# Data Analytics Must Knows

### Pandas, SQL

Situation: 

You are given a JSON data formatted as a string which holds basic employee information in a company that has several departments, and you are given a series of analytics questions to analyze. 

You are to provide your solutions to each question in every language/library - **Pandas**, **SQL**, and **PySpark**.

The data given to you is shown in the next cell:

In [1]:
employees_json = """
    [
        {"EmployeeID":1, "Name":"Alice", "Department":"HR", "Salary":60000, "JoiningDate":"2020-01-15", "PerformanceScore":3},
        {"EmployeeID":2, "Name":"Bob", "Department":"IT", "Salary":70000, "JoiningDate":"2019-06-20", "PerformanceScore":4},
        {"EmployeeID":3, "Name":"Charlie", "Department":"IT", "Salary":80000, "JoiningDate":"2018-07-23", "PerformanceScore":2},
        {"EmployeeID":4, "Name":"David", "Department":"HR", "Salary":65000, "JoiningDate":"2020-02-10", "PerformanceScore":5},
        {"EmployeeID":5, "Name":"Eve", "Department":"Finance", "Salary":75000, "JoiningDate":"2021-03-15", "PerformanceScore":3}
    ]
"""
bonuses_json = """
    [
        {"EmployeeID":1, "Bonus":5000},
        {"EmployeeID":2, "Bonus":7000},
        {"EmployeeID":3, "Bonus":8000},
        {"EmployeeID":6, "Bonus":6000}
    ]
"""

**Q0: Import the necessary libraries and perform the necessary setup to complete the tasks for all the indivated languages and libraries.**

Import libraries

In [2]:
# DateTime
from datetime import datetime, date 
# JSON
import json
# Pandas
import pandas as pd
# SQL
import os
from dotenv import load_dotenv
import psycopg2
from sqlalchemy import create_engine
from sqlalchemy.sql import text

JSON Parsing

In [3]:
# JSON Parsing (Employee)
employees_dict = json.loads(employees_json)
employees_dict[0]

{'EmployeeID': 1,
 'Name': 'Alice',
 'Department': 'HR',
 'Salary': 60000,
 'JoiningDate': '2020-01-15',
 'PerformanceScore': 3}

In [4]:
# JSON Parsing (Bonuses)
bonuses_dict = json.loads(bonuses_json)
bonuses_dict[0]

{'EmployeeID': 1, 'Bonus': 5000}

Setup Pandas

In [5]:
employees_pd = pd.DataFrame(employees_dict)
employees_pd

Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate,PerformanceScore
0,1,Alice,HR,60000,2020-01-15,3
1,2,Bob,IT,70000,2019-06-20,4
2,3,Charlie,IT,80000,2018-07-23,2
3,4,David,HR,65000,2020-02-10,5
4,5,Eve,Finance,75000,2021-03-15,3


In [6]:
bonuses_pd = pd.DataFrame(bonuses_dict)
bonuses_pd

Unnamed: 0,EmployeeID,Bonus
0,1,5000
1,2,7000
2,3,8000
3,6,6000


Setup PostgreSQL

In [7]:
# Get credentials
load_dotenv()
user = os.environ.get("USER")
pw = os.environ.get("PASS")
db = os.environ.get("DB")
host = os.environ.get("HOST")
api = os.environ.get("API")
port = 5432
schema = 'da_must_knows'

In [8]:
# Connect to database
uri = f"postgresql+psycopg2://{user}:{pw}@{host}:{port}/{db}"
alchemyEngine = create_engine(uri)
conn = alchemyEngine.connect()

In [None]:
# Load to SQL
employees_pd.to_sql(con=conn,name="employees",schema=schema)
bonuses_pd.to_sql(con=conn,name="bonuses",schema=schema)

In [9]:
rs = conn.execute(text(f"SELECT table_name FROM information_schema.tables WHERE table_schema='{schema}'"))
tables = [table[0] for table in rs.fetchall()]
print(f'The tables in the database are: \n- {'\n- '.join(tables)}')

The tables in the database are: 
- employees
- bonuses


In [10]:
for table in tables:
    print("=================================")
    print(f'Table [{table}]')
    df = pd.read_sql_query(f'SELECT * FROM {schema}.{table} LIMIT 5', conn)
    print(f'Dimensions: {df.shape[0]} rows x {df.shape[1]} columns\n')
    print(df.head())
    info_df = pd.DataFrame.from_dict({'Datatypes':df.dtypes, 'NULL count':df.isna().sum()})
    print()
    print(info_df)
    print()

Table [employees]
Dimensions: 5 rows x 7 columns

   index  EmployeeID     Name Department  Salary JoiningDate  PerformanceScore
0      0           1    Alice         HR   60000  2020-01-15                 3
1      1           2      Bob         IT   70000  2019-06-20                 4
2      2           3  Charlie         IT   80000  2018-07-23                 2
3      3           4    David         HR   65000  2020-02-10                 5
4      4           5      Eve    Finance   75000  2021-03-15                 3

                 Datatypes  NULL count
index                int64           0
EmployeeID           int64           0
Name                object           0
Department          object           0
Salary               int64           0
JoiningDate         object           0
PerformanceScore     int64           0

Table [bonuses]
Dimensions: 4 rows x 3 columns

   index  EmployeeID  Bonus
0      0           1   5000
1      1           2   7000
2      2           3   8000
3 