#### **First Step**: Consulting data to the Database


Task:

- Establish connection to the database
- Load data into a data frame such as `df` for cleansing


In [11]:
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt


# Add the 'src' folder to sys.path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

from connections.db import DB

db = DB()

In [12]:
# Fetch the data from the database as a dataframe
df = db.fetch_as_dataframe('../sql/queries/004_get_raw_data.sql')

2024-08-18 17:52:01,921 - ✔ Connected to database
2024-08-18 17:52:03,583 - ✔ Data loaded into DataFrame
2024-08-18 17:52:03,588 - ✔ Cursor closed
2024-08-18 17:52:03,588 - ✔ Connection closed


---


#### **Second Step**: Clean the data


Task:

- Standardize column names.
- Identify inconsistencies in data types.


In [13]:
# Standardize the names of the columns.
df.columns = [col.lower() for col in df.columns]

In [14]:
# Check type of initial columns
df.dtypes

first_name                   object
last_name                    object
email                        object
application_date             object
country                      object
yoe                           int64
seniority                    object
technology                   object
code_challenge_score          int64
technical_interview_score     int64
dtype: object

> Note:
>
> The `application_date` column must be of type datetime so that there is no confusion when making temporary queries.


In [15]:
df['application_date'] = pd.to_datetime(df['application_date'])

In [16]:
df.dtypes

first_name                           object
last_name                            object
email                                object
application_date             datetime64[ns]
country                              object
yoe                                   int64
seniority                            object
technology                           object
code_challenge_score                  int64
technical_interview_score             int64
dtype: object

> Note:
>
> As we can see, the `application_date` column has been converted to the correct format.

---


#### **Third Step**: Upload the data as a new clean table


Task:

- Define clean table scheme and save it in `sql/migrations/schema_clean.sql`.
- Define the `sql/migrations/seed_data_clean.sql` to upload the data.
- Run both queries to create a table and load the data into it.


> Note: 
>
> I developed a class to get the `schema.sql` and `seed_data.sql` automatically from the dataframe.
>
> Check it out at [pysqlschema.py](https://github.com/DCajiao/workshop001_candidates_analysis/blob/develop/src/utils/pysqlschema.py)


In [17]:
from utils.pysqlschema import SQLSchemaGenerator

generator = SQLSchemaGenerator(table_name='candidates_cleaned')
generator.generate_schema(df, '../sql/migrations/schema_clean.sql')
generator.generate_seed_data(df, '../sql/migrations/seed_data_clean.sql')

2024-08-18 17:52:03,683 - Generating schema for candidates_cleaned
2024-08-18 17:52:03,684 - Infering SQL type for object
2024-08-18 17:52:03,685 - Infering SQL type for object
2024-08-18 17:52:03,686 - Infering SQL type for object
2024-08-18 17:52:03,687 - Infering SQL type for datetime64[ns]
2024-08-18 17:52:03,687 - Infering SQL type for object
2024-08-18 17:52:03,689 - Infering SQL type for int64
2024-08-18 17:52:03,690 - Infering SQL type for object
2024-08-18 17:52:03,690 - Infering SQL type for object
2024-08-18 17:52:03,691 - Infering SQL type for int64
2024-08-18 17:52:03,692 - Infering SQL type for int64
2024-08-18 17:52:03,694 - Query written to ../sql/migrations/schema_clean.sql
2024-08-18 17:52:03,695 - Generating seed data for candidates_cleaned
2024-08-18 17:52:07,443 - Query written to ../sql/migrations/seed_data_clean.sql


"INSERT INTO candidates_cleaned VALUES ('Bernadette', 'Langworth', 'leonard91@yahoo.com', '2021-02-26 00:00:00', 'Norway', 2, 'Intern', 'Data Engineer', 3, 3);\nINSERT INTO candidates_cleaned VALUES ('Camryn', 'Reynolds', 'zelda56@hotmail.com', '2021-09-09 00:00:00', 'Panama', 10, 'Intern', 'Data Engineer', 2, 10);\nINSERT INTO candidates_cleaned VALUES ('Larue', 'Spinka', 'okey_schultz41@gmail.com', '2020-04-14 00:00:00', 'Belarus', 4, 'Mid-Level', 'Client Success', 10, 9);\nINSERT INTO candidates_cleaned VALUES ('Arch', 'Spinka', 'elvera_kulas@yahoo.com', '2020-10-01 00:00:00', 'Eritrea', 25, 'Trainee', 'QA Manual', 7, 1);\nINSERT INTO candidates_cleaned VALUES ('Larue', 'Altenwerth', 'minnie.gislason@gmail.com', '2020-05-20 00:00:00', 'Myanmar', 13, 'Mid-Level', 'Social Media Community Management', 9, 7);\nINSERT INTO candidates_cleaned VALUES ('Alec', 'Abbott', 'juanita_hansen@gmail.com', '2019-08-17 00:00:00', 'Zimbabwe', 8, 'Junior', 'Adobe Experience Manager', 2, 9);\nINSERT INT

In [18]:
# Create schema
db.execute("../sql/migrations/schema_clean.sql", False)

2024-08-18 17:52:08,444 - ✔ Connected to database
2024-08-18 17:52:08,827 - ✔ Query executed
2024-08-18 17:52:08,828 - ✔ Cursor closed
2024-08-18 17:52:08,828 - ✔ Connection closed


In [19]:
# Seed data
db.execute("../sql/migrations/seed_data_clean.sql", False)

2024-08-18 17:52:09,617 - ✔ Connected to database


In [None]:
# Check if the data was inserted correctly
db.execute("../sql/queries/001_view_tables.sql", True)

2024-08-18 14:21:23,260 - ✔ Connected to database
2024-08-18 14:21:23,666 - ✔ Query executed
2024-08-18 14:21:23,667 - ✔ Cursor closed
2024-08-18 14:21:23,667 - ✔ Connection closed


[('candidates',), ('candidates_cleaned',)]

---

#### **Results**:


- The raw data has been consulted and loaded as a dataframe.
- Column names have been standardized.
- The `application_date` column has been correctly formatted as a datetime column.
- A new `schema` and `seed_data` has been generated automatically based on the clean df, using [pysqlschema.py](https://github.com/DCajiao/workshop001_candidates_analysis/blob/develop/src/utils/pysqlschema.py) and saved to `sql/migrations/`

---