#### **First Step**: Consulting data to the Database


Task:

- Establish connection to the database
- Load data into a data frame such as `df` for cleansing


In [29]:
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt


# Add the 'src' folder to sys.path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

from connections.db import DB

db = DB()

In [30]:
# Fetch the data from the database as a dataframe
df = db.fetch_as_dataframe('../sql/queries/004_get_raw_data.sql')

2024-08-18 21:52:58,879 - ✔ Connected to database
2024-08-18 21:53:01,250 - ✔ Data loaded into DataFrame
2024-08-18 21:53:01,250 - ✔ Cursor closed
2024-08-18 21:53:01,250 - ✔ Connection closed


---


#### **Second Step**: Clean the data


Task:

- Standardize column names.
- Identify inconsistencies in data types.


In [31]:
# Standardize the names of the columns.
df.columns = [col.lower() for col in df.columns]

In [32]:
# Check type of initial columns
df.dtypes

first_name                   object
last_name                    object
email                        object
application_date             object
country                      object
yoe                           int64
seniority                    object
technology                   object
code_challenge_score          int64
technical_interview_score     int64
dtype: object

> Note:
>
> The `application_date` column must be of type datetime so that there is no confusion when making temporary queries.


In [33]:
df['application_date'] = pd.to_datetime(df['application_date'])

> Note:
>
> As we can see, the `application_date` column has been converted to the correct format.

In [34]:
df.dtypes

first_name                           object
last_name                            object
email                                object
application_date             datetime64[ns]
country                              object
yoe                                   int64
seniority                            object
technology                           object
code_challenge_score                  int64
technical_interview_score             int64
dtype: object

> Note: 
>
> In the `technology` column we have many categories, as we saw in the notebook [01_data_exploration](https://github.com/DCajiao/workshop001_candidates_analysis/blob/develop/notebooks/01_data_exploration.ipynb), so I have decided to create a new column called `technology_topic` in order to generalize the categories and in the visualization stage to group them in a better way.  

In [35]:
# In this way I will group the categories by topic:

technology_topic = {
    "Development - Backend" : "Development",
    "Development - FullStack" : "Development",
    "Development - CMS Frontend" : "Development",
    "Development - Frontend" : "Development",
    "Development - CMS Backend" : "Development",
    "DevOps" : "Development",
    "Security" : "Security",
    "Security Compliance" : "Security",
    "System Administration" : "Security",
    "QA Manual" : "QA",
    "QA Automation" : "QA",
    "Design" : "Design",
    "Adobe Experience Manager" : "Design",
    "Data Engineer" : "Data",
    "Business Intelligence" : "Data",
    "Database Administration" : "Data",
    "Business Analytics / Project Management" : "Data",
    "Mulesoft" : "Data",
    "Salesforce" : "Marketing",
    "Client Success" : "Marketing",
    "Sales" : "Marketing",
    "Technical Writing" : "Communication",
    "Social Media Community Management" : "Communication",
}

In [36]:
df['technology_topic'] = df['technology'].map(technology_topic)
df['technology_topic'] = df['technology_topic'].fillna(df['technology'])

df = df[['first_name', 'last_name', 'email', 'application_date', 'country', 'yoe', 'seniority', 'technology', 'technology_topic', 'code_challenge_score', 'technical_interview_score']]


---


#### **Third Step**: Upload the data as a new clean table


Task:

- Define clean table scheme and save it in `sql/migrations/schema_clean.sql`.
- Define the `sql/migrations/seed_data_clean.sql` to upload the data.
- Run both queries to create a table and load the data into it.


> Note: 
>
> I developed a class to get the `schema.sql` and `seed_data.sql` automatically from the dataframe.
>
> Check it out at [pysqlschema.py](https://github.com/DCajiao/workshop001_candidates_analysis/blob/develop/src/utils/pysqlschema.py)


In [37]:
from utils.pysqlschema import SQLSchemaGenerator

generator = SQLSchemaGenerator(table_name='candidates_cleaned')
generator.generate_schema(df, '../sql/migrations/schema_clean.sql')
generator.generate_seed_data(df, '../sql/migrations/seed_data_clean.sql')

2024-08-18 21:53:01,402 - Generating schema for candidates_cleaned
2024-08-18 21:53:01,402 - Infering SQL type for object
2024-08-18 21:53:01,402 - Infering SQL type for object
2024-08-18 21:53:01,406 - Infering SQL type for object
2024-08-18 21:53:01,406 - Infering SQL type for datetime64[ns]
2024-08-18 21:53:01,406 - Infering SQL type for object
2024-08-18 21:53:01,408 - Infering SQL type for int64
2024-08-18 21:53:01,409 - Infering SQL type for object
2024-08-18 21:53:01,409 - Infering SQL type for object
2024-08-18 21:53:01,409 - Infering SQL type for object
2024-08-18 21:53:01,409 - Infering SQL type for int64
2024-08-18 21:53:01,409 - Infering SQL type for int64
2024-08-18 21:53:01,414 - Query written to ../sql/migrations/schema_clean.sql
2024-08-18 21:53:01,415 - Generating seed data for candidates_cleaned
2024-08-18 21:53:05,213 - Query written to ../sql/migrations/seed_data_clean.sql


In [38]:
# Create schema
db.execute("../sql/migrations/seed_data_clean.sql", False)

In [39]:
# Seed data
db.execute("../sql/migrations/seed_data_clean.sql", False)

In [40]:
# Check if the data was inserted correctly
db.execute("../sql/queries/001_view_tables.sql", True)

In [41]:
# Check the size of the tables
db.execute("../sql/queries/003_view_tables_sizes.sql", True)

---

#### **Results**:


- The raw data has been consulted and loaded as a dataframe.
- Column names have been standardized.
- The `application_date` column has been correctly formatted as a datetime column.
- Added `technology_topic` column to be able to group 'technology' categories in future graphs
- A new `schema` and `seed_data` has been generated automatically based on the clean df, using [pysqlschema.py](https://github.com/DCajiao/workshop001_candidates_analysis/blob/develop/src/utils/pysqlschema.py) and saved to `sql/migrations/`
- The clean data table has been created in the database.

---