# CSV DATA BASE LOAD

### IMPORTING NECESSARY LIBRARIES 

This section of the code performs the import of the libraries needed to load, manipulate and store data in a relational database. In addition, the working environment is adjusted to allow the import of modules from higher levels of the directory.
The function sys.path.append(os.path.abspath(os.path.join(os.getcwd(), “..”))) adds the parent directory to the sys.path, facilitating the import of modules from higher paths within the project. connect_db (from src.database.db_connection): Provides functions to establish the connection with the database database.
create_database (from src.database.database_create): Contains the functions for the creation and structuring of the database.

In [3]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))
import pandas as pd
from src.database.db_conection import connect_db
from src.database.database_create import create_database
from sqlalchemy import text

### load the csv to start data manipulation 

This code fragment is a fundamental step in any data analysis process, as it allows the initial loading and verification of the data. The correct specification of the delimiter and encoding is key to avoid reading errors and ensure the integrity of the imported information. The visualization of the first records facilitates the understanding of the dataset and allows the identification of possible inconsistencies or preprocessing needs before further analysis. The pd.read_csv() function is used to read a CSV file located in the path ../data/candidates.csv.


In [4]:
data = pd.read_csv('../data/candidates.csv', sep=";", encoding='utf-8')
print(data.head())

   First Name   Last Name                      Email Application Date  \
0  Bernadette   Langworth        leonard91@yahoo.com       2021-02-26   
1      Camryn    Reynolds        zelda56@hotmail.com       2021-09-09   
2       Larue      Spinka   okey_schultz41@gmail.com       2020-04-14   
3        Arch      Spinka     elvera_kulas@yahoo.com       2020-10-01   
4       Larue  Altenwerth  minnie.gislason@gmail.com       2020-05-20   

   Country  YOE  Seniority                         Technology  \
0   Norway    2     Intern                      Data Engineer   
1   Panama   10     Intern                      Data Engineer   
2  Belarus    4  Mid-Level                     Client Success   
3  Eritrea   25    Trainee                          QA Manual   
4  Myanmar   13  Mid-Level  Social Media Community Management   

   Code Challenge Score  Technical Interview Score  
0                     3                          3  
1                     2                         10  
2          

The following code fragment uses the .count() method of the pandas library to obtain the number of non-null values for each column in a previously loaded DataFrame. This method is useful to evaluate the quality of the data and detect possible missing values in the dataset.

In [5]:
data.count()

First Name                   50000
Last Name                    50000
Email                        50000
Application Date             50000
Country                      50000
YOE                          50000
Seniority                    50000
Technology                   50000
Code Challenge Score         50000
Technical Interview Score    50000
dtype: int64

Complete Data: All columns contain exactly 50,000 records, indicating that there are no null values in any of them.

### After reading the csv with the data, we are going to create the database. 

The following code fragment is responsible for invoking the create_database() function, located in the src.database.database_create module. The execution of this function allows the creation of a database in the system, ensuring that the necessary infrastructure to store information is available.

In [None]:
if __name__ == "__main__":
    print("Creating the database...")
    create_database()  

Creando la base de datos...
Base de datos 'candidates' creada exitosamente.


This implies that the system had the appropriate permissions and that the connection to the database engine went smoothly.

### Upload the CSV to the database

In [7]:
engine = connect_db()

Establishes a connection to the database using SQLAlchemy, an ORM (Object Relational Mapper) widely used in data analysis environments. 

This function, located in src.database.db_connection, is responsible for establishing the connection to the database engine. a engine variable stores the connection object returned by connect_db().
This object will allow the execution of SQL queries and interaction with the database from Python.

In [None]:
data.to_sql("rawCandidates", engine, if_exists="replace", index=False)
print("Data inserted correctly")

Datos insertados correctamente


The implemented code follows a structured flow for loading, connecting and inserting data into a SQL database. It has been ensured that the information stored in the CSV file is properly processed and sent to the database, which facilitates its subsequent analysis and exploitation.

to_sql() : This Pandas function allows writing a DataFrame to a SQL table.
“rawCandidates: It is the name of the table in the database where the data will be stored.
engine: It is the database connection object, previously created with connect_db().
if_exists=“replace”: If the table already exists, it replaces it completely with the new data.

Finally there is the result that indicates that the data were correctly uploaded to the database.

### Verification that the data has been loaded correctly.

In [9]:
df = pd.read_sql_query('SELECT * FROM "rawCandidates"', engine)
print(df.head())

   First Name   Last Name                      Email Application Date  \
0  Bernadette   Langworth        leonard91@yahoo.com       2021-02-26   
1      Camryn    Reynolds        zelda56@hotmail.com       2021-09-09   
2       Larue      Spinka   okey_schultz41@gmail.com       2020-04-14   
3        Arch      Spinka     elvera_kulas@yahoo.com       2020-10-01   
4       Larue  Altenwerth  minnie.gislason@gmail.com       2020-05-20   

   Country  YOE  Seniority                         Technology  \
0   Norway    2     Intern                      Data Engineer   
1   Panama   10     Intern                      Data Engineer   
2  Belarus    4  Mid-Level                     Client Success   
3  Eritrea   25    Trainee                          QA Manual   
4  Myanmar   13  Mid-Level  Social Media Community Management   

   Code Challenge Score  Technical Interview Score  
0                     3                          3  
1                     2                         10  
2          

A connection to the previously configured database is established and the data stored in the rawCandidates table is extracted. This table contains the information that was previously loaded from the CSV file.

The main purpose of this operation is to verify the correct insertion of the data, ensuring that the information stored in the database matches the original source. For

In [10]:
print(df.count())

First Name                   50000
Last Name                    50000
Email                        50000
Application Date             50000
Country                      50000
YOE                          50000
Seniority                    50000
Technology                   50000
Code Challenge Score         50000
Technical Interview Score    50000
dtype: int64


The results obtained confirm that the data loading was successful and that all columns contain the same number of records, with no missing values.