## **WORKSHOP 001 - NOTEBOOK #2: Exploratory Data Analysis (EDA)**

This notebook focuses on conducting **Exploratory Data Analysis (EDA)** on the candidates' dataset. EDA is a fundamental step in the data analysis process, as it allows us to understand the structure and characteristics of the data. By examining the dataset, we can identify patterns, relationships, and key insights that inform further analysis and decision-making.

In this notebook, we will explore the dataset using a range of statistical and visual techniques. We will assess the distribution of variables and investigate correlations between them. Through this process, we aim to develop a thorough understanding of the dataset and extract valuable insights.

---

### **Setting Environment**

In [1]:
import os 
print(os.getcwd())

try:
    os.chdir("../../workshop-001")

except FileNotFoundError:
    print("""
        FileNotFoundError - The directory may not exist or you might not be in the specified path.
        If this has already worked, do not run this block again, as the current directory is already set to workshop-001.
        """)
    
print(os.getcwd())

d:\U\FIFTH SEMESTER\ETL\workshop-001\notebooks
d:\U\FIFTH SEMESTER\ETL\workshop-001


## **Load Data**

### **Import dependencies**

We will use Pandas to analyse the data within the DataFrame, while Matplotlib and Seaborn will be employed to generate graphs that illustrate the insights from the database.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
plt.style.use('ggplot')

In [4]:
from functions.db_connection.connection import creating_engine

### **Create engine**

The connection process has been assigned to a Python script called _connection.py_, where the function _creating_engine_ is responsible for setting up the connection to a PostgreSQL database using SQLAlchemy.

In [6]:
engine = creating_engine()

### **Load Database**

We retrieve the dataset from a table in a PostgreSQL database connected using the SQLAlchemy engine.

In [None]:
df = pd.read_sql_table("candidates_raw", engine, parse_dates=["Application Date"])
df