># Lab -3 Data Engineering & Exploratory Data Analysis (EDA) Workshop

**Student Name:** `Hasyashri Bhatt`

**Student Number:**`9028501`

**Course:**`Machine Learning Programming(PROG8245)`

**Reference:** For the coding understanding and reference I used chatgpt,copilot and W3School.com
---

**Introduction:**

This lab focuses on practical data engineering and exploratory data analysis (EDA) using Python, SQL, and cloud-based infrastructure. We connect to a PostgreSQL database hosted on Neon.tech, populate it with synthetic employee data, and then perform a series of data processing and analytical tasks using the Pandas library. The aim is to simulate a real-world scenario where data must be collected, cleaned, transformed, and visualized to extract meaningful business insights.

---

**Objective:**

1.Set up a free PostgreSQL cloud database on Neon.tech

2.Generate and insert synthetic employee data using the Faker library

3.Connect to the database with Psycopg2 and SQLAlchemy

4.Load data into a Pandas DataFrame and perform:

         - Data cleaning and transformation

         - Feature engineering (e.g., calculating years of service)

         - Scaling numeric data

5.Create and interpret two visualizations:

         - A grouped bar chart (salary by position and start year)

         - An advanced heatmap using joined department data

---         


> ## **1. Data Collection**
--- 

I created a free cloud database using `Neon.tech`, a service that provides a PostgreSQL database without needing a credit card.In the database, we created a table named employees. It contains the following information:

`employee_id:` A unique ID for each employee

`name:` The employee's full name

`position:` Their job title (all in IT field)

`start_date:` The year they joined the company (between 2015–2024)

`salary:` Their annual salary (ranging from $60,000 to $200,000)

 I used a library called psycopg2 to connect Python to the cloud database and Pandas to bring the data into a format we can analyze.

---

In [7]:
# Installing necessary packages for PostgreSQL and data generation
%pip install psycopg2-binary

Collecting psycopg2-binary
  Downloading psycopg2_binary-2.9.10-cp312-cp312-win_amd64.whl.metadata (5.0 kB)
Downloading psycopg2_binary-2.9.10-cp312-cp312-win_amd64.whl (1.2 MB)
   ---------------------------------------- 0.0/1.2 MB ? eta -:--:--
   ------------------ --------------------- 0.5/1.2 MB 4.2 MB/s eta 0:00:01
   --------------------------- ------------ 0.8/1.2 MB 2.6 MB/s eta 0:00:01
   ---------------------------------------- 1.2/1.2 MB 2.2 MB/s eta 0:00:00
Installing collected packages: psycopg2-binary
Successfully installed psycopg2-binary-2.9.10
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
%pip install faker

Collecting faker
  Using cached faker-37.3.0-py3-none-any.whl.metadata (15 kB)
Collecting tzdata (from faker)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached faker-37.3.0-py3-none-any.whl (1.9 MB)
Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: tzdata, faker
Successfully installed faker-37.3.0 tzdata-2025.2
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [8]:
# Importing necessary libraries
import random
from faker import Faker
from datetime import date # Import the date object
import psycopg2

**After this,I used a Python library called Faker to create 50 fake (but realistic) employee records. These records were then inserted into the cloud database manually copy and pasting in the SQL editor to create Employee table with 50 fake data.**

`**Note:**` Below data changes everytime when we run the code but When I run this for the first time I copy-pasted that data in the Cloud database called `Neon.tech`,So my all result will include that database entries not the current one.


In [10]:
fake = Faker()

positions = [
    'Software Engineer', 'Data Analyst', 'DevOps Engineer', 'ML Engineer', 'QA Engineer',
    'Backend Developer', 'Frontend Developer', 'Cloud Architect', 'SysAdmin', 'Data Scientist'
]

# Convert date strings to date objects
start_date_obj = date(2015, 1, 1)
end_date_obj = date(2024, 6, 1)

for i in range(50):
    name = fake.name().replace("'", "''")  # Escape single quotes in names
    position = random.choice(positions)
    #Pass date objects to date_between
    start_date = fake.date_between(start_date=start_date_obj, end_date=end_date_obj)
    salary = random.randint(60000, 200000)

    print(f"INSERT INTO employees (name, position, start_date, salary) VALUES ('{name}', '{position}', '{start_date}', {salary});")


INSERT INTO employees (name, position, start_date, salary) VALUES ('Patrick Smith', 'SysAdmin', '2021-12-27', 113618);
INSERT INTO employees (name, position, start_date, salary) VALUES ('Juan Maddox', 'SysAdmin', '2023-07-21', 121114);
INSERT INTO employees (name, position, start_date, salary) VALUES ('Juan Ortiz', 'Data Analyst', '2022-04-21', 72008);
INSERT INTO employees (name, position, start_date, salary) VALUES ('Roberta Hudson', 'Software Engineer', '2018-12-22', 74096);
INSERT INTO employees (name, position, start_date, salary) VALUES ('Connie Lopez', 'Backend Developer', '2015-12-14', 65219);
INSERT INTO employees (name, position, start_date, salary) VALUES ('Mrs. Christina Jones MD', 'Frontend Developer', '2020-05-21', 72869);
INSERT INTO employees (name, position, start_date, salary) VALUES ('Nicholas Blair', 'QA Engineer', '2015-07-13', 170677);
INSERT INTO employees (name, position, start_date, salary) VALUES ('Patricia Rodriguez', 'ML Engineer', '2017-05-12', 172558);
INS