# Data Pipeline for H+ Sport  

## Overview  
This notebook demonstrates an **ETL (Extract, Transform, Load) pipeline** for processing business data from an Excel source. The dataset includes customer, employee, and order records. The final step loads the cleaned data into a PostgreSQL database for further analysis.  

## Objectives:  
- **Extract** data from Excel files.  
- **Transform** data (cleaning, formatting, handling missing values).  
- **Load** structured data into a PostgreSQL database.  

## Technologies Used:  
- **Python** (`pandas`, `sqlalchemy`, `psycopg2`)  
- **PostgreSQL** (for data storage)  
- **Excel** (as the raw data source)  

## Expected Outcome:  
By the end of this notebook, a structured dataset will be available in PostgreSQL, ready for querying and analysis.  


In [None]:
# Import Libraries

import pandas as pd # For data Extract/transformation/manipulation/wrangling/analysis, etc
import psycopg2 # For Connecting Python to Postgresql database
from sqlalchemy import create_engine # To efficiently manage and reuse the database connections

### Step 1: Extract data from the Excel file into a Pandas Dataframe

In [57]:
#read to customer excel file
customer = pd.read_excel("H+ Sport Customers.xlsx", sheet_name= 'data')
customer.head() # first five rows

Unnamed: 0,CustomerID,FirstName,LastName,Email,Phone,Address,City,State,Zipcode
0,100,Carol,Shaw,cshaw0@mlb.com,(206)804-8771,8157 Longview Court,Seattle,WA,98121
1,101,Elizabeth,Carr,ecarr1@oracle.com,(512)187-2507,3934 Petterle Trail,Austin,TX,78732
2,102,Ernest,Ramos,eramos2@plala.or.jp,(816)540-4257,8699 Clarendon Terrace,Kansas City,MO,64199
3,103,Jane,Carter,jcarter3@harvard.edu,(214)839-0542,2830 Novick Lane,Irving,TX,75037
4,104,Martha,Cooper,mcooper4@go.com,(727)235-5696,4537 Hoard Lane,Tampa,FL,33625


In [107]:
# read to Employee excel sheet 
employee = pd .read_excel("H+ Sport Employees.xlsx", sheet_name = 'Employees-Table')
employee.head() # first five rows

Unnamed: 0,Employee Name,Building,Department,Status,Hire Date,Month,Years,Benefits,Salary,Job Rating,New Salary,Tax Rate,2.91%
0,"Page, Lisa",West,ADC,Full Time,1999-01-11,Jan,25,DMR,24550,1,,,
1,"Taylor, Hector",West,ADC,Half-Time,2011-02-21,Feb,13,DM,26795,4,,,
2,"Dawson, Jonathan",West,ADC,Contract,2007-03-06,Mar,17,,42540,5,,,
3,"Duran, Brian",Taft,ADC,Hourly,2012-08-30,Aug,12,,35680,2,,,
4,"Weber, Larry",Watson,ADC,Full Time,2007-12-31,Dec,17,M,72830,2,,,


In [122]:
# read to orders excel file
order = pd.read_excel("H+ Sport Orders.xlsx", sheet_name='data')
order.head()# first five rows

Unnamed: 0,OrderID,Date,TotalDue,Status,CustomerID,SalespersonID,CustomersComment,SalespersonsComment
0,1000,05/14/2016,$140.91,paid,413,130,,
1,1001,07/31/2016,$105.32,returned,128,102,,
2,1002,07/09/2016,$217.30,past due,791,115,,
3,1003,04/04/2016,$281.39,paid,974,139,,
4,1004,02/16/2016,$254.76,paid,866,102,,


### Step 2: Transform the data(i.e clean the data) - Deal with missing and duplicate data

#### Customer

In [60]:
# customer : identify rows where one or more fields (using Email or Phone) are duplicated.
columns_to_check = ['Email', 'Phone']
duplicates_cus = customer[customer.duplicated(columns_to_check)] # check duplictes according to 'Email', 'Phone'columns
print(duplicates_cus)

Empty DataFrame
Columns: [CustomerID, FirstName, LastName, Email, Phone, Address, City, State, Zipcode]
Index: []


In [61]:
duplicates_cus.shape

(0, 9)

In [62]:
customer = customer.drop_duplicates(columns_to_check, keep = 'first') # drop duplicates keeping first one
customer.to_excel('H+ Sport Customers.xlsx',sheet_name='data', index=False) # save in original file

In [63]:
columns_to_check = ['Email', 'Phone']
duplicates_cus = customer[customer.duplicated(columns_to_check)] # check duplictes according to 'Email', 'Phone'columns
print(duplicates_cus)

Empty DataFrame
Columns: [CustomerID, FirstName, LastName, Email, Phone, Address, City, State, Zipcode]
Index: []


In [64]:
# check missing values -cutomers
customer.isnull().sum()

CustomerID    0
FirstName     0
LastName      0
Email         0
Phone         0
Address       0
City          0
State         0
Zipcode       0
dtype: int64

#### Employee

In [108]:
# find Employee size
employee.shape

(741, 13)

In [109]:
# Employees 
employee.duplicated().sum() # check duplicates

np.int64(0)

In [110]:
employee = employee.drop_duplicates(keep='first') # drop duplicates

In [111]:
employee.duplicated().sum() # check duplicates

np.int64(0)

In [112]:
employee.isnull().sum()

Employee Name      0
Building           0
Department         0
Status             0
Hire Date          0
Month              0
Years              0
Benefits         247
Salary             0
Job Rating         0
New Salary       741
Tax Rate         741
2.91%            741
dtype: int64

In [113]:
employee.isnull().all() # find empty colums

Employee Name    False
Building         False
Department       False
Status           False
Hire Date        False
Month            False
Years            False
Benefits         False
Salary           False
Job Rating       False
New Salary        True
Tax Rate          True
2.91%             True
dtype: bool

In [114]:
employee.columns

Index(['Employee Name', 'Building', 'Department', 'Status', 'Hire Date',
       'Month', 'Years', 'Benefits', 'Salary', 'Job Rating', 'New Salary',
       'Tax Rate', '2.91%'],
      dtype='object')

In [115]:
# remove empty columns
columns_remove = ['New Salary','Tax Rate', '2.91%','Job Rating']
employee = employee.drop(columns=columns_remove)


In [116]:
employee.columns

Index(['Employee Name', 'Building', 'Department', 'Status', 'Hire Date',
       'Month', 'Years', 'Benefits', 'Salary'],
      dtype='object')

In [117]:
# Employee - dealing with missing values
employee.head()

Unnamed: 0,Employee Name,Building,Department,Status,Hire Date,Month,Years,Benefits,Salary
0,"Page, Lisa",West,ADC,Full Time,1999-01-11,Jan,25,DMR,24550
1,"Taylor, Hector",West,ADC,Half-Time,2011-02-21,Feb,13,DM,26795
2,"Dawson, Jonathan",West,ADC,Contract,2007-03-06,Mar,17,,42540
3,"Duran, Brian",Taft,ADC,Hourly,2012-08-30,Aug,12,,35680
4,"Weber, Larry",Watson,ADC,Full Time,2007-12-31,Dec,17,M,72830


In [118]:
employee['Benefits']= employee['Benefits'].fillna('Unknown') # replace missing values is unkown
employee.head()

Unnamed: 0,Employee Name,Building,Department,Status,Hire Date,Month,Years,Benefits,Salary
0,"Page, Lisa",West,ADC,Full Time,1999-01-11,Jan,25,DMR,24550
1,"Taylor, Hector",West,ADC,Half-Time,2011-02-21,Feb,13,DM,26795
2,"Dawson, Jonathan",West,ADC,Contract,2007-03-06,Mar,17,Unknown,42540
3,"Duran, Brian",Taft,ADC,Hourly,2012-08-30,Aug,12,Unknown,35680
4,"Weber, Larry",Watson,ADC,Full Time,2007-12-31,Dec,17,M,72830


In [119]:
employee.isnull().sum()

Employee Name    0
Building         0
Department       0
Status           0
Hire Date        0
Month            0
Years            0
Benefits         0
Salary           0
dtype: int64

In [120]:
employee.to_excel("H+ Sport Employees.xlsx", sheet_name="Employees-Table", index=False) # save on original file

#### Orders

In [125]:
orders_dup = order[order.duplicated("OrderID")] # find duplicated value using id
orders_dup

Unnamed: 0,OrderID,Date,TotalDue,Status,CustomerID,SalespersonID,CustomersComment,SalespersonsComment
138,1105,05/10/2016,$158.59,paid,666,147,,
139,1106,05/23/2016,$267.98,paid,743,119,,
140,1107,07/25/2015,$332.55,paid,949,115,,
141,1108,11/19/2015,$215.83,paid,758,139,,
142,1109,08/09/2015,$225.52,paid,810,110,,
143,1110,07/11/2016,$93.32,paid,211,120,,
144,1111,05/27/2016,$24.14,paid,400,101,,
145,1112,05/12/2016,$299.00,paid,822,106,,
146,1113,08/21/2015,$58.78,due,189,102,,
147,1114,08/02/2015,$367.73,paid,248,148,,


In [127]:
order = order.drop_duplicates("OrderID", keep='first') # drop duplicates
order.duplicated().sum()

np.int64(0)

In [128]:
# chekcing missing values
order.isnull().sum()

OrderID                  0
Date                     0
TotalDue                 0
Status                   0
CustomerID               0
SalespersonID            0
CustomersComment       200
SalespersonsComment    200
dtype: int64

In [129]:
order.isnull().all() # find empty columns

OrderID                False
Date                   False
TotalDue               False
Status                 False
CustomerID             False
SalespersonID          False
CustomersComment        True
SalespersonsComment     True
dtype: bool

In [130]:
# remove empty columns
order = order.drop(columns=['CustomersComment','SalespersonsComment'])
order.columns

Index(['OrderID', 'Date', 'TotalDue', 'Status', 'CustomerID', 'SalespersonID'], dtype='object')

In [131]:
# save in original sheet
order.to_excel("H+ Sport Orders.xlsx", sheet_name="data", index=False)

### Step 3: Create a database
  go to PGAdmin 4 and create database tables

### Step 4: Load the clean data into the database

In [143]:
# Database credentials
username = 'postgres'
password = '********'
host = 'localhost'
port = 5432
db_name = 'H+ Sport'

In [144]:
# Establish a connection
engine = create_engine(f'postgresql://{username}:{password}@{host}:{port}/{db_name}')
try:
    with engine.connect():
        print("Connection successful!")
except Exception as e:
    print(f"Connection failed: {e}")

Connection successful!


In [145]:
# load the database table - Employee
customer.to_sql('Customers', engine, if_exists='replace', index=False)
employee.to_sql('Employees', engine, if_exists='replace', index=False)
order.to_sql('Orders', engine, if_exists='replace', index=False)

#close the connection
engine.dispose()

## Summary  

✅ Successfully extracted customer, employee, and order data from Excel files.  
✅ Performed necessary **data cleaning and transformation**, ensuring data consistency.  
✅ Loaded the cleaned datasets into a **PostgreSQL database** for further analysis.  
  
