<a href="https://colab.research.google.com/github/Shazizan/portfolio/blob/master/etl_vault_pd_bank_cust_insight.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **ETL with Pandas & Use API into the Vault**

data: bank_customer_attrition_insight_data

# **Data Preparation & Extraction**

In [1]:
import pandas as pd

# URL to your CSV in GitHub (raw link)
url = "https://raw.githubusercontent.com/Shazizan/data/refs/heads/master/Bank-Customer-Attrition-Insights-Data.csv"

# Read CSV into Pandas DataFrame
df = pd.read_csv(url)

df.head()  # see first 5 rows


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
0,1,15598695,Fields,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464
1,2,15649354,Johnston,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456
2,3,15737556,Vasilyev,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377
3,4,15671610,Hooper,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350
4,5,15625092,Colombo,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425


# **Transform the Data / Clean**

- Remove unnecessary columns: RowNumber
- Handle missing values
- Create new features: AgeGroup & ActiveCardHolder

In [2]:
#Remove unnecessary columns: RowNumber
df = df.drop(columns=['RowNumber'])

In [5]:
#Handle missing value
#Check if any column has missing or null values.
#If the number > 0 → that column has missing values.

df.isnull().sum()

Unnamed: 0,0
CustomerId,0
Surname,0
CreditScore,0
Geography,0
Gender,0
Age,0
Tenure,0
Balance,0
NumOfProducts,0
HasCrCard,0


In [10]:
#Create New Feature 1: AgeGroup

df['AgeGroup'] = pd.cut(df['Age'], bins=[0,25,35,50,100], labels=['Young','Adult','Senior','Old'])
df.head()   # shows first 5 rows in table format

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned,AgeGroup
0,15598695,Fields,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464,Senior
1,15649354,Johnston,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456,Senior
2,15737556,Vasilyev,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377,Senior
3,15671610,Hooper,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350,Senior
4,15625092,Colombo,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425,Senior


In [14]:
#Create New Feature 2: ActiveCardHolder

df['ActiveCardHolder'] = df['HasCrCard'] & df['IsActiveMember']
df.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned,AgeGroup,ActiveCardHolder
0,15598695,Fields,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464,Senior,1
1,15649354,Johnston,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456,Senior,0
2,15737556,Vasilyev,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377,Senior,0
3,15671610,Hooper,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350,Senior,0
4,15625092,Colombo,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425,Senior,1


Highlight:
- The new columns now display on the table above: AgeGroup & ActiveCardHolder

# **Load to Target System Using Github API**

## **Step 1: Install PyGithub**

- PyGithub is a Python library to interact with GitHub API easily.
- It allows Python to authenticate with GitHub and create/update files in our repo without manually uploading.

In [15]:
!pip install PyGithub

Collecting PyGithub
  Downloading pygithub-2.8.1-py3-none-any.whl.metadata (3.9 kB)
Collecting pynacl>=1.4.0 (from PyGithub)
  Downloading pynacl-1.6.0-cp38-abi3-manylinux_2_34_x86_64.whl.metadata (9.4 kB)
Downloading pygithub-2.8.1-py3-none-any.whl (432 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m432.7/432.7 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pynacl-1.6.0-cp38-abi3-manylinux_2_34_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pynacl, PyGithub
Successfully installed PyGithub-2.8.1 pynacl-1.6.0


## **Step 2: Import Libraries**

- pandas → we already have our df with transformed data.
- Github → lets us use the GitHub API.

In [22]:
from github import Github, Auth
import pandas as pd

## **Step 3: Convert DataFrame to CSV string**

- GitHub API doesn’t accept DataFrame directly, it only accepts text content.
- df.to_csv(index=False) converts the table into a CSV-formatted string.

In [23]:
csv_string = df.to_csv(index=False)

## **Step 4: Authenticate to GitHub**

- GitHub needs to verify who you are before allowing file uploads.
- PAT works as a password for scripts (don’t share it!).

In [24]:
# Replace with your personal access token
# Use Auth.Token for modern authentication
g = Github(auth=Auth.Token("REPLACE_WITH_PERSONAL_TOKEN"))

## **Step 5: Connect to your repository**

- We need to tell PyGithub which repo we want to push the file into.

In [25]:
# Replace with your GitHub username and repo name
repo = g.get_user().get_repo("pipeline-vault")

## **Step 6: Create a new file in the repo**

- This sends our CSV string to GitHub as a new file.

In [26]:
repo.create_file("processed_data.csv", "Add processed data", csv_string)

{'content': ContentFile(path="processed_data.csv"),
 'commit': Commit(sha="71ecbf2ccbb1d529f6c6d8c23470e4d19be02fe8")}

## **Step 7: Check it worked**

- Go to repo in GitHub → the file name: processed_data.csv added.