<a href="https://colab.research.google.com/github/Shazizan/portfolio/blob/master/etl_vault_pd_stock_price.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ETL with Pandas - Fraud Data**

# **Setup Configuration & Extraction**

In [8]:
import pandas as pd

url = "https://raw.githubusercontent.com/Shazizan/data/refs/heads/master/stock_price.csv"
df = pd.read_csv(url)

In [10]:
print(df)

          Date      Close
0   2022-01-03  58.099998
1   2022-01-04  59.389999
2   2022-01-05  58.779999
3   2022-01-06  59.360001
4   2022-01-07  60.779999
5   2022-01-10  60.730000
6   2022-01-11  61.130001
7   2022-01-12  61.180000
8   2022-01-13  61.490002
9   2022-01-14  61.630001
10  2022-01-18  60.910000
11  2022-01-19  59.590000
12  2022-01-20  59.090000
13  2022-01-21  58.150002
14  2022-01-24  58.959999
15  2022-01-25  58.320000
16  2022-01-26  58.650002
17  2022-01-27  58.270000
18  2022-01-28  58.889999
19  2022-01-31  59.660000
20  2022-02-01  60.580002
21  2022-02-02  61.389999
22  2022-02-03  61.099998
23  2022-02-04  61.270000


# **Transformation (simple cleaning)**

Transform by applying filter - as our goal is only want the January data.

In [12]:
# Convert the column to datetime
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

Knowledge:
- Need to convert to datetime to avoid pandas mistreat the column as a string
- errors='coerce' will turn any invalid date into NaT (missing), so it won’t break the filtering.

In [13]:
# Filter rows where month is January (month == 1)
january_data = df[df['Date'].dt.month == 1]

# Display the result
print(january_data.head())

        Date      Close
0 2022-01-03  58.099998
1 2022-01-04  59.389999
2 2022-01-05  58.779999
3 2022-01-06  59.360001
4 2022-01-07  60.779999


In [14]:
# Display all rows
print(january_data)

         Date      Close
0  2022-01-03  58.099998
1  2022-01-04  59.389999
2  2022-01-05  58.779999
3  2022-01-06  59.360001
4  2022-01-07  60.779999
5  2022-01-10  60.730000
6  2022-01-11  61.130001
7  2022-01-12  61.180000
8  2022-01-13  61.490002
9  2022-01-14  61.630001
10 2022-01-18  60.910000
11 2022-01-19  59.590000
12 2022-01-20  59.090000
13 2022-01-21  58.150002
14 2022-01-24  58.959999
15 2022-01-25  58.320000
16 2022-01-26  58.650002
17 2022-01-27  58.270000
18 2022-01-28  58.889999
19 2022-01-31  59.660000


In [19]:
# reset index (optional)
january_data = january_data.reset_index(drop=True)

In [20]:
print(january_data.head())

        Date      Close
0 2022-01-03  58.099998
1 2022-01-04  59.389999
2 2022-01-05  58.779999
3 2022-01-06  59.360001
4 2022-01-07  60.779999


Knowledge:
- pd.to_datetime() ensures pandas understands the column as dates.
- .dt.month == 1 selects only rows from January.
- reset_index() cleans up the index after filtering.

# **Load into the target system**

the target system for this experiment is in Github repo(pipeline-vault)

In [21]:
# Save locally
january_data.to_csv("january_data.csv", index=False)

# **Setup Configuration for the target System**

Working using Bash in Colab

In [24]:
# @title
#Set Git Identity
!git config --global user.name "xxxxxxxx"
!git config --global user.email "xxxxxxxxxxx@gmail.com"

**Generate a Personal Access Token (PAT)**

Go to GitHub → Settings → Developer settings → Personal access tokens → Tokens (classic) → Generate new token
.
- Set Expiration (your choice, e.g., 30 days).
- Select repo scope (for full repo access).
- Click Generate token and copy it.
- Treat this token like a password—it gives access to your GitHub account.

In [25]:
#Authenticate Github - Use token in the repo URL
!git clone https://Shazizan:<TOKEN-TO-BE-REPLACE>@github.com/Shazizan/pipeline-vault.git

Cloning into 'pipeline-vault'...
remote: Enumerating objects: 11, done.[K
remote: Counting objects:   9% (1/11)[Kremote: Counting objects:  18% (2/11)[Kremote: Counting objects:  27% (3/11)[Kremote: Counting objects:  36% (4/11)[Kremote: Counting objects:  45% (5/11)[Kremote: Counting objects:  54% (6/11)[Kremote: Counting objects:  63% (7/11)[Kremote: Counting objects:  72% (8/11)[Kremote: Counting objects:  81% (9/11)[Kremote: Counting objects:  90% (10/11)[Kremote: Counting objects: 100% (11/11)[Kremote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects:  10% (1/10)[Kremote: Compressing objects:  20% (2/10)[Kremote: Compressing objects:  30% (3/10)[Kremote: Compressing objects:  40% (4/10)[Kremote: Compressing objects:  50% (5/10)[Kremote: Compressing objects:  60% (6/10)[Kremote: Compressing objects:  70% (7/10)[Kremote: Compressing objects:  80% (8/10)[Kremote: Compressing objects:  90% (9/10)[Kremote: Compressing objects

In [26]:
#Push CSV
!cp january_data.csv pipeline-vault/
%cd pipeline-vault
!git add january_data.csv
!git commit -m "Add January 2022 data"
!git push origin main

/content/pipeline-vault/pipeline-vault
[main f79d84b] Add January 2022 data
 1 file changed, 21 insertions(+)
 create mode 100644 january_data.csv
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 505 bytes | 505.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/Shazizan/pipeline-vault.git
   3c130bd..f79d84b  main -> main


# **Check the Git push in Colab**

In [27]:
!git status

On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean


# **Verify at the Target System**

- Go to your target repo (pipeline-vault) on GitHub in a browser.
- Check that january_data.csv is there.
- Confirm that the commit message matches your push, e.g., "Add January 2022 data"