<a href="https://colab.research.google.com/github/GMwangi3/DE_Week7/blob/main/V1_Data_Pipelines_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Pipelines with Python

I am developing an automated data pipeline that can extract billing data from multiple sources and transform it into a structured format for efficient analysis and revenue reporting.

## Pre-requisites

In [3]:
# Pre-requisite 1
# ---
# Importing pandas library for data manipulation
import pandas as pd
# Importing numpy library for scientific computations
import numpy as np

## 1. Data Exploration

### Load and Review the datasets

In [4]:
# Dataset url = https://bit.ly/416WE1X
# Dataset1 - subscriptions
subscription_df = pd.read_csv('https://raw.githubusercontent.com/GMwangi3/DE_Week7/main/dataset1.csv')
subscription_df.head(5)

Unnamed: 0,customer_id,date_of_purchase,total_amount_billed,payment_status,payment_method,promo_code,country_of_purchase
0,101,04/01/2021,100,paid,credit card,PROMO1,USA
1,102,04/02/2021,200,paid,bank transfer,PROMO2,USA
2,103,04/02/2021,50,overdue,credit card,,UK
3,104,04/03/2021,75,disputed,e-wallet,PROMO3,UK
4,105,04/04/2021,125,paid,credit card,PROMO4,USA


In [5]:
# Dataset2 - Payments
payment_df = pd.read_csv('https://raw.githubusercontent.com/GMwangi3/DE_Week7/main/dataset1.csv')
payment_df.head(5)

Unnamed: 0,customer_id,date_of_purchase,total_amount_billed,payment_status,payment_method,promo_code,country_of_purchase
0,101,04/01/2021,100,paid,credit card,PROMO1,USA
1,102,04/02/2021,200,paid,bank transfer,PROMO2,USA
2,103,04/02/2021,50,overdue,credit card,,UK
3,104,04/03/2021,75,disputed,e-wallet,PROMO3,UK
4,105,04/04/2021,125,paid,credit card,PROMO4,USA


In [6]:
# Dataset3 - Refunds
refund_df = pd.read_csv('https://raw.githubusercontent.com/GMwangi3/DE_Week7/main/dataset1.csv')
refund_df.head(5)

Unnamed: 0,customer_id,date_of_purchase,total_amount_billed,payment_status,payment_method,promo_code,country_of_purchase
0,101,04/01/2021,100,paid,credit card,PROMO1,USA
1,102,04/02/2021,200,paid,bank transfer,PROMO2,USA
2,103,04/02/2021,50,overdue,credit card,,UK
3,104,04/03/2021,75,disputed,e-wallet,PROMO3,UK
4,105,04/04/2021,125,paid,credit card,PROMO4,USA


## 2. Data Preparation

### Get column data types

In [7]:
# Function that prints the data types of columns for a given dataframe
def get_datatypes(df):
  df_name =[x for x in globals() if globals()[x] is df][0]
  print("\n" + df_name)
  print("=================")
  print(df.dtypes)

df_list = [subscription_df, payment_df, refund_df]

for df in df_list:
  get_datatypes(df)


subscription_df
customer_id             int64
date_of_purchase       object
total_amount_billed     int64
payment_status         object
payment_method         object
promo_code             object
country_of_purchase    object
dtype: object

payment_df
customer_id             int64
date_of_purchase       object
total_amount_billed     int64
payment_status         object
payment_method         object
promo_code             object
country_of_purchase    object
dtype: object

refund_df
customer_id             int64
date_of_purchase       object
total_amount_billed     int64
payment_status         object
payment_method         object
promo_code             object
country_of_purchase    object
dtype: object


### Missing Values

In [8]:
# Function that prints the sum of missing values per columns for a given dataframe
def get_nulls(df):
  df_name =[x for x in globals() if globals()[x] is df][0]
  print("\n" + df_name)
  print("=================\n")
  print(df.isna().sum())

for df in df_list:
  get_nulls(df)


subscription_df

customer_id            0
date_of_purchase       0
total_amount_billed    0
payment_status         0
payment_method         0
promo_code             3
country_of_purchase    0
dtype: int64

payment_df

customer_id            0
date_of_purchase       0
total_amount_billed    0
payment_status         0
payment_method         0
promo_code             3
country_of_purchase    0
dtype: int64

refund_df

customer_id            0
date_of_purchase       0
total_amount_billed    0
payment_status         0
payment_method         0
promo_code             3
country_of_purchase    0
dtype: int64


### Duplicate data

In [9]:
# Function that prints number of duplicate records
def get_duplicates(df):
  df_name =[x for x in globals() if globals()[x] is df][0]
  print("\n" + df_name)
  print("=================\n")
  print(sum(df.duplicated()))

for df in df_list:
  get_duplicates(df)


subscription_df

0

payment_df

0

refund_df

0


There are missing values for the Promo_code

## 3. Data Transformation

### Date Formatting

In [16]:

# Convert the date/time columns
subscription_df['date_of_purchase'] = pd.to_datetime(subscription_df['date_of_purchase'], infer_datetime_format=True)
payment_df['date_of_payment'] = pd.to_datetime(payment_df['date_of_payment'], infer_datetime_format=True)
refund_df['date_of_refund'] = pd.to_datetime(refund_df['date_of_refund'], infer_datetime_format=True)

KeyError: ignored

### Merging the Datasets

In [11]:
# Merge the datasets
subscription_payment_merge = pd.merge(left=subscription_df, right=payment_df, how='left', left_on=['customer_id','payment_status','payment_method'], right_on= ['customer_id','payment_status','payment_method'])
final_merged_df = pd.merge(left=subscription_payment_merge, right=refund_df, how='left', left_on=['customer_id'], right_on= ['customer_id'])

final_merged_df

Unnamed: 0,customer_id,date_of_purchase_x,total_amount_billed_x,payment_status_x,payment_method_x,promo_code_x,country_of_purchase_x,date_of_purchase_y,total_amount_billed_y,promo_code_y,country_of_purchase_y,date_of_purchase,total_amount_billed,payment_status_y,payment_method_y,promo_code,country_of_purchase
0,101,2021-04-01,100,paid,credit card,PROMO1,USA,04/01/2021,100,PROMO1,USA,04/01/2021,100,paid,credit card,PROMO1,USA
1,102,2021-04-02,200,paid,bank transfer,PROMO2,USA,04/02/2021,200,PROMO2,USA,04/02/2021,200,paid,bank transfer,PROMO2,USA
2,103,2021-04-02,50,overdue,credit card,,UK,04/02/2021,50,,UK,04/02/2021,50,overdue,credit card,,UK
3,104,2021-04-03,75,disputed,e-wallet,PROMO3,UK,04/03/2021,75,PROMO3,UK,04/03/2021,75,disputed,e-wallet,PROMO3,UK
4,105,2021-04-04,125,paid,credit card,PROMO4,USA,04/04/2021,125,PROMO4,USA,04/04/2021,125,paid,credit card,PROMO4,USA
5,106,2021-04-05,150,paid,credit card,,UK,04/05/2021,150,,UK,04/05/2021,150,paid,credit card,,UK
6,107,2021-04-06,75,overdue,e-wallet,PROMO5,USA,04/06/2021,75,PROMO5,USA,04/06/2021,75,overdue,e-wallet,PROMO5,USA
7,108,2021-04-06,100,overdue,bank transfer,PROMO6,USA,04/06/2021,100,PROMO6,USA,04/06/2021,100,overdue,bank transfer,PROMO6,USA
8,109,2021-04-07,50,paid,bank transfer,,UK,04/07/2021,50,,UK,04/07/2021,50,paid,bank transfer,,UK
9,110,2021-04-07,25,overdue,credit card,PROMO7,USA,04/07/2021,25,PROMO7,USA,04/07/2021,25,overdue,credit card,PROMO7,USA


### Unpaid Bills

In [12]:
final_merged_df = final_merged_df[final_merged_df['payment_status'] != 'paid']
final_merged_df

KeyError: ignored

## 4. Data Loading

### Output to csv file

In [13]:
final_merged_df.to_csv('unpaid_bills.csv', header=True, index=False)

## 5. Automation

I will configure a cronjob to run the notebook everydat at 4am:

0 4 * * * /path/to/notebook.ipynb