# Customer transactions analysis
This Notebook makes an exploratory analysis the dataset found under `./dataset/Sample_Dataset.csv`

## Data Gathering
Load the local csv file inside a Pandas `DataFrame`.

In [93]:
import pandas as pd

df_transactions = pd.read_csv(f"dataset/Sample_Dataset.csv")
print(
    f"The transactions data set has {len(df_transactions)} records with {df_transactions.shape[1]} variables."
)
df_transactions.head()

The transactions data set has 10000 records with 9 variables.


Unnamed: 0,Customer ID,Transaction Date,Brand Name,Sector,Gender,Date of Birth,Country,No of Scans,Amount Spent
0,68844730,2022-01-02,Festina Group,Jewellery & Watches,,,ES,1,241.8
1,57088234,2022-01-02,Skechers,Shoes,F,1971-12-14,ES,1,76.48
2,50612353,2022-01-02,North Sails,Men's Apparel,M,1978-07-04,ES,1,24.5
3,36233318,2022-01-02,Converse,Shoes,,,ES,1,75.0
4,36256323,2022-01-02,Clarks,Shoes,F,1983-11-17,ES,1,21.25


In [94]:
df_transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Customer ID       10000 non-null  int64  
 1   Transaction Date  10000 non-null  object 
 2   Brand Name        10000 non-null  object 
 3   Sector            10000 non-null  object 
 4   Gender            4223 non-null   object 
 5   Date of Birth     3955 non-null   object 
 6   Country           9328 non-null   object 
 7   No of Scans       10000 non-null  int64  
 8   Amount Spent      10000 non-null  float64
dtypes: float64(1), int64(2), object(6)
memory usage: 703.3+ KB


## Data Assessment and Cleaning
In this part, we'll perform parallel assessment and cleaning steps on the data set.

### Data types
The `Transaction Date` and `Date of Birth` should be casted to `datetime`.

In [95]:
df_transactions["Transaction Date"] = pd.to_datetime(
    df_transactions["Transaction Date"]
)
# Some birth date do not have the right date format. Those will be set to NaT
df_transactions["Date of Birth"] = pd.to_datetime(
    df_transactions["Date of Birth"], errors="coerce"
)
df_transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Customer ID       10000 non-null  int64         
 1   Transaction Date  10000 non-null  datetime64[ns]
 2   Brand Name        10000 non-null  object        
 3   Sector            10000 non-null  object        
 4   Gender            4223 non-null   object        
 5   Date of Birth     3953 non-null   datetime64[ns]
 6   Country           9328 non-null   object        
 7   No of Scans       10000 non-null  int64         
 8   Amount Spent      10000 non-null  float64       
dtypes: datetime64[ns](2), float64(1), int64(2), object(4)
memory usage: 703.3+ KB


In [96]:
df_transactions.head()

Unnamed: 0,Customer ID,Transaction Date,Brand Name,Sector,Gender,Date of Birth,Country,No of Scans,Amount Spent
0,68844730,2022-01-02,Festina Group,Jewellery & Watches,,NaT,ES,1,241.8
1,57088234,2022-01-02,Skechers,Shoes,F,1971-12-14,ES,1,76.48
2,50612353,2022-01-02,North Sails,Men's Apparel,M,1978-07-04,ES,1,24.5
3,36233318,2022-01-02,Converse,Shoes,,NaT,ES,1,75.0
4,36256323,2022-01-02,Clarks,Shoes,F,1983-11-17,ES,1,21.25


### Variable name
To facilitate the SQL queries, it's better to remove the spaces in the variable names.

I'll replace the spaces with `_`.

In [97]:
from typing import Dict

renaming_map: Dict[str, str] = {}
for column in df_transactions:
    renaming_map[str(column)] = str(column).replace(" ", "_")

df_transactions.rename(columns=renaming_map, inplace=True)
df_transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Customer_ID       10000 non-null  int64         
 1   Transaction_Date  10000 non-null  datetime64[ns]
 2   Brand_Name        10000 non-null  object        
 3   Sector            10000 non-null  object        
 4   Gender            4223 non-null   object        
 5   Date_of_Birth     3953 non-null   datetime64[ns]
 6   Country           9328 non-null   object        
 7   No_of_Scans       10000 non-null  int64         
 8   Amount_Spent      10000 non-null  float64       
dtypes: datetime64[ns](2), float64(1), int64(2), object(4)
memory usage: 703.3+ KB


### Date of Birth

In [98]:
df_transactions.describe()

Unnamed: 0,Customer_ID,Transaction_Date,Date_of_Birth,No_of_Scans,Amount_Spent
count,10000.0,10000,3953,10000.0,10000.0
mean,56419790.0,2022-07-23 00:56:00.960000,1981-02-14 19:10:23.830002432,1.1075,65.923188
min,33744040.0,2022-01-02 00:00:00,1778-05-09 00:00:00,1.0,0.0
25%,44991340.0,2022-04-29 00:00:00,1973-12-17 00:00:00,1.0,24.97
50%,60740390.0,2022-07-28 00:00:00,1980-03-25 00:00:00,1.0,46.87
75%,68128650.0,2022-10-29 00:00:00,1988-10-30 00:00:00,1.0,82.935
max,74569050.0,2022-12-31 00:00:00,2022-10-30 00:00:00,8.0,2365.0
std,12839850.0,,,0.363809,76.506243


Some date of birth are out of range (some people are born in 2022 and others in 1778).

Let's keep only the people that were born between 1912 and 2008 (between 14 and 110 years old at the time of the transaction).

In [99]:
import datetime as dt

# Keep only the record without a birth date or with a valid birth date
df_transactions = df_transactions[
    (df_transactions["Date_of_Birth"].isnull())
    | (
        (
            df_transactions["Date_of_Birth"]
            >= dt.datetime(year=1912, month=1, day=1, hour=0, minute=0)
        )
        & (
            df_transactions["Date_of_Birth"]
            <= dt.datetime(year=2008, month=1, day=1, hour=0, minute=0)
        )
    )
]
df_transactions["Date_of_Birth"].describe()

count                             3935
mean     1981-01-03 22:24:29.275730624
min                1942-11-24 00:00:00
25%                1973-12-09 00:00:00
50%                1980-03-02 00:00:00
75%                1988-08-20 00:00:00
max                2006-06-02 00:00:00
Name: Date_of_Birth, dtype: object

In [100]:
df_transactions.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9982 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Customer_ID       9982 non-null   int64         
 1   Transaction_Date  9982 non-null   datetime64[ns]
 2   Brand_Name        9982 non-null   object        
 3   Sector            9982 non-null   object        
 4   Gender            4207 non-null   object        
 5   Date_of_Birth     3935 non-null   datetime64[ns]
 6   Country           9313 non-null   object        
 7   No_of_Scans       9982 non-null   int64         
 8   Amount_Spent      9982 non-null   float64       
dtypes: datetime64[ns](2), float64(1), int64(2), object(4)
memory usage: 779.8+ KB


## Export to SQL DB
Once the data frame is cleaned and ready for the SQL analysis (done in the `sql` folder), it is sent to a Postgresql database.

In [101]:
from pathlib import Path
import json

# Load secrets
secret_file: Path = Path.home() / ".secrets.json"
with open(secret_file, encoding="utf8") as secrets_file:
    mysql_secrets: Dict[str, str] = json.load(secrets_file)["mysql"]

In [102]:
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database, drop_database
from urllib.parse import quote_plus

user: str = mysql_secrets["user"]
password: str = quote_plus(mysql_secrets["password"])
database_name = "customer_transactions"
engine = create_engine(
    f"mysql+pymysql://{user}:{password}@localhost/{database_name}"
)

# Drop database if it already exists
if database_exists(engine.url):
    drop_database(engine.url)
create_database(engine.url)

df_transactions.to_sql(
    name="transactions", index=False, con=engine, if_exists="replace"
)

9982

## Save cleaned data to CSV
Let's save the cleaned data to CSV to use it in Tableau

In [103]:
df_transactions.to_csv("dataset/cleaned_data_set.csv", index=False)