## **Loading Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings

warnings.filterwarnings("ignore")

## **Loading Dataset**

In [4]:
df=pd.read_excel("Data/transactions.xlsx")
df.head()

Unnamed: 0,Transaction_ID,Date,Customer_Name,Total_Items,Amount($),Payment_Method,City,Store_Type,Discount_Applied,Customer_Category,Season,Promotion
0,1000667075,2022-09-12 17:40:23,David King,5,30.98,Debit Card,Chicago,Warehouse Club,True,Teenager,Fall,BOGO (Buy One Get One)
1,1000156022,2022-01-20 23:03:20,Michael Williamson,3,23.29,Credit Card,Boston,Warehouse Club,True,Homemaker,Winter,Discount on Selected Items
2,1000681674,2022-10-15 07:49:59,Chelsea Garza,7,25.62,Debit Card,Chicago,Pharmacy,False,Teenager,Fall,Discount on Selected Items
3,1000692089,2024-04-05 09:39:58,Scott Lopez,5,14.64,Mobile Payment,Atlanta,Pharmacy,False,Homemaker,Summer,Discount on Selected Items
4,1000328702,2021-05-28 04:16:54,Crystal Adams,4,62.27,Credit Card,Miami,Convenience Store,False,Retiree,Summer,


## **Data Pre-Processing**

---

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38500 entries, 0 to 38499
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Transaction_ID     38500 non-null  int64  
 1   Date               38500 non-null  object 
 2   Customer_Name      38500 non-null  object 
 3   Total_Items        38500 non-null  int64  
 4   Amount($)          38500 non-null  float64
 5   Payment_Method     38500 non-null  object 
 6   City               38500 non-null  object 
 7   Store_Type         38500 non-null  object 
 8   Discount_Applied   38500 non-null  bool   
 9   Customer_Category  38500 non-null  object 
 10  Season             38500 non-null  object 
 11  Promotion          25529 non-null  object 
dtypes: bool(1), float64(1), int64(2), object(8)
memory usage: 3.3+ MB


- The dataset contains 38,500 entries across 12 columns, with the `Promotion` column having 25,529 non-null entries, indicating missing values. All other columns are complete. 
- The `Date` column is currently of type `object`, which should be converted to `datetime` for proper date handling. 

---

### Handling the Date Data Type

In [None]:
df["Date"]=pd.to_datetime(df.Date) # Converting 'Date' column to datetime dtype

In [None]:
df["Date"].dtype ##Cheking the dtype

dtype('<M8[ns]')

In [None]:
str(df["Date"].dtype)  ##Cheking the dtype

'datetime64[ns]'

In [11]:
df.head(3)

Unnamed: 0,Transaction_ID,Date,Customer_Name,Total_Items,Amount($),Payment_Method,City,Store_Type,Discount_Applied,Customer_Category,Season,Promotion
0,1000667075,2022-09-12 17:40:23,David King,5,30.98,Debit Card,Chicago,Warehouse Club,True,Teenager,Fall,BOGO (Buy One Get One)
1,1000156022,2022-01-20 23:03:20,Michael Williamson,3,23.29,Credit Card,Boston,Warehouse Club,True,Homemaker,Winter,Discount on Selected Items
2,1000681674,2022-10-15 07:49:59,Chelsea Garza,7,25.62,Debit Card,Chicago,Pharmacy,False,Teenager,Fall,Discount on Selected Items


---

### Checking Missing Values 

In [None]:
df.isnull().sum()

Transaction_ID           0
Date                     0
Customer_Name            0
Total_Items              0
Amount($)                0
Payment_Method           0
City                     0
Store_Type               0
Discount_Applied         0
Customer_Category        0
Season                   0
Promotion            12971
dtype: int64

The `Promotion` column has 12,971 missing values, which accounts for a significant portion of the dataset. Further analysis is needed to determine how to handle these missing values.

---

### Checking the "Promotion" Column

In [None]:
df["Promotion"].value_counts()

Promotion
Discount on Selected Items    12811
BOGO (Buy One Get One)        12718
Name: count, dtype: int64

Although the `Promotion` column contains missing values (12,971 in total), it is observed that the non-null entries consist primarily of two distinct promotion types: **"Discount on Selected Items"** (12,811 occurrences) and **"BOGO (Buy One Get One)"** (12,718 occurrences). Since there is no explicit "None" or "No Promotion" value in the dataset, it is likely that the missing values represent transactions that did not have a promotional offer applied. Therefore, further treatment of these missing values may involve interpreting them as transactions with no promotion.

---

### Handling Missing Values

In [None]:
df["Promotion"].fillna("No Promotion", inplace=True) ## Filling The Missing Values With No Promotion

In [None]:
df.isnull().sum() ##Checking The Missing Values Again

Transaction_ID       0
Date                 0
Customer_Name        0
Total_Items          0
Amount($)            0
Payment_Method       0
City                 0
Store_Type           0
Discount_Applied     0
Customer_Category    0
Season               0
Promotion            0
dtype: int64

In [None]:
df["Promotion"].value_counts() ##Verify

Promotion
No Promotion                  12971
Discount on Selected Items    12811
BOGO (Buy One Get One)        12718
Name: count, dtype: int64

The missing values in the `Promotion` column were filled with the label **"No Promotion"**. This effectively indicates that transactions without a promotional offer are categorized as "No Promotion", ensuring a complete and consistent dataset for further analysis.

---

## Checking For Duplicate Values

In [21]:
num_duplicates = df.duplicated().sum()
print(f"Total number of duplicate rows in the dataset: {num_duplicates}")


Total number of duplicate rows in the dataset: 0


---

### Summary Statistics of the Dataset

In [25]:
df.describe()

Unnamed: 0,Transaction_ID,Date,Total_Items,Amount($)
count,38500.0,38500,38500.0,38500.0
mean,1000500000.0,2022-03-10 05:20:27.853402624,5.490649,52.459843
min,1000000000.0,2020-01-01 03:51:50,1.0,5.0
25%,1000248000.0,2021-02-02 04:45:56.500000,3.0,28.76
50%,1000501000.0,2022-03-11 09:14:40.500000,5.0,52.26
75%,1000751000.0,2023-04-13 13:47:41.750000128,8.0,76.35
max,1001000000.0,2024-05-18 19:06:29,10.0,100.0
std,289070.8,,2.868476,27.442214


The statistical summary of the dataset shows that for the numerical columns, the mean and 50th percentile (median) values are closely aligned, indicating a relatively symmetric distribution with minimal skew. The values across different columns, including `Transaction_ID`, `Total_Items`, and `Amount($)`, suggest a consistent range without significant deviations, implying the absence of notable outliers in the dataset.

---

## **Data Preprocessing Completed** ☑️☑️

The data preprocessing phase has been successfully completed, with missing values handled and data types appropriately assigned. After reviewing the dataset, no significant outliers or data duplicates were found. With the dataset now cleaned and ready, we will proceed to the next phase of the analysis, where we will dive deeper into exploring and interpreting the data to derive actionable insights.

--- 

## **Data Analysis** 
### ~ Finding meaningful insights. 