## Dataset Description

## Data Cleaning

Before any Exploratory Data Analysis can be done to answer our research questions with the dataset, we have to clean the data first in case there are any issues that may result in problems with the analysis.

### Importing, Loading and Reading the Dataset

We begin by importing all the necessary libraries before starting the actual Data Cleaning process.

In [103]:
import numpy as np
import pandas as pd

Next, we load the dataset and call the `.head()` function to view a snippet of the dataset's contents.

In [104]:
df = pd.read_csv('retail_store_sales.csv')
df.head()

Unnamed: 0,Transaction ID,Customer ID,Category,Item,Price Per Unit,Quantity,Total Spent,Payment Method,Location,Transaction Date,Discount Applied
0,TXN_6867343,CUST_09,Patisserie,Item_10_PAT,18.5,10.0,185.0,Digital Wallet,Online,2024-04-08,True
1,TXN_3731986,CUST_22,Milk Products,Item_17_MILK,29.0,9.0,261.0,Digital Wallet,Online,2023-07-23,True
2,TXN_9303719,CUST_02,Butchers,Item_12_BUT,21.5,2.0,43.0,Credit Card,Online,2022-10-05,False
3,TXN_9458126,CUST_06,Beverages,Item_16_BEV,27.5,9.0,247.5,Credit Card,Online,2022-05-07,
4,TXN_4575373,CUST_05,Food,Item_6_FOOD,12.5,7.0,87.5,Digital Wallet,Online,2022-10-02,False


Then the `.info()` function to view general information about the dataset.

In [105]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12575 entries, 0 to 12574
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction ID    12575 non-null  object 
 1   Customer ID       12575 non-null  object 
 2   Category          12575 non-null  object 
 3   Item              11362 non-null  object 
 4   Price Per Unit    11966 non-null  float64
 5   Quantity          11971 non-null  float64
 6   Total Spent       11971 non-null  float64
 7   Payment Method    12575 non-null  object 
 8   Location          12575 non-null  object 
 9   Transaction Date  12575 non-null  object 
 10  Discount Applied  8376 non-null   object 
dtypes: float64(3), object(8)
memory usage: 1.1+ MB


This tells us that the dataset contains 12575 entries, but more importantly, we can quickly identify (3) potential issues we'll need to resolve as we go through the Data Cleaning process:

1. The `Discount Applied` column uses a Dtype of `object` rather than `bool`.

2. The `Transaction Date` column uses a Dtype of `object` rather than `datetime64`. But since it also contains day, month, and year in one column, we can split these into their own respective Dtype `int64` columns.

3. The columns `Item`, `Price Per Unit`, `Quantity`, `Total Spent`, and `Discount Applied` all contain null values. Thus, we'll need to look into any inconsistencies or issues with these columns.

But before investigating these specific dataset issues, we'll start off with some preliminary checks by looking for duplicate values and duplicate categorical representations.

### Duplicate Values

In [106]:
df.duplicated().sum()

0

This dataset contains no duplicate values, so **no issues** there.

### Multiple Representations of the Same Categorical Value

The columns to watch out for here are `Category`, `Payment Method`, and `Location`. We have to ensure none of their representations end up being the same as other ones.

We'll start with `Category`.

In [107]:
df['Category'].unique()

array(['Patisserie', 'Milk Products', 'Butchers', 'Beverages', 'Food',
       'Furniture', 'Electric household essentials',
       'Computers and electric accessories'], dtype=object)

Next, `Payment Method`.

In [108]:
df['Payment Method'].unique()

array(['Digital Wallet', 'Credit Card', 'Cash'], dtype=object)

Lastly, `Location`.

In [109]:
df['Location'].unique()

array(['Online', 'In-store'], dtype=object)

Since none of the categories were representing the same thing as other categories, there are **no issues** to be found here.

### Incorrect Datatypes

#### Transaction Date Column

#### Discount Applied Column

### Missing Data

For this portion, we'll resolve each of the previously mentioned columns individually.

#### Item Column

One simple solution would be to drop all rows containing null values in the Item column. However, let's take a quick glance at the `Category` and `Item` columns.

In [110]:
df[['Category', 'Item']].head(8)

Unnamed: 0,Category,Item
0,Patisserie,Item_10_PAT
1,Milk Products,Item_17_MILK
2,Butchers,Item_12_BUT
3,Beverages,Item_16_BEV
4,Food,Item_6_FOOD
5,Patisserie,
6,Food,Item_1_FOOD
7,Furniture,


Each item has the pattern `Item_##_CATEGORYID` with the category identifier being based on the value in `Category`. So while we can't figure out what item number was chosen, we can still figure out the category identifier for each item. After all, the `Category` column contains 0 null values.

So rather than dropping the null values for `Item` and losing potentially precious data, we can instead fill up all the rows for `Item` using only the category identifier.

We'll first create a dictionary of key-value pairs, with the content of `Category` serving as our key and the category identifier as the value.

In [111]:
category_id = {'Patisserie':'Item_NA_PAT', 'Milk Products':'Item_NA_MILK', 'Butchers':'Item_NA_BUT', 'Beverages':'Item_NA_BEV', 'Food':'Item_NA_FOOD', 'Furniture':'Item_NA_FUR',
               'Electric household essentials':'Item_NA_EHE', 'Computers and electric accessories':'Item_NA_CEA'}
category_id

{'Patisserie': 'Item_NA_PAT',
 'Milk Products': 'Item_NA_MILK',
 'Butchers': 'Item_NA_BUT',
 'Beverages': 'Item_NA_BEV',
 'Food': 'Item_NA_FOOD',
 'Furniture': 'Item_NA_FUR',
 'Electric household essentials': 'Item_NA_EHE',
 'Computers and electric accessories': 'Item_NA_CEA'}

In [112]:
df['Item'] = df['Item'].fillna(df['Category'].map(category_id))
df[['Category', 'Item']].head(8)

Unnamed: 0,Category,Item
0,Patisserie,Item_10_PAT
1,Milk Products,Item_17_MILK
2,Butchers,Item_12_BUT
3,Beverages,Item_16_BEV
4,Food,Item_6_FOOD
5,Patisserie,Item_NA_PAT
6,Food,Item_1_FOOD
7,Furniture,Item_NA_FUR


And if we run `.info()` on our dataset, `Item` should no longer contain any null values.

In [113]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12575 entries, 0 to 12574
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction ID    12575 non-null  object 
 1   Customer ID       12575 non-null  object 
 2   Category          12575 non-null  object 
 3   Item              12575 non-null  object 
 4   Price Per Unit    11966 non-null  float64
 5   Quantity          11971 non-null  float64
 6   Total Spent       11971 non-null  float64
 7   Payment Method    12575 non-null  object 
 8   Location          12575 non-null  object 
 9   Transaction Date  12575 non-null  object 
 10  Discount Applied  8376 non-null   object 
dtypes: float64(3), object(8)
memory usage: 1.1+ MB


#### Price Per Unit Column

Filling up the null values for `Price Per Unit` is doable as long as both `Quantity` and `Total Spent` contain values. All we have to do is solve what the price per unit would be based on the quantity and total spent.

In [114]:
df['Price Per Unit'] = df['Price Per Unit'].fillna(df['Total Spent']/df['Quantity'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12575 entries, 0 to 12574
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction ID    12575 non-null  object 
 1   Customer ID       12575 non-null  object 
 2   Category          12575 non-null  object 
 3   Item              12575 non-null  object 
 4   Price Per Unit    12575 non-null  float64
 5   Quantity          11971 non-null  float64
 6   Total Spent       11971 non-null  float64
 7   Payment Method    12575 non-null  object 
 8   Location          12575 non-null  object 
 9   Transaction Date  12575 non-null  object 
 10  Discount Applied  8376 non-null   object 
dtypes: float64(3), object(8)
memory usage: 1.1+ MB


#### Quantity and Total Spent Columns

Suspiciously enough, both `Quanity` and `Total Spent` contain the same number of null values. We'll investigate them first to see if they both share null values.

In [115]:
(df['Quantity'].isna() == df['Total Spent'].isna()).all()

True

While we were able to fill up `Price Per Unit` using `Quantity` and `Total Spent` as reference, the opposite doesn't hold true. Since we can't fill these missing values, we'll delete all rows containing null values in both columns

In [116]:
df.dropna(subset = ['Quantity', 'Total Spent'], inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11971 entries, 0 to 12574
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction ID    11971 non-null  object 
 1   Customer ID       11971 non-null  object 
 2   Category          11971 non-null  object 
 3   Item              11971 non-null  object 
 4   Price Per Unit    11971 non-null  float64
 5   Quantity          11971 non-null  float64
 6   Total Spent       11971 non-null  float64
 7   Payment Method    11971 non-null  object 
 8   Location          11971 non-null  object 
 9   Transaction Date  11971 non-null  object 
 10  Discount Applied  7983 non-null   object 
dtypes: float64(3), object(8)
memory usage: 1.1+ MB


#### Discount Applied Column

### Inconsistent Formatting

Check up the Transaction & Customer ID and Item formats here. No need for dates and discounts as that would have already been resolved by the Incorrect Datatypes section

### Outliers

## Research Questions & Exploratory Data Analysis