# Basic Data Cleaning on a Retail Dataset

This notebook demonstrates basic cleaning techniques I will implement, such as...
- Handling missing values
- Removing duplicates
- Converting data types when necessary

For this project, I will be using a retail store sales dataset on Kaggle

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("retail_store_sales.csv")
df.head()

First, I will check the info and stats of this dataset...

In [None]:
df.info()

From this info() command, I can infer that there are 12575 rows, but some columns don't have that many rows. For example, the columns Item, Price Per unit, Quantity, Total Spent, and Discount Applied have missing rows and therefore require cleaning. Another thing I noticed is that the Transaction Date is an object Dtype, but it should be a datetime for any time analysis. Additionally, Discount Applied is an object, but it should be a boolean, true or false.

First, I will look for any duplicate values and get rid of them 

In [None]:
#Here I am checking if there are any duplicates. It turns out that there aren't any duplicates, so I don't have to drop anything here.
df.duplicated().sum()

Next, I will fill in some missing values and get rid of others

In [None]:
# I am going to assume that if there is no value for discount, then no discount was applied, and it's false.
# I am also going to change the data type from object to boolean
df["Discount Applied"].fillna(False, inplace=True)
df["Discount Applied"] = df["Discount Applied"].astype('bool')

In [None]:
#Now I am going drop all rows that are missing values for Price Per unit, Quantity, and Total Spent, because I don't know that data.
df.dropna(subset=["Price Per Unit", "Quantity", "Total Spent"], inplace = True)

In [None]:
#Finally, I am going to change Transaction Data from an object Dtype to a datetime Dtype
df["Transaction Date"] = pd.to_datetime(df["Transaction Date"])

In [None]:
df.info()

Now with all the cleaning done, we can see that there are no longer any missing values, and the Transaction Date/Discount Applied columns have been given appropriate data types. Around 1000 rows of missing information were deleted, and many more were updated based on the assumption that if a discount wasn't recorded, then there was no discount.
