
# Project : Sales Data Wrangling 

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#Gathering">Data Gathering</a></li>
<li><a href="#Assessing">Data Assessing</a></li>
<li><a href="#Cleaning">Data Cleaning</a></li>    
<li><a href="#Assessing2">Data Assessing After Cleaning</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#Storing"> Data Storing</a></li>    
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

This dataset includes detailed sales information from multiple branches, with 16 columns and 1006 rows. The data contains details about invoices such as invoice number, branch where the sale occurred, customer type, product type, quantity sold, price per unit, taxes, total, payment method, and customer ratings. The goal of this data is to analyze sales, understand customer behavior, evaluate the financial performance of different branches, and explore factors affecting customer satisfaction.


In [1]:
# import  the libraries that you will use
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline
from datetime import time
# show all columns
pd.set_option('display.max_columns', None)

<a id='Gathering'></a>

## 1- Data Gathering

In [2]:
df = pd.read_csv("Capstone Data - Supermarket Sales.csv")
df.head(5)

Unnamed: 0,Invoice ID,Branch,Yangon,Naypyitaw,Mandalay,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,Rating
0,750-67-8428,A,1,0,0,Normal,Male,Health and beauty,74.69,7,26.1415,,1/5/2019,13:08,Ewallet,9.1
1,226-31-3081,C,0,1,0,Normal,Male,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29,Cash,9.6
2,631-41-3108,A,1,0,0,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,7.4
3,123-19-1176,A,1,0,0,Normal,Male,Health and beauty,58.22,8,,489.048,1/27/2019,8 - 30 PM,Ewallet,8.4
4,373-73-7910,A,1,0,0,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,5.3


<a id='Assessing'></a>
## 2- Data Assessing

### 2-1- Tidiness issues

In [3]:
df.sample(5)

Unnamed: 0,Invoice ID,Branch,Yangon,Naypyitaw,Mandalay,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,Rating
15,299-46-1805,B,0,0,1,Normal,Male,Sports and travel,93.72,6,28.116,590.436,1/15/2019,16:19,Cash,4.5
222,389-25-3394,C,0,1,0,Normal,Male,Electronic accessories,11.81,5,2.9525,62.0025,2/17/2019,18:06,Cash,9.4
716,857-16-3520,A,1,0,0,Member,Female,Fashion accessories,71.46,7,25.011,525.231,3/28/2019,16:06,Ewallet,4.5
991,602-16-6955,B,0,0,1,Normal,Female,Sports and travel,76.6,10,38.3,804.3,1/24/2019,18:10,Ewallet,6.0
440,450-28-2866,C,0,1,0,Member,Male,Food and beverages,17.44,5,4.36,91.56,1/15/2019,19:25,Cash,8.1


In [4]:
df.shape

(1006, 16)

#### Each variable forms a column and contains values
- 'Yangon', 'Naypyitaw', 'Mandalay' convertir les colonnes


#### Each observation forms a row
- nothing

#### Each type of observational unit forms a table
- nothing

### 2-2- Quality issues

In [5]:
df.sample(10)

Unnamed: 0,Invoice ID,Branch,Yangon,Naypyitaw,Mandalay,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,Rating
173,608-27-6295,B,0,0,1,Member,Male,Electronic accessories,52.89,6,15.867,333.207,1/19/2019,17:34,Credit card,9.8
444,301-11-9629,A,1,0,0,Normal,Female,Sports and travel,19.1,7,6.685,140.385,1/15/2019,10:43,Cash,9.7
896,781-84-8059,C,0,1,0,Normal,Male,Fashion accessories,60.74,7,21.259,446.439,1/18/2019,16:23,Ewallet,5.0
866,361-59-0574,B,0,0,1,Member,Male,Sports and travel,90.53,8,36.212,760.452,3/15/2019,14:48,Credit card,6.5
402,236-86-3015,C,0,1,0,Member,Male,Home and lifestyle,13.98,1,0.699,14.679,2/4/2019,13:38,Ewallet,9.8
693,196-01-2849,C,0,1,0,Member,Female,Fashion accessories,73.38,7,25.683,539.343,2/10/2019,13:56,Cash,9.5
630,149-61-1929,A,1,0,0,Normal,Male,Sports and travel,64.19,10,32.095,673.995,1/19/2019,14:08,Credit card,6.7
647,574-31-8277,B,0,0,1,Member,Male,Fashion accessories,33.63,1,1.6815,35.3115,3/20/2019,19:55,Cash,5.6
986,764-44-8999,B,0,0,1,Normal,Female,Health and beauty,14.76,2,1.476,30.996,2/18/2019,14:42,Ewallet,4.3
618,828-46-6863,A,1,0,0,Member,Male,Food and beverages,98.53,6,29.559,620.739,1/23/2019,11:22,Credit card,4.0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1006 entries, 0 to 1005
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Invoice ID     1006 non-null   object 
 1   Branch         1006 non-null   object 
 2   Yangon         1006 non-null   int64  
 3   Naypyitaw      1006 non-null   int64  
 4   Mandalay       1006 non-null   int64  
 5   Customer type  1006 non-null   object 
 6   Gender         1006 non-null   object 
 7   Product line   1006 non-null   object 
 8   Unit price     1006 non-null   object 
 9   Quantity       1006 non-null   int64  
 10  Tax 5%         997 non-null    float64
 11  Total          1003 non-null   float64
 12  Date           1006 non-null   object 
 13  Time           1006 non-null   object 
 14  Payment        1006 non-null   object 
 15  Rating         1006 non-null   float64
dtypes: float64(3), int64(4), object(9)
memory usage: 125.9+ KB


In [7]:
df.describe()

Unnamed: 0,Yangon,Naypyitaw,Mandalay,Quantity,Tax 5%,Total,Rating
count,1006.0,1006.0,1006.0,1006.0,997.0,1003.0,1006.0
mean,0.338966,0.329026,0.332008,5.469185,15.479682,322.734689,7.056163
std,0.473594,0.470093,0.471168,3.014153,11.72832,245.865964,3.318751
min,0.0,0.0,0.0,-8.0,0.5085,10.6785,4.0
25%,0.0,0.0,0.0,3.0,5.9865,123.78975,5.5
50%,0.0,0.0,0.0,5.0,12.2275,254.016,7.0
75%,1.0,1.0,1.0,8.0,22.7205,471.009,8.5
max,1.0,1.0,1.0,10.0,49.65,1042.65,97.0


In [8]:
df.duplicated().sum()

6

In [9]:
df.isnull().sum()

Invoice ID       0
Branch           0
Yangon           0
Naypyitaw        0
Mandalay         0
Customer type    0
Gender           0
Product line     0
Unit price       0
Quantity         0
Tax 5%           9
Total            3
Date             0
Time             0
Payment          0
Rating           0
dtype: int64

In [10]:
df['Customer type'].value_counts()

Normal     515
Member     463
-           27
Memberr      1
Name: Customer type, dtype: int64

In [11]:
df['Quantity'].value_counts()

 10    120
 1     111
 4     109
 7     102
 5     102
 6      98
 9      94
 2      92
 3      91
 8      83
-8       2
-1       1
-7       1
Name: Quantity, dtype: int64

In [12]:
negative_values = df[df['Quantity'] < 0]
negative_values

Unnamed: 0,Invoice ID,Branch,Yangon,Naypyitaw,Mandalay,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,Rating
629,308-39-1707,A,1,0,0,Normal,Female,Fashion accessories,12.09 USD,-1,,12.6945,1/26/2019,18:19,Credit card,8.2
830,237-44-6163,A,1,0,0,Normal,Male,Electronic accessories,10.56 USD,-8,,88.704,1/24/2019,17:43,Cash,7.6
881,115-38-7388,C,0,1,0,Member,Female,Fashion accessories,10.18 USD,-8,,85.512,3/30/2019,12:51,Credit card,9.5
903,865-41-9075,A,1,0,0,Normal,Male,Food and beverages,11.53 USD,-7,,84.7455,1/28/2019,17:35,Cash,8.1


#### 1- Completeness :
- Remove USD from unit price columns
- Remove (pm)and (-) from Time columns
- Replace the empty values in the "Total" column by multiplying "Quantity" by "Unit Price" and adding "5% Tax"
- Replace empty values in the "5% Tax" column
- Replace (-) with the most common value.

#### 2- Validity :
- Change column names  => 'Invoice_ID' , 'Customer_type' , 'Product_line' , 'Unit_price' , 'Tax_5%' 
- Replace column type Unit price from object to int
- Replace column Date from int to date


#### 3- accuracy :
- nothing outlet in data

#### 4- Consistency :
- remove duplicates

<a id='Cleaning'></a>
## 3- Data Cleaning

In [13]:
#copy from the dataframe
df_clean = df.copy()
df_clean.head()

Unnamed: 0,Invoice ID,Branch,Yangon,Naypyitaw,Mandalay,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,Rating
0,750-67-8428,A,1,0,0,Normal,Male,Health and beauty,74.69,7,26.1415,,1/5/2019,13:08,Ewallet,9.1
1,226-31-3081,C,0,1,0,Normal,Male,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29,Cash,9.6
2,631-41-3108,A,1,0,0,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,7.4
3,123-19-1176,A,1,0,0,Normal,Male,Health and beauty,58.22,8,,489.048,1/27/2019,8 - 30 PM,Ewallet,8.4
4,373-73-7910,A,1,0,0,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,5.3


### 3-1 Fixing Tidiness issues

#### A- Define :
- 'Yangon', 'Naypyitaw', 'Mandalay' convertir les colonnes

#### B- Code :

In [14]:
# Reshape DataFrame from wide to long format for sales by city  
df_clean = pd.melt(df_clean , 
                    id_vars=['Invoice ID', 'Branch', 'Customer type', 'Gender', 'Product line', 'Unit price', 'Quantity', 'Tax 5%', 'Total', 'Date', 'Time', 'Payment', 'Rating'], 
                    value_vars=['Yangon', 'Naypyitaw', 'Mandalay'], 
                    var_name='City',
                    value_name='Sales')  
#  Filter the "df" to keep only the rows where the "Sales" column is = 1
df_clean = df_clean [df_clean ['Sales'] == 1]

#### C-Test :

In [15]:
df_clean.sample(5)

Unnamed: 0,Invoice ID,Branch,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,Rating,City,Sales
2439,650-98-6268,B,Member,Female,Food and beverages,20.87,3,3.1305,65.7405,3/20/2019,13:53,Credit card,8.0,Mandalay,1
168,796-32-9050,A,Normal,Male,Food and beverages,51.28,6,15.384,323.064,1/19/2019,16:31,Cash,6.5,Yangon,1
1079,841-35-6630,C,Normal,Female,Electronic accessories,75.91,6,22.773,478.233,3/9/2019,18:21,Cash,8.7,Naypyitaw,1
1667,842-29-4695,C,Member,Male,Sports and travel,17.14,7,5.999,125.979,1/16/2019,12:07,Credit card,7.9,Naypyitaw,1
182,851-28-6367,A,Member,Male,Sports and travel,15.5,10,7.75,162.75,3/23/2019,10:55,Ewallet,8.0,Yangon,1


In [16]:
df_clean['City'].value_counts()

Yangon       341
Mandalay     334
Naypyitaw    331
Name: City, dtype: int64

In [17]:
# Remove columns sales
df_clean= df_clean.drop(columns=['Sales'])

### 3-2 Fixing Quality Issues

#### A- Define :
- Remove duplicates

#### B- Code :

In [18]:
df_clean = df_clean.drop_duplicates()

#### C-Test  :

In [19]:
df_clean.duplicated().sum()

0

#### A- Define :
- Change column names

#### B- Code :

In [20]:
df_clean = df_clean.rename(columns={'Invoice ID': 'Invoice_ID', 'Customer type': 'Customer_type', 'Product line': 'Product_line', 'Unit price': 'Unit_price', 'Tax 5%': 'Tax_5%'})

#### C-Test :

In [21]:
df_clean.head(5)

Unnamed: 0,Invoice_ID,Branch,Customer_type,Gender,Product_line,Unit_price,Quantity,Tax_5%,Total,Date,Time,Payment,Rating,City
0,750-67-8428,A,Normal,Male,Health and beauty,74.69,7,26.1415,,1/5/2019,13:08,Ewallet,9.1,Yangon
2,631-41-3108,A,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,7.4,Yangon
3,123-19-1176,A,Normal,Male,Health and beauty,58.22,8,,489.048,1/27/2019,8 - 30 PM,Ewallet,8.4,Yangon
4,373-73-7910,A,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,5.3,Yangon
6,355-53-5943,A,Normal,Male,Electronic accessories,68.84,6,20.652,433.692,2/25/2019,14:36,Ewallet,5.8,Yangon


#### A- Define :
- Remove USD 'from unit' price columns

#### B- Code :

In [22]:
# count USD
contains_usd = df_clean['Unit_price'].str.contains('USD', regex=False).sum()
contains_usd 

5

In [23]:
# Remove USD
df_clean['Unit_price'] = df_clean['Unit_price'].astype(str).str.replace('USD', '', regex=False).str.strip()

#### C-Test :

In [26]:
# Check if there are any values containing "USD"  
contains_usd = df_clean['Unit_price'].str.contains('USD').any()  
if contains_usd:  
    print("There is still USD")  
else:  
    print("There is no USD")

There is no USD


#### A- Define :
- Replace column type Unit price from object to int

#### B- Code :

In [27]:
# Replace type
df_clean['Unit_price'] = pd.to_numeric(df_clean['Unit_price'])

#### C-Test :

In [28]:
df_clean['Unit_price'].info()

<class 'pandas.core.series.Series'>
Int64Index: 1000 entries, 0 to 3008
Series name: Unit_price
Non-Null Count  Dtype  
--------------  -----  
1000 non-null   float64
dtypes: float64(1)
memory usage: 15.6 KB


#### A- Define :
- Replace column Date from int to datetime

#### B- Code :

In [29]:
# Replace type
df_clean['Date'] = pd.to_datetime(df_clean['Date'])

#### C-Test :

In [30]:
df_clean['Date'].info()

<class 'pandas.core.series.Series'>
Int64Index: 1000 entries, 0 to 3008
Series name: Date
Non-Null Count  Dtype         
--------------  -----         
1000 non-null   datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 15.6 KB


#### A- Define :
- Remove (pm)and (-) from Time columns

#### B- Code :

In [31]:
# Replace --
df_clean['Time'] = df_clean['Time'].replace({' - ': ':'}, regex=True) 

In [32]:
# Remove Pm
df_clean['Time'] = df_clean['Time'].str.replace('PM', '', regex=False).str.replace('AM', '', regex=False).str.strip()

#### C-Test :

In [33]:
# Count - 
contains_dash = df_clean['Time'].str.contains('-').any()  
contains_dash.sum()

0

In [34]:
# count Pm
contains_PM = df_clean['Time'].str.contains('PM').any()  
contains_PM.sum()

0

#### A- Define :
- Replace null values in the "Total" column 


#### B- Code :

In [35]:
# Replace null 
df_clean['Total'] = df_clean['Total'].fillna(df_clean['Quantity'] * df_clean['Unit_price'] + df_clean['Tax_5%'])

#### C-Test :

In [36]:
df_clean['Total'].isnull().sum()

0

#### A- Define :
- Replace null values in the "5% Tax" column

#### B- Code :

In [37]:
# Replace null 
df_clean['Tax_5%'] = df_clean['Tax_5%'].fillna(df_clean['Unit_price'] * df_clean['Quantity'] * 0.05)

#### C-Test :

In [38]:
df_clean['Tax_5%'].isnull().sum()

0

#### A- Define :
- Replace (-) with the most common value.

#### B- Code :

In [39]:
# Replace (-)
df_clean['Customer_type'] = df_clean['Customer_type'].replace('-', np.nan)

In [40]:
# Most repeated
most = df_clean['Customer_type'].mode()[0]
most

'Normal'

In [41]:
df_clean['Customer_type'] = df_clean['Customer_type'].fillna(most)

In [42]:
# Replace Memberr
df_clean['Customer_type'] = df_clean['Customer_type'].replace('Memberr', 'Member')

#### C-Test :

In [43]:
df_clean['Customer_type'].value_counts()

Normal    540
Member    460
Name: Customer_type, dtype: int64

#### A- Define :
- Replace negative values

#### B- Code :

In [44]:
df_clean['Quantity'] = df_clean['Quantity'].abs()

#### C-Test :

In [46]:
# count negative
negative_values2 = df_clean['Quantity'] < 0
negative_values2.sum()

0

0

In [47]:
# Profit margin (25%)
profit_margin = 0.25 

# Cost of Goods Sold
df_clean['COGS'] = df_clean['Total'] / (1 + profit_margin)

In [48]:
# Calculate Profit margin 
df_clean['Profit'] = df_clean['Total'] - df_clean['COGS'].round(1)

In [49]:
df_clean.head()

Unnamed: 0,Invoice_ID,Branch,Customer_type,Gender,Product_line,Unit_price,Quantity,Tax_5%,Total,Date,Time,Payment,Rating,City,COGS,Profit
0,750-67-8428,A,Normal,Male,Health and beauty,74.69,7,26.1415,548.9715,2019-01-05,13:08,Ewallet,9.1,Yangon,439.1772,109.7715
2,631-41-3108,A,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,2019-03-03,13:23,Credit card,7.4,Yangon,272.4204,68.1255
3,123-19-1176,A,Normal,Male,Health and beauty,58.22,8,23.288,489.048,2019-01-27,8:30,Ewallet,8.4,Yangon,391.2384,97.848
4,373-73-7910,A,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2019-02-08,10:37,Ewallet,5.3,Yangon,507.5028,126.8785
6,355-53-5943,A,Normal,Male,Electronic accessories,68.84,6,20.652,433.692,2019-02-25,14:36,Ewallet,5.8,Yangon,346.9536,86.692


<a id='Assessing2'></a>
## 4- Data Assessing After Cleaning

In [50]:
df_clean.sample(10)

Unnamed: 0,Invoice_ID,Branch,Customer_type,Gender,Product_line,Unit_price,Quantity,Tax_5%,Total,Date,Time,Payment,Rating,City,COGS,Profit
2108,766-85-7061,B,Normal,Male,Health and beauty,87.87,10,43.935,922.635,2019-03-29,10:25,Ewallet,5.1,Mandalay,738.108,184.535
1873,101-81-4070,C,Member,Female,Health and beauty,62.82,2,6.282,131.922,2019-01-17,12:36,Ewallet,4.9,Naypyitaw,105.5376,26.422
2905,715-20-1673,B,Normal,Male,Electronic accessories,28.38,5,7.095,148.995,2019-03-06,20:57,Cash,9.4,Mandalay,119.196,29.795
1652,778-89-7974,C,Normal,Male,Health and beauty,70.21,6,21.063,442.323,2019-03-30,14:58,Cash,7.4,Naypyitaw,353.8584,88.423
645,706-36-6154,A,Member,Male,Home and lifestyle,19.36,9,8.712,182.952,2019-01-18,18:43,Ewallet,8.7,Yangon,146.3616,36.552
1085,756-01-7507,C,Normal,Female,Health and beauty,20.38,5,5.095,106.995,2019-01-22,18:56,Cash,6.0,Naypyitaw,85.596,21.395
184,586-25-0848,A,Normal,Female,Sports and travel,12.34,7,4.319,90.699,2019-03-04,11:19,Credit card,6.7,Yangon,72.5592,18.099
2588,458-61-0011,B,Normal,Male,Food and beverages,60.3,4,12.06,253.26,2019-02-20,18:43,Cash,5.8,Mandalay,202.608,50.66
631,655-07-2265,A,Normal,Male,Electronic accessories,78.31,3,11.7465,246.6765,2019-03-05,16:38,Ewallet,5.4,Yangon,197.3412,49.3765
1586,390-31-6381,C,Normal,Male,Food and beverages,27.22,3,4.083,85.743,2019-01-07,12:37,Cash,7.3,Naypyitaw,68.5944,17.143


In [51]:
df_clean.shape

(1000, 16)

In [52]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 3008
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Invoice_ID     1000 non-null   object        
 1   Branch         1000 non-null   object        
 2   Customer_type  1000 non-null   object        
 3   Gender         1000 non-null   object        
 4   Product_line   1000 non-null   object        
 5   Unit_price     1000 non-null   float64       
 6   Quantity       1000 non-null   int64         
 7   Tax_5%         1000 non-null   float64       
 8   Total          1000 non-null   float64       
 9   Date           1000 non-null   datetime64[ns]
 10  Time           1000 non-null   object        
 11  Payment        1000 non-null   object        
 12  Rating         1000 non-null   float64       
 13  City           1000 non-null   object        
 14  COGS           1000 non-null   float64       
 15  Profit         1000 n

In [53]:
df_clean.describe()

Unnamed: 0,Unit_price,Quantity,Tax_5%,Total,Rating,COGS,Profit
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,55.67213,5.51,15.353497,322.966749,7.06,258.373399,64.593849
std,26.494628,2.923431,11.742764,245.885335,3.324375,196.708268,49.175056
min,10.08,1.0,-4.224,10.6785,4.0,8.5428,2.1785
25%,32.875,3.0,5.924875,124.422375,5.5,99.5379,24.922375
50%,55.23,5.0,12.088,253.848,7.0,203.0784,50.798
75%,77.935,8.0,22.44525,471.35025,8.5,377.0802,94.30025
max,99.96,10.0,49.65,1042.65,97.0,834.12,208.55


In [54]:
df_clean.isna().sum()

Invoice_ID       0
Branch           0
Customer_type    0
Gender           0
Product_line     0
Unit_price       0
Quantity         0
Tax_5%           0
Total            0
Date             0
Time             0
Payment          0
Rating           0
City             0
COGS             0
Profit           0
dtype: int64

In [56]:
df_clean.duplicated().sum()

0

<a id='Storing'></a>
## 5- Data Storing

In [58]:
df_clean.to_csv('clean_Supermarket_Sales_data.csv')