# Case Study


* This dataset contains 1,000 retail transaction records from a supermarket, with 17 variables describing sales, customers, products, and transactions.

* **Transaction details**: Invoice ID, Date, Time, Branch, City

* **Customer information**: Customer type (e.g., Member/Normal), Gender

* **Product & sales info**: Product line, Unit price, Quantity, COGS, Tax (5%), Total, Gross income, Gross margin percentage

* **Payment & feedback**: Payment method, Customer rating

* Overall, it is a clean, well-structured transactional dataset (no missing values) commonly used for descriptive analytics, sales performance analysis, customer behavior analysis, and basic predictive modeling in a retail context.

In [73]:
# -- import libraries -- 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import style

In [74]:
# -- Set and configure plot appearance and format --
# -- See here for more options: https://matplotlib.org/users/customizing.html --
%matplotlib inline


# DataFrame

In [75]:
# -- Load dataset, foodSales.csv --
# -- The ShippedDate column is in a format dif

df = pd.read_csv('https://raw.githubusercontent.com/ttchuang/dataset/master/supermarket_sales.csv')


## Examine Data Types and Format
**Data Issues: Please examine the dataset. Do you see any problems with the data?**

In [76]:
# -- Examine the dataset --
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Invoice ID               1000 non-null   object 
 1   Branch                   1000 non-null   object 
 2   City                     1000 non-null   object 
 3   Customer type            1000 non-null   object 
 4   Gender                   1000 non-null   object 
 5   Product line             1000 non-null   object 
 6   Unit price               1000 non-null   float64
 7   Quantity                 1000 non-null   int64  
 8   Tax 5%                   1000 non-null   float64
 9   Total                    1000 non-null   float64
 10  Date                     1000 non-null   object 
 11  Time                     1000 non-null   object 
 12  Payment                  1000 non-null   object 
 13  cogs                     1000 non-null   float64
 14  gross margin percentage  

In [77]:
# -- Preview the dataset
df.head()

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29,Cash,76.4,4.761905,3.82,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,1/27/2019,20:33,Ewallet,465.76,4.761905,23.288,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3


In [78]:
# -- Type casting --
df['Date'] = pd.to_datetime(df['Date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Invoice ID               1000 non-null   object        
 1   Branch                   1000 non-null   object        
 2   City                     1000 non-null   object        
 3   Customer type            1000 non-null   object        
 4   Gender                   1000 non-null   object        
 5   Product line             1000 non-null   object        
 6   Unit price               1000 non-null   float64       
 7   Quantity                 1000 non-null   int64         
 8   Tax 5%                   1000 non-null   float64       
 9   Total                    1000 non-null   float64       
 10  Date                     1000 non-null   datetime64[ns]
 11  Time                     1000 non-null   object        
 12  Payment                  1000 non-n

## Create Periods

### Using to_period()

**Create a new column called QuarterlyPeriod, whose values have the format, 2008Q1.**

In [79]:
df['Year'] = df['Date'].dt.year
df.head()


Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating,Year
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,2019-01-05,13:08,Ewallet,522.83,4.761905,26.1415,9.1,2019
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,2019-03-08,10:29,Cash,76.4,4.761905,3.82,9.6,2019
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,2019-03-03,13:23,Credit card,324.31,4.761905,16.2155,7.4,2019
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,2019-01-27,20:33,Ewallet,465.76,4.761905,23.288,8.4,2019
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2019-02-08,10:37,Ewallet,604.17,4.761905,30.2085,5.3,2019


In [80]:
df['Quarter'] = df['Date'].dt.quarter
df.head()

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating,Year,Quarter
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,2019-01-05,13:08,Ewallet,522.83,4.761905,26.1415,9.1,2019,1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,2019-03-08,10:29,Cash,76.4,4.761905,3.82,9.6,2019,1
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,2019-03-03,13:23,Credit card,324.31,4.761905,16.2155,7.4,2019,1
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,2019-01-27,20:33,Ewallet,465.76,4.761905,23.288,8.4,2019,1
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2019-02-08,10:37,Ewallet,604.17,4.761905,30.2085,5.3,2019,1


In [81]:
# -- Creat new columns 
df['QuarterlyPeriod']=df['Date'].dt.to_period('Q')
df.head()

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating,Year,Quarter,QuarterlyPeriod
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,2019-01-05,13:08,Ewallet,522.83,4.761905,26.1415,9.1,2019,1,2019Q1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,2019-03-08,10:29,Cash,76.4,4.761905,3.82,9.6,2019,1,2019Q1
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,2019-03-03,13:23,Credit card,324.31,4.761905,16.2155,7.4,2019,1,2019Q1
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,2019-01-27,20:33,Ewallet,465.76,4.761905,23.288,8.4,2019,1,2019Q1
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2019-02-08,10:37,Ewallet,604.17,4.761905,30.2085,5.3,2019,1,2019Q1


### Create a New Column

**Create a new column called Quarter**

In [82]:
# -- Creat new columns 
df['Quarter']=df['Date'].dt.quarter
df.head()

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating,Year,Quarter,QuarterlyPeriod
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,2019-01-05,13:08,Ewallet,522.83,4.761905,26.1415,9.1,2019,1,2019Q1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,2019-03-08,10:29,Cash,76.4,4.761905,3.82,9.6,2019,1,2019Q1
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,2019-03-03,13:23,Credit card,324.31,4.761905,16.2155,7.4,2019,1,2019Q1
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,2019-01-27,20:33,Ewallet,465.76,4.761905,23.288,8.4,2019,1,2019Q1
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2019-02-08,10:37,Ewallet,604.17,4.761905,30.2085,5.3,2019,1,2019Q1


# Split-Apply-Combine

In [83]:
# -- What is the total amount of Different Customer Types? --
df.groupby('Customer type')['Total'].sum()

Customer type
Member    164223.444
Normal    158743.305
Name: Total, dtype: float64

In [84]:
# -- What is the average amount of Different Customer types? --
df.groupby('Customer type')['Total'].mean()

Customer type
Member    327.791305
Normal    318.122856
Name: Total, dtype: float64

In [89]:
# -- Create a cross table Customer Type by Gender showing their mean totals -- 

df.groupby(['Customer type','Gender'])['Total'].mean().unstack(level=1)

Gender,Female,Male
Customer type,Unnamed: 1_level_1,Unnamed: 2_level_1
Member,337.727753,316.985419
Normal,332.233256,305.047581


# Plotting

## Bar and Column Charts

* **How are the sales across categories?**
* **How are the sales across countries?** 

**Syntax**:

    * Column chart: DataFrame.plot.bar()
    * Bar chart: Dataframe.plot.barh()

In [86]:
df.head()

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating,Year,Quarter,QuarterlyPeriod
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,2019-01-05,13:08,Ewallet,522.83,4.761905,26.1415,9.1,2019,1,2019Q1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,2019-03-08,10:29,Cash,76.4,4.761905,3.82,9.6,2019,1,2019Q1
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,2019-03-03,13:23,Credit card,324.31,4.761905,16.2155,7.4,2019,1,2019Q1
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,2019-01-27,20:33,Ewallet,465.76,4.761905,23.288,8.4,2019,1,2019Q1
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2019-02-08,10:37,Ewallet,604.17,4.761905,30.2085,5.3,2019,1,2019Q1


In [87]:
# -- Calcualte total sales of different customer type --
df_category = foodSales.groupby('Customer type')['Total'].sum()
df_category

NameError: name 'foodSales' is not defined

In [None]:
# -- Plot a column chart --
df_category.plot.bar(title='Sales by Customer Type',
                           xlabel='Customer type',
                           ylabel='Sales',
                            rot=45,
                           figsize=(12,6))
plt.show();

### Add Data Labels

In [None]:
# -- Plot a column chart --
fig = df_category.plot.bar(title='Sales of Categories',
                           xlabel='Category',
                           ylabel='Sales',
                            rot=45,
                           figsize=(12,6))
fig.bar_label(fig.containers[0],label_type='edge')
plt.show();