<a id='top'></a>
## Data Visualization: Discovering Trends and Patterns in an Online Retail Sales
### By: P. A. Ogbodum

<b> Contents Outline </b><br>
- <a href = '#introduction'> Introduction </a>
- <a href = '#data wrangling'> Data Wrangling </a>
- <a href = '#eda'> Exploratory Data Analysis </a>
- <a href = '#conclusion'> Conclusion </a>


<a id='introduction'></a>

# Introduction

**Data Set:** Online Retails Sale Dataset

**Source:** [kaggle](https://www.kaggle.com/datasets/rohitmahulkar/online-retails-sale-dataset)

**Description:** This data set was collected by an intern at Forage and contains information about online retail sales made by the Tata Group multinational conglomerate headquartered in Mumbai, India. It has 10 columns and 541,909 records for sales between 2010 and 2011.

**Variables:** There are 10 variables contained in this data set as listed below: </br>
- **InvoiceNo**: Unique ID to identify sale
- **InvoiceDate**: The date the invoice was made 
- **InvoiceTime**: The time the invoice was made
- **StockCode**: Code number for the item of purchase
- **Description**: Description about the type of the product
- **Quantity**: Quantity of the items purchased
- **UnitPrice**: Price of a single unit of the item
- **Totalsale**: Sum total of item purchased
- **CustomerID**: Unique number of identification for each customer
- **Country**: Country where the purchase was made

**Credit:** *ROHIT MAHULKAR*

**Analysis Question:** Best performing product in terms of sales between 2010-2011 and its characteristics.

**<a href='#top'>Go to first cell</a>**


In [1]:
# import needed modules
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import mitosheet
import re

%matplotlib inline

<a id='data wrangling'></a>
**<a href='#top'>Go to first cell</a>**
    
# Data Wrangling

> ## Data Gathering

In [2]:
# read flat file into jupyter
df = pd.read_csv('Online Retail.csv')

# check if operation was successful
df.sample(n=5)

Unnamed: 0,InvoiceNo,InvoiceDate,InvoiceTime,StockCode,Description,Quantity,UnitPrice,Totalsale,CustomerID,Country
376227,569519,04-10-2011,02:31:00 PM,23396,LE JARDIN BOTANIQUE CUSHION COVER,4,3.75,15.0,13792.0,United Kingdom
141496,548514,31-03-2011,04:03:00 PM,48184,DOORMAT ENGLISH ROSE,1,14.13,14.13,,United Kingdom
115063,546105,09-03-2011,12:21:00 PM,22861,EASTER TIN CHICKS IN GARDEN,1,1.65,1.65,14662.0,United Kingdom
44551,540180,05-01-2011,01:05:00 PM,22083,PAPER CHAIN KIT RETROSPOT,2,2.95,5.9,15984.0,United Kingdom
61000,541423,17-01-2011,05:54:00 PM,20854,BLUE PATCH PURSE PINK HEART,7,1.63,11.41,,United Kingdom


> ## Data Assessment
>> ### Visual Assessment

In [3]:
# use mitosheet to perform visual assessment
mitosheet.sheet(analysis_to_replay="id-ywyefybwwh")

MitoWidget(analysis_data_json='{"analysisName": "id-gelwsvapja", "analysisToReplay": {"analysisName": "id-ywye…

In [4]:
from mitosheet import *; register_analysis("id-ywyefybwwh");
    
# Imported Online Retail.csv
import pandas as pd
Online_Retail = pd.read_csv(r'Online Retail.csv')

# Reordered column Quantity
Online_Retail_columns = [col for col in Online_Retail.columns if col != 'Quantity']
Online_Retail_columns.insert(6, 'Quantity')
Online_Retail = Online_Retail[Online_Retail_columns]

# Reordered column Description
Online_Retail_columns = [col for col in Online_Retail.columns if col != 'Description']
Online_Retail_columns.insert(3, 'Description')
Online_Retail = Online_Retail[Online_Retail_columns]

# Sorted InvoiceDate in descending order
Online_Retail = Online_Retail.sort_values(by='InvoiceDate', ascending=False, na_position='last')


In [5]:
from mitosheet import *; register_analysis("id-azmjrmwwda");
    
# Imported Online Retail.csv
import pandas as pd
Online_Retail = pd.read_csv(r'Online Retail.csv')

# Pivoted into Online_Retail
Online_Retail_pivot = pd.DataFrame(data={})

# Deleted Online_Retail_pivot
del Online_Retail_pivot


<b>Notes :</b>
- Unideal data type for InvoiceDate and InvoiceTime columns [quality issue].
- Irrational values (-tve) values for Quantity and TotalSale columns [quality issue].
- Discrepancies in InvoiceNo values (some alphaNumeric), seem to denote Quantity and TotalSale rows with negative values [quality issue].

>> ### Programmatic Assessment

In [6]:
# dataframe overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   InvoiceDate  541909 non-null  object 
 2   InvoiceTime  541909 non-null  object 
 3   StockCode    541909 non-null  object 
 4   Description  540455 non-null  object 
 5   Quantity     541909 non-null  int64  
 6   UnitPrice    541909 non-null  float64
 7   Totalsale    541909 non-null  float64
 8   CustomerID   406829 non-null  float64
 9   Country      541909 non-null  object 
dtypes: float64(3), int64(1), object(6)
memory usage: 41.3+ MB


In [221]:
df[df.Description.isna()]

Unnamed: 0,InvoiceNo,InvoiceDate,InvoiceTime,StockCode,Description,Quantity,UnitPrice,Totalsale,CustomerID,Country
622,536414,01-12-2010,11:52:00 AM,22139,,56,0.0,0.0,,United Kingdom
1970,536545,01-12-2010,02:32:00 PM,21134,,1,0.0,0.0,,United Kingdom
1971,536546,01-12-2010,02:33:00 PM,22145,,1,0.0,0.0,,United Kingdom
1972,536547,01-12-2010,02:33:00 PM,37509,,1,0.0,0.0,,United Kingdom
1987,536549,01-12-2010,02:34:00 PM,85226A,,1,0.0,0.0,,United Kingdom
...,...,...,...,...,...,...,...,...,...,...
535322,581199,07-12-2011,06:26:00 PM,84581,,-2,0.0,0.0,,United Kingdom
535326,581203,07-12-2011,06:31:00 PM,23406,,15,0.0,0.0,,United Kingdom
535332,581209,07-12-2011,06:35:00 PM,21620,,6,0.0,0.0,,United Kingdom
536981,581234,08-12-2011,10:33:00 AM,72817,,27,0.0,0.0,,United Kingdom


<b>Notes:</b>
- Missing values in Description and CustomerID columns [quality issue].
- CustomerID should be integer data type not float64 [quality issue].

In [7]:
# check for duplicates
df.duplicated().any()

True

In [8]:
# get sum of all duplicates
df.duplicated().sum()

5268

In [9]:
# confirm number of duplicated rows 
df_dup = df[df.duplicated(keep='first')]
df_dup.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5268 entries, 517 to 541701
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   InvoiceNo    5268 non-null   object 
 1   InvoiceDate  5268 non-null   object 
 2   InvoiceTime  5268 non-null   object 
 3   StockCode    5268 non-null   object 
 4   Description  5268 non-null   object 
 5   Quantity     5268 non-null   int64  
 6   UnitPrice    5268 non-null   float64
 7   Totalsale    5268 non-null   float64
 8   CustomerID   5225 non-null   float64
 9   Country      5268 non-null   object 
dtypes: float64(3), int64(1), object(6)
memory usage: 452.7+ KB


<b>Notes:</b>
- There exist 5268 duplicated records in the dataframe [tidiness issue].

In [10]:
# Summary statistics
df.describe()

Unnamed: 0,Quantity,UnitPrice,Totalsale,CustomerID
count,541909.0,541909.0,541909.0,406829.0
mean,9.55225,4.611114,17.987795,15287.69057
std,218.081158,96.759853,378.810824,1713.600303
min,-80995.0,-11062.06,-168469.6,12346.0
25%,1.0,1.25,3.4,13953.0
50%,3.0,2.08,9.75,15152.0
75%,10.0,4.13,17.4,16791.0
max,80995.0,38970.0,168469.6,18287.0


<b>Notes:</b>
- Quantity and Totalsale columns have their minimum value as the negative of their maximum value [quality issue].
- Although,the mean:max value ratio is significantly high suggesting the presence of outliers in the *Quantity, UnitPrice, and Totalsale* columns. This might be overlooked given the context of these values [quality issue].

In [11]:
# See and compare Quantity and Totalsale columns with minimum and maximum to see why they are inversely related  
check_df = df[(df.Quantity == 80995) | (df.Quantity == -80995)]
check_df

Unnamed: 0,InvoiceNo,InvoiceDate,InvoiceTime,StockCode,Description,Quantity,UnitPrice,Totalsale,CustomerID,Country
540421,581483,09-12-2011,09:15:00 AM,23843,"PAPER CRAFT , LITTLE BIRDIE",80995,2.08,168469.6,16446.0,United Kingdom
540422,C581484,09-12-2011,09:27:00 AM,23843,"PAPER CRAFT , LITTLE BIRDIE",-80995,2.08,-168469.6,16446.0,United Kingdom


In [12]:
# check to see attributes associated with InvoiceNo values have 'C'
c_values = df[df.InvoiceNo.str.contains('C')]

# check to see if all InvoiceNo 'C' values have negative Quantity entries
c_values.sample(n=10)

Unnamed: 0,InvoiceNo,InvoiceDate,InvoiceTime,StockCode,Description,Quantity,UnitPrice,Totalsale,CustomerID,Country
144098,C548723,04-04-2011,09:02:00 AM,85063,CREAM SWEETHEART MAGAZINE RACK,-1,16.95,-16.95,15874.0,United Kingdom
115512,C546131,09-03-2011,03:08:00 PM,21218,RED SPOTTY BISCUIT TIN,-1,3.75,-3.75,16057.0,United Kingdom
288268,C562147,03-08-2011,10:47:00 AM,22064,PINK DOUGHNUT TRINKET POT,-1,1.65,-1.65,16180.0,United Kingdom
176799,C552024,05-05-2011,06:01:00 PM,22990,COTTON APRON PANTRY DESIGN,-2,4.95,-9.9,14961.0,United Kingdom
387277,C570274,10-10-2011,10:52:00 AM,22423,REGENCY CAKESTAND 3 TIER,-6,10.95,-65.7,13113.0,United Kingdom
84299,C543375,07-02-2011,03:09:00 PM,22325,MOBILE VINTAGE HEARTS,-3,4.95,-14.85,16321.0,Australia
80661,C543050,03-02-2011,10:14:00 AM,22242,5 HOOK HANGER MAGIC TOADSTOOL,-24,1.65,-39.6,12625.0,Germany
411009,C572187,21-10-2011,11:00:00 AM,POST,POSTAGE,-1,18.0,-18.0,12403.0,Denmark
235,C536391,01-12-2010,10:24:00 AM,22556,PLASTERS IN TIN CIRCUS PARADE,-12,1.65,-19.8,17548.0,United Kingdom
504695,C578991,27-11-2011,03:31:00 PM,22996,TRAVEL CARD WALLET VINTAGE TICKET,-1,0.42,-0.42,15555.0,United Kingdom


In [13]:
# see number of rows in this sub-dataframe
len(c_values.InvoiceNo)

9288

<b>Rationale:</b> The number of rows from *c_values* (9288) compared with *df_dup* (5268) shows that not all the values in *c_values* are duplicated.

In [14]:
# sub-df to filter InvoiceNo where it contains values having "C" character and duplicated rows for the dataframe
c_and_dup = df[(df.InvoiceNo.str.contains('C')) & (df.duplicated())]

# get sample of dataframe above
c_and_dup.sample(n=8)

Unnamed: 0,InvoiceNo,InvoiceDate,InvoiceTime,StockCode,Description,Quantity,UnitPrice,Totalsale,CustomerID,Country
133118,C547725,25-03-2011,10:43:00 AM,22924,FRIDGE MAGNETS LA VIE EN ROSE,-72,0.85,-61.2,,United Kingdom
24183,C538341,10-12-2010,02:03:00 PM,22725,ALARM CLOCK BAKELIKE CHOCOLATE,-1,3.75,-3.75,15514.0,United Kingdom
361735,C568370,26-09-2011,04:43:00 PM,90143,SILVER BRACELET W PASTEL FLOWER,-1,7.5,-7.5,15154.0,United Kingdom
86901,C543611,10-02-2011,02:38:00 PM,82483,WOOD 2 DRAWER CABINET WHITE FINISH,-1,4.95,-4.95,17850.0,United Kingdom
86898,C543611,10-02-2011,02:38:00 PM,82483,WOOD 2 DRAWER CABINET WHITE FINISH,-1,4.95,-4.95,17850.0,United Kingdom
235951,C557663,21-06-2011,05:59:00 PM,21121,SET/10 RED POLKADOT PARTY CANDLES,-24,1.25,-30.0,,EIRE
390549,C570556,11-10-2011,11:10:00 AM,20969,RED FLORAL FELTCRAFT SHOULDER BAG,-144,3.39,-488.16,16029.0,United Kingdom
440149,C574510,04-11-2011,01:25:00 PM,22360,GLASS JAR ENGLISH CONFECTIONERY,-1,2.95,-2.95,15110.0,United Kingdom


In [15]:
# get length of dataframe
len(c_and_dup)

37

<b>Rationale:</b> <i>since I can't get the meaning of <b>C</b> in the <b>InvoiceNo</b> column, I will mask it out for this analysis.</i>

**Summary of Assessment:**
<hr>

**Quality issues**
- Unideal data type for InvoiceDate and InvoiceTime columns.
- Irrational values (-tve) values for Quantity and TotalSale columns.
- Discrepancies in InvoiceNo values (some alphaNumeric), seem to denote Quantity and TotalSale rows with negative values.
- Missing values in Description and CustomerID columns.
- CustomerID should be integer data type not float64.
- Quantity and Totalsale columns have their minimum value as the negative of their maximum value.
- Although,the mean:max value ratio is significantly high suggesting the presence of outliers in the Quantity, UnitPrice, and Totalsale columns. This might be overlooked given the context of these values. (will be addressed during exploratory analysis).

**Tidiness issue**
- There exist 5268 duplicated records in the dataframe.

> ## Data Cleaning

### Define
- Convert *InvoiceDate and InvoiceTime* columns data types from object (string) to datatime.
- Drop negative values in *Quantity and TotalSale* columns.
- Drop rows having 'c' prefixed values in the *InvoiceNo* column.
- Enter the texts 'Not Available' for missing values in the Description column and the number zero (0) for missing values in the *CustomerID* column.
- Change *CustomerID* data type from float to integer.
- Drop rows where *Quality and Totalsale* have negative values.
- Drop duplicated rows.

### Code


In [237]:
# duplicate dataframe to easily access original copy in the event of error.
df_clean = df.copy()

# test
# df_clean.sample(n=10)

### Test
-

<a id='eda'></a>
**<a href='#top'>Go to first cell</a>**
    
# Exploratory Data Analysis

Address issue of outliers.

<a id='conclusion'></a>
**<a href='#top'>Go to first cell</a>**
    
# Conclusion