<a id='top'></a>
## Data Visualization: Discovering Trends and Patterns in an Online Retail Sales
### By: P. A. Ogbodum

<b> Contents Outline </b><br>
- <a href = '#introduction'> Introduction </a>
- <a href = '#data wrangling'> Data Wrangling </a>
- <a href = '#eda'> Exploratory Data Analysis </a>
- <a href = '#conclusion'> Conclusion </a>


<a id='introduction'></a>

# Introduction

**Data Set:** Online Retails Sale Dataset

**Source:** [kaggle](https://www.kaggle.com/datasets/rohitmahulkar/online-retails-sale-dataset)

**Description:** This data set was collected by an intern at Forage and contains information about online retail sales made by the Tata Group multinational conglomerate headquartered in Mumbai, India. It has 10 columns and 541,909 records for sales between 2010 and 2011.

**Variables:** There are 10 variables contained in this data set as listed below: </br>
- **InvoiceNo**: Unique ID to identify sale
- **InvoiceDate**: The date the invoice was made 
- **InvoiceTime**: The time the invoice was made
- **StockCode**: Code number for the item of purchase
- **Description**: Description about the type of the product
- **Quantity**: Quantity of the items purchased
- **UnitPrice**: Price of a single unit of the item
- **Totalsale**: Sum total of item purchased
- **CustomerID**: Unique number of identification for each customer
- **Country**: Country where the purchase was made

**Credit:** *ROHIT MAHULKAR*

**Analysis Question:** Best performing product in terms of sales between 2010-2011 and its characteristics.

**<a href='#top'>Go to first cell</a>**


In [1]:
# import needed modules
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import mitosheet
import re

%matplotlib inline

<a id='data wrangling'></a>
**<a href='#top'>Go to first cell</a>**
    
# Data Wrangling

> ## Data Gathering

In [2]:
# read flat file into jupyter
df = pd.read_csv('Online Retail.csv')

# check if operation was successful
df.sample(n=5)

Unnamed: 0,InvoiceNo,InvoiceDate,InvoiceTime,StockCode,Description,Quantity,UnitPrice,Totalsale,CustomerID,Country
313376,C564488,25-08-2011,02:12:00 PM,23144,ZINC T-LIGHT HOLDER STARS SMALL,-9,0.83,-7.47,16859.0,United Kingdom
1581,536544,01-12-2010,02:32:00 PM,22451,SILK PURSE BABUSHKA RED,2,6.77,13.54,,United Kingdom
343542,566952,15-09-2011,04:37:00 PM,22423,REGENCY CAKESTAND 3 TIER,1,24.96,24.96,,United Kingdom
325964,565465,05-09-2011,09:46:00 AM,23129,HEART SHAPED HOLLY WREATH,4,4.15,16.6,15364.0,United Kingdom
252336,559109,06-07-2011,11:52:00 AM,23226,FILIGREE HEART DAISY WHITE,2,1.25,2.5,15021.0,United Kingdom


> ## Data Assessment
>> ### Visual Assessment

In [3]:
# use mitosheet to perform visual assessment
mitosheet.sheet(analysis_to_replay="id-ywyefybwwh")

MitoWidget(analysis_data_json='{"analysisName": "id-bxqxtcksjq", "analysisToReplay": {"analysisName": "id-ywye…

In [4]:
from mitosheet import *; register_analysis("id-ywyefybwwh");
    
# Imported Online Retail.csv
import pandas as pd
Online_Retail = pd.read_csv(r'Online Retail.csv')

# Reordered column Quantity
Online_Retail_columns = [col for col in Online_Retail.columns if col != 'Quantity']
Online_Retail_columns.insert(6, 'Quantity')
Online_Retail = Online_Retail[Online_Retail_columns]

# Reordered column Description
Online_Retail_columns = [col for col in Online_Retail.columns if col != 'Description']
Online_Retail_columns.insert(3, 'Description')
Online_Retail = Online_Retail[Online_Retail_columns]


In [5]:
from mitosheet import *; register_analysis("id-azmjrmwwda");
    
# Imported Online Retail.csv
import pandas as pd
Online_Retail = pd.read_csv(r'Online Retail.csv')

# Pivoted into Online_Retail
Online_Retail_pivot = pd.DataFrame(data={})

# Deleted Online_Retail_pivot
del Online_Retail_pivot


<b>Notes :</b>
- Unideal data type for InvoiceDate and InvoiceTime columns.
- Inconsistent formatting for InvoiceDate records.
- Irrational values (-tve) values for Quantity and TotalSale columns.
- Discrepancies in InvoiceNo values (some alphaNumeric), seem to denote Quantity and TotalSale rows with negative values.

>> ### Programmatic Assessment

In [6]:
# dataframe overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   InvoiceDate  541909 non-null  object 
 2   InvoiceTime  541909 non-null  object 
 3   StockCode    541909 non-null  object 
 4   Description  540455 non-null  object 
 5   Quantity     541909 non-null  int64  
 6   UnitPrice    541909 non-null  float64
 7   Totalsale    541909 non-null  float64
 8   CustomerID   406829 non-null  float64
 9   Country      541909 non-null  object 
dtypes: float64(3), int64(1), object(6)
memory usage: 41.3+ MB


<b>Notes:</b>
- Missing values in Description and CustomerID columns.
- CustomerID should be string/object data type not float64.

In [7]:
# check for duplicates
df.duplicated().any()

True

In [8]:
# get sum of all duplicates
df.duplicated().sum()

5268

In [9]:
# confirm number of duplicated rows 
df_dup = df[df.duplicated(keep='first')]
df_dup.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5268 entries, 517 to 541701
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   InvoiceNo    5268 non-null   object 
 1   InvoiceDate  5268 non-null   object 
 2   InvoiceTime  5268 non-null   object 
 3   StockCode    5268 non-null   object 
 4   Description  5268 non-null   object 
 5   Quantity     5268 non-null   int64  
 6   UnitPrice    5268 non-null   float64
 7   Totalsale    5268 non-null   float64
 8   CustomerID   5225 non-null   float64
 9   Country      5268 non-null   object 
dtypes: float64(3), int64(1), object(6)
memory usage: 452.7+ KB


<b>Notes:</b>
- There exist 5268 duplicated records in the dataframe

In [10]:
# Summary statistics
df.describe()

Unnamed: 0,Quantity,UnitPrice,Totalsale,CustomerID
count,541909.0,541909.0,541909.0,406829.0
mean,9.55225,4.611114,17.987795,15287.69057
std,218.081158,96.759853,378.810824,1713.600303
min,-80995.0,-11062.06,-168469.6,12346.0
25%,1.0,1.25,3.4,13953.0
50%,3.0,2.08,9.75,15152.0
75%,10.0,4.13,17.4,16791.0
max,80995.0,38970.0,168469.6,18287.0


<b>Notes:</b>
- Quantity and Totalsale columns have their minimum value as the negative of their maximum value.

In [12]:
# See and compare Quantity and Totalsale columns with minimum and maximum to see why they are inversely related  
check_df = df[(df.Quantity == 80995) | (df.Quantity == -80995)]
check_df

Unnamed: 0,InvoiceNo,InvoiceDate,InvoiceTime,StockCode,Description,Quantity,UnitPrice,Totalsale,CustomerID,Country
540421,581483,09-12-2011,09:15:00 AM,23843,"PAPER CRAFT , LITTLE BIRDIE",80995,2.08,168469.6,16446.0,United Kingdom
540422,C581484,09-12-2011,09:27:00 AM,23843,"PAPER CRAFT , LITTLE BIRDIE",-80995,2.08,-168469.6,16446.0,United Kingdom


In [43]:
# check to see attributes associated with InvoiceNo values have 'C'
c_values = df[df.InvoiceNo.str.contains('C')]

# check to see if all InvoiceNo 'C' values have negative Quantity entries
c_values.sample(n=10)

Unnamed: 0,InvoiceNo,InvoiceDate,InvoiceTime,StockCode,Description,Quantity,UnitPrice,Totalsale,CustomerID,Country
82038,C543185,04-02-2011,11:22:00 AM,22333,RETROSPOT PARTY BAG + STICKER SET,-14,1.65,-23.1,,United Kingdom
246858,C558736,01-07-2011,03:14:00 PM,23146,TRIPLE HOOK ANTIQUE IVORY ROSE,-2,3.29,-6.58,18075.0,United Kingdom
147060,C549050,06-04-2011,10:17:00 AM,21314,SMALL GLASS HEART TRINKET POT,-3,2.1,-6.3,13767.0,United Kingdom
337735,C566459,12-09-2011,05:17:00 PM,21535,RED RETROSPOT SMALL MILK JUG,-2,2.55,-5.1,13236.0,United Kingdom
363775,C568581,28-09-2011,09:59:00 AM,22778,GLASS CLOCHE SMALL,-1,3.95,-3.95,15804.0,United Kingdom
329645,C565848,07-09-2011,12:48:00 PM,22385,JUMBO BAG SPACEBOY DESIGN,-1,2.08,-2.08,14606.0,United Kingdom
234537,C557566,21-06-2011,10:39:00 AM,22103,MIRROR MOSAIC T-LIGHT HOLDER ROUND,-2,1.65,-3.3,15998.0,United Kingdom
326956,C565615,05-09-2011,03:20:00 PM,21197,MULTICOLOUR CONFETTI IN TUBE,-21,1.65,-34.65,12683.0,France
537497,C581305,08-12-2011,11:42:00 AM,82484,WOOD BLACK BOARD ANT WHITE FINISH,-1,7.95,-7.95,16933.0,United Kingdom
133265,C547763,25-03-2011,11:31:00 AM,85150,LADIES & GENTLEMEN METAL SIGN,-1,2.55,-2.55,14194.0,United Kingdom


In [44]:
# see number of rows in this sub-dataframe
len(c_values.InvoiceNo)

9288

<b>Rationale:</b> The number of rows from *c_values* (9288) compared with *df_dup* (5268) shows that not all the values in *c_values* are duplicated.

In [54]:
# sub-df to filter InvoiceNo where it contains values having "C" character and duplicated rows for the dataframe
c_and_dup = df[(df.InvoiceNo.str.contains('C')) & (df.duplicated())]

# get sample of dataframe above
c_and_dup.sample(n=8)

Unnamed: 0,InvoiceNo,InvoiceDate,InvoiceTime,StockCode,Description,Quantity,UnitPrice,Totalsale,CustomerID,Country
24169,C538341,10-12-2010,02:03:00 PM,22730,ALARM CLOCK BAKELIKE IVORY,-1,3.75,-3.75,15514.0,United Kingdom
86890,C543611,10-02-2011,02:38:00 PM,82483,WOOD 2 DRAWER CABINET WHITE FINISH,-1,4.95,-4.95,17850.0,United Kingdom
96695,C544580,21-02-2011,02:25:00 PM,S,SAMPLES,-1,9.74,-9.74,,United Kingdom
293157,C562582,07-08-2011,01:53:00 PM,21452,TOADSTOOL MONEY BOX,-1,2.95,-2.95,15640.0,United Kingdom
24177,C538341,10-12-2010,02:03:00 PM,22727,ALARM CLOCK BAKELIKE RED,-1,3.75,-3.75,15514.0,United Kingdom
215598,C555723,06-06-2011,04:21:00 PM,22171,3 HOOK PHOTO SHELF ANTIQUE WHITE,-2,8.5,-17.0,15737.0,United Kingdom
461408,C575940,13-11-2011,11:38:00 AM,23309,SET OF 60 I LOVE LONDON CAKE CASES,-24,0.55,-13.2,17838.0,United Kingdom
24183,C538341,10-12-2010,02:03:00 PM,22725,ALARM CLOCK BAKELIKE CHOCOLATE,-1,3.75,-3.75,15514.0,United Kingdom


In [55]:
# get length of dataframe
len(c_and_dup)

37

<i>since I can't get the meaning of <b>C</b> in the <b>InvoiceNo</b> column, I will mask it out for this analysis.</i>

In [58]:
df.isna().any()

InvoiceNo      False
InvoiceDate    False
InvoiceTime    False
StockCode      False
Description     True
Quantity       False
UnitPrice      False
Totalsale      False
CustomerID      True
Country        False
dtype: bool

> ## Data Cleaning

<a id='eda'></a>
**<a href='#top'>Go to first cell</a>**
    
# Exploratory Data Analysis

<a id='conclusion'></a>
**<a href='#top'>Go to first cell</a>**
    
# Conclusion