<a id='top'></a>
## Data Visualization: Discovering Trends and Patterns in an Online Retail Sales
### By: P. A. Ogbodum

<b> Contents Outline </b><br>
- <a href = '#introduction'> Introduction </a>
- <a href = '#data wrangling'> Data Wrangling </a>
- <a href = '#eda'> Exploratory Data Analysis </a>
- <a href = '#conclusion'> Conclusion </a>


<a id='introduction'></a>

# Introduction

**Data Set:** Online Retails Sale Dataset

**Source:** [kaggle](https://www.kaggle.com/datasets/rohitmahulkar/online-retails-sale-dataset)

**Description:** This data set was collected by an intern at Forage and contains information about online retail sales made by the Tata Group multinational conglomerate headquartered in Mumbai, India. It has 10 columns and 541,909 records for sales between 2010 and 2011.

**Variables:** There are 10 variables contained in this data set as listed below: </br>
- **InvoiceNo**: Unique ID to identify sale
- **InvoiceDate**: The date the invoice was made 
- **InvoiceTime**: The time the invoice was made
- **StockCode**: Code number for the item of purchase
- **Description**: Description about the type of the product
- **Quantity**: Quantity of the items purchased
- **UnitPrice**: Price of a single unit of the item
- **Totalsale**: Sum total of item purchased
- **CustomerID**: Unique number of identification for each customer
- **Country**: Country where the purchase was made

**Credit:** *ROHIT MAHULKAR*

**Analysis Question:** Best performing product in terms of sales between 2010-2011 and its characteristics.

**<a href='#top'>Go to first cell</a>**


In [1]:
# import needed modules
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import mitosheet

%matplotlib inline

<a id='data wrangling'></a>
**<a href='#top'>Go to first cell</a>**
    
# Data Wrangling

> ## Data Gathering

In [2]:
# read flat file into jupyter
df = pd.read_csv('Online Retail.csv')

# check if operation was successful
df.sample(n=5)

Unnamed: 0,InvoiceNo,InvoiceDate,InvoiceTime,StockCode,Description,Quantity,UnitPrice,Totalsale,CustomerID,Country
333506,566198,09-09-2011,01:58:00 PM,23400,SHELF WITH 4 HOOKS HOME SWEET HOME,2,6.25,12.5,18109.0,United Kingdom
370834,569202,30-09-2011,05:22:00 PM,20978,36 PENCILS TUBE SKULLS,2,2.46,4.92,,United Kingdom
508405,579196,28-11-2011,03:54:00 PM,84032B,CHARLIE + LOLA RED HOT WATER BOTTLE,1,3.29,3.29,14096.0,United Kingdom
215436,555719,06-06-2011,03:31:00 PM,22326,ROUND SNACK BOXES SET OF4 WOODLAND,12,2.95,35.4,12609.0,Germany
193188,553521,17-05-2011,02:35:00 PM,22684,FRENCH BLUE METAL DOOR SIGN 9,2,0.0,0.0,,United Kingdom


> ## Data Assessment
>> ### Visual Assessment

In [3]:
# use mitosheet to perform visual assessment
mitosheet.sheet(analysis_to_replay="id-ywyefybwwh")

MitoWidget(analysis_data_json='{"analysisName": "id-fxgzromzrt", "analysisToReplay": {"analysisName": "id-ywye…

In [4]:
from mitosheet import *; register_analysis("id-ywyefybwwh");
    
# Imported Online Retail.csv
import pandas as pd
Online_Retail = pd.read_csv(r'Online Retail.csv')

# Reordered column Quantity
Online_Retail_columns = [col for col in Online_Retail.columns if col != 'Quantity']
Online_Retail_columns.insert(6, 'Quantity')
Online_Retail = Online_Retail[Online_Retail_columns]

# Reordered column Description
Online_Retail_columns = [col for col in Online_Retail.columns if col != 'Description']
Online_Retail_columns.insert(3, 'Description')
Online_Retail = Online_Retail[Online_Retail_columns]


In [7]:
from mitosheet import *; register_analysis("id-azmjrmwwda");
    
# Imported Online Retail.csv
import pandas as pd
Online_Retail = pd.read_csv(r'Online Retail.csv')

# Pivoted into Online_Retail
Online_Retail_pivot = pd.DataFrame(data={})

# Deleted Online_Retail_pivot
del Online_Retail_pivot

# Imported twitter-archive-enhanced.csv
import pandas as pd
twitter_archive_enhanced = pd.read_csv(r'twitter-archive-enhanced.csv')

# Deleted twitter_archive_enhanced
del twitter_archive_enhanced


FileNotFoundError: [Errno 2] No such file or directory: 'twitter-archive-enhanced.csv'

<b>Notes :</b>
- Unideal data type for InvoiceDate, InvoiceTime columns.
- Inconsistent formatting for InvoiceDate records.
- Irrational values (-tve) values for Quantity and TotalSale columns.
- Discrepancies in InvoiceNo values (some alphaNumeric), seem to denote Quantity and TotalSale rows with negative values.

>> ### Programmatic Assessment

In [8]:
# dataframe overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   InvoiceDate  541909 non-null  object 
 2   InvoiceTime  541909 non-null  object 
 3   StockCode    541909 non-null  object 
 4   Description  540455 non-null  object 
 5   Quantity     541909 non-null  int64  
 6   UnitPrice    541909 non-null  float64
 7   Totalsale    541909 non-null  float64
 8   CustomerID   406829 non-null  float64
 9   Country      541909 non-null  object 
dtypes: float64(3), int64(1), object(6)
memory usage: 41.3+ MB


<b>Notes:</b>
- Missing values in Description and CustomerID columns.
- CustomerID should be string/object data type not float64.

In [10]:
# check for duplicates
df.duplicated().any()

True

In [13]:
# get sum of all duplicates
df.duplicated().sum()

5268

In [46]:
# confirm number of duplicated rows 
df_dup = df[df.duplicated(keep='first')]
df_dup.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5268 entries, 517 to 541701
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   InvoiceNo    5268 non-null   object 
 1   InvoiceDate  5268 non-null   object 
 2   InvoiceTime  5268 non-null   object 
 3   StockCode    5268 non-null   object 
 4   Description  5268 non-null   object 
 5   Quantity     5268 non-null   int64  
 6   UnitPrice    5268 non-null   float64
 7   Totalsale    5268 non-null   float64
 8   CustomerID   5225 non-null   float64
 9   Country      5268 non-null   object 
dtypes: float64(3), int64(1), object(6)
memory usage: 452.7+ KB


<b>Notes:</b>
- There exist 5268 duplicated records in the dataframe

> ## Data Cleaning

<a id='eda'></a>
**<a href='#top'>Go to first cell</a>**
    
# Exploratory Data Analysis

<a id='conclusion'></a>
**<a href='#top'>Go to first cell</a>**
    
# Conclusion