<a id='top'></a>
## Data Visualization: Discovering Trends and Patterns in an Online Retail Sales
### By: P. A. Ogbodum

<b> Contents Outline </b><br>
- <a href = '#introduction'> Introduction </a>
- <a href = '#data wrangling'> Data Wrangling </a>
- <a href = '#eda'> Exploratory Data Analysis </a>
- <a href = '#conclusion'> Conclusion </a>


<a id='introduction'></a>

# Introduction

**Data Set:** Online Retails Sale Dataset

**Source:** [kaggle](https://www.kaggle.com/datasets/rohitmahulkar/online-retails-sale-dataset)

**Description:** This data set was collected by an intern at Forage and contains information about online retail sales made by the Tata Group multinational conglomerate headquartered in Mumbai, India. It has 10 columns and 541,909 records for sales between 2010 and 2011.

**Variables:** There are 10 variables contained in this data set as listed below: </br>
- **InvoiceNo**: Unique ID to identify sale
- **InvoiceDate**: The date the invoice was made 
- **InvoiceTime**: The time the invoice was made
- **StockCode**: Code number for the item of purchase
- **Description**: Description about the type of the product
- **Quantity**: Quantity of the items purchased
- **UnitPrice**: Price of a single unit of the item
- **Totalsale**: Sum total of item purchased
- **CustomerID**: Unique number of identification for each customer
- **Country**: Country where the purchase was made

**Credit:** *ROHIT MAHULKAR*

**Analysis Question:** Best performing product in terms of sales between 2010-2011 and its characteristics.

**<a href='#top'>Go to first cell</a>**


In [1]:
# import needed modules
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import mitosheet

%matplotlib inline

<a id='data wrangling'></a>
**<a href='#top'>Go to first cell</a>**
    
# Data Wrangling

> ## Data Gathering

In [2]:
# read flat file into jupyter
df = pd.read_csv('Online Retail.csv')

# check if operation was successful
df.sample(n=5)

Unnamed: 0,InvoiceNo,InvoiceDate,InvoiceTime,StockCode,Description,Quantity,UnitPrice,Totalsale,CustomerID,Country
344503,567076,16-09-2011,12:27:00 PM,35598B,BLACK CHRISTMAS TREE 60CM,72,0.65,46.8,15505.0,United Kingdom
260838,559816,12-07-2011,04:11:00 PM,23206,LUNCH BAG APPLE DESIGN,1,4.13,4.13,,United Kingdom
140228,548388,30-03-2011,04:52:00 PM,23156,SET OF 5 MINI GROCERY MAGNETS,4,2.08,8.32,13268.0,United Kingdom
325913,565460,05-09-2011,09:24:00 AM,22324,BLUE POLKADOT KIDS BAG,48,1.65,79.2,16843.0,United Kingdom
454216,575516,10-11-2011,10:45:00 AM,21479,WHITE SKULL HOT WATER BOTTLE,4,4.25,17.0,17340.0,United Kingdom


> ## Data Assessment
>> ### Visual Assessment

In [3]:
# use the mitosheet to carry out visual assessment
mitosheet.sheet(analysis_to_replay="id-ywyefybwwh")

MitoWidget(analysis_data_json='{"analysisName": "id-ywyefybwwh", "analysisToReplay": {"analysisName": "id-azmj…

error uploading: HTTPSConnectionPool(host='api.segment.io', port=443): Max retries exceeded with url: /v1/batch (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002A03928E8B0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))


In [None]:
from mitosheet import *; register_analysis("id-ywyefybwwh");
    
# Imported Online Retail.csv
import pandas as pd
Online_Retail = pd.read_csv(r'Online Retail.csv')

# Reordered column Quantity
Online_Retail_columns = [col for col in Online_Retail.columns if col != 'Quantity']
Online_Retail_columns.insert(6, 'Quantity')
Online_Retail = Online_Retail[Online_Retail_columns]

# Reordered column Description
Online_Retail_columns = [col for col in Online_Retail.columns if col != 'Description']
Online_Retail_columns.insert(3, 'Description')
Online_Retail = Online_Retail[Online_Retail_columns]


In [20]:
from mitosheet import *; register_analysis("id-azmjrmwwda");
    
# Imported Online Retail.csv
import pandas as pd
Online_Retail = pd.read_csv(r'Online Retail.csv')

# Pivoted into Online_Retail
Online_Retail_pivot = pd.DataFrame(data={})

# Deleted Online_Retail_pivot
del Online_Retail_pivot

# Imported Marriage_Divorce_DB.csv
import pandas as pd
Marriage_Divorce_DB = pd.read_csv(r'Marriage_Divorce_DB.csv')

# Deleted Marriage_Divorce_DB
del Marriage_Divorce_DB

# Imported twitter-archive-enhanced.csv
import pandas as pd
twitter_archive_enhanced = pd.read_csv(r'twitter-archive-enhanced.csv')

# Deleted twitter_archive_enhanced
del twitter_archive_enhanced


<b>Notes :</b>
- Unideal data type for InvoiceDate, InvoiceTime columns.
- Inconsistent formatting for InvoiceDate records.
- Irrational values (-tve) values for Quantity and TotalSale columns.
- Discrepancies in InvoiceNo values (some alphaNumeric), seem to denote Quantity and TotalSale rows with negative values.

>> ### Programmatic Assessment

In [4]:
# dataframe overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   InvoiceDate  541909 non-null  object 
 2   InvoiceTime  541909 non-null  object 
 3   StockCode    541909 non-null  object 
 4   Description  540455 non-null  object 
 5   Quantity     541909 non-null  int64  
 6   UnitPrice    541909 non-null  float64
 7   Totalsale    541909 non-null  float64
 8   CustomerID   406829 non-null  float64
 9   Country      541909 non-null  object 
dtypes: float64(3), int64(1), object(6)
memory usage: 41.3+ MB


error uploading: HTTPSConnectionPool(host='api.segment.io', port=443): Max retries exceeded with url: /v1/batch (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002A03928E970>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))


<b>Notes:</b>
- Missing values in Description and CustomerID columns.
- CustomerID should be string/object data type not float64.


> ## Data Cleaning

<a id='eda'></a>
**<a href='#top'>Go to first cell</a>**
    
# Exploratory Data Analysis

<a id='conclusion'></a>
**<a href='#top'>Go to first cell</a>**
    
# Conclusion