# **Analysis Workflow**



* Reading the Data from Google Sheet : [Sheet Link](https://docs.google.com/spreadsheets/d/1yUCGN3Z-lywUyD3gjE2V6h_VMsOWgkfKcov65aO6e8M/edit?usp=sharing)
* Basic Exploratory Data Analysis (EDA).
* Data Cleaning.
* Merging with others Dim Sheets.
* Save the final Consolidated Data in BigQuery.

**Note:** Please use the left side panel of the Table of Contents to navigate through the sections.






# **Neccessary Libraries**

In [1]:
# Neccessary Libraries

# for storing data into BigQuery
from google.cloud import bigquery
from google.colab import auth

# for authenticate
# auth.authenticate_user()

# initialize the client for Bigquery
project_id = 'keen-phalanx-396514'
client = bigquery.Client(project_id, location='US')

# for Cleaning, Analyzing & Charts
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re


from google.auth.transport.requests import Request
from google.oauth2.service_account import Credentials

**Google Sheet URLs for CSV export**

To convert a google gheet file into csv and directly read by pandas here is the structure we need to follow,
https://docs.google.com/spreadsheets/d/{spreadsheet_id}/export?format=csv&gid={sheet_gid}

For example;

this https://docs.google.com/spreadsheets/d/1yUCGN3Z-lywUyD3gjE2V6h_VMsOWgkfKcov65aO6e8M/edit?gid=1531479241#gid=1531479241

will be converted to;

this https://docs.google.com/spreadsheets/d/1yUCGN3Z-lywUyD3gjE2V6h_VMsOWgkfKcov65aO6e8M/export?format=csv&gid=1531479241


# **Data Explorations**

In [2]:
# @title Google Sheet URLs for CSV export

# file path

orders = "https://docs.google.com/spreadsheets/d/1yUCGN3Z-lywUyD3gjE2V6h_VMsOWgkfKcov65aO6e8M/export?format=csv&gid=1531479241"
customers = "https://docs.google.com/spreadsheets/d/1yUCGN3Z-lywUyD3gjE2V6h_VMsOWgkfKcov65aO6e8M/export?format=csv&gid=2099175586"
returns = "https://docs.google.com/spreadsheets/d/1yUCGN3Z-lywUyD3gjE2V6h_VMsOWgkfKcov65aO6e8M/export?format=csv&gid=1158708900"
users = "https://docs.google.com/spreadsheets/d/1yUCGN3Z-lywUyD3gjE2V6h_VMsOWgkfKcov65aO6e8M/export?format=csv&gid=531959115"

# Read directly into Pandas DataFrame
df_orders = pd.read_csv(orders, index_col='Order ID')
df_customers = pd.read_csv(customers)
df_returns = pd.read_csv(returns)
df_users = pd.read_csv(users)


In [3]:
# @title Orders

# Display the first few rows
df_orders.head()

Unnamed: 0_level_0,Row ID,Customer ID,Customer Segment,Product Category,Product Sub-Category,Product Container,Product Name,Order Priority,Ship Mode,Region,...,Postal Code,Order Date,Ship Date,Quantity Ordered,Unit Price,Discount,Product Base Margin,Shipping Cost,Sales,Profit
Order ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
88525,18606,2,Corporate,Office Supplies,Labels,Small Box,Avery 49,Not Specified,Regular Air,Central,...,60101,5/28/2012,5/30/2012,2,2.88,0.01,0.36,0.5,5.9,1.32
88522,20847,3,Corporate,Office Supplies,Pens & Art Supplies,Wrap Bag,SANFORD Liquid Accent™ Tank-Style Highlighters,High,Express Air,West,...,98221,7/7/2010,7/8/2010,4,2.84,0.01,0.54,0.93,13.01,4.56
88523,23086,3,Corporate,Office Supplies,Paper,Small Box,Xerox 1968,Not Specified,Express Air,West,...,98221,7/27/2011,7/28/2011,7,6.68,0.03,0.37,6.15,49.92,-47.64
88523,23087,3,Corporate,Office Supplies,"Scissors, Rulers and Trimmers",Small Pack,Acme® Preferred Stainless Steel Scissors,Not Specified,Regular Air,West,...,98221,7/27/2011,7/28/2011,7,5.68,0.01,0.56,3.6,41.64,-30.51
88523,23088,3,Corporate,Technology,Telephones and Communication,Small Box,V70,Not Specified,Express Air,West,...,98221,7/27/2011,7/27/2011,8,205.99,0.0,0.59,2.5,1446.67,998.2023


In [4]:
# @title Customers

# Display the first few rows
df_customers.head()

Unnamed: 0,Customer ID,Customer Name
0,2,Janice Fletcher
1,3,Bonnie Potter
2,5,Ronnie Proctor
3,6,Dwight Hwang
4,7,Leon Gill


In [5]:
# @title Returns

# Display the first few rows
df_returns.head()

Unnamed: 0,Order ID,Status
0,65,Returned
1,612,Returned
2,614,Returned
3,678,Returned
4,710,Returned


In [6]:
# @title Users

# Display the first few rows
df_users.head()

Unnamed: 0,Region,Manager
0,Central,Chris
1,East,Erin
2,South,Sam
3,West,William


## **Basic Data Cleaning**

In [7]:
# @title Dropping The Row ID
df_orders.drop("Row ID", axis=1, inplace=True)
df_orders.head()


Unnamed: 0_level_0,Customer ID,Customer Segment,Product Category,Product Sub-Category,Product Container,Product Name,Order Priority,Ship Mode,Region,State or Province,...,Postal Code,Order Date,Ship Date,Quantity Ordered,Unit Price,Discount,Product Base Margin,Shipping Cost,Sales,Profit
Order ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
88525,2,Corporate,Office Supplies,Labels,Small Box,Avery 49,Not Specified,Regular Air,Central,Illinois,...,60101,5/28/2012,5/30/2012,2,2.88,0.01,0.36,0.5,5.9,1.32
88522,3,Corporate,Office Supplies,Pens & Art Supplies,Wrap Bag,SANFORD Liquid Accent™ Tank-Style Highlighters,High,Express Air,West,Washington,...,98221,7/7/2010,7/8/2010,4,2.84,0.01,0.54,0.93,13.01,4.56
88523,3,Corporate,Office Supplies,Paper,Small Box,Xerox 1968,Not Specified,Express Air,West,Washington,...,98221,7/27/2011,7/28/2011,7,6.68,0.03,0.37,6.15,49.92,-47.64
88523,3,Corporate,Office Supplies,"Scissors, Rulers and Trimmers",Small Pack,Acme® Preferred Stainless Steel Scissors,Not Specified,Regular Air,West,Washington,...,98221,7/27/2011,7/28/2011,7,5.68,0.01,0.56,3.6,41.64,-30.51
88523,3,Corporate,Technology,Telephones and Communication,Small Box,V70,Not Specified,Express Air,West,Washington,...,98221,7/27/2011,7/27/2011,8,205.99,0.0,0.59,2.5,1446.67,998.2023


In [8]:
# @title Rows & Columns

print("Rows:", df_orders.shape[0])
print("Columns:", df_orders.shape[1])

Rows: 9427
Columns: 21


In [9]:
# @title Dataset Columns

# Let's print the columns (features) names.
df_orders.columns

Index(['Customer ID', 'Customer Segment', 'Product Category',
       'Product Sub-Category', 'Product Container', 'Product Name',
       'Order Priority', 'Ship Mode', 'Region', 'State or Province', 'City',
       'Postal Code', 'Order Date', 'Ship Date', 'Quantity Ordered',
       'Unit Price', 'Discount', 'Product Base Margin', 'Shipping Cost',
       'Sales', 'Profit'],
      dtype='object')

In [10]:
# @title Columns Data Type

# Let's print the columns data types.
df_orders.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9427 entries, 88525 to 87533
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Customer ID           9427 non-null   int64  
 1   Customer Segment      9427 non-null   object 
 2   Product Category      9427 non-null   object 
 3   Product Sub-Category  9427 non-null   object 
 4   Product Container     9427 non-null   object 
 5   Product Name          9427 non-null   object 
 6   Order Priority        9427 non-null   object 
 7   Ship Mode             9427 non-null   object 
 8   Region                9427 non-null   object 
 9   State or Province     9427 non-null   object 
 10  City                  9427 non-null   object 
 11  Postal Code           9427 non-null   int64  
 12  Order Date            9427 non-null   object 
 13  Ship Date             9427 non-null   object 
 14  Quantity Ordered      9427 non-null   int64  
 15  Unit Price           

In [11]:
# @title Columns Data Type Correcting

# Let's try to change the datatypes of the following column in the dataset.
df_orders['Order Date'] = df_orders['Order Date'].astype('datetime64[ns]')
df_orders['Ship Date'] = df_orders['Ship Date'].astype('datetime64[ns]')
df_orders['Postal Code'] = df_orders['Postal Code'].astype('object')


# Let's print the columns data types.
df_orders.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9427 entries, 88525 to 87533
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Customer ID           9427 non-null   int64         
 1   Customer Segment      9427 non-null   object        
 2   Product Category      9427 non-null   object        
 3   Product Sub-Category  9427 non-null   object        
 4   Product Container     9427 non-null   object        
 5   Product Name          9427 non-null   object        
 6   Order Priority        9427 non-null   object        
 7   Ship Mode             9427 non-null   object        
 8   Region                9427 non-null   object        
 9   State or Province     9427 non-null   object        
 10  City                  9427 non-null   object        
 11  Postal Code           9427 non-null   object        
 12  Order Date            9427 non-null   datetime64[ns]
 13  Ship Date         

In [12]:
# @title Dataset Statistic Figures

# Describing statistical information on the dataset
df_orders.describe()

Unnamed: 0,Customer ID,Order Date,Ship Date,Quantity Ordered,Unit Price,Discount,Product Base Margin,Shipping Cost,Sales,Profit
count,9427.0,9427,9427,9427.0,9427.0,9427.0,9355.0,9427.0,9427.0,9427.0
mean,1738.238464,2012-03-05 19:19:32.759096064,2012-03-07 20:01:51.509494016,13.79739,88.295006,0.04963,0.512174,12.794485,949.599889,139.22017
min,2.0,2010-01-01 00:00:00,2010-01-02 00:00:00,1.0,0.99,0.0,0.35,0.49,1.32,-16476.838
25%,898.0,2011-03-07 12:00:00,2011-03-09 00:00:00,5.0,6.48,0.02,0.38,3.215,61.1,-74.00475
50%,1750.0,2012-04-08 00:00:00,2012-04-09 00:00:00,10.0,20.99,0.05,0.52,6.05,203.42,2.5392
75%,2578.5,2013-03-26 00:00:00,2013-03-28 00:00:00,17.0,85.99,0.08,0.59,13.99,776.355,140.2077
max,3403.0,2013-12-31 00:00:00,2014-01-17 00:00:00,170.0,6783.02,0.25,0.85,164.73,100119.16,16332.414
std,979.277823,,,15.107223,281.527308,0.031797,0.13523,17.18041,2597.902183,998.434762


In [13]:
# Describing more statistical information on the dataset
df_orders.describe(include='all')

Unnamed: 0,Customer ID,Customer Segment,Product Category,Product Sub-Category,Product Container,Product Name,Order Priority,Ship Mode,Region,State or Province,...,Postal Code,Order Date,Ship Date,Quantity Ordered,Unit Price,Discount,Product Base Margin,Shipping Cost,Sales,Profit
count,9427.0,9427,9427,9427,9427,9427,9427,9427,9427,9427,...,9427.0,9427,9427,9427.0,9427.0,9427.0,9355.0,9427.0,9427.0,9427.0
unique,,4,3,17,7,1263,6,3,4,49,...,1697.0,,,,,,,,,
top,,Corporate,Office Supplies,Paper,Small Box,"Global High-Back Leather Tilter, Burgundy",High,Regular Air,Central,California,...,10177.0,,,,,,,,,
freq,,3375,5182,1380,4888,27,1970,7037,2899,1022,...,54.0,,,,,,,,,
mean,1738.238464,,,,,,,,,,...,,2012-03-05 19:19:32.759096064,2012-03-07 20:01:51.509494016,13.79739,88.295006,0.04963,0.512174,12.794485,949.599889,139.22017
min,2.0,,,,,,,,,,...,,2010-01-01 00:00:00,2010-01-02 00:00:00,1.0,0.99,0.0,0.35,0.49,1.32,-16476.838
25%,898.0,,,,,,,,,,...,,2011-03-07 12:00:00,2011-03-09 00:00:00,5.0,6.48,0.02,0.38,3.215,61.1,-74.00475
50%,1750.0,,,,,,,,,,...,,2012-04-08 00:00:00,2012-04-09 00:00:00,10.0,20.99,0.05,0.52,6.05,203.42,2.5392
75%,2578.5,,,,,,,,,,...,,2013-03-26 00:00:00,2013-03-28 00:00:00,17.0,85.99,0.08,0.59,13.99,776.355,140.2077
max,3403.0,,,,,,,,,,...,,2013-12-31 00:00:00,2014-01-17 00:00:00,170.0,6783.02,0.25,0.85,164.73,100119.16,16332.414


In [14]:
# @title Exporting the modified Dataset

df_orders.to_csv('df_orders_exported.csv')
# index =False)

# **Data Cleaning**

In [15]:
# @title Reading Data
# Let's try to read from the new order dataset
df_cleaned = pd.read_csv('/content/df_orders_exported.csv')
df_cleaned.head()

Unnamed: 0,Order ID,Customer ID,Customer Segment,Product Category,Product Sub-Category,Product Container,Product Name,Order Priority,Ship Mode,Region,...,Postal Code,Order Date,Ship Date,Quantity Ordered,Unit Price,Discount,Product Base Margin,Shipping Cost,Sales,Profit
0,88525,2,Corporate,Office Supplies,Labels,Small Box,Avery 49,Not Specified,Regular Air,Central,...,60101,2012-05-28,2012-05-30,2,2.88,0.01,0.36,0.5,5.9,1.32
1,88522,3,Corporate,Office Supplies,Pens & Art Supplies,Wrap Bag,SANFORD Liquid Accent™ Tank-Style Highlighters,High,Express Air,West,...,98221,2010-07-07,2010-07-08,4,2.84,0.01,0.54,0.93,13.01,4.56
2,88523,3,Corporate,Office Supplies,Paper,Small Box,Xerox 1968,Not Specified,Express Air,West,...,98221,2011-07-27,2011-07-28,7,6.68,0.03,0.37,6.15,49.92,-47.64
3,88523,3,Corporate,Office Supplies,"Scissors, Rulers and Trimmers",Small Pack,Acme® Preferred Stainless Steel Scissors,Not Specified,Regular Air,West,...,98221,2011-07-27,2011-07-28,7,5.68,0.01,0.56,3.6,41.64,-30.51
4,88523,3,Corporate,Technology,Telephones and Communication,Small Box,V70,Not Specified,Express Air,West,...,98221,2011-07-27,2011-07-27,8,205.99,0.0,0.59,2.5,1446.67,998.2023


In [16]:
# @title Missing values check
# Let's try to read from the new order dataset
df_cleaned.isnull().sum()

Unnamed: 0,0
Order ID,0
Customer ID,0
Customer Segment,0
Product Category,0
Product Sub-Category,0
Product Container,0
Product Name,0
Order Priority,0
Ship Mode,0
Region,0


In [17]:
# @title Duplicate Checking

df_cleaned.duplicated().sum()

1

In [18]:
# @title Removing Duplicate

df_cleaned.drop_duplicates(inplace=True)

In [19]:
df_cleaned.duplicated().sum()

0

# **Merge & Consolidated Data**

In [20]:
# @title Merge with Customers

# Merge orders with customers on 'Customer ID'
df_consolidated = pd.merge(df_cleaned, df_customers, on='Customer ID', how='left')

In [21]:
# @title Merge with Regions

# Assuming there is a 'Region' column in df_orders to join with df_users
df_consolidated = pd.merge(df_consolidated, df_users, left_on='Region', right_on='Region', how='left')

In [22]:
# @title Merge with Returns

# Merge the result with returns on 'Order ID' to include order status
df_consolidated = pd.merge(df_consolidated, df_returns, on='Order ID', how='left')

In [23]:
# @title Consolidated Data

# The final df_consolidated will contain merged data
df_consolidated.head()

Unnamed: 0,Order ID,Customer ID,Customer Segment,Product Category,Product Sub-Category,Product Container,Product Name,Order Priority,Ship Mode,Region,...,Quantity Ordered,Unit Price,Discount,Product Base Margin,Shipping Cost,Sales,Profit,Customer Name,Manager,Status
0,88525,2,Corporate,Office Supplies,Labels,Small Box,Avery 49,Not Specified,Regular Air,Central,...,2,2.88,0.01,0.36,0.5,5.9,1.32,Janice Fletcher,Chris,
1,88522,3,Corporate,Office Supplies,Pens & Art Supplies,Wrap Bag,SANFORD Liquid Accent™ Tank-Style Highlighters,High,Express Air,West,...,4,2.84,0.01,0.54,0.93,13.01,4.56,Bonnie Potter,William,
2,88523,3,Corporate,Office Supplies,Paper,Small Box,Xerox 1968,Not Specified,Express Air,West,...,7,6.68,0.03,0.37,6.15,49.92,-47.64,Bonnie Potter,William,
3,88523,3,Corporate,Office Supplies,"Scissors, Rulers and Trimmers",Small Pack,Acme® Preferred Stainless Steel Scissors,Not Specified,Regular Air,West,...,7,5.68,0.01,0.56,3.6,41.64,-30.51,Bonnie Potter,William,
4,88523,3,Corporate,Technology,Telephones and Communication,Small Box,V70,Not Specified,Express Air,West,...,8,205.99,0.0,0.59,2.5,1446.67,998.2023,Bonnie Potter,William,


In [24]:
# @title EDA of Consolidated Data

df_consolidated['Status'].value_counts(dropna=False)


Unnamed: 0_level_0,count
Status,Unnamed: 1_level_1
,9328
Returned,98


Lets Fill the NaN as 'Order Complete'

In [25]:
df_consolidated['Status'] = df_consolidated['Status'].fillna('Order Complete')


In [26]:
df_consolidated['Status'].value_counts(dropna=False)

Unnamed: 0_level_0,count
Status,Unnamed: 1_level_1
Order Complete,9328
Returned,98


# **Store the Data in BigQuery**

In [27]:
# @title BigQuery

df_consolidated.to_gbq('Assignment.superstore_sales_data',
                     project_id,
                     chunksize=None,
                     if_exists='replace'
                     )

  df_consolidated.to_gbq('Assignment.superstore_sales_data',
  df_consolidated.to_gbq('Assignment.superstore_sales_data',
100%|██████████| 1/1 [00:00<00:00, 6820.01it/s]
