- Author: Maximiliano Lopez Salgado
- First Commit: 2023-06-20                      #folowing ISO  8601 Format
- Last Commit: 2023-06-20                       #folowing ISO  8601 Format
- Description: This notebook is used to perform EDA on the Superstore dataset

In [29]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import folium
from folium import plugins
import sqlite3

# Exploratory Data Analysis (EDA)

## 1. Understanding the data

### 1.1 Gathering data

In [30]:
# Import csv cleaned files 
order = pd.read_csv('../datasets/order.csv', encoding='latin1')
customer = pd.read_csv('../datasets/customer.csv', encoding='latin1')
shipment = pd.read_csv('../datasets/shipment.csv', encoding='latin1')
product = pd.read_csv('../datasets/product.csv', encoding='latin1')
stock = pd.read_csv('../datasets/stock.csv', encoding='latin1')

### 1.2 Assesing data

In [31]:
# Take a look of the data´s shape
display(order.info)
display(customer.info)
display(shipment.info)
display(product.info)
display(stock.info)

<bound method DataFrame.info of             order_id  order_date  shipment_code customer_id          product
0     CA-2016-152156   11/8/2016              1    CG-12520  FUR-BO-10001798
1     CA-2016-152156   11/8/2016              2    CG-12520  FUR-CH-10000454
2     CA-2016-138688   6/12/2016              3    DV-13045  OFF-LA-10000240
3     US-2015-108966  10/11/2015              4    SO-20335  FUR-TA-10000577
4     US-2015-108966  10/11/2015              5    SO-20335  OFF-ST-10000760
...              ...         ...            ...         ...              ...
9989  CA-2014-110422   1/21/2014           9990    TB-21400  FUR-FU-10001889
9990  CA-2017-121258   2/26/2017           9991    DB-13060  FUR-FU-10000747
9991  CA-2017-121258   2/26/2017           9992    DB-13060  TEC-PH-10003645
9992  CA-2017-121258   2/26/2017           9993    DB-13060  OFF-PA-10004041
9993  CA-2017-119914    5/4/2017           9994    CC-12220  OFF-AP-10002684

[9994 rows x 5 columns]>

<bound method DataFrame.info of      customer_id     customer_name    segment
0       CG-12520       Claire Gute   Consumer
1       CG-12520       Claire Gute   Consumer
2       DV-13045   Darrin Van Huff  Corporate
3       SO-20335    Sean O'Donnell   Consumer
4       SO-20335    Sean O'Donnell   Consumer
...          ...               ...        ...
9989    TB-21400  Tom Boeckenhauer   Consumer
9990    DB-13060       Dave Brooks   Consumer
9991    DB-13060       Dave Brooks   Consumer
9992    DB-13060       Dave Brooks   Consumer
9993    CC-12220      Chris Cortes   Consumer

[9994 rows x 3 columns]>

<bound method DataFrame.info of       ship_id   ship_date       ship_mode        country             city  \
0           1  11/11/2016    Second Class  United States        Henderson   
1           2  11/11/2016    Second Class  United States        Henderson   
2           3   6/16/2016    Second Class  United States      Los Angeles   
3           4  10/18/2015  Standard Class  United States  Fort Lauderdale   
4           5  10/18/2015  Standard Class  United States  Fort Lauderdale   
...       ...         ...             ...            ...              ...   
9989     9990   1/23/2014    Second Class  United States            Miami   
9990     9991    3/3/2017  Standard Class  United States       Costa Mesa   
9991     9992    3/3/2017  Standard Class  United States       Costa Mesa   
9992     9993    3/3/2017  Standard Class  United States       Costa Mesa   
9993     9994    5/9/2017    Second Class  United States      Westminster   

           state  postal_code region  
0   

<bound method DataFrame.info of            product_id         category sub-category  \
0     FUR-BO-10001798        Furniture    Bookcases   
1     FUR-CH-10000454        Furniture       Chairs   
2     OFF-LA-10000240  Office Supplies       Labels   
3     FUR-TA-10000577        Furniture       Tables   
4     OFF-ST-10000760  Office Supplies      Storage   
...               ...              ...          ...   
9989  FUR-FU-10001889        Furniture  Furnishings   
9990  FUR-FU-10000747        Furniture  Furnishings   
9991  TEC-PH-10003645       Technology       Phones   
9992  OFF-PA-10004041  Office Supplies        Paper   
9993  OFF-AP-10002684  Office Supplies   Appliances   

                                           product_name     price  quantity  \
0                     Bush Somerset Collection Bookcase  261.9600         2   
1     Hon Deluxe Fabric Upholstered Stacking Chairs,...  731.9400         3   
2     Self-Adhesive Address Labels for Typewriters b...   14.6200     

<bound method DataFrame.info of             order_id       product_id  stock_id
0     CA-2016-152156  FUR-BO-10001798       NaN
1     CA-2016-152156  FUR-CH-10000454       NaN
2     CA-2016-138688  OFF-LA-10000240       NaN
3     US-2015-108966  FUR-TA-10000577       NaN
4     US-2015-108966  OFF-ST-10000760       NaN
...              ...              ...       ...
9989  CA-2014-110422  FUR-FU-10001889       NaN
9990  CA-2017-121258  FUR-FU-10000747       NaN
9991  CA-2017-121258  TEC-PH-10003645       NaN
9992  CA-2017-121258  OFF-PA-10004041       NaN
9993  CA-2017-119914  OFF-AP-10002684       NaN

[9994 rows x 3 columns]>

In [32]:
# Take a look of the data´s info
display(order.info())
display(customer.info())
display(shipment.info())
display(product.info())
display(stock.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   order_id       9994 non-null   object
 1   order_date     9994 non-null   object
 2   shipment_code  9994 non-null   int64 
 3   customer_id    9994 non-null   object
 4   product        9994 non-null   object
dtypes: int64(1), object(4)
memory usage: 390.5+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customer_id    9994 non-null   object
 1   customer_name  9994 non-null   object
 2   segment        9994 non-null   object
dtypes: object(3)
memory usage: 234.4+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ship_id      9994 non-null   int64 
 1   ship_date    9994 non-null   object
 2   ship_mode    9994 non-null   object
 3   country      9994 non-null   object
 4   city         9994 non-null   object
 5   state        9994 non-null   object
 6   postal_code  9994 non-null   int64 
 7   region       9994 non-null   object
dtypes: int64(2), object(6)
memory usage: 624.8+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   product_id    9994 non-null   object 
 1   category      9994 non-null   object 
 2   sub-category  9994 non-null   object 
 3   product_name  9994 non-null   object 
 4   price         9994 non-null   float64
 5   quantity      9994 non-null   int64  
 6   discount      9994 non-null   float64
 7   profit        9994 non-null   float64
dtypes: float64(3), int64(1), object(4)
memory usage: 624.8+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   order_id    9994 non-null   object 
 1   product_id  9994 non-null   object 
 2   stock_id    0 non-null      float64
dtypes: float64(1), object(2)
memory usage: 234.4+ KB


None

In [33]:
# Use describe method to get descriptive statistics
display(order.describe())
display(customer.describe())
display(shipment.describe())
display(product.describe())
display(stock.describe())

Unnamed: 0,shipment_code
count,9994.0
mean,4997.5
std,2885.163629
min,1.0
25%,2499.25
50%,4997.5
75%,7495.75
max,9994.0


Unnamed: 0,customer_id,customer_name,segment
count,9994,9994,9994
unique,793,793,3
top,WB-21850,William Brown,Consumer
freq,37,37,5191


Unnamed: 0,ship_id,postal_code
count,9994.0,9994.0
mean,4997.5,55190.379428
std,2885.163629,32063.69335
min,1.0,1040.0
25%,2499.25,23223.0
50%,4997.5,56430.5
75%,7495.75,90008.0
max,9994.0,99301.0


Unnamed: 0,price,quantity,discount,profit
count,9994.0,9994.0,9994.0,9994.0
mean,229.858001,3.789574,0.156203,28.656896
std,623.245101,2.22511,0.206452,234.260108
min,0.444,1.0,0.0,-6599.978
25%,17.28,2.0,0.0,1.72875
50%,54.49,3.0,0.2,8.6665
75%,209.94,5.0,0.2,29.364
max,22638.48,14.0,0.8,8399.976


Unnamed: 0,stock_id
count,0.0
mean,
std,
min,
25%,
50%,
75%,
max,


## 2. Extracting and Plotting the data

From the dataframes we have, here are some potential information we can extract:

**DataFrame 1: df**
- Count of unique values.
- Distribution of values.
- Count of values.
- Descriptive statistics of column´s values.
- Time-based analysis of column´s values.
- Geographic distribution of values on a map
- Visualization of regions with a heatmap.

**DataFrame 2: df2**
- Count of unique values.
- Distribution of values.
- Count of values.
- Descriptive statistics of column´s values.
- Time-based analysis of column´s values.
- Geographic distribution of values on a map
- Visualization of regions with a heatmap.

**DataFrame 3: df3**
- Count of unique values.
- Distribution of values.
- Count of values.
- Descriptive statistics of column´s values.
- Time-based analysis of column´s values.
- Geographic distribution of values on a map
- Visualization of regions with a heatmap.

**DataFrame 4: df4**
- Count of unique values.
- Distribution of values.
- Count of values.
- Descriptive statistics of column´s values.
- Time-based analysis of column´s values.
- Geographic distribution of values on a map
- Visualization of regions with a heatmap.

## DataFrame 1: Customer Data

### Number of unique values

In [34]:
unique_column1 = cleaned_df['column1'].nunique()
print("Number of unique values:", column1) #-----> Change "values" for the variable name you want to analize

NameError: name 'cleaned_df' is not defined

### Distribution of values

In [None]:
value_counts = cleaned_df['column1'].value_counts().reset_index()
value_counts.columns = ['column1', 'Count']

# Display the distribution of values
print("\nDistribution of values:") #-----> Change "values" for the variable name you want to analize
print(cleaned_df.head(10))

### Barplot of the distribution of values

In [None]:
plt.figure(figsize=(8, 8))
plt.pie(cleaned_df['Count'][:10], labels=cleaned_df['column1'][:10], autopct='%1.1f%%')
plt.title('Distribution of Values (Top 10)')
plt.show()

### Descriptive statistics of Column values

In [None]:
cleaned_df_column1 = cleaned_df['column1'].describe()

print("Descriptive statistics of price:")
print(cleaned_df_column1)

### Distribution of values based on a column or several columns

In [None]:
cleaned_df[['column1', 'column2', 'column3', 'column4']].hist(figsize=(12, 6))
plt.tight_layout()
plt.show()

### Barplot of the distribution of values


In [None]:
cleaned_df_counts = df['column1'].value_counts()

plt.figure(figsize=(12, 6))
cleaned_df_counts.plot(kind='bar')
plt.xlabel('State')
plt.ylabel('Number of Cities')
plt.title('Distribution of Cities across States')
plt.xticks(rotation=45)
plt.show()

## Analyses specific to geolocation data

### Geographic distribution of values (i.e. customers, start/end points) on a map

To address the long runtime and the large number of locations, consider using a random sample of the data. This will allow you to work with a smaller subset of the data, making it faster to merge and plot on a map

In [None]:
# Remember to import folium library
# In case you haven`t do:
# import folium

# Randomly sample the geolocation DataFrame
sampled_geolocation = geolocation.sample(n=250) 

# Randomly sample the cleaned_customers DataFrame
sampled_cleaned_customers = cleaned_df.sample(n=250) 

# Merge the geolocation DataFrame with the cleaned_customers DataFrame using the common column 'customer_city'.
merged_df = pd.merge(sampled_geolocation, sampled_cleaned_customers, left_on='geolocation_city', right_on='customer_city', how='inner')

# Create a map object centered at a specific latitude and longitude you want, for example:
map = folium.Map(location=[-12.257569734193066, -53.113064202406306], zoom_start=4)

# Iterate over the aggregated data and add markers for each location
for index, row in merged_df.iterrows():
    lat = row['geolocation_lat']
    lon = row['geolocation_lng']
    city = row['geolocation_city']
    marker = folium.Marker(location=[lat, lon], popup=city)
    marker.add_to(map)

# save the map
map.save('map.html')

# Display the map
map

### Heatmap - Regions with the highest concentration of values.

Same as before, consider using a random sample of the data in case the df are too big, to avoid running errors.

In [None]:
# Group by 'customer_city' and count the occurrences
heatmap_city_counts = sampled_cleaned_customers['customer_city'].value_counts().reset_index()
heatmap_city_counts.columns = ['customer_city', 'count']

# Convert 'customer_city' to lowercase for consistency
heatmap_city_counts['customer_city'] = heatmap_city_counts['customer_city'].str.lower()

# Merge with the 'sampled_geolocation' DataFrame
heatmap_merged_df = pd.merge(sampled_geolocation, heatmap_city_counts, left_on='geolocation_city', right_on='customer_city', how='inner')

# Create a map object centered at a specific latitude and longitude
map_heatmap = folium.Map(location=[-12.257569734193066, -53.113064202406306], zoom_start=4)

# Create a HeatMap layer using the aggregated latitude and longitude coordinates
heatmap_data = heatmap_merged_df[['geolocation_lat', 'geolocation_lng', 'count']].values
heatmap_layer = plugins.HeatMap(heatmap_data)

# Add the HeatMap layer to the map
heatmap_layer.add_to(map_heatmap)

# Save the map as an HTML file
map_heatmap.save('heatmap.html')

# Display the map
map_heatmap
