#### Exploratory Data Analysis (EDA)

Dataset: 
- _customers_clean.csv_
- _inventory_clean.csv_
- _products_clean.csv_
- _salesforce_clean.csv_
- _suppliers_clean.csv_
- _transactions_clean.csv_

Author: Luis Sergio Pastrana Lemus  
Date: 2025-07-06

# Exploratory Data Analysis – Grocery Store Dataset

## __1. Libraries__.

In [11]:
from pathlib import Path
import sys

# Define project root dynamically, gets the current directory from which the notebook belongs and moves one level upper
project_root = Path.cwd().parent

# Add src to sys.path if it is not already
if str(project_root) not in sys.path:

    sys.path.append(str(project_root))

# Import function directly (more controlled than import *)
from src import *


from IPython.display import display, HTML
import os
import pandas as pd

## __2. Path to Data file__.

In [12]:
# Build route to data file and upload
data_file_path = project_root / "data" / "processed" / "clean"

df_customers_clean = load_dataset_from_csv(data_file_path, "customers_clean.csv", header='infer', parse_dates=['join_date'])
df_inventory_clean = load_dataset_from_csv(data_file_path, "inventory_clean.csv", header='infer', parse_dates=['date'])
df_products_clean = load_dataset_from_csv(data_file_path, "products_clean.csv", header='infer')
df_salesforce_clean = load_dataset_from_csv(data_file_path, "salesforce_clean.csv", header='infer')
df_suppliers_clean = load_dataset_from_csv(data_file_path, "suppliers_clean.csv", header='infer')
df_transactions_clean = load_dataset_from_csv(data_file_path, "transactions_clean.csv", header='infer', parse_dates=['date'])

# data_file_path = project_root / "data" / "processed" / "feature"

# df_xxx_feature = load_dataset_from_csv(data_file_path, "xxx_feature.csv", sep=',', header='infer')

In [None]:
# Format notebook output
format_notebook()

## __3. Exploratory Data Analysis__.

### 3.0 Casting Data types.

In [13]:
# Call casting dtypes function from features.py and Identifying correctly missing values qith pd.NA

# missing values to pd.NA
df_inventory_clean = replace_missing_values(df_inventory_clean, include=['warehouse_location'])
df_customers_clean = replace_missing_values(df_customers_clean, include=['segment'])

# object to string
df_products_clean = cast_datatypes(df_products_clean, 'string', c_include=['product_name', 'brand'])
df_suppliers_clean = cast_datatypes(df_suppliers_clean, 'string', c_include=['supplier_name', 'contact_info'])
df_customers_clean = cast_datatypes(df_customers_clean, 'string', c_include=['customer_name'])
df_salesforce_clean = cast_datatypes(df_salesforce_clean, 'string', c_include=['employee_name'])

# object to numeric
df_products_clean = cast_datatypes(df_products_clean, 'numeric', numeric_type='Float64', c_include=['unit_cost'])
df_customers_clean = cast_datatypes(df_customers_clean, 'numeric', numeric_type="Float64", c_include=['total_spent'])

# object to category
df_products_clean = cast_datatypes(df_products_clean, 'category', c_include=['category', 'status'])
df_inventory_clean = cast_datatypes(df_inventory_clean, 'category', c_include=['warehouse_location'])
df_customers_clean = cast_datatypes(df_customers_clean, 'category', c_include=['segment'])
df_salesforce_clean = cast_datatypes(df_salesforce_clean, 'category', c_include=['region'])

# object to datetime
df_inventory_clean['date'] = pd.to_datetime(df_inventory_clean['date'], errors='coerce', utc=True)
df_customers_clean['join_date'] = pd.to_datetime(df_customers_clean['join_date'], errors='coerce', utc=True)
df_transactions_clean['date'] = pd.to_datetime(df_transactions_clean['date'], errors='coerce', utc=True)

In [14]:
df_customers_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype              
---  ------         --------------  -----              
 0   customer_id    5000 non-null   int64              
 1   customer_name  5000 non-null   string             
 2   join_date      5000 non-null   datetime64[ns, UTC]
 3   total_spent    5000 non-null   Float64            
 4   frequency      5000 non-null   int64              
 5   segment        4850 non-null   category           
dtypes: Float64(1), category(1), datetime64[ns, UTC](1), int64(2), string(1)
memory usage: 205.3 KB


In [15]:
df_inventory_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   inventory_id        20000 non-null  int64              
 1   date                20000 non-null  datetime64[ns, UTC]
 2   product_id          20000 non-null  int64              
 3   beginning_stock     20000 non-null  int64              
 4   received            20000 non-null  int64              
 5   sold                20000 non-null  int64              
 6   warehouse_location  19703 non-null  category           
 7   ending_stock        20000 non-null  int64              
dtypes: category(1), datetime64[ns, UTC](1), int64(6)
memory usage: 1.1 MB


In [16]:
df_products_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   product_id        10000 non-null  int64   
 1   product_name      10000 non-null  string  
 2   category          10000 non-null  category
 3   supplier_id       10000 non-null  int64   
 4   unit_cost         10000 non-null  Float64 
 5   status            10000 non-null  category
 6   brand             10000 non-null  string  
 7   list_price        10000 non-null  float64 
 8   median_unit_cost  10000 non-null  float64 
dtypes: Float64(1), category(2), float64(2), int64(2), string(2)
memory usage: 576.6 KB


In [17]:
df_salesforce_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   employee_id    2000 non-null   int64   
 1   employee_name  2000 non-null   string  
 2   region         2000 non-null   category
 3   total_sales    2000 non-null   float64 
 4   effectiveness  2000 non-null   float64 
dtypes: category(1), float64(2), int64(1), string(1)
memory usage: 64.8 KB


In [18]:
df_suppliers_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   supplier_id     2000 non-null   int64  
 1   supplier_name   2000 non-null   string 
 2   lead_time_days  2000 non-null   int64  
 3   contact_info    2000 non-null   string 
 4   rating          2000 non-null   float64
dtypes: float64(1), int64(2), string(2)
memory usage: 78.3 KB


In [19]:
df_transactions_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype              
---  ------          --------------  -----              
 0   transaction_id  20000 non-null  int64              
 1   date            20000 non-null  datetime64[ns, UTC]
 2   product_id      20000 non-null  int64              
 3   units_sold      20000 non-null  int64              
 4   customer_id     20000 non-null  int64              
 5   employee_id     20000 non-null  int64              
 6   list_price      20000 non-null  float64            
 7   sales_amount    20000 non-null  float64            
dtypes: datetime64[ns, UTC](1), float64(2), int64(5)
memory usage: 1.2 MB


### 3.1  Descriptive Statistics.

#### 3.1.1 Descriptive statistics for Original datasets.

In [None]:
# Descriptive statistics for xxx dataset
df_xxx_feature.describe(include='all')

#### 3.1.2 Descriptive statistics for name dataset, quantitive values.

<table>
  <thead>
    <tr>
      <th>CV (%)</th>
      <th>Interpretation for Coefficient of Variation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><small><strong>0–10%</strong></small></td>
      <td><small><strong>Very low</strong> variability → <strong>very reliable</strong> Mean</small></td>
    </tr>
    <tr>
      <td><small><strong>10–20%</strong></small></td>
      <td><small><strong>Moderate</strong> variability → <strong>reliable</strong> Mean</small></td>
    </tr>
    <tr>
      <td><small><strong>20–30%</strong></small></td>
      <td><small><strong>Considerable</strong> variability → <strong>some what skewed</strong> Mean</small></td>
    </tr>
    <tr>
      <td><small><strong>>30%</strong></small></td>
      <td><small>High<strong> variability</strong> → <strong>prefer</strong> Median</small></td>
    </tr>
  </tbody>
</table>


In [None]:
df_xxx_feature['column_name'].describe()

In [None]:
# Evaluate the coefficient of variation to select the proper measure of central tendency
evaluate_central_trend(df_xxx_feature, 'column_name')

In [None]:
# Evaluate boundary thresholds and detect potential outliers
outlier_limit_bounds(df_xxx_feature, 'column_name', bound='both', clamp_zero=True)

In [None]:
# Show data distribution with detailed statistical info
plot_distribution_dispersion(df_, 'column', bins=43)

### 3.2 Data Visualization: Distributions and Relationships.

#### 3.2.1 Covariance and Correlation Analysis.

##### 3.2.1.1 Covariance Matrix.

In [None]:
# Covariance for services
df_xxx_feature[['column_name', 'column_name']].cov()

##### 3.2.1.2 Correlation Matrix.

| Correlation Value     | Interpretation                |
| --------------------- | ----------------------------- |
| `+0.7` to `+1.0`      | Strong positive correlation   |
| `+0.3` to `+0.7`      | Moderate positive correlation |
| `0.0` to `+0.3`       | Weak positive correlation     |
| `0`                   | No correlation                |
| `-0.3` to `0`         | Weak negative correlation     |
| `-0.7` to `-0.3`      | Moderate negative correlation |
| `-1.0` to `-0.7`      | Strong negative correlation   |


In [None]:
# Correlation for services
df_xxx_feature[['column_name', 'columna_name']].corr()

In [None]:
evaluate_correlation(df_xxx_feature)

In [None]:
plot_scatter_matrix(df_xxx_feature[['column_name', 'column_name']])

### 3.3 Data Visualization: Data dispersion and outliers.

3.3.1 Data dispersion and outliers for ...

In [None]:
# xxx Distribution Frequency and Frequency density
plot_frequency_density(df_xxx_feature['column_name'], bins=np.arange(min, max, step), color='grey', title='Frequency Density of name', 
                       xlabel='Name (units)', ylabel='Density', xticks_range=(min, max, step), show_kde=True, rotation=0)

In [None]:
# xxx data dispersion
plot_boxplots(ds_list=[df_xxx_feature['column_name']], xlabels=['name'], ylabel='Values', title='Name Data dispersion', 
              yticks_range=(min, max, step), rotation=0, color=['grey'])

#### 3.4 Data visualization for ...

3.4.1 Data visalization for ...

In [None]:
# Plots for insights

## 4. Conclusions and key insights

### 🎯 Key Findings

#### Behavioral Insights

- **XXX**: xxx 

#### Other Insights

- **XXX**: xxx 

### Final Takeaways

- **XXX**: xxx 

