<a href="https://colab.research.google.com/github/NicoG2023/Data_Science_Final_Project/blob/Predata/src/data-science-project/data_science_project/EDA_Rain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EDA "Grupo Bimbo Inventory Demand"

### Introduction

Grupo Bimbo, a leading multinational bakery company, faces a unique challenge in managing the inventory of its products. With a typical shelf life of just one week, the accuracy of daily inventory calculations is paramount. Currently, these calculations are performed by direct delivery sales employees who rely on their personal experiences to predict the forces of supply, demand, and consumer behavior at each store. The margin for error in this process is minimal. Underestimating demand results in empty shelves and lost sales, while overestimating demand leads to excess product returns and increased expenses.

Grupo Bimbo aims to create a predictive model that can accurately forecast inventory needs based on historical data, thereby optimizing the supply chain and improving efficiency.


## Libraries

In [1]:
# importing the basic libraries
!pip install ydata_profiling
from ydata_profiling import ProfileReport
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
import path
import os
import plotly.express as px
import gc



## Charge data

We load the datasets `train_2013.csv` and `test_2014.csv`, which contain the training and test observations respectively.


In [2]:
train=pd.read_csv("/content/drive/MyDrive/train_2013.csv")
test=pd.read_csv("/content/drive/MyDrive/test_2014.csv")

print("Training Size : (%d,%d)"%train.shape)
print("Test Size : (%d,%d)"%test.shape)

Training Size : (1126694,20)
Test Size : (630452,19)


## Understand the data

We performed a preliminary exploration to understand the structure of the data, including the first rows of the data sets, descriptive statistics, and the sum of null values.


In [3]:
# Preliminary exploration

# Descriptive Statistics
print("----"*15)
print("Descriptive Statistics")
print("----"*15)
print(train.describe())

# Missing Values
print("----"*15)
print("Missing Values")
print("----"*15)
print(train.isnull().sum())

# DataFrame
print("----"*15)
print("DataFrame")
print("----"*15)
train.head()


------------------------------------------------------------
Descriptive Statistics
------------------------------------------------------------
                 Id      Expected
count  1.126694e+06  1.126694e+06
mean   5.634753e+05  4.238658e+00
std    3.253159e+05  7.542596e+01
min    1.000000e+00  0.000000e+00
25%    2.817512e+05  0.000000e+00
50%    5.634785e+05  0.000000e+00
75%    8.452108e+05  0.000000e+00
max    1.126934e+06  2.451940e+04
------------------------------------------------------------
Missing Values
------------------------------------------------------------
Id                        0
TimeToEnd                 0
DistanceToRadar           0
Composite                 0
HybridScan                0
HydrometeorType           0
Kdp                       0
RR1                       0
RR2                       0
RR3                       0
RadarQualityIndex         0
Reflectivity              0
ReflectivityQC            0
RhoHV                     0
Velocity            

Unnamed: 0,Id,TimeToEnd,DistanceToRadar,Composite,HybridScan,HydrometeorType,Kdp,RR1,RR2,RR3,RadarQualityIndex,Reflectivity,ReflectivityQC,RhoHV,Velocity,Zdr,LogWaterVolume,MassWeightedMean,MassWeightedSD,Expected
0,1,56.0 37.0 31.0 25.0 19.0 13.0 7.0 2.0,30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.0,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0,0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0,0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,0.006246 0.0200476 0.0113924 0.217157 0.028566...,13.0 17.5 14.0 8.5 7.0 11.0 9.0 9.0,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,0.865 0.841667 0.765 0.985 0.768333 0.491667 0...,-99901.0 -99901.0 -99901.0 -99901.0 -99901.0 -...,7.9375 4.5 4.1875 5.5625 3.375 7.0625 5.3125 6...,nan nan nan nan nan nan nan nan,nan nan nan nan nan nan nan nan,nan nan nan nan nan nan nan nan,0.0
1,2,58.0 48.0 38.0 29.0 19.0 9.0,77.0 77.0 77.0 77.0 77.0 77.0,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,8.0 8.0 8.0 8.0 8.0 8.0,0.0 0.0 0.0 0.0 0.0 0.0,0.0 0.0 0.0 0.0 0.0 0.0,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,999.0 999.0 999.0 999.0 999.0 999.0,15.0 18.5 10.5 3.0 0.5 -3.0,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,0.635 0.851667 0.891667 0.638333 0.791667 0.73...,-4.0 -3.0 -2.0 -0.5 -4.0 3.0,2.6875 3.0 2.375 6.25 3.125 6.0625,nan nan nan nan nan nan,nan nan nan nan nan nan,nan nan nan nan nan nan,0.0
2,3,59.0 20.0,75.0 75.0,-99900.0 -99900.0,-99900.0 -99900.0,8.0 8.0,0.0 0.0,0.0 0.0,-99900.0 -99900.0,-99900.0 -99900.0,999.0 999.0,6.5 4.0,-99900.0 -99900.0,0.998333 0.891667,-99900.0 -3.5,-6.5 -4.6875,nan nan,nan nan,nan nan,0.0
3,4,53.0 43.0 34.0 24.0 14.0 5.0,21.0 21.0 21.0 21.0 21.0 21.0,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,8.0 8.0 8.0 8.0 8.0 8.0,0.0 0.0 0.0 0.0 0.0 0.0,0.0 0.0 0.0 0.0 0.0 0.0,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,0.0 0.0 0.0 0.0 0.0 0.0,11.0 14.0 12.0 11.0 13.0 15.5,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,0.688333 0.518333 0.708333 0.805 0.708333 0.555,-7.0 -12.0 -11.5 -8.5 -8.0 -13.0,-0.375 5.0625 1.1875 2.0 2.0625 0.3125,nan nan nan nan nan nan,nan nan nan nan nan nan,nan nan nan nan nan nan,0.0
4,5,56.0 52.0 43.0 59.0 54.0 48.0 42.0 36.0 31.0 5...,69.0 69.0 69.0 83.0 83.0 83.0 83.0 83.0 83.0 5...,23.0 24.0 22.0 15.5 14.5 16.0 15.0 18.5 12.5 1...,13.5 15.5 19.0 -99900.0 -99900.0 -99900.0 -999...,9.0 9.0 9.0 8.0 8.0 8.0 8.0 9.0 9.0 9.0 9.0 9....,0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0....,0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.27899 0....,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,-99900.0 -99900.0 -99900.0 -99900.0 -99900.0 -...,1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.996433 0...,14.0 14.0 17.0 24.5 23.5 21.5 25.0 16.0 21.0 1...,14.0 14.0 17.0 -99900.0 -99900.0 -99900.0 -999...,1.01833 1.01167 0.991667 1.015 1.015 1.005 1.0...,14.0 13.5 12.5 -13.5 -19.5 -16.0 -15.0 -14.0 -...,0.9375 -0.875 -0.75 0.0 0.0625 0.3125 0.5625 -...,-13.4793885769 -12.1370512402 -11.6001776071 n...,1.86413642918 1.27740873124 1.35497004174 nan ...,0.755068594278 0.502681241559 0.514253049727 n...,0.0


### Cleaning and treatment of Data

As seen, the data is distributed from We identified and addressed missing values, replaced specific missing data codes with "NaN" for uniform imputation, and considered removing columns with a high proportion of missing values. We also deal with compound columns, splitting them into multiple columns for easier analysis.


In [3]:
# Split columns of data

train_final = pd.DataFrame()

for column in train.columns:
  train_split = train[column].astype(str).str.split(' ', expand=True, n=7)
  # Rename new columns
  train_split.columns = [f'{column}_{i+1}' for i in range(train_split.shape[1])]
  # Convertir los datos a numéricos
  train_split = train_split.apply(pd.to_numeric, errors='coerce')
  # Combine new columns
  train_final = pd.concat([train_final, train_split], axis=1)


In [24]:
# Split of columns with Dask (more processing) -  No funcional, se demora mas convirtiendo de dask a pandas

# Cargar el archivo CSV con Dask
train = dd.read_csv('/content/drive/MyDrive/train_2013.csv')

def split_columns(df, column):
    # Convertir a cadena y luego dividir, limitando a 8 divisiones
    df_split = df[column].astype(str).str.split(' ', expand=True, n=7)
    # Renombrar las nuevas columnas
    df_split.columns = [f'{column}_{i+1}' for i in range(df_split.shape[1])]
    # Convertir los datos a numéricos (opcional)
    for col in df_split.columns:
        df_split[col] = df_split[col].apply(pd.to_numeric, errors='coerce', meta=('result', 'float64')) # Specify meta for Dask
    return df_split

# Inicializar un DataFrame Dask vacío para el resultado final
train_final = None

# Procesar cada columna del DataFrame original
for column in train.columns:
    if column not in ['Expected']:  # Excluye las columnas 'Expected'
        train_split = split_columns(train, column)
        if train_final is None:
            train_final = train_split
        else:
            train_final = dd.concat([train_final, train_split], axis=1)
    else:
        if train_final is None:
            train_final = train[[column]]
        else:
            train_final[column] = train[column]

# Persistir el DataFrame en la memoria de Dask
train_final = train_final.persist()

# Computar el DataFrame final
train_final = train_final.compute()



We're assuming that the indices of each dataframes are 
 aligned. This assumption is not generally safe.
We're assuming that the indices of each dataframes are 
 aligned. This assumption is not generally safe.
We're assuming that the indices of each dataframes are 
 aligned. This assumption is not generally safe.
We're assuming that the indices of each dataframes are 
 aligned. This assumption is not generally safe.
We're assuming that the indices of each dataframes are 
 aligned. This assumption is not generally safe.
We're assuming that the indices of each dataframes are 
 aligned. This assumption is not generally safe.
We're assuming that the indices of each dataframes are 
 aligned. This assumption is not generally safe.
We're assuming that the indices of each dataframes are 
 aligned. This assumption is not generally safe.
We're assuming that the indices of each dataframes are 
 aligned. This assumption is not generally safe.
We're assuming that the indices of each dataframes are 

In [44]:
# Verificar como quedo el train_final

print(train_final.shape)
train_final.head(3)

(1126694, 146)


Unnamed: 0,Id_1,TimeToEnd_1,TimeToEnd_2,TimeToEnd_3,TimeToEnd_4,TimeToEnd_5,TimeToEnd_6,TimeToEnd_7,TimeToEnd_8,DistanceToRadar_1,...,MassWeightedMean_8,MassWeightedSD_1,MassWeightedSD_2,MassWeightedSD_3,MassWeightedSD_4,MassWeightedSD_5,MassWeightedSD_6,MassWeightedSD_7,MassWeightedSD_8,Expected_1
0,1,56.0,37.0,31.0,25.0,19.0,13.0,7.0,2.0,30.0,...,,,,,,,,,,0.0
1,2,58.0,48.0,38.0,29.0,19.0,9.0,,,77.0,...,,,,,,,,,,0.0
2,3,59.0,20.0,,,,,,,75.0,...,,,,,,,,,,0.0


In [43]:
train_final.isnull().sum()

Id_1                      0
TimeToEnd_1               0
TimeToEnd_2          158469
TimeToEnd_3          254935
TimeToEnd_4          334912
                     ...   
MassWeightedSD_5     971604
MassWeightedSD_6     979712
MassWeightedSD_7     990634
MassWeightedSD_8    1120851
Expected_1                0
Length: 146, dtype: int64

In [31]:
# Replace specific missing value codes with NaN
missing_value_codes = [-99000, -99901, -99903, -99900, 999.0]
# train_final.replace(missing_value_codes, np.nan, inplace=True) TAMPOCO

# train_final = train_final.dropna() NO NO Y NO

# Print the dataframe info to verify changes
print(train_final.shape)
print(train_final.info())
train_final.head()

(13, 146)
<class 'pandas.core.frame.DataFrame'>
Index: 13 entries, 18368 to 1080865
Columns: 146 entries, Id_1 to Expected_1
dtypes: float64(146)
memory usage: 14.9 KB
None


Unnamed: 0,Id_1,TimeToEnd_1,TimeToEnd_2,TimeToEnd_3,TimeToEnd_4,TimeToEnd_5,TimeToEnd_6,TimeToEnd_7,TimeToEnd_8,DistanceToRadar_1,...,MassWeightedMean_8,MassWeightedSD_1,MassWeightedSD_2,MassWeightedSD_3,MassWeightedSD_4,MassWeightedSD_5,MassWeightedSD_6,MassWeightedSD_7,MassWeightedSD_8,Expected_1
18368,18376.0,43.0,37.0,31.0,25.0,19.0,14.0,8.0,2.0,45.0,...,1.370042,0.825327,0.558576,0.685658,0.630849,0.644939,0.656052,0.572861,0.531898,0.0
81545,81569.0,34.0,30.0,25.0,20.0,16.0,11.0,6.0,2.0,55.0,...,1.695983,0.821164,0.960096,0.692376,0.754972,0.86527,0.763035,0.702645,0.65003,0.3
217564,217624.0,37.0,33.0,28.0,23.0,18.0,14.0,9.0,4.0,30.0,...,1.500083,0.583623,0.620634,0.634983,0.583623,0.652475,0.556304,0.508403,0.58993,43.0
417070,417174.0,60.0,56.0,51.0,46.0,42.0,37.0,32.0,28.0,78.0,...,1.391009,0.71477,0.694124,0.711431,0.656705,0.694186,0.614423,0.619405,0.529053,1.0
480725,480837.0,59.0,53.0,47.0,42.0,36.0,30.0,24.0,19.0,69.0,...,1.509353,0.649775,0.754972,0.93089,0.520072,0.637161,0.952285,0.812261,0.57548,0.3


### Exploratory Data Analysis


In [None]:
profile_obj = ProfileReport(train_final, title='How much did it rain')
profile_obj.to_file('/content/drive/MyDrive/train_final.html')
profile_obj



Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]