<a href="https://colab.research.google.com/github/Soichiro-Gardinner/Sales_Prediction_Solved/blob/main/Sales_Prediction_Solved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **_ _ Sales Prediction_ _**

- **By:** Oscar Castanaza
- **Date:** 4/13/2023

# **Import Libs & Read Data:**

In [None]:
import pandas as pd

# read in the data from CSV file
df = pd.read_csv('https://drive.google.com/uc?id=1syH81TVrbBsdymLT_jl2JIf6IjPXtSQw')

# display the first 5 rows
print(df.head())

  Item_Identifier  Item_Weight Item_Fat_Content  Item_Visibility  \
0           FDA15         9.30          Low Fat         0.016047   
1           DRC01         5.92          Regular         0.019278   
2           FDN15        17.50          Low Fat         0.016760   
3           FDX07        19.20          Regular         0.000000   
4           NCD19         8.93          Low Fat         0.000000   

               Item_Type  Item_MRP Outlet_Identifier  \
0                  Dairy  249.8092            OUT049   
1            Soft Drinks   48.2692            OUT018   
2                   Meat  141.6180            OUT049   
3  Fruits and Vegetables  182.0950            OUT010   
4              Household   53.8614            OUT013   

   Outlet_Establishment_Year Outlet_Size Outlet_Location_Type  \
0                       1999      Medium               Tier 1   
1                       2009      Medium               Tier 3   
2                       1999      Medium               Tier

- **<font color='#ffd966'>Duplicates:</font>**

In [None]:
print(df.duplicated().sum())

0


- **<font color='#ffd966'>Inconcistencies:</font>**



One potential issue is the **'Item_Fat_Content'** column, which has values **'Low Fat'**, **'LF'**, **'low fat'** that all represent the same thing, and values 'Regular', 'reg' that also represent the same thing. We can fix this using the replace method:

In [None]:
df['Item_Fat_Content'].replace(['LF', 'low fat'], 'Low Fat', inplace=True)
df['Item_Fat_Content'].replace('reg', 'Regular', inplace=True)

- **<font color='#ffd966'>More Inconcistencies:</font>**
1. The Item_Weight column has missing values.

2. The Outlet_Size column also has missing values.

3. The Item_Visibility column has a value of 0, which doesn't make sense since no product can have zero visibility.

In [None]:
# Replace zero value in Item_Visibility with mean visibility of the corresponding item
item_visibility_mean = df.groupby('Item_Identifier')['Item_Visibility'].mean()
missing_values = (df['Item_Visibility'] == 0)
df.loc[missing_values,'Item_Visibility'] = df.loc[missing_values,'Item_Identifier'].apply(lambda x: item_visibility_mean[x])


- **<font color='#ffd966'>Find NaNs:</font>**

In [None]:
df.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

- **<font color='#ffd966'>Handle NaNs:</font>**

In [None]:
# Fill missing values in Item_Weight with mean weight of the corresponding item
item_weight_mean = df.groupby('Item_Identifier')['Item_Weight'].mean()
df['Item_Weight'].fillna(df['Item_Identifier'].apply(lambda x: item_weight_mean[x]), inplace=True)

# Fill missing values in Outlet_Size with mode of Outlet_Size
outlet_size_mode = df.pivot_table(values='Outlet_Size', columns='Outlet_Type', aggfunc=(lambda x: x.mode()[0]))
missing_values = df['Outlet_Size'].isnull()
df.loc[missing_values,'Outlet_Size'] = df.loc[missing_values,'Outlet_Type'].apply(lambda x: outlet_size_mode[x])

# Handle missing values
df['Item_Weight'].fillna(df['Item_Weight'].mean(), inplace=True)
df['Outlet_Size'].fillna('Unknown', inplace=True)
df = df[df['Item_Visibility'] > 0]

In [None]:
df.isnull().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

# **Prepare Data For ML**

In [None]:
# Import necessary libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Identify features (X) and target (y)
X = df.drop('Item_Outlet_Sales', axis=1)
y = df['Item_Outlet_Sales']

# Perform a train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


# Define categorical and numerical transformers
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Numerical Transformer:
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

# Identify categorical and numerical features
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(include=['float64', 'int64']).columns

# Preprocess the dataset using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)])

**Pipeline:**

In [None]:
# Define the pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())])

# Train the model and evaluate using both training and testing data
pipeline.fit(X_train, y_train)
train_preds = pipeline.predict(X_train)
test_preds = pipeline.predict(X_test)

(RMSE) Process:

In [None]:
# y_train:
train_rmse = np.sqrt(mean_squared_error(y_train, train_preds))
train_r2 = r2_score(y_train, train_preds)
# y_test:
test_rmse = np.sqrt(mean_squared_error(y_test, test_preds))
test_r2 = r2_score(y_test, test_preds)

**Ouput:**

In [None]:
print(f'Training RMSE: {train_rmse:.2f}')
print(f'Training R^2: {train_r2:.2f}')
print(f'Testing RMSE: {test_rmse:.2f}')
print(f'Testing R^2: {test_r2:.2f}')

Training RMSE: 980.53
Training R^2: 0.67
Testing RMSE: 1305.67
Testing R^2: 0.39


# **Final Part:**

- - **<font color='#ffd966'>Decision Tree:</font>** It evaluates its performance based on r^2, and evaluate its performance based on rmse:

In [None]:
# Define the decision tree pipeline
tree_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', DecisionTreeRegressor(max_depth=5, random_state=42))])

# Train the decision tree model and evaluate using both training and testing data
tree_pipeline.fit(X_train, y_train)
train_preds = tree_pipeline.predict(X_train)
test_preds = tree_pipeline.predict(X_test)

train_rmse = np.sqrt(mean_squared_error(y_train, train_preds))
train_r2 = r2_score(y_train, train_preds)

test_rmse = np.sqrt(mean_squared_error(y_test, test_preds))
test_r2 = r2_score(y_test, test_preds)

**Output:**

In [None]:
print(f'Training RMSE (Decision Tree): {train_rmse:.2f}')

Training RMSE (Decision Tree): 1080.88


- **<font color='#ffd966'>Regression Tree:</font>**

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Fit model on training data
rt = DecisionTreeRegressor(random_state=42)
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', rt)
])
model.fit(X_train, y_train)

# Evaluate r^2 on test data
y_pred = model.predict(X_test)
r2_rt = r2_score(y_test, y_pred)
print(f"r^2 score for regression tree: {r2_rt:.4f}")

# Evaluate RMSE on test data
rmse_rt = mean_squared_error(y_test, y_pred, squared=False)
print(f"RMSE score for regression tree: {rmse_rt:.4f}")


r^2 score for regression tree: 0.2453
RMSE score for regression tree: 1453.8887


- **Finally**, to determine which model to implement, we can compare the 'r^2' and 'rmse' scores of both models. In general, a higher 'r^2' score and a lower 'rmse' score indicate a better model.

#**Final Descision**

Based on the results obtained from the above code, we can see that the Decision Tree model has a higher r^2 score and a lower rmse score compared to the regression tree model. Therefore, we can recommend using the Decition Tree model for this dataset. However, it is always a good practice to try out multiple models and evaluate their performance before making a final decision.