<a href="https://colab.research.google.com/github/Soichiro-Gardinner/Sales_Prediction_Solved/blob/main/Sales_Prediction_Solved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **_ _ Sales Prediction_ _**

- **By:** Oscar Castanaza
- **Date:** 4/13/2023

# **Import Libs & Read Data:**

In [1]:
import pandas as pd

# read in the data from CSV file
df = pd.read_csv('https://drive.google.com/uc?id=1syH81TVrbBsdymLT_jl2JIf6IjPXtSQw')

# display the first 5 rows
display(df.head())

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


- **<font color='#ffd966'>Duplicates:</font>**

In [2]:
print(df.duplicated().sum())

0


- **<font color='#ffd966'>Inconcistencies:</font>**



One potential issue is the **'Item_Fat_Content'** column, which has values **'Low Fat'**, **'LF'**, **'low fat'** that all represent the same thing, and values 'Regular', 'reg' that also represent the same thing. We can fix this using the replace method:

In [3]:
df['Item_Fat_Content'].replace(['LF', 'low fat'], 'Low Fat', inplace=True)
df['Item_Fat_Content'].replace('reg', 'Regular', inplace=True)

- **<font color='#ffd966'>More Inconcistencies:</font>**
1. The Item_Weight column has missing values.

2. The Outlet_Size column also has missing values.

3. The Item_Visibility column has a value of 0, which doesn't make sense since no product can have zero visibility.

In [4]:
# Replace zero value in Item_Visibility with mean visibility of the corresponding item
item_visibility_mean = df.groupby('Item_Identifier')['Item_Visibility'].mean()
missing_values = (df['Item_Visibility'] == 0)
df.loc[missing_values,'Item_Visibility'] = df.loc[missing_values,'Item_Identifier'].apply(lambda x: item_visibility_mean[x])


- **<font color='#ffd966'>Find NaNs:</font>**

In [5]:
df.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

# **Prepare Data For ML**

In [6]:
# Import necessary libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Identify features (X) and target (y)
X = df.drop('Item_Outlet_Sales', axis=1)
y = df['Item_Outlet_Sales']

# Perform a train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

- **<font color='#ffd966'>Imputers & Transformers:</font>**

In [7]:
# Define categorical and numerical transformers
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Numerical Transformer:
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

# Identify categorical and numerical features
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(include=['float64', 'int64']).columns

# Preprocess the dataset using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)])

#X_train_transformed = preprocessor.fit_transform(X_train)
#X_test_transformed = preprocessor.transform(X_test)

**Pipeline:**

In [8]:
# Define the pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())])

# Train the model and evaluate using both training and testing data
pipeline.fit(X_train, y_train)
train_preds = pipeline.predict(X_train)
test_preds = pipeline.predict(X_test)

(RMSE) Process:

In [9]:
# y_train:
train_rmse = np.sqrt(mean_squared_error(y_train, train_preds))
train_r2 = r2_score(y_train, train_preds)
# y_test:
test_rmse = np.sqrt(mean_squared_error(y_test, test_preds))
test_r2 = r2_score(y_test, test_preds)

**Ouput:**

In [10]:
print(f'Training RMSE: {train_rmse:.2f}')
print(f'Training R^2: {train_r2:.2f}')
print(f'Testing RMSE: {test_rmse:.2f}')
print(f'Testing R^2: {test_r2:.2f}')

Training RMSE: 980.50
Training R^2: 0.67
Testing RMSE: 1306.38
Testing R^2: 0.39


# **Final Part:**

- - **<font color='#ffd966'>Decision Tree:</font>** It evaluates its performance based on r^2, and evaluate its performance based on rmse:

In [11]:
# Define the decision tree pipeline
tree_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', DecisionTreeRegressor(max_depth=5, random_state=42))])

# Train the decision tree model and evaluate using both training and testing data
tree_pipeline.fit(X_train, y_train)
train_preds = tree_pipeline.predict(X_train)
test_preds = tree_pipeline.predict(X_test)

train_rmse = np.sqrt(mean_squared_error(y_train, train_preds))
train_r2 = r2_score(y_train, train_preds)

test_rmse = np.sqrt(mean_squared_error(y_test, test_preds))
test_r2 = r2_score(y_test, test_preds)

**Output:**

In [12]:
print(f'Training RMSE (Decision Tree): {train_rmse:.2f}')

Training RMSE (Decision Tree): 1080.88


- **<font color='#ffd966'>Regression Tree:</font>**

In [13]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

# Fit and evaluate decision tree with max_depth = 5 on training data
rt = DecisionTreeRegressor(max_depth=5, random_state=42)
model_train = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', rt)
])
model_train.fit(X_train, y_train)

# Tune decision tree hyperparameters using GridSearchCV
params = {'regressor__max_depth': [2, 5, 10, None],
          'regressor__min_samples_split': [2, 5, 10]}
grid_search = GridSearchCV(model_train, params, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best hyperparameters and corresponding r^2 score
best_params = grid_search.best_params_
print(f"Best hyperparameters: {best_params}")

best_model = grid_search.best_estimator_
y_train_pred = best_model.predict(X_train)
r2_train = r2_score(y_train, y_train_pred)
print(f"r^2 score for decision tree on training data with best hyperparameters: {r2_train:.4f}")

y_test_pred = best_model.predict(X_test)
r2_test = r2_score(y_test, y_test_pred)
print(f"r^2 score for decision tree on testing data with best hyperparameters: {r2_test:.4f}")

# Fit and evaluate decision tree with no max_depth on testing data
rt = DecisionTreeRegressor(random_state=42)
model_test = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', rt)
])
model_test.fit(X_train, y_train)

y_train_pred = model_test.predict(X_train)
r2_train = r2_score(y_train, y_train_pred)
print(f"\n r^2 score for regression tree with no max_depth on training data: {r2_train:.4f}")

y_test_pred = model_test.predict(X_test)
r2_test = r2_score(y_test, y_test_pred)
print(f"r^2 score for regression tree with no max_depth on testing data: {r2_test:.4f}")

Best hyperparameters: {'regressor__max_depth': 5, 'regressor__min_samples_split': 2}
r^2 score for decision tree on training data with best hyperparameters: 0.6050
r^2 score for decision tree on testing data with best hyperparameters: 0.5963

 r^2 score for regression tree with no max_depth on training data: 1.0000
r^2 score for regression tree with no max_depth on testing data: 0.2463


- **Finally**, to determine which model to implement, we can compare the 'r^2' and 'rmse' scores of both models. In general, a higher 'r^2' score and a lower 'rmse' score indicate a better model.

#**Final Descision**

Based on the results obtained from the above code, we can see that the Decision Tree model has a higher r^2 score and a lower rmse score compared to the regression tree model. Therefore, we can recommend using the Decition Tree model for this dataset. However, it is always a good practice to try out multiple models and evaluate their performance before making a final decision.