<a href="https://colab.research.google.com/github/MichaelArthur224/Sales_Predictions/blob/main/ML_Sales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn import set_config
set_config(display='diagram')
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Input Data

In [2]:
df = pd.read_csv('/content/drive/MyDrive/sales_predictions.csv')

In [3]:
#dropping empty data
df.drop(columns = 'Item_Weight')
#fix inconsistencies
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace('LF', 'Low Fat')
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace('reg', 'Regular')
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace('low fat', 'Low Fat')

In [4]:
#make copy
ml_df = df.copy()
ml_df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


Identify the features (X) and target (y): Assign the "Item_Outlet_Sales" column as your target and the rest of the relevant variables as your features matrix.  

In [5]:
x = ml_df.drop(['Item_Identifier', 'Item_Outlet_Sales', 'Outlet_Location_Type'], axis=1)
y = ml_df['Item_Outlet_Sales']

Perform a train test split 

In [6]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42)

Create a preprocessing object to prepare the dataset for Machine Learning

In [7]:
ml_df['Outlet_Size'].value_counts()
oe = ml_df['Outlet_Size'].replace({'Small':0, 'Medium':1, 'High': 2}, inplace=True)

In [8]:
oe = ml_df['Item_Fat_Content'].replace({'Low Fat':0, 'Regular':1}, inplace=True)

##Pipelines

In [9]:
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')
freq_imputer = SimpleImputer(strategy='most_frequent')
mean_imputer = SimpleImputer(strategy='mean')
scaler = StandardScaler()
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

In [10]:
numeric_pipe = make_pipeline(mean_imputer, scaler)
numeric_pipe

In [11]:
categorical_pipe = make_pipeline(freq_imputer, ohe)
categorical_pipe

In [12]:
number_tuple = (numeric_pipe, num_selector)
category_tuple = (categorical_pipe, cat_selector)
preprocessor = make_column_transformer(number_tuple, category_tuple, remainder = 'passthrough')
preprocessor

In [13]:
preprocessor.fit(x_train)

##Build a linear regression model.
Evaluate the performance of your model based on r^2.
Evaluate the performance of your model based on rmse.

In [14]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()

In [15]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [16]:
from sklearn.pipeline import make_pipeline
reg_pipe = make_pipeline(preprocessor, reg)

In [17]:
reg_pipe.fit(x_train,y_train)

Evaluate the performance of your model based on r^2

In [18]:
training_predictions = reg_pipe.predict(x_train)
test_predictions = reg_pipe.predict(x_test)

In [19]:
train_r2 = r2_score(y_train, training_predictions)
test_r2 = r2_score(y_test, test_predictions)

print(f'Model Training R2: {train_r2}')
print(f'Model Test R2: {test_r2}')

Model Training R2: 0.5615550873278972
Model Test R2: 0.5670977300237667


Evaluate the performance of your model based on rmse.

In [20]:
train_mse = mean_squared_error(y_train, training_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

print(f'Model Training MSE: {train_mse}')
print(f'Model Test MSE: {test_mse}')

Model Training MSE: 1297558.2979281037
Model Test MSE: 1194367.530704372


In [21]:
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)

print(f'Model Training RMSE: {train_rmse}')
print(f'Model Test RMSE: {test_rmse}')

Model Training RMSE: 1139.1041646522515
Model Test RMSE: 1092.8712324443225


##Your second task is to build a regression tree model to predict sales.

In [None]:
ml_df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,0,0.016047,Dairy,249.8092,OUT049,1999,1.0,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,1,0.019278,Soft Drinks,48.2692,OUT018,2009,1.0,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,0,0.01676,Meat,141.618,OUT049,1999,1.0,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,1,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,0,0.0,Household,53.8614,OUT013,1987,2.0,Tier 3,Supermarket Type1,994.7052


In [24]:
from sklearn.tree import DecisionTreeRegressor
dec_tree = DecisionTreeRegressor(random_state = 42)
dec_tree_pipe = make_pipeline(preprocessor, dec_tree)
dec_tree_pipe.fit(x_train, y_train)

Compare the performance of your model based on r^2.

In [26]:
dec_tree_training_predictions = dec_tree_pipe.predict(x_train)
dec_tree_test_predictions = dec_tree_pipe.predict(x_test)

In [27]:
train_r2 = r2_score(y_train, dec_tree_training_predictions)
test_r2 = r2_score(y_test, dec_tree_test_predictions)

print(f'Model Training R2: {train_r2}')
print(f'Model Test R2: {test_r2}')

Model Training R2: 1.0
Model Test R2: 0.20323548627224275


Compare the performance of your model based on rmse.

In [28]:
train_mse = mean_squared_error(y_train, dec_tree_training_predictions)
test_mse = mean_squared_error(y_test, dec_tree_test_predictions)

print(f'Model Training MSE: {train_mse}')
print(f'Model Test MSE: {test_mse}')

Model Training MSE: 2.4264137179864312e-29
Model Test MSE: 2198255.1971051954


In [29]:
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)

print(f'Model Training RMSE: {train_rmse}')
print(f'Model Test RMSE: {test_rmse}')

Model Training RMSE: 4.925864104892086e-15
Model Test RMSE: 1482.6514078181679


##You now have tried 2 different models on your data set. You need to determine which model to implement.

The decision tree model came back at a .20 which shows that it is not the best for the dataset. 