<a href="https://colab.research.google.com/github/RaniaSaeed01/IE423-Tasks/blob/main/IE423_Task_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
dfBFSales = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/train.csv')

# Select target as a series and features as dataframe
y = dfBFSales.loc[:,['Purchase']].values.ravel()
X = dfBFSales.drop(['Purchase'],axis=1)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.8, test_size=0.2,random_state=1)

In [5]:
from sklearn.ensemble import RandomForestRegressor

# Function for building and scoring Random Forest models
def get_random_forest_mae(X_trn, X_tst, y_trn, y_tst):
    mdlRfsBFSales = RandomForestRegressor(random_state=1)
    mdlRfsBFSales.fit(X_trn, y_trn)
    y_tst_prd = mdlRfsBFSales.predict(X_tst)
    mae = mean_absolute_error(y_tst, y_tst_prd)
    return (mae)

### **Numerical Features**

In [6]:
# Select numeric features
cols_num = [col for col in X.columns if X[col].dtype in ['int64', 'float64']]
Xnum = X[cols_num]

# Split numeric features into training and test sets
Xnum_train, Xnum_test, y_train, y_test = train_test_split(Xnum,y,train_size=0.8, test_size=0.2,random_state=1)

In [8]:
# Count number of missing values in each column of the training data
Xnum_train.isna().sum()

User_ID                    0
Occupation                 0
Marital_Status             0
Product_Category_1         0
Product_Category_2    138892
Product_Category_3    306504
dtype: int64

The only columns with missing values are category numbers for the 2nd and 3rd products purchased by the customer. This is most likely because the customers who have no data in these columns did not purchase 2nd or 3rd products. It does not make sense to drop both colums entiresly because we would lose valauble data for customers who have purchased a 2nd and 3rd product. Instead we can try replacing missing values with a new product category: 0.

In [10]:
# Replace missing values with 0
Xnum_train_repnull = Xnum_train.fillna(0)
Xnum_test_repnull = Xnum_test.fillna(0)

print('MAE from replacing missing values with 0s:')
print(get_random_forest_mae(Xnum_train_repnull, Xnum_test_repnull, y_train, y_test))

MAE from replacing missing values with 0s:
2194.5193114225435


In [11]:
X_train[cols_num]=Xnum_train_repnull[cols_num]
X_test[cols_num]=Xnum_test_repnull[cols_num]

## **Non-Numerical Features**

In [12]:
# Select non-numeric features
cols_obj = [col for col in X.columns if X[col].dtype == 'object']
cols_obj

['Product_ID', 'Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']

In [14]:
# Select categorical features
cols_cat = [col for col in X.columns if X[col].dtype == 'object' and X[col].nunique()<10]
cols_cat

['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']

Product_ID would not be useful data anyway as product categories have alreasdy been included in numerical data, therefore it is preferred to not be included in the model

In [22]:
from sklearn.preprocessing import LabelEncoder

Xle_train = X_train.copy()
Xle_test = X_test.copy()
# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in cols_cat:
    Xle_train[col] = label_encoder.fit_transform(X_train[col])
    Xle_test[col] = label_encoder.transform(X_test[col])

In [21]:
mae = get_random_forest_mae(Xle_train[cols_num + cols_cat], Xle_test[cols_num + cols_cat], y_train, y_test)
print("MAE from Numerical and Encoded Categorical Colummns:")
print(mae)

MAE from Numerical and Encoded Categorical Colummns:
2155.658274877166


This is only slightly better than the MAE obtained from using only numerical data. Let's try using the gradient boost to see if it improves the MAE.

In [18]:
from xgboost import XGBRegressor

#Build and score default Gradient Boosting Model
mdlXgbBFSales = XGBRegressor()
mdlXgbBFSales.fit(Xle_train[cols_num + cols_cat], y_train)
y_test_pred = mdlXgbBFSales.predict(Xle_test[cols_num + cols_cat])
mae = mean_absolute_error(y_test_pred, y_test)

print("MAE from default XGBoost model:")
print(mae)

MAE from default XGBoost model:
2088.5694435910577


This result is better than tha obtained from the random forest model with numerical and categorical data. Let's see if it can be improved further by tuning the paramaters if the gradient boost model.

In [20]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV

# Define parameter grid for tuning
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1],
}

# Create XGBoost model
mdlXgbBFSales = XGBRegressor()

# Perform grid search with cross-validation
grid_search = GridSearchCV(mdlXgbBFSales, param_grid, cv=5, scoring='neg_mean_absolute_error')
grid_search.fit(Xle_train[cols_num + cols_cat], y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
y_test_pred = best_model.predict(Xle_test[cols_num + cols_cat])
mae = mean_absolute_error(y_test_pred, y_test)

print("Best parameters:", grid_search.best_params_)
print("MAE:", mae)

Best parameters: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 300}
MAE: 2061.397194056713


Lowest MAE thus far. This model is the best option to predict purchase values, given all other numerical and categorical data regarding the customer. In order to make actual predictions with this code we would need to know how the categoraical data has been encoded so it can be entered accordingly.