# Project: Price Mechanism
Objective: build a model to predict the price based on scrapped Udemy data <br> 
Models: Multiple Linear Regression, Random Forest Regression, Gradient Boosting (XGBoost, LightGBM) <br> 
Evaluation Metric: Root Mean Squared Error (RSME), Mean Absolute Error (MAE) <br>

## Packages

In [10]:
# Data handling
import pandas as pd
import numpy as np

# Preprocessing
import category_encoders as ce
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import train_test_split

# Models
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor 
from lightgbm import LGBMRegressor
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import f_regression

# Metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

## Preprocessing

In [2]:
# Import data
data_clean = pd.read_csv("./Data/Udemy_Clean.csv", index_col=0)

# Transform discounted_price variable
data_clean["Discounted"] = data_clean["Discounted_Price"] != data_clean["Price"]
# Remove unused discounted_price variable
data_clean.drop(columns = ["Discounted_Price"], inplace=True)
data_clean.set_index("Title", inplace=True)
# Inspect data
data_clean.head()

Unnamed: 0_level_0,Overall_Rating,Best_Rating,Worst_Rating,No_of_Ratings,Category,Subcategory,Topic,Instructor,Language,SkillsFuture,No_of_Practice_Test,No_of_Articles,No_of_Coding_Exercises,Video_Duration_Hr,No_of_Additional_Resources,Bestseller,Price,Discounted
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Complete Hypnotherapy & Hypnosis Certification Diploma,4.7,5,0.5,3524,Lifestyle,Esoteric Practices,Hypnotherapy,Dr Karen E Wells,English,False,0,4,0,3.0,0,Yes,104.98,True
Pinterest Marketing for Wedding Professionals 2020,5.0,5,0.5,1,Marketing,Social Media Marketing,Pinterest Marketing,Staci Nichols,English,False,0,0,0,0.6,2,No,29.98,True
Master the Telephone Sales- Cold calling Secrets,4.5,5,0.5,3,Marketing,Product Marketing,Marketing Strategy,Sanjay Bhasin,English,False,0,0,0,0.733333,0,No,29.98,True
5 Practical Management concepts you MUST know,5.0,5,0.5,2,Personal Development,Leadership,Management Skills,Vasudev Murthy,English,False,0,0,0,2.0,0,No,49.98,True
Fermented Foods Mastery,4.5,5,0.5,187,Health & Fitness,Nutrition,Fermented Foods,Kale Brock,English,False,0,3,0,1.5,12,No,68.98,True


Numeric and categorical variables will be preprocessed separately and then concatenated together afterwards. 

In [3]:
# Define response and explanatory variables
y = pd.DataFrame(data_clean["Price"])
X_raw = data_clean[data_clean.columns.drop('Price')]

# Separate numeric and categorical variables for X
X_numeric = X_raw.select_dtypes(exclude=["object", "boolean"])
X_categorical = X_raw.select_dtypes(include=["object", "boolean"])

### Categorical Encoding
For variable encoding, we will be deploying the binary encoder. This is because this dataset has many categorical variables with high cardinality, hence it would be more suitable for this dataset as it uses fewer features as compaerd to one-hot encoding. 

In [4]:
# Create encoder
encoder = ce.BinaryEncoder(cols=X_categorical.columns, return_df=True)
# Fit and transform data 
X_categorical_encoded = encoder.fit_transform(X_categorical)
X_categorical_encoded.reset_index(inplace=True)
X_categorical_encoded.head()

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,Title,Category_0,Category_1,Category_2,Category_3,Category_4,Subcategory_0,Subcategory_1,Subcategory_2,Subcategory_3,...,Instructor_12,Instructor_13,Instructor_14,Language_0,SkillsFuture_0,SkillsFuture_1,Bestseller_0,Bestseller_1,Discounted_0,Discounted_1
0,Complete Hypnotherapy & Hypnosis Certification...,0,0,0,0,1,0,0,0,0,...,0,0,1,1,0,1,0,1,0,1
1,Pinterest Marketing for Wedding Professionals ...,0,0,0,1,0,0,0,0,0,...,0,1,0,1,0,1,1,0,0,1
2,Master the Telephone Sales- Cold calling Secrets,0,0,0,1,0,0,0,0,0,...,0,1,1,1,0,1,1,0,0,1
3,5 Practical Management concepts you MUST know,0,0,0,1,1,0,0,0,0,...,1,0,0,1,0,1,1,0,0,1
4,Fermented Foods Mastery,0,0,1,0,0,0,0,0,0,...,1,0,1,1,0,1,1,0,0,1


### Outliers

In [5]:
# Outlier detection 
lof = LocalOutlierFactor()
yhat = pd.DataFrame(lof.fit_predict(X_numeric), columns=["outliers_d"])
outliers_index = yhat[yhat["outliers_d"]==-1].index
outliers_index

Int64Index([    8,    20,    31,    32,    71,    73,    76,    78,    80,
               94,
            ...
            16255, 16263, 16329, 16339, 16347, 16368, 16390, 16401, 16413,
            16427],
           dtype='int64', length=1522)

In [6]:
# Combine both numeric and categorical variables back into one dataframe
X_numeric.reset_index(inplace=True)
X = pd.concat([X_numeric, X_categorical_encoded], axis=1)
y.reset_index(inplace=True)

# Drop outliers for both X and y
X.drop(outliers_index, axis=0, inplace=True)
y.drop(outliers_index, axis=0, inplace=True)

# Double check that it has been dropped
print(len(X))
print(len(y))

14907
14907


In [7]:
# Reset index as title for both 
X.set_index("Title", inplace=True)
y.set_index("Title", inplace=True)

### Train Test Split
Now that we have removed the outliers and prepared the categorical variables, we can split the dataset into the training and validation set for model development and evaluation. 

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Model Development

### (1) Multiple Linear Regression Model

In [11]:
# Copy training and testing variables
X_train_mlr = X_train.copy()
X_test_mlr = X_test.copy()
y_train_mlr = y_train.copy()
y_test_mlr = y_test.copy()

# Build model function
def mlr_model(X_train, X_test, y_train, y_test): 
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print("Score: {0:0.3f}".format(model.score(X_train, y_train)))
    return predictions

# First initial model
predictions_1 = mlr_model(X_train_mlr, X_test_mlr, y_train_mlr, y_test_mlr)

Score: 0.151


As the score of the model is undesirable, we can perform feature selection to see if it can improve the fit of the model. 

#### Feature selection 

In [27]:
# Configure to select features based on correlation
fs_1 = SelectKBest(score_func=f_regression, k=15)
fs_1.fit(X_train_mlr, y_train_mlr)
X_train_fs1 = fs_1.transform(X_train_mlr)
X_test_fs1 = fs_1.transform(X_test_mlr)

predictions_2 = mlr_model(X_train_fs1, X_test_fs1, y_train_mlr, y_test_mlr)

Score: 0.137


  return f(*args, **kwargs)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom


In [31]:
# Configure to select features based on correlation
fs_1 = SelectKBest(score_func=mutual_info_regression, k=15)
fs_1.fit(X_train_mlr, y_train_mlr)
X_train_fs1 = fs_1.transform(X_train_mlr)
X_test_fs1 = fs_1.transform(X_test_mlr)

predictions_3 = mlr_model(X_train_fs1, X_test_fs1, y_train_mlr, y_test_mlr)

  return f(*args, **kwargs)


Score: 0.131


It appears that the best multiple linear regression model would be the original model without feature selection

In [32]:
# Final model 
mlr_model = LinearRegression()
mlr_model.fit(X_train_mlr, y_train_mlr)
mlr_predictions = mlr_model.predict(X_test_mlr)

### (2) Random Forest Regression

In [None]:
# Copying training and testing data 
X_train_rfr = X_train.copy()
X_test_rfr = X_test.copy()
y_train_rfr = X_train.copy()
y_test.rfr = y_train.copy()

