In [1]:
import numpy as np
import pandas as pd

# CX Kaggle Competition: Phone Price Prediction

### Table of Contents

* [4. Data Splitting](#setup)
* [5. Validation Dataset](#validation)
* [6. Model Evaluation Function](#rmse)
* [7. Regression Model Training](#models)
* [8. Classification Model Training](#models)
* [9. Suggestions for Improving Models](#models)
* [10. Submission](#submission)

### Hosted by and maintained by the [Students Association of Applied Statistics (SAAS)](https://saas.berkeley.edu). Authored by Ashwin Natampalli and Sarang Deshpande. 

# Modeling Overview

We are going to be creating various regression models in order to predict the ```price_range``` column of our dataset that contains features about different mobile phones. There are steps to the modeling process and we will be walking through each step individually. We will split the data appropriately, fit a model to our data, test and evaluate our model adjusting our model as necessary, and finally use the test dataset to make predictions using our model. We will also later determine which model performs the best on our data.

## Data Splitting

**Step 1:** Split the dataset into our features and response variable.

In [2]:
train_data = pd.read_csv('true_train.csv')
X = train_data.drop(columns=['price_range']) # extract X variables
y = train_data.loc[:, 'price_range'] # extract y variable

X

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
0,842,0,2.2,0,1,0,7,0.6,188,2,2,20,756,2549,9,7,19,0,0,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,6,905,1988,2631,17,3,7,1,1,0
2,563,1,0.5,1,2,1,41,0.9,145,5,6,1263,1716,2603,11,2,9,1,1,0
3,1859,0,0.5,1,3,0,22,0.7,164,1,7,1004,1654,1067,17,1,10,1,0,0
4,1821,0,1.7,0,4,1,10,0.8,139,8,10,381,1018,3220,13,8,18,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,674,1,2.9,1,1,0,21,0.2,198,3,4,576,1809,1180,6,3,4,1,1,1
1496,858,0,2.2,0,1,0,50,0.1,84,1,2,528,1416,3978,17,16,3,1,1,0
1497,794,1,0.5,1,0,1,2,0.8,106,6,14,1222,1890,668,13,4,19,1,1,0
1498,1911,0,0.9,1,1,1,36,0.7,108,8,3,868,1632,3057,9,1,5,1,1,0


## Validation Dataset

We plan to create multiple different models using different model architectures and potentially by adjusting hyperparameters. In order to effectively compare different models, we need to create a validation dataset and compare the model performance on this validation dataset to see which model architecture and hyperparameters are optimal.

**Step 2:** Split your data into a training set and a validation set.

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=21)



#### K-fold Cross Validation (Optional)

You can also use k-fold cross-validation as a more robust approach to validation. 
View https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html for more documentation.

**Step 3:** Set up cross-validation.

In [4]:
from sklearn.model_selection import KFold
kf = KFold(n_splits = 5)
for train_index, val_index in kf.split(X):
    X_train_cv, X_val_cv = X.iloc[train_index, :], X.iloc[val_index, :]
    y_train_cv, y_val_cv = y[train_index], y[val_index]
    
    # continue to fit and evaluate model
    ...

## Model Evaluation Function

We are going to use Root-Mean-Squared-Error as our evaluation metric. This error function is shown below.

$$\text{RMSE}(\hat{y}, y) = \sqrt{\frac{1}{n} \sum_{i = 1}^n (y_i - \hat{y}_i)^2}.$$

**Step 4:** Complete the function below that returns the RMSE given true y-values and predicted y-values.

In [154]:
def rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true-y_pred)**2))

## Regression Model Training

Given the format of the predictor variable in our dataset, we can try to use both regression and classification models. Let's look at the regression models first.

### Linear Regression

**Step 5:** Create and fit a linear regression model to the training data.

In [155]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression() # Create sklearn's linear regression object
lin_reg.fit(X_train, y_train) # Fit linear regression model to training data

lin_reg_train_pred = lin_reg.predict(X_train) # Predict against training data using fitted model
print('Training RMSE:', rmse(y_train, lin_reg_train_pred)) # Calculate RMSE on training data

lr_val_pred = lr.predict(X_val) # Predict against validation data using fitted model
print('Validation RMSE:', rmse(y_val, lr_val_pred)) # Calculate RMSE on validation data

Training RMSE: 0.31585674882173426
Validation RMSE: 0.33064651119948174


### Ridge Regression

**Step 6:** Create and fit a ridge regression model to the training data.

Again, it is probably a good idea to tune the ```alpha``` hyperparameter. Review https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html for documentation on sklearn's Ridge Regression model.

In [156]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1000) # Create ridge regression object
ridge.fit(X_train, y_train) # Fit ridge regression model

ridge_train_pred = ridge.predict(X_train) # Predict against X_train using fitted model
print('Training RMSE:', rmse(y_train, ridge_train_pred)) # Calculate RMSE on training data

ridge_val_pred = ridge.predict(X_val) # Predict against X_val using fitted model
print('Validation RMSE:', rmse(y_val, ridge_val_pred)) # Calculate RMSE on validation data

Training RMSE: 0.31635818476779604
Validation RMSE: 0.3293175263323787


### Random Forest Regression

**Step 7:** Create and fit a random forest regression model to the training data. 

Note that it is a good idea to tune the hyperparameters of the random forest regression model. Take a look at https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html for documentation on various hyperparameters for the Random Forest model. It is likely though that ```n_estimators``` and ```max_depth``` are the most relevant hyperparameters.

In [157]:
from sklearn.ensemble import RandomForestRegressor

rf_reg = RandomForestRegressor(n_estimators=100) # Create random forest regressor object
rf_reg.fit(X_train, y_train) # Fit random forest regression model

rf_reg_train_pred = rf_reg.predict(X_train) # Predict against X_train using fitted model
print('Training RMSE:', rmse(y_train, rf_reg_train_pred)) # Calculate RMSE on training data

rf_reg_val_pred = rf_reg.predict(X_val) # Predict against X_val using fitted model
print('Validation RMSE:', rmse(y_val, rf_reg_val_pred)) # Calculate RMSE on validation data

Training RMSE: 0.10046848593796291
Validation RMSE: 0.2857697208126386


### Other Regression Models to try

Polynomial Regression

LASSO Regression

Support Vector Regression

In [158]:
# Your code here
...

## Classification Model Training

Let's look at the classification models now. Note that while you would normally evaluate classification models using a different metric such as accuracy, we will be evaluating the models using RMSE since that is the chosen evaluation function for the Kaggle competition as it allows for evaluating both regression and classification models.

### Decision Tree Classifier

**Step 9:** Create and fit a decision tree classifier model to the training data.

Explore https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html for hyperparameters.

In [159]:
from sklearn.tree import DecisionTreeClassifier

tree_classifier = DecisionTreeClassifier() # Create Decision Tree Classifier object
tree_classifier.fit(X_train, y_train) # Fit decision tree classifier model

tree_classifier_train_pred = tree_classifier.predict(X_train) # Predict against X_train using fitted model
print('Training RMSE:', rmse(y_train, tree_classifier_train_pred)) # Calculate RMSE on training data

tree_classifier_val_pred = tree_classifier.predict(X_val) # Predict against X_train using fitted model
print('Training RMSE:', rmse(y_val, tree_classifier_val_pred)) # Calculate RMSE on training data

Training RMSE: 0.0
Training RMSE: 0.41231056256176607


### Random Forest Classifier

Let's try to see if we can improve our decision tree classifier by adding more trees and building a random forest classifier.

**Step 10:** Create and fit a random forest classifier model to the training data.

Explore https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html for hyperparameters.

In [160]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(n_estimators=100) # Create Random Forest Classifier object
rf_classifier.fit(X_train, y_train) # Fit random forest classifier model

rf_classifier_train_pred = rf_classifier.predict(X_train) # Predict against X_train using fitted model
print('Training RMSE:', rmse(y_train, rf_classifier_train_pred)) # Calculate RMSE on training data

rf_classifier_val_pred = rf_classifier.predict(X_val) # Predict against X_train using fitted model
print('Training RMSE:', rmse(y_val, rf_classifier_val_pred)) # Calculate RMSE on training data

Training RMSE: 0.0
Training RMSE: 0.33665016461206926


### Other Classification Models to try

K-Nearest Neighbors

Support Vector Classification

Logisitic Regression


In [161]:
# Your Code Here
...

## Suggestions for Improving Your Model

**Variable Selection:** Consider which variables are most important to be included in your model. The EDA of the dataset that you did will be useful for this task.

**Feature Engineering:** Transform the variables in the dataset somehow by taking logs of variables, squaring variables, etc. Also you can try to combine columns.

**Hyperparameter Tuning:** This technique is only applicable to some models.

**Explore a different model:** Feel free to explore models that you haven't been introduced to yet! Google is your best friend.

## Submission

**Step 11:** Choose the model that performed best on the validation data. Use that model to predict against the test data.

In [162]:
X_test = pd.read_csv("true_test.csv")
y_test_pred = rf.predict(X_test)    # Predict against test data using the best model you created

Run the cells below to save your predicted test values.

In [163]:
from datetime import datetime

def results_to_csv(data, y_test):
    df = pd.DataFrame({'price_range': y_test})
    df.to_csv(data + '_submission_' + datetime.now().strftime("%Y_%m_%d-%H_%M_%S") + '.csv',
              index_label='id')

In [164]:
results_to_csv("price_range", y_test_pred)