In [1]:
import numpy as np
import pandas as pd

In [2]:
train = pd.read_csv("train.csv")
X_train = train.drop(labels=['totalyearlycompensation'], axis=1)
y_train = train.loc[:, 'totalyearlycompensation']

## Practice Exercise
# We can also filter out the columns that are not of type int or float
...
# ---- SOLUTION ----
X_train = X_train.select_dtypes(include=['int', 'float']).drop(train.columns[0], axis=1)
X_train

Unnamed: 0,yearsofexperience,yearsatcompany,stockgrantvalue,bonus,cityid,Masters_Degree,Bachelors_Degree,Doctorate_Degree,Highschool,Some_College,Race_Asian,Race_White,Race_Two_Or_More,Race_Black,Race_Hispanic
0,2.0,0.0,0.0,20000.0,7839,1,0,0,0,0,0,0,0,0,1
1,15.0,4.0,0.0,23000.0,8909,0,0,0,0,0,0,0,0,0,0
2,2.0,0.0,29000.0,23000.0,10182,0,1,0,0,0,0,1,0,0,0
3,1.0,1.0,65000.0,30000.0,7392,1,0,0,0,0,1,0,0,0,0
4,9.0,0.0,12000.0,15000.0,8816,0,1,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50108,12.0,1.0,10000.0,0.0,7419,0,0,0,0,0,0,0,0,0,0
50109,10.0,0.0,39000.0,29000.0,40303,0,0,0,0,0,0,0,0,0,0
50110,6.0,0.0,91000.0,28000.0,7419,0,0,0,0,0,0,0,0,0,0
50111,2.0,2.0,75000.0,0.0,7351,0,0,0,0,0,0,0,0,0,0


# 2. Modeling

We will now be creating various regression models in order to predict the ```totalyearlycompensation``` column of our data. There are multiple different parts to manage when modeling, such as splitting the data, fitting an appropriate model, testing / evaluating our model, and predicting against the test dataset. We will also later determine which model performs the best on our data.

## 2.1 Validation

Use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to split up your training data into a training set and a validation set. The default size of the validation set is 20% of the full training data here.

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(..., ..., test_size=..., random_state=42)
# ---- SOLUTION ----
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
X_val

Unnamed: 0,yearsofexperience,yearsatcompany,stockgrantvalue,bonus,cityid,Masters_Degree,Bachelors_Degree,Doctorate_Degree,Highschool,Some_College,Race_Asian,Race_White,Race_Two_Or_More,Race_Black,Race_Hispanic
41851,11.0,11.0,225000.0,51000.0,11521,0,0,0,0,0,0,0,0,0,0
32219,0.0,0.0,22000.0,15000.0,11470,0,1,0,0,0,1,0,0,0,0
45305,5.0,0.0,85000.0,25000.0,7322,0,1,0,0,0,0,1,0,0,0
25924,6.0,6.0,20000.0,20000.0,7419,0,0,0,0,0,0,0,0,0,0
7390,10.0,1.0,0.0,0.0,6583,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27505,25.0,3.0,0.0,0.0,7472,0,0,0,0,0,0,0,0,0,0
33624,5.0,1.0,20000.0,14000.0,10646,0,0,0,0,0,0,0,0,0,0
26070,1.0,1.0,0.0,6000.0,11204,0,0,0,0,0,0,0,0,0,0
10758,15.0,0.0,29000.0,20000.0,12008,0,0,0,0,0,0,0,0,0,0


#### K-Fold Cross Validation

The validation method above is usable but not that robust. K-Fold Cross-Validation should be better. Feel free to set up your own K-Fold cross-validation scheme. For more information, please read https://towardsdatascience.com/cross-validation-a-beginners-guide-5b8ca04962cd.

In [4]:
# Complete this as an exercise

## 2.2 Accuracy & Error

Our Kaggle competition uses Root-Mean-Square-Error (RMSE). In mathematical notation, it is:

$$\text{RMSE}(\hat{y}, y) = \sqrt{\frac{1}{n} \sum_{i = 1}^n (y_i - \hat{y}_i)^2}$$

Complete the function below.

In [5]:
from sklearn.metrics import mean_squared_error
def rmse(y_true, y_pred):
    return ...
    # ---- SOLUTION ----
    return mean_squared_error(y_true, y_pred)**0.5

## 2.3 Analyzing Different Models

We will now analyze various different regressive models and compare how well they perform.

### 2.3.1 Linear Regression

Fit a linear regression model to your data and report your RMSE.

In [6]:
from sklearn.linear_model import LinearRegression

# Instantiate sklearn's linear regression object
lr = LinearRegression()

# Fit linear regression model
lr.fit(X_train, y_train)

LinearRegression()

In [7]:
# Predict against X_train using fitted model above
lr_train_pred = lr.predict(X_train)

# Calculate RMSE of predicted training output
rmse(y_train, lr_train_pred)

74637.41605606221

In [8]:
lr_val_pred = lr.predict(X_val)

# Calculate RMSE of validation output
## Notice our accuracy on the validation is slightly worse
rmse(y_val, lr_val_pred)

78331.596485081

### 2.3.2 Random Forests

Fit a random forest model to your data and report your RMSE.

**NOTE:** If you're finding that your model is performing worse than your linear regression, make sure you tune the parameters to the RandomForestRegressor!

Try to understand what the parameters mean by looking at the Decision Trees lecture.

In [9]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=50)
rf.fit(X_train, y_train)

RandomForestRegressor(n_estimators=50)

In [10]:
rf_train_pred = rf.predict(X_train)
rmse(y_train, rf_train_pred)

25065.155897053028

In [11]:
rf_val_pred = rf.predict(X_val)
rmse(y_val, rf_val_pred)    # Looks like a case of overfitting since training error >> validation error
                            # Tune parameters, such as max_depth or n_estimators, to reduce overfitting

55491.751810650065

### 2.3.3 Logistic Regression

Haha, don't get fooled here! Logistic regression (when combined with a decision rule) is really just a classification algorithm in disguise. In this case, we are looking for a numerical output, so logistic regression doesn't make sense at all.

### 2.3.4 Ridge Regression

Fit a ridge regression model and report your RMSE.

In [19]:
# YOUR CODE HERE
...

### 2.3.5 Support Vector Regression (OPTIONAL)

Fit a support vector regression model and report your RMSE.

**NOTE:** Support vectors machines (SVMs) often tend to overfit due to the nature of how they "enforce" correct classification of data points. This is also an out-of-scope topic, but it serves just to show that fancier models don't always prevail.

If you would like to understand more about support vector regression, please read https://towardsdatascience.com/unlocking-the-true-power-of-support-vector-regression-847fd123a4a0.

In [12]:
from sklearn.svm import SVR
svr = SVR(C=0.001, kernel='linear')

In [13]:
svr.fit(X_train[:10000], y_train[:10000])    # Limiting to 10000 samples b/c SVR takes a while

SVR(C=0.001, kernel='linear')

In [14]:
svr_train_pred = svr.predict(X_train)
rmse(y_train, svr_train_pred)

78540.53106286435

In [15]:
svr_val_pred = svr.predict(X_val)
rmse(y_val, svr_val_pred)

83998.36471966408

### 2.3.6 Neural Networks (OPTIONAL)

Train a neural network on the data. Report your RMSE.

**NOTE**: Neural Networks require a lot of time to train and it is better to use GPU to train them. Kaggle provides free weekly GPU usage(37 hours/week). To use GPU, choose 'GPU' in the Accelerator from Settings located on the right side of your screen.

In [20]:
# YOUR CODE HERE
...

### 2.3.7 Your Own Model

There's tons of regressive models out there to choose from. Some perform better in certain situations than others, and some require specific assumptions about the data to even work. If you would like to try out more models, please read https://towardsdatascience.com/7-of-the-most-commonly-used-regression-algorithms-and-how-to-choose-the-right-one-fc3c8890f9e3.

## 2.4 Submission

Choose the model that performed best on the validation data. Use that model to predict agains the test data.

In [17]:
X_test = pd.read_csv("test.csv")
X_test = X_test.select_dtypes(include=['int', 'float']).drop(train.columns[0], axis=1)  # For testing purposes only
X_test

Unnamed: 0,yearsofexperience,yearsatcompany,stockgrantvalue,bonus,cityid,Masters_Degree,Bachelors_Degree,Doctorate_Degree,Highschool,Some_College,Race_Asian,Race_White,Race_Two_Or_More,Race_Black,Race_Hispanic
0,4.0,0.0,6000.0,15000.0,6580,0,0,0,0,0,0,0,0,0,0
1,10.0,1.0,74000.0,16000.0,8198,0,0,0,0,0,0,0,0,0,0
2,10.0,5.0,40000.0,0.0,1311,0,0,0,0,0,0,0,0,0,0
3,2.0,2.0,0.0,1000.0,9592,0,1,0,0,0,1,0,0,0,0
4,5.0,1.0,15000.0,15000.0,7422,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12524,8.0,3.0,165000.0,0.0,7472,0,0,0,0,0,0,0,0,0,0
12525,17.0,10.0,320000.0,0.0,11527,1,0,0,0,0,1,0,0,0,0
12526,11.0,0.0,30000.0,30000.0,7434,1,0,0,0,0,1,0,0,0,0
12527,10.0,1.0,75000.0,30000.0,11420,0,1,0,0,0,0,1,0,0,0


In [18]:
y_test_pred = rf.predict(X_test)
y_test_pred

array([114380., 248280., 156180., ..., 234900., 300380., 202320.])