In [None]:
import numpy as np
import pandas as pd

# CX Kaggle Competition: Salary Prediction

### Table of Contents

* [5. Setup](#setup)
* [6. Validation](#validation)
* [7. Accuracy & Error](#rmse)
* [8. Different Models](#models)
* [9. Submission](#submission)

### Hosted by and maintained by the [Students Association of Applied Statistics (SAAS)](https://saas.berkeley.edu). Authored by Jasmine Lee and Akhil Vemuri. 

# Modeling

We will now be creating various regression models in order to predict the ```totalyearlycompensation``` column of our data. There are multiple different parts to manage when modeling, such as splitting the data, fitting an appropriate model, testing / evaluating our model, and predicting against the test dataset. We will also later determine which model performs the best on our data.

<span id="setup"></span>

## Setup

**Question 1:** While it is possible to use categorical features in predictive modeling, for simplicity sake, we will filter out all non-numerical columns. Fill in the blanks such that only columns of type ```int``` or ```float``` remain.

In [None]:
train = pd.read_csv("train.csv")
X_train = train.drop(labels=['totalyearlycompensation'], axis=1)
y_train = train.loc[:, 'totalyearlycompensation']

X_train = X_train.select_dtypes(...).drop(...)
X_train

#### One Hot Encoding (OPTIONAL)

Although we just removed all the categorical columns, those extra features can still be leveraged to potentially obtain a better model.

Use sklearn's [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) object to transform the categorical columns to numerical ones.

In [None]:
from sklearn.preprocessing import OneHotEncoder

# train = pd.read_csv("train.csv")
# X_train = train.drop(labels=['totalyearlycompensation'], axis=1)
# y_train = train.loc[:, 'totalyearlycompensation']

# YOUR CODE HERE
...

<span id="validation"></span>

## Validation

**Question 2:** Use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to split up your training data into a training set and a validation set. The typical size of the validation set is also 20% of the full training data.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(..., ..., test_size=..., random_state=42)

#### K-Fold Cross Validation

The validation method above is usable but not that robust. K-Fold Cross-Validation should be better.

**Question 3:** Feel free to set up your own K-Fold cross-validation scheme. For more information, please read https://towardsdatascience.com/cross-validation-a-beginners-guide-5b8ca04962cd.

In [None]:
# Complete this as an exercise, but you can continue rest of this notebook using single validation
...

<span id="rmse"></span>

## Accuracy & Error

Our Kaggle competition uses Root-Mean-Square-Error (RMSE) as the error metric. In mathematical notation, it is:

$$\text{RMSE}(\hat{y}, y) = \sqrt{\frac{1}{n} \sum_{i = 1}^n (y_i - \hat{y}_i)^2}.$$

**Question 4:** Complete the function below.

In [None]:
from sklearn.metrics import mean_squared_error
def rmse(y_true, y_pred):
    return ...

<span id="models"></span>

## Analyzing Different Models

We will now analyze various different regressive models and compare how well they perform.

### Linear Regression

**Question 5:** Fit a linear regression model to your training data and report your RMSE.

*Hint: Simply run the following cells*

In [None]:
from sklearn.linear_model import LinearRegression

# Instantiate sklearn's linear regression object
lr = LinearRegression()

# Fit linear regression model
lr.fit(X_train, y_train)

In [None]:
# Predict against X_train using fitted model above
lr_train_pred = lr.predict(X_train)

# Calculate RMSE of predicted training output
rmse(y_train, lr_train_pred)

In [None]:
lr_val_pred = lr.predict(X_val)

# Calculate RMSE of validation output
## Notice our accuracy on the validation data is slightly worse
rmse(y_val, lr_val_pred)

### Random Forests

**Question 6:** Fit a random forest model to your data and report your RMSE.

**NOTE:** If you're finding that your model is performing worse than your linear regression, make sure you tune the parameters to the RandomForestRegressor! Try to understand what the parameters mean by looking at the Decision Trees lecture.

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=...)
rf.fit(..., ...)

In [None]:
rf_train_pred = rf.predict(...)
rmse(y_train, rf_train_pred)

In [None]:
rf_val_pred = rf.predict(...)
rmse(y_val, rf_val_pred)

Looks like a case of overfitting since the training error >> validation error. Random forests / decision trees are notorious for overfitting as well due to how much they fit to the training data, but we can tune different parameters such as ```max_depth``` or ```n_estimators``` in order to increase bias and improve the model's predictive ability.

### Logistic Regression

Haha, don't get fooled here! Logistic regression (when combined with a decision rule) is really just a classification algorithm in disguise. In this case, we are looking for a numerical output, so logistic regression doesn't make sense at all.

### Ridge Regression

**Question 7:** Fit a ridge regression model and report your RMSE.

In [None]:
# YOUR CODE HERE
...

**Question 8:** Do you notice a difference between the Ridge Regression model and the Linear Regression one? Does changing the regularization parameter increase or decrease accuracy? What does this tell you about the data?

**Answer:**

### Support Vector Regression (OPTIONAL)

Fit a support vector regression model and report your RMSE.

**NOTE:** Support vectors machines (SVMs) often tend to overfit due to the nature of how they "enforce" correct classification of data points. This is also an out-of-scope topic, but it serves just to show that fancier models don't always prevail.

If you would like to understand more about support vector regression, please read https://towardsdatascience.com/unlocking-the-true-power-of-support-vector-regression-847fd123a4a0.

In [None]:
from sklearn.svm import SVR
svr = SVR(C=0.001, kernel='linear')

In [None]:
svr.fit(X_train[:10000], y_train[:10000])    # Limiting to 10000 samples b/c SVR takes a while

In [None]:
svr_train_pred = ...
rmse(y_train, svr_train_pred)

In [None]:
svr_val_pred = ...
rmse(y_val, svr_val_pred)

### Neural Networks (OPTIONAL)

Train a neural network on the data. Report your RMSE.

**NOTE**: Neural Networks require a lot of time to train and it is better to use GPU to train them. Kaggle provides free weekly GPU usage(37 hours/week). To use GPU, choose 'GPU' in the Accelerator from Settings located on the right side of your screen.

In [None]:
# YOUR CODE HERE
...

### Your Own Model

There's tons of regressive models out there to choose from. Some perform better in certain situations than others, and some require specific assumptions about the data to even work. If you would like to try out more models, please read https://towardsdatascience.com/7-of-the-most-commonly-used-regression-algorithms-and-how-to-choose-the-right-one-fc3c8890f9e3.

<span id="submission"></span>

## Submission

**Question 9:** Choose the model that performed best on the validation data. Use that model to predict against the test data.

In [None]:
X_test = pd.read_csv("test.csv")
X_test = X_test.select_dtypes(...).drop(...)
X_test

In [None]:
y_test_pred = ...    # Choose best model to predict using
y_test_pred

Run the below cells to save your predicted test values.

In [None]:
from datetime import datetime

def results_to_csv(data, y_test):
    y_test = y_test.astype(int)
    df = pd.DataFrame({'totalyearlycompensation': y_test})
    df.to_csv(data + '_submission_' + datetime.now().strftime("%Y_%m_%d-%H_%M_%S") + '.csv',
              index_label='key')

In [None]:
results_to_csv("salary", y_test_pred)