# Our first predictive model!

One of the powerful ways we use python and pandas is the ease with which we can build powerful predictive models.  The structure follows a basic structure that can be reused to fit many different types of models.   

Lets use a standard data set on features of different cars  to build a simple model to predict the mpg of cars (Y) from the other data available on those cars (X).

In [None]:
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# this trick is required to get plots to display inline with the rest of your notebook,
# not in a separate window
%matplotlib inline

### EDA and preparing the data for modelling

Recall `College.csv` - download again to your local machine if needed by [clicking on this link](https://www.statlearning.com/s/College.csv).

In [None]:
# Now, run the following code and "Choose Files" to upload the file into Colab:

from google.colab import files
uploaded = files.upload()


In [None]:
#import the data into a Pandas data frame:
college_df = pd.read_csv("College.csv", index_col=0)

In [None]:
# Let's rename all of the columns to remove punctuation (PET PEEVE ALERT)
# good use for Gen AI!!

We want to fit a model predicting `GradRate` based on the other features of the college.  

Some pre-processing we need to do..the feature Private is a categorical feature, we need to transform this into a binary feature to fit regression.  We can do this using the function `get_dummies` from `Pandas`.



In [None]:
college_df = pd.get_dummies(college_df, columns=['Private'], prefix=['Private'], drop_first=True)

Also, we want to make sure our target, GradRate, has reasonable values - aka between 0 and 100.

In [None]:
# prompt: check for GradRate greater than 100 or less than 0

college_df[(college_df['GradRate'] > 100) | (college_df['GradRate'] < 0)]


In [None]:
bad_cols = (college_df['GradRate'] > 100) | (college_df['GradRate'] < 0)
# drop bad_cols
college_df = college_df[~bad_cols]


In [None]:
# Now define our X and y, what are we predicting and what are the features?

# the features are going to be everything other than GradRate

features = list(college_df.columns)
features.remove("GradRate")

# our target is GradRate

target = "GradRate"

# For readability, identify your X (predictors) and y (target) variable cleanly
X = college_df[features]
y = college_df[target]


**Is this a classification or regression task??**

Remember about defining Training and Test sets.  The Training set is used to fit the model and then that model is applied to the Test set to see how good the predictions are.  
- Split data into 80% training and 20% test sets
- Fit the model to the training data to optimize parameters

In [None]:
# Split data into training and test sets

# need to import the `train_test_split` module

# Use random_state if you want the same random split every time

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)


### Fitting a Linear Regression model




In [None]:
# First we'll need to import the predictive model class that we'll use

from sklearn.linear_model import LinearRegression

# Instantiate the model
my_model = LinearRegression()

# fit the regression model to the TRAINING DATA
my_model.fit(X_train,y_train)


In [None]:
# print the coefficients from the model

print(pd.DataFrame({'Predictor':X_train.columns, 'coefficient':my_model.coef_.round(3)}))

## Evaluating the Model

That's it!  This same structure applies to all data science models and now you can fit some of the most powerful models out there with this simple code! 💪 💻

Now we get to see how well the model performs.  One way to do evaluation is to numerically compare the predictions ON THE TEST SET to the actual values in the test set.

This is a regression problem so our predictions are numeric.  We can compare the prediction and the actual using mean absolute error (MAE) or root mean squared error (RMSE).

 [There are many other metrics available too!](https://scikit-learn.org/stable/modules/model_evaluation.html#)

In [None]:
# Calclate the model predictions as applied to the TEST set
y_pred = my_model.predict(X_test)


# Print the top 10 values in y_pred and y_test next to each other
predictions_df = pd.DataFrame({'Pred': y_pred.round(1), 'Truth': y_test})
print(predictions_df.head(10))


In [None]:
# Calculate RMSE directly using Numpy

pred_rmse = np.sqrt(np.mean((y_pred - y_test)**2))
print("RMSE:", pred_rmse)

# PET PEEVE TRIGGER ALERT - DIGITS!!


Interpret your RMSE in the context of the problem.  YOU SHOULD ALWAYS DO THIS!

We can instead use `sklearn.metrics` to calculate evaluation metrics directly.


In [None]:
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, r2_score

rmse = root_mean_squared_error(y_test, y_pred)
print("RMSE:", round(rmse,2))
mae = mean_absolute_error(y_test, y_pred)
print("Mean absolute error:", round(mae,3))
r_squared = r2_score(y_test, y_pred)
print("R-squared=",round(r_squared,3))

What if you were to calculate the RMSE on the TRAINING set?  Do you think it would be higher or lower?

In [None]:
y_training_pred = my_model.predict(X_train)
rmse_train = root_mean_squared_error(???,???)
print("RMSE_train:", round(rmse_train,3))
mae = mean_absolute_error(???,???)
print("Mean absolute error:", round(mae,3))

You will be using and amending this basic code MANY times this semester!

In [None]:
# Can we visualize how good our predictions are?
# Let's plot the predicted mpg vs. the true value

plt.scatter(y_pred, y_test) # PET PEEVE TRIGGER ALERT
# plt.xlabel("Predicted GradRate")
# plt.ylabel("True GradRate")
# plt.title("Predicted GradRate vs. True GradRate")
# plot x=y line on plot
plt.plot([20, 100], [20, 100], 'r--')
plt.show()


### Optional Class Exercises: EDA and Modelling



1. Use the pd.plotting.scatter_matrix() function to produce a scatterplot matrix of any 3 numeric features and your Target [Apps, Accept, Enroll, Top10perc, GradRate].  What do you learn?

2. Use the boxplot() method to produce side-by-side boxplots of GradRate for Private=Yes vs Private=No. What do you learn?

3. We might be intersted in studying Elite schools, where Elite is defined by whether the school has more than 50% of their students from the top 10% of their class (Top10perc).  Create a new feature, called Elite, by binning the Top10perc variable into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%. How many Elite colleges are there and do Elite colleges have a higher or lower graduation rate?

4. Recreate the AcceptPerc feature (Accepted divided by Apps).  Add this as a new feature to the regression above.  Did the model get any better?  

5. Fit another model!  It is as simple as changing the first line of code `my_model` and using another type of model (as long as that model has been imported.  For instance, replace above with:

  `from sklearn.tree import DecisionTreeRegressor`

  `my_model = DecisionTreeRegressor(max_depth=3)`

  everything after that can remain the same!
