# Our first predictive model!

One of the powerful ways we use python and pandas is the ease with which we can build powerful predictive models.  The structure follows a basic structure that can be reused to fit many different types of models.   

Lets use a standard data set on features of different cars  to build a simple model to predict the mpg of cars (Y) from the other data available on those cars (X).

In [1]:
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# this trick is required to get plots to display inline with the rest of your notebook,
# not in a separate window
%matplotlib inline

### EDA and preparing the data for modelling

In [None]:
# This reads the data from a url and sets the column names.

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original"
column_names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model', 'origin', 'car_name']
mpg_df = pd.read_csv(url, delim_whitespace=True, header=None, names=column_names)

In [None]:
# always a good idea to look at your data!!
mpg_df.head()

In [None]:
mpg_df.describe()

In [None]:
# how many missing values for each feature?
mpg_df.isnull().sum()

In [6]:
# Eliminate (drop) any instances with missing values (NaNs)
cleaned_df = mpg_df.dropna()

We want to fit a model predicting `mpg` based on the other features of the car.  Is this a classification or regression problem??

In [7]:
# Now define our X and y, what are we predicting and what are the features?


features = ["weight", "acceleration", "cylinders", "displacement"]
target = "mpg"

# For readability, identify your X (predictors) and y (target) variable cleanly
X = cleaned_df[features]
y = cleaned_df[target]


Remember about defining Training and Test sets.  The Training set is used to fit the model and then that model is applied to the Test set to see how good the predictions are.  
- Split data into 80% training and 20% test sets (we will use 80/20)
- Fit the model to the training data to optimize parameters

In [8]:
# Split data into training and test sets
# need to import the `train_test_split` module

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Fitting a Linear Regression model




In [None]:
# First we'll need to import the predictive model class that we'll use

from sklearn.linear_model import LinearRegression

# Instantiate the model
my_model = LinearRegression()

# fit the regression model to the TRAINING DATA
my_model.fit(X_train,y_train)

print(pd.DataFrame({'Predictor':X_train.columns, 'coefficient':my_model.coef_.round(3)}))

## Evaluating the Model

That's it!  This same structure applies to all data science models and now you can fit some of the most powerful models out there with this simple code! 💪 💻

Now we get to see how well the model performs.  One way to do evaluation is to numerically compare the predictions ON THE TEST SET to the actual values in the test set.

This is a regression problem so our predictions are numeric.  We can compare the prediction and the actual using mean absolute error (MAE) or root mean squared error (RMSE).

 [There are many other metrics available too!](https://scikit-learn.org/stable/modules/model_evaluation.html#)

In [None]:
# Calclate the model predictions as applied to the TEST set
y_pred = my_model.predict(X_test)

# Find the average predictive squared error
# note the use of numpy

# print the top 10 values in y_pred and y_test next to each other

# Print the top 10 values in y_pred and y_test next to each other
predictions_df = pd.DataFrame({'Pred': y_pred.round(1), 'Truth': y_test})
print(predictions_df.head(10))


In [None]:
# Calculate RMSE directly using Numpy

pred_rmse = np.sqrt(np.mean((y_pred - y_test)**2))
print("RMSE:", pred_rmse.round(2))


We can use `sklearn.metrics` to calculate many useful evaluation metrics


In [None]:
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, r2_score

rmse = root_mean_squared_error(y_test, y_pred)
print("RMSE:", rmse.round(2))
mae = mean_absolute_error(y_test, y_pred)
print("Mean absolute error:", mae.round(2))
r_squared = r2_score(y_test, y_pred)
print("R-squared=",round(r_squared,3))

You will be using and amending this basic code MANY times this semester!

In [None]:
# Can we visualize how good our predictions are?
# Let's plot the predicted mpg vs. the true value

predictions_df.plot(kind="scatter",  x="Pred",y="Truth")

# add a regression line
plt.plot([0, 50], [0, 50], color='red', linestyle='--')
plt.xlim(5, 40)
plt.ylim(5, 40)
plt.xlabel("Predicted MPG")
plt.ylabel("True MPG")

# Putting it all together - Hands on exercise : College Admissions


Consider the following URL to a CSV file containing the results of compressive tests for various types of concrete


Input the College.csv data into a data frame called "college_df"


### About the data

In [None]:
# read in data
college_df = pd.read_csv("https://www.statlearning.com/s/College.csv") # to read from URL
#college_df = pd.read_csv("College.csv") # if you have it locally



It contains a number of variables for 777 different universities and colleges in the US. The variables are

- Private : Public/private indicator
- Apps : Number of applications received
- Accept : Number of applicants accepted
- Enroll : Number of new students enrolled
- Top10perc : New students from top 10 % of high school class
- Top25perc : New students from top 25 % of high school class
- F. Undergrad : Number of full-time undergraduates
- P.Undergrad : Number of part-time undergraduates
- Outstate : Out-of-state tuition
- Room.Board : Room and board costs
- Books : Estimated book costs
- Personal : Estimated personal spending
- PhD : Percent of faculty with Ph.D.s
- Terminal : Percent of faculty with terminal degree
- S.F.Ratio : Student/faculty ratio
- perc.alumni : Percent of alumni who donate
- Expend : Instructional expenditure per student
- Grad.Rate : Graduation rate

What intersting questions might you want to ask.  Specifically, what is a good target feature?

In [None]:
college_df.describe()

In [None]:
college_df.head()

### Optional Class Exercises: EDA and Modelling

1. Use the pd.read_csv() function to read the data into Python. Call the loaded data `college_df`.

2. Use head() and describe() to explore the features.  Do you see anything intersting that might require attention?

3. Use the pd.plotting.scatter_matrix() function to produce a scatterplot matrix of any 3 numeric features and your Target [Apps, Accept, Enroll, Top10perc, GradRate].  What do you learn?

4. Use the boxplot() method to produce side-by-side boxplots of Outstate for Private=Yes vs Private=No. What do you learn?

5. We might be intersted in studying Elite schools, where Elite is defined by whether the school has more than 50% of their students from the top 10% of their class (Top10perc).  Create a new feature, called Elite, by binning the Top10perc variable into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%. How many Elite colleges are there?

`college['Elite'] = pd.cut(college['Top10perc'], [0,0.5,1],
labels=['No', 'Yes'])`

6. Fit a regression where Y = your target of interest and X = three features that you think might be good predictors of the target. What is your RMSE? Interpret the RMSE.

7. If time: What other interesting things can you learn from this data?  

