# NETID: <fill in here\>

# Assessing Model Accuracy
In this lesson, we look over different ways to evaluate whether machine learning models you have created successfully accomplish their intended objective.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt

weather = pd.read_csv('lecture5dataA.csv').dropna()
noncategorical = [weather.columns[i] for i in range (3,11) if i != 9]
print ("Noncategorical Features: ", noncategorical)
weather.head()

## Loss Functions and Accuracy

In evaluating your models, it's important to remember that different models must be evaluated with the appropriate metric. Classification accuracy is not, for example, the same thing as the mean-squared error used in regression problems. Furthermore, a high score in either of those metrics does not prove a model is "good". 

## <span style="color:green"><em>Problem 1</em></span>
Edit the lines marked with TODO's below to do the following:
1. Create two columns to `temperatures` to store Temperature and Apparent Temperature in Rankines. Rankines is a  weird unit of temperature. Temperature in Rankines is 9/5 * (temperature in Celcius) + 491.67.
2. Train and predict two models: one for celcius and one for rankines
3. Compare the results

In [None]:
temperatures = weather.loc[:,["Temperature (C)", "Apparent Temperature (C)"]]
temperatures["Temperature (R)"] = 0 #TODO: replace the 0 (hint: you don't need a loop)
temperatures["Apparent Temperature (R)"] = 0 #TODO: replace the 0

In [None]:
celcius_model = LinearRegression()
# TODO split data. goal is temperature in celcius, feature is apparent temperature in celcius.
# Make sure that the names of your data for the two models are different! Otherwise, one will overwrite the other
# Then, fit the model.


x_tr_C, x_te_C, y_tr_C, y_te_C = # Fill in here


rankines_model = LinearRegression()
# TODO same as above, but for rankines


x_tr_R, x_te_R, y_tr_R, y_te_R = # Fill in here

In [None]:
from sklearn.metrics import mean_squared_error

# TODO store the predictions for the test sets
celcius_predictions = "Fill in here"
rankines_predictions = "Fill in here"

# TODO find mean squared error of each model's predictions
celcius_MSE = mean_squared_error("Fill in here", "Fill in here") # TODO
rankines_MSE = mean_squared_error("Fill in here", "Fill in here") # TODO

print("celcius MSE:", celcius_MSE)
print("rankines MSE:", rankines_MSE)
print("\n(if the MSE for rankines is 0, you missed something two cells above)")

#### The MSE's of the two models are significantly different -- one is more than triple the other. To inspect this difference, let's plot the predictions of the two models and compare.

In [None]:
plt.subplots(figsize=(15, 5))
plt.subplot(121)
plt.scatter(x_te_C, y_te_C)
plt.plot(x_te_C, celcius_predictions, 'k', linewidth=4)
plt.legend(["Predictions","Actual Values"])
plt.title('Celcius Linear Regression')
plt.xlabel('Apparent Temperature (C)')
plt.ylabel('Temperature (C)')

plt.subplot(122)
plt.scatter(x_te_R, y_te_R)
plt.plot(x_te_R, rankines_predictions, 'k', linewidth=4)
plt.legend(["Predictions","Actual Values"])
plt.title('Rankines Linear Regression')
plt.xlabel('Apparent Temperature (R)')
plt.ylabel('Temperature (R)')
plt.show()

#### The plots look the same! The only significant difference is the scale of the axes. That's why the MSE for Rankines is bigger: Rankines are generally greater than Celcius, and so their error is naturally bigger. To take care of this, we use a _baseline_.

### <span style="color:green"><em>end of Problem 1</em></span>

## <span style="color:green"><em>Problem 2</em></span>

Compute the 'score' of the celcius model using sklearn's .score() method on *celcius_model*. Do the same for the Rankines model.

In [None]:
print("sklearn's score for Celcius:", celcius_model.score('FILL IN HERE','FILL IN HERE'))

In [None]:
print("sklearn's score for Rankines:", rankines_model.score('FILL IN HERE','FILL IN HERE'))

But what exactly is .score() doing?

When building a model, we typically have a baseline model to compare against. This allows us to see whether or not our model is better than a relatively simple, naive model.

In our case, the most simple, naive (baseline) model we can build to predict a location's temperature in the test set is to simply predict the mean of all temperatures across the testing set (for every single test point in the test set).

Go ahead and compute the MSE for the outputs of this baseline model: i.e. compute the MSE between the true outputs (i.e. *y_te_C*) and the predicted outputs of this baseline model (i.e. mean of the testing labels, *y_te_C*). Follow the same procedure for the Rankines model.

In [None]:
test_goal_mean_C = 'FILL IN HERE'
baseline_C = np.full((len(celcius_predictions),), test_goal_mean_C)
baseline_C_MSE = mean_squared_error('FILL IN HERE', 'FILL IN HERE')

In [None]:
test_goal_mean_R = 'FILL IN HERE'
baseline_R = np.full((len(rankines_predictions),), test_goal_mean_R)
baseline_R_MSE = mean_squared_error('FILL IN HERE', 'FILL IN HERE')

Now, compute the normalized score (relative to a baseline model) defined as: norm_score = 1 - model_MSE / baseline_MSE. If you did everything correctly, your computed normalized scores should be exactly the same as sklearn's .score() method.

If necessary, ask TAs for help!

In [None]:
score_C = 'FILL IN HERE'
print("Your computed score:", score_C)

In [None]:
score_R = 'FILL IN HERE'
print("Your computed score:", score_R)

### <span style="color:green"><em>end of Problem 2</em></span>

## Bias and Variance

To understand one of the most important concepts in machine learning evaluation, the bias-variance tradeoff, we must first establish what each term means. Simply put, *bias* is the tendency of to systematically over or under-estimate something. For example, if a seesaw has starts off at an incline, then we can say that it is already biased to one side regardless of the weight of the people using it. On the other hand, *variance* measures how far some metric is from a mean value. High variance corresponds to more spread out observations while low variance corresponds to datapoints that're clumped closer together. 

How do these terms work in machine learning models? One way to think about a model that is highly biased is to consider the worst case- where the model fails to learn anything at all. Then, the model is held to its pre-training parameters, and thus biased towards these results. In the case of variance, the opposite is true. Consider a model whose parameters yield a fairly accurate average result. If it exhibits high variance, then its predictions will vary more from that average result, meaning it is more sensitive to any noise in the data. 

## Bias-Variance Tradeoff

In the above example, we see the 'bias-variance tradeoff'. Simply put, the bias and variance of a model's predictions must be balanced as much as possible in order to find the best machine learning model for any task. As you may have guessed, high bias inherently means having low variance while high variance means having low bias- hence, the tradeoff. 

## Overfitting and Underfitting

Having a high bias means your model did not learn as much as it could have (*underfitting*), while having a high variance means the model was responsive to training data to the point that it does not generalize well (*overfitting*).


In [None]:
deaths = pd.read_csv('lecture5dataB.csv')
deaths['Book of Death'].fillna(0,inplace=True)
deaths['Death Year'].fillna(deaths['Death Year'].mean(),inplace=True)
deaths.dropna(subset=['Book Intro Chapter'],inplace=True)
deaths['Death Chapter'].fillna(deaths['Death Chapter'].mean(),inplace=True)
deaths["Allegiances"] = deaths["Allegiances"].str.replace(pat=r'House (?P<one>.*)', repl=lambda m: m.group('one'))

print(",\n".join(deaths["Allegiances"].unique()))

In [None]:
deaths.head()

In [None]:
from sklearn.tree import DecisionTreeClassifier

X = deaths[['Death Year','Book Intro Chapter','Book of Death','Death Chapter']]
Y = deaths['Allegiances']
x_tr, x_te, y_tr, y_te = train_test_split(X, Y, test_size = 0.2, random_state=42)
train_scores = []
test_scores = []

max_depths = list(range(10,100))
for i in max_depths:
    model = DecisionTreeClassifier(max_depth=i)

    model.fit(x_tr, y_tr)
    
    train_scores.append(model.score(x_tr, y_tr))
    test_scores.append(model.score(x_te, y_te))
    
plt.subplots(figsize=(15,5))
plt.subplots_adjust(wspace=0.4)
plt.subplot(131)
plt.plot(max_depths, train_scores)
plt.title('Training Score: More complex is better')
plt.xlabel('Model Complexity')
plt.ylabel('Training Score')
plt.subplot(132)
plt.plot(max_depths, test_scores)
plt.title("Testing Score: There's a sweetspot")
plt.xlabel('Model Complexity')
plt.ylabel('Testing Score')
plt.subplot(133)
plt.plot(max_depths, np.subtract(train_scores,test_scores))
plt.title("Generalization Error")
plt.xlabel('Model Complexity')
plt.ylabel('(Training Score) - (Testing Score)')
plt.show()

## <span style="color:green"><em>Problem 3 (Optional)</em></span>
### NOTE: there's a required Problem 4 at the bottom of the notebook. Don't skip it!
### Part a
Modify the loop above to programmatically find the best `max_depth` for a Decision Tree. Print out the training and testing score of just a model that uses that `max_depth`.
You could also try using sklearn's `GridSearchCV` instead, which we will cover in a later lecture.


### Part b
Now, imagine if you get a `Lannister`'s allegiance wrong, there is a much harsher consequence. To be specific, for any Lannister, the penalty of not predicting that they're a Lannister is 5x the normal penalty. Adjust your scoring mechanism using this new metric (still produce a score normalized by baseline). If you used GridSearchCV above, then see if you can use GridSearchCV's `scoring` parameter.
Note: we're dealing with classification here, not regression, so you'll need to use a classification loss function.

### Part c
Now, imagine that you care twice as much about people whose Death Year is greater than or equal to `300`. Adjust your scoring mechanism using this new metric (still produce a score normalized by baseline). sklearn's typical scoring parameter doesn't allow for this, so you can't use GridSearchCV (as far as I know -- I could be wrong).

### <span style="color:green"><em>end of Problem 3</em></span>

## Feature-Subset Selection Techniques 


A dataset will usually have many features, many of which will not be useful at all. The key is to determine which are helpful in improving your model.

Use the following block to help decide if a particular feature subset selection is helpful for a linear model built on a dataset of a Hungarian city called Szeged. Feel free to modify it to suit your needs.



In [None]:
weather[noncategorical].describe()

## <span style="color:green"><em>Problem 4</em></span>

Using what you have learned, create a correlation matrix of the data. Use it to decide the three best features to use in predicting Humidity and store those in a list named `three_correlated_features`. Then store the two best features to in a list named `two_correlated_features`. Compare the result of using `three_correlated_features` vs `two_correlated_features` to train a Linear Regression. (When we say compare the results, we mean compare print out the scores).

Your results should show you an important lesson about feature selection- you don't always need to have all features to show almost the same results, and selecting a feature subset of lesser size may be more resource-efficient. 

### <span style="color:green"><em> end of Problem 4 </em></span>