## CSCI 470 Activities and Case Studies

1. For all activities, you are allowed to collaborate with a partner. 
1. For case studies, you should work individually and are **not** allowed to collaborate.

By filling out this notebook and submitting it, you acknowledge that you are aware of the above policies and are agreeing to comply with them.

Some considerations with regard to how these notebooks will be graded:

1. Cells in which "# YOUR CODE HERE" is found are the cells where your graded code should be written.
2. In order to test out or debug your code you may also create notebook cells or edit existing notebook cells other than "# YOUR CODE HERE". We actually highly recommend you do so to gain a better understanding of what is happening. However, during grading, **these changes are ignored**. 
3. You must ensure that all your code for the particular task is available in the cells that say "# YOUR CODE HERE"
4. Every cell that says "# YOUR CODE HERE" is followed by a "raise NotImplementedError". You need to remove that line. During grading, if an error occurs then you will lose points for your work in that section.
5. If your code passes the "assert" statements, then no output will result. If your code fails the "assert" statements, you will get an "AssertionError". Getting an assertion error means you will not receive points for that particular task.
6. If you edit the "assert" statements to make your code pass, they will still fail when they are graded since the autograder will ignore the modified "assert" statement. Make sure you don't edit the assert statements.
7. We may sometimes have "hidden" tests for grading. This means that passing the visible "assert" statements is not sufficient. The "assert" statements are there as a guide but you need to make sure you understand what you're required to do and ensure that you are doing it correctly. Passing the visible tests is necessary but not sufficient to get the grade for that cell.
8. When you are asked to define a function, make sure you **don't** use any variables outside of the parameters passed to the function. You can think of the parameters being passed to the function as a hint. Make sure you're using all of those variables.
9. The **Grading** section at the end of the document (before the **Feedback** section) contains some code for our autograder on GradeScope. You are expected to fail this block of code in your Jupyter environment. DO NOT edit this block of code, or you may not get points for your assignment.
10. Finally, **make sure you run "Kernel > Restart and Run All"** and pass all the asserts before submitting. If you don't restart the kernel, there may be some code that you ran and deleted that is still being used and that was why your asserts were passing.

# Feature Learning

In this exercise we will run feature learning methods on a dataset to identify the key features to use for predicting a target within it.

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn as sk

plt.style.use("ggplot")

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectKBest, mutual_info_regression, RFE
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

The data we'll be working with is the [California housing dataset](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset).

In [3]:
house_data = fetch_california_housing()

In [4]:
print(house_data["DESCR"])

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

By reading the above information about the data, which features do you intuit meaningful for predicting the target? Which are not meaningful?

In [5]:
house_features = pd.DataFrame(house_data["data"], columns=house_data["feature_names"])
house_prices = pd.Series(house_data["target"])

In [6]:
house_features.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [7]:
house_features.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31


In [8]:
house_prices.head()

0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
dtype: float64

In [9]:
house_prices.describe()

count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
dtype: float64

First, we'll [standardize](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling) our data to be used by our model. We'll use the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) which just adjusts the mean to be 0 and scales the variance of the data.

This class is a scikit-learn transformer which follows a two phase process:
1. First it fits to the data to learn something (in this case how to scale the data)
1. Then it transforms any desired dataset based on the scale it learned.

We also have a helper function that is `fit_transform` that allows us to do both processes at the same time.

Now since we only ever want to fit our model once, we should never use fit_transform more than once. We can use `fit_transform` on training data but then just `transform` the test data.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(house_features, house_prices, test_size=0.2, random_state=0)

In [11]:
# Use a StandardScaler to standardize the house_features.
# Update (overwrite) the older values of X_train and X_test with the
# scaled values that are output by the StandardScaler object.
scaler = StandardScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [12]:
X_test.mean(axis = 0)

array([-0.01475655,  0.00810338, -0.00714746,  0.0051149 ,  0.00017061,
        0.03115683,  0.01656657, -0.01669752])

In [13]:
assert np.all(X_train.mean(axis=0)< 0.0000000001)

# Note that if you do a fit_transform on the test data you did that part wrong
assert np.all(np.abs(X_test.mean(axis=0)) > 0.0000001)
assert np.all(np.abs(X_test.mean(axis=0)) < 0.04) 

Next, we'll split our data in order to follow the process we had outlined several lectures ago for effectively evaluating supervised learning problems.

## Filter Feature Selection

Select the best features based on [mutual information score](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression) from the __training data__, then transform X_train and X_test into the new subset of selected features. See [SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest).

In [15]:
# We will select the best k features from all feature selection methods
k = 4

In [16]:
# Save the transformer you use (fit) as "mi_transformer".
# Save the transformed set of features as as "mi_X_train" and "mi_X_test".
# YOUR CODE HERE
mi_transformer = SelectKBest(mutual_info_regression, k=k)

mi_transformer.fit(X_train, y_train)

mi_X_train = mi_transformer.transform(X_train)
mi_X_test = mi_transformer.transform(X_test)

In [17]:
for feature, importance in zip(house_features.columns, mi_transformer.scores_):
    print(f"The MI score for {feature} is {importance}")

The MI score for MedInc is 0.39963013811031
The MI score for HouseAge is 0.031868598399131365
The MI score for AveRooms is 0.10820951212328822
The MI score for AveBedrms is 0.0294647707244291
The MI score for Population is 0.02414968703531084
The MI score for AveOccup is 0.06974783284531316
The MI score for Latitude is 0.3604449601205042
The MI score for Longitude is 0.4014818781218592


Do these values match what you expected? What are the most important features to use for predicting house prices?

In [18]:
assert mi_transformer.k == k
assert isinstance(mi_transformer, SelectKBest)
assert len(mi_transformer.scores_) == 8
assert mi_X_train.shape == (16512, k)
assert mi_X_test.shape == (4128, k)

Since the focus in this exercise is on the feature learning and not on the supervised learning portion, we will use a simple estimator (linear regression) for the model training portions.

In [19]:
miEst = LinearRegression().fit(mi_X_train, y_train)

In [20]:
print(f"The mean squared error when training on the MI selected features is {mean_squared_error(y_train, miEst.predict(mi_X_train))}.")
print(f"When testing on the test data, the mean squared error is {mean_squared_error(y_test, miEst.predict(mi_X_test))}")

The mean squared error when training on the MI selected features is 0.5494954356737961.
When testing on the test data, the mean squared error is 0.5642909967350348


## Wrapper Feature Selection

Now try using [recursive feature elimination](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) to select the 4 features we will use instead.

Note that after calling the ```.fit()``` method of the RFE object, a Boolean array that indicates which features are selected is stored as ```rfe_transformer.support_```.

In [21]:
# Use an RFE object to determine the k features to select from X_train using a step of 2
# Save the rfe object as rfe_transformer
# Create rfe_X_train and rfe_X_test as the updated features based on the RFE output

rfeEst = LinearRegression()

rfe_transformer = RFE(rfeEst, n_features_to_select=k, step=2)

rfe_transformer.fit(X_train, y_train)

rfe_X_train = rfe_transformer.transform(X_train)
rfe_X_test = rfe_transformer.transform(X_test)

In [22]:
assert isinstance(rfe_transformer, RFE)
assert rfe_transformer.step == 2
assert len(rfe_transformer.support_) == 8
assert rfe_X_train.shape == (16512, k)
assert rfe_X_test.shape == (4128, k)

In [23]:
rfeEst = LinearRegression().fit(rfe_X_train, y_train)

In [24]:
print(f"The mean squared error when training on the RFE selected features is {mean_squared_error(y_train, rfeEst.predict(rfe_X_train))}.")
print(f"When testing on the test data, the mean squared error is {mean_squared_error(y_test, rfeEst.predict(rfe_X_test))}")

The mean squared error when training on the RFE selected features is 0.5451672299133382.
When testing on the test data, the mean squared error is 0.5580832992075773


In [25]:
print(f"The most important features as determined by RFE were {list(house_features.columns[rfe_transformer.support_])}")

The most important features as determined by RFE were ['MedInc', 'AveBedrms', 'Latitude', 'Longitude']


## Embedded Methods

For the embedded methods feature selection example, we will use Lasso. For this task you should use [LassoCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV) and **not** [Lasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) so that it trains with various values for alpha.

Since this is an embedded method, the feature selection will occur directly in the model.

In [26]:
# Create a LassoCV model trained with 10 alphas and save it to lassoEst
# YOUR CODE HERE
lassoEst = LassoCV(n_alphas=10)

lassoEst.fit(X_train, y_train)

In [27]:
lassoEst.coef_

array([ 0.8172285 ,  0.11878355, -0.22453658,  0.2660654 , -0.0062399 ,
       -0.02936718, -0.88177587, -0.85089246])

In [28]:
for feature, coef in zip(house_features.columns, lassoEst.coef_):
    print(f"The magniture of the feature coefficient for {feature} is {abs(coef)}.")

The magniture of the feature coefficient for MedInc is 0.8172284997738892.
The magniture of the feature coefficient for HouseAge is 0.11878355321284885.
The magniture of the feature coefficient for AveRooms is 0.2245365787499033.
The magniture of the feature coefficient for AveBedrms is 0.26606540007395557.
The magniture of the feature coefficient for Population is 0.0062399043585295655.
The magniture of the feature coefficient for AveOccup is 0.02936717698787836.
The magniture of the feature coefficient for Latitude is 0.8817758655028378.
The magniture of the feature coefficient for Longitude is 0.850892463078756.


In [29]:
lassoEst.alpha_

0.0017266465315919485

In [30]:
assert lassoEst
assert isinstance(lassoEst, LassoCV)
assert len(lassoEst.coef_) == 8

In [31]:
print(f"The mean squared error when training using lasso is {mean_squared_error(y_train, lassoEst.predict(X_train))}.")
print(f"When testing on the test data, the mean squared error is {mean_squared_error(y_test, lassoEst.predict(X_test))}")

The mean squared error when training using lasso is 0.5236104671923245.
When testing on the test data, the mean squared error is 0.5298113560764504


Compare each model's prioritized features. Which model do you think is the best? What do they tell you about this data?

# Grading
The following code block is purely used for grading. If you find any error, you can ignore. DO NOT MODIFY THE CODE BLOCK BELOW.

In [None]:
# Autograding with Otter Grader
import otter
grader = otter.Notebook()
grader.check_all()

## Feedback

In [None]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    # YOUR CODE HERE
    return 'this was cool'