For this homework assignment, you are to submit a **single** ipynb file. Use the provided ipynb file to keep the same formatting for each question. In the ipynb file name, replace "NAME" with your first name. Unless otherwise specified, present your code as well as the output in your report. It is the student's responsibility to make sure the ipynb file runs when submitted. This assignment is worth 59 points.

# 1: Cross-Validation *(21 points)*

To investigate cross-validation, we will look at the `housing` data set which provides information from the 1990 US Census on housing in California. The 20,000+ observations have features like location, housing, age, and population. The target variable is the median house value for a Californian district.

This is an example of regression, as opposed to classification as we have primarily studied.

Convert the inputs of the `housing` data set into a `pandas` data frame so that you can look at the variables. *(2 points)*

In [26]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

In [28]:
import pandas as pd

house_data=pd.DataFrame(housing.data)

Split the data frame (inputs) and the target (outputs) into three sets: training set, validation set, test set as we did in class:

- First as a trainval set and test set
- Second as a train set and validate set

**Reminder to never use the test set until the very final step. We never want to tell our models what the test set is while we are finetuning the model.**  *(5 points)*

In [30]:
from sklearn.model_selection import train_test_split
(X_trainval, X_test, Y_trainval, Y_test)=train_test_split(house_data, housing.target)
(splitx_train,splitx_val,splity_train,splity_val)=train_test_split(X_trainval,Y_trainval)


In order to perform the regression, we will use a Ridge linear regression model. Fit a ridge model (without any adjusting of parameters) to the training set and record it's accuracy on the validation set. *(3 points)*

*Reminder that regression's default score is $R^2$ which has domain $(-\infty,1]$.*

In [None]:
from sklearn.linear_model import Ridge

house_ridge = Ridge().fit(splitx_train,splity_train)
house_ridge.score(splitx_val, splity_val)

0.6235308901361161

Perform a stratified fold on the trainval set retuning all of the scores along with the average score. Is this score better than our original score without cross-validating?  *(6 points)*

In [None]:
from sklearn.model_selection import cross_val_score
import numpy as np

strat_fold = cross_val_score(Ridge(), housing.data,housing.target)
print(strat_fold)
np.mean(strat_fold)

[0.54878594 0.46817341 0.55078466 0.53693584 0.66053068]


np.float64(0.5530421056931834)

The original score is better.

Increase the number of folds to 10. Return all scores along with the average score and determine if the accuracy improved or worsened. What does this tell you about the sensitivity of the model to the actual train-validate split? *(3 points)*

In [None]:
strat_fold10 = cross_val_score(Ridge(), housing.data,housing.target, cv=10)
print(strat_fold10)
np.mean(strat_fold10)


[0.482818   0.61412011 0.42268645 0.48182494 0.55703274 0.54134247
 0.47497151 0.45838648 0.48177509 0.59533218]


np.float64(0.5110289965995403)

The accuracy worsened. This indicates that the model is more sensitive to the training and validation set in the actuatl train-validate split and less sensitive to the further splits.

Attempt to run a `LeavePOut` fold using $p=1000$. Theoretically we can do this, but in practicality we can't in this case. When you've given up, stop the computation and explain why the computations never stopped. *(2 points)*

In [None]:
from sklearn.model_selection import LeavePOut
leave=LeavePOut(p=1000)
strat_leave = cross_val_score(Ridge(), housing.data,housing.target,cv=leave)
print(strat_leave)
np.mean(strat_leave)

KeyboardInterrupt: 

The computation never stopped because it generates so many different models of 1000 points that it would take too long to run pratically.

# 2: Grid Search *(21 points)*

Continue to use the `housing` data set.

Using a for loop, perform a grid search (with cross-validation) with the list of parameter values for the regularization $\alpha$ in the Ridge model.

You should print all of the average scores in an array, the best score, and the best parameter. Also make sure you're using the appropriate data sets (trainval, train, val, or test)! *(7 points)*

In [33]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.linear_model import Ridge

params = [0.01,0.05,0.1,0.5,1,5,10,20,50,100]

arrays_mean=[]
best_score = 0
best_para = None

for i in params:
  ral=Ridge(alpha=i)
  grid_cross=cross_val_score(ral,X_trainval,Y_trainval)
  mean_score=np.mean(grid_cross)
  arrays_mean.append(np.mean(grid_cross))


  if mean_score > best_score:
      best_score = mean_score
      best_para = i
print(np.array(arrays_mean))
print("besst Score:",best_score)
print("best parameter:",best_para)

[0.60418138 0.60418215 0.60418312 0.60419079 0.60420031 0.60427322
 0.60435668 0.60450018 0.60477736 0.60488876]
besst Score: 0.6048887561125514
best parameter: 100


Verify that the `GridSearchCV` function perform like your for loop by finding the best parameters and score for the same list of parameters. *(5 points)*

In [35]:
params1 = {'alpha': [0.01,0.05,0.1,0.5,1,5,10,20,50,100]}

grid_searched=GridSearchCV(Ridge(),params1,cv=5)
grid_searched.fit(X_trainval,Y_trainval)
print(grid_searched.best_score_)
print(grid_searched.best_params_)


0.6048887561125514
{'alpha': 100}


Using the information about the best parameters, narrow down the actual best parameter value by adding in values into the `params` array and recomputing the grid search.

You should have accuracy up to two decimal places. *(4 points)*

In [37]:
params2 = {'alpha': [0.01,0.05,0.1,0.5,0.6,0.7,0.8,0.9,1,1.5,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,325,30,35,40,45,50,60,70,80,90,91,92,93,94,95,96,97,98,99,100,110,125,150,175,200]}
grid_searched_best=GridSearchCV(Ridge(),params2,cv=5)
grid_searched_best.fit(X_trainval,Y_trainval)
print(grid_searched_best.best_score_)
print(grid_searched_best.best_params_)


0.60489243063517
{'alpha': 92}


Now, display the results of your final grid search as a data frame. *(2 points)*

In [38]:
pd.DataFrame(grid_searched_best.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.005677,0.001539,0.002475,6.9e-05,0.01,{'alpha': 0.01},0.609882,0.604033,0.594795,0.589386,0.622811,0.604181,0.011718,50
1,0.004404,7.1e-05,0.002369,7.9e-05,0.05,{'alpha': 0.05},0.609882,0.604032,0.5948,0.589386,0.622811,0.604182,0.011717,49
2,0.005297,0.00172,0.002424,0.000128,0.1,{'alpha': 0.1},0.609881,0.604032,0.594806,0.589386,0.62281,0.604183,0.011716,48
3,0.004544,0.00011,0.002454,9.6e-05,0.5,{'alpha': 0.5},0.609877,0.604028,0.594856,0.589389,0.622803,0.604191,0.011705,47
4,0.004199,5.1e-05,0.002292,3.6e-05,0.6,{'alpha': 0.6},0.609876,0.604027,0.594869,0.58939,0.622801,0.604193,0.011702,46
5,0.004152,6.8e-05,0.002246,8.5e-05,0.7,{'alpha': 0.7},0.609875,0.604026,0.594881,0.589391,0.622799,0.604195,0.011699,45
6,0.004183,0.0001,0.002318,0.000157,0.8,{'alpha': 0.8},0.609874,0.604025,0.594894,0.589391,0.622798,0.604197,0.011696,44
7,0.004026,3.1e-05,0.002168,6e-05,0.9,{'alpha': 0.9},0.609873,0.604024,0.594906,0.589392,0.622796,0.604198,0.011693,43
8,0.003961,0.000116,0.002166,5.2e-05,1.0,{'alpha': 1},0.609872,0.604023,0.594919,0.589393,0.622794,0.6042,0.011691,42
9,0.004106,0.000103,0.00216,3.3e-05,1.5,{'alpha': 1.5},0.609867,0.604019,0.594981,0.589396,0.622786,0.60421,0.011677,41


Let's finally use the testing set! Use the output of `GridSearchCV` to find the score on the test set. Does your model generalize well? *(3 points)*

In [39]:
grid_searched_test=GridSearchCV(Ridge(),params2,cv=5)
grid_searched_test.fit(X_trainval,Y_trainval)
print(grid_searched_test.score(X_test, Y_test))
print(grid_searched_test.best_score_)
print(grid_searched_test.best_params_)


0.595626306755606
0.60489243063517
{'alpha': 92}


It does not generalize because it has a very low accuracy throught the test and training sets.

# 3: Group K-Fold *(17 points)*

Many data sets are often stored already in a training and test set. Data was originally collected (training) and then another collection occured (testing) that could then be used to evaluate a model. In the UCI HAR Dataset, we need to download the inputs and outputs of the training set.

Both are stored without column titles, so we need to include `header=None` to let Python know there are no column titles.

Similarly, because the data is stored as a .txt file, all input values are separated by a whitespace, so we need to tell Python how to separte column values. The `ravel` function used on the output values makes the many univariate observations now stored as an array.

In [14]:
# You will need to replace "YOUR PATH" with the path to your txt files relative to where you ipynb file is saved
import pandas as pd
X_tr = pd.read_csv('drive/MyDrive/UCI HAR Dataset/UCI HAR Dataset/train/X_train.txt',header=None,delim_whitespace=True)
Y_tr = pd.read_csv('drive/MyDrive/UCI HAR Dataset/UCI HAR Dataset/train/y_train.txt',header=None).values.ravel()


  X_tr = pd.read_csv('drive/MyDrive/UCI HAR Dataset/UCI HAR Dataset/train/X_train.txt',header=None,delim_whitespace=True)


Recall that if we are classifying data points that are similar to one another, it would be better if we ensure that every "similar" point is put together (either in the training set or the test set).

The UCI HAR Dataset contains recordings of 30 individuals performing activities of daily living while carrying a waist-mounted smartphone with embedded inertial sensors. The classifier is tasked with classifying the human activity that was performed.

The following code provides the grouping information for each of the individuals.

In [15]:
# You will need to replace "YOUR PATH" with the path to your txt files relative to where you ipynb file is saved

group = pd.read_csv('drive/MyDrive/UCI HAR Dataset/UCI HAR Dataset/train/subject_train.txt',header=None).values.ravel()

Use a 10-Fold stratified cross-validation for a Random Forest model with the inputs `n_jobs=-1` and `n_estimators=20`. Return the average accuracy. *(2 points)*

In [16]:
from sklearn.ensemble import RandomForestClassifier

forest_txt = RandomForestClassifier(n_jobs=1,n_estimators=20)


In [17]:
from sklearn.model_selection import cross_val_score
import numpy as np
scored=cross_val_score(forest_txt,X_tr, Y_tr, cv=10)
txt_mean=np.mean(scored)
print(txt_mean)

0.9227465986394557


Now, use a Group 10-Fold stratified cross-validation using a Random Forest model with the same inputs as before. Reutn the average accuracy. *(6 points)*

In [18]:
from sklearn.model_selection import GroupKFold
gf = GroupKFold(n_splits=10)
group_score=cross_val_score(forest_txt,X_tr,Y_tr,groups=group,cv=gf)
txt_mean_group=np.mean(group_score)
print(txt_mean_group)

0.900890068176426


You may notice that the ungrouped gave you higher accuracy than the grouped version. This is because we have overfit our data. Give an explanation of what is happening and why the first accuracy is too optimistic of an accuracy. *(2 points)*

The data is overfit in the ungrouped cross validation because it uses the entire data set on specific individuals so that data is largely following patterns within those individual's occurances without considering potential variances in the patterns across people with similar situations, making it ineffective at making predictions for new data. Thereby making it too optimistic for predicting the output for newly collect inputs.

Now, load in the test sets from the UCI HAR Dataset. Build a new Random Forest model training on ALL of the training set using the same inputs and then calculate the score on the test set. For this final accuracy, explain why it is closer to the group-folded average score. *(7 points)*

In [20]:
X_te = pd.read_csv('drive/MyDrive/UCI HAR Dataset/UCI HAR Dataset/test/X_test.txt',header=None,delim_whitespace=True)
Y_te = pd.read_csv('drive/MyDrive/UCI HAR Dataset/UCI HAR Dataset/test/y_test.txt',header=None).values.ravel()


  X_te = pd.read_csv('drive/MyDrive/UCI HAR Dataset/UCI HAR Dataset/test/X_test.txt',header=None,delim_whitespace=True)


In [21]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

forest_te = RandomForestClassifier(n_jobs=1,n_estimators=20)

In [22]:
forest_te=forest_te.fit(X_tr,Y_tr)

In [24]:
forest_te.score(X_te,Y_te)

0.9189005768578216

This accuracy is closer to the group folded because it's finding similarities between data points and characterizing patterns based on those similarities, making it similar to grouping and improving its accuracy.