<a href="https://colab.research.google.com/github/AalexisYU/AalexisYU.github.io/blob/master/HW4_anthony_alexis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phys 3130

#HW4: Train-test Split and Cross-Validation Lab

Author: Dr. Elaina A. Hyde

Due Date: April 02, 2025 10pm

-----

This notebook is a guided code including all questions for HW4. Students should make a copy and enter their solutions, avoiding copy-paste from online or Gemini. Please note, this is partial 'AI autocomplete' assignment, you are free to use any (cited) web references. If you have AI generated code, make SURE that you have explained and cited it, and leave in your prompt as well as your original contribution (use the hashtag to include comments in your code).

### How to turn in this homework:
* **Step 1:** make a copy in your local google drive of this notebook and rename it to HW4_firstname_lastname.ipynb

* **Step 2:** go to your GitHub account and select the private class repo you created.

* **Step 3:** Load your .ipynb file into your private github

* **Step 4:** your upload date timestamp is your due date check, do not upload past the deadline or you will lose marks of 10% for every hour late.
---
Student name: [Anthony Alexis]

---
## Review of train/test validation methods

We've discussed basic models.

Make sure to review overfitting, underfitting, and how to validate the "generalizeability" of your models by testing them on unseen data.

In this homework you'll practice two related validation methods:
1. **train/test split**
2. **k-fold cross-validation**

Train/test split and k-fold cross-validation both serve two useful purposes:
- We prevent overfitting by not using all the data, and
- We retain some remaining data to evaluate our model.

In the case of cross-validation, the model fitting and evaluation is performed multiple times on different train/test splits of the data.

Ultimately we can the training and testing validation framework to compare multiple models on the same dataset. This could be comparisons of two linear models, or of completely different models on the same data.


## Instructions

For your independent practice, fit **three different models** on the Boston housing data. For example, you could pick three different subsets of variables, one or more polynomial models, or any other model that you like.

**Start with train/test split validation:**
* Fix a testing/training split of the data
* Train each of your models on the training data
* Evaluate each of the models on the test data
* Rank the models by how well they score on the testing data set.

**Then try K-Fold cross-validation:**
* Perform a k-fold cross validation and use the cross-validation scores to compare your models. Did this change your rankings?
* Try a few different K-splits of the data for the same models.

If you're interested, try a variety of response variables.  We start with **MEDV** (the `.target` attribute from the dataset load method).

In [1]:
from matplotlib import pyplot as plt

import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')


In [2]:
from sklearn.datasets import fetch_california_housing
#this loads the free california housing data
#remember, why do we NOT want to use the boston set?
#https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html
california = fetch_california_housing()


In [3]:
# Convert to DataFrame for easy manipulation
columns = california.feature_names

#model features and target
X = pd.DataFrame(california.data, columns=california.feature_names)
y = california.target

# Show first few rows of the dataframe
print(X.head())

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  


### Q1. Clean up any data problems

Load the data.  Fix any problems, if applicable.

In [16]:
# California data is from SKlearn so it is reasonably clean
#read up on the columns in SKlearn documentation and enter your solution
#using code and text

# Check for blank values or duplicate rows
blank = 0
duplicates = 0
if X.isnull().values.any():
    print(X.isnull().sum())
    blank = 1

if X.duplicated().any():
    print(X.duplicated().sum())
    duplicates = 1


# Apply changes, if necessary.
if blank == 1:
    X_clean = X.dropna()
    Y_clean = y[X_clean.index] #ai autocomplete used after I put in my if statement. There was no prompt, I pressed tab because I liked what it suggested; it looked correct
    print(f"There were blank rows present in this data. They have now been removed.")
else:
    print("There are no blank rows in this data.")

if duplicates == 1:
    X_clean = X.drop_duplicates()
    Y_clean = y[X_clean.index] #ai autocomplete used above, I just re-applied the same code for duplicate rows.
    print(f"There were duplicate rows present in this data. They have now been removed.")
else:
    print(f"There are no duplicate rows in this data.")



# Print a final message if there was no cleaning done.
if blank == 0 and duplicates == 0:
    print(f"No data cleaning was necessary.")

There are no blank rows in this data.
There are no duplicate rows in this data.
No data cleaning was necessary.


### Q2. Select 3-4 variables with your dataset to perform a 50/50 test train split on

- Use sklearn.
- Score and plot your predictions.

In [None]:
#X.columns will print your column names, as long as you have saved X correctly

In [None]:
#what does a 50/50 split look like, how about a 80/20 split?
# Hint: Split dataset into training and test sets
#X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=??, random_state=42)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

predictors = ['MedInc','HouseAge', u'AveRooms', 'Population']

#Regression models are ideal for our task, as we’re trying to predict continuous values
#(in this case, house prices). Let's use a simple Linear Regression model to start:

# Instantiate the model

# Fit the model on training data

# Predict values using the test set

# Evaluate the model

#Return the coefficient of determination of the prediction R2
#this can be done 2 ways (probably more)

#print("Mean Squared Error: ", mse)
#print("R2 Score: ", r2)

In [None]:
#compare predicted results to test results
#hint: use jointplot


### Q3. Try 70/30 and 90/10 split
- Score and plot.  
- How do your metrics change?

### Q4. Try K-Folds cross-validation with K between 5-10 for your regression.

- What seems optimal?
- How do your scores change?  
- What the variance of scores like?
- Try different folds to get a sense of how this impacts your score.

In [None]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics

# iterate through folds 5-10

# Perform cross-validation

# Make cross-validated predictions
#print("Cross-Predicted R2:", r2)


### Q5. Optimize the $R^2$ score

Can you optimize your R^2 by selecting the best features and validating the model using either train/test split or K-Folds?

Your code will need to iterate through the different combinations of predictors, cross-validate the current model parameterization, and determine which set of features performed best.

The number of K-folds is up to you.

> *Hint:* the `itertools` package is useful for combinations and permutations.


In [None]:
from itertools import combinations

### Q6 Can you explain what could be wrong with the approach in Question 5?

### Q7. Explore another target variable and practice `patsy` formulas

Can you find another response variable, given a combination of predictors, that can be predicted accurately through the exploration of different predictors in this dataset?

**Try out using patsy to construct your target and predictor matrices from formula strings.**

*Tip: Check out pairplots, coefficients, and pearson scores.*

*Tip: For more on patsy models see: https://witve.com/codes/comprehensive-guide-to-patsy-for-building-statistical-models-in-python/*

In [None]:
# Check out variable relations
import seaborn as sns
#sns.pairplot(X) as long as X is defined correctly

In [None]:
import patsy
#df = X.copy()

# Add response to core DataFrame
#df['MedInc'] = y

In [None]:
# Easily change your variable predictors without reslicing your DataFrame
#AveBedrooms and AveRooms correlate
#new target variable median income MedInc
#yp, Xp = patsy.dmatrices(" MedInc ~ Population + AveRooms", data=df, return_type="dataframe")

# "unravel" y

#make your train test split


In [None]:
# Build a new model and calculate the score:
#lm = LinearRegression()

#print("R^2 Score: ", metrics.r2_score(y_test, predictions))