# Sprint 2 
## 90-803 Machine Learning Foundations with Python (Spring 2024)
### Team name:
#### Due Date: Thursday, February 15th, 2024

### Topics covered:
- Collaboration as a team through Git/GitHub

**Imports all the necessary packages we are going to use in this Lab.**

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso
from sklearn.linear_model import LassoCV, RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

from matplotlib import pyplot as plt
%matplotlib inline

---

### California Housing Data

Take a minute to understand the dataset [california housing](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) we are loading and its features. 

In [2]:
california_housing = fetch_california_housing()

print(california_housing.DESCR)

df = pd.DataFrame(california_housing.data)
df.columns = california_housing.feature_names
df['price'] = pd.DataFrame(california_housing.target)
df.head()

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,price
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


## Train and Test Split

For this next part we are going to split our dataset into train and test. Take a look at [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from scikit-learn. Choose do a 70/30 split (70% train, 30% test) and a random state of 34.

In [3]:
y = df.price
X = df[df.columns.difference(['price'])]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=34)
X_test

Unnamed: 0,AveBedrms,AveOccup,AveRooms,HouseAge,Latitude,Longitude,MedInc,Population
6658,1.083521,2.268623,3.772009,22.0,34.15,-118.12,3.0119,1005.0
2118,0.836795,2.578635,5.017804,27.0,36.75,-119.72,3.9514,869.0
12101,1.016156,2.875850,6.726190,13.0,33.94,-117.34,5.5563,3382.0
4076,1.026217,2.202247,6.520599,33.0,34.14,-118.45,7.9625,588.0
15872,1.038306,3.661290,4.997984,52.0,37.76,-122.41,3.0774,1816.0
...,...,...,...,...,...,...,...,...
8497,0.985185,2.792593,3.456790,38.0,33.90,-118.31,3.5417,1131.0
1708,1.049479,2.460938,5.656250,38.0,37.94,-122.31,4.3958,945.0
8017,1.075697,2.776892,6.203187,36.0,33.84,-118.10,4.5417,697.0
16332,1.051821,2.900560,5.900560,20.0,38.03,-121.34,4.4063,2071.0


---

### Ridge and Lasso Regression

In scikit-learn you can compute Lasso Regression and Ridge Regression with the following:

- [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)
- [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)


**Before calculating a Ridge or Lasso Model it is very important that we standarize our features. We will do this by using the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) in scikit-learn**

The Standard Scaler will remove the mean and scale to unit variance:

$z = \dfrac{(x - \mu)}{\sigma}$

As with any other class in scikit-learn you will have to instantiate the class and then apply the method `fit_transform` onto your data. Print your `X_train_standarized` to see how it looks!

In [4]:
# Scaling training features
standard_scaler = StandardScaler()
X_train_standarized = standard_scaler.fit_transform(X_train)
X_test_standarized = standard_scaler.fit_transform(X_test)
print(X_train_standarized)

[[-0.15883511 -0.08868281  0.12811849 ...  1.10428421  0.12004525
  -0.63435711]
 [-0.16433559 -0.06886675  0.59383867 ...  0.7141955   1.99537045
  -0.46769837]
 [-0.15975142  0.10450834  0.7055036  ...  1.1843024   1.27767358
   0.13156389]
 ...
 [-0.25286169 -0.16523975 -0.58578282 ...  0.85422734  0.65738182
  -0.56698443]
 [-0.03716648 -0.11470346  0.96684364 ...  0.58916706  5.87021631
   0.12890444]
 [-0.135918    0.12662334 -0.44297127 ...  0.78921256 -0.55482923
   0.87886876]]


**Code a LinearRegression model and print its corresponding Train MSE, Test MSE, Train $R^2$, and Test $R^2$. this model will serve as a baseline to compare if our Ridge and Lasso are doing better. Please use your standarised datasets!**

In [5]:
lr_model = LinearRegression()
lr_model.fit(X_train_standarized,y_train)
lr_y_pred = lr_model.predict(X_train_standarized)
lr_y_pred_test = lr_model.predict(X_test_standarized)

lr_mse_train = mean_squared_error(y_train,lr_y_pred)
lr_mse_test = mean_squared_error(y_test,lr_y_pred_test)
lr_r_sq_train = lr_model.score(X_train_standarized,y_train)
lr_r_sq_test = lr_model.score(X_test_standarized, y_test)

print("Linear Regression")
print("Train MSE: ",lr_mse_train)
print("Test MSE: ",lr_mse_test)
print("Train R^2: ",lr_r_sq_train)
print("Test R^2: ",lr_r_sq_test)

Linear Regression
Train MSE:  0.5297201176001207
Test MSE:  0.512969865770243
Train R^2:  0.6008202491907139
Test R^2:  0.6177895869466736


--> your answer here

---

**The next few cells are for you to practice making changes as a team, commiting them and resolving conflicts. The outcome of those cells does not matter as much as gaining experience as a team**


**Create a model, either Ridge or Lasso.**

In [6]:
#your code here


**Do a few more commits so that you get the hang of it with your team!**

---

**When indicated make sure to push one last commit to your remote repo. We will check your group repos for this Sprint 2**

---

### END OF Sprint 2 Jupyter Notebook!

Make sure to go back to the README and modify the indicated sections as a team.