# Introduction

This notebook contains two parts. **Part 1, Multiple Linear Regression**, provides you an opportunity to demonstrate your ability to apply course concepts by implementing a training function for multiple linear regression. **Part 2, California Housing Prices**, provides you an opportunity to practice using widely-used ML libraries and an ML workflow to solve a regression problem.

**You do not need to complete Part 1 in order to complete Part 2**. If you get stuck on Part 1, and choose to work on Part 2, be sure that all of your code for Part 1 runs without error. You can comment out your code in Part 1 if necessary.

# Part 1: Implementing Multiple Linear Regression

Given a simple MultipleLinearRegressor, and a simple training set of housing data, demonstrate your ability to implement a multiple linear regression model's `fit` function, such that it properly trains its linear model using gradient descent.

## The UnivariateLinearRegressor

Let's first review the UnivariateLinearRegressor, which you should find familiar, and you do not need to modify. Notice that the `fit` method uses a fixed number of iterations, only for simplicity and experimentation.

In [2]:
class UnivariateLinearRegressor:

    def __init__(self, w = 0, b = 0, alpha = 0.1):
        self.w = w
        self.b = b
        self.alpha = alpha

    def fit(self, x_train, y_train):
        for _ in range(0, 500):
            delta_w = self.alpha * self._d_cost_function_w(x_train, y_train)
            delta_b = self.alpha * self._d_cost_function_b(x_train, y_train)
            self.w = self.w - delta_w
            self.b = self.b - delta_b

    def _d_cost_function_w(self, x_train, y_train):
        sum = 0
        for i in range(len(x_train)):
            sum += (self.predict(x_train[i]) - y_train[i]) * x_train[i]
        return sum / len(x_train)

    def _d_cost_function_b(self, x_train, y_train):
        sum = 0
        for i in range(len(x_train)):
            sum += (self.predict(x_train[i]) - y_train[i])
        return sum / len(x_train)

    def predict(self, x):
        return self.w * x + self.b


Next, consider the following simple training examples, which you should also find familiar, that represent the square feet and prices of houses.

In [3]:
x_train = [1.0, 2.0]
y_train = [300.0, 500.0]

As demonstrated in the related Exploration, we can instantiate, train and make predictions with our UnivariateLinearRegressor as follows. Notice how we first instantiate our UnivariateLinearRegressor with a _single_ weight, and the bias and learning rate.

In [4]:
regressor = UnivariateLinearRegressor(0, 0, 0.1)
regressor.fit(x_train, y_train)

small_house_price = regressor.predict(1.0)
print(f"The price of a 1,000 sqft house is {small_house_price}")

medium_house_price = regressor.predict(2.0)
print(f"The price of a 2,000 sqft house is {medium_house_price}")

big_house_price = regressor.predict(8.0)
print(f"The price of an 8,000 sqft house is {big_house_price}")

The price of a 1,000 sqft house is 300.1677608428322
The price of a 2,000 sqft house is 499.8963180971484
The price of an 8,000 sqft house is 1698.2676616230458


Observing the results, we can see that the model has made its way toward converging on its line of best fit. However, we are intentionally limiting the amount of training in `fit`, and therefore truncating the training. Again, we are limiting this only for simplicity and experimentation. Try increasing the steps of gradient descent to 500 and re-run the code cells, and notice that the predictions become more accurate.

This concludes a review of our UnivariateLinearRegressor. Notice that this implementation intentionally handles only one dimension of input. In the example above, this one dimension is the size in square feet of a house.


## The MultipleLinearRegressor

While our simple UnivariateLinearRegressor works well for just a single dimension of input, we would like to make predictions based on multiple features, such as square feet, number of bedrooms, the number of floors, and the age of a house.

To demonstrate your understanding of features, vectors and gradient descent, try completing the implementation of a MultipleLinearRegressor. We begin with the implementation below, which has a complete `predict` method and method stubs for `fit` and the partial derivatives.

In [18]:
#Orgiginal functions
class MultipleLinearRegressor:

    def __init__(self, w = [], b = 0, alpha = 0.1):
        self.w = w
        self.b = b
        self.alpha = alpha

    def fit(self, x_train, y_train):
        for _ in range(0, 10):
            pass

    def _d_cost_function_w(self, x_train, y_train):
        return 0

    def _d_cost_function_b(self, x_train, y_train):
        return 0

    def predict(self, x):
        return self._dot_product(self.w, x) + self.b

    def _dot_product(self, a, b):
        return sum(pair[0] * pair[1] for pair in zip(a, b))

As we shall see in a moment, your goal will be to implement `fit` and `_d_cost_function_w` and `_d_cost_function_b`. For now, let's take a look at the training set and see how our current implementation behaves.

We'll start with a simple contrived data set with four examples, already split for you. Each training example in `x_train` represents the size, number of bedrooms, number of floors and the age of a house. Each value in `y_train` represents the price of the house in thousands of dollars.

In [25]:
import numpy as np
x_train = np.array([
    [2104.0, 5.0, 1.0, 45.0],
    [1416.0, 3.0, 2.0, 40.0],
    [1534.0, 3.0, 2.0, 30.0],
    [852.0, 2.0, 1.0, 36.0]])
y_train = np.array([460.0, 232.0, 315.0, 178.0])

Notice that `x_train` now contains vectors representing the features of each house, and each vector contains four features. Since we know that our linear regression model will need one weight for each feature, we should instantiate it with a _vector_ of weights, along with a bias and our learning rate.

In [20]:
regressor = MultipleLinearRegressor([0, 0, 0, 0], 0, 0.1)
regressor.fit(x_train, y_train)

Even though our implementation is incomplete, we can try to make some predictions. Notice that, to make a prediction, we should provide the `predict` method with a vector of features.

In [21]:
# 'Test Run' Code Cell, Referred to in "What to Do" #2.

first_house_price = regressor.predict([2104.0, 5.0, 1.0, 45.0])
print(f"The actual price of a 2,104 sqft house with 5 bedrooms, 1 floor, that is 45-years old is 460 thousand dollars")
print(f"The predicted price of a 2,104 sqft house with 5 bedrooms, 1 floor, that is 45-years old is {first_house_price} thousand dollars")

second_house_price = regressor.predict([1416.0, 3.0, 2.0, 40.0])
print(f"The actual price of a 1,416 sqft house with 3 bedrooms, 2 floors, that is 40 years old is 232 thousand dollars")
print(f"The predicted price of a 1,416 sqft house with 3 bedrooms, 2 floors, that is 40 years old is {second_house_price} thousand dollars")

third_house_price = regressor.predict([1534.0, 3.0, 2.0, 30.0])
print(f"The actual price of a 1,534 sqft house with 3 bedrooms, 2 floors, that is 30 years old is 315 thousand dollars")
print(f"The predicated price of a 1,534 sqft house with 3 bedrooms, 2 floors, that is 30 years old is {third_house_price} thousand dollars")

small_house_price = regressor.predict([852.0, 2.0, 1.0, 36.0])
print(f"The actual price of an 852 sqft house with 2 bedrooms, 1 floor, that is 36 years old is 178 thousand dollars")
print(f"The predicted price of this house is {small_house_price}")

The actual price of a 2,104 sqft house with 5 bedrooms, 1 floor, that is 45-years old is 460 thousand dollars
The predicted price of a 2,104 sqft house with 5 bedrooms, 1 floor, that is 45-years old is 0.0 thousand dollars
The actual price of a 1,416 sqft house with 3 bedrooms, 2 floors, that is 40 years old is 232 thousand dollars
The predicted price of a 1,416 sqft house with 3 bedrooms, 2 floors, that is 40 years old is 0.0 thousand dollars
The actual price of a 1,534 sqft house with 3 bedrooms, 2 floors, that is 30 years old is 315 thousand dollars
The predicated price of a 1,534 sqft house with 3 bedrooms, 2 floors, that is 30 years old is 0.0 thousand dollars
The actual price of an 852 sqft house with 2 bedrooms, 1 floor, that is 36 years old is 178 thousand dollars
The predicted price of this house is 0.0


Notice how, for each example, our MultipleLinearRegressor model is predicting a 0.

Our goal is to complete the implementation of MultipleLinearRegressor, ensuring that we can properly train it.


##Play around with settings

In [50]:
class MultipleLinearRegressor:

    def __init__(self, w = [], b = 0, alpha = 0.1):
        self.w = w
        self.b = b
        self.alpha = alpha

    def fit(self, x_train, y_train):
        #regessor function with optimized number of steps value from 10 to 13.
        for _ in range(0, 13):
            delta_w = self.alpha * self._d_cost_function_w(x_train, y_train)
            delta_b = self.alpha * self._d_cost_function_b(x_train, y_train)
            self.w = self.w - delta_w
            self.b = self.b - delta_b

    def _d_cost_function_w(self, x_train, y_train):
        sum = 0
        for i in range(len(x_train)):
            sum += (self.predict(x_train[i]) - y_train[i]) * x_train[i]
        return sum / len(x_train)

    def _d_cost_function_b(self, x_train, y_train):
        sum = 0
        for i in range(len(x_train)):
            sum += (self.predict(x_train[i]) - y_train[i])
        return sum / len(x_train)

    def predict(self, x):
        return self._dot_product(self.w, x) + self.b

    def _dot_product(self, a, b):
        return sum(pair[0] * pair[1] for pair in zip(a, b))
    
#regessor function with optimized alpha value
regressor = MultipleLinearRegressor([0,0,0,0],0,0.0000001)  
regressor.fit(x_train, y_train)

# 'Test Run' Code Cell, Referred to in "What to Do" #2.
first_house_price = regressor.predict([2104.0, 5.0, 1.0, 45.0])
print(f"[2104.0, 5.0, 1.0, 45.0]= 460")
print(f"Prediction: {first_house_price} Accuracy:{first_house_price/460*100:.2f}%")

second_house_price = regressor.predict([1416.0, 3.0, 2.0, 40.0])
print(f"[1416.0, 3.0, 2.0, 40.0]= 232")
print(f"Prediction: {second_house_price} Accuracy:{second_house_price/232*100:.2f}%")

third_house_price = regressor.predict([1534.0, 3.0, 2.0, 30.0])
print(f"[1534.0, 3.0, 2.0, 30.0] = 315")
print(f"Prediction: {third_house_price} Accuracy:{third_house_price/315*100:.2f}%")

small_house_price = regressor.predict([852.0, 2.0, 1.0, 36.0])
print(f"[852.0, 2.0, 1.0, 36.0] = 178")
print(f"Prediction: {small_house_price} Accuracy:{small_house_price/178*100:.2f}%")

print(f"Average Accuracy: {(((first_house_price/460)+(second_house_price/232)+(third_house_price/315)+(small_house_price/178))/4)*100:.2f}%")
print(f"Wegihts:{regressor.w}")
print(f"bais:{regressor.b}")


[2104.0, 5.0, 1.0, 45.0]= 460
Prediction: 414.7003694129901 Accuracy:90.15%
[1416.0, 3.0, 2.0, 40.0]= 232
Prediction: 279.13944180273256 Accuracy:120.32%
[1534.0, 3.0, 2.0, 30.0] = 315
Prediction: 302.33994318045427 Accuracy:95.98%
[852.0, 2.0, 1.0, 36.0] = 178
Prediction: 168.0114471819109 Accuracy:94.39%
Average Accuracy: 100.21%
Wegihts:[1.97001949e-01 4.42316980e-04 1.61901405e-04 4.57285511e-03]
bais:0.00011769819388913334


## What to Do

Implement `fit`, `_d_cost_function_w` and `_d_cost_function_b`, to represent an appropriate gradient descent algorithm that trains our multiple linear regression model. When complete, you should see the model produce price predictions that begin to approach a "best fit" for the simple training data above (note: there are particular reasons why the fit will not be as 'perfect' as our univariate example). Here are some suggestions for completing your implementation.

1. Modify the existing MultipleLinearRegressor class definition above.
2. Run your code frequently, using _Run All_ and running the code in the "Test Run" code cell above.
2. Draw inspiration from the UnivariateLinearRegressor - the structure of gradient descent remains the same, we just need to handle a vector of weights and features.
3. Consider replicating the small steps taken in the exploration. Start with `fit`.
4. Review the Exploration content and familiarize yourself with the expressions for computing the partial derivatives with respect to `w` and `b` when using a _vector_ of weights and features.
5. Implement just _one_ of the partial derivative functions first, and verify that the prediction output has changed.
6. For convenience, you can create a new code cell with the class definition, data, instantiation and usage all in one code cell if you wish. But when complete, please be sure that you remove it, and that the MultipleLinearRegressor class definition above is complete.

The best tip for thinking about this challenge is to become intimately familiar with the expressions for computing the gradients, or partial derivatives, for w and b. Then, try first working out on paper how your implementation of these computations might work, given the vector of weights and features.

## 💡 Conclusion
**The first thing did was to examine the universal function and replace missing pieces of code in the multiple linear function. Then I tried replacing the cost function methods to independently return 0 which indicated the weight parameters were crucial whereas basis could remain at zero and produce results. Manually decreasing input weights was seen to decrease predicted outputs, however these weights are optimized by the gradient descent function and are not a hyper parameter. The next parameter to tweak was alpha. When alpha was 0.1 the predict values were ~-1e-50 lowering alpha progressively increased the accuracy of the prediction rate. A sweet spot for alpha was found at 0.0000001 which results in an average of 96% accuracy. Then I increased the number of steps from 10 to 13 which increased average accuracy to 100%**

**Both functions use gradient descent to iteratively find weights and basis where the minimal cost function is at its minimum. However with multiple variables you have a more complex model and a vector of weights to adjust compared to a single weight with a univariate function. This resulted in the alpha (learning rate) or step size having to be significantly different in order to find the optimized cost function given the multiple variables. Where as the univariant model has only one weight associated with the learning rate.  In addition the univariant model could simply multiply the single weight by the feature where as the multivariant model would have to use a dot.product to handle to vector of weights and features.**



# Part 2: Predicting California Housing Prices

_Attribution: Special thanks to Dr. Roi Yehoshua_

In this, the second, part of this notebook, you will construct a guided experiment to analyze the quality of a linear regression model for predicting real housing prices. We'll use a version of the [california housing data](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html) by Kelley and Barry. Take a moment now to [familiarize yourself with the version of this data set provded by sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html), and you can take a look at [a version of this data on Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices). (Note that, the version on Kaggle has an extra column, ocean_proximity, which you should ignore.)

As you progress through this notebook, complete each code cell, run them, and complete the Knowledge Checks.

We'll begin by loading the data set.

## Step 1: Loading the Data Set

For convenience, we shall rely on the "california housing set" provided by scikit-learn. We'll first import a few typical libraries, and fetch the data set.

```python
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing

np.random.seed(0)
data = fetch_california_housing(as_frame = True)
print(data.DESCR)
```

Try doing the same in a code cell here.

In [56]:
import numpy as np
import pandas as pd

In [66]:
np.random.seed(0)
data = pd.read_csv('/kaggle/input/california-housing-prices/housing.csv')


### 💡 Knowledge Check 1

Demonstrate your understanding of the general characteristics of the data set by summarizing it here. (What is this data set, and what does it contain? What are the attributes, what are their types, and what do they mean? What is the target value? Is there missing data? Etc.)

**the data contains information from the 1990 California census. The attributes are:
longitude: A measure of how far west a house is; a higher value is farther west
latitude: A measure of how far north a house is; a higher value is farther north
housingMedianAge: Median age of a house within a block; a lower number is a newer building
totalRooms: Total number of rooms within a block
totalBedrooms: Total number of bedrooms within a block
population: Total number of people residing within a block
households: Total number of households, a group of people residing within a home unit, for a block
medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
medianHouseValue: Median house value for households within a block (measured in US Dollars)
oceanProximity: Location of the house w.r.t ocean/sea**

**All are floats except ocean_proximiaty**

**The only attribute with missing values is total_bedrooms which has 207 missing data points. All associated rows with missing data will be dropped prior to train/test split**




Now that our data set is loaded, let's explore what we have.

In [74]:
data.head(n=10)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
5,-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
6,-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
7,-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
8,-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
9,-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


lets look at the type of varriables. We can see they are all float64, except the irelevent ocean_proximity

In [72]:
data.dtypes

longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object

Lets look for missing rows and we can see only total_bedrooms has 207 missing data points

In [77]:
data.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

## Step 2: Exploring the Data Set



Let's quickly investigate some examples in the data set. Since `data` is a sklearn Bunch object, we can obtain the pandas DataFrame and investigate its shape, to determine the number of rows and columns, and to inspect the first few rows of data.

```python
print(data.frame.shape)
data.frame.head()
```

Go ahead and investigate the first few rows of the data frame.

lets breifly look at the dataset dimensions, we can see there are 10 atrributes and 20,640 rows

In [70]:
data.shape

(20640, 10)

lets breifly look at the dataset with head function

In [67]:
data.head(n=10)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
5,-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
6,-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
7,-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
8,-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
9,-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


### 💡 Knowledge Check 2

What do the `shape` and `head` reveal about this data set?

**head shows us the first number of specified rows, and the names of the associated attributes. Shape tells us the dimensionality(rows and attributes)**


## Step 3: Preparing Training and Test Sets

To train and test our linear regression model, we will need to split our data set. We'll use the `data` and `target` attributes of the Bunch to retrieve the feature set and target prediction values. Then, we'll reach for the handy `train_test_split` method from sklearn.

```python
from sklearn.model_selection import train_test_split

housing_attributes, prices = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(housing_attributes, prices, test_size = 0.2)
X_train.head()
```

Go ahead and split the data set into training and test sets here.

In [109]:
#before splitting the data we need to get ride of Na values
data_nona = data.dropna()
print(data_nona.shape)
#to get arround import errors of dataset we are defining our own housing and price attributes

#data features (x-var) removes the median house value and ocean proximity
housing_attributes = data_nona.drop(['median_house_value', 'ocean_proximity'], axis=1)

#price data(y)
prices = data_nona['median_house_value']

from sklearn.model_selection import train_test_split
housing_attributes, prices
X_train, X_test, y_train, y_test = train_test_split(housing_attributes, prices, test_size = 0.2)
X_train.head()

(20433, 10)


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
7255,-118.15,34.0,32.0,3218.0,739.0,2368.0,730.0,3.1406
19516,-121.01,37.64,52.0,201.0,35.0,74.0,22.0,1.3036
19640,-120.79,37.53,20.0,1417.0,263.0,853.0,263.0,3.3083
12633,-121.49,38.49,26.0,4629.0,832.0,2902.0,816.0,2.735
11750,-121.18,38.78,13.0,3480.0,528.0,1432.0,532.0,6.1642


In [89]:
#double check price/ y-var split
y_train.head()

11217    218000.0
8038     481300.0
13031    171000.0
8464     183300.0
4550     225000.0
Name: median_house_value, dtype: float64

### 💡 Knowledge Check 3

Approximately how many examples are in the training and test sets?


In [110]:
print(f"There are {X_test.shape[0]} examples in the training set and {y_train.shape[0]} in the test set")

There are 4087 examples in the training set and 16346 in the test set


## Step 4: Pre-Processing and Training

Before applying our regression model, we would like to standardize the training set. To do this, we'll use the sklearn StandardScaler. Once we standardize the data, we will use it to train a linear regression model. In our case, we will experiment with the scikit-learn [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html), a linear regression model that trains via stochastic gradient descent (SGD). Please be sure to take a look at [the documentation for SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html).

To demonstrate a new feature in scikit-learn, and to give you some new ideas in your own future work, we will illustrate a small "machine learning pipeline," using the scikit-learn [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class.

A Pipeline is handy for "setting up" multiple pre-processing steps that will run one after the other. The Pipeline can also end in a training step with a model. This enables us to provide the Pipeline our training data, and with one method call, complete both pre-processing and training in one step.

We'll import the necessary libraries, create our Pipeline, fill it with a StandardScalar and SGDRegressor, and run the Pipeline.

```python
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor())
])
```

Try importing the necessary libraries and building your Pipeline below.


In [114]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor())
])

With our Pipeline created, we can now invoke the Pipeline's `fit` method, passing it the training data. Behind the scenes, the Pipeline will standardize our training data, and also invoke our SGDRegressor's `fit` method with the transformed training data.

```
pipeline.fit(X_train, y_train)
```

Try kicking off the Pipeline below.

In [115]:
pipeline.fit(X_train, y_train)

With our model now trained, let us analyze the results.

### 💡 Knowledge Check 4

Investigate the parameters passed to the Pipeline initializer. Notice our use of the strings `'scaler'` and `'regressor'`. What purpose do these serve, and are we required to use those specific strings, or can we "make up" our own meaningful names for each component of the Pipeline?

**the names "scaler" and "regressor" in some senses are arbitrary but serve an important purpose for identifying the imported sklearn functions in a rational manner. Yes, we could apply other non-sensical  names, but this would be improper as these specific strings identify the functions and make it easier to reference them by the coder or a future reader of the code.**

## Step 5: Model Validation

We have conducted an initial round of training using a data set that may or may not have strong linear tendencies, and we have employed a basic, unconfigured SGDRegressor model to see what baseline quality we can achieve. Let's investigate the "coefficient of determination," R^2, via the model's `score` method. We will invoke this `score` method via the Pipeline, since it has ownership of our SGDRegressor model. We would love to see a value as close to 1.0 as possible.

We can generate an R^2 score with both the training data and the test data to validate the quality of our model.

```python
training_score = pipeline.score(X_train, y_train)
print(f"Training score: {training_score:.6f}")

test_score = pipeline.score(X_test, y_test)
print(f"Test score: {test_score:.6f}")
```

Go ahead and generate and print the score based on the training data, and the score based on the test data.

In [119]:
training_score = pipeline.score(X_train, y_train)
print(f"Training score:{training_score:.6f}")

test_score = pipeline.score(X_test,y_test)
print(f"Test score: {test_score :.6f}")

Training score:0.634181
Test score: 0.637807


### 💡 Knowledge Check 5

What are the scores for the training and test sets? What do they indicate? Are they good? How do you know? (Hint: Have you read the documentation for the `score` method of SGDRegressor?)

**The Training score R^2 = 0.63 and test score R^2 = 0.64. These values indicate there is a positive correlation between the housing attributes and the price of the house. More so the regression prediction does a "moderate" job at accurately approximating real data points. an acceptable R^2 values can vary depending on the field but with real world "messy" multivariate data such as house prices and attributes a value of 0.63 to me is both logical and good indicating a degree of confidence the code executed correctly on good data.**

## Step 6: Adjusting the Model (Experiment)

If we spend time reviewing the documentation of SGDRegressor, we find that the default instantiation uses particular default hyperparameters. Now it's your turn. Based on the concepts in the course and your understanding of linear regression, how might you "tune" the SGDRegressor instance in the Pipeline?

Try setting up a new Pipeline as an experiment, and try passing different parameter configurations to SGDRegressor's initializer, and investigate the results. You might set up your experiment like the following. Notice how we have specified a `penalty` of `None` as a demonstrated experiment.

```python
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor(penalty = None))
])

pipeline.fit(X_train, y_train)

training_score = pipeline.score(X_train, y_train)
print(f"Training score: {training_score:.6f}")

test_score = pipeline.score(X_test, y_test)
print(f"Test score: {test_score:.6f}")
```

Create a similar experiment here, and try a few different initialization parameters for SGDRegressor. How might you increase its performance score? (Think about the important concepts of a linear regression model that uses gradient descent. Be sure to try customizing the most important hyperparameters.)

lets find an optimal model penalty

In [173]:
#defines dpenality parameters 
pently_paramaters = ["l2", "l1", "elasticnet", None]

#loops over the parameters
for para in pently_paramaters:
    pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor(penalty = para))])

    #fits model
    pipeline.fit(X_train, y_train)
    
    #calculates training scores
    para_training_score = pipeline.score(X_train, y_train)
    para_test_score = pipeline.score(X_test, y_test)
    print(f"Penalty: {para}\nTest score: {para_test_score:.3f}\nTraining score: {para_training_score:.3f}\n")
    

Penalty: l2
Test score: 0.634
Training score: 0.637

Penalty: l1
Test score: 0.632
Training score: 0.636

Penalty: elasticnet
Test score: 0.634
Training score: 0.637

Penalty: None
Test score: 0.634
Training score: 0.637



Looks like penality did not have a bit incluence

Lets see if we can optimize **alpha** with the same approach using the defualt learing rate 'L2'

In [160]:
#defines parameters 
alpha_paramaters = [0,0.1,0.01,0.001,0.0001,0.0001,0.00001,0.000001,0.0000001,0.00000001,0.000000001]

#loops over the parameters
for a in alpha_paramaters:
    pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor(alpha = a))])

    #fits model
    pipeline.fit(X_train, y_train)
    
    #calculates training scores
    training_score = pipeline.score(X_train, y_train)
    test_score = pipeline.score(X_test, y_test)
    print(f"alpha: {a}\nTest score: {test_score:.2f}\nTraining score: {training_score:.2f}\n")
    

alpha: 0
Test score: 0.63
Training score: 0.64

alpha: 0.1
Test score: 0.60
Training score: 0.60

alpha: 0.01
Test score: 0.63
Training score: 0.64

alpha: 0.001
Test score: 0.63
Training score: 0.64

alpha: 0.0001
Test score: 0.63
Training score: 0.64

alpha: 0.0001
Test score: 0.63
Training score: 0.64

alpha: 1e-05
Test score: 0.63
Training score: 0.64

alpha: 1e-06
Test score: 0.63
Training score: 0.64

alpha: 1e-07
Test score: 0.63
Training score: 0.64

alpha: 1e-08
Test score: 0.62
Training score: 0.64

alpha: 1e-09
Test score: 0.63
Training score: 0.64



Well it does not look like modifing alpha optimizes that R^2 beyond the defualt of 0.0001. lets examine **max_iter** optimization.

In [159]:
#defines parameters 
mi_paramaters = [10,100,1000,10000,100000,1000000,10000000]

#loops over the parameters
for mi in mi_paramaters:
    pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor(max_iter = mi))])

    #fits model
    pipeline.fit(X_train, y_train)
    
    #calculates training scores
    training_score = pipeline.score(X_train, y_train)
    test_score = pipeline.score(X_test, y_test)
    print(f"Max_iter: {mi}\nTest score: {test_score:.2f}\nTraining score: {training_score:.2f}\n")



Max_iter: 10
Test score: 0.63
Training score: 0.64

Max_iter: 100
Test score: 0.63
Training score: 0.64

Max_iter: 1000
Test score: 0.63
Training score: 0.64

Max_iter: 10000
Test score: 0.63
Training score: 0.64

Max_iter: 100000
Test score: 0.63
Training score: 0.64

Max_iter: 1000000
Test score: 0.63
Training score: 0.64

Max_iter: 10000000
Test score: 0.62
Training score: 0.64



seems like the defualt settings are just as good at any attempt of hyperparamaterization. Lets optimize the **random_state**

In [158]:
#defines parameters 
rs_paramaters = [0,1,2,42]

#loops over the parameters
for rs in rs_paramaters:
    pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor(random_state = rs))])

    #fits model
    pipeline.fit(X_train, y_train)
    
    #calculates training scores
    training_score = pipeline.score(X_train, y_train)
    test_score = pipeline.score(X_test, y_test)
    print(f"random_state: {rs}\nTest score: {test_score:.2f}\nTraining score: {training_score:.2f}\n")
    

random_state: 0
Test score: 0.62
Training score: 0.64

random_state: 1
Test score: 0.63
Training score: 0.64

random_state: 2
Test score: 0.63
Training score: 0.64

random_state: 42
Test score: 0.63
Training score: 0.64



Well that didn't work either. Lets investigate manipulating the **train/test split**, using defulat settings for the Regressor

In [172]:
#defines parameters 
split_paramaters = [.1,.15,.2,.25,.3,.35,.4]

#loops over the parameters
for s in split_paramaters:
    
    housing_attributes, prices
    X_train, X_test, y_train, y_test = train_test_split(housing_attributes, prices, test_size = s)
    pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor())])

    #fits model
    pipeline.fit(X_train, y_train)
    
    #calculates training scores
    training_score = pipeline.score(X_train, y_train)
    test_score = pipeline.score(X_test, y_test)
    print(f"Split: {s}\nTest score: {test_score:.3f}\nTraining score: {training_score:.3f}\nSize of test:{y_test.shape[0]}\n")
    


Split: 0.1
Test score: 0.652
Training score: 0.634
Size of test:2044

Split: 0.15
Test score: 0.635
Training score: 0.637
Size of test:3065

Split: 0.2
Test score: 0.612
Training score: 0.642
Size of test:4087

Split: 0.25
Test score: 0.651
Training score: 0.632
Size of test:5109

Split: 0.3
Test score: 0.648
Training score: 0.632
Size of test:6130

Split: 0.35
Test score: 0.623
Training score: 0.642
Size of test:7152

Split: 0.4
Test score: 0.634
Training score: 0.637
Size of test:8174



In [171]:
#re-sets to defulat split
housing_attributes, prices
X_train, X_test, y_train, y_test = train_test_split(housing_attributes, prices, test_size = 0.2)
    
#defines parameters 
learning_rate_paramaters = ["constant","optimal","invscaling","adaptive"]

#loops over the parameters
for lr in learning_rate_paramaters:
    pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor(learning_rate = lr))])

    #fits model
    pipeline.fit(X_train, y_train)
    
    #calculates training scores
    training_score = pipeline.score(X_train, y_train)
    test_score = pipeline.score(X_test, y_test)
    print(f"learning_rate: {lr}\nTest score: {test_score:.3f}\nTraining score: {training_score:.3f}\n")
    

learning_rate: constant
Test score: 0.618
Training score: 0.637

learning_rate: optimal
Test score: 0.598
Training score: 0.614

learning_rate: invscaling
Test score: 0.617
Training score: 0.641

learning_rate: adaptive
Test score: 0.617
Training score: 0.641



hmmm, that didnt work. Maybe **loss** will work

In [163]:
    
#defines parameters 
loss_paramaters = ['squared_error', 'epsilon_insensitive', 'squared_epsilon_insensitive', 'huber']

#loops over the parameters
for l in loss_paramaters:
    pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor(loss = l))])

    #fits model
    pipeline.fit(X_train, y_train)
    
    #calculates training scores
    training_score = pipeline.score(X_train, y_train)
    test_score = pipeline.score(X_test, y_test)
    print(f"loss: {lr}\nTest score: {test_score:.2f}\nTraining score: {training_score:.2f}\n")
    

loss: adaptive
Test score: 0.63
Training score: 0.64





loss: adaptive
Test score: -3.11
Training score: -3.14

loss: adaptive
Test score: 0.63
Training score: 0.64

loss: adaptive
Test score: -3.19
Training score: -3.21





Lets increase max_iter and see if that worked

In [168]:
"""
#defines parameters 
loss_paramaters = ['squared_error', 'epsilon_insensitive', 'squared_epsilon_insensitive', 'huber']

#loops over the parameters
for l in loss_paramaters:
    pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor(loss = l, max_iter=100000))])

    #fits model
    pipeline.fit(X_train, y_train)
    
    #calculates training scores
    training_score = pipeline.score(X_train, y_train)
    test_score = pipeline.score(X_test, y_test)
    print(f"loss: {l}\nTest score: {test_score:.2f}\nTraining score: {training_score:.2f}\n")
Results:
loss: squared_error
Test score: 0.63
Training score: 0.64

/opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn(
loss: epsilon_insensitive
Test score: -1.11
Training score: -1.13

loss: squared_epsilon_insensitive
Test score: 0.63
Training score: 0.64

loss: huber
Test score: -2.93
Training score: -2.96

/opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_stochastic_gradient.py:1548: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn(
"""



That didn't work either and took forever to run so its commented out.maybe **average** 

In [170]:
#defines parameters 
avg_paramaters = [0, 1, 10, 100,1000]

#loops over the parameters
for avg in avg_paramaters:
    pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor(average = avg))])

    #fits model
    pipeline.fit(X_train, y_train)
    
    #calculates training scores
    training_score = pipeline.score(X_train, y_train)
    test_score = pipeline.score(X_test, y_test)
    print(f"average: {avg}\nTest score: {test_score:.2f}\nTraining score: {training_score:.2f}\n")


average: 0
Test score: 0.63
Training score: 0.64

average: 1
Test score: 0.63
Training score: 0.64

average: 10
Test score: 0.63
Training score: 0.64

average: 100
Test score: 0.63
Training score: 0.64

average: 1000
Test score: 0.63
Training score: 0.64



### 💡 Knowledge Check 6

Based on the concepts in the Explorations regarding linear regression and gradient descent, what is perhaps the single most important hyperparameter for a linear regression model? What SGDRegressor initialization parameter lets you specify the value for this important hyperparameter?
**An attempt was made to optimize the performance of the model and increase the R^2. This was done through looping over parameters of the Regressor such as  penalty, alpha, max_iter, random_state, learning_rate, loss, and average but no meaningful improvement of R^2 > 0.64 could be achieved. Examining different split/train ratios did little as well to improve R^2 values beyond default settings but a 0.1 split ratio resulted in the best overall training and test R^2 values, but this improvement was trivial.**




# Conclusion

(Replace this writing prompt with your conclusion.) Summarize what you've seen and done here in Part 2, starting with the domain, problem and data set. Mention three things that were most notable in this process, whether it's related to exploration, preprocessing, configuring, training, or evaluating. If you put in the effort to try to improve the SGDRegressor, describe what you did and what led you to try what you did, and describe the results. Conclude with some statements or questions about the model score, the model being used, and the data set. Make suggestions about what you might do next to either improve the score or conclude with an explanation of whether you would continue to use a linear model.

**What was accomplished was taking a dataset in this case price of housing in CA and the housing attributes, then addressing the problem of predicting housing price based off these attributes using multivariate regression with gradient descent.**

**Regarding notable things first is just how simple and powerful packages are such as sklearn. The ability to import pre-bult complex functions that are optimized is such a time saver. Secondly, I’m impressed at how well the default settings of the SGDRegressor performed.  In addition, I’m surprised the housing attributes did a good job (moderate R^2) at predicting price since it seems like something like housing price, like most of economics is irrational. I guess irrational things can be correlated…**

**An attempt was made to optimize the performance of the model and increase the R^2. This was done through looping over parameters of the Regressor such as  penalty, alpha, max_iter, random_state, learning_rate, loss, and average but no meaningful improvement of R^2 > 0.64 could be achieved. Examining different split/train ratios did little as well to improve R^2 values beyond default settings.**

**I’m a bit surprised the default Regessor model resulted in an optimized R^2, but it does make sense the model is pre-optimized. I wonder how much R^2 would change if the dataset was taken say 1/2020 to now, where there have been crazy swings in economic conditions, likely decreasing correlation. Also, I would be interested to see if the least square method for finding a relationship between housing attributes and price would be just as accurate, and if the simpler approach might result in better data. Or if reducing the features of the dataset could improve correlations. Overall, this model did do a good job at explaining the relationship in the data and I look forward to applying it to other problems**
