    #### Prelude
In this homework, we will use the California Housing Prices
Your assignment is to develop a regression model to forecast housing prices, utilizing the 'median_house_value' column as the target variable. 

#### To begin, follow these EDA steps:
1. Import the dataset into your working environment.
2. Examine the distribution of the median_house_value variable, paying particular attention to whether it exhibits a long-tailed distribution.

#### Let's modify our approach to include only specific records and columns. Here's what we'll do:
Load the full dataset
Filter the records to keep only those where ocean_proximity is either '<1H OCEAN' or 'INLAND'
Select only the specified columns
Next, use only the following columns:

* 'latitude',
* 'longitude',
* 'housing_median_age',
* 'total_rooms',
* 'total_bedrooms',
* 'population',
* 'households',
* 'median_income',
* 'median_house_value'*

In [60]:
import pandas as pd
import numpy as np 
pd.set_option('display.max_columns', None) 
pd.set_option('display.width', None) 

In [61]:
# import dataset
df = pd.read_csv("housing.csv")

In [62]:
df.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')

In [63]:
df['ocean_proximity'].value_counts()

ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

In [64]:
df = df[df['ocean_proximity'].isin(['<1H OCEAN', 'INLAND'])]
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
701,-121.97,37.64,32.0,1283.0,194.0,485.0,171.0,6.0574,431000.0,<1H OCEAN
830,-121.99,37.61,9.0,3666.0,711.0,2341.0,703.0,4.6458,217000.0,<1H OCEAN
859,-121.97,37.57,21.0,4342.0,783.0,2172.0,789.0,4.6146,247600.0,<1H OCEAN
860,-121.96,37.58,15.0,3575.0,597.0,1777.0,559.0,5.7192,283500.0,<1H OCEAN
861,-121.98,37.58,20.0,4126.0,1031.0,2079.0,975.0,3.6832,216900.0,<1H OCEAN


In [65]:
columns = [
    "longitude",
    "latitude",
    "housing_median_age",
    "total_rooms",
    "total_bedrooms",
    "population",
    "households", 
    "median_income",
    "median_house_value"
]

df = df[columns]
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
701,-121.97,37.64,32.0,1283.0,194.0,485.0,171.0,6.0574,431000.0
830,-121.99,37.61,9.0,3666.0,711.0,2341.0,703.0,4.6458,217000.0
859,-121.97,37.57,21.0,4342.0,783.0,2172.0,789.0,4.6146,247600.0
860,-121.96,37.58,15.0,3575.0,597.0,1777.0,559.0,5.7192,283500.0
861,-121.98,37.58,20.0,4126.0,1031.0,2079.0,975.0,3.6832,216900.0


### **Question 1**
There's one feature with missing values. What is it?

- total_rooms
- total_bedrooms
- population
- households

In [66]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        157
population              0
households              0
median_income           0
median_house_value      0
dtype: int64

#### **Answer: total_bedrooms**

### **Question 2**

What's the median (50% percentile) for variable 'population'?

- 995
- 1095
- 1195
- 1295

In [67]:
#write your code and answers here!!! 
df['population'].median()

1195.0

#### **Answer: 1195**

Data Preparation and Partitioning:

1. Begin with the filtered dataset you previously created.
2. Randomize the order of the records in the dataset. Use a random seed of 42 to ensure reproducibility.
3. Divide the shuffled data into three subsets:

    - Training set: 60% of the data
    - Validation set: 20% of the data
    - Test set: 20% of the data


4. Transform the 'median_house_value' variable by applying a logarithmic function. Specifically, use the np.log1p() function for this transformation

In [68]:
# split and prepare you dataset
df_shuffled = df.sample(frac=1, random_state=13).reset_index(drop=True)
df_shuffled



Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-117.75,34.08,33.0,2824.0,523.0,1797.0,493.0,3.6359,135100.0
1,-118.38,34.18,40.0,2079.0,568.0,1396.0,526.0,3.0061,190800.0
2,-118.44,34.18,35.0,972.0,270.0,550.0,256.0,2.2461,215000.0
3,-117.85,33.62,13.0,5192.0,658.0,1865.0,662.0,15.0001,500001.0
4,-118.08,34.13,35.0,1897.0,279.0,733.0,291.0,7.4185,500001.0
...,...,...,...,...,...,...,...,...,...
15682,-121.29,38.59,19.0,2460.0,470.0,1346.0,480.0,3.6563,95600.0
15683,-120.48,37.30,39.0,1015.0,356.0,875.0,313.0,1.5000,67000.0
15684,-121.89,37.68,22.0,1898.0,239.0,734.0,245.0,6.2918,334100.0
15685,-122.73,38.42,26.0,1446.0,296.0,884.0,295.0,4.3523,150000.0


In [69]:
train_size = int(0.6 * len(df_shuffled))
valid_size = int(0.2 * len(df_shuffled))
test_size = int(0.2 * len(df_shuffled))
print(train_size, valid_size, test_size)

9412 3137 3137


In [70]:
train_size + valid_size + test_size

15686

In [71]:
df_shuffled.shape

(15687, 9)

In [72]:
train_df = df_shuffled.iloc[:train_size]
valid_df = df_shuffled.iloc[train_size:train_size + valid_size]
test_df = df_shuffled.iloc[train_size + test_size:]

In [73]:
train_df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-117.75,34.08,33.0,2824.0,523.0,1797.0,493.0,3.6359,135100.0
1,-118.38,34.18,40.0,2079.0,568.0,1396.0,526.0,3.0061,190800.0
2,-118.44,34.18,35.0,972.0,270.0,550.0,256.0,2.2461,215000.0
3,-117.85,33.62,13.0,5192.0,658.0,1865.0,662.0,15.0001,500001.0
4,-118.08,34.13,35.0,1897.0,279.0,733.0,291.0,7.4185,500001.0
...,...,...,...,...,...,...,...,...,...
9407,-121.96,37.31,31.0,3890.0,711.0,1898.0,717.0,5.2534,290900.0
9408,-121.06,39.25,17.0,3127.0,539.0,1390.0,520.0,3.9537,172800.0
9409,-118.49,34.25,30.0,2871.0,470.0,1335.0,458.0,5.0232,253900.0
9410,-119.02,35.34,38.0,1463.0,294.0,692.0,295.0,2.3125,65800.0


In [74]:
valid_df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
9412,-118.14,34.01,42.0,1007.0,277.0,1060.0,268.0,3.0179,153700.0
9413,-119.80,36.76,52.0,1853.0,437.0,764.0,390.0,1.6429,69200.0
9414,-118.24,34.24,31.0,3019.0,469.0,1349.0,462.0,7.1463,394100.0
9415,-117.88,33.75,10.0,1823.0,590.0,2176.0,548.0,1.5026,151800.0
9416,-121.07,37.71,39.0,223.0,37.0,92.0,37.0,3.3750,212500.0
...,...,...,...,...,...,...,...,...,...
12544,-117.33,34.04,18.0,1837.0,388.0,727.0,336.0,2.5187,116700.0
12545,-118.11,33.83,36.0,1726.0,287.0,820.0,288.0,5.5767,218100.0
12546,-117.86,33.85,17.0,1131.0,236.0,622.0,244.0,4.9306,158500.0
12547,-119.83,36.77,23.0,2168.0,503.0,1190.0,425.0,2.6250,71600.0


In [75]:
test_df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
12549,-119.44,36.59,32.0,1153.0,236.0,761.0,241.0,2.8250,67600.0
12550,-121.58,37.02,27.0,2303.0,471.0,1447.0,467.0,3.2019,203600.0
12551,-118.04,34.05,32.0,1252.0,273.0,1337.0,263.0,2.6579,156800.0
12552,-121.86,37.41,16.0,1603.0,287.0,1080.0,296.0,6.1256,266900.0
12553,-118.17,34.06,44.0,1856.0,461.0,1853.0,452.0,2.5033,131900.0
...,...,...,...,...,...,...,...,...,...
15682,-121.29,38.59,19.0,2460.0,470.0,1346.0,480.0,3.6563,95600.0
15683,-120.48,37.30,39.0,1015.0,356.0,875.0,313.0,1.5000,67000.0
15684,-121.89,37.68,22.0,1898.0,239.0,734.0,245.0,6.2918,334100.0
15685,-122.73,38.42,26.0,1446.0,296.0,884.0,295.0,4.3523,150000.0


In [104]:
train_df['median_house_value'] = np.log1p(train_df['median_house_value'])
valid_df['median_house_value'] = np.log1p(valid_df['median_house_value'])
test_df['median_house_value'] = np.log1p(test_df['median_house_value'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['median_house_value'] = np.log1p(train_df['median_house_value'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid_df['median_house_value'] = np.log1p(valid_df['median_house_value'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['median_house_value'] = np.log1p(test_df['median_

## Question 3

- Address the missing values in the column identified in Q1.
- Consider two approaches for handling these missing values:

    1. Replace them with 0
    2. Replace them with the mean value of the column


- Implement both approaches separately.
- For each approach:

    - Train a linear regression model without any regularization, using the code provided in the lessons.
    - Important: When calculating the mean, use only the training dataset.
- Evaluate both models using the validation dataset.
- Calculate the Root Mean Square Error (RMSE) for each model.
- Round the RMSE scores to two decimal places using the round(score, 2) function.
- Compare the RMSE scores from both approaches.
- Determine which method of handling missing values results in a better (lower) RMSE.

Options:

- With 0
- With mean
- Both are equally good

In [105]:
def linear_regression(X, y):
    X = np.hstack([np.ones((X.shape[0], 1)), X])

    # beta = (X^T * X)^(-1) * X^T * y
    X_transpose = X.T
    beta = np.linalg.inv(X_transpose.dot(X)).dot(X_transpose).dot(y)

    return beta

In [112]:
def calculate_rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

In [107]:
def predict(X, beta):
    X = np.hstack([np.ones((X.shape[0], 1)), X])
    return X.dot(beta)

In [108]:
# with mean:
train_df_mean = train_df.copy()
valid_df_mean = valid_df.copy()

mean_value = train_df_mean['total_bedrooms'].mean()
train_df_mean['total_bedrooms'].fillna(mean_value)
valid_df_mean['total_bedrooms'].fillna(mean_value)

X_train_mean = train_df_mean.drop('median_house_value', axis=1).values
y_train_mean = train_df_mean['median_house_value'].values

X_val_mean = valid_df_mean.drop('median_house_value', axis=1).values
y_val_mean = valid_df_mean['median_house_value'].values

beta_mean = linear_regression(X_train_mean, y_train_mean)

y_pred_mean = predict(X_val_mean, beta_mean)
rmse_mean = calculate_rmse(y_val_mean, y_pred_mean)

In [109]:
# with zero:
train_df_0 = train_df.copy()
valid_df_0 = valid_df.copy()

train_df_0['total_bedrooms'].fillna(0)
valid_df_0['total_bedrooms'].fillna(0)

X_train_0 = train_df_0.drop('median_house_value', axis=1)
y_train_0 = train_df_0['median_house_value']

X_val_0 = valid_df_0.drop('median_house_value', axis=1)
y_val_0 = valid_df_0['median_house_value']

beta_0 = linear_regression(X_train_0, y_train_0)

y_pred_0 = predict(X_val_0, beta_0)
rmse_0 = calculate_rmse(y_val_0, y_pred_0)


In [110]:
# the answer:
if rmse_0 < rmse_mean:
    result = "With 0"
elif rmse_0 > rmse_mean:
    result = "With mean"
else:
    result = "Both are equally good"

rmse_0, rmse_mean, result

(0.33, 0.33, 'Both are equally good')

## Question 4

- Now we'll train a regularized linear regression model.
- Fill all NA values with 0 for this question.
- Test the following values of r: [0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]
- Use RMSE to evaluate each model on the validation dataset.
- Round the RMSE scores to 2 decimal places.
- Determine which r value gives the best RMSE.

If multiple r values tie for the best RMSE, select the smallest one.
Options:

- 0
- 0.000001
- 0.001
- 0.0001

In [113]:
def reg_regression(X, y, r):
    X = np.hstack([np.ones((X.shape[0], 1)), X])

    # beta = (X^T * X + r * I)^(-1) * X^T * y
    n_features = X.shape[1]
    I = np.eye(n_features)
    X_transpose = X.T
    beta = np.linalg.inv(X_transpose.dot(X) + r * I).dot(X_transpose).dot(y)

    return beta

In [114]:
r_values = [0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]
rmse_results = {}

train_df_reg = train_df.copy()
valid_df_reg = valid_df.copy()

train_df_reg['total_bedrooms'].fillna(0)
valid_df_reg['total_bedrooms'].fillna(0)

X_train_reg = train_df_reg.drop('median_house_value', axis=1).values
y_train_reg = train_df_reg['median_house_value'].values

X_val_reg = valid_df_reg.drop('median_house_value', axis=1).values
y_val_reg = valid_df_reg['median_house_value'].values

for r in r_values:
    beta_ridge = reg_regression(X_train_reg, y_train_reg, r)
    y_pred_ridge = predict(X_val_reg, beta_ridge)
    rmse_results[r] = calculate_rmse(y_val_reg, y_pred_ridge)

rmse_results


{0: 0.33416623731868444,
 1e-06: 0.3341662344365404,
 0.0001: 0.3341659495966846,
 0.001: 0.3341633963625798,
 0.01: 0.3341413182910901,
 0.1: 0.3341628914902619,
 1: 0.3367874989141284,
 5: 0.3393316158734458,
 10: 0.3398674800638476}

In [96]:
best_r = min(rmse_results, key=lambda x: (rmse_results[x], x))
best_r

0

#### Question 5

We previously used seed 42 for data splitting. Now, let's investigate how different seeds affect our model's performance.
- Use these seed values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
- For each seed:

    - Split the data into train/validation/test sets with a 60%/20%/20% ratio
    - Replace missing values with 0
    - Train a linear regression model without regularization
    - Evaluate the model on the validation set and record the RMSE
- Calculate the standard deviation of all the RMSE scores using np.std
- Round the standard deviation to 3 decimal places using round(std, 3)

What's the value of std?
- 0.5
- 0.05
- 0.005
- 0.0005
> **Note:** The standard deviation indicates the level of variability among the values. A low standard deviation suggests that the values are clustered closely together, showing little variation. Conversely, a high standard deviation indicates that the values are spread out over a wider range, demonstrating greater differences. In the context of our model, if the standard deviation of scores is low, it implies that our model exhibits *consistency* across different data splits.

In [115]:
rmses = []
seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

for seed in seeds:
    df_shuffled = df.sample(frac=1, random_state=seed).reset_index(drop=True)

    train_size = int(0.6 * len(df_shuffled))
    valid_size = int(0.2 * len(df_shuffled))

    train_df = df_shuffled.iloc[:train_size]
    valid_df = df_shuffled.iloc[train_size:train_size + valid_size]

    train_df['total_bedrooms'].fillna(0, inplace=True)
    valid_df['total_bedrooms'].fillna(0, inplace=True)

    y_train = np.log1p(train_df['median_house_value'].values)
    y_val = np.log1p(valid_df['median_house_value'].values)

    X_train = train_df.drop('median_house_value', axis=1).values
    X_val = valid_df.drop('median_house_value', axis=1).values

    beta = linear_regression(X_train, y_train)

    y_pred = predict(X_val, beta)
    rmse = calculate_rmse(y_val, y_pred)

    rmses.append(rmse)

std_rmse = np.std(rmses)
std_rmse


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_df['total_bedrooms'].fillna(0, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['total_bedrooms'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method

0.005631486961146894