# Assignment

### Data Story: California Housing Dataset

- **Dataset Overview**: The California Housing dataset contains information about various properties in California.
- **Target Variable**: The target variable is the median house value for California districts, measured in hundreds of thousands of dollars.
- **Features**:
  - **MedInc**: Median income in block group.
  - **HouseAge**: Median age of houses in the block group.
  - **AveRooms**: Average number of rooms per household.
  - **AveOccup**: Average number of household members.
  - **Latitude/Longitude**: Geographical location of the district.
- **Preprocessing**:
  - **Feature Scaling**: Standardization of the features to have a mean of 0 and standard deviation of 1 to improve the model performance.
  - **Train-Test Split**: The dataset was split into 80% training and 20% testing to ensure the model is evaluated on unseen data.
- **Regression Model**:
  - **A Linear Regression** model was used to predict the median house value.
  - **Evaluation Metrics**: The model was evaluated using MSE (Mean Squared Error), MAE (Mean Absolute Error), and R² (R-squared) to assess its accuracy and performance.

Note: ssl is imported to avoid the SSL certification error while fetching the data set


### Data Loading and Preprocessing

In [1]:
#importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings("ignore")

In [2]:
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)): 
    ssl._create_default_https_context = ssl._create_unverified_context

In [3]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

In [10]:
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

Converting this data set into a pandas data frame

In [17]:
df = pd.DataFrame(housing.data, columns=housing.feature_names)

# Add the target column (median house value)
df['target'] = housing.target

In [16]:
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   target      20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [15]:
df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


**Handle Missing Values**: 
Check if there are any missing values in the data set

In [19]:
df.isnull().sum()

MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
target        0
dtype: int64

Null values are not present

**Standardisation:**
- Standardization is performed on the features using StandardScaler. This is necessary for linear regression, as the algorithm assumes that features are on a similar scale.
- Features with larger values (e.g., "AveRooms" in square meters) might dominate the model if not standardized.
- Standardization makes sure each feature contributes equally to the model, preventing some features from having more influence than others.

In [21]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Select the features for scaling (excluding the target)
features = df.drop(columns=['target'])

# Fit the scaler and transform the features
scaled_features = scaler.fit_transform(features)

# Convert the scaled features back to a DataFrame and add the target column
scaled_df = pd.DataFrame(scaled_features, columns=features.columns)
scaled_df['target'] = df['target']

# Display the first few rows of the scaled data
print(scaled_df.head())

     MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  2.344766  0.982143  0.628559  -0.153758   -0.974429 -0.049597  1.052548   
1  2.332238 -0.607019  0.327041  -0.263336    0.861439 -0.092512  1.043185   
2  1.782699  1.856182  1.155620  -0.049016   -0.820777 -0.025843  1.038503   
3  0.932968  1.856182  0.156966  -0.049833   -0.766028 -0.050329  1.038503   
4 -0.012881  1.856182  0.344711  -0.032906   -0.759847 -0.085616  1.038503   

   Longitude  target  
0  -1.327835   4.526  
1  -1.322844   3.585  
2  -1.332827   3.521  
3  -1.337818   3.413  
4  -1.337818   3.422  


## Regression Algorithms

### Linear Regression

Linear regression models the relationship between a dependent variable (target) and independent variables (features) by fitting a linear equation to the observed data. Why it suits this dataset: It's a simple model, but it can perform well when there is a linear relationship between features and target.

#### Steps to Implement Linear Regression:
- Import Required Libraries.
- Split the Data into training and testing sets.
- Train the Linear Regression Model on the training data.
- Evaluate the Model on the test data using appropriate metrics (e.g., Mean Squared Error or R-squared).

In [23]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [27]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(scaled_features, df['target'], test_size=0.2, random_state=42)

# Initialize and Train the Linear Regression Model
model = LinearRegression()
model.fit(X_train, y_train)

# Make Predictions on the Test Set
y_pred = model.predict(X_test)

#Evaluate the Model
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Output the evaluation results
print("Mean Squared Error (MSE):", mse)
print("Mean Absolute Error (MAE):", mae)
print("R-squared (R²):", r2)


Mean Squared Error (MSE): 0.5558915986952444
Mean Absolute Error (MAE): 0.5332001304956566
R-squared (R²): 0.5757877060324508


### Metrics:

- **MSE**: Measures the squared error, so it penalizes large errors more significantly than smaller ones.
- **MAE**: Measures the average magnitude of errors, without squaring them. It’s a more robust metric in the presence of outliers.
- **R²**: Indicates how well the model fits the data. It tells you how much of the variance in the target is explained by the model.