### Regression Assignment
#### Objective:
 ##### The objective of this assignment is to evaluate your understanding of regression techniques in supervised learning by applying them to a real-world dataset.
##### Dataset: Use the California Housing dataset available in the sklearn library. This dataset contains information about various features of houses in California and their respective median prices.


#### 1.Loading and preprocessing

In [12]:
# We will load the California Housing dataset using the fetch_california_housing() function from sklearn.datasets.
#After that, we will convert the dataset into a pandas DataFrame for easier manipulation.

#Lets import necessary libraries
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#loading the dataset and converting into pd dataframe
data = fetch_california_housing()
X = pd.DataFrame(data.data,columns=data.feature_names)
X
y = pd.Series(data.target, name='target')

In [14]:
# Checking for missing values
missing = X.isnull().sum()
missing

MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
dtype: int64

In [None]:
#No missing values in the dataset

In [16]:
#lets standardize the features using StandardScalar
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled

array([[ 2.34476576,  0.98214266,  0.62855945, ..., -0.04959654,
         1.05254828, -1.32783522],
       [ 2.33223796, -0.60701891,  0.32704136, ..., -0.09251223,
         1.04318455, -1.32284391],
       [ 1.7826994 ,  1.85618152,  1.15562047, ..., -0.02584253,
         1.03850269, -1.33282653],
       ...,
       [-1.14259331, -0.92485123, -0.09031802, ..., -0.0717345 ,
         1.77823747, -0.8237132 ],
       [-1.05458292, -0.84539315, -0.04021111, ..., -0.09122515,
         1.77823747, -0.87362627],
       [-0.78012947, -1.00430931, -0.07044252, ..., -0.04368215,
         1.75014627, -0.83369581]])

#### Explanation of Preprocessing:
#### Missing Values: If there are missing values, it’s important to handle them to ensure that the model is not biased or inefficient due to incomplete data. In this dataset no missing values are there.
#### Feature Scaling: Standardizing the features ensures that all features are on a similar scale, preventing models from being biased toward features with larger scales. Standardization helps gradient-based algorithms converge faster and improves model accuracy.

In [17]:
#Lets split the data into training and testing sets.
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y,test_size=0.2,random_state=40)

### 2.Regression Algorithm Implementation

#### i)Linear Regression

In [20]:
#importing necessary models
from sklearn.linear_model import LinearRegression
#Initializing and fit the linear regression model
linear_reg = LinearRegression()
linear_reg.fit(X_train,y_train)
#MAking predictions
linear_pred = linear_reg.predict(X_test)
linear_pred

array([2.00412998, 2.57561917, 1.19769801, ..., 2.42460902, 1.73085934,
       1.35765516])

#### Explanation:
##### Linear Regression assumes a linear relationship between the features and the target variable.  If the relationship between features and target is roughly linear, this model will perform well.
##### The California Housing dataset includes features like income, house age, and average rooms, which might have linear relationships with the target variable, so starting with linear regression is a good choice.


#### ii)Decision Tree Regression

In [21]:
#importing necessary models
from sklearn.tree import DecisionTreeRegressor
#Initializing Decision tree regression
dt_reg = DecisionTreeRegressor()
dt_reg.fit(X_train,y_train)
#Making predictions
dt_pred = dt_reg.predict(X_test)
dt_pred

array([1.607, 2.294, 0.768, ..., 1.986, 2.264, 0.827])

#### Explanation:
##### A Decision Tree Regressor creates a tree-like structure by splitting the data at each node based on the best feature that reduces the variance within the split. This process continues until the data is split into smaller, homogenous groups (leaf nodes). Each leaf node provides a prediction by averaging the target values of the instances in that node.
##### Since the California Housing dataset may contain non-linear relationships between features and the target, Decision Trees can model such interactions.

#### iii)Random Forest Regression

In [22]:
#importing necessary models
from sklearn.ensemble import RandomForestRegressor
#Initializing Random forest regression
rf_reg = RandomForestRegressor()
rf_reg.fit(X_train,y_train)
#Making predictions
rf_pred = rf_reg.predict(X_test)
rf_pred

array([1.95706, 2.37913, 0.74204, ..., 1.90402, 1.96961, 0.99142])

#### Explanation:
##### Random Forest is an ensemble learning method that creates multiple decision trees and aggregates their predictions. The final prediction is the average of all the trees' predictions, which helps to reduce overfitting and improve generalization.
##### Random Forest can model complex non-linear relationships It works well on high-dimensional datasets, like the California Housing dataset, where many features may interact with each other.

#### iv)Gradient boosting Regression

In [23]:
#importing necessary models
from sklearn.ensemble import GradientBoostingRegressor
#Initializing Gradient Boosting regression
gb_reg = GradientBoostingRegressor()
gb_reg.fit(X_train,y_train)
#Making predictions
gb_pred = gb_reg.predict(X_test)
gb_pred

array([1.81656876, 2.30660447, 0.8132348 , ..., 2.11724067, 2.03201091,
       0.91179624])

#### Explanation:
##### Gradient Boosting builds models sequentially. Each new model attempts to correct the residual errors of the previous model
##### Gradient Boosting is highly effective for datasets with complex, non-linear relationships. In the case of California Housing, it can model the non-linear effects of features on house prices.

#### v)Support Vector Regression (SVR)

In [24]:
#importing necessary models
from sklearn.svm import SVR
#Initializing Gradient Boosting regression
svr_reg = SVR()
svr_reg.fit(X_train,y_train)
#Making predictions
svr_pred = svr_reg.predict(X_test)
svr_pred

array([1.72930462, 2.2756762 , 0.76080943, ..., 2.05595118, 1.76671968,
       0.98554852])

#### Explanation:
##### SVR can model non-linear relationships using kernel functions, which can be helpful for datasets with complex feature-target relationships. It works well when the dataset has fewer outliers, as it is sensitive to them.
##### The California Housing dataset, with multiple features influencing house prices, might benefit from SVR's ability to handle non-linear patterns.

### 3. Model Evaluation and Comparison

##### Lets Evaluate the performance of each algorithm using the following metrics:
##### Mean Squared Error (MSE) , Mean Absolute Error (MAE) , R-squared Score (R²)

In [26]:
# Lets import necessary libraries
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score


#### i)Linear regression model

In [28]:
# predication made is stored in linear_pred
#Lets calculate evaluation metrices
#Mean squared error metric
mse_linear = mean_squared_error(y_test,linear_pred)
#Mean absolute error
mae_linear = mean_absolute_error(y_test,linear_pred)
#r2score
r2_linear = r2_score(y_test,linear_pred)

In [29]:
#Lets print the results
print("Linear regression Evaluation: ")
print("Mean Squared Error: ",mse_linear)
print("Mean absolute Error: ",mae_linear)
print("R2 score: ",r2_linear)

Linear regression Evaluation: 
Mean Squared Error:  0.5417517275769405
Mean absolute Error:  0.538957248055476
R2 score:  0.6075794091011186


#### ii) Decision tree regression

In [33]:
# predication made is stored in dt_pred
#Lets calculate evaluation metrices
#Mean squared error metric
mse_dt = mean_squared_error(y_test,dt_pred)
#Mean absolute error
mae_dt = mean_absolute_error(y_test,dt_pred)
#r2score
r2_dt = r2_score(y_test,dt_pred)

In [34]:
#Lets print the results
print("Decision Tree Regression Evaluation: ")
print("Mean Squared Error: ",mse_dt)
print("Mean absolute Error: ",mae_dt)
print("R2 score: ",r2_dt)

Decision Tree Regression Evaluation: 
Mean Squared Error:  0.5256955741617491
Mean absolute Error:  0.4665009617248062
R2 score:  0.619209764649653


#### iii)Random Forest Regression 

In [35]:
# predication made is stored in rf_pred
#Lets calculate evaluation metrices
#Mean squared error metric
mse_rf = mean_squared_error(y_test,rf_pred)
#Mean absolute error
mae_rf = mean_absolute_error(y_test,rf_pred)
#r2score
r2_rf = r2_score(y_test,rf_pred)

In [36]:
#Lets print the results
print("Random Forest Regression Evaluation: ")
print("Mean Squared Error: ",mse_rf)
print("Mean absolute Error: ",mae_rf)
print("R2 score: ",r2_rf)

Random Forest Regression Evaluation: 
Mean Squared Error:  0.2736603968684209
Mean absolute Error:  0.339245191521318
R2 score:  0.8017727139975268


#### iv)Gradient Boosting Regression 

In [37]:
# predication made is stored in gb_pred
#Lets calculate evaluation metrices
#Mean squared error metric
mse_gb = mean_squared_error(y_test,gb_pred)
#Mean absolute error
mae_gb = mean_absolute_error(y_test,gb_pred)
#r2score
r2_gb = r2_score(y_test,gb_pred)

In [38]:
#Lets print the results
print("Gradient Boosting Regression Evaluation: ")
print("Mean Squared Error: ",mse_gb)
print("Mean absolute Error: ",mae_gb)
print("R2 score: ",r2_gb)

Gradient Boosting Regression Evaluation: 
Mean Squared Error:  0.29882813115581697
Mean absolute Error:  0.37517686471055917
R2 score:  0.7835423389790304


#### v)Support Vector Regression 

In [39]:
# predication made is stored in svr_pred
#Lets calculate evaluation metrices
#Mean squared error metric
mse_svr = mean_squared_error(y_test,svr_pred)
#Mean absolute error
mae_svr = mean_absolute_error(y_test,svr_pred)
#r2score
r2_svr = r2_score(y_test,svr_pred)

In [40]:
#Lets print the results
print("Support Vector Regression Evaluation: ")
print("Mean Squared Error: ",mse_svr)
print("Mean absolute Error: ",mae_svr)
print("R2 score: ",r2_svr)

Support Vector Regression Evaluation: 
Mean Squared Error:  0.36526277205789964
Mean absolute Error:  0.3995027521112096
R2 score:  0.7354200724279787


##### To compare the results of all the regression models (Linear Regression, Decision Tree Regressor, Random Forest Regressor, Gradient Boosting Regressor, and Support Vector Regressor), we will look at the three evaluation metrics for each model: Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R²).

In [45]:
#Lets compare the evaluation results:
#Lets print the results
print("Linear regression Evaluation: ")
print("Mean Squared Error: ",mse_linear)
print("Mean absolute Error: ",mae_linear)
print("R2 score: ",r2_linear)
#Lets print the results
print("\nDecision Tree Regression Evaluation: ")
print("Mean Squared Error: ",mse_dt)
print("Mean absolute Error: ",mae_dt)
print("R2 score: ",r2_dt)
#Lets print the results
print("\nRandom Forest Regression Evaluation: ")
print("Mean Squared Error: ",mse_rf)
print("Mean absolute Error: ",mae_rf)
print("R2 score: ",r2_rf)
#Lets print the results
print("\nGradient Boosting Regression Evaluation: ")
print("Mean Squared Error: ",mse_gb)
print("Mean absolute Error: ",mae_gb)
print("R2 score: ",r2_gb)
#Lets print the results
print("\nSupport Vector Regression Evaluation: ")
print("Mean Squared Error: ",mse_svr)
print("Mean absolute Error: ",mae_svr)
print("R2 score: ",r2_svr)

Linear regression Evaluation: 
Mean Squared Error:  0.5417517275769405
Mean absolute Error:  0.538957248055476
R2 score:  0.6075794091011186

Decision Tree Regression Evaluation: 
Mean Squared Error:  0.5256955741617491
Mean absolute Error:  0.4665009617248062
R2 score:  0.619209764649653

Random Forest Regression Evaluation: 
Mean Squared Error:  0.2736603968684209
Mean absolute Error:  0.339245191521318
R2 score:  0.8017727139975268

Gradient Boosting Regression Evaluation: 
Mean Squared Error:  0.29882813115581697
Mean absolute Error:  0.37517686471055917
R2 score:  0.7835423389790304

Support Vector Regression Evaluation: 
Mean Squared Error:  0.36526277205789964
Mean absolute Error:  0.3995027521112096
R2 score:  0.7354200724279787


##### Best-Performing Algorithm:
##### Lower MSE and MAE: These indicate the model's prediction is closer to the actual values, which generally translates to better performance. Higher R²: A higher R² value indicates that the model is better at explaining the variance in the target variable (i.e., it fits the data well).

##### " Here in this Project, We got Lower MSE and MAE , Higher R^2 value for Random forest Regression model, So we can conclude that Random forest model's the Best performing Algorithm. Also Gradient boosting Regression model evaluated as a good performing algorithm "

##### Worst-Performing Algorithm:
##### Higher MSE and MAE: Models with higher values of MSE and MAE are not predicting well. Lower R²: A lower R² means the model doesn't explain much of the variance in the data

##### " Here in this Project, We got Lower MSE and MAE , Higher R^2 value for Linear Regression mode and Decision tree model, So we can conclude that Linear regression Algorithm and DEcision tree Algorithm are both performing worst in this case. "