# Regression

## 1.Loading and Preprocessing :
* Load the California Housing dataset using the fetch_california_housing function from sklearn.
* Convert the dataset into a pandas DataFrame for easier handling. *Handle missing values (if any) and perform necessary feature scaling (e.g., standardization). 
* Explain the preprocessing steps you performed and justify why they are necessary for this dataset.

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error,r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.feature_selection import SelectKBest, f_regression

In [4]:
# Load the California housing dataset
data = fetch_california_housing()

In [5]:
 # Convert to Pandas DataFrame : The dataset was converted to a DataFrame for easier handling and manipulation.
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Target
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [7]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in each colum:\n", missing_values)

Missing values in each colum:
 MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
Target        0
dtype: int64


In [8]:
# Check for duplicate rows in the dataset
df.duplicated().sum()


0

In [9]:
# Differentiating Columns : To understand the categorical column and numerical columns. 
categorical_columns = df.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Print Results
print("Categorical Columns:", categorical_columns)
print("Numerical Columns:", numerical_columns)


Categorical Columns: []
Numerical Columns: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', 'Target']


In [10]:
# Correlation with target variable: To understand the correlation of columns with respect to Target column. 
print(df.corr()['Target'].sort_values(ascending=False))

Target        1.000000
MedInc        0.688075
AveRooms      0.151948
HouseAge      0.105623
AveOccup     -0.023737
Population   -0.024650
Longitude    -0.045967
AveBedrms    -0.046701
Latitude     -0.144160
Name: Target, dtype: float64


In [11]:
#To find the best features from the dataset using Kbest feture selection
from sklearn.feature_selection import SelectKBest, f_classif
# SelectKBest for feature selection
X = df.drop(columns=['Target'])  # Features
y = df['Target']  # Target

select_k = SelectKBest(score_func=f_classif, k=4)  # Selecting Top 1 feature, depends on the person
X_selected = select_k.fit_transform(X, y)

# Get the names and scores of the selected features
selected_features = X.columns[select_k.get_support()]
selected_scores = select_k.scores_[select_k.get_support()] # to find scores of all features

print("Selected Features:", selected_features)
print("Feature Scores based on select_k:", selected_scores)


# Create a DataFrame to display feature names and scores
feature_scores_df = pd.DataFrame({'Feature': selected_features, 'Score': selected_scores})


# Sort by scores in ascending order
feature_scores_df = feature_scores_df.sort_values(by="Score", ascending=False)

# Print results
print("Selected Features:\n", feature_scores_df)

Selected Features: Index(['MedInc', 'HouseAge', 'Population', 'Latitude'], dtype='object')
Feature Scores based on select_k: [6.80719001 1.11517788 1.19632039 1.25951125]
Selected Features:
       Feature     Score
0      MedInc  6.807190
3    Latitude  1.259511
2  Population  1.196320
1    HouseAge  1.115178


In [12]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nTraining Data (Features):")
print(X_train)
print("\nTesting Data (Features):")
print(X_test)

#This step is crucial to evaluate the model’s performance on unseen data and prevent overfitting.


Training Data (Features):
       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
14196  3.2596      33.0  5.017657   1.006421      2300.0  3.691814     32.71   
8267   3.8125      49.0  4.473545   1.041005      1314.0  1.738095     33.77   
17445  4.1563       4.0  5.645833   0.985119       915.0  2.723214     34.66   
14265  1.9425      36.0  4.002817   1.033803      1418.0  3.994366     32.69   
2271   3.5542      43.0  6.268421   1.134211       874.0  2.300000     36.78   
...       ...       ...       ...        ...         ...       ...       ...   
11284  6.3700      35.0  6.129032   0.926267       658.0  3.032258     33.78   
11964  3.0500      33.0  6.868597   1.269488      1753.0  3.904232     34.02   
5390   2.9344      36.0  3.986717   1.079696      1756.0  3.332068     34.03   
860    5.7192      15.0  6.395349   1.067979      1777.0  3.178891     37.58   
15795  2.5755      52.0  3.402576   1.058776      2619.0  2.108696     37.77   

       Longi

In [13]:
X_train
X_train.shape

(16512, 8)

In [14]:
y_train

14196    1.030
8267     3.821
17445    1.726
14265    0.934
2271     0.965
         ...  
11284    2.292
11964    0.978
5390     2.221
860      2.835
15795    3.250
Name: Target, Length: 16512, dtype: float64

In [15]:
X_test

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
20046,1.6812,25.0,4.192201,1.022284,1392.0,3.877437,36.06,-119.01
3024,2.5313,30.0,5.039384,1.193493,1565.0,2.679795,35.14,-119.46
15663,3.4801,52.0,3.977155,1.185877,1310.0,1.360332,37.80,-122.44
20484,5.7376,17.0,6.163636,1.020202,1705.0,3.444444,34.28,-118.72
9814,3.7250,34.0,5.492991,1.028037,1063.0,2.483645,36.62,-121.93
...,...,...,...,...,...,...,...,...
15362,4.6050,16.0,7.002212,1.066372,1351.0,2.988938,33.36,-117.22
16623,2.7266,28.0,6.131915,1.256738,1650.0,2.340426,35.36,-120.83
18086,9.2298,25.0,7.237676,0.947183,1585.0,2.790493,37.31,-122.05
2144,2.7850,36.0,5.289030,0.983122,1227.0,2.588608,36.77,-119.76


In [16]:
y_test

20046    0.47700
3024     0.45800
15663    5.00001
20484    2.18600
9814     2.78000
          ...   
15362    2.63300
16623    2.66800
18086    5.00001
2144     0.72300
3665     1.51500
Name: Target, Length: 4128, dtype: float64

In [17]:
# Display the first few rows of the scaled DataFrame
print(df.head())

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  Target  
0    -122.23   4.526  
1    -122.22   3.585  
2    -122.24   3.521  
3    -122.25   3.413  
4    -122.25   3.422  


* Feature Scaling

In [18]:
# Standardization ensures all features have a mean of 0 and a standard deviation of 1, making them comparable.
# Scaling using StandardScaler
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)


In [19]:
X_train_scaled

array([[-0.326196  ,  0.34849025, -0.17491646, ...,  0.05137609,
        -1.3728112 ,  1.27258656],
       [-0.03584338,  1.61811813, -0.40283542, ..., -0.11736222,
        -0.87669601,  0.70916212],
       [ 0.14470145, -1.95271028,  0.08821601, ..., -0.03227969,
        -0.46014647, -0.44760309],
       ...,
       [-0.49697313,  0.58654547, -0.60675918, ...,  0.02030568,
        -0.75500738,  0.59946887],
       [ 0.96545045, -1.07984112,  0.40217517, ...,  0.00707608,
         0.90651045, -1.18553953],
       [-0.68544764,  1.85617335, -0.85144571, ..., -0.08535429,
         0.99543676, -1.41489815]])

In [20]:
print(len(X_train_scaled[0]))
X_train_scaled.shape
X_train.shape

8


(16512, 8)

In [21]:
from sklearn.preprocessing import StandardScaler

y_train_df = pd.DataFrame(y_train) #converting to data frame from series
scaler = StandardScaler()
scaler.fit(y_train_df)
y_train_scaled = scaler.transform(y_train_df)

In [22]:
y_train_scaled

array([[-0.90118909],
       [ 1.5127714 ],
       [-0.29921255],
       ...,
       [ 0.12891731],
       [ 0.65997132],
       [ 1.01890847]])

In [23]:
y_test

20046    0.47700
3024     0.45800
15663    5.00001
20484    2.18600
9814     2.78000
          ...   
15362    2.63300
16623    2.66800
18086    5.00001
2144     0.72300
3665     1.51500
Name: Target, Length: 4128, dtype: float64

## 2. Regression Algorithm Implementation : 
Implement the following regression algorithms: 
* Linear Regression 
* Decision Tree Regressor 
* Random Forest Regressor 
* Gradient Boosting Regressor 
* Support Vector Regressor (SVR)
   
For each algorithm: Provide a brief explanation of how it works. Explain why it might be suitable for this dataset. 

In [32]:
# Linear Regression
print("\n1.Linear Regression:")
print("Linear Regression assumes a linear relationship between features and the target variable.")
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)
y_pred_lr = lr_model.predict(X_test_scaled)
results['Linear Regression'] = {
    "MSE": mean_squared_error(y_test, y_pred_lr),
    "MAE": mean_absolute_error(y_test, y_pred_lr),
    "R2": r2_score(y_test, y_pred_lr)
}

# Decision Tree Regressor
print("\n2.Decision Tree Regressor:")
print("Decision Trees split the data into branches to make predictions and are useful for capturing non-linear patterns.")
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)
results['Decision Tree'] = {
    "MSE": mean_squared_error(y_test, y_pred_dt),
    "MAE": mean_absolute_error(y_test, y_pred_dt),
    "R2": r2_score(y_test, y_pred_dt)
}

# Random Forest Regressor
print("\n3.Random Forest Regressor:")
print("Random Forest uses multiple Decision Trees to improve accuracy and reduce overfitting.")
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
results['Random Forest'] = {
    "MSE": mean_squared_error(y_test, y_pred_rf),
    "MAE": mean_absolute_error(y_test, y_pred_rf),
    "R2": r2_score(y_test, y_pred_rf)
}

# Gradient Boosting Regressor
print("\n4.Gradient Boosting Regressor:")
print("Gradient Boosting builds models sequentially, each correcting the errors of the previous one.")
gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)
results['Gradient Boosting'] = {
    "MSE": mean_squared_error(y_test, y_pred_gb),
    "MAE": mean_absolute_error(y_test, y_pred_gb),
    "R2": r2_score(y_test, y_pred_gb)
}

# Support Vector Regressor (SVR)
print("\n5.Support Vector Regressor (SVR):")
print("SVR uses support vectors to fit a regression model and is effective for small datasets.")
svr_model = SVR()
svr_model.fit(X_train_scaled, y_train)
y_pred_svr = svr_model.predict(X_test_scaled)
results['SVR'] = {
    "MSE": mean_squared_error(y_test, y_pred_svr),
    "MAE": mean_absolute_error(y_test, y_pred_svr),
    "R2": r2_score(y_test, y_pred_svr)
}

results_df = pd.DataFrame(results).T
print("\nModel Performance:\n", results_df)


1.Linear Regression:
Linear Regression assumes a linear relationship between features and the target variable.

2.Decision Tree Regressor:
Decision Trees split the data into branches to make predictions and are useful for capturing non-linear patterns.

3.Random Forest Regressor:
Random Forest uses multiple Decision Trees to improve accuracy and reduce overfitting.

4.Gradient Boosting Regressor:
Gradient Boosting builds models sequentially, each correcting the errors of the previous one.

5.Support Vector Regressor (SVR):
SVR uses support vectors to fit a regression model and is effective for small datasets.

Model Performance:
                         MSE       MAE        R2
Linear Regression  0.555892  0.533200  0.575788
Decision Tree      0.495235  0.454679  0.622076
Random Forest      0.255368  0.327543  0.805123
Gradient Boosting  0.293997  0.371643  0.775645
SVR                0.357004  0.398599  0.727563


## 3.Model Evaluation and Comparison: 
Evaluate the performance of each algorithm using the following metrics: 
* Mean Squared Error (MSE) 
* Mean Absolute Error (MAE) 
* R-squared Score (R²)
  
Compare the results of all models and identify: 
The best-performing algorithm with justification. 
The worst-performing algorithm with reasoning.

In [38]:
results_df = pd.DataFrame(results).T
print("\nModel Performance Comparison:\n", results_df)

# Identify Best and Worst Performing Models
best_model = results_df['R2'].idxmax()
worst_model = results_df['R2'].idxmin()

print(f"\nBest Performing Model: {best_model} with R² Score of {results_df.loc[best_model, 'R2']:.4f}")
print(f"Worst Performing Model: {worst_model} with R² Score of {results_df.loc[worst_model, 'R2']:.4f}")

# Explanation for best and worst models
print(f"\n{best_model} performed best because it achieved the highest R² score, indicating the best fit to the data.")
print(f"{worst_model} performed the worst likely due to overfitting or an inability to capture the dataset’s patterns effectively.")


Model Performance Comparison:
                         MSE       MAE        R2
Linear Regression  0.555892  0.533200  0.575788
Decision Tree      0.495235  0.454679  0.622076
Random Forest      0.255368  0.327543  0.805123
Gradient Boosting  0.293997  0.371643  0.775645
SVR                0.357004  0.398599  0.727563

Best Performing Model: Random Forest with R² Score of 0.8051
Worst Performing Model: Linear Regression with R² Score of 0.5758

Random Forest performed best because it achieved the highest R² score, indicating the best fit to the data.
Linear Regression performed the worst likely due to overfitting or an inability to capture the dataset’s patterns effectively.
