#### Linear Regression By Abhijit Challapalli

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, RidgeCV, Lasso,ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold,GridSearchCV
import matplotlib.pyplot as plt

In [None]:
#1. Load and Explore Data:
#- Read the "housing.xlsx" file.
#- Display the first few rows of the dataset.
#- Extract input (X) and output (Y) data from the dataset.
housing_df = pd.read_csv('/content/drive/MyDrive/housing.csv')
print(housing_df.head())
print(housing_df.shape)

x = housing_df.drop(columns=['median_house_value'])
y = housing_df['median_house_value']
print(x.shape)
print(y.shape)



   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                  41          880           129.0   
1    -122.22     37.86                  21         7099          1106.0   
2    -122.24     37.85                  52         1467           190.0   
3    -122.25     37.85                  52         1274           235.0   
4    -122.25     37.85                  52         1627           280.0   

   population  households  median_income ocean_proximity  median_house_value  
0         322         126         8.3252        NEAR BAY              452600  
1        2401        1138         8.3014        NEAR BAY              358500  
2         496         177         7.2574        NEAR BAY              352100  
3         558         219         5.6431        NEAR BAY              341300  
4         565         259         3.8462        NEAR BAY              342200  
(20640, 10)
(20640, 9)
(20640,)


In [None]:
#2. Handle Missing Values:
#- Fill missing values. Imputation method should make sense.
housing_df.isnull().sum()

# using the median strategy with simple imputation
imputer = SimpleImputer(strategy='median')

# Fit and transform the total_bedrooms column
housing_df['total_bedrooms'] = imputer.fit_transform(housing_df[['total_bedrooms']])




#### Since the data in total_bedrooms feature is right skewed, using the median imputation strategy might be better than using the mean, as the median is more robust to extreme values.

In [None]:
#3. Encode Categorical Data:
#- Convert categorical columns in the dataset to numerical data.

housing_df.info()
label = LabelEncoder()
housing_df['ocean_proximity'] = label.fit_transform(housing_df[['ocean_proximity']])
print(housing_df.head())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  int64  
 3   total_rooms         20640 non-null  int64  
 4   total_bedrooms      20640 non-null  float64
 5   population          20640 non-null  int64  
 6   households          20640 non-null  int64  
 7   median_income       20640 non-null  float64
 8   ocean_proximity     20640 non-null  object 
 9   median_house_value  20640 non-null  int64  
dtypes: float64(4), int64(5), object(1)
memory usage: 1.6+ MB
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                  41          880           129.0   
1    -122.22     37.86                  21         7099          1106.0   
2    -122.24    

  y = column_or_1d(y, warn=True)


In [None]:
#4. Split the Dataset:
#- Split the data into 80% training dataset and 20% test dataset
x = housing_df.drop(columns=['median_house_value'])
y = housing_df['median_house_value']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2, random_state= 10)
print(x_train.shape)
print(x_test.shape)
print(y_test.shape)
print(y_train.shape)


(16512, 9)
(4128, 9)
(4128,)
(16512,)


In [None]:

#5. Standardize Data:
#- Standardize the training and test datasets
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.fit_transform(x_test)

x_train_scaled
x_test_scaled

array([[ 0.18192367, -0.67849302, -1.34840706, ..., -1.00916727,
         0.71293818,  1.96551927],
       [-1.38943917,  0.9181829 , -0.14117516, ...,  0.88878716,
         0.03727867,  1.96551927],
       [ 0.88979   , -0.90928099, -1.34840706, ...,  1.21103343,
         0.31517798, -0.82796103],
       ...,
       [-1.40450016,  1.12071111,  0.34171759, ..., -0.18705867,
         0.03991695,  1.26714919],
       [ 0.63375324, -0.76327227,  1.30750311, ..., -0.34437575,
        -1.53456991, -0.82796103],
       [ 0.67391587, -0.65494323,  0.90509248, ..., -0.70214523,
         2.03683973, -0.82796103]])

In [None]:
#6. Linear Regression:
#- Perform Linear Regression on the training data.
#- Predict the output for the test dataset using the fitted model.
#- Print the root mean squared error (RMSE) from Linear Regression.
linear_reg = LinearRegression()
linear_reg.fit(x_train_scaled,y_train)

linear_pred = linear_reg.predict(x_test_scaled)
print(linear_pred)

linear_rmse = np.sqrt(mean_squared_error(y_test,linear_pred))
print("Linear Regression RMSE:",linear_rmse)



[277317.06686109 274032.58208745 263316.78296738 ... 224587.26169117
 113353.56280082 368871.93725691]
Linear Regression RMSE: 70559.92223637385


In [None]:

#7. Lasso Regression:
#- Implement Lasso Regression on the training data.
#- Predict the output for the test dataset using the fitted Lasso model.
#- Evaluate and print the RMSE for Lasso Regression.
lasso_regression = Lasso(alpha=0.3)
lasso_regression.fit(x_train_scaled,y_train)

lasso_predict = lasso_regression.predict(x_test_scaled)
print(lasso_predict)

lasso_rmse = np.sqrt(mean_squared_error(y_test,lasso_predict))
print("Lasso Regression RMSE:",lasso_rmse)


[277314.87396588 274028.5801481  263312.50983532 ... 224586.55092893
 113354.8325373  368870.9012924 ]
Lasso Regression RMSE: 70560.00003923356


In [None]:
#8. Ridge Regression:
#- Implement Ridge Regression on the training data.
#- Predict the output for the test dataset using the fitted Ridge model.
#- Evaluate and print the RMSE for Ridge Regression.
ridge_regression = Ridge(alpha=0.3)
ridge_regression.fit(x_train_scaled,y_train)
ridge_predict = ridge_regression.predict(x_test_scaled)
print(ridge_predict)
ridge_rmse = np.sqrt(mean_squared_error(y_test,ridge_predict))
print("Ridge Regression RMSE:",ridge_rmse)


[277306.26563042 274021.42830044 263307.23971625 ... 224584.68522079
 113351.07837298 368874.02079982]
Ridge Regression RMSE: 70559.96740390857


In [None]:
#9. Elastic Net Regression:
#- Implement Elastic Net Regression on the training data.
#- Predict the output for the test dataset using the fitted Elastic Net model.
#- Evaluate and print the RMSE for Elastic Net Regression.
elastic_regression = ElasticNet(alpha=0.3, l1_ratio= 0.5)
elastic_regression.fit(x_train_scaled,y_train)
elastic_predict = elastic_regression.predict(x_test_scaled)
print(elastic_predict)
elastic_rmse = np.sqrt(mean_squared_error(y_test,elastic_predict))
print("Elastic Net Regression RMSE:",elastic_rmse)


[250058.88612489 240900.79682736 230262.2291075  ... 217722.44822775
 120361.52985551 354560.25505993]
Elastic Net Regression RMSE: 75927.58639581053


In [None]:
#10. Cross-Validation and Grid Search:
#- Apply cross-validation on the dataset to assess the models' generalization performance.
#- Perform grid search to fine-tune hyperparameters for Ridge and Lasso Regression models.
#- Discuss the results of cross-validation and grid search, providing insights into the optimal hyperparameters for the models

# Perform cross-validation for Ridge Regression Model
ridge_cross = cross_val_score(Ridge(alpha=0.3), x_train_scaled, y_train, cv=10, scoring='neg_mean_squared_error')
ridge_cv_rmse = np.sqrt(-ridge_cross.mean())
print("Ridge Regression Model Cross-Validation RMSE:",ridge_cv_rmse)

# Perform cross-validation for Lasso Regression Model
lasso_cross = cross_val_score(Lasso(alpha=0.3), x_train_scaled, y_train, cv=10, scoring='neg_mean_squared_error')
lasso_cv_rmse = np.sqrt(-lasso_cross.mean())
print("Lasso Regression Model Cross-Validation RMSE:",lasso_cv_rmse)


# Perform cross-validation for Elastic Net Regression
elastic_net_cv_scores = cross_val_score(ElasticNet(alpha=0.3, l1_ratio=0.5), x_train_scaled, y_train, cv=10, scoring='neg_mean_squared_error')
elastic_net_cv_rmse = np.sqrt(-elastic_net_cv_scores.mean())
print("Elastic Net Regression Cross-Validation RMSE:", elastic_net_cv_rmse)



Ridge Regression Model Cross-Validation RMSE: 69594.53595245
Lasso Regression Model Cross-Validation RMSE: 69594.60268664401
Elastic Net Regression Cross-Validation RMSE: 74517.0198212744


##### Since Linear regression, being a simple and interpretable model, which tends to have lower variance and is less likely to overfit the training data compared to more complex models.So, Cross-validation was only performed for RIdge,Lasso and Elastic Regression Models.

In [None]:
# Grid Search
fold = KFold(n_splits=10, shuffle=True, random_state=10)
parameters = {'alpha':[0.001,0.01,0.1,0.2,0.5,0.9,1.0,5.0,10.0]}
model_cv_ridge = GridSearchCV(estimator=ridge_regression,
                            param_grid=parameters,
                            scoring='neg_mean_squared_error',
                            cv=fold,
                            return_train_score=True,
                            verbose=1)
model_cv_ridge.fit(x_train_scaled,y_train)
print(model_cv_ridge.best_params_)
ridge_grid_rmse = np.sqrt(-model_cv_ridge.best_score_)
print(ridge_grid_rmse)



Fitting 10 folds for each of 9 candidates, totalling 90 fits
{'alpha': 10.0}
69595.83234708985


In [None]:
model_cv_lasso = GridSearchCV(estimator=lasso_regression,
                            param_grid=parameters,
                            scoring='neg_mean_squared_error',
                            cv=fold,
                            return_train_score=True,
                            verbose=1)
model_cv_lasso.fit(x_train_scaled,y_train)
print(model_cv_lasso.best_params_)
lasso_grid_rmse = np.sqrt(-model_cv_lasso.best_score_)
print(lasso_grid_rmse)

Fitting 10 folds for each of 9 candidates, totalling 90 fits
{'alpha': 10.0}
69597.33594758525


#### Interpretations for Cross-Validation

##### The findings demonstrate that the performance of the Ridge and Lasso regression models is relatively close, with Ridge having a little lower RMSE. This implies that both models are capable of providing a good fit to the data while avoiding overfitting.

##### The Elastic Net regression model is more prone to overfitting and has a lower accuracy due to its larger RMSE. This might be the result of the fact that Elastic Net regression requires an additional hyperparameter called l1_ratio, which regulates the ratio between the two regularization techniques.

##### Elastic Net regression is a hybrid of Lasso and Ridge. The model may not work well if the l1_ratio is not optimized.

##### -----------------------------------------------------------

#### Interpretations for Grid Search
##### According to the findings, both the Ridge and Lasso regression models benefit from a moderate level of regularization, with an ideal alpha value of 10.0 for both. Each model's top score is marginally better than the cross-validation RMSE, indicating some improvement in the models' performance due to the grid search.

##### The best score for Ridge regression is likewise marginally lower than the best score for Lasso regression, indicating that even with hyperparameter adjustment, Ridge regression is marginally superior to Lasso regression.