# Predicting Housing Prices with Regularized Regression

1. Data Preparation:

a. Load the dataset using pandas.
b. Explore and clean the data. Handle missing values and outliers.
c. Split the dataset into training and testing sets.

In [18]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('housing.csv')

#checking the missing values
data.isna().sum()

df=data.fillna(method='bfill')
data.isna().sum()



Education             0
JoiningYear           0
City                  0
Projects Completed    0
Age                   0
Gender                0
EverBenched           0
Experience            0
LeaveOrNot            0
dtype: int64

In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('housing.csv')

# Define your features (X) and target vaiable (y)
X = df[['area', 'bedrooms', 'bathrooms', 'stories']]  
y = df['price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('Training data -X - shape:\t',X_train.shape)
print()
print('Training data -Y - shape:\t',_.shape)
print()
print('Testing data shape\n')
print('testing data(x-input) shape :\t',X_test.shape)
print()
print('testing data(Y-input) shape :\t',y_test.shape)

Training data -X - shape:	 (436, 4)

Training data -Y - shape:	 (9,)

Testing data shape

testing data(x-input) shape :	 (109, 4)

testing data(Y-input) shape :	 (109,)


2. Implement Lasso Regression:

a. Choose a set of features (independent variables, X) and house prices as the dependent variable (y).
b. Implement Lasso regression using scikit-learn to predict house prices based on the selected features.
c. Discuss the impact of L1 regularization on feature selection and coefficients.

In [4]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_squared_error

# Create a Lasso regression model
lasso = Lasso(alpha=0.01)  # You can adjust the alpha (penalty parameter) for regularization

# Fit the model to the training data
lasso.fit(X_train, y_train)

# Make predictions on the test data
y_pred = lasso.predict(X_test)


In [10]:
# Assuming you have a trained Lasso model (lasso) and new data in a DataFrame (new_data)
new_data = pd.DataFrame({
    'area': [6000],
    'bedrooms': [4],
    'bathrooms': [1],
    'stories': [2],
    # Add values for other features
})

# Make predictions for new data
predicted_prices_lasso = lasso.predict(new_data)

print("Predicted House Prices (Lasso Model):")
for i, price in enumerate(predicted_prices_lasso):
    print(f"Prediction {i+1}: ${price:.2f}")


Predicted House Prices (Lasso Model):
Prediction 1: $4954326.83


3. Evaluate the Lasso Regression Model:

a. Calculate the Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) for the Lasso regression model.
b. Discuss how the Lasso model helps prevent overfitting and reduces the impact of irrelevant features.

In [5]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)


Mean Absolute Error: 1158970.4803865917
Mean Squared Error: 2457741643358.9106
Root Mean Squared Error: 1567718.6110265166


4. Implement Ridge Regression:

a. Select the same set of features as independent variables (X) and house prices as the dependent variable (y).
b. Implement Ridge regression using scikit-learn to predict house prices based on the selected features.
c. Explain how 12 regularization in Ridge regression differs from L1 regularization in Lasso

In [13]:
from sklearn.linear_model import Ridge

# Create a Ridge regression model
ridge = Ridge(alpha=1.0)  # You can adjust the alpha (penalty parameter) for regularization

# Fit the model to the training data
ridge.fit(X_train, y_train)

# Make predictions on the test data
y_pred_ridge = ridge.predict(X_test)



In [12]:
# Assuming you have a trained Ridge model (ridge) and new data in a DataFrame (new_data)
new_data = pd.DataFrame({
    'area': [6000],
    'bedrooms': [4],
    'bathrooms': [1],
    'stories': [2],
    # Add values for other features
})

# Make predictions for new data
predicted_prices = ridge.predict(new_data)

print("Predicted House Prices:")
for i, price in enumerate(predicted_prices):
    print(f"Prediction {i+1}: ${price:.2f}")

Predicted House Prices:
Prediction 1: $4961460.04


5. Evaluate the Ridge Regression Model:

a. Calculate the MAE, MSE, and RMSE for the Ridge regression model.
b. Discuss the benefits of Ridge regression in handling multicollinearity among features and its impact on the model's coefficients.

In [11]:
# Calculate evaluation metrics
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
rmse_ridge = mean_squared_error(y_test, y_pred_ridge, squared=False)

print("Ridge Regression Metrics:")
print("Mean Absolute Error:", mae_ridge)
print("Mean Squared Error:", mse_ridge)
print("Root Mean Squared Error:", rmse_ridge)


Ridge Regression Metrics:
Mean Absolute Error: 1158471.4534767317
Mean Squared Error: 2456765538413.5244
Root Mean Squared Error: 1567407.266288352


6. Model Comparison:

a. Compare the results of the Lasso and Ridge regression models.
b. Discuss when it is preferable to use Lasso, Ridge, or plain linear regression.

7. Hyperparameter Tuning:

a. Explore hyperparameter tuning for Lasso and Ridge, such as the strength of regularization, and discuss how different hyperparameters affect the models.

8. Model Improvement:

a. Investigate any feature engineering or data preprocessing techniques that can enhance the performance of the regularized regression models.

9. Conclusion:

a. Summarize the findings and provide insights into how Lasso and Ridge regression can be valuable tools for estimating house prices and handling complex datasets.

# Diagnosing and Remedying Heteroscedasticity and Multicollinearity

1. Initial Linear Regression Model:

a. Describe the dataset and the variables you're using for predicting employee performance.
b. Implement a simple linear regression model to predict employee performance.

In [46]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Load your dataset into a pandas DataFrame
data = pd.read_csv('Employee_new.csv')



In [37]:
data.head()

Unnamed: 0,Education,JoiningYear,City,Projects_Completed,Age,Gender,EverBenched,Experience,LeaveOrNot,salary
0,Bachelors,2017,Bangalore,3,34,Male,No,0,0,7420
1,Bachelors,2013,Pune,1,28,Female,No,3,1,8960
2,Bachelors,2014,New Delhi,3,38,Female,No,2,0,9960
3,Masters,2016,Bangalore,3,27,Male,No,5,1,7500
4,Masters,2017,Pune,3,24,Male,Yes,2,1,7420


In [38]:
#encoding education

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
data['Education']=le.fit_transform(data['Education'])
data.head()

Unnamed: 0,Education,JoiningYear,City,Projects_Completed,Age,Gender,EverBenched,Experience,LeaveOrNot,salary
0,0,2017,Bangalore,3,34,Male,No,0,0,7420
1,0,2013,Pune,1,28,Female,No,3,1,8960
2,0,2014,New Delhi,3,38,Female,No,2,0,9960
3,1,2016,Bangalore,3,27,Male,No,5,1,7500
4,1,2017,Pune,3,24,Male,Yes,2,1,7420


In [39]:
data.isna().sum()

Education             0
JoiningYear           0
City                  0
Projects_Completed    0
Age                   0
Gender                0
EverBenched           0
Experience            0
LeaveOrNot            0
salary                0
dtype: int64

In [40]:
df=data.fillna(method='bfill')
data.isna().sum()

Education             0
JoiningYear           0
City                  0
Projects_Completed    0
Age                   0
Gender                0
EverBenched           0
Experience            0
LeaveOrNot            0
salary                0
dtype: int64

In [45]:
# Define your predictor variables (X) and target variable (Y) with the correct column names.
X = data[['Education','Experience','Projects_Completed']]
Y = data['Age']
x

# Fit the initial linear regression model.
model = LinearRegression()
model.fit(X, Y)




2. Identifying Heteroscedasticity:

a. Explain what heteroscedasticity is in the context of linear regression.
b. Provide methods for diagnosing heteroscedasticity in a regression model.
c. Apply these diagnostic methods to your model's residuals and report your findings.

In [48]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(x):
    vif = pd.DataFrame()
    vif["variables"]=x.columns
    vif["VIF"] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]
    return(vif)

In [49]:
k=calc_vif(x)
k

Unnamed: 0,variables,VIF
0,Experience,3.985165
1,Education,1.218883
2,Projects Completed,4.066063


In [None]:
X = df.drop(['Experience'],axis=1)
calc_vif(X)