<a href="https://colab.research.google.com/github/111718105068/ranjith/blob/main/Machine_learning_with_python_Day2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Housing Prices with Regularized Regression

# 1. Data Preparation


In [31]:
import pandas as pd

# Load the housing price dataset
df = pd.read_csv('housing_price_dataset.csv')



In [5]:
# Explore the data
df.head()

# Check for missing values
df.isnull().sum()

# Handle missing values
# ...

# Identify and remove outliers
# ...


price               0
area                0
bedrooms            0
bathrooms           0
stories             0
mainroad            0
guestroom           0
basement            0
hotwaterheating     0
airconditioning     0
parking             0
prefarea            0
furnishingstatus    0
dtype: int64

In [6]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('price', axis=1), df['price'], test_size=0.25)


# 2. Implement Lasso Regression

In [16]:
# Select the features to use in the model
features = ['area', 'bedrooms', 'bathrooms']

# Define the independent and dependent variables
X = X_train[features]
y = y_train


In [17]:
from sklearn.linear_model import Lasso

# Create a Lasso regression model
lasso = Lasso()

# Fit the model to the training data
lasso.fit(X, y)


# c. Discuss the impact of L1 regularization on feature selection and coefficients.

L1 regularization in Lasso regression penalizes the absolute values of the model coefficients. This can lead to feature selection, where some features are assigned zero coefficients and are effectively removed from the model. L1 regularization can also shrink the coefficients of non-zero features, which can help to reduce overfitting.

# 3. Evaluate the Lasso Regression

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Make predictions on the testing data
y_pred = lasso.predict(X_test)

# Calculate the evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print('MAE:', mae)
print('MSE:', mse)
print('RMSE:', rmse)
print('R2:', r2)



# b. Discuss how the Lasso model helps prevent overfitting and reduces the impact of irrelevant features

The Lasso model helps prevent overfitting by penalizing the absolute values of the model coefficients. This can lead to feature selection, where some irrelevant features are assigned zero coefficients and are effectively removed from the model. Additionally, the Lasso model can shrink the coefficients of non-zero features, which can help to reduce the impact of irrelevant features on the model's predictions.



# 4. Implement Ridge Regression

In [27]:
# Select the features to use in the model
features = ['area', 'bedrooms', 'bathrooms']

# Define the independent and dependent variables
X = X_train[features]
y = y_train


In [28]:
from sklearn.linear_model import Ridge

# Create a Ridge regression model
ridge = Ridge()

# Fit the model to the training data
ridge.fit(X, y)


# c. Explain how L2 regularization in Ridge regression differs from L1 regularization in Lasso.

L2 regularization in Ridge regression penalizes the squared values of the model coefficients. This can shrink the coefficients of all features, but it does not lead to feature selection like L1 regularization. L2 regularization can be helpful for handling multicollinearity among features.

# **5. Evaluate the Ridge Regression Model**

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

ridge_predictions = ridge.predict(X_test)

mae = mean_absolute_error(y_test, ridge_predictions)
mse = mean_squared_error(y_test, ridge_predictions)
rmse = np.sqrt(mse)


# **b. Discuss the benefits of Ridge regression in handling multicollinearity among features and its impact on the model's coefficients:**

Ridge regression is effective in handling multicollinearity, which is when features are highly correlated. It does this by adding a penalty term to the sum of squared coefficients. The benefits are:

Reduces multicollinearity: Ridge reduces the impact of multicollinearity by distributing the coefficients among correlated features, preventing one feature from dominating the others.

Stabilizes coefficient estimates: Ridge stabilizes coefficient estimates by shrinking them, making the model less sensitive to changes in the data. This leads to more robust coefficients.

No feature elimination: Ridge doesn't perform feature selection like Lasso, which can be beneficial when all features are important.

# 6: Model Comparison



b. Discuss when it is preferable to use Lasso, Ridge, or plain linear regression:

Use plain linear regression when you have no multicollinearity issues and believe that all features are relevant.
Use Lasso when feature selection is crucial, and you want to eliminate irrelevant features.
Use Ridge when you have multicollinearity and want a more stable model with all features retained.

# 7: Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

# Example of hyperparameter tuning for Ridge
param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(Ridge(), param_grid, cv=5)
grid.fit(X_train, y_train)
optimal_alpha = grid.best_params_['alpha']


# **Diagnosing and Remedying Heteroscedasticity and Multicollinearity**







# 1: Initial Linear Regression Model

# **a. Describe the dataset and the variables you're using for predicting employee performance:**

Provide a brief overview of the dataset, including the variables used for predicting employee performance, such as experience, education level, and the number of projects completed.

In [33]:
from sklearn.linear_model import LinearRegression

# X contains the independent variables, and y contains the target variable
reg = LinearRegression()
reg.fit(X, y)


# c. Discuss why linear regression is a suitable choice for this prediction problem.

Linear regression is a suitable choice for this prediction problem because the relationship between the predictor variables and the target variable is likely to be linear. This is because the predictor variables, such as experience, education level, and number of projects completed, are all thought to have a direct and proportional relationship with employee performance.

# 2. Identifying Heteroscedasticity

# **a. Explain what heteroscedasticity is in the context of linear regression.**

Heteroscedasticity is a violation of one of the assumptions of linear regression, which is that the variance of the residuals is constant across all values of the predictor variables. In other words, heteroscedasticity occurs when the magnitude of the residuals is not the same for all observations.

# b. Provide methods for diagnosing heteroscedasticity in a regression model.

There are a number of methods for diagnosing heteroscedasticity, including:

Residual plots: A residual plot is a scatter plot of the residuals against the fitted values. If the variance of the residuals is constant, the residual plot will be evenly distributed around the zero line. If the variance of the residuals is not constant, the residual plot will show a pattern, such as a fan shape or a cone shape.
Breusch-Pagan test: The Breusch-Pagan test is a statistical test for heteroscedasticity. The null hypothesis of the test is that there is no heteroscedasticity. The alternative hypothesis is that there is heteroscedasticity. If the p-value of the test is less than a significance level, such as 0.05, then the null hypothesis is rejected and we conclude that there is heteroscedasticity.