<a href="https://colab.research.google.com/github/Aditya-2005917/CityCare/blob/main/Linear_Regression_implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

data : https://drive.google.com/file/d/1OULf9w4ztck9TLdvEIqr37AwUmrZ6En-/view

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('/content/Boston_Housing.csv')

## **Problem Statement:**

You are tasked with developing a predictive model to estimate the median value of owner-occupied homes (MEDV) in various neighborhoods of Boston. To achieve this, you will use the Boston Housing dataset, which contains various features describing different aspects of the neighborhoods.

**Dataset Description:**

The Boston Housing dataset includes the following features:

1. CRIM: Per capita crime rate by town.
2. ZN: Proportion of residential land zoned for large lots.
3. INDUS: Proportion of non-retail business acres per town.
4. CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise).
5. NOX: Nitrogen oxide concentration (parts per 10 million).
6. RM: Average number of rooms per dwelling.
7. AGE: Proportion of owner-occupied units built before 1940.
8. DIS: Weighted distance to employment centers.
9. RAD: Index of accessibility to radial highways.
10. TAX: Property tax rate.
11. PTRATIO: Pupil-teacher ratio.
12. B: Proportion of residents of African American descent.
13. LSTAT: Percentage of lower status population.

**Objective:**

Your objective is to build a simple linear regression model that predicts the median value of homes (MEDV) based on a single feature, specifically the average number of rooms per dwelling (RM). The goal is to create a model that accurately estimates the MEDV using the RM feature.

**Tasks:**

1. Load the Boston Housing dataset from the provided source.

2. Preprocess the dataset by selecting the RM feature as the independent variable (X) and the MEDV as the dependent variable (y).

3. Split the dataset into training and testing sets (e.g., 80% for training and 20% for testing) to evaluate the model's performance.

4. Develop a simple linear regression model using the training data, where you predict MEDV based on RM.

5. Evaluate the model's performance on the testing data using various metrics such as RMSE, MAE, MAPE, R², and Adjusted R².

6. Visualize the model's predictions by plotting the regression line against the actual data points.

In [None]:
data

# basic Checks

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe()

No constant columns.(no column has 0 std)

In [None]:
data.isnull().sum()

In [None]:
data.duplicated().sum()

* No missing values
* No duplicates

# EDA

In [None]:
# RM
plt.figure(figsize=(10,3))
sns.histplot(data['RM'])

In [None]:
# MEDV
plt.figure(figsize=(10,3))
sns.histplot(data['MEDV'])


Both RM and MEDV somehow represents normal distribution

In [None]:
# relationship between RM and MEDV
plt.figure(figsize=(10,3))
sns.scatterplot(x=data['RM'],y=data['MEDV'])


There is a positive relationship between RM and MEDV

In [None]:
plt.figure(figsize=(10,3))
sns.heatmap(data.corr(),annot=True)

The correlation between RM and MEDV is 0.7.

# Split data into x and y

In [None]:
x=data[['RM']]
y= data['MEDV']

# Split data for training and testing

* Training data : Data used for training model
* Test Dtaa : Data used for testing the model
* 80:20 or 70:30

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)


In [None]:
print(x_train.shape)
print(x_test.shape)

# Linear Regression

In [None]:
# import LinearRegression
from sklearn.linear_model import LinearRegression

In [None]:
# initialise model
model = LinearRegression()

In [None]:
# Train a model
model.fit(x_train,y_train)


In [None]:
# test : make predictions
y_pred=model.predict(x_test)

In [None]:
new_df = pd.DataFrame({'Actual':y_test,'Predicted':y_pred})
new_df

#Evaluating the model
* MSE
* MAE
* RMSE
* MAPE
* R2 score
* adjusted R2 score

In [None]:
# MSE using math function
mse = sum((y_test-y_pred)**2)
n=len(y_test)
print(mse/n)

In [None]:
mse = np.mean((y_test-y_pred)**2)
print(mse)

In [None]:
# MSE using libraries
from sklearn.metrics import mean_squared_error ,mean_absolute_error,mean_absolute_percentage_error,r2_score

In [None]:
# MSE
mse = mean_squared_error(y_test,y_pred)
print(mse)

In [None]:
# mae
mae = mean_absolute_error(y_test,y_pred)
print(mae)

In [None]:
# rmse
rmse = np.sqrt(mse)
print(rmse)

In [None]:
# mape
mape = mean_absolute_percentage_error(y_test,y_pred)
print(mape )

In [None]:
# r2 score
r2_score = r2_score(y_test,y_pred)
print(r2_score)

37% predictions are correct.

In [None]:
import matplotlib.pyplot as plt

# Plotting the results
plt.scatter(x_test, y_test, color='black', label='Actual Data')
plt.plot(x_test, y_pred, color='blue', linewidth=3, label='Regression Line')
plt.xlabel('Number of Rooms (RM)')
plt.ylabel('Median Value (MEDV in $1000s)')
plt.title('Linear Regression on Boston Housing Dataset')
plt.legend()
plt.show()

In [None]:
# Calculate Adjusted R² (Adjusted Coefficient of Determination)
n = x_test.shape[0]  # Number of samples
p = x_test.shape[1]  # Number of features (1 in this case)
adjusted_r2 = 1 - ((1 - r2_score) * (n - 1) / (n - p - 1))
print(adjusted_r2)

In [None]:
# adjusted r2 < r2 score