# Linear Regression - PSC

<img src="https://drek4537l1klr.cloudfront.net/serrano/v-4/Figures/image027.png" height=500 width=500>

> **Problem Statement**: Predict & Evaluate the Delhi Houses Prices in a certain locality using this [dataset](https://www.kaggle.com/datasets/neelkamal692/delhi-house-price-prediction).

In [202]:
import opendatasets as od
import pandas as pd
import numpy as np

In [203]:
od.download("https://www.kaggle.com/datasets/neelkamal692/delhi-house-price-prediction")

Skipping, found downloaded files in "./delhi-house-price-prediction" (use force=True to force download)


In [2]:
od.download('https://www.kaggle.com/datasets/neelkamal692/delhi-house-price-prediction')

Downloading delhi-house-price-prediction.zip to ./delhi-house-price-prediction


100%|███████████████████████████████████████| 25.0k/25.0k [00:00<00:00, 197kB/s]







In [186]:
df = pd.read_csv('delhi-house-price-prediction/MagicBricks.csv')

In [187]:
df

Unnamed: 0,Area,BHK,Bathroom,Furnishing,Locality,Parking,Price,Status,Transaction,Type,Per_Sqft
0,800.0,3,2.0,Semi-Furnished,Rohini Sector 25,1.0,6500000,Ready_to_move,New_Property,Builder_Floor,
1,750.0,2,2.0,Semi-Furnished,"J R Designers Floors, Rohini Sector 24",1.0,5000000,Ready_to_move,New_Property,Apartment,6667.0
2,950.0,2,2.0,Furnished,"Citizen Apartment, Rohini Sector 13",1.0,15500000,Ready_to_move,Resale,Apartment,6667.0
3,600.0,2,2.0,Semi-Furnished,Rohini Sector 24,1.0,4200000,Ready_to_move,Resale,Builder_Floor,6667.0
4,650.0,2,2.0,Semi-Furnished,Rohini Sector 24 carpet area 650 sqft status R...,1.0,6200000,Ready_to_move,New_Property,Builder_Floor,6667.0
...,...,...,...,...,...,...,...,...,...,...,...
1254,4118.0,4,5.0,Unfurnished,Chittaranjan Park,3.0,55000000,Ready_to_move,New_Property,Builder_Floor,12916.0
1255,1050.0,3,2.0,Semi-Furnished,Chittaranjan Park,3.0,12500000,Ready_to_move,Resale,Builder_Floor,12916.0
1256,875.0,3,3.0,Semi-Furnished,Chittaranjan Park,3.0,17500000,Ready_to_move,New_Property,Builder_Floor,12916.0
1257,990.0,2,2.0,Unfurnished,Chittaranjan Park Block A,1.0,11500000,Ready_to_move,Resale,Builder_Floor,12916.0


In [188]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Area         1259 non-null   float64
 1   BHK          1259 non-null   int64  
 2   Bathroom     1257 non-null   float64
 3   Furnishing   1254 non-null   object 
 4   Locality     1259 non-null   object 
 5   Parking      1226 non-null   float64
 6   Price        1259 non-null   int64  
 7   Status       1259 non-null   object 
 8   Transaction  1259 non-null   object 
 9   Type         1254 non-null   object 
 10  Per_Sqft     1018 non-null   float64
dtypes: float64(4), int64(2), object(5)
memory usage: 108.3+ KB


In [204]:
df.isna().sum()

Area           0
BHK            0
Bathroom       0
Furnishing     0
Locality       0
Parking        0
Price          0
Status         0
Transaction    0
Type           0
Per_Sqft       0
dtype: int64

A summary of a dataset with various columns and the count of missing values in each column. Each number represents the count of missing values for the corresponding column.

### Cleaning the dataset

In [190]:
df['Furnishing'].fillna(df['Furnishing'].mode().iloc[0], inplace=True)

In [191]:
df['Bathroom'].fillna(df['Bathroom'].mode().iloc[0], inplace=True)

Mode imputation is suitable for categorical variables like "Furnishing" and "Bathroom" when dealing with a relatively small number of unique values. Filling missing values with the mode helps maintain the distribution of existing values and is a common approach for categorical data.

In [192]:
df['Parking'].fillna(0, inplace=True)

Assuming that missing values might indicate no parking space, filling with the minimum value (likely 0) is a reasonable choice.

In [193]:
df['Type'].fillna('NA', inplace=True)

Assuming that missing values in the "Type" column may not be available or are not applicable, filling with 'NA' (Not Available) is a straightforward way to handle them.

In [194]:
df['Per_Sqft'] = df.groupby(['Locality'])['Per_Sqft'].transform(lambda x: x.fillna(x.min()))

In [195]:
df['Per_Sqft'] = df.groupby(['Area'])['Per_Sqft'].transform(lambda x: x.fillna(x.min()))

In [196]:
df['Per_Sqft'] = df.groupby(['Transaction'])['Per_Sqft'].transform(lambda x: x.fillna(x.min()))

Grouping by relevant columns and filling missing values with the minimum value within those groups is a more targeted approach. It assumes that the "Per_Sqft" values within the same locality, area, or transaction type may have similar characteristics, making the imputation more context-specific and less arbitrary.

In [197]:
df.isna().sum()

Area           0
BHK            0
Bathroom       0
Furnishing     0
Locality       0
Parking        0
Price          0
Status         0
Transaction    0
Type           0
Per_Sqft       0
dtype: int64

No more missing values in the dataset!

### Splitting the dataset

In [198]:
from sklearn.model_selection import train_test_split

Importing the `train_test_split` function from the `sklearn.model_selection` module. This function is commonly used in machine learning to split a dataset into training and testing sets. 

In [199]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

- `df`: Your original DataFrame containing the dataset you want to split.
- `test_size=0.2`: This parameter specifies the proportion of the dataset that should be included in the test split. In this case, it's set to 20%, meaning 80% of the data will be used for training (`train_df`), and 20% will be used for testing (`test_df`).
- `random_state=42`: This parameter sets a seed for the random number generator, ensuring reproducibility. If you use the same random state (42 in this case) in the future, you'll get the same split.

After running this code, we'll have two DataFrames:

- `train_df`: This DataFrame contains 80% of the data and is typically used for training machine learning models.
- `test_df`: This DataFrame contains 20% of the data and is reserved for evaluating the performance of your trained models.

These subsets are helpful for training a model on one portion of the data and assessing its performance on another unseen portion, helping to ensure the generalization of your model to new, unseen data.

In [76]:
train_df.columns

Index(['Area', 'BHK', 'Bathroom', 'Furnishing', 'Locality', 'Parking', 'Price',
       'Status', 'Transaction', 'Type', 'Per_Sqft'],
      dtype='object')

We're selecting only numerical independent columns for training the model. We'll handle categorical independent columns in the future.

In [200]:
train_inputs = train_df[['Area', 'BHK', 'Bathroom', 'Parking', 'Per_Sqft']]

In [86]:
train_targets = train_df[['Price']]

In [87]:
test_inputs = test_df[['Area', 'BHK', 'Bathroom', 'Parking', 'Per_Sqft']]

In [88]:
test_targets = test_df[['Price']]

### Fitting the model

1. **Importing LinearRegression:**
   - `from sklearn.linear_model import LinearRegression`: This line imports the `LinearRegression` class from scikit-learn's linear_model module. `LinearRegression` is a simple linear regression model, a commonly used algorithm for predicting a target variable based on one or more predictor variables.

In [183]:
from sklearn.linear_model import LinearRegression

2. **Creating a Linear Regression Model:**
   - `linear = LinearRegression()`: This line creates an instance of the `LinearRegression` model and assigns it to the variable `linear`. This instance will be used to store the trained model.


In [184]:
linear = LinearRegression()

3. **Fitting the Model:**
   - `linear.fit(train_inputs, train_targets)`: This line fits (trains) the linear regression model using the training inputs (`train_inputs`) and the corresponding target values (`train_targets`). The model learns the coefficients and intercept that best describe the relationship between the inputs and the targets.

In [92]:
linear.fit(train_inputs,train_targets)

### Making Predictions

In [96]:
train_predictions = linear.predict(train_inputs)

In [95]:
test_predictions = linear.predict(test_inputs)

`linear.predict()` uses the trained linear regression model (`linear`) to make predictions on the inputs. The resulting predictions are stored in the respectice variable.

### Evaluating the Model

In [97]:
from sklearn.metrics import mean_squared_error

`from sklearn.metrics import mean_squared_error`: This line imports the `mean_squared_error` function from scikit-learn's metrics module. The function is used to compute the mean squared error between actual and predicted values.

In [100]:
mean_squared_error(train_targets, train_predictions,squared=False)

15633062.56043585

In [101]:
mean_squared_error(test_targets, test_predictions, squared=False)

17752287.814827345

`mean_squared_error(targets, predictions, squared=False)` calculates the root mean squared error (RMSE) by comparing the actual target values (`targets`) with the predicted values (`predictions`). The `squared=False` parameter ensures that the function returns the RMSE rather than the MSE.


The RMSE is a commonly used metric to evaluate the performance of regression models. It represents the square root of the average squared differences between predicted and actual values. Lower RMSE values indicate better model performance.

In [102]:
df.describe()

Unnamed: 0,Area,BHK,Bathroom,Parking,Price,Per_Sqft
count,1259.0,1259.0,1259.0,1259.0,1259.0,1259.0
mean,1466.452724,2.796664,2.555203,1.911041,21306700.0,14193.106434
std,1568.05504,0.954425,1.041627,6.19811,25601150.0,19635.233678
min,28.0,1.0,1.0,1.0,1000000.0,1259.0
25%,800.0,2.0,2.0,1.0,5700000.0,5345.0
50%,1200.0,3.0,2.0,1.0,14200000.0,10000.0
75%,1700.0,3.0,3.0,2.0,25500000.0,15556.0
max,24300.0,10.0,7.0,114.0,240000000.0,183333.0


> Comparing the RMSE values to the mean of the target variable (21,306,700), it appears that the RMSE is relatively high, indicating that the model's predictions have a considerable spread around the mean.

In [103]:
from sklearn.metrics import r2_score

`from sklearn.metrics import r2_score`: This line imports the `r2_score` function from scikit-learn's metrics module. R-squared is a metric that measures the proportion of the variance in the dependent variable (target) that is predictable from the independent variable(s) (predictions).

`r2_score(targets, predictions)` calculates the R-squared for the training set by comparing the actual target values (`targets`) with the predicted values (`predictions`).

The R-squared value ranges from 0 to 1, where 1 indicates a perfect fit. A higher R-squared value suggests that a larger proportion of the variance in the target variable is explained by the model.

In [106]:
r2_score(train_targets, train_predictions)

0.6110304057736029

In [107]:
r2_score(test_targets, test_predictions)

0.5832756892300516

> A value of 0.611 for the training set and 0.583 for the testing set suggests that your model explains a substantial portion of the variance in the target variable.