In [47]:
import pandas as pd

df = pd.read_csv('cleaned_bengalur.csv')

df.info()
df.describe()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13257 entries, 0 to 13256
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    13257 non-null  int64  
 1   area_type     13257 non-null  object 
 2   availability  13257 non-null  object 
 3   location      13257 non-null  object 
 4   total_sqft    13257 non-null  float64
 5   bath          13257 non-null  float64
 6   balcony       13257 non-null  float64
 7   price         13257 non-null  float64
 8   bhk           13257 non-null  int64  
dtypes: float64(4), int64(2), object(3)
memory usage: 932.3+ KB


Unnamed: 0      0
area_type       0
availability    0
location        0
total_sqft      0
bath            0
balcony         0
price           0
bhk             0
dtype: int64

In [48]:
df['price_per_sqft'] = df['price'] * 100000 / df['total_sqft']

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13257 entries, 0 to 13256
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      13257 non-null  int64  
 1   area_type       13257 non-null  object 
 2   availability    13257 non-null  object 
 3   location        13257 non-null  object 
 4   total_sqft      13257 non-null  float64
 5   bath            13257 non-null  float64
 6   balcony         13257 non-null  float64
 7   price           13257 non-null  float64
 8   bhk             13257 non-null  int64  
 9   price_per_sqft  13257 non-null  float64
dtypes: float64(5), int64(2), object(3)
memory usage: 1.0+ MB


In [50]:
df.drop('Unnamed: 0', axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13257 entries, 0 to 13256
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   area_type       13257 non-null  object 
 1   availability    13257 non-null  object 
 2   location        13257 non-null  object 
 3   total_sqft      13257 non-null  float64
 4   bath            13257 non-null  float64
 5   balcony         13257 non-null  float64
 6   price           13257 non-null  float64
 7   bhk             13257 non-null  int64  
 8   price_per_sqft  13257 non-null  float64
dtypes: float64(5), int64(1), object(3)
memory usage: 932.3+ KB


Step 1: Prepare Features and Target (X and y)
what data to use for predictions (features) and what it's trying to predict (the target).
Features (X): The input variables your model uses to make a prediction. In our case, these are location, total_sqft, bath, balcony, and bhk.

Target (y): The value you want to predict. Here, it's the price.

Machine learning models only understand numbers, so we must convert the text-based location column into a numerical format. The best way to do this is with one-hot encoding, which creates a new binary (0 or 1) column for each unique location.

In [51]:
df_encoded = pd.get_dummies(df, drop_first=True)
# df_encoded = pd.concat([df, dummies], axis='columns')
X = df_encoded.drop(['price','price_per_sqft'], axis='columns')
y = df_encoded.price

Step 2: Split the Data into Training and Testing Sets
To check if your model is accurate, you need to test it on data it has never seen before. We split our dataset into two parts:

Training Set: Used to train the model.

Testing Set: Used to evaluate the trained model's performance.

We'll use Scikit-learn's train_test_split function for this. A common split is 80% for training and 20% for testing.

In [52]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

Step 3: Create and Train the Linear Regression Model 🤖
Now for the main event! We will import the LinearRegression class, create an instance of the model, and train it using the .fit() method on our training data.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_   error
import numpy as np
# Create an instance of the Linear Regression model
lr_model = LinearRegression()

# Train the model using the training data
lr_model.fit(X_train, y_train)

# Evaluate the model on the test data
score = lr_model.score(X_test, y_test)

print(f"Model R-squared score: {score}")

y_predicted = lr_model.predict(X_test)

mse = mean_squared_error(y_test, y_predicted)
rmse = np.sqrt(mse) # RMSE is often easier to interpret

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

Model R-squared score: 0.4202147552544194
Mean Squared Error (MSE): 13260.61
Root Mean Squared Error (RMSE): 115.15


Step 4: Evaluate the Model's Performance

Now it's time to answer the most important question: "Is my model any good?"
A Loss Function is simply a method for calculating a single "error score" that tells you how wrong your model's predictions are. Your goal in machine learning is always to make this score as low as possible. For your housing price project, we'll use the most common one: Mean Squared Error (MSE).
Error: For each house, the error is the difference between the actual price and the price your model predicted. (Actual Price - Predicted Price)

Squared: We square each error. This does two things: it makes all the errors positive (so they don't cancel each other out) and it heavily punishes large errors. A model that is off by ₹10 Lakhs is penalized much more than a model that is off by ₹1 Lakh ten times.

Mean: We take the average of all these squared errors. This gives us a single, final score that represents the overall "wrongness" of our model.

In [54]:
from sklearn.ensemble import  RandomForestRegressor

rfr_model = RandomForestRegressor()
rfr_model.fit(X_train, y_train)

score = rfr_model.score(X_test, y_test)
print(f"Model R-squared score: {score}")

y_predicted = rfr_model.predict(X_test)
mse = mean_squared_error(y_test, y_predicted)
rmse = np.sqrt(mse) # RMSE is often easier to interpret

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

Model R-squared score: 0.6528357617377477
Mean Squared Error (MSE): 7940.20
Root Mean Squared Error (RMSE): 89.11


In [55]:
from sklearn.ensemble import GradientBoostingRegressor
gbc_model = GradientBoostingRegressor()
gbc_model.fit(X_train, y_train)

score = gbc_model.score(X_test, y_test)
print(f"Model R-squared score: {score}")

y_predicted = gbc_model.predict(X_test)
mse = mean_squared_error(y_test, y_predicted)
rmse = np.sqrt(mse) # RMSE is often easier to interpret

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

Model R-squared score: 0.5953556855336994
Mean Squared Error (MSE): 9254.86
Root Mean Squared Error (RMSE): 96.20
