# ML on House Rent Dataset
This mini project performs Regression and Classification on the House Rent Dataset.

Each step of the process is explained in markdown cells and comments.

## Dataset Columns:

BHK: Number of Bedrooms, Hall, Kitchen.

Rent: Rent of the Houses/Apartments/Flats.

Size: Size of the Houses/Apartments/Flats in Square Feet.

Floor: Houses/Apartments/Flats situated in which Floor and Total Number of Floors (Example: Ground out of 2, 3 out of 5, etc.)

Area Type: Size of the Houses/Apartments/Flats calculated on either Super Area or Carpet Area or Build Area.

Area Locality: Locality of the Houses/Apartments/Flats.

City: City where the Houses/Apartments/Flats are Located.

Furnishing Status: Furnishing Status of the Houses/Apartments/Flats, either it is Furnished or Semi-Furnished or Unfurnished.

Tenant Preferred: Type of Tenant Preferred by the Owner or Agent.

Bathroom: Number of Bathrooms.

Point of Contact: Whom should you contact for more information regarding the Houses/Apartments/Flats.

## Dataset used: 
https://www.kaggle.com/datasets/iamsouravbanerjee/house-rent-prediction-dataset/data

## Dependencies
To run this notebook, make sure you have the following dependencies installed:

In [1]:
# !pip install pandas==2.2.1
# !pip install scikit-learn==1.5.0

## Load Realtor Dataset
Load, display and clean data for later predictions.

In [2]:
import pandas as pd

# modify csv path if needed
csv_path = "/home/user/PycharmProjects/data_analytics_uni/task_4/House_Rent_Dataset.csv"
df = pd.read_csv(csv_path)

df.head()

Unnamed: 0,Posted On,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2022-05-18,2,10000,1100,Ground out of 2,Super Area,Bandel,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2022-05-13,2,20000,800,1 out of 3,Super Area,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2022-05-16,2,17000,1000,1 out of 3,Super Area,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2022-07-04,2,10000,800,1 out of 2,Super Area,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2022-05-09,2,7500,850,1 out of 2,Carpet Area,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4746 entries, 0 to 4745
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Posted On          4746 non-null   object
 1   BHK                4746 non-null   int64 
 2   Rent               4746 non-null   int64 
 3   Size               4746 non-null   int64 
 4   Floor              4746 non-null   object
 5   Area Type          4746 non-null   object
 6   Area Locality      4746 non-null   object
 7   City               4746 non-null   object
 8   Furnishing Status  4746 non-null   object
 9   Tenant Preferred   4746 non-null   object
 10  Bathroom           4746 non-null   int64 
 11  Point of Contact   4746 non-null   object
dtypes: int64(4), object(8)
memory usage: 445.1+ KB


In [4]:
df.describe()

Unnamed: 0,BHK,Rent,Size,Bathroom
count,4746.0,4746.0,4746.0,4746.0
mean,2.08386,34993.45,967.490729,1.965866
std,0.832256,78106.41,634.202328,0.884532
min,1.0,1200.0,10.0,1.0
25%,2.0,10000.0,550.0,1.0
50%,2.0,16000.0,850.0,2.0
75%,3.0,33000.0,1200.0,2.0
max,6.0,3500000.0,8000.0,10.0


In [5]:
df.dtypes

Posted On            object
BHK                   int64
Rent                  int64
Size                  int64
Floor                object
Area Type            object
Area Locality        object
City                 object
Furnishing Status    object
Tenant Preferred     object
Bathroom              int64
Point of Contact     object
dtype: object

In [6]:
# see null counts for each column
df.isnull().sum()

Posted On            0
BHK                  0
Rent                 0
Size                 0
Floor                0
Area Type            0
Area Locality        0
City                 0
Furnishing Status    0
Tenant Preferred     0
Bathroom             0
Point of Contact     0
dtype: int64

### Turn string value columns into numeric types

* Extract 'Floor Number' and 'Total Floor' from 'Floor' column
* Turn 'Area Type', 'City', 'Furnishing Status', 'Tenant Preferred' and 'Point of Contact' into numeric values
* Drop the unnecessary columns, which are not numeric: 'Floor', 'Posted on', 'Area Locality'.

In [7]:
# extract floor number and total floor from the 'Floor' column
df["Floor Number"] = df["Floor"].apply(lambda x: str(x).split()[0])
df["Total Floor"] = df["Floor"].apply(lambda x: str(x).split()[-1])

In [8]:
# replace specific string values with integers in 'Floor Number'
df["Floor Number"] = df["Floor Number"].replace(['Lower'], -2)
df["Floor Number"] = df["Floor Number"].replace(['Upper'], -1)
df["Floor Number"] = df["Floor Number"].replace(['Ground'], 0)

In [9]:
# convert 'Floor Number' and 'Total Floor' to integers
df["Floor Number"] = pd.to_numeric(df["Floor Number"], errors='coerce').fillna(0).astype(int)
df["Total Floor"] = pd.to_numeric(df["Total Floor"], errors='coerce').fillna(0).astype(int)

In [10]:
df["Area Type"].unique()

array(['Super Area', 'Carpet Area', 'Built Area'], dtype=object)

In [11]:
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()
df["Area Type"]= label_encoder.fit_transform(df["Area Type"])
df["Area Type"].unique()

array([2, 1, 0])

In [12]:
df['City'].unique()

array(['Kolkata', 'Mumbai', 'Bangalore', 'Delhi', 'Chennai', 'Hyderabad'],
      dtype=object)

In [13]:
df['City']= label_encoder.fit_transform(df['City'])
df['City'].unique()

array([4, 5, 0, 2, 1, 3])

In [14]:
df["Furnishing Status"].unique()

array(['Unfurnished', 'Semi-Furnished', 'Furnished'], dtype=object)

In [15]:
df["Furnishing Status"]= label_encoder.fit_transform(df["Furnishing Status"])
df["Furnishing Status"].unique()

array([2, 1, 0])

In [16]:
df["Tenant Preferred"].unique()

array(['Bachelors/Family', 'Bachelors', 'Family'], dtype=object)

In [17]:
df["Tenant Preferred"]= label_encoder.fit_transform(df["Tenant Preferred"])
df["Tenant Preferred"].unique()

array([1, 0, 2])

In [18]:
df["Point of Contact"].unique()

array(['Contact Owner', 'Contact Agent', 'Contact Builder'], dtype=object)

In [19]:
df["Point of Contact"]= label_encoder.fit_transform(df["Point of Contact"])
df["Point of Contact"].unique()

array([2, 0, 1])

In [20]:
# drop the original 'Floor' column
df.drop('Floor', axis=1, inplace=True)
df.drop('Posted On', axis=1, inplace=True)
df.drop('Area Locality', axis=1, inplace=True)

In [21]:
df.dtypes

BHK                  int64
Rent                 int64
Size                 int64
Area Type            int64
City                 int64
Furnishing Status    int64
Tenant Preferred     int64
Bathroom             int64
Point of Contact     int64
Floor Number         int64
Total Floor          int64
dtype: object

## Single Variable Linear Regression

Independent variable: 'Size'

Dependant variable: 'Rent'

In [22]:
from sklearn.preprocessing import minmax_scale

df["Size"] = minmax_scale(df["Size"])
df['Rent'] = minmax_scale(df["Rent"])

In [23]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X = df[['Size']]
y = df['Rent']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

In [24]:
model = LinearRegression()
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print("Training Score:", train_score)
print("Testing Score:", test_score)

Training Score: 0.15453829729062218
Testing Score: 0.3193923624358027


In [25]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 0.00033014737314407085
R-squared: 0.3193923624358027


### Single Variable Linear Regression Results:

This model uses only the size of the property to predict rent. It explains about 15.45% of the variation in rent during training and 31.94% during testing. This means it's not very accurate, as it misses many factors that affect rent prices.

## Multiple Variable Linear Regression

Independent variables: every column except for 'Rent'

Dependant variable: 'Rent'

In [26]:
X = df.drop('Rent', axis=1)
y = df['Rent']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

In [27]:
model = LinearRegression()
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print("Training Score:", train_score)
print("Testing Score:", test_score)

Training Score: 0.2761320535456504
Testing Score: 0.43500573742326765


In [28]:
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 0.0002740659395165622
R-squared: 0.43500573742326765


### Multiple Variable Linear Regression Results:

This model uses several features (BHK, Size, Floor, and Bathroom) to predict rent prices. It explains 27.61% of the variation in rent during training and 43.50% during testing. It's more accurate than the single variable model because it considers more factors.


## Decision Tree Regression

Independent variables: every column except for 'Rent'

Dependant variable: 'Rent'

In [29]:
from sklearn.tree import DecisionTreeRegressor

X = df.drop('Rent', axis=1)
y = df['Rent']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeRegressor()
model.fit(X_train, y_train)

In [30]:
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print("Training Score:", train_score)
print("Testing Score:", test_score)

Training Score: 0.9998476428721843
Testing Score: 0.43072730863694864


In [31]:
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 0.000185332675602548
R-squared: 0.43072730863694864


## Logistic Regression

Independent variables: every column except for 'Rent'

Dependant variable: 75th percentile threshold of the 'Rent' column

In [32]:
from sklearn.linear_model import LogisticRegression

threshold = df['Rent'].quantile(0.75)  
df['High Rent'] = (df['Rent'] > threshold).astype(int)

features = ['BHK', 'Size', 'Floor Number', 'Bathroom']
target = 'High Rent'

X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(random_state=42, max_iter=1000)

model.fit(X_train, y_train)

In [33]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print("Accuracy Score:", accuracy)
print("===============================================================================")
print('Confusion Matrix:', conf_matrix)
print("===============================================================================")
print('Classification Report:', class_report)

Accuracy Score: 0.8726315789473684
Confusion Matrix: [[692  33]
 [ 88 137]]
Classification Report:               precision    recall  f1-score   support

           0       0.89      0.95      0.92       725
           1       0.81      0.61      0.69       225

    accuracy                           0.87       950
   macro avg       0.85      0.78      0.81       950
weighted avg       0.87      0.87      0.87       950



## Decision Tree Classifier

Independent variables: every column except for 'Rent'

Dependant variable: 60th percentile threshold of the 'Rent' column

In [34]:
from sklearn.tree import DecisionTreeClassifier

threshold = df['Rent'].quantile(0.6)  
df['High Rent'] = (df['Rent'] > threshold).astype(int)

features = ['BHK', 'Size', 'Floor Number', 'Bathroom']
target = 'High Rent'

X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=41)
model = DecisionTreeClassifier()

model.fit(X_train, y_train)

In [35]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print("Accuracy Score:", accuracy)
print("===============================================================================")
print('Confusion Matrix:', conf_matrix)
print("===============================================================================")
print('Classification Report:', class_report)

Accuracy Score: 0.7789473684210526
Confusion Matrix: [[491  85]
 [125 249]]
Classification Report:               precision    recall  f1-score   support

           0       0.80      0.85      0.82       576
           1       0.75      0.67      0.70       374

    accuracy                           0.78       950
   macro avg       0.77      0.76      0.76       950
weighted avg       0.78      0.78      0.78       950

