# Rent Price Prediction Using Machine Learning

### This notebook:
#### - Loads rental dataset
#### - Cleans and encodes categorical data
#### - Trains a Decision Tree Regressor
#### - Evaluates model performance
#### - Allows interactive rent prediction


# Import necessary Libraries

In [25]:
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Load csv file
### pd.read_csv is a function used to read data file in csv. 
### sep=';' defines the separator and since the file uses ; to separete data, it was impotant to specify that on the code as well to avoid errors.
### index_col = False, it ensures that all the columns in the file are treated as data, so it will print the whole data.
### df.head(), this function prints the head of the dataet, in this case i choose the firt 7

In [76]:
df = pd.read_csv('/Users/collinchimene/Desktop/Class/Rentprice.csv', sep=';', index_col=False)
df.head(7)

Unnamed: 0,City,Location,Propertytype,Area_m2,Furnished,Peoplesharing,Floor,Bathroom,Kitchen,Parking,WIFI,Gym,Rent(Euro)
0,Berlin,Alfred-Jung-Straße,Studio,21,Yes,0,2,1,1,Yes,Yes,No,749
1,Berlin,Motzstraße,Studio,21,Yes,0,4,1,1,No,Yes,No,700
2,Berlin,Adlershof,Studio,17,Yes,0,3,1,1,Yes,Yes,Yes,859
3,Berlin,Hohenzollerndamm,Shared Apartment,11,Yes,7,1,2,1,No,Yes,No,590
4,Berlin,Nürnberger Straße,Shared Apartment,11,Yes,2,4,1,1,No,Yes,No,600
5,Berlin,Keithstraße,Studio,17,Yes,0,5,1,1,Yes,Yes,No,899
6,Berlin,Keibelstraße,Shared Apartment,11,Yes,3,2,1,1,Yes,Yes,No,660


### Here me get shape, basically the model has 20 rows and 13 columns

In [77]:
df.shape

(20, 13)

### Here we drop all those features that we find unnecessary to train or model and for that we use the function drop to drop any column

In [78]:
df1 = df.drop(['City','Location','Floor','Gym','WIFI'], axis='columns')
df1.head()

Unnamed: 0,Propertytype,Area_m2,Furnished,Peoplesharing,Bathroom,Kitchen,Parking,Rent(Euro)
0,Studio,21,Yes,0,1,1,Yes,749
1,Studio,21,Yes,0,1,1,No,700
2,Studio,17,Yes,0,1,1,Yes,859
3,Shared Apartment,11,Yes,7,2,1,No,590
4,Shared Apartment,11,Yes,2,1,1,No,600


# Data Cleaning Process
### we use this fuction to check if there is any error on our code(it could be NaN or a misspellig word or number that our machine does not uderstand). So in this case we use the function isnull() to check if there any mistake on the column and .sum() we are adding all the columns to check if there is any error on all the columns. Having 0 in all the columns means that all the values are available. Means there is no data to be cleaned or dropped


In [29]:
df1.isnull().sum()

Propertytype     0
Area_m2          0
Furnished        0
Peoplesharing    0
Bathroom         0
Kitchen          0
Parking          0
Rent(Euro)       0
dtype: int64

# Use of the function get.dummies
### Basically we use this function to get the content from Propertytype and convert it into 2 columns but with 0 and 1. Notice that if we do put dtype=int it will return Boolean expressions, so the 0 would be False and the 1 would be True, so i made sure i added this on the function.

In [30]:
dummies = pd.get_dummies(df1.Propertytype, dtype=int)
dummies

Unnamed: 0,Shared Apartment,Studio
0,0,1
1,0,1
2,0,1
3,1,0
4,1,0
5,0,1
6,1,0
7,1,0
8,1,0
9,0,1


# Use of the function concat
### We use this function to joindataframe. So since we have builded a different dataset and the new features we took from the feature Propertytype now we use concat to joint the new table with the first table.

In [31]:
joindataframe = pd.concat([df1, dummies], axis='columns')
joindataframe.head()

Unnamed: 0,Propertytype,Area_m2,Furnished,Peoplesharing,Bathroom,Kitchen,Parking,Rent(Euro),Shared Apartment,Studio
0,Studio,21,Yes,0,1,1,Yes,749,0,1
1,Studio,21,Yes,0,1,1,No,700,0,1
2,Studio,17,Yes,0,1,1,Yes,859,0,1
3,Shared Apartment,11,Yes,7,2,1,No,590,1,0
4,Shared Apartment,11,Yes,2,1,1,No,600,1,0


### Now that we dont need the column Propertytype we can drop it becuase we got the content inside it and turned into column too.

In [32]:
final = joindataframe.drop(['Propertytype'], axis='columns')
final.head()

Unnamed: 0,Area_m2,Furnished,Peoplesharing,Bathroom,Kitchen,Parking,Rent(Euro),Shared Apartment,Studio
0,21,Yes,0,1,1,Yes,749,0,1
1,21,Yes,0,1,1,No,700,0,1
2,17,Yes,0,1,1,Yes,859,0,1
3,11,Yes,7,2,1,No,590,1,0
4,11,Yes,2,1,1,No,600,1,0


### we use the function replace to replace the misspelling word Yess with Yes

In [33]:
final['Furnished'] = final['Furnished'].replace('Yess', 'Yes')
final

Unnamed: 0,Area_m2,Furnished,Peoplesharing,Bathroom,Kitchen,Parking,Rent(Euro),Shared Apartment,Studio
0,21,Yes,0,1,1,Yes,749,0,1
1,21,Yes,0,1,1,No,700,0,1
2,17,Yes,0,1,1,Yes,859,0,1
3,11,Yes,7,2,1,No,590,1,0
4,11,Yes,2,1,1,No,600,1,0
5,17,Yes,0,1,1,Yes,899,0,1
6,11,Yes,3,1,1,Yes,660,1,0
7,10,Yes,4,1,1,No,620,1,0
8,9,Yes,6,2,1,No,630,1,0
9,17,Yes,0,1,1,No,850,0,1


In [34]:
categorical_cols = ['Furnished', 'Parking']

### We have just used LabelEncoder to switch directly the Boolean expressions for columns Furnished and Parking through 0s and 1s. The we use fit_transform to actually transform the data.

In [35]:

for col in categorical_cols:
    le = LabelEncoder()
    final[col] = le.fit_transform(final[col])

final.head()

Unnamed: 0,Area_m2,Furnished,Peoplesharing,Bathroom,Kitchen,Parking,Rent(Euro),Shared Apartment,Studio
0,21,1,0,1,1,1,749,0,1
1,21,1,0,1,1,0,700,0,1
2,17,1,0,1,1,1,859,0,1
3,11,1,7,2,1,0,590,1,0
4,11,1,2,1,1,0,600,1,0


In [36]:
X = final.drop('Rent(Euro)', axis='columns')
X

Unnamed: 0,Area_m2,Furnished,Peoplesharing,Bathroom,Kitchen,Parking,Shared Apartment,Studio
0,21,1,0,1,1,1,0,1
1,21,1,0,1,1,0,0,1
2,17,1,0,1,1,1,0,1
3,11,1,7,2,1,0,1,0
4,11,1,2,1,1,0,1,0
5,17,1,0,1,1,1,0,1
6,11,1,3,1,1,1,1,0
7,10,1,4,1,1,0,1,0
8,9,1,6,2,1,0,1,0
9,17,1,0,1,1,0,0,1


#### I had to rename it because of the errors i was getting du to the "()" so the program was not assuming as it was part of the feature name

In [37]:
final = final.rename(columns={"Rent(Euro)": "Rent_Euro"})
y = final.Rent_Euro
y


0      749
1      700
2      859
3      590
4      600
5      899
6      660
7      620
8      630
9      850
10     540
11     580
12     500
13     540
14     560
15     420
16    1050
17     450
18     925
19     610
Name: Rent_Euro, dtype: int64

In [38]:
# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

#### The Linear Regression model is being trained

In [39]:
model = LinearRegression()
# Train model
model.fit(X, y)

0,1,2
,"fit_intercept  fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7",1e-06
,"n_jobs  n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24",False


In [40]:
X.head()

Unnamed: 0,Area_m2,Furnished,Peoplesharing,Bathroom,Kitchen,Parking,Shared Apartment,Studio
0,21,1,0,1,1,1,0,1
1,21,1,0,1,1,0,0,1
2,17,1,0,1,1,1,0,1
3,11,1,7,2,1,0,1,0
4,11,1,2,1,1,0,1,0


#### Fist prediction made from this model, and it is possible to see that the model is well trained and note that the prices are in Euro

In [41]:
model.predict([[28,1,0,1,1,0,0,1]])



array([863.38257929])

In [42]:
model.predict([[35,0,2,2,1,1,1,0]])



array([719.08708238])

#### This is how accurate the model is, that means the model is 87% accurate out of 100%

In [43]:
model.score(X,y)

0.8782947083583302

#### I imported this preprocessing One Hot Encoder to enhance the accuracy based on rent prices prediction

In [68]:
from sklearn.preprocessing import OneHotEncoder
new = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

In [71]:
Xn = new.fit_transform(X.toarray())

In [72]:
model.fit(X,y)

0,1,2
,"fit_intercept  fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7",1e-06
,"n_jobs  n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24",False


#### It is possible to notice that the accuracy with the preprocessing One Hot Encoder is almost 100% of accuracy which is good enough preventing errors during predictions.

In [73]:
model.score(X,y)

0.9968963988149804