# Machine Learning
### In this section, we will prepare our database for the machine learning process, perform linear regression, and evaluate performance.
#### Written By: Nadav Bitran Numa and Maor Bezalel

#### First, we will load the desired modules and datasets

In [82]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt 

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.metrics import r2_score,mean_squared_error,confusion_matrix,f1_score,precision_score,recall_score,accuracy_score,make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

In [2]:
df_Cities = pd.read_csv("AllCities.csv")
df_Cities_Env_Avg = pd.read_csv("AllCitiesEnvorinment (AverageFill).csv")

In [3]:
df_Cities.Property_Type = df_Cities.Property_Type.astype('category')

df_Cities.Rooms = df_Cities.Rooms.astype("category")

df_Cities.Floor = df_Cities.Floor.astype('category')

df_Cities.Square_Meter = df_Cities.Square_Meter.astype("float64")

df_Cities.Price = df_Cities.Price.astype('int64')

df_Cities['Sale_Date'] = pd.to_datetime(df_Cities['Sale_Date'],format='%Y-%m-%d',errors="coerce")

# 

# Step 1: Preparing the database
### In this part we will prepare the database for machine learning.
+ 1) We will remove unwanted columns
+ 2) We will unite the information of the two databases we have
+ 3) We will perform a final treatment of the data

# 

**Before we start merging the two databases ,we would like to verify if there is a match between the neighborhood names of the two databases**

Explanation why: a situation can arise where the first database will have a neighborhood name without the word שכונת before, while the second database will have a neighborhood name with the word שכונת before.
In order to be able to perform the union we will have to make sure that the names are indeed the same

Example: 

in df_Cities ---> **בן גוריון, אילת**

in df_Cities_Env_Avg ---> **שכונת בן גוריון, אילת**

In order to find out if such cases exist, we will use the asymmetric difference operation between df_Cities_Env_Avg and df_Cities

In [4]:
set_Cities = set(df_Cities.Neighborhood.unique().tolist())
set_Env = set(df_Cities_Env_Avg.Neighborhood.unique().tolist())

print("set_Cities - set_Env : ", set_Cities-set_Env)

set_Cities - set_Env :  {'שכונת קרית רבין', 'הדר יוסף', 'הצפון החדש החלק הדרומי', 'שכונת הצפון הישן החלק הדרומי', 'שכונת מרכז העיר', 'שכונת משכנות גבעת זאב', 'שכונת ותיקים', 'יפו העתיקה', 'שכונת התקווה', 'שכונת רובע ט', 'שכונת קריית שרת', 'שכונת נווה שרת', "שכונת ג'סי כהן", 'שכונת נווה רמז', 'שכונת הצפון החדש סביבת כיכר המדינה'}


We have now received two sets of results:
The first group: the group of neighborhoods that start with the word שכונת, which we will have to correct.

The second group: the group of neighborhoods that do not begin with the word שכונת, we will conclude from this that there are neighborhoods for which we have data in the main database but not in the secondary

In [5]:
# Removing the word שכונת in the main dataset
updated = [neighborhood[6::] for neighborhood in df_Cities.loc[df_Cities['Neighborhood'].str.startswith('שכונת'),'Neighborhood']]
df_Cities.loc[df_Cities['Neighborhood'].str.startswith('שכונת'),'Neighborhood'] = updated

# Removing the word שכונת in the second dataset
updated_Env = [neighborhood[6::] for neighborhood in df_Cities_Env_Avg.loc[df_Cities_Env_Avg['Neighborhood'].str.startswith('שכונת'),'Neighborhood']]
df_Cities_Env_Avg.loc[df_Cities_Env_Avg['Neighborhood'].str.startswith('שכונת'),'Neighborhood'] = updated_Env

## 1.1: Removing Unwanted data

It can be seen that the apartment number column is not relevant information for machine learning, so we will have to get rid of it.
In addition, we would like to extract the year of sale from the apartment price date column

### 1.1.1: Removing Building Number Column

In [6]:
df_Cities.drop(columns="Building_Number",inplace=True)

### 1.1.2: Extarcting Year Of Sale

In [7]:
df_Cities['Sale_Year'] = df_Cities['Sale_Date'].dt.year
df_Cities.drop(columns=['Sale_Date'],inplace=True)
df_Cities.insert(0, 'Sale_Year', df_Cities.pop('Sale_Year'))

Before we finish, we will have to ask ourselves: will the data where the name of the street is unknown help us in the machine learning process? And that answer is probably no.
Because we want to perform the machine learning process at the apartment level (and not at the neighborhood level) we would like to further divide the street names into categories, as a result all the apartments where the street name is Unknown will be united under one category, which is wrong

In [8]:
df_Cities[df_Cities.Street=='Unknown']

Unnamed: 0,Sale_Year,City,Neighborhood,Street,Property_Type,Rooms,Floor,Square_Meter,Price
0,2022,אשדוד,"הקריה מע""ר",Unknown,1,5.0,8,227.0,3500000
1,2022,אשדוד,"הקריה מע""ר",Unknown,1,3.0,3,81.0,2370000
3,2022,אשדוד,"הקריה מע""ר",Unknown,1,5.0,13,137.0,3550000
4,2022,אשדוד,"הקריה מע""ר",Unknown,1,3.0,1,77.0,2099000
5,2022,אשדוד,"הקריה מע""ר",Unknown,1,5.0,3,189.0,3300000
...,...,...,...,...,...,...,...,...,...
229101,2013,תל אביב -יפו,מונטיפיורי,Unknown,1,3.0,30,65.0,2135000
229102,2013,תל אביב -יפו,מונטיפיורי,Unknown,1,3.0,30,68.0,2250000
229103,2013,תל אביב -יפו,מונטיפיורי,Unknown,1,3.0,30,76.0,2205000
229104,2013,תל אביב -יפו,מונטיפיורי,Unknown,1,3.0,30,69.0,2250000


### 1.1.3: Removing apartments with an unknown street name  

In [9]:
df_Cities.drop(index=df_Cities[df_Cities['Street']=='Unknown'].index,inplace=True)
df_Cities.reset_index(drop=True,inplace=True)


# 

## 1.2: Uniting the information of the two databases

This process is complex, because we want to "replicate" the values ​​of each row in the second database for a large amount of values ​​that belong to the same neighborhood in the main database.
Since there is no specific function that can handle this case for us, we will have to handle it "manually" using loops

In [10]:
# Adding columns from the second database, and resetting them to NaN values
new_columns = df_Cities_Env_Avg.columns[2::]
df_Cities[new_columns] = np.nan

# Loop through all the neighborhoods in the second database,
for row in range(len(df_Cities_Env_Avg)):
    
    # while transferring the data from there to all the records 
    # in the main database that belong to that neighborhood
    for col in df_Cities_Env_Avg.columns.tolist()[2::]:
        
        df_Cities.loc[df_Cities.Neighborhood==df_Cities_Env_Avg.loc[row,'Neighborhood'],col] = df_Cities_Env_Avg.loc[row,col]

**And it works!**

In [11]:
df_Cities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203191 entries, 0 to 203190
Data columns (total 21 columns):
 #   Column                               Non-Null Count   Dtype   
---  ------                               --------------   -----   
 0   Sale_Year                            203191 non-null  int64   
 1   City                                 203191 non-null  object  
 2   Neighborhood                         203191 non-null  object  
 3   Street                               203191 non-null  object  
 4   Property_Type                        203191 non-null  category
 5   Rooms                                203191 non-null  category
 6   Floor                                203191 non-null  category
 7   Square_Meter                         203191 non-null  float64 
 8   Price                                203191 non-null  int64   
 9   Schools                              200147 non-null  float64 
 10  Kindergartens_And_Dormitories        200147 non-null  float64 
 11  

### Uniting the information of the two databases - Conclusions:
+ We understood the importance of removing the apartments that do not have a street name



+ It can be seen that after merging the two databases, there are about 25K missing values ​​in columns 9 to 20. The reason is that these values ​​are probably under a neighborhood that **did not have** such data on the site

Therefore, we will have to delete this data in the next step

# 

## 1.3: Final treatment of the data

### 1.3.1: Converting the city column to a categorical variable with numeric values

**Before**

In [12]:
df_Cities.City.value_counts()

תל אביב -יפו    63547
חולון           27254
אשדוד           21091
בת ים           20435
הרצלייה         12910
גבעתיים          9794
רעננה            9488
הוד השרון        8408
ראש העין         7190
נס ציונה         5295
רמת השרון        5172
אור יהודה        4105
שדרות            2658
אופקים           2523
אור עקיבא        1830
אילת             1218
גבעת זאב          273
Name: City, dtype: int64

In [13]:
uniques = df_Cities.City.unique()
numbers = list(range(len(uniques)))
dictionary_rep = dict(zip(uniques,numbers))
df_Cities.City.replace(dictionary_rep,inplace=True)

**After**

In [14]:
df_Cities.City.value_counts()

16    63547
7     27254
0     21091
1     20435
5     12910
3      9794
12     9488
6      8408
14     7190
8      5295
13     5172
11     4105
15     2658
9      2523
10     1830
2      1218
4       273
Name: City, dtype: int64

### 1.3.2:  Converting the neighborhood column to a categorical variable with numeric values

In [15]:
uniques = df_Cities.Neighborhood.unique()
len(uniques)

248

As you can see, there are 248 unique neighborhood names in our database.


Can we in this case perform the same conversion as we did in the city column?

**The answer is no!**

Explanation: Imagine that in Eilat there is data on apartments in the Ben Gurion neighborhood,
But the Ben Gurion neighborhood is also in Ashdod.
If we perform a normal conversion like we did in the city column, a situation will be created where all the data on these two different neighborhoods will be united and sit under the same category.
Therefore we will have to perform a more complex conversion, which takes this problem into account and solves it

In [16]:
# Creating uniques variable, 
# that will count all the unique neighborhoods per city
uniques = 0

# Loop through all the special city names we have
for city in df_Cities.City.unique().tolist():
    
    # temp_unique -  contains all the special neighborhoods in the current city
    temp_unique = df_Cities[df_Cities.City==city].Neighborhood.unique().tolist()
    # temp_number - Creates serial numbers for neighborhoods in the current city
    temp_number = [(x + uniques) for x in list(range(len(temp_unique)))]
    
    # Combine these two variables into a dictionary data structure
    temp_dictionary = dict(zip(temp_unique,temp_number))
    
    uniques+=len(temp_number)
    # Place the serial values for all the values that belong to the current city
    df_Cities.loc[df_Cities.City == city, "Neighborhood"] = df_Cities.loc[df_Cities.City == city, "Neighborhood"].replace(temp_dictionary)
    
print(uniques)

256


Now we got 256 special neighborhoods and not 251.
It can be concluded from this that there are 5 neighborhoods that appear in at least two different cities, and now they will be cataloged separately

### 1.3.3: Converting the street column to a categorical variable with numeric values

In [17]:
uniques = df_Cities.Street.unique()
len(uniques)

3037

As you can see, we have 3037 special streets

But, this case is the same as the case in the neighborhoods column!, so we will have to do the same treatment

In [18]:
uniques = 0
for i in df_Cities.City.unique().tolist():
    
    # temp_unique -  contains all the special streets in the current city
    temp_unique = df_Cities[df_Cities.City==i].Street.unique().tolist()
    
    # temp_number - Creates serial numbers for streets in the current city
    temp_number = [(x + uniques) for x in list(range(len(temp_unique)))]
    
    # Combine these two variables into a dictionary data structure
    temp_dictionary = dict(zip(temp_unique,temp_number))
    uniques+=len(temp_number)
    
    # Place the serial values for all the values that belong to the current city
    df_Cities.loc[df_Cities.City == i, "Street"] = df_Cities.loc[df_Cities.City == i, "Street"].replace(temp_dictionary)
    
print(uniques)

4795


We received 4795 special streets!
This means that we found more than 1500 street names whose names are found in at least 2 different cities.
This difference is significant and will help us later

Now, we will remove the data on neighborhoods for which there is no information on columns 9-20

In [176]:
df_Cities.dropna(axis=0,inplace=True)
df_Cities.reset_index(drop=True,inplace=True)


df_Cities.Neighborhood = df_Cities.Neighborhood.astype('category')
df_Cities.Street = df_Cities.Street.astype('category')

df_Cities.Rooms = df_Cities.Rooms.astype('int64')
df_Cities.Rooms = df_Cities.Rooms.astype('category')

df_Cities.Floor = df_Cities.Floor.astype('int64')
df_Cities.Floor = df_Cities.Floor.astype('category')


## Feature Engineering

In [20]:
df_Cities['Institutions'] = df_Cities['Public_Institutions'] + df_Cities['Community_Institutions'] + df_Cities['Religious_Institutions']
df_Cities['Schools_Institutions'] = df_Cities['Schools'] + df_Cities['Non_Formal_Educational_Institutions']
df_Cities.drop(columns=['Public_Institutions','Community_Institutions','Religious_Institutions',
                       'Schools','Non_Formal_Educational_Institutions',],inplace=True)

# Step 2: The Machine Learning Process

## In this part we will perform the machine learning process
+ we will extract the target column from the database
+ we will divide the data into train and test in a ratio of 80-20
+ we will train the model

## Multiple Linear Regression 

### 2.0.1: Extracting the target column from the database 

In [163]:
columns = df_Cities.columns[df_Cities.columns != 'Price']
X = df_Cities[columns]
y = df_Cities['Price']

### 2.0.2: Dividing the data into train and test in a ratio of 80-20

In [164]:
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size=0.2)

scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### 2.0.3: Training the model

In [179]:
model = LinearRegression().fit(X_train,Y_train)
y_pred = model.predict(X_test)

print("First Result:",r2_score(Y_test,y_pred))

First Result: 0.5633260989389982


### 2.0.4: Performance Evaluation

In [181]:
scores = cross_val_score(model, X_train, Y_train, scoring='r2', cv=200)
print('SCORE:',max(scores))

SCORE: 0.6608619563949935


**You can see that in the main model we got up to 66% success, which is not bad in relation to housing price forecasting!**

Summary:
+ In machine learning, we performed a linear regression to predict apartment prices based on their prices.
+ In order to get a general picture of the quality of the model itself, we performed 200-cross-validation.
+ We chose a criterion of 200 so that our database is quite large, so we wanted to find the balance between the quality of the performance evaluation and the quality of the data variance when dividing into groups
+ These results indicate that predicting apartment prices is definitely not simple, there are many parameters that add to or subtract from the apartment price that are not necessarily in our database, and it will probably be impossible to put the influencing parameters into one database



**This model was the main model of our project, now we will move to other models based on additional hypotheses that we wanted to test**

# 

# 

Now, we will try to build all kinds of models and evaluate their performance.

### Logistic Regression Models:
+ 1) Prediction if the house has more than a 4 room apartment given the price
+ 2) Prediction if the house has more then a 5 room apartment given the square_meter

### GNB\KNN\DT\RF Models:
+ 3) Predicting how many rooms there are in an apartment given all its data
+ 4) Predicting what type of apartment given all its data

For each model we are going to perform a general performance evaluation, and a specific performance evaluation for one result that we will receive

## Extras: Other Models

### First Prediction:

In [25]:
# 2.0.1
df_Cities['Above 4 rooms'] = [1 if rooms>=4 else 0 for rooms in df_Cities['Rooms']]
X = df_Cities[['Price']]
Y = df_Cities['Above 4 rooms']

# 2.0.2
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)

scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# 2.0.3
model = LogisticRegression().fit(X_train,Y_train)
y_pred = model.predict(X_test)

# 2.0.4
scores = cross_val_score(model, X_train, Y_train, scoring='f1', cv=20)

df_Cities.drop(columns = 'Above 4 rooms',inplace=True)

### Performance Evaluation

In [26]:
print("SUCCESS RATE:",f1_score(Y_test,y_pred))
print("PRECISION RATE:",precision_score(Y_test,y_pred))
print("RECALL RATE:",recall_score(Y_test,y_pred))
print("CONF MAT:")

cm = pd.DataFrame(confusion_matrix(Y_test,y_pred),index=['Predicted Above 4','Predicted Below 4'])
cm = cm.rename(columns={0: 'Actual Above 4',1:"Actual Below 4"})
cm

SUCCESS RATE: 0.7140861411692278
PRECISION RATE: 0.7343430408103633
RECALL RATE: 0.6949168164431541
CONF MAT:


Unnamed: 0,Actual Above 4,Actual Below 4
Predicted Above 4,12876,5455
Predicted Below 4,6620,15079


+ We got about 71 percent success
+ Out of about 18K apartments in which the model predicted that they had more than 4 rooms, he was right in only 13K of them. (72%)
+ Out of about 19K apartments that have more than 4 rooms, the model selected 13K of them (69%)

In [27]:
print("BEST SUCCESS RATE:",max(scores))
print("MEAN SUCCESS RATE:",scores.mean())

BEST SUCCESS RATE: 0.7182320441988951
MEAN SUCCESS RATE: 0.7083373893070604


# 

### Second Prediction

In [28]:
df_Cities['Above 4 rooms'] = [1 if rooms>=4 else 0 for rooms in df_Cities['Rooms']]
X = df_Cities[['Square_Meter']]
Y = df_Cities['Above 4 rooms']

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)

scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)


model = LogisticRegression().fit(X_train,Y_train)
y_pred = model.predict(X_test)

scores = cross_val_score(model, X_train, Y_train, scoring='f1', cv=20)

df_Cities.drop(columns = 'Above 4 rooms',inplace=True)

### Performance Evaluation

In [29]:
print("SUCCESS RATE:",f1_score(Y_test,y_pred))
print("PRECISION RATE:",precision_score(Y_test,y_pred))
print("RECALL RATE:",recall_score(Y_test,y_pred))
print("CONF MAT:")
cm = pd.DataFrame(confusion_matrix(Y_test,y_pred),index=['Predicted Above 4','Predicted Below 4'])
cm = cm.rename(columns={0: 'Actual Above 4',1:"Actual Below 4"})
cm

SUCCESS RATE: 0.9110975724060362
PRECISION RATE: 0.9173311942622564
RECALL RATE: 0.9049480984964856
CONF MAT:


Unnamed: 0,Actual Above 4,Actual Below 4
Predicted Above 4,16795,1752
Predicted Below 4,2042,19441


+ We got about 91 percent success
+ Out of about 18K apartments in which the model predicted that they had more than 4 rooms, he was right in 16K of them. (92%)
+ Out of about 18K apartments that have more than 4 rooms, the model selected 16K of them (90%)

In [30]:
print("BEST SUCCESS RATE:",max(scores))
print("MEAN SUCCESS RATE:",scores.mean())

BEST SUCCESS RATE: 0.9166859523258505
MEAN SUCCESS RATE: 0.9126891175982003


# 

### Third Prediction

In [84]:
X = df_Cities[df_Cities.columns[df_Cities.columns!='Rooms']]
Y = df_Cities['Rooms']

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)

scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

KNN = KNeighborsClassifier(n_neighbors=10).fit(X_train,Y_train)
RandomForest = RandomForestClassifier().fit(X_train,Y_train)
GuassianNB = GaussianNB().fit(X_train,Y_train)
DT = DecisionTreeClassifier(max_depth=10).fit(X_train,Y_train)

KNN_pred = KNN.predict(X_test)
RandomForest_pred = RandomForest.predict(X_test)
GuassianNB_pred = GuassianNB.predict(X_test)
DT_pred = DT.predict(X_test)

#### KNN: 

In [85]:
def get_labels(shape_row,shape_col):
    Act = []
    Pred = {}
    for i in range(1,shape_row+1):
        Act.append(("Actual ",i,' Rooms'))
    for i in range(1,shape_col+1):
        Pred[i-1] = ("Pred ",i,' Rooms')
    return (Act,Pred)


labels=  get_labels(confusion_matrix(Y_test,KNN_pred).shape[0],confusion_matrix(Y_test,KNN_pred).shape[1])
cm = pd.DataFrame(confusion_matrix(Y_test,KNN_pred),index=labels[0])
cm = cm.rename(columns=labels[1])
cm

Unnamed: 0,"(Pred , 1, Rooms)","(Pred , 2, Rooms)","(Pred , 3, Rooms)","(Pred , 4, Rooms)","(Pred , 5, Rooms)","(Pred , 6, Rooms)","(Pred , 7, Rooms)","(Pred , 8, Rooms)","(Pred , 9, Rooms)","(Pred , 10, Rooms)"
"(Actual , 1, Rooms)",31,182,17,18,22,8,1,0,0,0
"(Actual , 2, Rooms)",21,5040,1233,17,3,0,0,0,0,0
"(Actual , 3, Rooms)",3,1394,8998,1367,34,1,0,0,0,0
"(Actual , 4, Rooms)",0,23,1819,9348,972,7,0,0,0,0
"(Actual , 5, Rooms)",0,1,38,1798,5582,271,9,1,0,0
"(Actual , 6, Rooms)",1,0,4,91,846,523,24,1,0,0
"(Actual , 7, Rooms)",0,0,1,9,93,90,15,0,0,0
"(Actual , 8, Rooms)",1,0,0,1,24,24,3,3,0,0
"(Actual , 9, Rooms)",0,0,0,1,8,5,2,0,0,0
"(Actual , 10, Rooms)",0,0,0,0,0,1,0,0,0,0


In [86]:
print("SCORE: ",accuracy_score(Y_test,KNN_pred))

SCORE:  0.7379465400949288


#### RandomForest:

In [87]:
labels=  get_labels(confusion_matrix(Y_test,RandomForest_pred).shape[0],confusion_matrix(Y_test,RandomForest_pred).shape[1])
cm = pd.DataFrame(confusion_matrix(Y_test,RandomForest_pred),index=labels[0])
cm = cm.rename(columns=labels[1])
cm

Unnamed: 0,"(Pred , 1, Rooms)","(Pred , 2, Rooms)","(Pred , 3, Rooms)","(Pred , 4, Rooms)","(Pred , 5, Rooms)","(Pred , 6, Rooms)","(Pred , 7, Rooms)","(Pred , 8, Rooms)","(Pred , 9, Rooms)","(Pred , 10, Rooms)"
"(Actual , 1, Rooms)",52,154,22,19,26,6,0,0,0,0
"(Actual , 2, Rooms)",36,5264,999,10,5,0,0,0,0,0
"(Actual , 3, Rooms)",2,1001,9608,1164,22,0,0,0,0,0
"(Actual , 4, Rooms)",3,12,1178,10113,858,5,0,0,0,0
"(Actual , 5, Rooms)",0,0,18,985,6345,332,19,1,0,0
"(Actual , 6, Rooms)",3,0,2,49,696,705,33,2,0,0
"(Actual , 7, Rooms)",0,0,1,6,78,91,30,1,1,0
"(Actual , 8, Rooms)",2,0,0,3,17,25,8,1,0,0
"(Actual , 9, Rooms)",0,0,0,1,7,2,3,2,1,0
"(Actual , 10, Rooms)",0,0,0,0,1,0,0,0,0,0


In [88]:
print("SCORE: ",accuracy_score(Y_test,RandomForest_pred))

SCORE:  0.8023732200849363


#### GuassianNB

In [89]:
labels=  get_labels(confusion_matrix(Y_test,GuassianNB_pred).shape[0],confusion_matrix(Y_test,GuassianNB_pred).shape[1])
cm = pd.DataFrame(confusion_matrix(Y_test,GuassianNB_pred),index=labels[0])
cm = cm.rename(columns=labels[1])
cm

Unnamed: 0,"(Pred , 1, Rooms)","(Pred , 2, Rooms)","(Pred , 3, Rooms)","(Pred , 4, Rooms)","(Pred , 5, Rooms)","(Pred , 6, Rooms)","(Pred , 7, Rooms)","(Pred , 8, Rooms)","(Pred , 9, Rooms)","(Pred , 10, Rooms)"
"(Actual , 1, Rooms)",9,186,26,8,25,17,2,6,0,0
"(Actual , 2, Rooms)",13,5396,776,14,35,79,1,0,0,0
"(Actual , 3, Rooms)",125,3356,6651,1298,222,141,1,3,0,0
"(Actual , 4, Rooms)",209,42,3098,7522,1095,184,12,7,0,0
"(Actual , 5, Rooms)",192,0,79,3379,3418,464,127,41,0,0
"(Actual , 6, Rooms)",51,0,10,109,675,426,182,37,0,0
"(Actual , 7, Rooms)",5,0,2,3,41,84,64,9,0,0
"(Actual , 8, Rooms)",3,0,0,0,7,20,18,8,0,0
"(Actual , 9, Rooms)",4,0,0,0,3,4,4,1,0,0
"(Actual , 10, Rooms)",0,0,0,0,0,1,0,0,0,0


In [90]:
print("SCORE: ",accuracy_score(Y_test,GuassianNB_pred))

SCORE:  0.5869098176367724


#### Decision Tree

In [105]:
labels=  get_labels(confusion_matrix(Y_test,DT_pred).shape[0],confusion_matrix(Y_test,DT_pred).shape[1])
cm = pd.DataFrame(confusion_matrix(Y_test,DT_pred),index=labels[0])
cm = cm.rename(columns=labels[1])
cm

Unnamed: 0,"(Pred , 1, Rooms)","(Pred , 2, Rooms)","(Pred , 3, Rooms)","(Pred , 4, Rooms)","(Pred , 5, Rooms)","(Pred , 6, Rooms)","(Pred , 7, Rooms)","(Pred , 8, Rooms)","(Pred , 9, Rooms)","(Pred , 10, Rooms)"
"(Actual , 1, Rooms)",78,122,33,16,20,7,2,1,0,0
"(Actual , 2, Rooms)",140,4910,1238,21,4,0,1,0,0,0
"(Actual , 3, Rooms)",24,1190,9225,1300,54,4,0,0,0,0
"(Actual , 4, Rooms)",13,38,1505,9500,1064,44,5,0,0,0
"(Actual , 5, Rooms)",17,0,75,1132,5845,566,55,6,4,0
"(Actual , 6, Rooms)",7,1,8,54,620,699,80,17,4,0
"(Actual , 7, Rooms)",0,0,2,6,80,74,39,5,2,0
"(Actual , 8, Rooms)",1,0,2,1,7,31,8,4,2,0
"(Actual , 9, Rooms)",0,0,1,0,8,2,3,1,1,0
"(Actual , 10, Rooms)",0,0,0,0,1,0,0,0,0,0


In [107]:
print("SCORE: ",accuracy_score(Y_test,DT_pred))

SCORE:  0.7569572820384711


### Performance Evaluation

**The models with the highest accuracy percentage**:
+ 1) RandomForest: 80.23%
+ 2) DecisionTree: 75.69%
+ 3) KNN: 73.79%
+ 4) GNB: 58.69%


(Please note that these indices are not precise and constitute only one experiment, with the help of finding the highest hyperparameter with the SearchGridCV function we would get higher accuracy percentages.
Since these models are not our main focus in this project, we will be okay with that)

### Fourth Prediction

In [109]:
X = df_Cities[df_Cities.columns[df_Cities.columns!='Property_Type']]
Y = df_Cities['Property_Type']

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)

scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

KNN = KNeighborsClassifier(n_neighbors=10).fit(X_train,Y_train)
RandomForest = RandomForestClassifier().fit(X_train,Y_train)
GuassianNB = GaussianNB().fit(X_train,Y_train)
DT = DecisionTreeClassifier(max_depth=10).fit(X_train,Y_train)

KNN_pred = KNN.predict(X_test)
RandomForest_pred = RandomForest.predict(X_test)
GuassianNB_pred = GuassianNB.predict(X_test)
DT_pred = DT.predict(X_test)

#### KNN:

In [126]:
labels = ['Actual בית פרטי','Actual דירה בבניין','Actual דירת גג','Actual דירת גן','Actual קוטג']
labels2 = {0:'Pred בית פרטי',
          1:'Pred דירה בבניין',
          2:'Pred דירת גג',
          3: 'Pred דירת גן',
          4: 'Pred קוטג'}
cm = pd.DataFrame(confusion_matrix(Y_test,KNN_pred),index=labels)
cm = cm.rename(columns=labels2)
cm

Unnamed: 0,Pred בית פרטי,Pred דירה בבניין,Pred דירת גג,Pred דירת גן,Pred קוטג
Actual בית פרטי,34,141,0,0,62
Actual דירה בבניין,19,37702,1,7,247
Actual דירת גג,0,228,2,0,2
Actual דירת גן,1,268,0,5,18
Actual קוטג,12,613,0,5,663


In [127]:
print("SCORE: ",accuracy_score(Y_test,KNN_pred))

SCORE:  0.9594304271796152


#### RandomForest

In [128]:
cm = pd.DataFrame(confusion_matrix(Y_test,RandomForest_pred),index=labels)
cm = cm.rename(columns=labels2)
cm

Unnamed: 0,Pred בית פרטי,Pred דירה בבניין,Pred דירת גג,Pred דירת גן,Pred קוטג
Actual בית פרטי,49,118,0,2,68
Actual דירה בבניין,35,37576,23,54,288
Actual דירת גג,0,225,5,0,2
Actual דירת גן,1,233,1,35,22
Actual קוטג,8,460,0,4,821


In [129]:
print("SCORE: ",accuracy_score(Y_test,RandomForest_pred))

SCORE:  0.9614289283037721


#### GuassianNB

In [130]:
cm = pd.DataFrame(confusion_matrix(Y_test,GuassianNB_pred),index=labels)
cm = cm.rename(columns=labels2)
cm

Unnamed: 0,Pred בית פרטי,Pred דירה בבניין,Pred דירת גג,Pred דירת גן,Pred קוטג
Actual בית פרטי,27,112,1,4,93
Actual דירה בבניין,592,34944,611,430,1399
Actual דירת גג,4,161,48,3,16
Actual דירת גן,9,174,1,35,73
Actual קוטג,29,404,3,14,843


In [131]:
print("SCORE: ",accuracy_score(Y_test,GuassianNB_pred))

SCORE:  0.896752435673245


#### Decision Tree

In [132]:
cm = pd.DataFrame(confusion_matrix(Y_test,DT_pred),index=labels)
cm = cm.rename(columns=labels2)
cm

Unnamed: 0,Pred בית פרטי,Pred דירה בבניין,Pred דירת גג,Pred דירת גן,Pred קוטג
Actual בית פרטי,13,151,1,0,72
Actual דירה בבניין,13,37601,19,15,328
Actual דירת גג,0,226,4,0,2
Actual דירת גן,1,252,1,8,30
Actual קוטג,6,548,0,6,733


In [133]:
print("SCORE: ",accuracy_score(Y_test,DT_pred))

SCORE:  0.9582563077691731


### Performance Evaluation
**The models with the highest accuracy percentage**:
+ 1) RandomForest: 96.14%
+ 2) DecisionTree: 95.94%
+ 3) KNN: 95.82%
+ 4) GNB: 89.67%


(Please note that these indices are not precise and constitute only one experiment, with the help of finding the highest hyperparameter with the SearchGridCV function we would get higher accuracy percentages.
Since these models are not our main focus in this project, we will be okay with that)