# Machine Learning
### In this section, we will prepare our database for the machine learning process, perform linear regression, and evaluate performance.
#### Written By: Nadav Bitran Numa and Maor Bezalel

#### First, we will load the desired modules and datasets

In [660]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt 

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.metrics import r2_score,mean_squared_error,confusion_matrix,f1_score,precision_score,recall_score,accuracy_score,make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.model_selection import GridSearchCV

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

In [661]:
df_Cities = pd.read_csv("AllCities.csv")
df_Cities_Env_Avg = pd.read_csv("AllCitiesEnvorinment (AverageFill).csv")

In [662]:
df_Cities.Property_Type = df_Cities.Property_Type.astype('category')

df_Cities.City = df_Cities.City.astype('category')

df_Cities.Rooms = df_Cities.Rooms.astype("category")

df_Cities.Floor = df_Cities.Floor.astype('category')

df_Cities.Square_Meter = df_Cities.Square_Meter.astype("float64")

df_Cities.Price = df_Cities.Price.astype('int64')

df_Cities['Sale_Date'] = pd.to_datetime(df_Cities['Sale_Date'],format='%Y-%m-%d',errors="coerce")

# 

# Step 1: Preparing the database
### In this part we will prepare the database for machine learning.
+ 1) We will remove unwanted columns
+ 2) We will unite the information of the two databases we have
+ 3) We will perform a final treatment of the data

# 

**Before we start merging the two databases ,we would like to verify if there is a match between the neighborhood names of the two databases**

Explanation why: a situation can arise where the first database will have a neighborhood name without the word שכונת before, while the second database will have a neighborhood name with the word שכונת before.
In order to be able to perform the union we will have to make sure that the names are indeed the same

Example: 

in df_Cities ---> **בן גוריון, אילת**

in df_Cities_Env_Avg ---> **שכונת בן גוריון, אילת**

In order to find out if such cases exist, we will use the asymmetric difference operation between df_Cities_Env_Avg and df_Cities

In [663]:
set_Cities = set(df_Cities.Neighborhood.unique().tolist())
set_Env = set(df_Cities_Env_Avg.Neighborhood.unique().tolist())

print("set_Cities - set_Env : ", set_Cities-set_Env)

set_Cities - set_Env :  {'יפו העתיקה', 'שכונת ותיקים', 'שכונת רובע ט', 'שכונת קרית רבין', 'שכונת קריית שרת', 'שכונת התקווה', 'שכונת משכנות גבעת זאב', 'שכונת מרכז העיר', "שכונת ג'סי כהן", 'הדר יוסף', 'שכונת נווה שרת', 'שכונת נווה רמז', 'שכונת הצפון החדש סביבת כיכר המדינה', 'שכונת הצפון הישן החלק הדרומי', 'הצפון החדש החלק הדרומי'}


We have now received two sets of results:
The first group: the group of neighborhoods that start with the word שכונת, which we will have to correct.

The second group: the group of neighborhoods that do not begin with the word שכונת, we will conclude from this that there are neighborhoods for which we have data in the main database but not in the secondary

In [664]:
# Removing the word שכונת in the main dataset
updated = [neighborhood[6::] for neighborhood in df_Cities.loc[df_Cities['Neighborhood'].str.startswith('שכונת'),'Neighborhood']]
df_Cities.loc[df_Cities['Neighborhood'].str.startswith('שכונת'),'Neighborhood'] = updated

# Removing the word שכונת in the second dataset
updated_Env = [neighborhood[6::] for neighborhood in df_Cities_Env_Avg.loc[df_Cities_Env_Avg['Neighborhood'].str.startswith('שכונת'),'Neighborhood']]
df_Cities_Env_Avg.loc[df_Cities_Env_Avg['Neighborhood'].str.startswith('שכונת'),'Neighborhood'] = updated_Env

## 1.1: Removing Unwanted data

It can be seen that the apartment number column is not relevant information for machine learning, so we will have to get rid of it.
In addition, we would like to extract the year of sale from the apartment price date column

### 1.1.1: Removing Building Number Column

In [665]:
df_Cities.drop(columns="Building_Number",inplace=True)

### 1.1.2: Extarcting Year Of Sale

In [666]:
df_Cities['Sale_Year'] = df_Cities['Sale_Date'].dt.year
df_Cities.drop(columns=['Sale_Date'],inplace=True)
df_Cities.insert(0, 'Sale_Year', df_Cities.pop('Sale_Year'))

Before we finish, we will have to ask ourselves: will the data where the name of the street is unknown help us in the machine learning process? And that answer is probably no.
Because we want to perform the machine learning process at the apartment level (and not at the neighborhood level) we would like to further divide the street names into categories, as a result all the apartments where the street name is Unknown will be united under one category, which is wrong

In [667]:
df_Cities[df_Cities.Street=='Unknown']

Unnamed: 0,Sale_Year,City,Neighborhood,Street,Property_Type,Rooms,Floor,Square_Meter,Price
0,2022,אשדוד,"הקריה מע""ר",Unknown,1,5.0,8,227.0,3500000
1,2022,אשדוד,"הקריה מע""ר",Unknown,1,3.0,3,81.0,2370000
3,2022,אשדוד,"הקריה מע""ר",Unknown,1,5.0,13,137.0,3550000
4,2022,אשדוד,"הקריה מע""ר",Unknown,1,3.0,1,77.0,2099000
5,2022,אשדוד,"הקריה מע""ר",Unknown,1,5.0,3,189.0,3300000
...,...,...,...,...,...,...,...,...,...
229101,2013,תל אביב -יפו,מונטיפיורי,Unknown,1,3.0,30,65.0,2135000
229102,2013,תל אביב -יפו,מונטיפיורי,Unknown,1,3.0,30,68.0,2250000
229103,2013,תל אביב -יפו,מונטיפיורי,Unknown,1,3.0,30,76.0,2205000
229104,2013,תל אביב -יפו,מונטיפיורי,Unknown,1,3.0,30,69.0,2250000


### 1.1.3: Removing apartments with an unknown street name  

In [668]:
df_Cities.drop(index=df_Cities[df_Cities['Street']=='Unknown'].index,inplace=True)
df_Cities.reset_index(drop=True,inplace=True)


# 

## 1.2: Uniting the information of the two databases

This process is complex, because we want to "replicate" the values ​​of each row in the second database for a large amount of values ​​that belong to the same neighborhood in the main database.
Since there is no specific function that can handle this case for us, we will have to handle it "manually" using loops

In [669]:
# Adding columns from the second database, and resetting them to NaN values
new_columns = df_Cities_Env_Avg.columns[2::]
df_Cities[new_columns] = np.nan

# Loop through all the neighborhoods in the second database,
for row in range(len(df_Cities_Env_Avg)):
    
    # while transferring the data from there to all the records 
    # in the main database that belong to that neighborhood
    for col in df_Cities_Env_Avg.columns.tolist()[2::]:
        
        df_Cities.loc[df_Cities.Neighborhood==df_Cities_Env_Avg.loc[row,'Neighborhood'],col] = df_Cities_Env_Avg.loc[row,col]

**And it works!**

In [670]:
df_Cities.Floor = df_Cities.Floor.astype('int64')
df_Cities.Rooms = df_Cities.Rooms.astype('int64')
df_Cities.Property_Type = df_Cities.Property_Type.astype('int64')
df_Cities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203191 entries, 0 to 203190
Data columns (total 21 columns):
 #   Column                               Non-Null Count   Dtype   
---  ------                               --------------   -----   
 0   Sale_Year                            203191 non-null  int64   
 1   City                                 203191 non-null  category
 2   Neighborhood                         203191 non-null  object  
 3   Street                               203191 non-null  object  
 4   Property_Type                        203191 non-null  int64   
 5   Rooms                                203191 non-null  int64   
 6   Floor                                203191 non-null  int64   
 7   Square_Meter                         203191 non-null  float64 
 8   Price                                203191 non-null  int64   
 9   Schools                              200147 non-null  float64 
 10  Kindergartens_And_Dormitories        200147 non-null  float64 
 11  

### Uniting the information of the two databases - Conclusions:
+ We understood the importance of removing the apartments that do not have a street name



+ It can be seen that after merging the two databases, there are about 3K missing values ​​in columns 9 to 20. The reason is that these values ​​are probably under a neighborhood that **did not have** such data on the site

Therefore, we will have to delete this data in the next step

# 

## 1.3: Final treatment of the data

In [671]:
df_Cities.dropna(axis=0,inplace=True)
df_Cities.reset_index(drop=True,inplace=True)


df_Cities.Neighborhood = df_Cities.Neighborhood.astype('category')
df_Cities.Street = df_Cities.Street.astype('category')

df_Cities.Rooms = df_Cities.Rooms.astype('int64')
df_Cities.Rooms = df_Cities.Rooms.astype('category')
df_Cities.Floor = df_Cities.Floor.astype('int64')

## Feature Engineering

In [672]:
df_Cities['Institutions'] = df_Cities['Public_Institutions'] + df_Cities['Community_Institutions'] + df_Cities['Religious_Institutions']
df_Cities['Schools_Institutions'] = df_Cities['Schools'] + df_Cities['Non_Formal_Educational_Institutions']
df_Cities.drop(columns=['Public_Institutions','Community_Institutions','Religious_Institutions',
                       'Schools','Non_Formal_Educational_Institutions',],inplace=True)


In [673]:
df_Cities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200147 entries, 0 to 200146
Data columns (total 18 columns):
 #   Column                              Non-Null Count   Dtype   
---  ------                              --------------   -----   
 0   Sale_Year                           200147 non-null  int64   
 1   City                                200147 non-null  category
 2   Neighborhood                        200147 non-null  category
 3   Street                              200147 non-null  category
 4   Property_Type                       200147 non-null  int64   
 5   Rooms                               200147 non-null  category
 6   Floor                               200147 non-null  int64   
 7   Square_Meter                        200147 non-null  float64 
 8   Price                               200147 non-null  int64   
 9   Kindergartens_And_Dormitories       200147 non-null  float64 
 10  Education_Average_Distance          200147 non-null  float64 
 11  Green_Areas_S

In [674]:
df_Cities.to_csv("Save 1.csv")
df_Cities2 = df_Cities.copy()
df_Cities2 = pd.get_dummies(df_Cities2,columns=['City'])
df_Cities2 = pd.get_dummies(df_Cities2,columns=['Property_Type'])

df_Cities2.drop(columns=['Neighborhood','Street'],inplace=True)

In [552]:
df_Cities2.Rooms = df_Cities2.Rooms.astype('int64')
df_Cities2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200147 entries, 0 to 200146
Data columns (total 36 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   Sale_Year                           200147 non-null  int64  
 1   Rooms                               200147 non-null  int64  
 2   Floor                               200147 non-null  int64  
 3   Square_Meter                        200147 non-null  float64
 4   Price                               200147 non-null  int64  
 5   Kindergartens_And_Dormitories       200147 non-null  float64
 6   Education_Average_Distance          200147 non-null  float64
 7   Green_Areas_SQM                     200147 non-null  float64
 8   Parks_And_Gardens                   200147 non-null  float64
 9   Green_Areas_Average_Distance        200147 non-null  float64
 10  Parks_And_Gardens_Average_Distance  200147 non-null  float64
 11  Public_Building_Average_Di

# Step 2: The Machine Learning Process

## In this part we will perform the machine learning process
+ we will extract the target column from the database
+ we will divide the data into train and test in a ratio of 80-20
+ we will train the model

## Multiple Linear Regression 

### 2.0.1: Extracting the target column from the database 

In [601]:
scaler = MinMaxScaler(feature_range=(0,1))

columns = df_Cities2.columns[df_Cities2.columns != 'Price']
X = df_Cities2[columns]
y = df_Cities2['Price']

### 2.0.2: Dividing the data into train and test in a ratio of 80-20

In [602]:
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size=0.2)

num_vars = ['Sale_Year', 'Rooms', 'Square_Meter', 'Kindergartens_And_Dormitories', 'Education_Average_Distance','Green_Areas_SQM',
           'Parks_And_Gardens','Green_Areas_Average_Distance','Parks_And_Gardens_Average_Distance','Public_Building_Average_Distance',
           'Institutions','Schools_Institutions']

X_train[num_vars] = scaler.fit_transform(X_train[num_vars])
X_test[num_vars] = scaler.fit_transform(X_test[num_vars])

### 2.0.3: Training the model

In [603]:
model = LinearRegression().fit(X_train,Y_train)
y_pred = model.predict(X_test)

print("First Result:",r2_score(Y_test,y_pred))

First Result: 0.6362717748570688


### 2.0.4: Performance Evaluation

In [604]:
scores = cross_val_score(model, X_train, Y_train, scoring='r2', cv=100)
print('SCORE:',max(scores))

SCORE: 0.7092980835320352


**You can see that in the main model we got up to 70% success, which is not bad in relation to housing price forecasting!**

Summary:
+ In machine learning, we performed a linear regression to predict apartment prices based on their prices.
+ In order to get a general picture of the quality of the model itself, we performed 200-cross-validation.
+ We chose a criterion of 200 so that our database is quite large, so we wanted to find the balance between the quality of the performance evaluation and the quality of the data variance when dividing into groups
+ These results indicate that predicting apartment prices is definitely not simple, there are many parameters that add to or subtract from the apartment price that are not necessarily in our database, and it will probably be impossible to put the influencing parameters into one database



**This model was the main model of our project, now we will move to other models based on additional hypotheses that we wanted to test**

# 

# 

Now, we will try to build all kinds of models and evaluate their performance.

### Logistic Regression Models:
+ 1) Prediction if the house has more than a 4 room apartment given the price
+ 2) Prediction if the house has more then a 5 room apartment given the square_meter

### GNB\KNN\DT\RF Models:
+ 3) Predicting how many rooms there are in an apartment given all its data
+ 4) Predicting what type of apartment given all its data

For each model we are going to perform a general performance evaluation, and a specific performance evaluation for one result that we will receive

In [636]:
df_Cities = pd.read_csv("Save 1.csv",index=False)

## Extras: Other Models

### First Prediction:

In [608]:
# 2.0.1
df_Cities['Above 4 rooms'] = [1 if rooms>=4 else 0 for rooms in df_Cities['Rooms']]
X = df_Cities[['Price']]
Y = df_Cities['Above 4 rooms']

# 2.0.2
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)

scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# 2.0.3
model = LogisticRegression().fit(X_train,Y_train)
y_pred = model.predict(X_test)

# 2.0.4
scores = cross_val_score(model, X_train, Y_train, scoring='f1', cv=20)

df_Cities.drop(columns = 'Above 4 rooms',inplace=True)

### Performance Evaluation

In [609]:
print("SUCCESS RATE:",f1_score(Y_test,y_pred))
print("PRECISION RATE:",precision_score(Y_test,y_pred))
print("RECALL RATE:",recall_score(Y_test,y_pred))
print("CONF MAT:")

cm = pd.DataFrame(confusion_matrix(Y_test,y_pred),index=['Predicted Above 4','Predicted Below 4'])
cm = cm.rename(columns={0: 'Actual Above 4',1:"Actual Below 4"})
cm

SUCCESS RATE: 0.7121852210238134
PRECISION RATE: 0.7299820987952973
RECALL RATE: 0.6952354621693854
CONF MAT:


Unnamed: 0,Actual Above 4,Actual Below 4
Predicted Above 4,12747,5581
Predicted Below 4,6614,15088


+ We got about 71 percent success
+ Out of about 18K apartments in which the model predicted that they had more than 4 rooms, he was right in only 13K of them. (72%)
+ Out of about 19K apartments that have more than 4 rooms, the model selected 13K of them (69%)

In [610]:
print("BEST SUCCESS RATE:",max(scores))
print("MEAN SUCCESS RATE:",scores.mean())

BEST SUCCESS RATE: 0.7208278815272986
MEAN SUCCESS RATE: 0.7089188267063367


# 

### Second Prediction

In [611]:
df_Cities['Above 4 rooms'] = [1 if rooms>=4 else 0 for rooms in df_Cities['Rooms']]
X = df_Cities[['Square_Meter']]
Y = df_Cities['Above 4 rooms']

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)


scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)


model = LogisticRegression().fit(X_train,Y_train)
y_pred = model.predict(X_test)

scores = cross_val_score(model, X_train, Y_train, scoring='f1', cv=20)

df_Cities.drop(columns = 'Above 4 rooms',inplace=True)

### Performance Evaluation

In [612]:
print("SUCCESS RATE:",f1_score(Y_test,y_pred))
print("PRECISION RATE:",precision_score(Y_test,y_pred))
print("RECALL RATE:",recall_score(Y_test,y_pred))
print("CONF MAT:")
cm = pd.DataFrame(confusion_matrix(Y_test,y_pred),index=['Predicted Above 4','Predicted Below 4'])
cm = cm.rename(columns={0: 'Actual Above 4',1:"Actual Below 4"})
cm

SUCCESS RATE: 0.9132124653804082
PRECISION RATE: 0.9191379714218787
RECALL RATE: 0.9073628711497549
CONF MAT:


Unnamed: 0,Actual Above 4,Actual Below 4
Predicted Above 4,16682,1726
Predicted Below 4,2003,19619


+ We got about 91 percent success
+ Out of about 18K apartments in which the model predicted that they had more than 4 rooms, he was right in 16K of them. (92%)
+ Out of about 18K apartments that have more than 4 rooms, the model selected 16K of them (90%)

In [622]:
print("BEST SUCCESS RATE:",max(scores))
print("MEAN SUCCESS RATE:",scores.mean())

BEST SUCCESS RATE: 0.918184983172798
MEAN SUCCESS RATE: 0.9121724815152461


# 

In [675]:
df_Cities2 = pd.read_csv("Save 1.csv")

### Third Prediction

In [676]:
df_Cities2 = pd.get_dummies(df_Cities2,columns=['City'])
df_Cities2 = pd.get_dummies(df_Cities2,columns=['Property_Type'])

df_Cities2.drop(columns=['Neighborhood','Street'],inplace=True)

X = df_Cities2[df_Cities2.columns[df_Cities2.columns!='Rooms']]
Y = df_Cities2['Rooms']

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)

num_vars = ['Sale_Year', 'Square_Meter', 'Kindergartens_And_Dormitories', 'Education_Average_Distance','Green_Areas_SQM',
           'Parks_And_Gardens','Green_Areas_Average_Distance','Parks_And_Gardens_Average_Distance','Public_Building_Average_Distance',
           'Institutions','Schools_Institutions']

X_train[num_vars] = scaler.fit_transform(X_train[num_vars])
X_test[num_vars] = scaler.fit_transform(X_test[num_vars])

KNN = KNeighborsClassifier(n_neighbors=10).fit(X_train,Y_train)
RandomForest = RandomForestClassifier().fit(X_train,Y_train)
GuassianNB = GaussianNB().fit(X_train,Y_train)
DT = DecisionTreeClassifier(max_depth=10).fit(X_train,Y_train)

KNN_pred = KNN.predict(X_test)
RandomForest_pred = RandomForest.predict(X_test)
GuassianNB_pred = GuassianNB.predict(X_test)
DT_pred = DT.predict(X_test)

#### KNN: 

In [677]:
def get_labels(shape_row,shape_col):
    Act = []
    Pred = {}
    for i in range(1,shape_row+1):
        Act.append(("Actual ",i,' Rooms'))
    for i in range(1,shape_col+1):
        Pred[i-1] = ("Pred ",i,' Rooms')
    return (Act,Pred)


labels=  get_labels(confusion_matrix(Y_test,KNN_pred).shape[0],confusion_matrix(Y_test,KNN_pred).shape[1])
cm = pd.DataFrame(confusion_matrix(Y_test,KNN_pred),index=labels[0])
cm = cm.rename(columns=labels[1])
cm

Unnamed: 0,"(Pred , 1, Rooms)","(Pred , 2, Rooms)","(Pred , 3, Rooms)","(Pred , 4, Rooms)","(Pred , 5, Rooms)","(Pred , 6, Rooms)","(Pred , 7, Rooms)","(Pred , 8, Rooms)","(Pred , 9, Rooms)","(Pred , 10, Rooms)"
"(Actual , 1, Rooms)",5,159,71,18,24,3,1,0,0,0
"(Actual , 2, Rooms)",9,3294,2413,532,81,5,0,0,0,0
"(Actual , 3, Rooms)",10,2512,5992,2696,510,32,1,0,0,0
"(Actual , 4, Rooms)",2,855,3618,5876,1810,75,2,0,0,0
"(Actual , 5, Rooms)",2,151,849,2782,3614,284,9,0,0,0
"(Actual , 6, Rooms)",0,13,87,284,838,208,9,0,0,0
"(Actual , 7, Rooms)",0,4,8,31,118,56,1,0,0,0
"(Actual , 8, Rooms)",0,1,2,11,37,11,1,0,0,0
"(Actual , 9, Rooms)",0,0,1,1,8,2,0,0,0,0
"(Actual , 10, Rooms)",0,0,0,1,0,0,0,0,0,0


In [678]:
print("SCORE: ",accuracy_score(Y_test,KNN_pred))

SCORE:  0.47439420434673996


#### RandomForest:

In [679]:
labels=  get_labels(confusion_matrix(Y_test,RandomForest_pred).shape[0],confusion_matrix(Y_test,RandomForest_pred).shape[1])
cm = pd.DataFrame(confusion_matrix(Y_test,RandomForest_pred),index=labels[0])
cm = cm.rename(columns=labels[1])
cm

Unnamed: 0,"(Pred , 1, Rooms)","(Pred , 2, Rooms)","(Pred , 3, Rooms)","(Pred , 4, Rooms)","(Pred , 5, Rooms)","(Pred , 6, Rooms)","(Pred , 7, Rooms)","(Pred , 8, Rooms)","(Pred , 9, Rooms)","(Pred , 10, Rooms)"
"(Actual , 1, Rooms)",30,185,18,15,21,10,1,1,0,0
"(Actual , 2, Rooms)",39,5123,1158,12,2,0,0,0,0,0
"(Actual , 3, Rooms)",0,1028,9321,1379,25,0,0,0,0,0
"(Actual , 4, Rooms)",1,14,1280,9992,943,7,1,0,0,0
"(Actual , 5, Rooms)",3,0,15,1086,6210,352,22,3,0,0
"(Actual , 6, Rooms)",1,0,2,52,702,639,34,8,1,0
"(Actual , 7, Rooms)",0,0,0,7,71,104,27,8,1,0
"(Actual , 8, Rooms)",2,0,0,2,23,32,4,0,0,0
"(Actual , 9, Rooms)",0,0,0,1,6,4,1,0,0,0
"(Actual , 10, Rooms)",0,0,0,1,0,0,0,0,0,0


In [680]:
print("SCORE: ",accuracy_score(Y_test,RandomForest_pred))

SCORE:  0.7829627779165625


#### GuassianNB

In [681]:
labels=  get_labels(confusion_matrix(Y_test,GuassianNB_pred).shape[0],confusion_matrix(Y_test,GuassianNB_pred).shape[1])
cm = pd.DataFrame(confusion_matrix(Y_test,GuassianNB_pred),index=labels[0])
cm = cm.rename(columns=labels[1])
cm

Unnamed: 0,"(Pred , 1, Rooms)","(Pred , 2, Rooms)","(Pred , 3, Rooms)","(Pred , 4, Rooms)","(Pred , 5, Rooms)","(Pred , 6, Rooms)","(Pred , 7, Rooms)","(Pred , 8, Rooms)","(Pred , 9, Rooms)","(Pred , 10, Rooms)"
"(Actual , 1, Rooms)",0,19,216,18,14,6,3,5,0,0
"(Actual , 2, Rooms)",0,438,5306,548,42,0,0,0,0,0
"(Actual , 3, Rooms)",0,374,9140,1596,621,21,1,0,0,0
"(Actual , 4, Rooms)",0,355,7973,2818,999,86,7,0,0,0
"(Actual , 5, Rooms)",0,150,3072,2878,1356,148,50,37,0,0
"(Actual , 6, Rooms)",0,34,220,563,513,75,14,20,0,0
"(Actual , 7, Rooms)",0,1,24,54,113,21,1,4,0,0
"(Actual , 8, Rooms)",0,0,2,10,28,7,5,11,0,0
"(Actual , 9, Rooms)",0,0,0,4,4,3,1,0,0,0
"(Actual , 10, Rooms)",0,0,1,0,0,0,0,0,0,0


In [682]:
print("SCORE: ",accuracy_score(Y_test,GuassianNB_pred))

SCORE:  0.3457157132150887


#### Decision Tree

In [683]:
labels=  get_labels(confusion_matrix(Y_test,DT_pred).shape[0],confusion_matrix(Y_test,DT_pred).shape[1])
cm = pd.DataFrame(confusion_matrix(Y_test,DT_pred),index=labels[0])
cm = cm.rename(columns=labels[1])
cm

Unnamed: 0,"(Pred , 1, Rooms)","(Pred , 2, Rooms)","(Pred , 3, Rooms)","(Pred , 4, Rooms)","(Pred , 5, Rooms)","(Pred , 6, Rooms)","(Pred , 7, Rooms)","(Pred , 8, Rooms)","(Pred , 9, Rooms)","(Pred , 10, Rooms)"
"(Actual , 1, Rooms)",32,181,14,17,22,15,0,0,0,0
"(Actual , 2, Rooms)",36,4913,1370,12,3,0,0,0,0,0
"(Actual , 3, Rooms)",1,1099,8931,1690,31,1,0,0,0,0
"(Actual , 4, Rooms)",0,14,1408,9460,1348,8,0,0,0,0
"(Actual , 5, Rooms)",0,0,10,1340,6034,305,2,0,0,0
"(Actual , 6, Rooms)",0,0,1,65,859,514,0,0,0,0
"(Actual , 7, Rooms)",0,0,1,5,80,132,0,0,0,0
"(Actual , 8, Rooms)",0,0,0,2,21,40,0,0,0,0
"(Actual , 9, Rooms)",0,0,0,0,5,7,0,0,0,0
"(Actual , 10, Rooms)",0,0,0,1,0,0,0,0,0,0


In [684]:
print("SCORE: ",accuracy_score(Y_test,DT_pred))

SCORE:  0.7465400949288034


### Performance Evaluation

**The models with the highest accuracy percentage**:
+ 1) RandomForest: 78.07%
+ 2) DecisionTree: 74.67%
+ 3) KNN: 47.50%
+ 4) GNB: 38.97%


(Please note that these indices are not precise and constitute only one experiment, with the help of finding the highest hyperparameter with the SearchGridCV function we would get higher accuracy percentages.
Since these models are not our main focus in this project, we will be okay with that)

In [688]:
df_Cities2 = pd.read_csv("Save 1.csv")

### Fourth Prediction

In [689]:
df_Cities2 = pd.get_dummies(df_Cities2,columns=['City'])

df_Cities2.drop(columns=['Neighborhood','Street'],inplace=True)

X = df_Cities2[df_Cities2.columns[df_Cities2.columns!='Property_Type']]
Y = df_Cities2['Property_Type']

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)


num_vars = ['Sale_Year', 'Square_Meter', 'Kindergartens_And_Dormitories', 'Education_Average_Distance','Green_Areas_SQM',
           'Parks_And_Gardens','Green_Areas_Average_Distance','Parks_And_Gardens_Average_Distance','Public_Building_Average_Distance',
           'Institutions','Schools_Institutions']

X_train[num_vars] = scaler.fit_transform(X_train[num_vars])
X_test[num_vars] = scaler.fit_transform(X_test[num_vars])

KNN = KNeighborsClassifier(n_neighbors=10).fit(X_train,Y_train)
RandomForest = RandomForestClassifier().fit(X_train,Y_train)
GuassianNB = GaussianNB().fit(X_train,Y_train)
DT = DecisionTreeClassifier(max_depth=10).fit(X_train,Y_train)

KNN_pred = KNN.predict(X_test)
RandomForest_pred = RandomForest.predict(X_test)
GuassianNB_pred = GuassianNB.predict(X_test)
DT_pred = DT.predict(X_test)

#### KNN:

In [690]:
labels = ['Actual בית פרטי','Actual דירה בבניין','Actual דירת גג','Actual דירת גן','Actual קוטג']
labels2 = {0:'Pred בית פרטי',
          1:'Pred דירה בבניין',
          2:'Pred דירת גג',
          3: 'Pred דירת גן',
          4: 'Pred קוטג'}
cm = pd.DataFrame(confusion_matrix(Y_test,KNN_pred),index=labels)
cm = cm.rename(columns=labels2)
cm

Unnamed: 0,Pred בית פרטי,Pred דירה בבניין,Pred דירת גג,Pred דירת גן,Pred קוטג
Actual בית פרטי,11,206,0,0,10
Actual דירה בבניין,10,37762,3,0,153
Actual דירת גג,0,255,0,0,4
Actual דירת גן,0,264,0,0,14
Actual קוטג,0,1142,0,0,196


In [691]:
print("SCORE: ",accuracy_score(Y_test,KNN_pred))

SCORE:  0.9485136147889083


#### RandomForest

In [692]:
cm = pd.DataFrame(confusion_matrix(Y_test,RandomForest_pred),index=labels)
cm = cm.rename(columns=labels2)
cm

Unnamed: 0,Pred בית פרטי,Pred דירה בבניין,Pred דירת גג,Pred דירת גן,Pred קוטג
Actual בית פרטי,120,53,0,0,54
Actual דירה בבניין,12,37602,18,37,259
Actual דירת גג,0,247,9,0,3
Actual דירת גן,0,233,0,31,14
Actual קוטג,14,529,1,6,788


In [693]:
print("SCORE: ",accuracy_score(Y_test,RandomForest_pred))

SCORE:  0.9630277292030976


#### GuassianNB

In [694]:
cm = pd.DataFrame(confusion_matrix(Y_test,GuassianNB_pred),index=labels)
cm = cm.rename(columns=labels2)
cm

Unnamed: 0,Pred בית פרטי,Pred דירה בבניין,Pred דירת גג,Pred דירת גן,Pred קוטג
Actual בית פרטי,0,212,0,0,15
Actual דירה בבניין,0,37476,4,0,448
Actual דירת גג,0,240,0,0,19
Actual דירת גן,0,270,0,0,8
Actual קוטג,0,1228,0,0,110


In [695]:
print("SCORE: ",accuracy_score(Y_test,GuassianNB_pred))

SCORE:  0.9389457906570072


#### Decision Tree

In [696]:
cm = pd.DataFrame(confusion_matrix(Y_test,DT_pred),index=labels)
cm = cm.rename(columns=labels2)
cm

Unnamed: 0,Pred בית פרטי,Pred דירה בבניין,Pred דירת גג,Pred דירת גן,Pred קוטג
Actual בית פרטי,119,52,2,0,54
Actual דירה בבניין,10,37552,20,11,335
Actual דירת גג,1,242,8,0,8
Actual דירת גן,1,245,1,7,24
Actual קוטג,15,572,2,1,748


In [697]:
print("SCORE: ",accuracy_score(Y_test,DT_pred))

SCORE:  0.9601299025730702


### Performance Evaluation
**The models with the highest accuracy percentage**:
+ 1) RandomForest: 96.14%
+ 2) DecisionTree: 96.01%
+ 3) KNN: 95.82%
+ 4) GNB: 89.67%


(Please note that these indices are not precise and constitute only one experiment, with the help of finding the highest hyperparameter with the SearchGridCV function we would get higher accuracy percentages.
Since these models are not our main focus in this project, we will be okay with that)