# Context
### Model Explanation
In order to enhance our project, we jave decided to create a  model that would predict if a house is over or under evaluated while looking at its comparable.
This criteria will be based on the listed price : Is the home listed at a significantly higher price than the neighboring properties with similar characteristics ?
Since we are facing a classification problem with a binary outcome, we chose a Logistic Regession.

In [2]:
!pip install pymongo

Collecting pymongo
  Downloading pymongo-4.6.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (676 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.9/676.9 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.6.1-py3-none-any.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.7/307.7 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.6.1 pymongo-4.6.3


In [11]:
#Importing the dependencies
import pandas as pd
from pymongo import MongoClient
from api_keys import mongo_username, mongo_password
from pprint import pprint

In [9]:
#Connection to our MongoClient Instance
connection_string = f"mongodb+srv://{mongo_username}:{mongo_password}@cluster0.9gjuly6.mongodb.net/mydatabase"
mongo = MongoClient(connection_string)

#Assigning our db to a variable
db = mongo['properties']

#Assigning our collections to a variable
all_houses = db["all_houses"]
sold_houses = db['sold_houses']

In [16]:
#Converting our colloections to Pandas DataFrame
all_houses_df = pd.DataFrame(list(all_houses.find()))
sold_houses_df = pd.DataFrame(list(sold_houses.find()))

In [18]:
all_houses_df.head()

Unnamed: 0,_id,address,status,latitude,longitude,floor_size,bedrooms,bathrooms,garage,city,type_of_house,date_listed,neighbourhood,price,sold_price
0,65e3e8514625ce6cbae3942a,167 Olive Ave,Sold Conditional,-78.85339,43.88987,831.0,1.0,1,0,Oshawa,Freehold Townhouse,2024-03-02,Central,319900.0,319900.0
1,65e3e8514625ce6cbae3942c,233 Bennet Dr,Sold,-79.51915,43.928,1444.0,3.0,2,1,King,Detached,2024-03-02,King City,1750000.0,1691000.0
2,65e3e8514625ce6cbae3942d,124 Norwood Crt,Sold,-78.82709,43.90868,1399.0,3.0,2,1,Oshawa,Detached,2024-03-02,Eastdale,780000.0,802000.0
3,65e3e8524625ce6cbae3942f,629 Crerar Ave,Sold,-78.83612,43.89229,1284.0,3.0,2,0,Oshawa,Detached,2024-03-02,Donevan,675000.0,777000.0
4,65e3e8534625ce6cbae39438,610 - 153 Beecroft Rd,Sold Conditional,-79.41436,43.76526,749.0,2.0,1,1,North York,Condo Apt,2024-03-02,Lansing-Westgate,675000.0,675000.0


In [27]:
all_houses_df.dtypes

_id                      object
address                  object
status                   object
latitude                float64
longitude               float64
floor_size              float64
bedrooms                float64
bathrooms                 int64
garage                    int64
city                     object
type_of_house            object
date_listed      datetime64[ns]
neighbourhood            object
price                   float64
sold_price              float64
dtype: object

Our model compares houses based on the city but also the house type. Since we have multiple different types, we are going to bin those types

In [40]:
#Checking the different types
all_houses_df['type_of_house'].value_counts()

type_of_house
Detached                                  265
Condo Apt                                 163
Freehold Townhouse                         70
Condo Townhouse                            54
Semi-Detached                              34
Single Family Residence                     9
Detached, Freehold                          9
Link                                        8
Apartment Unit, Condominium                 7
Condo/Apt Unit                              5
Vacant Land                                 4
Duplex                                      3
Townhouse/Row House, Condominium            3
Other                                       1
Semi Detached, Single Family Residence      1
Semi-Det Condo                              1
Row/Townhouse                               1
Name: count, dtype: int64

In [44]:
#binning types
bin_df = all_houses_df.copy()
condo_types = ['Condo Apt','Apartment Unit, Condominium','Condo/Apt Unit','Semi-Det Condo']
townhouse_types =['Freehold Townhouse','Condo Townhouse','Townhouse/Row House, Condominium','Row/Townhouse']
other = ['Semi-Detached','Single Family Residence','Detached, Freehold','Link','Vacant Land','Duplex','Semi Detached, Single Family Residence']

# Replace in dataframe
for i in condo_types:
    bin_df['type_of_house'] = bin_df['type_of_house'].replace(i,"Condominium")

for i in townhouse_types:
    bin_df['type_of_house'] = bin_df['type_of_house'].replace(i,"Townhouse")

for i in other:
    bin_df['type_of_house'] = bin_df['type_of_house'].replace(i,"Other")
# Check to make sure binning was successful
bin_df['type_of_house'].value_counts()

type_of_house
Detached       265
Condominium    176
Townhouse      128
Other           69
Name: count, dtype: int64

In [30]:
#For each house type per city we will create a mean price based on our data
#SUM(xi)/n = X barre
#Counters to sum prices
Oshawa_prices_counter = 0
Oakville_prices_counter = 0
Burlington_prices_counter = 0
Milton_prices_counter = 0
Vaughan_prices_counter = 0
#Counters to sum the number of houses foud per city
Oshawa_n_counter = 0
Oakville_n_counter = 0
Burlington_n_counter = 0
Milton_n_counter = 0
Vaughan_n_counter = 0
for index, row in bin_df.iterrows():
  if row['city'] == 'Oshawa':
    Oshawa_prices_counter += row['price']
    Oshawa_n_counter += 1
  elif row['city'] == 'Oakville':
    Oakville_prices_counter += row['price']
    Oakville_n_counter += 1
  elif row['city'] == 'Burlington':
    Burlington_prices_counter += row['price']
    Burlington_n_counter += 1
  elif row['city'] == 'Milton':
    Milton_prices_counter += row['price']
    Milton_n_counter += 1
  elif row['city'] == 'Vaughan':
    Vaughan_prices_counter += row['price']
    Vaughan_n_counter += 1

In [31]:
#Now we are going to calculate the mean for each
Oshawa_mean = Oshawa_prices_counter/Oshawa_n_counter
Oshawa_mean

858427.0138888889

In [35]:
new_df = all_houses_df.copy()