# Merging data

Data that is going to be used in model creation is currently spread across datasets:
- `listings_modified.csv` - main dataset, contains data about listings from cetris.ca;
- `extra_data_modified.csv` - extra data about listings from above dataset;
- `policeCoord.csv` & `firestations.csv` - locations of police stations and fire stations in Montréal
- `montreal_hpi.csv`
- `sociodemo_modified.csv` - Montréal sociodemographical data

In [99]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import category_encoders as ce
import textstat
import spacy
from spacy.language import Language
from spacy_langdetect import LanguageDetector

from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import PCA

## 1. Preparing listings data

In [2]:
listings = pd.read_csv('data\\listings_modified.csv')

### 1.1. Dropping redundant columns

(see EDA: `eda_merged.ipynb`)

In [3]:
redundant_cols = [
    '__type', 'ShortCity',
    
    'RevenuEffectif', 'ShortDescription', 'Langue', 'SuperficieTerrain',
    'NbEquipements', 'NbAnimals', 'NbCultures', 'LocationPrice', 'LocationPriceDesc',
    
    'CatgCode', 'PhotoText', 'DiffusionWebUrl', 'PhotoUrl', 'PhotoPrintingUrl', 
    'FinancialToolUrl', 'PasserelleUrl', 'CommunityUrl', 'GoogleStreetViewUrl', 
    'WalkScoreUrl', 'ShareThisUrl', 'SummaryUrl', 'PrintingPageUrl', 
    'GoogleMapAddressLink', 
    
    'IsFavorite', 'IsRemoved', 'IsSubject', 'IsStrictQueryMatch', 'IsFavoriteDisplayed',
    'IsRemovedDisplayed', 'IsEmptyView', 'IsMissingView', 'IsInFavorite',

    'IsNew', 'IsNewPrice'
]
listings.drop(redundant_cols,
              axis=1,
              inplace=True)

### 1.2. Converting `VisiteVirtuelleUrl` & `OpenHouse` to boolean columns

In [4]:
listings['OpenHouse'] = ~listings['OpenHouse'].isna()

In [5]:
listings['VisiteVirtuelleUrl'] = ~listings['VisiteVirtuelleUrl'].isna()
listings.rename({'VisiteVirtuelleUrl': 'IsVirtualTour', 'OpenHouse': 'IsOpenHouse'}, axis=1, inplace=True)

In [6]:
listings.head()

Unnamed: 0,MlsNumber,CategoryCode,Lat,Lng,Address,City,Quartier,PostalCode,LongDescription,BuyPrice,...,NbChambres,NbSallesEaux,NbSallesBains,NbFoyerPoele,NbGarages,NbStationnements,NbPiscines,NbBordEaux,IsVirtualTour,PageTitle
0,9027411,COP,45.556508,-73.633198,"8708, Av. André-Grasset, apt. 202",Ahuntsic-Cartierville (Montréal),Ahuntsic Central,H2M 2L4,PRIX POUR VENTE RAPIDE. Superbe condo ensoleil...,187500.0,...,1,0,1,0,0,1,0,0,False,Condos for sale in Ahuntsic-Cartierville (Mont...
1,9030205,PPR,45.542062,-73.696868,"12064 - 12066, Rue Valmont",Ahuntsic-Cartierville (Montréal),Nouveau Bordeaux,H3M 2V5,"Duplex à vendre à Ahuntsic près de la rivière,...",574900.0,...,4,0,1,0,0,1,0,0,True,Duplexes for sale in Ahuntsic-Cartierville (Mo...
2,9041238,PCI,45.573887,-73.649079,"2326 - 2328, Rue Fleury E.",Ahuntsic-Cartierville (Montréal),Ahuntsic East,H2B 1K9,rare sur le marché! ideal pour dentiste ou med...,475000.0,...,0,0,0,0,0,1,0,0,False,Commercial building for sale in Ahuntsic-Carti...
3,9048590,PPR,45.57104,-73.643413,"2283 - 2287, Rue Sauvé E.",Ahuntsic-Cartierville (Montréal),Ahuntsic East,H2B 1B4,Excellent opportunity for builder/ contractor....,545000.0,...,2,0,1,0,0,0,0,0,False,Triplexes for sale in Ahuntsic-Cartierville (M...
4,9066959,COP,45.533876,-73.672649,"2110, Rue Caroline-Béique, apt. 804",Ahuntsic-Cartierville (Montréal),Nouveau Bordeaux,H4N 3K2,Beautiful condo built in 2011 still under guar...,200000.0,...,1,0,1,0,0,0,0,0,True,Condos for sale in Ahuntsic-Cartierville (Mont...


### 1.3. `Utilisation`

Connect "Commercial and residential"  and "Residential and commercial" into one

In [7]:
listings['Utilisation'].value_counts()

Residential only                     2299
Commercial only                       392
Residential and commercial            255
Commercial and office space           139
Commercial and residential            124
Commercial and industrial              22
Industrial and offices                 19
Industrial only                        15
Offices only                           14
Other                                  14
Multi-family dwelling                   5
Commercial or industrial (income)       4
Retirement home                         4
Name: Utilisation, dtype: int64

In [8]:
listings.loc[
    listings['Utilisation'] == 'Residential and commercial', 
    'Utilisation'
] = 'Commercial and residential'

Replace NaN values with 'No information'

In [70]:
listings['Utilisation'].fillna('No information', inplace=True)

### 1.4. `BuyPriceDesc`

In [9]:
listings['BuyPriceDesc'].value_counts()

 /square foot    20
Name: BuyPriceDesc, dtype: int64

In [10]:
listings.loc[listings['BuyPriceDesc'].isna(), 'BuyPrice'].describe()

count    1.464700e+04
mean     5.204339e+05
std      5.965235e+05
min      9.480320e+03
25%      2.590000e+05
50%      3.790000e+05
75%      5.599000e+05
max      1.388000e+07
Name: BuyPrice, dtype: float64

In [11]:
listings.loc[listings['BuyPriceDesc'] == ' /square foot', 'BuyPrice'].describe()

count    2.000000e+01
mean     8.246817e+04
std      2.718385e+05
min      3.130000e+00
25%      6.725000e+01
50%      1.556900e+02
75%      1.580800e+02
max      1.126332e+06
Name: BuyPrice, dtype: float64

It seems that records with NA in column `BuyPriceDesc` have given total price.

### 1.5. Renaming columns with French names to English

In [12]:
listings.rename({
    'Quartier': 'District',
    'RevenuBrutPotentiel': 'GrossPotentialIncome',
    'NbPieces': 'NumberOfRooms',
    'NbChambres': 'NumberOfBedrooms',
    'NbSallesEaux': 'NumberOfBathroomsWithShower',
    'NbSallesBains': 'NumberOfBathroomsWithBathtub',
    'NbFoyerPoele': 'NumberOfFireplaces',
    'NbGarages': 'NumberOfGarages',
    'NbStationnements': 'NumberOfParkingLots',
    'NbPiscines': 'NumberOfPools',
    'NbBordEaux': 'NbBordEaux',
}, axis=1, inplace=True)

In [13]:
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14667 entries, 0 to 14666
Data columns (total 29 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   MlsNumber                     14667 non-null  int64  
 1   CategoryCode                  14667 non-null  object 
 2   Lat                           14667 non-null  float64
 3   Lng                           14667 non-null  float64
 4   Address                       14667 non-null  object 
 5   City                          14667 non-null  object 
 6   District                      11488 non-null  object 
 7   PostalCode                    14667 non-null  object 
 8   LongDescription               12922 non-null  object 
 9   BuyPrice                      14667 non-null  float64
 10  BuyPriceDesc                  20 non-null     object 
 11  Category                      14667 non-null  object 
 12  Construction                  14667 non-null  int64  
 13  I

### 1.6. Columns with location data

#### `Address`

In [14]:
listings['Address'].value_counts()

Boul. Gouin O.                         13
Rue Lakeview                            3
50e Avenue (R.-d.-P.)                   3
Boul. Beaconsfield                      3
59e Avenue                              3
                                       ..
4374 - 4380, Rue Notre-Dame O.          1
705, Boul. Georges-Vanier, apt. 101     1
1085, Rue Smith, apt. 1205              1
3601, Rue St-Jacques, apt. 412          1
409, Av. Clarke                         1
Name: Address, Length: 14623, dtype: int64

`Address` column is quite specific and probably not that useful since there are `Lat` and `Lng` columns (?)

#### `City`

In [15]:
print(len(listings['City'].unique()))
listings['City'].value_counts()

33


Ville-Marie (Montréal)                                 2087
Mercier/Hochelaga-Maisonneuve (Montréal)               1124
Rivière-des-Prairies/Pointe-aux-Trembles (Montréal)     920
Le Plateau-Mont-Royal (Montréal)                        871
Le Sud-Ouest (Montréal)                                 816
Côte-des-Neiges/Notre-Dame-de-Grâce (Montréal)          736
Rosemont/La Petite-Patrie (Montréal)                    730
Villeray/Saint-Michel/Parc-Extension (Montréal)         729
Saint-Laurent (Montréal)                                704
Ahuntsic-Cartierville (Montréal)                        696
Verdun/Île-des-Soeurs (Montréal)                        677
Pierrefonds-Roxboro (Montréal)                          511
LaSalle (Montréal)                                      470
Montréal-Nord (Montréal)                                433
Lachine (Montréal)                                      361
Dollard-Des Ormeaux                                     333
Côte-Saint-Luc                          

This column will be useful later to merge this data frame with sociodemographical data.

Removing "(Montréal)" from the end of some records' values (also will be useful later):

In [16]:
listings['City'] = listings['City'].apply(
    lambda city: city[:-11] 
                 if city.endswith('(Montréal)') 
                 else city
)

#### `Quartier`

In [17]:
print(len(listings['District'].unique()))
listings['District'].value_counts()

52


Central                           1244
Central West                       831
Le Plateau-Mont-Royal              735
Mercier                            591
Rivière-des-Prairies               504
Hochelaga-Maisonneuve              499
Île-des-Soeurs                     442
Rosemont                           431
Saint-Henri/Petite Bourgogne       428
Pointe-aux-Trembles                416
Côte-des-Neiges                    400
Villeray                           342
East                               336
Notre-Dame-de-Grâce                336
Saint-Michel                       302
La Petite-Patrie                   299
Old Montréal                       251
Verdun                             235
Saint-Paul/Émard                   228
Nouveau Bordeaux                   192
Pierrefonds/Central East           176
L'Île-Bizard                       163
Pointe-Saint-Charles               160
West                               146
Ahuntsic West                      143
Mile-End                 

#### `PostalCode`

In [18]:
print(len(listings['PostalCode'].unique()))
listings['PostalCode'].value_counts()

7996


H3B 3C3    61
H3C 1W4    57
H3G 2M5    52
H1V 2L1    45
H3G 2H5    45
           ..
H9C 1E2     1
H9E 1G1     1
H9C 2H4     1
H9H 2M7     1
H3Y 3C3     1
Name: PostalCode, Length: 7996, dtype: int64

We have decided to drop `Address` and `PostalCode` columns as they are too specific. Columns related with location, which remained are `Lat`, `Lng`, `City` and `District`. During model fitting we will check whether all these columns are needed.

In [19]:
listings.drop(columns=['Address', 'PostalCode'], inplace=True)

In [20]:
listings.head()

Unnamed: 0,MlsNumber,CategoryCode,Lat,Lng,City,District,LongDescription,BuyPrice,BuyPriceDesc,Category,...,NumberOfBedrooms,NumberOfBathroomsWithShower,NumberOfBathroomsWithBathtub,NumberOfFireplaces,NumberOfGarages,NumberOfParkingLots,NumberOfPools,NbBordEaux,IsVirtualTour,PageTitle
0,9027411,COP,45.556508,-73.633198,Ahuntsic-Cartierville,Ahuntsic Central,PRIX POUR VENTE RAPIDE. Superbe condo ensoleil...,187500.0,,"Condo/Loft, Apartment, Detached",...,1,0,1,0,0,1,0,0,False,Condos for sale in Ahuntsic-Cartierville (Mont...
1,9030205,PPR,45.542062,-73.696868,Ahuntsic-Cartierville,Nouveau Bordeaux,"Duplex à vendre à Ahuntsic près de la rivière,...",574900.0,,"Duplex, Detached",...,4,0,1,0,0,1,0,0,True,Duplexes for sale in Ahuntsic-Cartierville (Mo...
2,9041238,PCI,45.573887,-73.649079,Ahuntsic-Cartierville,Ahuntsic East,rare sur le marché! ideal pour dentiste ou med...,475000.0,,Commercial,...,0,0,0,0,0,1,0,0,False,Commercial building for sale in Ahuntsic-Carti...
3,9048590,PPR,45.57104,-73.643413,Ahuntsic-Cartierville,Ahuntsic East,Excellent opportunity for builder/ contractor....,545000.0,,"Triplex, Detached",...,2,0,1,0,0,0,0,0,False,Triplexes for sale in Ahuntsic-Cartierville (M...
4,9066959,COP,45.533876,-73.672649,Ahuntsic-Cartierville,Nouveau Bordeaux,Beautiful condo built in 2011 still under guar...,200000.0,,"Condo/Loft, Apartment, Detached",...,1,0,1,0,0,0,0,0,True,Condos for sale in Ahuntsic-Cartierville (Mont...


### 1.7. Other columns

#### 1.7.1 `LongDescription`
From this column we will keep the following data:
1. Description length,
2. Description language.

In [21]:
listings['LongDescription'].head(n=10)

0    PRIX POUR VENTE RAPIDE. Superbe condo ensoleil...
1    Duplex à vendre à Ahuntsic près de la rivière,...
2    rare sur le marché! ideal pour dentiste ou med...
3    Excellent opportunity for builder/ contractor....
4    Beautiful condo built in 2011 still under guar...
5    One bedroom condo in a sought out area of Ahun...
6    Grand duplex avec garage dans un secteur très ...
7    Hard to imagine a more inspiring space than th...
8    You want to become a Florist  !!! So this busi...
9    Great Deal !!! . 4plex for sale in the heart o...
Name: LongDescription, dtype: object

In [22]:
def get_lang_detector(nlp, name):
    return LanguageDetector()

nlp = spacy.load('en_core_web_sm')
Language.factory("language_detector", func=get_lang_detector)
nlp.add_pipe('language_detector', last=True)

listings['DescriptionLanguage'] = listings['LongDescription'].apply(lambda text: nlp(str(text))._.language['language'])

In [23]:
rare_languages = listings['DescriptionLanguage'].unique()[listings['DescriptionLanguage'].value_counts() < 10]
listings['DescriptionLanguage'] = listings['DescriptionLanguage'].apply(lambda lang: 'OTHER' if lang in rare_languages else lang)

In [24]:
listings['DescriptionLanguage'].unique()

array(['fr', 'en', 'tl', 'de', 'it', 'OTHER'], dtype=object)

Languages occurring less than 10 times will be classified as other.

In [25]:
listings['DescriptionLength'] = listings['LongDescription'].apply(lambda text: len(str(text)))

In [26]:
listings.drop(columns=['LongDescription'], inplace=True)

In [27]:
listings.head()

Unnamed: 0,MlsNumber,CategoryCode,Lat,Lng,City,District,BuyPrice,BuyPriceDesc,Category,Construction,...,NumberOfBathroomsWithBathtub,NumberOfFireplaces,NumberOfGarages,NumberOfParkingLots,NumberOfPools,NbBordEaux,IsVirtualTour,PageTitle,DescriptionLanguage,DescriptionLength
0,9027411,COP,45.556508,-73.633198,Ahuntsic-Cartierville,Ahuntsic Central,187500.0,,"Condo/Loft, Apartment, Detached",1988,...,1,0,0,1,0,0,False,Condos for sale in Ahuntsic-Cartierville (Mont...,fr,348
1,9030205,PPR,45.542062,-73.696868,Ahuntsic-Cartierville,Nouveau Bordeaux,574900.0,,"Duplex, Detached",1955,...,1,0,0,1,0,0,True,Duplexes for sale in Ahuntsic-Cartierville (Mo...,fr,347
2,9041238,PCI,45.573887,-73.649079,Ahuntsic-Cartierville,Ahuntsic East,475000.0,,Commercial,1959,...,0,0,0,1,0,0,False,Commercial building for sale in Ahuntsic-Carti...,fr,349
3,9048590,PPR,45.57104,-73.643413,Ahuntsic-Cartierville,Ahuntsic East,545000.0,,"Triplex, Detached",1956,...,1,0,0,0,0,0,False,Triplexes for sale in Ahuntsic-Cartierville (M...,en,127
4,9066959,COP,45.533876,-73.672649,Ahuntsic-Cartierville,Nouveau Bordeaux,200000.0,,"Condo/Loft, Apartment, Detached",2011,...,1,0,0,0,0,0,True,Condos for sale in Ahuntsic-Cartierville (Mont...,en,285


Similar information but more generalised about category is contained in CategoryCode, so we will drop column `Category`.

In [28]:
listings['Category'].value_counts()

Condo/Loft, Apartment, Detached                                      3338
Condo/Loft, Apartment, Attached                                      2255
Condo/Loft, Apartment, Semi-detached                                 1385
Single Family, Two or more storey, Detached                          1344
Single Family, Bungalow, Detached                                     647
                                                                     ... 
Revenue Property, Attached, 11 units                                    1
Revenue Property, Attached, 5 plex on 2 streets (Boyer & Mentana)       1
Lot, 4,661 sqft                                                         1
Revenue Property, Detached, 8 Appartments                               1
Lot, 7,003 sqft                                                         1
Name: Category, Length: 408, dtype: int64

In [29]:
listings['CategoryCode'].value_counts()

COP    7698
UNI    3522
PPR    2587
PCI     719
TER     140
FER       1
Name: CategoryCode, dtype: int64

In [23]:
all_categories = set()
listings['Category'].apply(lambda s: all_categories.update(s.split(', ')))
len(all_categories)

331

Column `PageTitle` will be dropped, because it contains mainly address and building category, which we have in other columns.

In [24]:
listings['PageTitle'].value_counts()

Condos for sale in Mercier/Hochelaga-Maisonneuve (Montréal), Montréal (Island) – 2540, Av. Bilaudeau, 17659544 | Centris                                  1
Condos for sale in Montréal-Nord (Montréal), Montréal (Island) – 3681, Boul. Henri-Bourassa E., apt. 302, 15086430 | Centris                              1
Condos for sale in Ville-Marie (Montréal), Montréal (Island) – 2020, Rue Falardeau, apt. 101, 28792616 | Centris                                          1
Houses for sale in Lachine (Montréal), Montréal (Island) – 325, Tsse Louis-Basile-Pigeon, 28901276 | Centris                                              1
Commercial building for sale in Verdun/Île-des-Soeurs (Montréal), Montréal (Island) – 3135, Rue Wellington, 28572456 | Centris                            1
                                                                                                                                                         ..
Condos for sale in Ville-Marie (Montréal), Montréal (Island) – 6

In [30]:
listings.drop(columns=['PageTitle', 'Category'], inplace=True)

In [31]:
listings.head()

Unnamed: 0,MlsNumber,CategoryCode,Lat,Lng,City,District,BuyPrice,BuyPriceDesc,Construction,IsOpenHouse,...,NumberOfBathroomsWithShower,NumberOfBathroomsWithBathtub,NumberOfFireplaces,NumberOfGarages,NumberOfParkingLots,NumberOfPools,NbBordEaux,IsVirtualTour,DescriptionLanguage,DescriptionLength
0,9027411,COP,45.556508,-73.633198,Ahuntsic-Cartierville,Ahuntsic Central,187500.0,,1988,True,...,0,1,0,0,1,0,0,False,fr,348
1,9030205,PPR,45.542062,-73.696868,Ahuntsic-Cartierville,Nouveau Bordeaux,574900.0,,1955,False,...,0,1,0,0,1,0,0,True,fr,347
2,9041238,PCI,45.573887,-73.649079,Ahuntsic-Cartierville,Ahuntsic East,475000.0,,1959,False,...,0,0,0,0,1,0,0,False,fr,349
3,9048590,PPR,45.57104,-73.643413,Ahuntsic-Cartierville,Ahuntsic East,545000.0,,1956,False,...,0,1,0,0,0,0,0,False,en,127
4,9066959,COP,45.533876,-73.672649,Ahuntsic-Cartierville,Nouveau Bordeaux,200000.0,,2011,False,...,0,1,0,0,0,0,0,True,en,285


## 2. Merging data frames

### 2.1. `extra_data_modified.csv`

In [32]:
extra_data = pd.read_csv('data/extra_data_modified.csv')

In [42]:
from itertools import combinations
from collections import Counter
import difflib
import Levenshtein

threshold = 0.6
minGroupSize = 1
stripJunk = str.maketrans("","","- ")


columns_lower = [c.lower().replace('-', ' ').replace(',', ' ').replace('/', ' ') for c in extra_data.columns]

def getRatio(a, b):
    a = a.lower().translate(stripJunk)
    b = b.lower().translate(stripJunk)
    return np.max([difflib.SequenceMatcher(isjunk=None, a=a, b=b).ratio(), Levenshtein.ratio(a, b)])

data = columns_lower
paired = {c:{c} for c in data}
count = 0
for a, b in combinations(data,2):
    if getRatio(a,b) < threshold:
        count += 1
        continue
    else:
        paired[a].add(b)
        paired[b].add(a)

groups = []
ungrouped = set(data)
while ungrouped:
    bestGroup = {}
    for city in ungrouped:
        g = paired[city] & ungrouped
        for c in g.copy():
            g &= paired[c]
        if len(g) > len(bestGroup):
            bestGroup = g
    if len(bestGroup) < minGroupSize : break  # to terminate grouping early change minGroupSize to 3
    ungrouped -= bestGroup
    groups.append(bestGroup)

new_variable_aliases = {}
for i, g in enumerate(groups):
    for column in g:
        new_variable_aliases[column] = f'var_{i}'

df_living_area = extra_data.loc[:, ['MlsNumber', 'LivingArea']]
df_other_cols = extra_data.drop(columns=['LivingArea'])
df = df_other_cols.melt(id_vars=['MlsNumber'])
df['variable'] = df['variable'].apply(lambda x: new_variable_aliases[x.lower().replace('-', ' ').replace(',', ' ').replace('/', ' ')])
df = df.groupby(by=['MlsNumber', 'variable']).max().reset_index()
df = df.pivot(index=['MlsNumber'], columns='variable', values='value').reset_index()

print(df.head())
print(groups)

variable  MlsNumber  var_0  var_1  var_10  var_100  var_101  var_102  var_103  \
0           8459190      0      0       0        0        0        0        0   
1           8529438      0      0       0        0        0        0        0   
2           8529809      0      0       0        0        0        0        0   
3           8539583      0      0       0        0        0        0        0   
4           8559001      0      0       0        0        0        0        0   

variable  var_104  var_105  ...  var_90  var_91  var_92  var_93  var_94  \
0               0        0  ...       0       0       0       0       0   
1               0        0  ...       0       0       0       0       0   
2               0        0  ...       0       0       0       0       0   
3               0        0  ...       0       0       0       0       0   
4               0        0  ...       0       0       0       0       0   

variable  var_95  var_96  var_97  var_98  var_99  
0          

In [43]:
extra_data_final = pd.merge(df, df_living_area, on='MlsNumber', how='left')
len(extra_data_final)

9016

In [73]:
final_df = pd.merge(listings, extra_data_final, on='MlsNumber', how='left')

In [74]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14672 entries, 0 to 14671
Columns: 306 entries, MlsNumber to LivingArea
dtypes: bool(2), float64(283), int64(14), object(7)
memory usage: 34.2+ MB


In [75]:
final_df.loc[final_df['BuyPriceDesc'] == ' /square foot', 'LivingArea']

505     NaN
519     NaN
714     NaN
2717    NaN
5583    NaN
6398    NaN
6534    NaN
6739    NaN
9158    NaN
10237   NaN
10251   NaN
10279   NaN
10365   NaN
10368   NaN
10381   NaN
10402   NaN
10596   NaN
10617   NaN
10634   NaN
14065   NaN
Name: LivingArea, dtype: float64

In [76]:
final_df = final_df.loc[final_df['BuyPriceDesc'] != ' /square foot',]
final_df.drop(columns=['BuyPriceDesc'], inplace=True)

Probably the best option will be to remove these observations because theirs livingArea is not given and drop column `BuyPriceDescription`

### 2.2. `policeCoord.csv` & `firestations.csv`

In [77]:
police_coord = pd.read_csv('data/policeCoord.csv').rename({'long': 'lng'}, axis=1)
firestations_coord = pd.read_csv('data/firestations.csv')

In [78]:
police_coord.head()

Unnamed: 0,lng,lat
0,-73.542094,45.583362
1,-73.617431,45.577735
2,-73.545913,45.556857
3,-73.701026,45.528533
4,-73.593463,45.585063


In [79]:
firestations_coord.head()

Unnamed: 0,address,lat,lng
0,550 Blvd. St. Pierre,45.502644,-73.559248
1,230 Blvd.Chèvremont,45.493473,-73.876561
2,13795 Blvd. Pierrefonds,45.4933,-73.849787
3,13 rue Centre Commercial,45.509543,-73.806234
4,10 Sunnydale,45.493874,-73.806822


Adding distances to the closest fire station and police station to the merged data frame:

In [80]:
def get_distance_to_closest_building(lat: float, lng: float, 
                                     buildings: pd.DataFrame) -> float:
    """
    Returns distance (in kilometres) from the building at latitude lat and
    longitude lng to the closest building from data frame passed as 
    buildings parameter
    """
    buildings['distance'] = buildings.apply(
        lambda row: np.sqrt(
            (row['lat'] - lat)**2 + 
            (row['lng'] - lng)**2), 
        axis=1)
    return np.min(buildings['distance']) * 111  # *111 to convert degrees to km

In [81]:
final_df['DistanceToFireStation'] = final_df.apply(
    lambda row: get_distance_to_closest_building(
        row['Lat'], row['Lng'], firestations_coord), axis=1)
final_df['DistanceToPoliceStation'] = final_df.apply(
    lambda row: get_distance_to_closest_building(
        row['Lat'], row['Lng'], police_coord),
    axis=1)

In [82]:
final_df.head()

Unnamed: 0,MlsNumber,CategoryCode,Lat,Lng,City,District,BuyPrice,Construction,IsOpenHouse,GenreCode,...,var_93,var_94,var_95,var_96,var_97,var_98,var_99,LivingArea,DistanceToFireStation,DistanceToPoliceStation
0,9027411,COP,45.556508,-73.633198,Ahuntsic-Cartierville,Ahuntsic Central,187500.0,1988,True,AP,...,,,,,,,,,1.34448,2.138863
1,9030205,PPR,45.542062,-73.696868,Ahuntsic-Cartierville,Nouveau Bordeaux,574900.0,1955,False,2X,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.145066,1.571043
2,9041238,PCI,45.573887,-73.649079,Ahuntsic-Cartierville,Ahuntsic East,475000.0,1959,False,C,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.648593,0.775819
3,9048590,PPR,45.57104,-73.643413,Ahuntsic-Cartierville,Ahuntsic East,545000.0,1956,False,3X,...,,,,,,,,,1.065887,1.1651
4,9066959,COP,45.533876,-73.672649,Ahuntsic-Cartierville,Nouveau Bordeaux,200000.0,2011,False,AP,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,66.89,2.405775,3.175101


### 2.3. `montreal_hpi`
We won't add this to our data frame, because it is not related to single records.

In [83]:
montreal_hpi = pd.read_csv('data/montreal_hpi.csv')

All columns in this data frame have extremely strong correlation with each other

In [84]:
montreal_hpi.head()

Unnamed: 0,Date,Composite_HPI,Single_Family_HPI,One_Storey_HPI,Two_Storey_HPI,Townhouse_HPI,Apartment_HPI,Composite_Benchmark,Single_Family_Benchmark,One_Storey_Benchmark,Two_Storey_Benchmark,Townhouse_Benchmark,Apartment_Benchmark
0,Jan 2005,100.0,100.0,100.0,100.0,100.0,100.0,192200,200200,165400,245400,178600,163300
1,Feb 2005,101.3,101.3,101.6,100.9,104.8,100.9,194700,202800,168100,247600,187200,164800
2,Mar 2005,102.4,102.6,103.1,102.0,104.6,101.4,196800,205400,170600,250300,186800,165600
3,Apr 2005,103.2,103.4,104.1,102.5,105.7,102.1,198400,207000,172200,251500,188800,166800
4,May 2005,103.6,103.9,104.7,102.8,104.9,102.4,199100,208000,173200,252300,187400,167300


### 2.4. `sociodemo_modified.csv`

In [85]:
sociodemo = pd.read_csv('data/sociodemo_modified.csv')

In [86]:
sociodemo.head()

Unnamed: 0,Municipality,Population,Variation,Density,avgAge,below15,below24,below44,below64,below65,...,University,College,Secondary,Apprentice,No,NonImmigrant,Immigrant,french,English,Others
0,Ahuntsic-Cartierville,126891,2,5256,41,16,11,29,26,18,...,36,16,20,10,18,60,40,64,10,26
1,Anjou,41928,3,3064,43,16,10,24,28,22,...,24,15,23,15,23,71,29,76,8,16
2,Baie-d'Urfé,3850,-1,639,43,18,14,16,32,20,...,57,13,18,3,9,70,30,19,72,9
3,Beaconsfield,19505,2,1772,40,19,15,18,32,16,...,51,18,18,5,8,76,24,21,73,6
4,Côte-des-Neiges/Notre-Dame-de-Grâce (Montréal),165031,0,7699,38,16,14,33,23,14,...,48,15,20,6,11,47,53,32,39,29


Reducing number of columns:

In [41]:
# to niżej skopiowałem z edy z tym ze zmieniłem troche zeby zostawic kolumne Municipality bo będzie potrzebna przy merge'owaniu

In [87]:
# Get correlation matrix 
corr = sociodemo.corr()

# Create a mask for values above 90% 
# But also below 100% since it variables correlated with the same one
mask = (sociodemo.corr() > 0.9) & (sociodemo.corr() < 1.0)
high_corr = corr[mask]

# Create a new column mask using any() and ~
col_to_filter_out = ~high_corr[mask].any()

# Leave Municipality column
cols_to_leave = list(high_corr.columns[col_to_filter_out])
cols_to_leave.insert(0, 'Municipality')

# Apply new mask
sociodemo = sociodemo[cols_to_leave]

Removing "(Montréal)" from the end of `Municipality` values:

In [88]:
sociodemo['Municipality'] = sociodemo['Municipality'].apply(
    lambda city: city[:-11] 
                 if city.endswith('(Montréal)') 
                 else city
)

In [89]:
np.sort(sociodemo['Municipality'].unique())

array(['Ahuntsic-Cartierville', 'Anjou', "Baie-d'Urfé", 'Beaconsfield',
       'Côte-Saint-Luc', 'Côte-des-Neiges/Notre-Dame-de-Grâce',
       'Dollard-Des Ormeaux', 'Dorval', 'Hampstead', 'Kirkland',
       "L'Île-Bizard/Sainte-Geneviève", 'LaSalle', 'Lachine',
       'Le Plateau-Mont-Royal', 'Le Sud-Ouest',
       'Mercier/Hochelaga-Maisonneuve', 'Mont-Royal', 'Montréal-Est',
       'Montréal-Nord', 'Montréal-Ouest', 'Outremont',
       'Pierrefonds-Roxboro', 'Pointe-Claire',
       'Rivière-des-Prairies/Pointe-aux-Trembles',
       'Rosemont/La Petite-Patrie', 'Saint-Laurent', 'Saint-Léonard',
       'Sainte-Anne-de-Bellevue', 'Senneville', 'Verdun/Île-des-Soeurs',
       'Ville-Marie', 'Villeray/Saint-Michel/Parc-Extension', 'Westmount'],
      dtype=object)

In [90]:
np.sort(final_df['City'].unique())

array(['Ahuntsic-Cartierville', 'Anjou', "Baie-d'Urfé", 'Beaconsfield',
       'Côte-Saint-Luc', 'Côte-des-Neiges/Notre-Dame-de-Grâce',
       'Dollard-Des Ormeaux', 'Dorval', 'Hampstead', 'Kirkland',
       "L'Île-Bizard/Sainte-Geneviève", 'LaSalle', 'Lachine',
       'Le Plateau-Mont-Royal', 'Le Sud-Ouest',
       'Mercier/Hochelaga-Maisonneuve', 'Mont-Royal', 'Montréal-Est',
       'Montréal-Nord', 'Montréal-Ouest', 'Outremont',
       'Pierrefonds-Roxboro', 'Pointe-Claire',
       'Rivière-des-Prairies/Pointe-aux-Trembles',
       'Rosemont/La Petite-Patrie', 'Saint-Laurent', 'Saint-Léonard',
       'Sainte-Anne-de-Bellevue', 'Senneville', 'Verdun/Île-des-Soeurs',
       'Ville-Marie', 'Villeray/Saint-Michel/Parc-Extension', 'Westmount'],
      dtype=object)

Merging both data frames:

In [91]:
final_df = pd.merge(final_df, sociodemo, left_on='City', right_on='Municipality')

'MlsNumber' is no longer impotant so we drop it.

In [132]:
final_df.drop(columns=['MlsNumber', 'Municipality'], inplace=True)

In [133]:
final_df

Unnamed: 0,CategoryCode,Lat,Lng,City,District,BuyPrice,Construction,IsOpenHouse,GenreCode,Utilisation,...,belowBach,College,Secondary,Apprentice,No,NonImmigrant,Immigrant,french,English,Others
0,COP,45.556508,-73.633198,Ahuntsic-Cartierville,Ahuntsic Central,187500.00,1988,True,AP,No information,...,6,16,20,10,18,60,40,64,10,26
1,PPR,45.542062,-73.696868,Ahuntsic-Cartierville,Nouveau Bordeaux,574900.00,1955,False,2X,Residential only,...,6,16,20,10,18,60,40,64,10,26
2,PCI,45.573887,-73.649079,Ahuntsic-Cartierville,Ahuntsic East,475000.00,1959,False,C,Commercial and residential,...,6,16,20,10,18,60,40,64,10,26
3,PPR,45.571040,-73.643413,Ahuntsic-Cartierville,Ahuntsic East,545000.00,1956,False,3X,Residential only,...,6,16,20,10,18,60,40,64,10,26
4,COP,45.533876,-73.672649,Ahuntsic-Cartierville,Nouveau Bordeaux,200000.00,2011,False,AP,No information,...,6,16,20,10,18,60,40,64,10,26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14647,COP,45.481498,-73.591471,Westmount,,661013.26,1986,False,AP,No information,...,7,10,15,3,6,67,33,19,72,9
14648,UNI,45.479493,-73.600415,Westmount,,1295000.00,1909,True,ME,No information,...,7,10,15,3,6,67,33,19,72,9
14649,COP,45.488253,-73.587413,Westmount,,1450000.00,1989,False,AP,No information,...,7,10,15,3,6,67,33,19,72,9
14650,UNI,45.487523,-73.608672,Westmount,,2268000.00,1914,False,ME,No information,...,7,10,15,3,6,67,33,19,72,9


## 3. Encoding categorical colums

In [134]:
oh = ce.OneHotEncoder(cols=['CategoryCode', 'City', 'District', 'Utilisation', 'DescriptionLanguage', 'GenreCode'])
final_df_encoded = oh.fit_transform(final_df)
final_df_encoded

Unnamed: 0,CategoryCode_1,CategoryCode_2,CategoryCode_3,CategoryCode_4,CategoryCode_5,CategoryCode_6,Lat,Lng,City_1,City_2,...,belowBach,College,Secondary,Apprentice,No,NonImmigrant,Immigrant,french,English,Others
0,1,0,0,0,0,0,45.556508,-73.633198,1,0,...,6,16,20,10,18,60,40,64,10,26
1,0,1,0,0,0,0,45.542062,-73.696868,1,0,...,6,16,20,10,18,60,40,64,10,26
2,0,0,1,0,0,0,45.573887,-73.649079,1,0,...,6,16,20,10,18,60,40,64,10,26
3,0,1,0,0,0,0,45.571040,-73.643413,1,0,...,6,16,20,10,18,60,40,64,10,26
4,1,0,0,0,0,0,45.533876,-73.672649,1,0,...,6,16,20,10,18,60,40,64,10,26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14647,1,0,0,0,0,0,45.481498,-73.591471,0,0,...,7,10,15,3,6,67,33,19,72,9
14648,0,0,0,1,0,0,45.479493,-73.600415,0,0,...,7,10,15,3,6,67,33,19,72,9
14649,1,0,0,0,0,0,45.488253,-73.587413,0,0,...,7,10,15,3,6,67,33,19,72,9
14650,0,0,0,1,0,0,45.487523,-73.608672,0,0,...,7,10,15,3,6,67,33,19,72,9


## 4. Create two dataframes
Because extra_data was not given for approximately 1/3 of records. We will save all records without extra data and records that had matching extra data.
We will also drop column living area column because of lacks in data.

In [142]:
#final_df_encoded.drop(columns=['LivingArea'], inplace=True)
final_df_no_extra_data = pd.merge(final_df_encoded.iloc[:, 0:147], final_df_encoded.iloc[:,426:], left_index=True, right_index=True)
final_df_with_extra_data = final_df_encoded.loc[~final_df_encoded['var_0'].isna(), :]

## 5. PCA
We will use PCA to show that in the future we can use PCA to reduce dimensionality

In [148]:
pca = PCA(n_components=20)
pca.fit(final_df_no_extra_data, y='BuyPrice')
print(pca.explained_variance_ratio_)

[9.97888201e-01 2.09082834e-03 2.09239950e-05 3.62093312e-08
 3.37283428e-09 2.68259887e-09 1.34737234e-09 9.98494282e-10
 7.90637246e-10 4.54157139e-10 2.40041423e-10 1.31926449e-10
 1.09460331e-10 4.87665639e-11 2.93226414e-11 2.68471790e-11
 2.21064674e-11 1.59251050e-11 1.10668124e-11 9.01912186e-12]


In [146]:
pca = PCA(n_components=20)
pca.fit(final_df_with_extra_data, y='BuyPrice')
print(pca.explained_variance_ratio_)

[9.97345079e-01 2.61402436e-03 4.08091463e-05 6.83464132e-08
 6.28051654e-09 4.83637316e-09 2.44963217e-09 1.97893334e-09
 1.47383450e-09 8.31428238e-10 4.92657956e-10 2.59614925e-10
 1.80462700e-10 9.76418808e-11 5.67415203e-11 4.94570614e-11
 4.08221711e-11 3.04142250e-11 2.02077185e-11 1.44994594e-11]


## 6. Exporting final data frame

In [149]:
final_df_no_extra_data.to_csv('data\\final_df_no_extra_data.csv', encoding='utf-8')
final_df_with_extra_data.to_csv('data\\final_df_with_extra_data.csv', encoding='utf-8')