# Du-Proprio ML Project
### Data Innovation Society
#### Matteo Esposito, Victor Debray, Edric Jung

> TODO
* ~~Preprocess Data for Geocoding (fix wrongly formatted addresses or remove)~~
* Geocode (find postal codes using addresses)
* Exploratory data analysis
* Model (NN)
* Metrics

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import re

%matplotlib inline

In [2]:
import warnings
warnings.filterwarnings("ignore", 'This pattern has match groups')

In [3]:
# Load data
df = pd.read_csv(os.getcwd() + '/data/du_proprio_scraped_FULL.csv')

## Data Preprocessing

In [4]:
def displayAll(df):
    '''
    Function to ensure custom output of table information. 
    '''
    with pd.option_context('display.max_rows', 20):
        with pd.option_context('display.max_columns', 20):
            display(df)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19799 entries, 0 to 19798
Data columns (total 7 columns):
Rent Price            19799 non-null object
Address               19701 non-null object
Location              19799 non-null object
Number of Bed         19799 non-null int64
Number of Bath        19799 non-null object
Living Area (ft^2)    19799 non-null int64
Land Area (ft^2)      19799 non-null int64
dtypes: int64(3), object(4)
memory usage: 1.1+ MB


In [6]:
# Removing rows with NAN addresses = 98/19799 rows.
df = df[pd.notnull(df['Address'])]

In [15]:
# Get an idea of what type of addresses we are dealing with
df['Address'].value_counts()

Modèle à construire - Par les Constructions André Jacques inc.    4
chemin des Hauteurs                                               4
Montée du Cerf                                                    4
12370, rue Jean-Baptiste-Normand                                  3
4776-4774, AV PALM                                                3
Projet domiciliaire Le Majordome - Chemin James-McFaul            3
4560, rue Bosco                                                   3
rue Bellevue - Par Construction C.R.D                             3
rue Kildare - Par Construction C.R.D                              3
, Domaine Lévesque                                                3
Le Gardenville - Unité 2-111                                      3
2605, chemin Ste-Foy                                              3
51, avenue Au Pied de la Côte                                     3
1, rue Félix-Leclerc                                              3
1320, AV DES MELEZES                            

In [8]:
# Define boolean variable to flag entries/addresses that do not respect the digit + "," + char format
# Doing this to be able to work fluidly with postal code API
df['respects_dt_format'] = df.Address.str.contains('([0-9]+,)(.*[a-zA-Z]+)')

In [9]:
# Create table of bad addresses
df_unformatted = df[df['respects_dt_format'] == False]
print('% of wrongly formatted rows: {} %'.format(round(100*len(df_unformatted)/len(df),3)))

% of wrongly formatted rows: 6.812 %


In [10]:
# Look at bad entries
displayAll(df_unformatted['Address'].value_counts())

Modèle à construire - Par les Constructions André Jacques inc.              4
Montée du Cerf                                                              4
chemin des Hauteurs                                                         4
Le Gardenville - Unité 2-111                                                3
rue Bellevue - Par Construction C.R.D                                       3
rue Kildare - Par Construction C.R.D                                        3
Chemin des Cerisiers                                                        3
chemin Abercrombie                                                          3
rang Sainte-Catherine                                                       3
Projet domiciliaire Le Majordome - Chemin James-McFaul                      3
                                                                           ..
, chemin du Lac Creux                                                       1
1653-1661-1665 rue Girouard Ouest                               

In [11]:
df_well_formatted = df[df['respects_dt_format'] == True]
print('% of well formatted rows: {} %'.format(round(100*len(df_well_formatted)/len(df),3)))

% of well formatted rows: 93.188 %


In [12]:
# Look at good entries
displayAll(df_unformatted['Address'].value_counts())

Modèle à construire - Par les Constructions André Jacques inc.              4
Montée du Cerf                                                              4
chemin des Hauteurs                                                         4
Le Gardenville - Unité 2-111                                                3
rue Bellevue - Par Construction C.R.D                                       3
rue Kildare - Par Construction C.R.D                                        3
Chemin des Cerisiers                                                        3
chemin Abercrombie                                                          3
rang Sainte-Catherine                                                       3
Projet domiciliaire Le Majordome - Chemin James-McFaul                      3
                                                                           ..
, chemin du Lac Creux                                                       1
1653-1661-1665 rue Girouard Ouest                               

In [13]:
# Output and save 'clean' table for postal code lookups.
# Basically just dropping bad rows, since < 7% of total data.
df_well_formatted = df_well_formatted.drop(['respects_dt_format'], axis=1)
df_well_formatted.to_csv(os.getcwd() + '/data/du_proprio_clean.csv')