# Lyons Housing Data Set

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#load the data, change the file address

infile= "/content/drive/MyDrive/Colab Notebooks/Spring 2024/Machine Learning and Data Mining /Data/lyon_housing.csv"
lyon=pd.read_csv(infile)

In [None]:
lyon.head()

Unnamed: 0,date_transaction,type_achat,type_bien,nombre_pieces,surface_logement,surface_carrez_logement,surface_terrain,nombre_parkings,prix,adresse,commune,latitude,longitude,date_construction,anciennete
0,2019-10-31,ancien,maison,5,100.0,,247.0,0,530000.0,6 PAS DES ANTONINS,Villeurbanne,45.781673,4.879333,2003-06-11 11:38:24,16.387783
1,2018-11-26,ancien,maison,2,52.0,,156.0,0,328550.0,12 RUE DU LUIZET,Villeurbanne,45.78324,4.884683,2003-06-11 11:38:24,15.459633
2,2016-08-04,ancien,appartement,1,28.0,28.2,0.0,1,42500.0,4 RUE DE L ESPOIR,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.148839
3,2016-11-18,ancien,appartement,3,67.0,66.3,0.0,1,180900.0,6 RUE DE L ESPOIR,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.439058
4,2016-12-16,ancien,appartement,1,28.0,,0.0,1,97000.0,163 AV ROGER SALENGRO,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.515719


This data is from https://www.kaggle.com/benoitfavier/lyon-housing

type_achat-  ancien means existing house,  VEFA= sale prior to completion

type bien- house or apartment

nombre_pieces- probably number of rooms, use this

Surface_legement-interior space in square meeters

surface_carrez_logment-area with roof height under 1.8 m  (drop this variable)

surface_terrain- drop this

nombre_parkings- parking spots

prix- selling price,   predict this

anciennete- age of the property in years



Convert the dates into Pandas datetime variables, so we can extract the year of the build and the year of the sale

It is also possible to extract quarter of the year from the datetime variables,  or even months,  we could look for seasonality in prices if so inclined

Anyway, extract the year of the sale,   that is a categorical variable we will want

In [None]:
lyon['date_transaction']=pd.to_datetime(lyon['date_transaction'])

In [None]:
lyon['year_transaction']=lyon['date_transaction'].dt.year

In [None]:
lyon['date_construction']=pd.to_datetime(lyon['date_construction'])



In [None]:
lyon['year_construction']=lyon['date_construction'].dt.year

In [None]:
lyon.head()

Unnamed: 0,date_transaction,type_achat,type_bien,nombre_pieces,surface_logement,surface_carrez_logement,surface_terrain,nombre_parkings,prix,adresse,commune,latitude,longitude,date_construction,anciennete,year_transaction,year_construction
0,2019-10-31,ancien,maison,5,100.0,,247.0,0,530000.0,6 PAS DES ANTONINS,Villeurbanne,45.781673,4.879333,2003-06-11 11:38:24,16.387783,2019,2003
1,2018-11-26,ancien,maison,2,52.0,,156.0,0,328550.0,12 RUE DU LUIZET,Villeurbanne,45.78324,4.884683,2003-06-11 11:38:24,15.459633,2018,2003
2,2016-08-04,ancien,appartement,1,28.0,28.2,0.0,1,42500.0,4 RUE DE L ESPOIR,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.148839,2016,2003
3,2016-11-18,ancien,appartement,3,67.0,66.3,0.0,1,180900.0,6 RUE DE L ESPOIR,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.439058,2016,2003
4,2016-12-16,ancien,appartement,1,28.0,,0.0,1,97000.0,163 AV ROGER SALENGRO,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.515719,2016,2003


In [None]:
# how about the age of the property?
lyon['anciennete'].describe()

count    40516.000000
mean        21.246938
std          9.397379
min         -3.853563
25%         15.064690
50%         26.571388
75%         28.775403
max         31.494144
Name: anciennete, dtype: float64

# Convert from a continuous variable into categorical,  using Pandas cut

There is too much detail in the age of properties, use the cut function in pandas to convert this to a limited number of categories



In [None]:
temp=pd.cut(lyon.anciennete,bins=[-5,0,5,10,20,30,40],labels=['UnderConstruction','0-5','5-10','10-20','20-30','30+'])

In [None]:
lyon['age']=temp

In [None]:
lyon.head(3)

Unnamed: 0,date_transaction,type_achat,type_bien,nombre_pieces,surface_logement,surface_carrez_logement,surface_terrain,nombre_parkings,prix,adresse,commune,latitude,longitude,date_construction,anciennete,year_transaction,year_construction,age
0,2019-10-31,ancien,maison,5,100.0,,247.0,0,530000.0,6 PAS DES ANTONINS,Villeurbanne,45.781673,4.879333,2003-06-11 11:38:24,16.387783,2019,2003,10-20
1,2018-11-26,ancien,maison,2,52.0,,156.0,0,328550.0,12 RUE DU LUIZET,Villeurbanne,45.78324,4.884683,2003-06-11 11:38:24,15.459633,2018,2003,10-20
2,2016-08-04,ancien,appartement,1,28.0,28.2,0.0,1,42500.0,4 RUE DE L ESPOIR,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.148839,2016,2003,10-20


Okay,  that's as far as I will go in addressing a couple of issues there,  your turn.

# Build your predictor of housing prices in Lyons

Predictors- use at least these variables

type_achat, type_bien, nombre_pieces,  surface_logement, nombre_parkings, commune(?), year_transaction (as a category,not an integer, age (category)
  
-one hot encode the categories
                                                                                                        
-standard scale the other data

-combine the standard-scaled and the onehot data into a pd Dataframe

-Build a neural net regressor,   a nearest neighbhor and a linear

-use some metrics,   what is the MSE?,  the R2,  the mean absolute value error?

-use cross validation to figure out which model seems to be best

-use EPI5 to understand what the most important predictors are
                                                                                                    
                                                                                                        