# Sommaire

### 1. Les données
### 2. Premier modèle "naif" 
### 3. Deuxième modèle sur un dataset plus "optimisé"

## 1. Les données

On remarque déjà que beaucoup des données sont qualitatives, on va donc devoir les transformer pour pouvoir utiliser sklearn. Aussi, on devra séparer les cas où les données sont nominales ou ordinales, et utiliser OneHotEncoder, ou OrdinalEncoder (sklearn) en fonction. 

Point intéressant, on possède les données latitude et longitude, qui définissent la position d'une maison. On peut ainsi se demander si city, neighbourhood et zipcode ont un intérêt significatif.

Données compledataes (à edataclure potentiellement) :
- description
- name
- amenities

Données nominales :
- property_type
- room_type (discutable)
- bed_type (discutable)
- city
- neighbourhood
- zipcode

Données ordinales : 
- cancellation_policy
- cleaning_fee
- first_review
- host_has_profile_pic
- host_identity_verified
- host_response_rate
- host_since
- instant_bookable
- last_review

In [45]:
# Importations

import numpy as np 
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [46]:
# Chargement des données

train_data = pd.read_csv("airbnb_train.csv")
test_data = pd.read_csv("airbnb_test.csv")

In [47]:
# data = train_data
data = train_data

In [48]:
# Traitement du cas amenities

def clean_amenities(amenities_str):
    if pd.isna(amenities_str):
        return []
    cleaned_str = amenities_str.replace('{', '').replace('}', '').replace('"', '')
    amenities_list = cleaned_str.split(',')
    return amenities_list

# Application sur la colonne
if (type(data['amenities'][0]) != list): # Pour vérifier si elle n'est pas déjà transformée.
    data['amenities'] = data['amenities'].apply(clean_amenities)

# print(data['amenities'].head())

On a maintenant une bonne vision de ces données qualitatives, on va pouvoir les traiter au cas par cas, en fonction des données qu'elles renvoient.

1. Pour les booléens :
    - true : 1
    - false : 0
    - nan : 0

2. Pour les taux :
    - on va les convertir simplement en réels

3. Pour les dates :
    - On compte les secondes (timestamp())

4. Pour le reste :
    - Au cas par cas

In [49]:
def ordBoolean(data:str):
    if (data == 't'):
        return 1
    else:
        return 0
    
def ordRate(data:str):
    if pd.isna(data):
        return 0
    else:
        y = str(data)
        return float(y.replace('%', '')) / 100.0
    
def ordDate(data:str):
    if pd.isna(data):
        return 0
    else:
        return pd.to_datetime(data).timestamp()
    
def ordCancellationPolicy(data):
    if (data == 'fledataible'):
        return 0
    elif (data == 'moderate'):
        return 1
    elif (data == 'strict'):
        return 2
    elif (data == 'super_strict_30'):
        return 3
    elif (data == 'super_strict_60'):
        return 4
    else:
        return 5 # Pour les valeurs inconnues, comment choisir ? moyenne, -1, 5 ?
    
def ordCleaningFee(data:bool):
    if (data == True):
        return 1
    else:
        return 0

data['host_has_profile_pic'] = data['host_has_profile_pic'].apply(ordBoolean)
data['host_identity_verified'] = data['host_identity_verified'].apply(ordBoolean)
data['instant_bookable'] = data['instant_bookable'].apply(ordBoolean)
data['host_response_rate'] = data['host_response_rate'].apply(ordRate)
data['first_review'] = data['first_review'].apply(ordDate)
data['host_since'] = data['host_since'].apply(ordDate)
data['last_review'] = data['last_review'].apply(ordDate)
data['cancellation_policy'] = data['cancellation_policy'].apply(ordCancellationPolicy)
data['cleaning_fee'] = data['cleaning_fee'].apply(ordCleaningFee)

# data.head()

On retire les colonnes trop compledataes à analyser.

In [50]:
data = data.drop(columns=['name','description','id'])

### Cas Amenities

Une première option pour amenities, serait de réaliser un script qui simulerait un OneHeatEncoder sur les différents équipements (125) mais, on constate que tous les équipements n'ont pas la même importance, on pourrait donc les rassembler (on remarque par exemple qu'il y a beaucoup d'équipements pour les enfants, et d'autres beaucoup plus luxueux comme hot tub). Comme cela prend du temps de les regrouper, j'utiliserais une IA pour faire le gros du travail, puis j'ajusterai au besoin.


On pourra également faire plus simple, et simplement compter les aménagements (mais donner la même importance à un sèche-cheveux qu'à un jaccuzzi ce n'est pas optimal)

In [51]:
unique_amenities = []
series = data['amenities']
for l in series:
    for e in l:
        if (e not in unique_amenities):
            unique_amenities.append(e)

grouped_amenities = {
    'Basic': ['TV', 'Wireless Internet', 'Internet', 'Kitchen', 'Heating', 'Air conditioning', 'Refrigerator', 'Essentials', 'Shampoo', 'Hot water'],
    'Safety': ['Smoke detector', 'Carbon monoxide detector', 'Fire extinguisher', 'First aid kit', 'Safety card'],
    'Family': ['Family/kid friendly', 'Crib', 'Children’s books and toys', 'High chair', 'Outlet covers', 'Baby bath', 'Changing table', 'Baby monitor', 'Stair gates'],
    'Luxury': ['Hot tub', 'Gym', 'BBQ grill', 'Pool', 'Doorman', 'Private entrance', 'Beachfront', 'Waterfront', 'Garden or backyard'],
    'Accessibility': ['Elevator', 'Step-free access', 'Wide entryway', 'Accessible-height bed', 'Wide clearance to bed', 'Flat smooth pathway to front door', 'Well-lit path to entrance'],
    'Extras': ['Laptop friendly workspace', 'Hangers', 'Hair dryer', 'Iron', 'Lock on bedroom door', 'Bed linens', 'Microwave', 'Coffee maker', 'Dishes and silverware', 'Cooking basics', 'Oven', 'Stove', 'Luggage dropoff allowed', 'EV charger']
}
# Pour les autres je les ignorerai simplement pour accélérer le traitement.

def nomAmenities():
    keys = grouped_amenities.keys()
    for key in keys:
        data[key] = 0
        for ligne in range(len(data['amenities'])):
            for amenity in grouped_amenities[key]:
                if (amenity in data['amenities'][ligne]):
                    data.loc[ligne, key] = 1
        print(key, " done")
    

def ordAmenities():
    if data['amenities'].dtype == 'object':
        data['amenities'] = data['amenities'].apply(lambda data: len(data))
    else:
        return 0

# ordAmenities() si on avait voulu utiliser la longueur

nomAmenities()

data.head()

Basic  done
Safety  done
Family  done
Luxury  done
Accessibility  done
Extras  done


Unnamed: 0,log_price,property_type,room_type,amenities,accommodates,bathrooms,bed_type,cancellation_policy,cleaning_fee,city,...,review_scores_rating,zipcode,bedrooms,beds,Basic,Safety,Family,Luxury,Accessibility,Extras
0,4.317488,House,Private room,"[TV, Wireless Internet, Kitchen, Free parking ...",3,1.0,Real Bed,5,0,LA,...,,90804,0.0,2.0,1,1,0,0,0,1
1,4.007333,House,Private room,"[Wireless Internet, Air conditioning, Kitchen,...",4,2.0,Real Bed,2,0,NYC,...,86.0,11385,1.0,2.0,1,1,1,0,0,1
2,7.090077,Apartment,Entire home/apt,"[TV, Wireless Internet, Air conditioning, Kitc...",6,2.0,Real Bed,5,0,DC,...,,20009,2.0,2.0,1,1,1,0,0,1
3,3.555348,House,Private room,"[TV, Cable TV, Internet, Wireless Internet, Ai...",1,1.0,Real Bed,5,1,NYC,...,96.0,11104,1.0,1.0,1,1,1,1,0,1
4,5.480639,House,Entire home/apt,"[TV, Cable TV, Internet, Wireless Internet, Ki...",4,1.0,Real Bed,1,1,SF,...,96.0,94131,2.0,2.0,1,1,1,0,0,1


In [52]:
if 'amenities' in data.columns: 
    data = data.drop(columns=['amenities'])

## On passe à la création du modèle

Pour le choix de l'estimateur on avons utilisé la documentation sklearn (https://scikit-learn.org/stable/machine_learning_map.html), qui nous a mené vers une SVR

In [53]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.impute import SimpleImputer

from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

from sklearn.metrics import accuracy_score

In [60]:
# Initialisation
y = data['log_price']
X = data.drop(columns=['log_price'])

In [61]:
# Preprocessing

cat_preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

num_preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer([
        ('num', num_preprocessor, X.select_dtypes(exclude='object').columns),
        ('cat', cat_preprocessor, X.select_dtypes(include='object').columns)
    ],
    remainder='passthrough'
)

In [62]:
# https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

pipe = Pipeline([
                ("preprocessing", preprocessor),
                ("model", SVR(kernel = 'linear'))
                ])

# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html

pipe.get_params()

{'memory': None,
 'steps': [('preprocessing',
   ColumnTransformer(remainder='passthrough',
                     transformers=[('num',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(strategy='median')),
                                                    ('scaler', StandardScaler())]),
                                    Index(['accommodates', 'bathrooms', 'cancellation_policy', 'cleaning_fee',
          'first_review', 'host_has_profile_pic', 'host_identity_verified',
          'host_response_rate', 'host_since', 'instant_bookable', 'last_rev...
          'latitude', 'longitude', 'number_of_reviews', 'review_scores_rating',
          'bedrooms', 'beds', 'Basic', 'Safety', 'Family', 'Luxury',
          'Accessibility', 'Extras'],
         dtype='object')),
                                   ('cat',
                                    Pipeline(steps=[('imputer',
                                          

In [15]:
model = GridSearchCV(estimator = pipe,
             param_grid = {"model__kernel" : ["linear","rbf","poly"]},
             cv = 3)

# Je vais faire de la cross-validation pour avoir des paramètres sensiblement plus fiables.

In [16]:
# On détermine les meilleurs paramètres grâce à la cross-validation

model.fit(X,y)
pd.DataFrame(model.cv_results_)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__kernel,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,139.21945,3.846081,8.695879,0.251548,linear,{'model__kernel': 'linear'},0.656786,0.661505,0.655672,0.657988,0.002528,2
1,40.555517,2.080592,8.63939,0.856194,rbf,{'model__kernel': 'rbf'},0.679295,0.665117,0.669097,0.67117,0.005971,1
2,64.129656,4.211246,8.511711,0.307603,poly,{'model__kernel': 'poly'},0.600234,0.599667,0.622376,0.607426,0.010574,3


Après exécution de la validation croisée (pour nomAmenities), on trouve que 'rbf' obtient les meilleurs résultats, suivi par 'linear', et finalement 'poly'. On choisira donc 'rbf' pour le moment

In [63]:
pipe.fit(X, y)
predictions = pipe.predict(X)

In [21]:

import numpy as np

def root_mean_squared_error(y_true, y_pred):
	return np.sqrt(np.mean((y_true - y_pred) ** 2))

print(f"Moyenne des log_prices : {y.mean()}")

print(f"RMSE : {root_mean_squared_error(y, predictions)}")

Moyenne des log_prices : 4.783480914904761
RMSE : 0.31486937788200775


Le modèle semble plutôt bon, on a tendance à sous-estimer un peu les biens possédant peu d'équipements ? Le point important est surtout qu'on constate que le nombre d'équipements n'augmente pas particulièrement avec le prix. len() n'est donc peut-être pas la meilleure idée.

Cela concorde avec ce qu'on avait vu dans Data Visualization

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score


y_pred = model.predict(X)
errors = y - y_pred
SCE = np.sum(errors**2)
mean_y = np.mean(y)
SCT = np.sum((y - mean_y)**2)
SCR = np.sum((y_pred - mean_y)**2)
precision = model.score(X, y)
R2 = r2_score(y, y_pred)

metrics = {
    "SCE (SSE)": SCE,
    "SCT": SCT,
    "SCR": SCR,
    "R²": R2,
    "Précision (model.score)" : precision
}
print(pd.Series(metrics, dtype=float))


SCE (SSE)                   2204.339350
SCT                        11485.866206
SCR                         8316.368228
R²                             0.808082
Précision (model.score)        0.808082
dtype: float64


In [None]:
# N'est valable que si on utilise len(amenities) dans data["amenities"] !!!

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(1, figsize=(18, 9))
plt.subplot(211)
plt.axhline(y.mean(), label='Moyenne', color='green', linewidth=1)
plt.scatter(X['amenities'], predictions, label='Predictions', linewidth=0.1, color='red')
plt.legend()

plt.title('Comparaison des prédictions et de la moyenne des log_prices')
plt.subplot(212)
plt.axhline(y.mean(), label='Moyenne', color='green', linewidth=1)
plt.scatter(X['amenities'], y, label='Vraies valeurs', linewidth=0.1, color='blue')
plt.legend()

plt.title('Comparaison des vraies valeurs et de la moyenne des log_prices')

## 2.3 Exportation au format CSV du test

In [64]:
data = test_data

In [65]:
# Traitement du cas amenities

def clean_amenities(amenities_str):
    if pd.isna(amenities_str):
        return []
    cleaned_str = amenities_str.replace('{', '').replace('}', '').replace('"', '')
    amenities_list = cleaned_str.split(',')
    return amenities_list

# Application sur la colonne
if (type(data['amenities'][0]) != list): # Pour vérifier si elle n'est pas déjà transformée.
    data['amenities'] = data['amenities'].apply(clean_amenities)

# print(data['amenities'].head())

In [66]:
def ordBoolean(data:str):
    if (data == 't'):
        return 1
    else:
        return 0
    
def ordRate(data:str):
    if pd.isna(data):
        return 0
    else:
        y = str(data)
        return float(y.replace('%', '')) / 100.0
    
def ordDate(data:str):
    if pd.isna(data):
        return 0
    else:
        return pd.to_datetime(data).timestamp()
    
def ordCancellationPolicy(data):
    if (data == 'fledataible'):
        return 0
    elif (data == 'moderate'):
        return 1
    elif (data == 'strict'):
        return 2
    elif (data == 'super_strict_30'):
        return 3
    elif (data == 'super_strict_60'):
        return 4
    else:
        return 5 # Pour les valeurs inconnues, comment choisir ? moyenne, -1, 5 ?
    
def ordCleaningFee(data:bool):
    if (data == True):
        return 1
    else:
        return 0

data['host_has_profile_pic'] = data['host_has_profile_pic'].apply(ordBoolean)
data['host_identity_verified'] = data['host_identity_verified'].apply(ordBoolean)
data['instant_bookable'] = data['instant_bookable'].apply(ordBoolean)
data['host_response_rate'] = data['host_response_rate'].apply(ordRate)
data['first_review'] = data['first_review'].apply(ordDate)
data['host_since'] = data['host_since'].apply(ordDate)
data['last_review'] = data['last_review'].apply(ordDate)
data['cancellation_policy'] = data['cancellation_policy'].apply(ordCancellationPolicy)
data['cleaning_fee'] = data['cleaning_fee'].apply(ordCleaningFee)

# data.head()

In [67]:
data = data.drop(columns=['name','description','id'])

In [68]:
unique_amenities = []
series = data['amenities']
for l in series:
    for e in l:
        if (e not in unique_amenities):
            unique_amenities.append(e)

grouped_amenities = {
    'Basic': ['TV', 'Wireless Internet', 'Internet', 'Kitchen', 'Heating', 'Air conditioning', 'Refrigerator', 'Essentials', 'Shampoo', 'Hot water'],
    'Safety': ['Smoke detector', 'Carbon monoxide detector', 'Fire extinguisher', 'First aid kit', 'Safety card'],
    'Family': ['Family/kid friendly', 'Crib', 'Children’s books and toys', 'High chair', 'Outlet covers', 'Baby bath', 'Changing table', 'Baby monitor', 'Stair gates'],
    'Luxury': ['Hot tub', 'Gym', 'BBQ grill', 'Pool', 'Doorman', 'Private entrance', 'Beachfront', 'Waterfront', 'Garden or backyard'],
    'Accessibility': ['Elevator', 'Step-free access', 'Wide entryway', 'Accessible-height bed', 'Wide clearance to bed', 'Flat smooth pathway to front door', 'Well-lit path to entrance'],
    'Extras': ['Laptop friendly workspace', 'Hangers', 'Hair dryer', 'Iron', 'Lock on bedroom door', 'Bed linens', 'Microwave', 'Coffee maker', 'Dishes and silverware', 'Cooking basics', 'Oven', 'Stove', 'Luggage dropoff allowed', 'EV charger']
}
# Pour les autres je les ignorerai simplement pour accélérer le traitement.

def nomAmenities():
    keys = grouped_amenities.keys()
    for key in keys:
        data[key] = 0
        for ligne in range(len(data['amenities'])):
            for amenity in grouped_amenities[key]:
                if (amenity in data['amenities'][ligne]):
                    data.loc[ligne, key] = 1
        print(key, " done")
    

def ordAmenities():
    if data['amenities'].dtype == 'object':
        data['amenities'] = data['amenities'].apply(lambda data: len(data))
    else:
        return 0

# ordAmenities() si on avait voulu utiliser la longueur

nomAmenities()

data.head()

Basic  done
Safety  done
Family  done
Luxury  done
Accessibility  done
Extras  done


Unnamed: 0,property_type,room_type,amenities,accommodates,bathrooms,bed_type,cancellation_policy,cleaning_fee,city,first_review,...,review_scores_rating,zipcode,bedrooms,beds,Basic,Safety,Family,Luxury,Accessibility,Extras
0,Apartment,Entire home/apt,"[Wireless Internet, Air conditioning, Kitchen,...",3,1.0,Real Bed,2,1,NYC,1466208000.0,...,100.0,11201.0,1.0,1.0,1,0,1,0,0,1
1,Apartment,Entire home/apt,"[Wireless Internet, Air conditioning, Kitchen,...",7,1.0,Real Bed,2,1,NYC,1501891000.0,...,93.0,10019.0,3.0,3.0,1,1,1,0,0,1
2,Apartment,Entire home/apt,"[TV, Cable TV, Wireless Internet, Air conditio...",5,1.0,Real Bed,1,1,NYC,1493510000.0,...,92.0,10027.0,1.0,3.0,1,1,1,0,0,1
3,House,Entire home/apt,"[TV, Cable TV, Internet, Wireless Internet, Ki...",4,1.0,Real Bed,5,1,SF,0.0,...,,94117.0,2.0,2.0,1,1,0,0,0,0
4,Apartment,Entire home/apt,"[TV, Internet, Wireless Internet, Air conditio...",2,1.0,Real Bed,1,1,DC,1431389000.0,...,40.0,20009.0,0.0,1.0,1,1,0,0,0,0


In [69]:
if 'amenities' in data.columns: 
    data = data.drop(columns=['amenities'])

In [70]:
X = data

In [71]:
predictions_test = pipe.predict(X)

In [76]:
output = pd.DataFrame({
    'id': test_data['id'],
    'log_price_pred': predictions_test
})
output.to_csv('predictions_test_model1.csv', index=False)

data = test

In [None]:
medium_min = ['Lock on bedroom door', 'translation missing: en.hosting_amenity_50', 'Pets live on this property', 'Dog(s)', 'translation missing: en.hosting_amenity_49', 'Private living room', 'Cat(s)', 'Smoking allowed', 'Host greets you']
medium_plus = ['TV', 'Pets allowed', 'Suitable for events', 'Washer', 'Dryer', 'Family/kid friendly', '24-hour check-in', 'Self Check-In', 'Keypad', 'Coffee maker', 'Dishes and silverware', 'Cooking basics', 'Oven', 'Stove', 'Elevator in building', 'Cable TV', 'BBQ grill', 'Garden or backyard', 'Doorman', 'Gym', 'Elevator', 'Hot tub', 'Private entrance', 'Outlet covers', 'Children’s books and toys', 'Pack ’n Play/travel crib', 'Children’s dinnerware', 'Game console', 'Bathtub', 'High chair', 'Room-darkening shades', 'Dishwasher', 'Patio or balcony', 'Pool', 'Indoor fireplace', 'Wheelchair accessible', 'Lockbox', 'Flat', ' smooth pathway to front door', 'Smart lock', 'Window guards', 'Pocket wifi', 'Ethernet connection', 'Hot water kettle', 'Handheld shower head', 'Baby bath', 'Babysitter recommendations']
high_plus = ['Stair gates', 'Changing table', 'Crib', 'Fireplace guards']

Pas eu le temps snif