# Mini-Challenges - Atelier 7

In [123]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

prices = pd.read_csv("../data/price_availability.csv", sep=";")
listings = pd.read_csv("../clean/listings_final.csv", sep=";")
intersects = np.intersect1d(np.unique(prices['listing_id']), np.unique(listings['listing_id']))
listings_prices = prices[prices['listing_id'].isin(intersects)]

# Plus le droit d'utiliser ces variables !
del prices
del intersects

print("Chargement des données effectué.")

Chargement des données effectué.


## Lexique :

- **(J)** signifie que la réponse est similaire à un code déjà présent dans un notebook de l'atelier précédent.
- **(!)** signifie que la question est difficile et demande éventuellement une recherche plus approfondie que les autres.

In [127]:
listings_prices.head()

Unnamed: 0,listing_id,day,created,available,local_currency,local_price,min_nights
94,17757345,2018-10-14,2018-09-27 10:40:07.000+0000,False,EUR,40,1
95,17757345,2018-10-14,2018-09-27 06:05:54.000+0000,False,EUR,40,1
96,17757345,2018-10-14,2018-09-26 19:31:41.000+0000,False,EUR,40,1
297,2581464,2019-01-05,2018-09-27 10:40:54.000+0000,True,EUR,78,2
298,2581464,2019-01-05,2018-09-27 06:06:46.000+0000,True,EUR,78,2


## (J) Question 4

**Regrouper les appartements par ID listing et par plage de 14 jours.**

In [148]:
dates =  pd.to_datetime(listings_prices["day"], format='%Y-%m-%d')
range_prices = listings_prices.assign(date=dates).set_index(dates).groupby(['listing_id', pd.Grouper(key='date', freq='14D')]).mean().reset_index()
print("Nombre d'individus :", range_prices.shape[0])
range_prices.head()

Nombre d'individus : 9990


Unnamed: 0,listing_id,date,available,local_price,min_nights
0,56093,2018-08-27,0.0,170.0,4.0
1,56093,2018-09-10,0.0,170.0,4.0
2,56093,2018-09-24,0.0,170.0,4.0
3,56093,2018-10-08,0.0,170.0,4.0
4,56093,2018-10-22,0.0,170.0,4.0


## Question 5

**Considérer uniquement les appartements dont le nombre de nuit consécutives minimal est situé entre 2 (inclus) et 7 (exclus) nuits.**

In [149]:
range_prices_nights = range_prices[(range_prices['min_nights'] >= 2) & (range_prices['min_nights'] < 7)]
print("Nombre d'individus :", range_prices_nights.shape[0])
range_prices_nights.head()

Nombre d'individus : 6515


Unnamed: 0,listing_id,date,available,local_price,min_nights
0,56093,2018-08-27,0.0,170.0,4.0
1,56093,2018-09-10,0.0,170.0,4.0
2,56093,2018-09-24,0.0,170.0,4.0
3,56093,2018-10-08,0.0,170.0,4.0
4,56093,2018-10-22,0.0,170.0,4.0


## (J) Question 6

**Y a-t-il des appartements qui ont été supprimés ? Si oui, combien en reste-il ?**

In [150]:
intersect_nights = np.intersect1d(np.unique(range_prices['listing_id']),
                                  np.unique(range_prices_nights['listing_id']))
print("Nombre d'appartements restants :", len(intersect_nights))

Nombre d'appartements restants : 669


## (J) Question 7

**Effectuer une encodage par dictionnaire sur la date dans une nouvelle colonne nommée *date_dict***.

In [151]:
dict_encoder = LabelEncoder()
range_prices_nights['date_dict'] = dict_encoder.fit_transform(range_prices_nights['date'])
range_prices_nights.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,listing_id,date,available,local_price,min_nights,date_dict
0,56093,2018-08-27,0.0,170.0,4.0,0
1,56093,2018-09-10,0.0,170.0,4.0,1
2,56093,2018-09-24,0.0,170.0,4.0,2
3,56093,2018-10-08,0.0,170.0,4.0,3
4,56093,2018-10-22,0.0,170.0,4.0,4


## (J) Question 8

**Stocker dans la variable *one_hot_dates* l'encodage one-hot des dates.**

In [152]:
one_hot_encoder = OneHotEncoder(sparse=False)
date_labels = range_prices_nights['date_dict'].values.reshape(len(range_prices_nights), 1)
one_hot_dates = one_hot_encoder.fit_transform(date_labels)
one_hot_dates.shape

(6515, 10)

## (!) Question 9

**Ajouter la matrice de vecteurs one-hot au DataFrame *range_prices_nights* et stocker dans un nouveau DataFrame nommé *dataset*.**

In [161]:
dataset = range_prices_nights.copy()
for i in range(one_hot_dates.shape[1]):
    dataset['oh_' + str(i + 1)] = one_hot_dates[:, i]
dataset.head()

Unnamed: 0,listing_id,date,available,local_price,min_nights,date_dict,oh_1,oh_2,oh_3,oh_4,oh_5,oh_6,oh_7,oh_8,oh_9,oh_10
0,56093,2018-08-27,0.0,170.0,4.0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,56093,2018-09-10,0.0,170.0,4.0,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,56093,2018-09-24,0.0,170.0,4.0,2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,56093,2018-10-08,0.0,170.0,4.0,3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,56093,2018-10-22,0.0,170.0,4.0,4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


Si tout se passe bien, il devrait y avoir le même nombre d'individus :

In [166]:
if dataset.shape[0] != range_prices_nights.shape[0]:
    raise ValueError("ERREUR : Il n'y a pas le bon nombre d'individus !")
print("Tout est OK !")

Tout est OK !
