# Test technique Aurélien Allard

Ce test a pour but de regrouper les annonces disponibles sur [ce lien Google Sheet](https://docs.google.com/spreadsheets/d/1XUjqeXVgjZJ8jVNAn9MeIb9zHLhVj4mwRs-hk1P050o/edit).

## Connexion à la table Google Sheet

Tout d'abord nous désirons importer l'ensemble des données sous un format exploitable par Pandas, afin de pouvoir manipuler et analyser les données.

Nous redéfinissons pour cela un extrait csv de la table en réécrivant son url.

In [8]:
import pandas as pd

# Définition des variables
sheet_url = "https://docs.google.com/spreadsheets/d/1XUjqeXVgjZJ8jVNAn9MeIb9zHLhVj4mwRs-hk1P050o/edit#gid=1762609115"
csv_url = sheet_url.replace("/edit#gid=", "/export?format=csv&gid=")

# Extraction du csv dans un dataframe Pandas
raw_df = pd.read_csv(csv_url,
                   # Nous attribuons l'index  des lignes à la colonne ID
                   index_col=0,
                  )

# Visualisation du résultat
raw_df.head(2)

Unnamed: 0_level_0,URL,CRAWL_SOURCE,PROPERTY_TYPE,NEW_BUILD,DESCRIPTION,IMAGES,SURFACE,LAND_SURFACE,BALCONY_SURFACE,TERRACE_SURFACE,...,DEALER_NAME,DEALER_TYPE,CITY_ID,CITY,ZIP_CODE,DEPT_CODE,PUBLICATION_START_DATE,PUBLICATION_END_DATE,LAST_CRAWL_DATE,LAST_PRICE_DECREASE_DATE
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
22c05930-0eb5-11e7-b53d-bbead8ba43fe,http://www.avendrealouer.fr/location/levallois...,A_VENDRE_A_LOUER,APARTMENT,False,"Au rez de chaussée d'un bel immeuble récent,ap...","[""https://cf-medias.avendrealouer.fr/image/_87...",72.0,,,,...,Lamirand Et Associes,AGENCY,54178039,Levallois-Perret,92300.0,92,2017-03-22T04:07:56.095,,2017-04-21T18:52:35.733,
8d092fa0-bb99-11e8-a7c9-852783b5a69d,https://www.bienici.com/annonce/ag440414-16547...,BIEN_ICI,APARTMENT,False,Je vous propose un appartement dans la rue Col...,"[""http://photos.ubiflow.net/440414/165474561/p...",48.0,,,,...,Proprietes Privees,MANDATARY,54178039,Levallois-Perret,92300.0,92,2018-09-18T11:04:44.461,,2019-06-06T10:08:10.89,2018-09-25


<details>
<summary>Autre méthode</summary>
<br>
Avec l'accès aux credentials, nous pourrions imaginer nous connecter via l'API Google, voire nous aider d'une librairie telle que gspread.
</details>

## Analyse

Nous désirons nous faire une idée générale des colonnes présentes ainsi que du nombre d'entrée (2164 pour cet extrait).

Nous en profitons pour se faire un récapitulatif statistique permettant d'apprécier les grandes tendances.

In [12]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2164 entries, 22c05930-0eb5-11e7-b53d-bbead8ba43fe to f3ae8be0-9fcf-11e9-ab3e-47ec2b68d334
Data columns (total 56 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   URL                         2164 non-null   object 
 1   CRAWL_SOURCE                2164 non-null   object 
 2   PROPERTY_TYPE               2164 non-null   object 
 3   NEW_BUILD                   1973 non-null   object 
 4   DESCRIPTION                 2160 non-null   object 
 5   IMAGES                      2164 non-null   object 
 6   SURFACE                     2050 non-null   float64
 7   LAND_SURFACE                3 non-null      float64
 8   BALCONY_SURFACE             0 non-null      float64
 9   TERRACE_SURFACE             25 non-null     float64
 10  ROOM_COUNT                  1835 non-null   float64
 11  BEDROOM_COUNT               696 non-null    float64
 12  BATHROOM_COUNT              

In [13]:
raw_df.describe()

Unnamed: 0,SURFACE,LAND_SURFACE,BALCONY_SURFACE,TERRACE_SURFACE,ROOM_COUNT,BEDROOM_COUNT,BATHROOM_COUNT,LUNCHROOM_COUNT,TOILET_COUNT,FIREPLACE,...,GREENHOUSE_GAS_CONSUMPTION,PRICE,PRICE_M2,RENTAL_EXPENSES,DEPOSIT,FEES,CITY_ID,ZIP_CODE,DEPT_CODE,PUBLICATION_END_DATE
count,2050.0,3.0,0.0,25.0,1835.0,696.0,0.0,0.0,0.0,0.0,...,0.0,2104.0,1991.0,441.0,55.0,94.0,2164.0,2163.0,2164.0,0.0
mean,128.136068,30.0,,31.3636,2.243597,1.847701,,,,,...,,426114.2,5483.77342,406.687347,2988.506,1531.935957,54178039.0,92300.0,92.0,
std,423.339898,8.660254,,57.467242,1.715621,1.104508,,,,,...,,648660.3,5384.760624,1329.451157,8071.172419,6269.06513,0.0,0.0,0.0,
min,6.0,25.0,,5.59,0.0,1.0,,,,,...,,33.0,1.78,2.0,70.0,29.5,54178039.0,92300.0,92.0,
25%,36.2,25.0,,13.0,1.0,1.0,,,,,...,,1600.0,32.375,60.0,860.5,378.0,54178039.0,92300.0,92.0,
50%,55.0,25.0,,18.0,2.0,1.0,,,,,...,,220000.0,7586.21,100.0,1350.0,639.67,54178039.0,92300.0,92.0,
75%,93.0,32.5,,26.0,3.0,2.0,,,,,...,,609750.0,10056.75,151.0,2450.0,806.7,54178039.0,92300.0,92.0,
max,10287.0,40.0,,300.0,10.0,8.0,,,,,...,,6000000.0,89000.0,14287.33,60000.0,60000.0,54178039.0,92300.0,92.0,


## Nettoyage

Rendons la donnée plus lisible en éliminant les colonnes n'apportant aucun élément de comparaison probants entre deux biens immobilier, ainsi que les colonnes de lieu contenant une information similaire.

Notons que nous retirons ici les images: les filigrannes et l'incertitude d'avoir des photographies similaires nous en décourage.


In [122]:
# Colonnes non-pertinentes à enlever
col_to_drop = ['URL', 'CRAWL_SOURCE', 'CRAWL_SOURCE', 'IMAGES', 'PRICE_EVENTS', 'FEES', 'FEES_INCLUDED', 'DEALER_NAME',
 'DEALER_TYPE', 'CITY_ID', 'CITY', 'DEPT_CODE', 'PUBLICATION_START_DATE', 'PUBLICATION_END_DATE',
 'LAST_CRAWL_DATE', 'LAST_PRICE_DECREASE_DATE']

df = raw_df.drop(col_to_drop, axis=1)

Toujours en vue d'alléger le processus, nous enlevons les colonnes n'ayant pas  ou peu d'informations.

In [128]:
# On retire les colonnes vides ou contenant peu déléments comparables
df.dropna(axis=1, thresh=0.1*len(df), inplace=True)
df.count()

PROPERTY_TYPE               2164
NEW_BUILD                   1973
DESCRIPTION                 2160
SURFACE                     2050
ROOM_COUNT                  1835
BEDROOM_COUNT                696
FURNISHED                    467
PARKING                     2164
HEATING_TYPES               2164
HEATING_MODE                 653
FLOOR                        660
FLOOR_COUNT                  465
CONSTRUCTION_YEAR            503
ELEVATOR                     552
CARETAKER                    247
MARKETING_TYPE              2164
PRICE                       2104
PRICE_M2                    1991
RENTAL_EXPENSES              441
RENTAL_EXPENSES_INCLUDED     600
EXCLUSIVE_MANDATE           2164
OCCUPIED                    1209
ZIP_CODE                    2163
dtype: int64

## Paramètres de regroupement

Nous pouvons éliminer l'ensemble des biens n'ayant aucune doublons au sein de features

In [130]:
cleaned_df = df[(raw_df['SURFACE'].between(20, 80)) & (raw_df['PROPERTY_TYPE'] == "APARTMENT")]

duplicated = cleaned_df.duplicated(['PROPERTY_TYPE', 'SURFACE', 'ZIP_CODE'])
cleaned_df[duplicated].sort_values(['SURFACE'], ascending=False)

Unnamed: 0_level_0,PROPERTY_TYPE,NEW_BUILD,DESCRIPTION,SURFACE,ROOM_COUNT,BEDROOM_COUNT,FURNISHED,PARKING,HEATING_TYPES,HEATING_MODE,...,ELEVATOR,CARETAKER,MARKETING_TYPE,PRICE,PRICE_M2,RENTAL_EXPENSES,RENTAL_EXPENSES_INCLUDED,EXCLUSIVE_MANDATE,OCCUPIED,ZIP_CODE
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9d231240-c677-11e9-a6b2-651beb16710e,APARTMENT,False,Metro LOUISE MICHEL. Au calme dans un immeuble...,80.00,4.0,,,False,[],,...,,,SALE,849000.0,10612.50,,,True,False,92300.0
7c4ec190-c8b6-11e9-92d3-cb429fb9e457,APARTMENT,False,Métro LOUISE MICHEL. Au calme dans un immeuble...,80.00,4.0,3.0,,False,"[""ELECTRIC""]",INDIVIDUAL,...,True,True,SALE,849000.0,10612.50,,,False,False,92300.0
709fb420-af80-11e9-9fab-c3006e339e11,APARTMENT,False,"Situé à LEVALLOIS PERRET, cet appartement de b...",77.88,3.0,2.0,False,True,"[""GAS""]",COLLECTIVE,...,True,,RENT,2400.0,30.82,190.0,,False,,92300.0
20537fc0-5011-11e9-afff-399d21b97abe,APARTMENT,False,LEVALLOIS | JEAN-ZAY | 4 PIÈCES MAGNIFIQUE App...,77.30,4.0,2.0,,True,"[""ELECTRIC""]",INDIVIDUAL,...,True,,SALE,950000.0,12289.78,,,False,False,92300.0
d5c6e3f0-67af-11e9-8084-55ce2049f05c,APARTMENT,False,LEVALLOIS | JEAN-ZAY | 4 PIECES MAGNIFIQUE App...,77.30,4.0,2.0,,False,"[""ELECTRIC""]",INDIVIDUAL,...,True,False,SALE,950000.0,12289.78,,,True,False,92300.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
c8e1e7d0-c357-11e9-81e9-5f24299f2ef3,APARTMENT,False,ORPI vous propose ce joli petit 2 pièces de 21...,21.00,2.0,1.0,,False,[],INDIVIDUAL,...,,,SALE,198000.0,9428.57,,,False,False,92300.0
d87e1260-13b5-11e9-9d9b-7591d677cbca,APARTMENT,False,LEVALLOIS // GREFFULHE Appartement de 21m2 sit...,21.00,1.0,1.0,,False,[],COLLECTIVE,...,True,True,SALE,260000.0,12380.95,,,False,False,92300.0
6d0329a0-ad55-11e9-9cb2-ef48193159b3,APARTMENT,False,Au coeur du Quartier Greffulle de Levallois. D...,20.00,1.0,,True,False,[],,...,,,RENT,950.0,47.50,,,False,,92300.0
8b06b0f0-b112-11e9-9fab-c3006e339e11,APARTMENT,False,"Au coeur du Quartier Greffulle de Levallois, d...",20.00,1.0,,True,False,[],,...,,,RENT,950.0,47.50,,True,False,,92300.0


## Regroupement

In [64]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer().fit_transform(tx)
pairwise_similarity = tfidf * tfidf.T

In [65]:
pairwise_similarity.toarray()

array([[1.        , 0.80308189],
       [0.80308189, 1.        ]])

In [180]:
from scipy.spatial.distance import pdist, squareform

comp_1 = pd.DataFrame(
    1 - squareform(pdist(df, lambda u,v: (u != v).mean())),
    columns = df.index,
    index = df.index
)

In [192]:
comp_2 = comp_1.unstack(fill_value='').to_frame().reset_index().set_axis(['Bien A', 'Bien B', 'Score'], axis=1)
comp_2[(comp_2['Bien A'] != comp_2['Bien B']) & (comp_2['Score'] > 0.5)].sort_values(['Score'], ascending=False)

ValueError: cannot insert ID, already exists

In [201]:
comp_1.unstack()

ID                                    ID                                  
22c05930-0eb5-11e7-b53d-bbead8ba43fe  22c05930-0eb5-11e7-b53d-bbead8ba43fe    1.000000
                                      8d092fa0-bb99-11e8-a7c9-852783b5a69d    0.260870
                                      44b6a5c0-3466-11e9-8213-25cc7d9bf5fc    0.217391
                                      e9e07ed0-812f-11e8-82aa-61eacebe4584    0.217391
                                      872302b0-5a21-11e9-950c-510fefc1ed35    0.173913
                                                                                ...   
f3ae8be0-9fcf-11e9-ab3e-47ec2b68d334  d3579370-824f-11e9-af18-9742751bcff8    0.391304
                                      cce3fc60-c86b-11e9-a6b2-651beb16710e    0.304348
                                      beec50b0-c85e-11e9-92d3-cb429fb9e457    0.347826
                                      8cba88c0-a07a-11e9-a8e6-0de7b497e456    0.130435
                                      f3ae8be0-9fcf-11e

In [202]:
comp_1.unstack().index.rename(['Bien A', 'Bien B', 'Score'])

ValueError: Length of names must match number of levels in MultiIndex.