# 7. Transformation des données

## 7.1 Index et colonnes hiérarchiques
La librairie **pandas** permet de gérer des index et des colonnes hiérarchiques. Dans ce cas, les clefs d'accès sont des tuples (ou bien la clef seule pour le premier niveau).

### 7.1.2 Index hiérarchique ou multi-index
Le *DataFrame* des communes comporte implicitement 3 niveaux d'index : *Région*, *Département* et *Code INSEE* (ou le nom de la commune).

In [1]:
# import des modules usuels
import pandas as pnd
import numpy as np
import matplotlib.pyplot as plt

# commande magique pour l'affichage des graphiques
%matplotlib inline

# options d'affichage
pnd.set_option("display.max_rows", 16)
plt.style.use('seaborn-darkgrid')

In [3]:
# chargement des données
geo = pnd.read_csv("correspondance-code-insee-code-postal.csv",
                   sep=';',
                   usecols=range(11),
                  index_col="Code INSEE")
geo.sort_index(inplace=True)
geo.head()

Unnamed: 0_level_0,Code Postal,Commune,Département,Région,Statut,Altitude Moyenne,Superficie,Population,geo_point_2d,geo_shape
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1001,1400,L'ABERGEMENT-CLEMENCIAT,AIN,RHONE-ALPES,Commune simple,242.0,1565.0,0.8,"46.1534255214, 4.92611354223","{""type"": ""Polygon"", ""coordinates"": [[[4.926273..."
1002,1640,L'ABERGEMENT-DE-VAREY,AIN,RHONE-ALPES,Commune simple,483.0,912.0,0.2,"46.0091878776, 5.42801696363","{""type"": ""Polygon"", ""coordinates"": [[[5.430089..."
1004,1500,AMBERIEU-EN-BUGEY,AIN,RHONE-ALPES,Chef-lieu canton,379.0,2448.0,13.4,"45.9608475114, 5.3729257777","{""type"": ""Polygon"", ""coordinates"": [[[5.386190..."
1005,1330,AMBERIEUX-EN-DOMBES,AIN,RHONE-ALPES,Commune simple,290.0,1605.0,1.6,"45.9961799872, 4.91227250796","{""type"": ""Polygon"", ""coordinates"": [[[4.895580..."
1006,1300,AMBLEON,AIN,RHONE-ALPES,Commune simple,589.0,602.0,0.1,"45.7494989044, 5.59432017366","{""type"": ""Polygon"", ""coordinates"": [[[5.614854..."


In [4]:
geo["Latitude"] = geo["geo_point_2d"].apply(lambda x: float(x.split(', ')[0]))
geo["Longitude"] = geo["geo_point_2d"].apply(lambda x: float(x.split(', ')[1]))
geo["Superficie"] /= 100.0
geo["Densité"] = 1000 * geo["Population"] / geo["Superficie"]
geo["Statut"] = geo["Statut"].astype('category',
                                     categories=["Commune simple", "Chef-lieu canton", "Sous-préfecture",
                                                    "Préfecture", "Préfecture de région", "Capitale d'état"],
                                         ordered=True)

In [5]:
geo.index

Index(['01001', '01002', '01004', '01005', '01006', '01007', '01008', '01009',
       '01010', '01011',
       ...
       '97608', '97609', '97610', '97611', '97612', '97613', '97614', '97615',
       '97616', '97617'],
      dtype='object', name='Code INSEE', length=36742)

On annule l'index existant avec la fonction **reset_index()** :

In [6]:
# annulation de l'index existant
geo1 = geo.reset_index()
type(geo1.index)

pandas.core.indexes.range.RangeIndex

In [7]:
# on applique un index triple aux communes
geo1.set_index(["Région", "Département", "Code INSEE"], inplace=True)
geo1

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Code Postal,Commune,Statut,Altitude Moyenne,Superficie,Population,geo_point_2d,geo_shape,Latitude,Longitude,Densité
Région,Département,Code INSEE,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
RHONE-ALPES,AIN,01001,01400,L'ABERGEMENT-CLEMENCIAT,Commune simple,242.0,15.65,0.8,"46.1534255214, 4.92611354223","{""type"": ""Polygon"", ""coordinates"": [[[4.926273...",46.153426,4.926114,51.118211
RHONE-ALPES,AIN,01002,01640,L'ABERGEMENT-DE-VAREY,Commune simple,483.0,9.12,0.2,"46.0091878776, 5.42801696363","{""type"": ""Polygon"", ""coordinates"": [[[5.430089...",46.009188,5.428017,21.929825
RHONE-ALPES,AIN,01004,01500,AMBERIEU-EN-BUGEY,Chef-lieu canton,379.0,24.48,13.4,"45.9608475114, 5.3729257777","{""type"": ""Polygon"", ""coordinates"": [[[5.386190...",45.960848,5.372926,547.385621
RHONE-ALPES,AIN,01005,01330,AMBERIEUX-EN-DOMBES,Commune simple,290.0,16.05,1.6,"45.9961799872, 4.91227250796","{""type"": ""Polygon"", ""coordinates"": [[[4.895580...",45.996180,4.912273,99.688474
RHONE-ALPES,AIN,01006,01300,AMBLEON,Commune simple,589.0,6.02,0.1,"45.7494989044, 5.59432017366","{""type"": ""Polygon"", ""coordinates"": [[[5.614854...",45.749499,5.594320,16.611296
RHONE-ALPES,AIN,01007,01500,AMBRONAY,Commune simple,309.0,33.59,2.3,"46.0055913782, 5.35760660735","{""type"": ""Polygon"", ""coordinates"": [[[5.413533...",46.005591,5.357607,68.472760
RHONE-ALPES,AIN,01008,01500,AMBUTRIX,Commune simple,274.0,5.18,0.7,"45.9367134524, 5.3328092349","{""type"": ""Polygon"", ""coordinates"": [[[5.321986...",45.936713,5.332809,135.135135
RHONE-ALPES,AIN,01009,01300,ANDERT-ET-CONDON,Commune simple,294.0,6.96,0.3,"45.7873565333, 5.65788307924","{""type"": ""Polygon"", ""coordinates"": [[[5.656393...",45.787357,5.657883,43.103448
...,...,...,...,...,...,...,...,...,...,...,...,...,...
MAYOTTE,MAYOTTE,97610,97600,KOUNGOU,Chef-lieu canton,138.0,27.69,19.8,"-12.7465604467, 45.1869991913","{""type"": ""Polygon"", ""coordinates"": [[[45.23692...",-12.746560,45.186999,715.059588


In [8]:
# l'index est devenu un MultiIndex
type(geo1.index)

pandas.core.indexes.multi.MultiIndex

Accès direct au premier niveau :

In [9]:
geo1.loc[("ILE-DE-FRANCE")]  # ou bien communes1.loc["ILE-DE-FRANCE"]

Unnamed: 0_level_0,Unnamed: 1_level_0,Code Postal,Commune,Statut,Altitude Moyenne,Superficie,Population,geo_point_2d,geo_shape,Latitude,Longitude,Densité
Département,Code INSEE,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
PARIS,75101,75001,PARIS-1ER-ARRONDISSEMENT,Capitale d'état,33.0,1.81,17.6,"48.8626304852, 2.33629344655","{""type"": ""Polygon"", ""coordinates"": [[[2.344559...",48.862630,2.336293,9723.756906
PARIS,75102,75002,PARIS-2E-ARRONDISSEMENT,Chef-lieu canton,36.0,0.99,22.4,"48.8679033789, 2.34410716666","{""type"": ""Polygon"", ""coordinates"": [[[2.350834...",48.867903,2.344107,22626.262626
PARIS,75103,75003,PARIS-3E-ARRONDISSEMENT,Chef-lieu canton,35.0,1.16,35.7,"48.8630541318, 2.35936105897","{""type"": ""Polygon"", ""coordinates"": [[[2.368401...",48.863054,2.359361,30775.862069
PARIS,75104,75004,PARIS-4E-ARRONDISSEMENT,Chef-lieu canton,33.0,1.60,28.2,"48.854228282, 2.35736193814","{""type"": ""Polygon"", ""coordinates"": [[[2.364320...",48.854228,2.357362,17625.000000
PARIS,75105,75005,PARIS-5E-ARRONDISSEMENT,Chef-lieu canton,42.0,2.52,61.5,"48.8445086596, 2.34985938556","{""type"": ""Polygon"", ""coordinates"": [[[2.365944...",48.844509,2.349859,24404.761905
PARIS,75106,75006,PARIS-6E-ARRONDISSEMENT,Chef-lieu canton,40.0,2.15,43.1,"48.8489680919, 2.33267089859","{""type"": ""Polygon"", ""coordinates"": [[[2.336591...",48.848968,2.332671,20046.511628
PARIS,75107,75007,PARIS-7E-ARRONDISSEMENT,Chef-lieu canton,34.0,4.11,57.4,"48.8560825982, 2.31243868773","{""type"": ""Polygon"", ""coordinates"": [[[2.316633...",48.856083,2.312439,13965.936740
PARIS,75108,75008,PARIS-8E-ARRONDISSEMENT,Chef-lieu canton,39.0,3.85,40.3,"48.8725272666, 2.31258256042","{""type"": ""Polygon"", ""coordinates"": [[[2.320781...",48.872527,2.312583,10467.532468
...,...,...,...,...,...,...,...,...,...,...,...,...
VAL-D'OISE,95658,95450,VIGNY,Chef-lieu canton,103.0,6.81,1.1,"49.080165557, 1.92992323682","{""type"": ""Polygon"", ""coordinates"": [[[1.953222...",49.080166,1.929923,161.527166


Accès au deuxième niveau :

In [10]:
geo1.loc[("ILE-DE-FRANCE", "HAUTS-DE-SEINE")]

  if __name__ == '__main__':


Unnamed: 0_level_0,Code Postal,Commune,Statut,Altitude Moyenne,Superficie,Population,geo_point_2d,geo_shape,Latitude,Longitude,Densité
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
92002,92160,ANTONY,Sous-préfecture,66.0,9.56,61.4,"48.7503412602, 2.2993268102","{""type"": ""Polygon"", ""coordinates"": [[[2.303837...",48.750341,2.299327,6422.594142
92004,92600,ASNIERES-SUR-SEINE,Chef-lieu canton,31.0,4.82,81.6,"48.9153530123, 2.2880384663","{""type"": ""Polygon"", ""coordinates"": [[[2.284270...",48.915353,2.288038,16929.460581
92007,92220,BAGNEUX,Chef-lieu canton,86.0,4.18,38.5,"48.7983229866, 2.30989995212","{""type"": ""Polygon"", ""coordinates"": [[[2.309699...",48.798323,2.309900,9210.526316
92009,92270,BOIS-COLOMBES,Chef-lieu canton,36.0,1.93,28.2,"48.9153368426, 2.26738552597","{""type"": ""Polygon"", ""coordinates"": [[[2.266398...",48.915337,2.267386,14611.398964
92012,92100,BOULOGNE-BILLANCOURT,Sous-préfecture,33.0,6.15,113.1,"48.8365843138, 2.23913599058","{""type"": ""Polygon"", ""coordinates"": [[[2.236100...",48.836584,2.239136,18390.243902
92014,92340,BOURG-LA-REINE,Chef-lieu canton,55.0,1.86,19.8,"48.7799073627, 2.31643010763","{""type"": ""Polygon"", ""coordinates"": [[[2.319915...",48.779907,2.316430,10645.161290
92019,92290,CHATENAY-MALABRY,Chef-lieu canton,126.0,6.37,32.4,"48.7681690197, 2.26282598525","{""type"": ""Polygon"", ""coordinates"": [[[2.276145...",48.768169,2.262826,5086.342229
92020,92320,CHATILLON,Chef-lieu canton,101.0,2.93,32.4,"48.8034091756, 2.28799131883","{""type"": ""Polygon"", ""coordinates"": [[[2.272195...",48.803409,2.287991,11058.020478
...,...,...,...,...,...,...,...,...,...,...,...
92064,92210,SAINT-CLOUD,Chef-lieu canton,93.0,7.51,29.7,"48.8428741034, 2.20864570316","{""type"": ""Polygon"", ""coordinates"": [[[2.224048...",48.842874,2.208646,3954.727031


Accès au troisième niveau :

In [11]:
geo1.loc[("ILE-DE-FRANCE", "HAUTS-DE-SEINE", "92002")]

Code Postal                                                     92160
Commune                                                        ANTONY
Statut                                                Sous-préfecture
Altitude Moyenne                                                   66
Superficie                                                       9.56
Population                                                       61.4
geo_point_2d                              48.7503412602, 2.2993268102
geo_shape           {"type": "Polygon", "coordinates": [[[2.303837...
Latitude                                                      48.7503
Longitude                                                     2.29933
Densité                                                       6422.59
Name: (ILE-DE-FRANCE, HAUTS-DE-SEINE, 92002), dtype: object

Tri du multi-index pour éviter les problèmes de performance.

In [12]:
geo1.sort_index(inplace=True)
geo1.loc[("ILE-DE-FRANCE", "HAUTS-DE-SEINE")]

Unnamed: 0_level_0,Code Postal,Commune,Statut,Altitude Moyenne,Superficie,Population,geo_point_2d,geo_shape,Latitude,Longitude,Densité
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
92002,92160,ANTONY,Sous-préfecture,66.0,9.56,61.4,"48.7503412602, 2.2993268102","{""type"": ""Polygon"", ""coordinates"": [[[2.303837...",48.750341,2.299327,6422.594142
92004,92600,ASNIERES-SUR-SEINE,Chef-lieu canton,31.0,4.82,81.6,"48.9153530123, 2.2880384663","{""type"": ""Polygon"", ""coordinates"": [[[2.284270...",48.915353,2.288038,16929.460581
92007,92220,BAGNEUX,Chef-lieu canton,86.0,4.18,38.5,"48.7983229866, 2.30989995212","{""type"": ""Polygon"", ""coordinates"": [[[2.309699...",48.798323,2.309900,9210.526316
92009,92270,BOIS-COLOMBES,Chef-lieu canton,36.0,1.93,28.2,"48.9153368426, 2.26738552597","{""type"": ""Polygon"", ""coordinates"": [[[2.266398...",48.915337,2.267386,14611.398964
92012,92100,BOULOGNE-BILLANCOURT,Sous-préfecture,33.0,6.15,113.1,"48.8365843138, 2.23913599058","{""type"": ""Polygon"", ""coordinates"": [[[2.236100...",48.836584,2.239136,18390.243902
92014,92340,BOURG-LA-REINE,Chef-lieu canton,55.0,1.86,19.8,"48.7799073627, 2.31643010763","{""type"": ""Polygon"", ""coordinates"": [[[2.319915...",48.779907,2.316430,10645.161290
92019,92290,CHATENAY-MALABRY,Chef-lieu canton,126.0,6.37,32.4,"48.7681690197, 2.26282598525","{""type"": ""Polygon"", ""coordinates"": [[[2.276145...",48.768169,2.262826,5086.342229
92020,92320,CHATILLON,Chef-lieu canton,101.0,2.93,32.4,"48.8034091756, 2.28799131883","{""type"": ""Polygon"", ""coordinates"": [[[2.272195...",48.803409,2.287991,11058.020478
...,...,...,...,...,...,...,...,...,...,...,...
92064,92210,SAINT-CLOUD,Chef-lieu canton,93.0,7.51,29.7,"48.8428741034, 2.20864570316","{""type"": ""Polygon"", ""coordinates"": [[[2.224048...",48.842874,2.208646,3954.727031


### Colonnes hiérarchiques

In [14]:
thd = pnd.read_excel("FranceTHD_Open_Data_Observatoire_Juin2015.xlsx",
                    sheetname="Communes",
                    header=1,
                    index_col="Code INSEE",
                    names=["Département", "Commune",
                                "1 Mbit", "3 Mbit", "8 Mbit", "30 Mbit", "100 Mbit",
                                "DSL 1 Mbit", "DSL 3 Mbit", "DSL 8 Mbit", "DSL 30 Mbit", "DSL 100 Mbit",
                                "Câble 1 Mbit", "Câble 3 Mbit", "Câble 8 Mbit", "Câble 30 Mbit", "Câble 100 Mbit",
                                "Fibre 1 Mbit", "Fibre 3 Mbit", "Fibre 8 Mbit", "Fibre 30 Mbit", "Fibre 100 Mbit"])
thd.head()

Unnamed: 0_level_0,Département,Commune,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit,DSL 1 Mbit,DSL 3 Mbit,DSL 8 Mbit,...,Câble 1 Mbit,Câble 3 Mbit,Câble 8 Mbit,Câble 30 Mbit,Câble 100 Mbit,Fibre 1 Mbit,Fibre 3 Mbit,Fibre 8 Mbit,Fibre 30 Mbit,Fibre 100 Mbit
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001,AIN,L'Abergement-Clémenciat,1.0,0.448,0.052,0.0,0.0,1.0,0.448,0.052,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1002,AIN,L'Abergement-de-Varey,0.676,0.594,0.571,0.571,0.571,0.124,0.024,0.0,...,0.0,0.0,0.0,0.0,0.0,0.571,0.571,0.571,0.571,0.571
1004,AIN,Ambérieu-en-Bugey,1.0,0.966,0.794,0.234,0.0,1.0,0.966,0.794,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1005,AIN,Ambérieux-en-Dombes,1.0,0.99,0.942,0.667,0.004,1.0,0.985,0.937,...,0.0,0.0,0.0,0.0,0.0,0.004,0.004,0.004,0.004,0.004
1006,AIN,Ambléon,1.0,1.0,1.0,0.934,0.934,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.934,0.934,0.934,0.934,0.934


In [15]:
# on restreint les données au haut débit seul
thd2 = thd.loc[:, "1 Mbit":]
thd2.head()

Unnamed: 0_level_0,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit,DSL 1 Mbit,DSL 3 Mbit,DSL 8 Mbit,DSL 30 Mbit,DSL 100 Mbit,Câble 1 Mbit,Câble 3 Mbit,Câble 8 Mbit,Câble 30 Mbit,Câble 100 Mbit,Fibre 1 Mbit,Fibre 3 Mbit,Fibre 8 Mbit,Fibre 30 Mbit,Fibre 100 Mbit
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1001,1.0,0.448,0.052,0.0,0.0,1.0,0.448,0.052,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1002,0.676,0.594,0.571,0.571,0.571,0.124,0.024,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.571,0.571,0.571,0.571,0.571
1004,1.0,0.966,0.794,0.234,0.0,1.0,0.966,0.794,0.234,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1005,1.0,0.99,0.942,0.667,0.004,1.0,0.985,0.937,0.663,0,0.0,0.0,0.0,0.0,0.0,0.004,0.004,0.004,0.004,0.004
1006,1.0,1.0,1.0,0.934,0.934,1.0,1.0,1.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.934,0.934,0.934,0.934,0.934


On constate qu'il y a implicitement 2 niveaux de colonnes : Techno et Débit (ce qu'il y avait dans le fichier Excel initial).

In [16]:
# on crée des colonnes hiérarchiques que l'on nomme également
thd2.columns = [["THD", "THD", "THD", "THD", "THD",
                     "DSL", "DSL", "DSL", "DSL", "DSL",
                     "Câble", "Câble", "Câble", "Câble", "Câble",
                     "Fibre", "Fibre", "Fibre", "Fibre", "Fibre"],
                   ["1 Mbit", "3 Mbit", "8 Mbit", "30 Mbit", "100 Mbit",
                   "1 Mbit", "3 Mbit", "8 Mbit", "30 Mbit", "100 Mbit",
                   "1 Mbit", "3 Mbit", "8 Mbit", "30 Mbit", "100 Mbit",
                   "1 Mbit", "3 Mbit", "8 Mbit", "30 Mbit", "100 Mbit"]]
thd2.columns.names = ["Techno", "Débit"]  # on nomme les 2 index"
thd2.head()

Techno,THD,THD,THD,THD,THD,DSL,DSL,DSL,DSL,DSL,Câble,Câble,Câble,Câble,Câble,Fibre,Fibre,Fibre,Fibre,Fibre
Débit,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit
Code INSEE,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
1001,1.0,0.448,0.052,0.0,0.0,1.0,0.448,0.052,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1002,0.676,0.594,0.571,0.571,0.571,0.124,0.024,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.571,0.571,0.571,0.571,0.571
1004,1.0,0.966,0.794,0.234,0.0,1.0,0.966,0.794,0.234,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1005,1.0,0.99,0.942,0.667,0.004,1.0,0.985,0.937,0.663,0,0.0,0.0,0.0,0.0,0.0,0.004,0.004,0.004,0.004,0.004
1006,1.0,1.0,1.0,0.934,0.934,1.0,1.0,1.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.934,0.934,0.934,0.934,0.934


On retrouve la structure du fichier Excel initial avec des cellules fusionnées.

In [17]:
# accès à une valeur
thd2.loc["01002", ("DSL", "3 Mbit")]

0.024

In [18]:
thd2["DSL"].columns

Index(['1 Mbit', '3 Mbit', '8 Mbit', '30 Mbit', '100 Mbit'], dtype='object', name='Débit')

In [19]:
# accès à une portion de ligne
thd2.loc["01002", ("DSL")]  # ou thd2.loc["01001", "DSL"]

Débit
1 Mbit      0.124
3 Mbit      0.024
8 Mbit      0.000
30 Mbit     0.000
100 Mbit    0.000
Name: 01002, dtype: float64

**Remarque** : La liste des colonnes hiérarchiques comporte des répétitions. Il aurait été possible d'utiliser les opérateurs *+* et *\** qui agissent sur les listes :

In [20]:
[["THD"] * 5 + ["DSL"] * 5 + ["Câble"] * 5 + ["Fibre"] * 5,
 ["1 Mbit", "3 Mbit", "8 Mbit", "30 Mbit", "100 Mbit"] * 4]

[['THD',
  'THD',
  'THD',
  'THD',
  'THD',
  'DSL',
  'DSL',
  'DSL',
  'DSL',
  'DSL',
  'Câble',
  'Câble',
  'Câble',
  'Câble',
  'Câble',
  'Fibre',
  'Fibre',
  'Fibre',
  'Fibre',
  'Fibre'],
 ['1 Mbit',
  '3 Mbit',
  '8 Mbit',
  '30 Mbit',
  '100 Mbit',
  '1 Mbit',
  '3 Mbit',
  '8 Mbit',
  '30 Mbit',
  '100 Mbit',
  '1 Mbit',
  '3 Mbit',
  '8 Mbit',
  '30 Mbit',
  '100 Mbit',
  '1 Mbit',
  '3 Mbit',
  '8 Mbit',
  '30 Mbit',
  '100 Mbit']]

### Permutation de niveaux hiérarchiques

Il est possible d'échanger des niveaux hiérarchiques des index ou des colonnes avec la fonction **swaplevel()** :

In [22]:
thd2.swaplevel("Débit", "Techno", axis=1).head()

Débit,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit
Techno,THD,THD,THD,THD,THD,DSL,DSL,DSL,DSL,DSL,Câble,Câble,Câble,Câble,Câble,Fibre,Fibre,Fibre,Fibre,Fibre
Code INSEE,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
1001,1.0,0.448,0.052,0.0,0.0,1.0,0.448,0.052,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1002,0.676,0.594,0.571,0.571,0.571,0.124,0.024,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.571,0.571,0.571,0.571,0.571
1004,1.0,0.966,0.794,0.234,0.0,1.0,0.966,0.794,0.234,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1005,1.0,0.99,0.942,0.667,0.004,1.0,0.985,0.937,0.663,0,0.0,0.0,0.0,0.0,0.0,0.004,0.004,0.004,0.004,0.004
1006,1.0,1.0,1.0,0.934,0.934,1.0,1.0,1.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.934,0.934,0.934,0.934,0.934


Les niveaux sont bien inversés mais les débits identiques sont éclatés. On verra ci-après comment les rassembler.

## 7.2 Opérations de pivot / Tableaux croisés

La méthode **pivot_table()** construit un tableau synthétique de valeurs agrégées et ventilées selon les différentes valeurs d'une ou plusieurs colonnes.

Elle retourne un nouveau *DataFrame* en fonction des paramètres fournis.

- index : colonnes du *DataFrame* initial dont les valeurs sont utilisées en index
- values : colonnes du *DataFrame* initial dont les valeurs sont agrégées
- columns : colonnes du *DataFrame* initial dont les valeurs sont utilisées en nom de colonnes
- aggfunc : fonction d'agrégation des valeurs, par défaut **numpy.mean** (calcul de la moyenne des valeurs)
- margins : avec ou sans calcul des sous-totaux

Il existe également la fonction **crosstab()** qui calcule par défaut la table des décomptes ou des fréquences croisées.

In [23]:
# tableau croisé" "Statut" x "Altitude Moyenne", "Population", "Superficie"
geo.pivot_table(index="Statut", values=["Altitude Moyenne", "Population", "Superficie"])

Unnamed: 0_level_0,Altitude Moyenne,Population,Superficie
Statut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Commune simple,279.097813,0.868631,14.665004
Chef-lieu canton,269.461074,7.381477,43.604453
Sous-préfecture,236.016667,22.865417,56.93575
Préfecture,220.905405,58.458108,37.894189
Préfecture de région,141.884615,140.057692,47.372692
Capitale d'état,33.0,17.6,1.81


In [24]:
# tableau croisé" "Région" x "Statut" avec la somme de "Population"
geo.pivot_table(index="Région", columns="Statut", values="Population", aggfunc='sum')

Statut,Commune simple,Chef-lieu canton,Sous-préfecture,Préfecture,Préfecture de région,Capitale d'état
Région,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ALSACE,973.2,307.9,224.3,67.2,271.7,
AQUITAINE,1672.6,902.5,219.3,176.4,236.7,
AUVERGNE,677.0,323.6,138.5,66.8,138.6,
BASSE-NORMANDIE,861.7,332.8,122.3,46.1,109.3,
BOURGOGNE,882.5,351.9,147.5,108.3,152.1,
BRETAGNE,1697.5,760.7,348.0,162.1,206.6,
CENTRE,1258.4,642.6,190.7,333.5,113.2,
CHAMPAGNE-ARDENNE,662.9,197.3,299.7,134.6,46.2,
...,...,...,...,...,...,...
MIDI-PYRENEES,1409.0,605.6,183.0,224.7,440.2,


In [25]:
# fréquences croisées
pnd.set_option('display.precision', 3)
pnd.crosstab(geo["Région"], geo["Statut"], normalize=True, margins=True) * 100

Statut,Commune simple,Chef-lieu canton,Sous-préfecture,Préfecture,Préfecture de région,Capitale d'état,All
Région,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ALSACE,2.292,0.136,0.027,0.003,0.003,0.000,2.460
AQUITAINE,5.696,0.501,0.038,0.011,0.003,0.000,6.249
AUVERGNE,3.201,0.327,0.027,0.008,0.003,0.000,3.565
BASSE-NORMANDIE,4.600,0.302,0.022,0.005,0.003,0.000,4.932
BOURGOGNE,5.166,0.362,0.030,0.008,0.003,0.000,5.569
BRETAGNE,2.994,0.422,0.030,0.008,0.003,0.000,3.457
CENTRE,4.556,0.400,0.038,0.014,0.003,0.000,5.011
CHAMPAGNE-ARDENNE,5.000,0.278,0.030,0.008,0.003,0.000,5.318
...,...,...,...,...,...,...,...
NORD-PAS-DE-CALAIS,3.892,0.278,0.030,0.003,0.003,0.000,4.205


Les méthodes **stack()** et **unstack()** permettent d'empiler les colonnes en index et de dépiler les index en colonnes.

In [26]:
thd2.head()

Techno,THD,THD,THD,THD,THD,DSL,DSL,DSL,DSL,DSL,Câble,Câble,Câble,Câble,Câble,Fibre,Fibre,Fibre,Fibre,Fibre
Débit,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit
Code INSEE,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
1001,1.0,0.448,0.052,0.0,0.0,1.0,0.448,0.052,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1002,0.676,0.594,0.571,0.571,0.571,0.124,0.024,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.571,0.571,0.571,0.571,0.571
1004,1.0,0.966,0.794,0.234,0.0,1.0,0.966,0.794,0.234,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1005,1.0,0.99,0.942,0.667,0.004,1.0,0.985,0.937,0.663,0,0.0,0.0,0.0,0.0,0.0,0.004,0.004,0.004,0.004,0.004
1006,1.0,1.0,1.0,0.934,0.934,1.0,1.0,1.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.934,0.934,0.934,0.934,0.934


La méthode **stack()** prend le niveau de colonne le plus bas et l'envoie au niveau d'index le plus bas.

In [27]:
# on déplace le niveau de colonne "Débit" vers les index
thd2.stack()

Unnamed: 0_level_0,Techno,Câble,DSL,Fibre,THD
Code INSEE,Débit,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
01001,1 Mbit,0.0,1.000,0.000,1.000
01001,100 Mbit,0.0,0.000,0.000,0.000
01001,3 Mbit,0.0,0.448,0.000,0.448
01001,30 Mbit,0.0,0.000,0.000,0.000
01001,8 Mbit,0.0,0.052,0.000,0.052
01002,1 Mbit,0.0,0.124,0.571,0.676
01002,100 Mbit,0.0,0.000,0.571,0.571
01002,3 Mbit,0.0,0.024,0.571,0.594
...,...,...,...,...,...
97701,3 Mbit,0.0,0.907,0.000,0.907


On peut répéter l'opération jusqu'à épuisement des niveaux de colonnes. A la fin on obtient un objet *Series* avec un index hiérarchique.

In [28]:
# on déplace les niveau de colonne "Débit" et "Techno" vers les index
thd2.stack().stack()

Code INSEE  Débit     Techno
01001       1 Mbit    Câble     0.000
                      DSL       1.000
                      Fibre     0.000
                      THD       1.000
            100 Mbit  Câble     0.000
                      DSL       0.000
                      Fibre     0.000
                      THD       0.000
                                ...  
97801       30 Mbit   Câble     0.000
                      DSL       0.211
                      Fibre     0.000
                      THD       0.211
            8 Mbit    Câble     0.000
                      DSL       0.663
                      Fibre     0.000
                      THD       0.663
Length: 733860, dtype: float64

Inversement, la méthode **unstack()** prend le niveau d'index le plus bas et l'envoie au niveau de colonne le plus bas.

In [29]:
# unstack
thd2.stack().unstack()

Techno,Câble,Câble,Câble,Câble,Câble,DSL,DSL,DSL,DSL,DSL,Fibre,Fibre,Fibre,Fibre,Fibre,THD,THD,THD,THD,THD
Débit,1 Mbit,100 Mbit,3 Mbit,30 Mbit,8 Mbit,1 Mbit,100 Mbit,3 Mbit,30 Mbit,8 Mbit,1 Mbit,100 Mbit,3 Mbit,30 Mbit,8 Mbit,1 Mbit,100 Mbit,3 Mbit,30 Mbit,8 Mbit
Code INSEE,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
01001,0.0,0.0,0.0,0.0,0.0,1.000,0.0,0.448,0.000,0.052,0.000,0.000,0.000,0.000,0.000,1.000,0.000,0.448,0.000,0.052
01002,0.0,0.0,0.0,0.0,0.0,0.124,0.0,0.024,0.000,0.000,0.571,0.571,0.571,0.571,0.571,0.676,0.571,0.594,0.571,0.571
01004,0.0,0.0,0.0,0.0,0.0,1.000,0.0,0.966,0.234,0.794,0.000,0.000,0.000,0.000,0.000,1.000,0.000,0.966,0.234,0.794
01005,0.0,0.0,0.0,0.0,0.0,1.000,0.0,0.985,0.663,0.937,0.004,0.004,0.004,0.004,0.004,1.000,0.004,0.990,0.667,0.942
01006,0.0,0.0,0.0,0.0,0.0,1.000,0.0,1.000,0.000,1.000,0.934,0.934,0.934,0.934,0.934,1.000,0.934,1.000,0.934,1.000
01007,0.0,0.0,0.0,0.0,0.0,0.778,0.0,0.066,0.000,0.059,0.713,0.713,0.713,0.713,0.713,0.973,0.713,0.779,0.713,0.772
01008,0.0,0.0,0.0,0.0,0.0,1.000,0.0,0.557,0.000,0.013,0.000,0.000,0.000,0.000,0.000,1.000,0.000,0.557,0.000,0.013
01009,0.0,0.0,0.0,0.0,0.0,1.000,0.0,0.616,0.000,0.584,0.000,0.000,0.000,0.000,0.000,1.000,0.000,0.616,0.000,0.584
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97612,0.0,0.0,0.0,0.0,0.0,0.996,0.0,0.871,0.394,0.624,0.000,0.000,0.000,0.000,0.000,0.996,0.000,0.871,0.394,0.624


Il est possible de préciser le niveau que l'on souhaite empiler ou dépiler : soit en inidiquant le nom de l'index ou de la colonne soit en précisant son rang.

In [30]:
thd2.stack("Techno")  # ou bien thd2.stack(level=0)

Unnamed: 0_level_0,Débit,1 Mbit,100 Mbit,3 Mbit,30 Mbit,8 Mbit
Code INSEE,Techno,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
01001,Câble,0.000,0.000,0.000,0.000,0.000
01001,DSL,1.000,0.000,0.448,0.000,0.052
01001,Fibre,0.000,0.000,0.000,0.000,0.000
01001,THD,1.000,0.000,0.448,0.000,0.052
01002,Câble,0.000,0.000,0.000,0.000,0.000
01002,DSL,0.124,0.000,0.024,0.000,0.000
01002,Fibre,0.571,0.571,0.571,0.571,0.571
01002,THD,0.676,0.571,0.594,0.571,0.571
...,...,...,...,...,...,...
97701,Câble,0.000,0.000,0.000,0.000,0.000


En utilant ensuite la méthode *unstack()* on obtient cette fois-ci un tableau avec les débits rassemblés. 

In [31]:
thd2.stack("Techno").unstack()

Débit,1 Mbit,1 Mbit,1 Mbit,1 Mbit,100 Mbit,100 Mbit,100 Mbit,100 Mbit,3 Mbit,3 Mbit,3 Mbit,3 Mbit,30 Mbit,30 Mbit,30 Mbit,30 Mbit,8 Mbit,8 Mbit,8 Mbit,8 Mbit
Techno,Câble,DSL,Fibre,THD,Câble,DSL,Fibre,THD,Câble,DSL,Fibre,THD,Câble,DSL,Fibre,THD,Câble,DSL,Fibre,THD
Code INSEE,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
01001,0.0,1.000,0.000,1.000,0.0,0.0,0.000,0.000,0.0,0.448,0.000,0.448,0.0,0.000,0.000,0.000,0.0,0.052,0.000,0.052
01002,0.0,0.124,0.571,0.676,0.0,0.0,0.571,0.571,0.0,0.024,0.571,0.594,0.0,0.000,0.571,0.571,0.0,0.000,0.571,0.571
01004,0.0,1.000,0.000,1.000,0.0,0.0,0.000,0.000,0.0,0.966,0.000,0.966,0.0,0.234,0.000,0.234,0.0,0.794,0.000,0.794
01005,0.0,1.000,0.004,1.000,0.0,0.0,0.004,0.004,0.0,0.985,0.004,0.990,0.0,0.663,0.004,0.667,0.0,0.937,0.004,0.942
01006,0.0,1.000,0.934,1.000,0.0,0.0,0.934,0.934,0.0,1.000,0.934,1.000,0.0,0.000,0.934,0.934,0.0,1.000,0.934,1.000
01007,0.0,0.778,0.713,0.973,0.0,0.0,0.713,0.713,0.0,0.066,0.713,0.779,0.0,0.000,0.713,0.713,0.0,0.059,0.713,0.772
01008,0.0,1.000,0.000,1.000,0.0,0.0,0.000,0.000,0.0,0.557,0.000,0.557,0.0,0.000,0.000,0.000,0.0,0.013,0.000,0.013
01009,0.0,1.000,0.000,1.000,0.0,0.0,0.000,0.000,0.0,0.616,0.000,0.616,0.0,0.000,0.000,0.000,0.0,0.584,0.000,0.584
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97612,0.0,0.996,0.000,0.996,0.0,0.0,0.000,0.000,0.0,0.871,0.000,0.871,0.0,0.394,0.000,0.394,0.0,0.624,0.000,0.624


**Exercice dirigé optionnel de mise à plat des données**

1) Repartir du *DataFrame* thd et effectuer un reset de l'index sur une copie

2) Transférer les colonnes *Code INSEE*, *Département* et *Commune* vers l'index.

3) Le reste des colonnes correspond aux données télécom. Créer un index hiérarchique sur les colonnes restantes avec les niveaux *Techno* et *Débit*. On obtient un *DataFrame* avec un index hiérarchique à 3 niveaux et des colonnes sur 2 niveaux.

4) Transférer les colonnes hiérarchiques vers l'index. On obtient un objet Series avec un index hiérarchique à 5 niveaux. Les valeurs sont les débits.

5) Effectuer un reset de l'index et renommer les colonnes *Code INSEE*, *Département*, *Commune*, *Techno* et *Débit* ainsi qu'en ajoutant le nom *Valeur* pour la colonne avec les débits.  On obtient une table avec 6 colonnes distinctes et une ligne par valeur numérique de débit.

6) Il est ensuite possible d'accéder à un débit particulier en précisant la commune, la techno et le débit.

In [None]:
# suppresion de l'index


La mise à plat des données ne concerne que les données portant sur le haut-débit. On commence par transférer vers l'index les autres colonnes. On obtient un index hiérarchique à 2 niveaux.

In [None]:
# transfert des autres colonnes vers l'index


Création d'un index hiérarchique sur les colonnes restantes. On obtient un DataFrame avec un index hiérarchique à 3 niveaux et des colonnes sur 2 niveaux.

In [None]:
# création d'un index héirarchique


Transfert des colonnes hiérarchiques vers l'index. On obtient un objet *Series* avec un index hiérarchique à 5 niveaux.

In [None]:
# transfert des colonnes hiérarchiques vers l'index


Transfert de tous les index en colonnes. On obtient une table avec 6 colonnes distinctes et une ligne par valeur numérique de débit.

In [None]:
# reset de l'index


In [None]:
# accès à une ligne particulière


## 7.3 Fusion d'objets de type *DataFrame*

Dans cette section nous allons fusionner le *DataFrame* des communes avec celui des données sur le haut débit. En utilisant l'index "*Code INSEE*" comme pivot.

La méthode **join()** permet de fusionner 2 objets de type *DataFrame* selon l'index (par défaut) ou selon une colonne.

Il existe une option "*how*" pour spécifier comment s'effectue la jointure (terme utilisé dans les base de données) :
- left (par défaut) : utilise l'index du *DataFrame* dont on invoque la méthode
- right : utilise l'index du *DataFrame* passé en argument
- outer : forme l'union des index
- inner : utilise l'intersection des index

Il existe aussi les options *lsuffix* et *rsuffix* pour suffixer les colonnes dupliquées entre *DataFrame*.

In [32]:
alldata = geo.join(thd, rsuffix="_")
alldata.head()

Unnamed: 0_level_0,Code Postal,Commune,Département,Région,Statut,Altitude Moyenne,Superficie,Population,geo_point_2d,geo_shape,...,Câble 1 Mbit,Câble 3 Mbit,Câble 8 Mbit,Câble 30 Mbit,Câble 100 Mbit,Fibre 1 Mbit,Fibre 3 Mbit,Fibre 8 Mbit,Fibre 30 Mbit,Fibre 100 Mbit
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001,1400,L'ABERGEMENT-CLEMENCIAT,AIN,RHONE-ALPES,Commune simple,242.0,15.65,0.8,"46.1534255214, 4.92611354223","{""type"": ""Polygon"", ""coordinates"": [[[4.926273...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1002,1640,L'ABERGEMENT-DE-VAREY,AIN,RHONE-ALPES,Commune simple,483.0,9.12,0.2,"46.0091878776, 5.42801696363","{""type"": ""Polygon"", ""coordinates"": [[[5.430089...",...,0.0,0.0,0.0,0.0,0.0,0.571,0.571,0.571,0.571,0.571
1004,1500,AMBERIEU-EN-BUGEY,AIN,RHONE-ALPES,Chef-lieu canton,379.0,24.48,13.4,"45.9608475114, 5.3729257777","{""type"": ""Polygon"", ""coordinates"": [[[5.386190...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1005,1330,AMBERIEUX-EN-DOMBES,AIN,RHONE-ALPES,Commune simple,290.0,16.05,1.6,"45.9961799872, 4.91227250796","{""type"": ""Polygon"", ""coordinates"": [[[4.895580...",...,0.0,0.0,0.0,0.0,0.0,0.004,0.004,0.004,0.004,0.004
1006,1300,AMBLEON,AIN,RHONE-ALPES,Commune simple,589.0,6.02,0.1,"45.7494989044, 5.59432017366","{""type"": ""Polygon"", ""coordinates"": [[[5.614854...",...,0.0,0.0,0.0,0.0,0.0,0.934,0.934,0.934,0.934,0.934


In [33]:
alldata.columns

Index(['Code Postal', 'Commune', 'Département', 'Région', 'Statut',
       'Altitude Moyenne', 'Superficie', 'Population', 'geo_point_2d',
       'geo_shape', 'Latitude', 'Longitude', 'Densité', 'Département_',
       'Commune_', '1 Mbit', '3 Mbit', '8 Mbit', '30 Mbit', '100 Mbit',
       'DSL 1 Mbit', 'DSL 3 Mbit', 'DSL 8 Mbit', 'DSL 30 Mbit', 'DSL 100 Mbit',
       'Câble 1 Mbit', 'Câble 3 Mbit', 'Câble 8 Mbit', 'Câble 30 Mbit',
       'Câble 100 Mbit', 'Fibre 1 Mbit', 'Fibre 3 Mbit', 'Fibre 8 Mbit',
       'Fibre 30 Mbit', 'Fibre 100 Mbit'],
      dtype='object')

On constate que *pandas* a suffixé les 2 colonnes "Département" et "Commune" du *DataFrame* de gauche.

Il existe plusieurs autres fonctions de fusion de *DataFrame*.

- fonction **merge()** : jointure sur index ou colonne
- méthode **append()** : ajout d'un objet *Series* ou *DataFrame*
- fonction **concat()** : concaténation de *DataFrames* selon un axe

Voir la documentation : http://pandas.pydata.org/pandas-docs/stable/merging.html

**Exercice**

Construire une table pivot avec les données allant du 1 Mbit au 100 Mbit ventilées par statut des communes.

In [39]:
# on pivote sur les valeurs de la colonne "Statut"
alldata.pivot_table(index='Statut', values=['1 Mbit', '3 Mbit', '8 Mbit', '30 Mbit', '100 Mbit'])

Unnamed: 0_level_0,1 Mbit,100 Mbit,3 Mbit,30 Mbit,8 Mbit
Statut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Commune simple,0.951,0.018,0.58,0.112,0.417
Chef-lieu canton,0.993,0.06,0.908,0.411,0.826
Sous-préfecture,0.997,0.11,0.957,0.376,0.883
Préfecture,1.0,0.313,0.957,0.485,0.875
Préfecture de région,1.0,0.474,0.98,0.717,0.924
Capitale d'état,,,,,


## 7.4 Techniques pour compléter les données manquantes

Le *DataFrame* fait apparaître des lignes incomplètes.

In [40]:
alldata[alldata.isnull().any(axis=1)]

Unnamed: 0_level_0,Code Postal,Commune,Département,Région,Statut,Altitude Moyenne,Superficie,Population,geo_point_2d,geo_shape,...,Câble 1 Mbit,Câble 3 Mbit,Câble 8 Mbit,Câble 30 Mbit,Câble 100 Mbit,Fibre 1 Mbit,Fibre 3 Mbit,Fibre 8 Mbit,Fibre 30 Mbit,Fibre 100 Mbit
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
09304,09240,SUZAN,ARIEGE,MIDI-PYRENEES,Commune simple,548.0,2.40,0.0,"43.0294117749, 1.43489468125","{""type"": ""Polygon"", ""coordinates"": [[[1.432086...",...,,,,,,,,,,
13201,13001,MARSEILLE--1ER-ARRONDISSEMENT,BOUCHES-DU-RHONE,PROVENCE-ALPES-COTE D'AZUR,Préfecture de région,29.0,1.78,40.9,"43.2999009436, 5.38227869795","{""type"": ""Polygon"", ""coordinates"": [[[5.372144...",...,,,,,,,,,,
13202,13002,MARSEILLE--2E--ARRONDISSEMENT,BOUCHES-DU-RHONE,PROVENCE-ALPES-COTE D'AZUR,Commune simple,6.0,3.64,25.7,"43.3126964178, 5.36364983265","{""type"": ""Polygon"", ""coordinates"": [[[5.373616...",...,,,,,,,,,,
13203,13003,MARSEILLE--3E--ARRONDISSEMENT,BOUCHES-DU-RHONE,PROVENCE-ALPES-COTE D'AZUR,Commune simple,25.0,2.54,45.4,"43.3121200046, 5.38010981423","{""type"": ""Polygon"", ""coordinates"": [[[5.375021...",...,,,,,,,,,,
13204,13004,MARSEILLE--4E--ARRONDISSEMENT,BOUCHES-DU-RHONE,PROVENCE-ALPES-COTE D'AZUR,Commune simple,58.0,2.88,47.1,"43.306733355, 5.40087488752","{""type"": ""Polygon"", ""coordinates"": [[[5.408970...",...,,,,,,,,,,
13205,13005,MARSEILLE--5E--ARRONDISSEMENT,BOUCHES-DU-RHONE,PROVENCE-ALPES-COTE D'AZUR,Commune simple,35.0,2.18,44.5,"43.2928061953, 5.3975770959","{""type"": ""Polygon"", ""coordinates"": [[[5.391944...",...,,,,,,,,,,
13206,13006,MARSEILLE--6E--ARRONDISSEMENT,BOUCHES-DU-RHONE,PROVENCE-ALPES-COTE D'AZUR,Commune simple,41.0,2.06,43.3,"43.2870989546, 5.38104785394","{""type"": ""Polygon"", ""coordinates"": [[[5.373816...",...,,,,,,,,,,
13207,13007,MARSEILLE--7E--ARRONDISSEMENT,BOUCHES-DU-RHONE,PROVENCE-ALPES-COTE D'AZUR,Commune simple,35.0,5.84,35.9,"43.28251778, 5.36323440435","{""type"": ""Polygon"", ""coordinates"": [[[5.367986...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75115,75015,PARIS-15E-ARRONDISSEMENT,PARIS,ILE-DE-FRANCE,Chef-lieu canton,40.0,8.46,236.5,"48.8401554186, 2.29355937244","{""type"": ""Polygon"", ""coordinates"": [[[2.301320...",...,,,,,,,,,,


Nous allons étudier les différentes manières de compléter automatiquement les données manquantes.

In [44]:
# on remet les statuts en type str
alldata2 = alldata.copy().sort_index()
alldata2["Statut"] = alldata2["Statut"].astype(str)
alldata2

Unnamed: 0_level_0,Code Postal,Commune,Département,Région,Statut,Altitude Moyenne,Superficie,Population,geo_point_2d,geo_shape,...,Câble 1 Mbit,Câble 3 Mbit,Câble 8 Mbit,Câble 30 Mbit,Câble 100 Mbit,Fibre 1 Mbit,Fibre 3 Mbit,Fibre 8 Mbit,Fibre 30 Mbit,Fibre 100 Mbit
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01001,01400,L'ABERGEMENT-CLEMENCIAT,AIN,RHONE-ALPES,Commune simple,242.0,15.65,0.8,"46.1534255214, 4.92611354223","{""type"": ""Polygon"", ""coordinates"": [[[4.926273...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
01002,01640,L'ABERGEMENT-DE-VAREY,AIN,RHONE-ALPES,Commune simple,483.0,9.12,0.2,"46.0091878776, 5.42801696363","{""type"": ""Polygon"", ""coordinates"": [[[5.430089...",...,0.0,0.0,0.0,0.0,0.0,0.571,0.571,0.571,0.571,0.571
01004,01500,AMBERIEU-EN-BUGEY,AIN,RHONE-ALPES,Chef-lieu canton,379.0,24.48,13.4,"45.9608475114, 5.3729257777","{""type"": ""Polygon"", ""coordinates"": [[[5.386190...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
01005,01330,AMBERIEUX-EN-DOMBES,AIN,RHONE-ALPES,Commune simple,290.0,16.05,1.6,"45.9961799872, 4.91227250796","{""type"": ""Polygon"", ""coordinates"": [[[4.895580...",...,0.0,0.0,0.0,0.0,0.0,0.004,0.004,0.004,0.004,0.004
01006,01300,AMBLEON,AIN,RHONE-ALPES,Commune simple,589.0,6.02,0.1,"45.7494989044, 5.59432017366","{""type"": ""Polygon"", ""coordinates"": [[[5.614854...",...,0.0,0.0,0.0,0.0,0.0,0.934,0.934,0.934,0.934,0.934
01007,01500,AMBRONAY,AIN,RHONE-ALPES,Commune simple,309.0,33.59,2.3,"46.0055913782, 5.35760660735","{""type"": ""Polygon"", ""coordinates"": [[[5.413533...",...,0.0,0.0,0.0,0.0,0.0,0.713,0.713,0.713,0.713,0.713
01008,01500,AMBUTRIX,AIN,RHONE-ALPES,Commune simple,274.0,5.18,0.7,"45.9367134524, 5.3328092349","{""type"": ""Polygon"", ""coordinates"": [[[5.321986...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
01009,01300,ANDERT-ET-CONDON,AIN,RHONE-ALPES,Commune simple,294.0,6.96,0.3,"45.7873565333, 5.65788307924","{""type"": ""Polygon"", ""coordinates"": [[[5.656393...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97610,97600,KOUNGOU,MAYOTTE,MAYOTTE,Chef-lieu canton,138.0,27.69,19.8,"-12.7465604467, 45.1869991913","{""type"": ""Polygon"", ""coordinates"": [[[45.23692...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000


Il est possible de remplacer les *NaN* :
- par une valeur donnée,
- par la dernière valeur connue, option *method='ffill'* (forward fill)
- par la prochaine valeur connue, option *method='bfill'* (backward fill)
- par une fonction appliquée à certaines colonnes
- par une fonction d'interpolation sur l'index

In [43]:
# on remplace les NaN par une valeur quelconque
alldata2.fillna(0)

Unnamed: 0_level_0,Code Postal,Commune,Département,Région,Statut,Altitude Moyenne,Superficie,Population,geo_point_2d,geo_shape,...,Câble 1 Mbit,Câble 3 Mbit,Câble 8 Mbit,Câble 30 Mbit,Câble 100 Mbit,Fibre 1 Mbit,Fibre 3 Mbit,Fibre 8 Mbit,Fibre 30 Mbit,Fibre 100 Mbit
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01001,01400,L'ABERGEMENT-CLEMENCIAT,AIN,RHONE-ALPES,Commune simple,242.0,15.65,0.8,"46.1534255214, 4.92611354223","{""type"": ""Polygon"", ""coordinates"": [[[4.926273...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
01002,01640,L'ABERGEMENT-DE-VAREY,AIN,RHONE-ALPES,Commune simple,483.0,9.12,0.2,"46.0091878776, 5.42801696363","{""type"": ""Polygon"", ""coordinates"": [[[5.430089...",...,0.0,0.0,0.0,0.0,0.0,0.571,0.571,0.571,0.571,0.571
01004,01500,AMBERIEU-EN-BUGEY,AIN,RHONE-ALPES,Chef-lieu canton,379.0,24.48,13.4,"45.9608475114, 5.3729257777","{""type"": ""Polygon"", ""coordinates"": [[[5.386190...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
01005,01330,AMBERIEUX-EN-DOMBES,AIN,RHONE-ALPES,Commune simple,290.0,16.05,1.6,"45.9961799872, 4.91227250796","{""type"": ""Polygon"", ""coordinates"": [[[4.895580...",...,0.0,0.0,0.0,0.0,0.0,0.004,0.004,0.004,0.004,0.004
01006,01300,AMBLEON,AIN,RHONE-ALPES,Commune simple,589.0,6.02,0.1,"45.7494989044, 5.59432017366","{""type"": ""Polygon"", ""coordinates"": [[[5.614854...",...,0.0,0.0,0.0,0.0,0.0,0.934,0.934,0.934,0.934,0.934
01007,01500,AMBRONAY,AIN,RHONE-ALPES,Commune simple,309.0,33.59,2.3,"46.0055913782, 5.35760660735","{""type"": ""Polygon"", ""coordinates"": [[[5.413533...",...,0.0,0.0,0.0,0.0,0.0,0.713,0.713,0.713,0.713,0.713
01008,01500,AMBUTRIX,AIN,RHONE-ALPES,Commune simple,274.0,5.18,0.7,"45.9367134524, 5.3328092349","{""type"": ""Polygon"", ""coordinates"": [[[5.321986...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
01009,01300,ANDERT-ET-CONDON,AIN,RHONE-ALPES,Commune simple,294.0,6.96,0.3,"45.7873565333, 5.65788307924","{""type"": ""Polygon"", ""coordinates"": [[[5.656393...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97610,97600,KOUNGOU,MAYOTTE,MAYOTTE,Chef-lieu canton,138.0,27.69,19.8,"-12.7465604467, 45.1869991913","{""type"": ""Polygon"", ""coordinates"": [[[45.23692...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000


In [45]:
# on remplace les NaN par la dernière valeur connue
alldata2.fillna(method='ffill')

Unnamed: 0_level_0,Code Postal,Commune,Département,Région,Statut,Altitude Moyenne,Superficie,Population,geo_point_2d,geo_shape,...,Câble 1 Mbit,Câble 3 Mbit,Câble 8 Mbit,Câble 30 Mbit,Câble 100 Mbit,Fibre 1 Mbit,Fibre 3 Mbit,Fibre 8 Mbit,Fibre 30 Mbit,Fibre 100 Mbit
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01001,01400,L'ABERGEMENT-CLEMENCIAT,AIN,RHONE-ALPES,Commune simple,242.0,15.65,0.8,"46.1534255214, 4.92611354223","{""type"": ""Polygon"", ""coordinates"": [[[4.926273...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
01002,01640,L'ABERGEMENT-DE-VAREY,AIN,RHONE-ALPES,Commune simple,483.0,9.12,0.2,"46.0091878776, 5.42801696363","{""type"": ""Polygon"", ""coordinates"": [[[5.430089...",...,0.0,0.0,0.0,0.0,0.0,0.571,0.571,0.571,0.571,0.571
01004,01500,AMBERIEU-EN-BUGEY,AIN,RHONE-ALPES,Chef-lieu canton,379.0,24.48,13.4,"45.9608475114, 5.3729257777","{""type"": ""Polygon"", ""coordinates"": [[[5.386190...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
01005,01330,AMBERIEUX-EN-DOMBES,AIN,RHONE-ALPES,Commune simple,290.0,16.05,1.6,"45.9961799872, 4.91227250796","{""type"": ""Polygon"", ""coordinates"": [[[4.895580...",...,0.0,0.0,0.0,0.0,0.0,0.004,0.004,0.004,0.004,0.004
01006,01300,AMBLEON,AIN,RHONE-ALPES,Commune simple,589.0,6.02,0.1,"45.7494989044, 5.59432017366","{""type"": ""Polygon"", ""coordinates"": [[[5.614854...",...,0.0,0.0,0.0,0.0,0.0,0.934,0.934,0.934,0.934,0.934
01007,01500,AMBRONAY,AIN,RHONE-ALPES,Commune simple,309.0,33.59,2.3,"46.0055913782, 5.35760660735","{""type"": ""Polygon"", ""coordinates"": [[[5.413533...",...,0.0,0.0,0.0,0.0,0.0,0.713,0.713,0.713,0.713,0.713
01008,01500,AMBUTRIX,AIN,RHONE-ALPES,Commune simple,274.0,5.18,0.7,"45.9367134524, 5.3328092349","{""type"": ""Polygon"", ""coordinates"": [[[5.321986...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
01009,01300,ANDERT-ET-CONDON,AIN,RHONE-ALPES,Commune simple,294.0,6.96,0.3,"45.7873565333, 5.65788307924","{""type"": ""Polygon"", ""coordinates"": [[[5.656393...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97610,97600,KOUNGOU,MAYOTTE,MAYOTTE,Chef-lieu canton,138.0,27.69,19.8,"-12.7465604467, 45.1869991913","{""type"": ""Polygon"", ""coordinates"": [[[45.23692...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000


Il est également possible de supprimer les lignes (*axis=0*) ou les colonnes (*axis=1*) comportant un *NaN*.

In [46]:
# on supprime les lignes avec des valeurs manquantes
alldata2.dropna(axis=0)

Unnamed: 0_level_0,Code Postal,Commune,Département,Région,Statut,Altitude Moyenne,Superficie,Population,geo_point_2d,geo_shape,...,Câble 1 Mbit,Câble 3 Mbit,Câble 8 Mbit,Câble 30 Mbit,Câble 100 Mbit,Fibre 1 Mbit,Fibre 3 Mbit,Fibre 8 Mbit,Fibre 30 Mbit,Fibre 100 Mbit
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01001,01400,L'ABERGEMENT-CLEMENCIAT,AIN,RHONE-ALPES,Commune simple,242.0,15.65,0.8,"46.1534255214, 4.92611354223","{""type"": ""Polygon"", ""coordinates"": [[[4.926273...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
01002,01640,L'ABERGEMENT-DE-VAREY,AIN,RHONE-ALPES,Commune simple,483.0,9.12,0.2,"46.0091878776, 5.42801696363","{""type"": ""Polygon"", ""coordinates"": [[[5.430089...",...,0.0,0.0,0.0,0.0,0.0,0.571,0.571,0.571,0.571,0.571
01004,01500,AMBERIEU-EN-BUGEY,AIN,RHONE-ALPES,Chef-lieu canton,379.0,24.48,13.4,"45.9608475114, 5.3729257777","{""type"": ""Polygon"", ""coordinates"": [[[5.386190...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
01005,01330,AMBERIEUX-EN-DOMBES,AIN,RHONE-ALPES,Commune simple,290.0,16.05,1.6,"45.9961799872, 4.91227250796","{""type"": ""Polygon"", ""coordinates"": [[[4.895580...",...,0.0,0.0,0.0,0.0,0.0,0.004,0.004,0.004,0.004,0.004
01006,01300,AMBLEON,AIN,RHONE-ALPES,Commune simple,589.0,6.02,0.1,"45.7494989044, 5.59432017366","{""type"": ""Polygon"", ""coordinates"": [[[5.614854...",...,0.0,0.0,0.0,0.0,0.0,0.934,0.934,0.934,0.934,0.934
01007,01500,AMBRONAY,AIN,RHONE-ALPES,Commune simple,309.0,33.59,2.3,"46.0055913782, 5.35760660735","{""type"": ""Polygon"", ""coordinates"": [[[5.413533...",...,0.0,0.0,0.0,0.0,0.0,0.713,0.713,0.713,0.713,0.713
01008,01500,AMBUTRIX,AIN,RHONE-ALPES,Commune simple,274.0,5.18,0.7,"45.9367134524, 5.3328092349","{""type"": ""Polygon"", ""coordinates"": [[[5.321986...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
01009,01300,ANDERT-ET-CONDON,AIN,RHONE-ALPES,Commune simple,294.0,6.96,0.3,"45.7873565333, 5.65788307924","{""type"": ""Polygon"", ""coordinates"": [[[5.656393...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97610,97600,KOUNGOU,MAYOTTE,MAYOTTE,Chef-lieu canton,138.0,27.69,19.8,"-12.7465604467, 45.1869991913","{""type"": ""Polygon"", ""coordinates"": [[[45.23692...",...,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.000,0.000,0.000


Les fonctions de lecture de données comportent également des options permettant de gérer l'absence de valeur.

## 7.5 Techniques pour supprimer les doublons

La méthode **duplicated()** permet de trouver les lignes dupliquées dans un *DataFrame* tandis que la méthode **drop_duplicates()** permet d'en supprimer les lignes dupliquées.

In [47]:
var = thd.reset_index().loc[:, "1 Mbit":"100 Mbit"]
var

Unnamed: 0,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit
0,1.000,0.448,0.052,0.000,0.000
1,0.676,0.594,0.571,0.571,0.571
2,1.000,0.966,0.794,0.234,0.000
3,1.000,0.990,0.942,0.667,0.004
4,1.000,1.000,1.000,0.934,0.934
5,0.973,0.779,0.772,0.713,0.713
6,1.000,0.557,0.013,0.000,0.000
7,1.000,0.616,0.584,0.000,0.000
...,...,...,...,...,...
36685,0.996,0.871,0.624,0.394,0.000


In [48]:
var.duplicated().value_counts()

False    26871
True      9822
dtype: int64

In [49]:
var.drop_duplicates()

Unnamed: 0,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit
0,1.000,0.448,0.052,0.000,0.000
1,0.676,0.594,0.571,0.571,0.571
2,1.000,0.966,0.794,0.234,0.000
3,1.000,0.990,0.942,0.667,0.004
4,1.000,1.000,1.000,0.934,0.934
5,0.973,0.779,0.772,0.713,0.713
6,1.000,0.557,0.013,0.000,0.000
7,1.000,0.616,0.584,0.000,0.000
...,...,...,...,...,...
36685,0.996,0.871,0.624,0.394,0.000
