# 8. Méthode split-apply-combine

La méthode **groupby()** permet d'agréger les données selon les valeurs identiques d'une ou plusieurs colonnes.

Elle renvoie un objet de type *DataFrameGroupBy* qui peut être interprété comme un dictionnaire d'objets de type *DataFrame* dont :
- les clefs sont les modalités des valeurs de la colonne (ou des colonnes) utilisée(s) pour éclater les données
- et les valeurs des sous-DataFrames  du DataFrame initial.

**Méthodologie** :
- **split** : division des données en sous-ensembles
- **apply** : application d'une fonction sur chaque groupe
- **combine** : agrégation des résultats

Voir la documentation : http://pandas.pydata.org/pandas-docs/stable/groupby.html

In [1]:
# import des modules usuels
import pandas as pnd
import numpy as np
import matplotlib.pyplot as plt

# commande magique pour l'affichage des graphiques
%matplotlib inline

# options d'affichage
pnd.set_option("display.max_rows", 16)
plt.style.use('seaborn-darkgrid')

In [2]:
# chargement des données
geo = pnd.read_csv("correspondance-code-insee-code-postal.csv",
                   sep=';',
                   usecols=range(11),
                  index_col="Code INSEE")
geo.sort_index(inplace=True)
geo.head()

Unnamed: 0_level_0,Code Postal,Commune,Département,Région,Statut,Altitude Moyenne,Superficie,Population,geo_point_2d,geo_shape
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1001,1400,L'ABERGEMENT-CLEMENCIAT,AIN,RHONE-ALPES,Commune simple,242.0,1565.0,0.8,"46.1534255214, 4.92611354223","{""type"": ""Polygon"", ""coordinates"": [[[4.926273..."
1002,1640,L'ABERGEMENT-DE-VAREY,AIN,RHONE-ALPES,Commune simple,483.0,912.0,0.2,"46.0091878776, 5.42801696363","{""type"": ""Polygon"", ""coordinates"": [[[5.430089..."
1004,1500,AMBERIEU-EN-BUGEY,AIN,RHONE-ALPES,Chef-lieu canton,379.0,2448.0,13.4,"45.9608475114, 5.3729257777","{""type"": ""Polygon"", ""coordinates"": [[[5.386190..."
1005,1330,AMBERIEUX-EN-DOMBES,AIN,RHONE-ALPES,Commune simple,290.0,1605.0,1.6,"45.9961799872, 4.91227250796","{""type"": ""Polygon"", ""coordinates"": [[[4.895580..."
1006,1300,AMBLEON,AIN,RHONE-ALPES,Commune simple,589.0,602.0,0.1,"45.7494989044, 5.59432017366","{""type"": ""Polygon"", ""coordinates"": [[[5.614854..."


In [3]:
# on fabrique un objet de type DataFrameGroupBy
regions = geo.groupby("Région")
type(regions)

pandas.core.groupby.DataFrameGroupBy

Un objet *DataFrameGroupBy* peut être vu comme un dictionnaire dont :
- les clefs sont les valeurs de la colonne utilisée pour éclater les données
- les valeurs sont des sous-DataFrame du *DataFrame* initial (sans la colonne ayant servie à éclater les données

In [6]:
# accès au dictionnaire des groupes
type(regions.groups)

dict

In [7]:
# accès au nombre de groupes
regions.ngroups

27

In [9]:
geo["Région"].nunique()

27

La méthode **size()** permet de calculer l'effectif de chaque groupe.

In [10]:
# effectifs des groupes
regions.size()

Région
ALSACE                         904
AQUITAINE                     2296
AUVERGNE                      1310
BASSE-NORMANDIE               1812
BOURGOGNE                     2046
BRETAGNE                      1270
CENTRE                        1841
CHAMPAGNE-ARDENNE             1954
                              ... 
MIDI-PYRENEES                 3020
NORD-PAS-DE-CALAIS            1545
PAYS DE LA LOIRE              1502
PICARDIE                      2291
POITOU-CHARENTES              1462
PROVENCE-ALPES-COTE D'AZUR     978
REUNION                         24
RHONE-ALPES                   2887
Length: 27, dtype: int64

On vérifie que la somme des effectifs vaut le nombre total de communes.

In [11]:
# somme des effectifs
regions.size().sum() == len(geo)

True

La méthode **describe()** fournit les informations statistiques sur les sous-groupes.

In [12]:
geo.describe()

Unnamed: 0,Altitude Moyenne,Superficie,Population
count,36742.0,36742.0,36742.0
mean,277.81381,1735.786,1.755484
std,290.435871,14447.83,8.10913
min,0.0,2.0,0.0
25%,103.0,645.0,0.2
50%,185.0,1081.0,0.4
75%,334.0,1850.0,1.1
max,2713.0,1871833.0,440.2


In [13]:
# describe par groupe
regions.describe()

Unnamed: 0_level_0,Altitude Moyenne,Altitude Moyenne,Altitude Moyenne,Altitude Moyenne,Altitude Moyenne,Altitude Moyenne,Altitude Moyenne,Altitude Moyenne,Population,Population,Population,Population,Population,Superficie,Superficie,Superficie,Superficie,Superficie,Superficie,Superficie,Superficie
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Région,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
ALSACE,904.0,302.655973,164.141776,110.0,186.0,253.0,361.00,928.0,904.0,2.040155,...,1.500,271.7,904.0,920.202434,957.163127,64.0,423.75,674.5,1142.00,18358.0
AQUITAINE,2296.0,138.384146,154.109311,2.0,57.0,103.0,175.00,1738.0,2296.0,1.396995,...,1.000,236.7,2296.0,1819.704704,2123.650800,2.0,741.50,1214.0,2063.00,24881.0
AUVERGNE,1310.0,637.366412,282.575966,204.0,376.5,613.0,871.75,1354.0,1310.0,1.026336,...,0.800,138.6,1310.0,1994.300763,1247.572770,101.0,1082.00,1747.5,2622.75,9324.0
BASSE-NORMANDIE,1812.0,119.595475,76.193013,4.0,52.0,114.0,176.25,344.0,1812.0,0.812472,...,0.700,109.3,1812.0,979.507174,684.901173,63.0,525.00,818.5,1224.50,7370.0
BOURGOGNE,2046.0,281.342131,102.654453,64.0,200.0,260.5,350.00,655.0,2046.0,0.802688,...,0.600,152.1,2046.0,1548.894917,1075.885575,60.0,776.25,1241.5,2047.50,8637.0
BRETAGNE,1270.0,88.930709,54.477878,1.0,51.0,79.0,113.00,268.0,1270.0,2.499921,...,2.400,206.6,1270.0,2160.540945,1609.143176,8.0,995.50,1757.5,2882.75,11779.0
CENTRE,1841.0,142.575231,50.122349,29.0,110.0,132.0,166.00,424.0,1841.0,1.378816,...,1.100,135.2,1841.0,2143.457360,1473.699429,100.0,1132.00,1777.0,2736.00,11813.0
CHAMPAGNE-ARDENNE,1954.0,198.787615,90.452667,64.0,129.0,174.0,251.75,471.0,1954.0,0.686131,...,0.400,180.8,1954.0,1315.348004,972.845389,58.0,691.25,1066.0,1645.50,10726.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
MIDI-PYRENEES,3020.0,410.063907,334.601047,56.0,198.0,292.0,497.25,2131.0,3020.0,0.947848,...,0.600,440.2,3020.0,1510.269868,1570.157061,24.0,609.75,1037.5,1813.00,17008.0


La méthode **get_group()** permet d'obtenir le sous-*DataFrame* correspondant au groupe.

In [16]:
# accès à un groupe
regions.get_group("CORSE").head(7)

Unnamed: 0_level_0,Altitude Moyenne,Code Postal,Commune,Département,Population,Statut,Superficie,geo_point_2d,geo_shape
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2A001,173.0,20167,AFA,CORSE-DU-SUD,2.8,Commune simple,1198.0,"41.9844089346, 8.79828936731","{""type"": ""Polygon"", ""coordinates"": [[[8.784625..."
2A004,136.0,20000,AJACCIO,CORSE-DU-SUD,64.3,Préfecture de région,8314.0,"41.9347926638, 8.70132275974","{""type"": ""Polygon"", ""coordinates"": [[[8.636808..."
2A006,244.0,20167,ALATA,CORSE-DU-SUD,3.0,Commune simple,3065.0,"41.9735186682, 8.73133206502","{""type"": ""Polygon"", ""coordinates"": [[[8.784316..."
2A008,488.0,20128,ALBITRECCIA,CORSE-DU-SUD,1.5,Commune simple,4599.0,"41.860474641, 8.87771179182","{""type"": ""Polygon"", ""coordinates"": [[[8.860380..."
2A011,728.0,20112,ALTAGENE,CORSE-DU-SUD,0.1,Commune simple,475.0,"41.7091975922, 9.07705100575","{""type"": ""Polygon"", ""coordinates"": [[[9.081959..."
2A014,168.0,20151,AMBIEGNA,CORSE-DU-SUD,0.1,Commune simple,621.0,"42.0878005416, 8.77383533017","{""type"": ""Polygon"", ""coordinates"": [[[8.786968..."
2A017,250.0,20167,APPIETTO,CORSE-DU-SUD,1.4,Commune simple,3459.0,"42.0032023283, 8.73355643393","{""type"": ""Polygon"", ""coordinates"": [[[8.788208..."


**Exercice**

Obtenir le groupe régional avec le plus petit effectif.

In [19]:
geo.groupby("Région").size().argmin()

'MAYOTTE'

La méthode **aggregate()** ou bien **agg()** permet d'agréger toutes les valeurs par regroupement en passant une ou plusieurs fonctions de calcul.

In [24]:
# somme de toutes les valeurs numériques agrégées
regions.agg(np.sum)

Unnamed: 0_level_0,Altitude Moyenne,Superficie,Population
Région,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ALSACE,273601.0,831863.0,1844.3
AQUITAINE,317730.0,4178042.0,3207.5
AUVERGNE,834950.0,2612534.0,1344.5
BASSE-NORMANDIE,216707.0,1774867.0,1472.2
BOURGOGNE,575626.0,3169039.0,1642.3
BRETAGNE,112942.0,2743887.0,3174.9
CENTRE,262481.0,3946105.0,2538.4
CHAMPAGNE-ARDENNE,388431.0,2570190.0,1340.7
...,...,...,...
MIDI-PYRENEES,1238393.0,4561015.0,2862.5


In [25]:
# moyenne et écart type de toutes les valeurs agrégées
regions.agg([np.mean, np.std])

Unnamed: 0_level_0,Altitude Moyenne,Altitude Moyenne,Superficie,Superficie,Population,Population
Unnamed: 0_level_1,mean,std,mean,std,mean,std
Région,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
ALSACE,302.655973,164.141776,920.202434,957.163127,2.040155,10.317211
AQUITAINE,138.384146,154.109311,1819.704704,2123.650800,1.396995,6.306459
AUVERGNE,637.366412,282.575966,1994.300763,1247.572770,1.026336,4.436420
BASSE-NORMANDIE,119.595475,76.193013,979.507174,684.901173,0.812472,3.173097
BOURGOGNE,281.342131,102.654453,1548.894917,1075.885575,0.802688,4.042821
BRETAGNE,88.930709,54.477878,2160.540945,1609.143176,2.499921,8.060555
CENTRE,142.575231,50.122349,2143.457360,1473.699429,1.378816,5.227198
CHAMPAGNE-ARDENNE,198.787615,90.452667,1315.348004,972.845389,0.686131,4.784049
...,...,...,...,...,...,...
MIDI-PYRENEES,410.063907,334.601047,1510.269868,1570.157061,0.947848,8.397871


La méthode **apply()** applique la fonction spécifiée à chacun des groupes d'un objet *DataFrameGroupBy* et combine l'ensemble en un nouveau *DataFrame*.

In [30]:
# rang dans une région de chaque commune par population
def fct_rang(region):
    region["rang"] = region["Population"].rank(ascending=False)
    region["rang"] = region["rang"].astype(int)
    return region

df = regions.apply(fct_rang)
df.sort_values(["rang", "Population"], ascending=[True, False]).head(50)

Unnamed: 0_level_0,Altitude Moyenne,Code Postal,Commune,Département,Population,Statut,Superficie,geo_point_2d,geo_shape,rang
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
31555,148.0,31000/31100/31200/31300/31400/31500,TOULOUSE,HAUTE-GARONNE,440.2,Préfecture de région,11809.0,"43.5963814303, 1.43167293364","{""type"": ""Polygon"", ""coordinates"": [[[1.461403...",1
06088,115.0,06000/06100/06200/06300,NICE,ALPES-MARITIMES,340.7,Préfecture,7391.0,"43.7119992661, 7.23826889465","{""type"": ""Polygon"", ""coordinates"": [[[7.308745...",1
44109,18.0,44000/44100/44200/44300,NANTES,LOIRE-ATLANTIQUE,282.0,Préfecture de région,6577.0,"47.2316356767, -1.54831008605","{""type"": ""Polygon"", ""coordinates"": [[[-1.51819...",1
67482,139.0,67000/67100/67200,STRASBOURG,BAS-RHIN,271.7,Préfecture de région,7809.0,"48.5712679849, 7.76752679517","{""type"": ""Polygon"", ""coordinates"": [[[7.770759...",1
34172,45.0,34000/34070/34080/34090,MONTPELLIER,HERAULT,255.1,Préfecture de région,5711.0,"43.6134409138, 3.86851657896","{""type"": ""Polygon"", ""coordinates"": [[[3.860837...",1
33063,9.0,33000/33100/33200/33300/33800,BORDEAUX,GIRONDE,236.7,Préfecture de région,4970.0,"44.8572445351, -0.57369678116","{""type"": ""Polygon"", ""coordinates"": [[[-0.57387...",1
75115,40.0,75015,PARIS-15E-ARRONDISSEMENT,PARIS,236.5,Chef-lieu canton,846.0,"48.8401554186, 2.29355937244","{""type"": ""Polygon"", ""coordinates"": [[[2.301320...",1
59350,27.0,59000/59160/59260/59777/59800,LILLE,NORD,226.8,Préfecture de région,3499.0,"50.6317183168, 3.04783272312","{""type"": ""Polygon"", ""coordinates"": [[[3.054628...",1
...,...,...,...,...,...,...,...,...,...,...
82121,102.0,82000,MONTAUBAN,TARN-ET-GARONNE,56.1,Préfecture,13591.0,"44.0222594578, 1.36408636501","{""type"": ""Polygon"", ""coordinates"": [[[1.402326...",2


**Exercice**

Etablir une fonction qui calcule le top 10 des villes les plus habitées d'une région.

Utiliser la fonction pour limiter le *DataFrame* geo aux villes qui sont top 10 dans chaque région.

In [36]:
a = list(np.random.random(10))
a.sort()
a

[0.11612551674293803,
 0.19817033783572879,
 0.35843831767999446,
 0.36707114413824227,
 0.37129063102364024,
 0.72294560107558326,
 0.75151490807421639,
 0.79317397760350439,
 0.79669625057361582,
 0.81717161232002866]

In [32]:
def top3(region):
    return region.sort_values("Population", ascending=False).head(3)

geo.groupby("Région").apply(top3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Code Postal,Commune,Département,Région,Statut,Altitude Moyenne,Superficie,Population,geo_point_2d,geo_shape
Région,Code INSEE,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ALSACE,67482,67000/67100/67200,STRASBOURG,BAS-RHIN,ALSACE,Préfecture de région,139.0,7809.0,271.7,"48.5712679849, 7.76752679517","{""type"": ""Polygon"", ""coordinates"": [[[7.770759..."
ALSACE,68224,68100/68200,MULHOUSE,HAUT-RHIN,ALSACE,Sous-préfecture,250.0,2234.0,111.2,"47.749163303, 7.32570047509","{""type"": ""Polygon"", ""coordinates"": [[[7.344518..."
ALSACE,68066,68000,COLMAR,HAUT-RHIN,ALSACE,Préfecture,187.0,6645.0,67.2,"48.1099405789, 7.38468690323","{""type"": ""Polygon"", ""coordinates"": [[[7.352773..."
AQUITAINE,33063,33000/33100/33200/33300/33800,BORDEAUX,GIRONDE,AQUITAINE,Préfecture de région,9.0,4970.0,236.7,"44.8572445351, -0.57369678116","{""type"": ""Polygon"", ""coordinates"": [[[-0.57387..."
AQUITAINE,64445,64000,PAU,PYRENEES-ATLANTIQUES,AQUITAINE,Préfecture,211.0,3152.0,82.8,"43.3200189773, -0.350337918181","{""type"": ""Polygon"", ""coordinates"": [[[-0.37894..."
AQUITAINE,33281,33700,MERIGNAC,GIRONDE,AQUITAINE,Chef-lieu canton,42.0,4784.0,66.5,"44.8322953289, -0.681733084891","{""type"": ""Polygon"", ""coordinates"": [[[-0.61064..."
AUVERGNE,63113,63000/63100,CLERMONT-FERRAND,PUY-DE-DOME,AUVERGNE,Préfecture de région,370.0,4307.0,138.6,"45.7856492991, 3.11554542903","{""type"": ""Polygon"", ""coordinates"": [[[3.077364..."
AUVERGNE,03185,03100,MONTLUCON,ALLIER,AUVERGNE,Sous-préfecture,235.0,2071.0,39.0,"46.3385883496, 2.60390499777","{""type"": ""Polygon"", ""coordinates"": [[[2.573022..."
...,...,...,...,...,...,...,...,...,...,...,...
PROVENCE-ALPES-COTE D'AZUR,83137,83000/83100/83200,TOULON,VAR,PROVENCE-ALPES-COTE D'AZUR,Préfecture,126.0,4419.0,165.5,"43.1361589728, 5.93239634249","{""type"": ""Polygon"", ""coordinates"": [[[5.979060..."


Il est possible de grouper selon plusieurs colonnes.

In [37]:
# groupes multiples
regions = geo.groupby(["Région", "Département"])
regions.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Altitude Moyenne,Altitude Moyenne,Altitude Moyenne,Altitude Moyenne,Altitude Moyenne,Altitude Moyenne,Altitude Moyenne,Altitude Moyenne,Population,Population,Population,Population,Population,Superficie,Superficie,Superficie,Superficie,Superficie,Superficie,Superficie,Superficie
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Région,Département,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
ALSACE,BAS-RHIN,527.0,241.631879,121.802016,110.0,164.0,204.0,279.00,914.0,527.0,2.079127,...,1.5,271.7,527.0,909.495256,1085.812380,64.0,382.00,635.0,1140.50,18358.0
ALSACE,HAUT-RHIN,377.0,387.960212,177.299695,176.0,253.0,350.0,454.00,928.0,377.0,1.985676,...,1.6,111.2,377.0,935.169761,741.979765,121.0,478.00,694.0,1153.00,6645.0
AQUITAINE,DORDOGNE,557.0,158.527828,62.845546,9.0,115.0,156.0,196.00,385.0,557.0,0.739677,...,0.7,29.3,557.0,1653.100539,989.668334,53.0,946.00,1448.0,2079.00,8966.0
AQUITAINE,GIRONDE,542.0,46.981550,28.425920,2.0,21.0,45.0,66.00,139.0,542.0,2.647601,...,1.8,236.7,542.0,1872.142066,2789.684276,2.0,615.00,965.5,1798.50,22253.0
AQUITAINE,LANDES,331.0,70.404834,39.174348,5.0,36.5,68.0,96.00,187.0,331.0,1.147130,...,1.0,30.4,331.0,2825.066465,3059.243095,222.0,969.50,1690.0,3282.50,19282.0
AQUITAINE,LOT-ET-GARONNE,319.0,102.761755,39.865808,16.0,75.5,101.0,126.50,221.0,319.0,1.033856,...,0.8,33.9,319.0,1686.824451,1147.365547,395.0,962.00,1396.0,2048.50,8119.0
AQUITAINE,PYRENEES-ATLANTIQUES,547.0,270.349177,251.901880,12.0,138.5,211.0,300.00,1738.0,547.0,1.190128,...,0.7,82.8,547.0,1406.526508,1731.931279,99.0,580.50,870.0,1537.50,24881.0
AUVERGNE,ALLIER,320.0,331.328125,105.587393,204.0,260.0,305.0,379.25,915.0,320.0,1.072813,...,0.8,39.0,320.0,2301.234375,1212.785239,188.0,1311.00,2179.0,2913.50,7207.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
RHONE-ALPES,AIN,419.0,408.451074,242.373371,171.0,224.5,286.0,546.00,1198.0,419.0,1.405251,...,1.5,39.6,419.0,1377.706444,844.603068,53.0,767.50,1228.0,1795.00,5067.0


Il est possible d'utiliser la fonction **agg()** ou **aggregate()** sur une ou plusieurs colonnes en passant un dictionnaire de fonctions. On obtient alors un *DataFrame* avec les résultats ventilés par groupe et par colonne / clés de fonction.

In [38]:
# synthèses différenciées sur plusieurs colonnes
regions.agg({"Superficie": np.sum, "Population": np.sum, "Altitude Moyenne": np.mean})

Unnamed: 0_level_0,Unnamed: 1_level_0,Altitude Moyenne,Population,Superficie
Région,Département,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ALSACE,BAS-RHIN,241.631879,1095.7,479304.0
ALSACE,HAUT-RHIN,387.960212,748.6,352559.0
AQUITAINE,DORDOGNE,158.527828,412.0,920777.0
AQUITAINE,GIRONDE,46.981550,1435.0,1014701.0
AQUITAINE,LANDES,70.404834,379.7,935097.0
AQUITAINE,LOT-ET-GARONNE,102.761755,329.8,538097.0
AQUITAINE,PYRENEES-ATLANTIQUES,270.349177,651.0,769370.0
AUVERGNE,ALLIER,331.328125,343.3,736395.0
...,...,...,...,...
RHONE-ALPES,AIN,408.451074,588.8,577259.0


Il est également possible d'effectuer des regroupements selon des catégories numériques obtenues avec la fonction *digitize()*.

In [39]:
thd = pnd.read_excel("FranceTHD_Open_Data_Observatoire_Juin2015.xlsx",
                    sheetname="Communes",
                    header=1,
                    index_col="Code INSEE",
                    names=["Département", "Commune",
                                "1 Mbit", "3 Mbit", "8 Mbit", "30 Mbit", "100 Mbit",
                                "DSL 1 Mbit", "DSL 3 Mbit", "DSL 8 Mbit", "DSL 30 Mbit", "DSL 100 Mbit",
                                "Câble 1 Mbit", "Câble 3 Mbit", "Câble 8 Mbit", "Câble 30 Mbit", "Câble 100 Mbit",
                                "Fibre 1 Mbit", "Fibre 3 Mbit", "Fibre 8 Mbit", "Fibre 30 Mbit", "Fibre 100 Mbit"])
thd.head()

Unnamed: 0_level_0,Département,Commune,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit,DSL 1 Mbit,DSL 3 Mbit,DSL 8 Mbit,...,Câble 1 Mbit,Câble 3 Mbit,Câble 8 Mbit,Câble 30 Mbit,Câble 100 Mbit,Fibre 1 Mbit,Fibre 3 Mbit,Fibre 8 Mbit,Fibre 30 Mbit,Fibre 100 Mbit
Code INSEE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001,AIN,L'Abergement-Clémenciat,1.0,0.448,0.052,0.0,0.0,1.0,0.448,0.052,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1002,AIN,L'Abergement-de-Varey,0.676,0.594,0.571,0.571,0.571,0.124,0.024,0.0,...,0.0,0.0,0.0,0.0,0.0,0.571,0.571,0.571,0.571,0.571
1004,AIN,Ambérieu-en-Bugey,1.0,0.966,0.794,0.234,0.0,1.0,0.966,0.794,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1005,AIN,Ambérieux-en-Dombes,1.0,0.99,0.942,0.667,0.004,1.0,0.985,0.937,...,0.0,0.0,0.0,0.0,0.0,0.004,0.004,0.004,0.004,0.004
1006,AIN,Ambléon,1.0,1.0,1.0,0.934,0.934,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.934,0.934,0.934,0.934,0.934


In [40]:
# partition de [0,1] en 5 segments
bins = np.linspace(0, 1, 5)
bins

array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])

In [45]:
thd["1 Mbit"]

Code INSEE
01001    1.000
01002    0.676
01004    1.000
01005    1.000
01006    1.000
01007    0.973
01008    1.000
01009    1.000
         ...  
97612    0.996
97613    0.999
97614    1.000
97615    1.000
97616    0.991
97617    0.990
97701    1.000
97801    0.993
Name: 1 Mbit, Length: 36693, dtype: float64

In [44]:
# catégorisation de la colonne "1 Mbit" selon la partition donnée
cat = np.digitize(thd["1 Mbit"], bins)
cat

array([5, 3, 5, ..., 4, 5, 4], dtype=int64)

Par défaut, il s'agit d'une partition semi-ouverte à droite (*right=False*) :
- [0, 0.25[
- [0.25, 0.5[
- [0.5, 0.75[
- [0.75, 1.0[
- [1.0, 1.0]

In [53]:
geo.groupby("Région").size()

Région
ALSACE                         904
AQUITAINE                     2296
AUVERGNE                      1310
BASSE-NORMANDIE               1812
BOURGOGNE                     2046
BRETAGNE                      1270
CENTRE                        1841
CHAMPAGNE-ARDENNE             1954
                              ... 
MIDI-PYRENEES                 3020
NORD-PAS-DE-CALAIS            1545
PAYS DE LA LOIRE              1502
PICARDIE                      2291
POITOU-CHARENTES              1462
PROVENCE-ALPES-COTE D'AZUR     978
REUNION                         24
RHONE-ALPES                   2887
Length: 27, dtype: int64

In [46]:
# regroupement selon la partition et calcul de la moyenne des valeurs des groupes
groups = thd.groupby(cat)
groups.mean()

Unnamed: 0,1 Mbit,3 Mbit,8 Mbit,30 Mbit,100 Mbit,DSL 1 Mbit,DSL 3 Mbit,DSL 8 Mbit,DSL 30 Mbit,DSL 100 Mbit,Câble 1 Mbit,Câble 3 Mbit,Câble 8 Mbit,Câble 30 Mbit,Câble 100 Mbit,Fibre 1 Mbit,Fibre 3 Mbit,Fibre 8 Mbit,Fibre 30 Mbit,Fibre 100 Mbit
1,0.05302,0.006919,0.003573,0.000606,0.000332,0.052688,0.006587,0.003241,0.000274,0,0.0,0.0,0.0,0.0,0.0,0.000332,0.000332,0.000332,0.000332,0.000332
2,0.384078,0.048213,0.020293,0.00335,0.000581,0.38371,0.047632,0.019713,0.002769,0,0.0,0.0,0.0,0.0,0.0,0.000581,0.000581,0.000581,0.000581,0.000581
3,0.647012,0.093624,0.043031,0.005817,0.003669,0.643564,0.08912,0.038526,0.002148,0,0.003728,0.003728,0.003728,0.002892,0.002892,0.000777,0.000777,0.000777,0.000777,0.000777
4,0.957697,0.4887,0.344005,0.097368,0.014134,0.956363,0.48085,0.333629,0.082615,0,0.008626,0.008626,0.008626,0.008539,0.006573,0.007655,0.007655,0.007655,0.007655,0.007655
5,1.0,0.711897,0.54091,0.169414,0.028516,0.999988,0.704713,0.526087,0.137616,0,0.027997,0.027997,0.027997,0.027997,0.019013,0.011172,0.011172,0.011172,0.011172,0.011172


Regroupement et calcul des quantiles pour chaque colonne numérique avec la méthode **quantile()**.

In [47]:
# le premier décile de chaque colonne numérique par région
geo.groupby("Région").quantile(0.1)

0.1,Altitude Moyenne,Population,Superficie
Région,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ALSACE,156.0,0.30,280.8
AQUITAINE,27.0,0.10,496.0
AUVERGNE,284.0,0.10,676.4
BASSE-NORMANDIE,23.1,0.10,365.1
BOURGOGNE,177.5,0.10,509.5
BRETAGNE,30.0,0.40,585.9
CENTRE,91.0,0.20,734.0
CHAMPAGNE-ARDENNE,104.0,0.10,459.0
...,...,...,...
MIDI-PYRENEES,148.0,0.10,365.9


In [48]:
# les premier et dernier déciles de chaque colonne numérique par région
geo.groupby("Région").quantile([0.1, 0.9])

Unnamed: 0_level_0,Unnamed: 1_level_0,Altitude Moyenne,Population,Superficie
Région,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ALSACE,0.1,156.0,0.30,280.8
ALSACE,0.9,532.7,3.30,1845.5
AQUITAINE,0.1,27.0,0.10,496.0
AQUITAINE,0.9,258.0,2.30,3486.0
AUVERGNE,0.1,284.0,0.10,676.4
AUVERGNE,0.9,1031.0,1.81,3665.1
BASSE-NORMANDIE,0.1,23.1,0.10,365.1
BASSE-NORMANDIE,0.9,226.0,1.49,1748.4
...,...,...,...,...
POITOU-CHARENTES,0.1,18.0,0.20,628.0


**Exercice dirigé optionnel : analyse en composante principale avec StatsModels**

On va effectuer une ACP sur la moyenne des données THD par région.

1) Effectuer une jointure entre les 2 datasets thd et geo sur l'index des Code INSEE.

2) Regrouper par région, calculer la moyenne des valeurs numériques et restreigner aux 20 premières colonnes.

3) Importer la fonction **pca()** du module *statsmodels.sandbox.tools* et chercher dans la doc comment utiliser la fonction avec les données et obtenir des facteurs sur 2 dimensions

4) Reprendre le code du paragraphe 4.1.5 (cartographie avec annotations) pour afficher le résultat de l'ACP