# Introduction à la libraire Pandas

DataFrame : structure de données équivalente à un numpy.array avec des fonctionnalités en plus

Avantages de Pandas :
 - pas de restriction à un seul type de données
 - gestion des NaN simplifié
 - extraction de valeurs et de sous-ensembles facile

## Généralités

In [1]:
import pandas as pd

In [2]:
### Lecture d'un fichier csv
data = pd.read_csv('Data/food-info.csv')
type(data)

pandas.core.frame.DataFrame

In [None]:
### Affichage des données
data.head()
data.head(3)
data.tail()
data.tail(7)

In [None]:
### Attributs columns et shape
print(data.shape,'\n',data.columns)

In [None]:
### type des données avec dtypes :
#object (pour les string ou les variables ayant différents types)
#int, float, datetime, bool
print(data.dtypes)

## Sélection de ligne.s et de colonne.s

In [None]:
type(data.loc[0])

In [None]:
### sélection d'une seule ligne
data.loc[0]

In [None]:
### sélection de plusieurs lignes
data.loc[3:6]

In [None]:
data.iloc[3:6,:]

In [None]:
data.loc[[2,10,5]]

In [None]:
data.loc[list(range(2,5))+list(range(9,11))]

In [None]:
### sélection d'une colonne
data['NDB_No']

In [None]:
### sélection d'une colonne avec .loc
data.loc[:,"NDB_No"]

In [None]:
### sélection d'une colonne avec .iloc[]
data.iloc[:,0]

In [None]:
### sélection de plusieurs colonnes
data[["Zinc_(mg)","NDB_No"]].head()

In [None]:
### sélection de plusieurs colonnes avec loc
data.loc[:,["NDB_No","Calcium_(mg)","Energ_Kcal"]].head()

In [None]:
### sélection de plusieurs colonnes avec iloc
data.iloc[:,[10,0,3]+list(range(2,5))].head()

### Exo : afficher les 3 premières lignes des colonnes utilisant comme unité de mesure les grammes

En regardant les noms des colonnes, on constate qu'elles sont identifiables via le suffixe "(g)". Il y a plus qu'à...

Deux solutions possibles mais il y en a d'autres. **Surtout, essayez d'abord par vous-même !**

In [16]:
### Une première solution
data.columns

grams=[]
for col in data.columns:
    if col[-3:] == "(g)": #if "(g)" in col:
        grams.append(col)
data[grams].head(3)

Unnamed: 0,Water_(g),Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g)
0,15.87,0.85,81.11,2.11,0.06,0.0,0.06,51.368,21.021,3.043
1,15.87,0.85,81.11,2.11,0.06,0.0,0.06,50.489,23.426,3.012
2,0.24,0.28,99.48,0.0,0.0,0.0,0.0,61.924,28.732,3.694


In [18]:
### Une seconde en utilisant la méthode endswith() et un booléen pour sélectionner les colonnes avec .loc
data.loc[:2,[c.endswith("_(g)") for c in data.columns]]

Unnamed: 0,Water_(g),Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g)
0,15.87,0.85,81.11,2.11,0.06,0.0,0.06,51.368,21.021,3.043
1,15.87,0.85,81.11,2.11,0.06,0.0,0.06,50.489,23.426,3.012
2,0.24,0.28,99.48,0.0,0.0,0.0,0.0,61.924,28.732,3.694


## Manipulation de données avec Pandas

### Transformation d'une colonne

On peut facilement transformer une colonne en ajoutant, enlevant, multipliant ou divisant une toutes les lignes par une même valeur. Deux exemples ci-dessous pour convertir des valeurs des milligrammes vers les grammes et inversement.

In [20]:
### Pour convertir des mg en g
iron_g = data["Iron_(mg)"]/1000

### Pour convertir des g en mg
protein_mg = data["Protein_(g)"]*1000

print(iron_g,protein_mg)

0       0.00002
1       0.00016
2       0.00000
3       0.00031
4       0.00043
5       0.00050
6       0.00033
7       0.00064
8       0.00016
9       0.00021
10      0.00076
11      0.00007
12      0.00016
13      0.00015
14      0.00013
15      0.00014
16      0.00038
17      0.00044
18      0.00065
19      0.00023
20      0.00052
21      0.00024
22      0.00017
23      0.00013
24      0.00072
25      0.00044
26      0.00020
27      0.00022
28      0.00023
29      0.00041
         ...   
8588    0.00900
8589    0.00030
8590    0.00010
8591    0.00163
8592    0.03482
8593    0.00228
8594    0.00017
8595    0.00017
8596    0.00486
8597    0.00025
8598    0.00023
8599    0.00013
8600    0.00011
8601    0.00068
8602    0.00783
8603    0.00311
8604    0.00030
8605    0.00018
8606    0.00080
8607    0.00004
8608    0.00387
8609    0.00005
8610    0.00038
8611    0.00520
8612    0.00150
8613    0.00140
8614    0.00058
8615    0.00360
8616    0.00350
8617    0.00140
Name: Iron_(mg), Length:

### Création et suppression d'une colonne

Pour créer une nouvelle colonne, il suffit d'affecter un objet Series à un nouveau de colonne dans le dataframe. Si on affecte à un nom de colonne déjà existant, on remplace l'ensemble des valeurs par la nouvelle série.

In [None]:
### La nouvelle colonne est créée à la fin du dataframe
data["Iron_(g)"] = data["Iron_(mg)"]/1000
data.head(1)

Pour supprimer une colonne, on peut utiliser del ou drop

In [None]:
### Avec del
del data["Iron_(g)"]
data.head(1)

In [None]:
### Avec drop
data["Iron_(g)"] = data["Iron_(mg)"]/1000 # on recréé la variable à supprimer
data.drop(["Iron_(g)"], axis='columns', inplace=True)
data.head(1)

# ATTENTION à l'option inplace, qui remplace directement le dataframe par le dataframe sans les colonnes à supprimer...
# par précaution et par défaut inplace = FALSE et la méthode drop retourne un nouveau dataframe

### Opérations entre colonnes

In [None]:
### Pour ajouter les quantités de fer et de calcium contenues dans un aliment
fer_calcium = data["Iron_(mg)"] + data["Calcium_(mg)"]
fer_calcium.head()

In [None]:
### On peut multiplier (ou diviser) des colonnes
water_energy = data["Water_(g)"]*data["Energ_Kcal"]
water_energy[:5]

### Trier un dataframe

In [26]:
### on utilise la méthode sort_values()
data.sort_values("Protein_(g)")
# Pensez à bien regarder la documentation et notamment l'existence des paramètres inplace= et ascending=

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg),Score
750,4656,OIL INDUSTRIAL PALM KERNEL CONFECTION FAT,0.05,884,0.00,100.00,0.01,0.00,0.0,0.00,...,0.0,3.81,,,24.7,87.558,5.406,0.834,0.0,-0.750000
4257,14148,CARBONATED BEV COLA,89.62,41,0.00,0.00,0.06,10.58,0.0,10.75,...,0.0,0.00,0.0,0.0,0.0,0.000,0.000,0.000,0.0,0.000000
4187,14051,ALCOHOLIC BEV DISTILLED VODKA 80 PROOF,66.60,231,0.00,0.00,0.00,0.00,0.0,0.00,...,0.0,,,,,0.000,0.000,0.000,0.0,0.000000
4186,14050,ALCOHOLIC BEV DISTILLED RUM 80 PROOF,66.60,231,0.00,0.00,0.00,0.00,0.0,0.00,...,0.0,0.00,0.0,0.0,0.0,0.000,0.000,0.000,0.0,0.000000
4185,14049,ALCOHOLIC BEV DISTILLED GIN 90 PROOF,62.10,263,0.00,0.00,0.00,0.00,0.0,0.00,...,0.0,0.00,0.0,0.0,0.0,0.000,0.000,0.000,0.0,0.000000
4180,14042,BEVERAGES FORT LO CAL FRUIT JUC BEV,97.21,4,0.00,0.00,2.09,0.70,0.0,0.63,...,20.0,1.14,0.0,0.0,0.1,0.000,0.000,0.000,0.0,0.000000
4178,14038,BEVERAGES OCEAN SPRAY CRAN-ENERGY CRANBERRY EN...,96.18,15,0.00,0.00,0.11,3.75,0.0,3.75,...,0.0,0.00,0.0,0.0,0.1,0.000,0.000,0.000,0.0,0.000000
4207,14074,BEVERAGES ZEVIA COLA CAFFEINE FREE,98.87,0,0.00,0.00,0.01,1.13,0.0,0.00,...,0.0,0.00,0.0,0.0,0.0,0.000,0.000,0.000,0.0,0.000000
4398,14545,TEA HERB CHAMOMILE BREWED,99.70,1,0.00,0.00,0.00,0.20,0.0,0.00,...,1.0,0.00,0.0,0.0,0.0,0.000,0.000,0.000,0.0,0.000000
4346,14381,TEA HERB OTHER THAN CHAMOMILE BREWED,99.70,1,0.00,0.00,0.00,0.20,0.0,0.00,...,0.0,0.00,0.0,0.0,0.0,0.002,0.001,0.005,0.0,0.000000


In [24]:
help(data.sort_values)

Help on method sort_values in module pandas.core.frame:

sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last') method of pandas.core.frame.DataFrame instance
    Sort by the values along either axis
    
    Parameters
    ----------
            by : str or list of str
                Name or list of names to sort by.
    
                - if `axis` is 0 or `'index'` then `by` may contain index
                  levels and/or column labels
                - if `axis` is 1 or `'columns'` then `by` may contain column
                  levels and/or index labels
    
                .. versionchanged:: 0.23.0
                   Allow specifying index or column level names.
    axis : {0 or 'index', 1 or 'columns'}, default 0
         Axis to be sorted
    ascending : bool or list of bool, default True
         Sort ascending vs. descending. Specify list for multiple sort
         orders.  If this is a list of bools, must match the length of
        

### Exo : un judoka de haut niveau voudrait adapter son alimentation à son programme d'entraînement et vous demande de déterminer les aliments à la fois riches en protéine et faibles en lipides. Son entraîneur vous contacte ensuite pour vous préciser que le plus important est qu'il y ait un maximum de protéine !

Pour faire cet exercice, voici quelques indications. Vous pouvez tout à fait faire sans mais si vous ne savez pas par où commencer, ces quelques lignes vous aideront sûrement.

Comme on souhaite regarder coinjointement les quantités de protéines et de lipides, le plus simple est de calculer un score faisant intervenir les variables Protein_(g) et Lipid_Tot_(g). L'idée de ce score est que plus il sera élevé, plus l'aliment répondra à nos critères (bcp de protéines/peu de lipides).

Vous calculerez donc un score sous la forme __*Score = a * Protéine + b * Lipide* où a et b sont des constantes qu'il faudra fixer__ (arbitrairement certes mais en justifiant le choix quand même !).

Les questions que vous devez vous poser :
 - Tout d'abord, quel doivent être les signes de a et de b pour répondre à notre problématique ?
 - Quel coefficient doit être le plus grand en valeur absolue ?
 - Ensuite, quelles sont les min et max des variables Protein_(g) et Lipid_Tot_(g) ?
 - Par conséquent, comment gérer le fait que ces deux variables n'évoluent pas du tout sur la même échelle ?

À vous de jouer ! Essayez dans un premier temps de ne pas regarder la solution proposée ci dessous...

In [6]:
# score = 2 * proteine - 0.5 * lipide
# a > 0 car on veut que le score augmente avec la quantité de protéine
# b < 0 car on veut que le score diminue avec la quantité de lipide
# |a|>|b| car le "plus important" selon l'entraîner c'est la quantité de protéine
# on normalise les 2 variables Protéine et Lipide pour qu'elles prennent toutes les 2 leurs valeurs entre 0 et 1
max_p = max(data["Protein_(g)"])
min_p = min(data["Protein_(g)"])
max_l = max(data["Lipid_Tot_(g)"])
min_l = min(data["Lipid_Tot_(g)"])

prot_norm = (data["Protein_(g)"]-min_p)/(max_p-min_p)
lip_norm = (data["Lipid_Tot_(g)"]-min_l)/(max_l-min_l)

data["score"] = 2 * prot_norm - 0.5 *  lip_norm

data.sort_values("score", ascending = False)

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg),score
4991,16423,SOY PROT ISOLATE K TYPE CRUDE PROT BASIS,4.98,321,88.32,0.53,3.58,2.59,2.0,0.00,...,0.0,0.00,0.0,0.0,0.0,0.066,0.101,0.258,0.0,1.997350
6155,19177,GELATINS DRY PDR UNSWTND,13.00,335,85.60,0.10,1.30,0.00,0.0,0.00,...,0.0,0.00,0.0,0.0,0.0,0.070,0.060,0.010,0.0,1.937906
216,1258,EGG WHITE DRIED STABILIZED GLUCOSE RED,6.53,362,84.63,0.48,3.63,4.72,0.0,0.00,...,0.0,0.00,0.0,0.0,0.0,0.147,0.173,0.070,20.0,1.914040
124,1136,EGG WHITE DRIED PDR STABILIZED GLUCOSE RED,8.54,376,82.40,0.04,4.55,4.47,0.0,0.00,...,0.0,0.00,0.0,0.0,0.0,0.000,0.000,0.000,0.0,1.865742
8152,35055,SEAL BEARDED (OOGRUK) MEAT DRIED (ALASKA NATIVE),11.60,351,82.60,2.30,3.50,0.00,0.0,0.00,...,393.0,,,,,0.600,1.330,0.370,,1.858971
151,1173,EGG WHITE DRIED,5.80,382,81.10,0.00,5.30,7.80,0.0,5.40,...,0.0,0.00,0.0,0.0,0.0,0.000,0.000,0.000,0.0,1.836504
4990,16422,SOY PROT ISOLATE K TYPE,4.98,326,80.69,0.53,3.58,10.22,5.6,0.00,...,0.0,0.00,0.0,0.0,0.0,0.077,0.117,0.299,0.0,1.824569
4833,16122,SOY PROTEIN ISOLATE,4.98,338,80.69,3.39,3.58,7.36,5.6,0.00,...,0.0,0.00,0.0,0.0,0.0,0.422,0.645,1.648,0.0,1.810269
4200,14066,BEVERAGES PROT PDR WHEY BSD,3.44,352,78.13,1.56,10.55,6.25,3.1,0.00,...,0.0,0.00,0.0,0.0,0.0,0.781,0.158,0.299,16.0,1.761448
123,1135,EGG WHITE DRIED FLAKES STABILIZED GLUCOSE RED,14.62,351,76.92,0.04,4.25,4.17,0.0,0.00,...,0.0,0.00,0.0,0.0,0.0,0.000,0.000,0.000,0.0,1.741648


In [7]:
# le double sort de Fred
data.sort_values(["Protein_(g)","Lipid_Tot_(g)"], ascending=[False,True])

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg),score
4991,16423,SOY PROT ISOLATE K TYPE CRUDE PROT BASIS,4.98,321,88.32,0.53,3.58,2.59,2.0,0.00,...,0.0,0.00,0.0,0.0,0.0,0.066,0.101,0.258,0.0,1.997350
6155,19177,GELATINS DRY PDR UNSWTND,13.00,335,85.60,0.10,1.30,0.00,0.0,0.00,...,0.0,0.00,0.0,0.0,0.0,0.070,0.060,0.010,0.0,1.937906
216,1258,EGG WHITE DRIED STABILIZED GLUCOSE RED,6.53,362,84.63,0.48,3.63,4.72,0.0,0.00,...,0.0,0.00,0.0,0.0,0.0,0.147,0.173,0.070,20.0,1.914040
8152,35055,SEAL BEARDED (OOGRUK) MEAT DRIED (ALASKA NATIVE),11.60,351,82.60,2.30,3.50,0.00,0.0,0.00,...,393.0,,,,,0.600,1.330,0.370,,1.858971
124,1136,EGG WHITE DRIED PDR STABILIZED GLUCOSE RED,8.54,376,82.40,0.04,4.55,4.47,0.0,0.00,...,0.0,0.00,0.0,0.0,0.0,0.000,0.000,0.000,0.0,1.865742
151,1173,EGG WHITE DRIED,5.80,382,81.10,0.00,5.30,7.80,0.0,5.40,...,0.0,0.00,0.0,0.0,0.0,0.000,0.000,0.000,0.0,1.836504
4990,16422,SOY PROT ISOLATE K TYPE,4.98,326,80.69,0.53,3.58,10.22,5.6,0.00,...,0.0,0.00,0.0,0.0,0.0,0.077,0.117,0.299,0.0,1.824569
4833,16122,SOY PROTEIN ISOLATE,4.98,338,80.69,3.39,3.58,7.36,5.6,0.00,...,0.0,0.00,0.0,0.0,0.0,0.422,0.645,1.648,0.0,1.810269
4200,14066,BEVERAGES PROT PDR WHEY BSD,3.44,352,78.13,1.56,10.55,6.25,3.1,0.00,...,0.0,0.00,0.0,0.0,0.0,0.781,0.158,0.299,16.0,1.761448
8234,35180,STEELHEAD TROUT DRIED FLESH (SHOSHONE BANNOCK),6.49,382,77.27,8.06,10.58,0.00,0.0,0.00,...,64.0,2.41,15.7,628.0,0.0,0.829,1.228,1.739,227.0,1.709474


In [14]:
data2 = data[["Protein_(g)","Lipid_Tot_(g)","score"]][data["score"]<1.1]
data2.sort_values("score", ascending=False)

Unnamed: 0,Protein_(g),Lipid_Tot_(g),score
3645,49.10,4.77,1.088016
4985,49.81,8.90,1.083444
3662,48.06,1.61,1.080265
4628,47.68,0.80,1.075710
191,50.00,12.00,1.072246
4175,47.62,3.57,1.060501
4828,47.01,1.22,1.058438
4830,44.95,2.39,1.005939
4829,45.51,8.90,0.986071
190,45.71,17.14,0.949400


In [15]:
data2.sort_values(["Protein_(g)","Lipid_Tot_(g)"], ascending=[False,True])

Unnamed: 0,Protein_(g),Lipid_Tot_(g),score
191,50.00,12.00,1.072246
4985,49.81,8.90,1.083444
3645,49.10,4.77,1.088016
121,48.37,43.04,0.880135
122,48.17,43.95,0.871056
3662,48.06,1.61,1.080265
4839,47.94,30.34,0.933898
4996,47.94,30.34,0.933898
4628,47.68,0.80,1.075710
4175,47.62,3.57,1.060501


## Traitement des valeurs manquantes 

### Les données

In [16]:
titanic = pd.read_csv("Data/titanic-survival.csv")
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [17]:
### La méthode describe() renvoie quelques statistiques du dataframe par variable
titanic.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


### Identifier les manquants

In [18]:
### la méthode isnull()
titanic["age"].isnull()

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15       True
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
1280    False
1281    False
1282     True
1283     True
1284     True
1285    False
1286    False
1287    False
1288    False
1289    False
1290    False
1291     True
1292     True
1293     True
1294    False
1295    False
1296    False
1297     True
1298    False
1299    False
1300    False
1301    False
1302     True
1303     True
1304    False
1305     True
1306    False
1307    False
1308    False
1309     True
Name: age, Length: 1310, dtype: bool

Comment récupérer le nombre de valeurs manquantes ?

In [None]:
### Nombre de valeurs manquantes : exemples de solutions
tot_nan1 = len(titanic.loc[titanic["age"].isnull(),"age"])
tot_nan2 = titanic["age"].isnull().sum()

print(tot_nan1, tot_nan2)

In [19]:
titanic.isnull().head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,False,False,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,False,True,True,False
3,False,False,False,False,False,False,False,False,False,False,False,True,False,False
4,False,False,False,False,False,False,False,False,False,False,False,True,True,False


### Problème avec les valeurs manquantes : calcul de la moyenne

In [None]:
### Calcul de l'âge moyen sans build-in function
age_mean =  sum(titanic["age"])/len(titanic["age"])
age_mean

In [None]:
age_mean_wo_null = sum(titanic.loc[~titanic["age"].isnull(),"age"])/len(titanic.loc[~titanic["age"].isnull(),"age"])
age_mean_wo_null

In [None]:
### Une méthode bien plus simple : la fonction build-in Series.mean()
titanic["age"].mean()

In [None]:
titanic["fare"].mean()

### Exo : calculer le tarif moyen par classe. Retourner les résultats sous forme d'un dictionnaire

### Supprimer des valeurs manquantes

In [None]:
### La méthode .dropna()
titanic.dropna(axis = 1)

In [None]:
titanic.dropna(axis = 0)

In [None]:
titanic.dropna(axis = 0, subset = ["boat"])

In [None]:
### Pour ne conserver uniquement les lignes pour lesquelles l'âge et le sexe sont renseignés
titanic.dropna(axis = 0, subset = ["age","sex"])

In [None]:
### On peut vérifier avec l'attribut shape
titanic.dropna(axis = 0, subset = ["age","sex"]).shape

## Les pivots de tables

Pour ceux qui connaissent, c'est l'équivalent des "tableaux croisés dynamiques" sur Excel.

In [None]:
### Méthode DataFrame.pivot_table() : regardez la doc !!
titanic.pivot_table(index = "pclass", values = "fare")

In [None]:
titanic.pivot_table(index = "pclass", values = "fare", aggfunc = sum)

In [None]:
titanic.pivot_table(index = "pclass", values = ["age","fare","survived"])

In [None]:
titanic.pivot_table(index="embarked", values = ["fare", "survived"], aggfunc = sum)

## Tri et réindexation

In [None]:
new_titanic = titanic.sort_values("age", ascending=False)
new_titanic

Nous avons déjà introduit les méthodes loc et iloc sans nous y arrêter vraiment. C'est l'occasion de le faire maintenant.

Quelles différences entre les deux méthodes ? Regardez ce que donnent new_titanic.loc[9,:] et new_titanic.iloc[9,:] pour y voir plus clair...

In [None]:
new_titanic.loc[9,:]

In [None]:
new_titanic.iloc[9,:]

On voit donc l'intérêt de réindexer les lignes d'un nouveau dataframe construit à partir d'un dataframe existant puisqu'en effet les indices de ligne ne correspondent plus aux numéros des lignes...


In [None]:
### méthode .reset_index()
new_titanic.reset_index()

### voir l'option drop pour ne pas avoir une nouvelle colonne avec l'ancien indice

## Appliquer une fonction sur un DataFrame

In [None]:
### méthode .apply()

In [None]:
### on définit une fonction qui retourne le 100ème élement d'une colonne
def obs100(col):
    return col.iloc[99]

In [None]:
### On applique cette fonction au dataframe
titanic.apply(obs100, axis=0)

Essayez maintenant la même chose avec __axis = 1__. Que se passe-t-il ? Pourquoi ?

In [None]:
### on définit une nouvelle fonction
def from_france(row):
    if row["embarked"] == "C":
        return "départ de France"
    else:
        return "parti d'ailleurs"

titanic.apply(from_france, axis=1)

### Exo : utiliser la méthode apply pour déterminer le nombre de valeurs manquantes dans chaque colonne

Une indication : il faut d'abord définir une fonction

### Exo : utiliser la méthode apply pour créer une nouvelle variable age_label dans le dataframe contenant 'minor' si la personne a moins de 18 ans, 'adult' si son age est supérieur ou égal à 18 ans et 'unknown' sinon 

Indications : Procéder par étape ! On définira d'abord une fonction, qu'on appliquera au dataframe et on stockera le résultat dans une nouvelle colonne

### Exo : calculer le pourcentage de survie par groupe d'âge (majeur/mineur)

Indication : l'exercice précédent a quasiment fait tout le boulot...