# Traiter les valeurs manquantes

## Introduction au dataset

In [1]:
import pandas as pd

titanic_survival = pd.read_csv("titanic_survival.csv")
titanic_survival.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


## Trouver les valeurs manquantes

In [2]:
# pandas.isnull()
# Elle prend en paramètre un objet séries c'est à dire une colonne du dataframe
#et ça retournera une valeur True ou False si la valeur est manquante ou non

In [3]:
sex = titanic_survival["sex"]
sex_is_null = pd.isnull(sex)
print(sex_is_null)

0       False
1       False
2       False
3       False
4       False
        ...  
1305    False
1306    False
1307    False
1308    False
1309     True
Name: sex, Length: 1310, dtype: bool


In [4]:
sex_null = sex[sex_is_null]
print(sex_null)

1309    NaN
Name: sex, dtype: object


* Compter le nombre de valeurs dans la colonne "age" possédant des valeurs manquantes :
    
    
    * Assigner à la variable age la colonne des âges du dataframe titanic_survival.
    * Utiliser pandas.isnull() sur la variable age pour créer une Series de valeurs True et False.
    * Utiliser la Series résultante pour sélectionner seulement les éléments de la colonne "age" qui sont nuls et assigner le résultat à la variable age_null.
    * Assigner le nombre de valeurs manquantes de age_null à la variable age_null_count (fonction len() ).
    

* Afficher age_null_count pour voir le nombre de valeur manquantes de la colonne "age".

In [5]:
age = pd.isnull(titanic_survival["age"])

age_null = titanic_survival[age]

age_null_count = len(age_null)

age_null_count

264

In [6]:
for col in titanic_survival.columns:
    
    print(titanic_survival[col].isnull().value_counts())
    print("-------------")

False    1309
True        1
Name: pclass, dtype: int64
-------------
False    1309
True        1
Name: survived, dtype: int64
-------------
False    1309
True        1
Name: name, dtype: int64
-------------
False    1309
True        1
Name: sex, dtype: int64
-------------
False    1046
True      264
Name: age, dtype: int64
-------------
False    1309
True        1
Name: sibsp, dtype: int64
-------------
False    1309
True        1
Name: parch, dtype: int64
-------------
False    1309
True        1
Name: ticket, dtype: int64
-------------
False    1308
True        2
Name: fare, dtype: int64
-------------
True     1015
False     295
Name: cabin, dtype: int64
-------------
False    1307
True        3
Name: embarked, dtype: int64
-------------
True     824
False    486
Name: boat, dtype: int64
-------------
True     1189
False     121
Name: body, dtype: int64
-------------
False    745
True     565
Name: home.dest, dtype: int64
-------------


## Problème avec les valeurs manquantes

In [7]:
mean_age = sum(titanic_survival["age"]) / len(titanic_survival["age"])
print(mean_age)
# un calcule contenant une valeur manquante retournera une valeur manquante

nan


* Utiliser age_is_null pour créer un vecteur qui contient seulement les valeurs de la colonne "age" qui ne sont pas NaN (c'est à dire pour lesquelles age_is_null vaut False)
* Assigner ce résultat à la variable good_ages
* Calculer la moyenne de ce nouveau vecteur et assigner le résultat à la variable mean_age
* Afficher cette moyenne

In [8]:
age_is_null = pd.isnull(titanic_survival["age"])
good_ages = titanic_survival["age"][age_is_null == False]
mean_age = sum(good_ages) / len(good_ages)
print(mean_age)

29.8811345124283


## Calculer une moyenne plus simplement

In [9]:
# Series.mean()
# Il va calculer la moyenne d'une colonne en ignorant les valeurs manquantes.

In [10]:
mean_age = titanic_survival["age"].mean()
print(mean_age)

29.8811345124283


* Assigner la moyenne de la colonne "fare" à la variable mean_fare.
* Afficher le résultat

In [11]:
mean_fare = titanic_survival["fare"].mean()
print(mean_fare)

33.29547928134572


## Calculer des statistiques de prix

* Créer un dictionnaire vide qu'on nommera fares_by_class.
* Créer la liste passenger_classes qui contient les éléments [1,2,3].
* Utiliser une boucle for pour parcourir la liste passenger_classes :

     * Sélectionner juste les lignes de titanic_survival pour lesquelles la colonne pclass est égale à la variable temporaire (l'itérateur) de la boucle for, c'est à dire correspondant au numéro de classe (1, 2 ou 3)
     * Sélectionner seulement la colonne fare pour ce sous-ensemble de lignes (correspondant à la classe)
     * Utiliser la méthode Series.mean() pour calculer la moyenne de ce sous-ensemble.
     * Ajouter cette moyenne calculée de la classe au dictionnaire fares_by_class avec comme clé le numéro de la classe (et donc comme valeur la moyenne du prix du billet d'embarquement)


* Une fois la bouche achevée, le dictionnaire fares_by_class devrait avoir 1, 2 et 3 comme clés avec en valeur les moyennes correspondantes.
* Afficher le résultat.

In [12]:
fares_by_class = {}
passager_classes = [1,2,3]

for n in passager_classes:
    x = titanic_survival.loc[titanic_survival["pclass"] == n]["fare"].mean()
    fares_by_class[n] = x
    
print(fares_by_class)
    

{1: 87.50899164086687, 2: 21.1791963898917, 3: 13.302888700564957}


## Introduction aux pivots de table

In [13]:
# DataFrame.pivot_table()

# c'est un moyen de distribuer des sous-ensembles directement sur une colonne et d'effectuer des calcules sur cette colonne

In [14]:
import numpy
passenger_class_fares = titanic_survival.pivot_table(index="pclass", values = "fare", aggfunc= numpy.mean)
print(passenger_class_fares)

# Par défaut c'est une moyenne qui est utilisé dans cette méthode. On est pas obligé de marquer le .mean
# Il faut donc préciser si on veut faire un sum ou autre mais pas pour mean

             fare
pclass           
1.0     87.508992
2.0     21.179196
3.0     13.302889


* Utiliser la méthode dataframe.pivot_table() pour calculer la moyenne de l'âge pour chaque classe de passager ("pclass")
* Assigner le résultat à la variable passager_age
* Afficher passenger_age
* Faire de même avec la colonne survived pour chaque classe de passager

In [15]:
passager_age = titanic_survival.pivot_table(index="pclass", values="age")
print(passager_age)

              age
pclass           
1.0     39.159918
2.0     29.506705
3.0     24.816367


In [16]:
passager_survived = titanic_survival.pivot_table(index= "pclass", values="survived")
print(passager_survived)

        survived
pclass          
1.0     0.619195
2.0     0.429603
3.0     0.255289


## Tables Pivot Niveau 2

* Faire un pivot qui calcule le total d'argent encaissé ("fare") et le nombre total de survivants ("survived") pour chaque port d'embarcation ("embarked"). Il faudra utiliser la fonction numpy.sum.
* Assigner le résultat à la variable port_stats.
* Afficher le résultat.

In [17]:
import numpy
port_stats = titanic_survival.pivot_table(index= "embarked", values=["fare", "survived"], aggfunc= numpy.sum)
port_stats

Unnamed: 0_level_0,fare,survived
embarked,Unnamed: 1_level_1,Unnamed: 2_level_1
C,16830.7922,150.0
Q,1526.3085,44.0
S,25033.3862,304.0


## Eliminer les valeurs manquantes

In [18]:
# Dataframe.dropna(axis= 0 ou axis="index") => pour supprimer toutes lignes présentant au moins une valeur manquante
# Dataframe.dropna(axis= 1) => pour supprimer toutes les colonnes présentant au moins une valeur manquante

In [19]:
drop_na_rows = titanic_survival.dropna(axis=0)
drop_na_rows

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest


In [20]:
drop_na_columns = titanic_survival.dropna(axis=1)
drop_na_columns

0
1
2
3
4
...
1305
1306
1307
1308
1309


In [21]:
# dropna(axis= ... , subset=["name"])
# on supprime toutes les lignes ou colonnes pour la colonne "name" du DataFrame

In [22]:
titanic_survival.dropna(axis=0, subset=["name"])

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3.0,0.0,"Zabour, Miss. Hileni",female,14.5000,1.0,0.0,2665,14.4542,,C,,328.0,
1305,3.0,0.0,"Zabour, Miss. Thamine",female,,1.0,0.0,2665,14.4542,,C,,,
1306,3.0,0.0,"Zakarian, Mr. Mapriededer",male,26.5000,0.0,0.0,2656,7.2250,,C,,304.0,
1307,3.0,0.0,"Zakarian, Mr. Ortin",male,27.0000,0.0,0.0,2670,7.2250,,C,,,


* Supprimer toutes les lignes de titanic_survival pour lesquelles les colonnes "age" ou "sex" ont des valeurs manquantes et assigner le résultat à la variable new_titanic_survival.
* Comparer le nombre de lignes qu'il reste avec l'attribut shape.

In [23]:
new_titanic_survival = titanic_survival.dropna(axis=0, subset=["age", "sex"])
new_titanic_survival

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1301,3.0,0.0,"Youseff, Mr. Gerious",male,45.5000,0.0,0.0,2628,7.2250,,C,,312.0,
1304,3.0,0.0,"Zabour, Miss. Hileni",female,14.5000,1.0,0.0,2665,14.4542,,C,,328.0,
1306,3.0,0.0,"Zakarian, Mr. Mapriededer",male,26.5000,0.0,0.0,2656,7.2250,,C,,304.0,
1307,3.0,0.0,"Zakarian, Mr. Ortin",male,27.0000,0.0,0.0,2670,7.2250,,C,,,


In [24]:
print(new_titanic_survival.shape)

(1046, 14)


In [25]:
print(titanic_survival.shape)

(1310, 14)


## iloc pour accéder à des lignes

In [26]:
# DataFrame.loc[]
# Permet de sélectionner des lignes par leur intitulé d'index ou des colonnes par leur nom

In [27]:
new_titanic_survival = titanic_survival.sort_values("age", inplace= False, ascending= False)
new_titanic_survival

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
14,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0.0,0.0,27042,30.0000,A23,S,B,,"Hessle, Yorks"
61,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1.0,0.0,19877,78.8500,C46,S,6,,"Little Onn Hall, Staffs"
1235,3.0,0.0,"Svensson, Mr. Johan",male,74.0,0.0,0.0,347060,7.7750,,S,,,
135,1.0,0.0,"Goldschmidt, Mr. George B",male,71.0,0.0,0.0,PC 17754,34.6542,A5,C,,,"New York, NY"
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1297,3.0,0.0,"Wiseman, Mr. Phillippe",male,,0.0,0.0,A/4. 34244,7.2500,,S,,,
1302,3.0,0.0,"Yousif, Mr. Wazli",male,,0.0,0.0,2647,7.2250,,C,,,
1303,3.0,0.0,"Yousseff, Mr. Gerious",male,,0.0,0.0,2627,14.4583,,C,,,
1305,3.0,0.0,"Zabour, Miss. Thamine",female,,1.0,0.0,2665,14.4542,,C,,,


In [28]:
new_titanic_survival.loc[0]

pclass                                   1
survived                                 1
name         Allen, Miss. Elisabeth Walton
sex                                 female
age                                     29
sibsp                                    0
parch                                    0
ticket                               24160
fare                               211.338
cabin                                   B5
embarked                                 S
boat                                     2
body                                   NaN
home.dest                     St Louis, MO
Name: 0, dtype: object

In [29]:
# Il va chercher l'intitulé et pas l'index

In [30]:
# Dataframe.iloc[]
# iloc pour afficher les éléments dans l'ordre de position

In [31]:
new_titanic_survival.iloc[0:5]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
14,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0.0,0.0,27042,30.0,A23,S,B,,"Hessle, Yorks"
61,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1.0,0.0,19877,78.85,C46,S,6,,"Little Onn Hall, Staffs"
1235,3.0,0.0,"Svensson, Mr. Johan",male,74.0,0.0,0.0,347060,7.775,,S,,,
135,1.0,0.0,"Goldschmidt, Mr. George B",male,71.0,0.0,0.0,PC 17754,34.6542,A5,C,,,"New York, NY"
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


* Assigner les 10 premières lignes de new_titanic_survival à la variable first_ten_rows.
* Assigner la 5e ligne de new_titanic_survival à la variable row_position_fifth.
* Assigner la ligne dont l'intitulé d'index est 25 pour new_titanic_survival à la variable row_index_25.

In [32]:
first_ten_rows = new_titanic_survival.iloc[0:10]
first_ten_rows

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
14,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0.0,0.0,27042,30.0,A23,S,B,,"Hessle, Yorks"
61,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1.0,0.0,19877,78.85,C46,S,6,,"Little Onn Hall, Staffs"
1235,3.0,0.0,"Svensson, Mr. Johan",male,74.0,0.0,0.0,347060,7.775,,S,,,
135,1.0,0.0,"Goldschmidt, Mr. George B",male,71.0,0.0,0.0,PC 17754,34.6542,A5,C,,,"New York, NY"
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"
727,3.0,0.0,"Connors, Mr. Patrick",male,70.5,0.0,0.0,370369,7.75,,Q,,171.0,
81,1.0,0.0,"Crosby, Capt. Edward Gifford",male,70.0,1.0,1.0,WE/P 5735,71.0,B22,S,,269.0,"Milwaukee, WI"
506,2.0,0.0,"Mitchell, Mr. Henry Michael",male,70.0,0.0,0.0,C.A. 24580,10.5,,S,,,"Guernsey / Montclair, NJ and/or Toledo, Ohio"
285,1.0,0.0,"Straus, Mr. Isidor",male,67.0,1.0,0.0,PC 17483,221.7792,C55 C57,S,,96.0,"New York, NY"
594,2.0,0.0,"Wheadon, Mr. Edward H",male,66.0,0.0,0.0,C.A. 24579,10.5,,S,,,"Guernsey, England / Edgewood, RI"


In [33]:
row_position_fifth = new_titanic_survival.iloc[4]
row_position_fifth

pclass                             1
survived                           0
name         Artagaveytia, Mr. Ramon
sex                             male
age                               71
sibsp                              0
parch                              0
ticket                      PC 17609
fare                         49.5042
cabin                            NaN
embarked                           C
boat                             NaN
body                              22
home.dest        Montevideo, Uruguay
Name: 9, dtype: object

In [34]:
row_index_25 = new_titanic_survival.loc[25]
row_index_25

pclass                         1
survived                       0
name         Birnbaum, Mr. Jakob
sex                         male
age                           25
sibsp                          0
parch                          0
ticket                     13905
fare                          26
cabin                        NaN
embarked                       C
boat                         NaN
body                         148
home.dest      San Francisco, CA
Name: 25, dtype: object

## Les index de colonne

In [35]:
new_titanic_survival.iloc[0,0]

1.0

In [36]:
new_titanic_survival.iloc[:,0:3]

Unnamed: 0,pclass,survived,name
14,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson"
61,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence..."
1235,3.0,0.0,"Svensson, Mr. Johan"
135,1.0,0.0,"Goldschmidt, Mr. George B"
9,1.0,0.0,"Artagaveytia, Mr. Ramon"
...,...,...,...
1297,3.0,0.0,"Wiseman, Mr. Phillippe"
1302,3.0,0.0,"Yousif, Mr. Wazli"
1303,3.0,0.0,"Yousseff, Mr. Gerious"
1305,3.0,0.0,"Zabour, Miss. Thamine"


In [37]:
new_titanic_survival.loc[83, "age"]

64.0

In [38]:
new_titanic_survival.loc[766, "pclass"]

3.0

* Assigner la valeur d'intitulé de ligne 1100 pour la colonne "age" de new_titanic_survival à la variable row_index_1100_age.
* Assigner la valeur d'intitulé de ligne 25 pour la colonne "survived" de new_titanic_survival à la variable row_index_25_survived.
* Assigner les 5 premières lignes et 3 premières colonnes de new_titanic_survival à la variable five_rows_three_cols.
* Afficher tous les résultats.

In [39]:
row_index_1100_age = new_titanic_survival.loc[1100, "age"]
row_index_1100_age

29.0

In [40]:
row_index_25_survived = new_titanic_survival.loc[25, "survived"]
row_index_25_survived

0.0

In [41]:
five_rows_three_cols = new_titanic_survival.iloc[0:5,0:3]
print(five_rows_three_cols)

      pclass  survived                                               name
14       1.0       1.0               Barkworth, Mr. Algernon Henry Wilson
61       1.0       1.0  Cavendish, Mrs. Tyrell William (Julia Florence...
1235     3.0       0.0                                Svensson, Mr. Johan
135      1.0       0.0                          Goldschmidt, Mr. George B
9        1.0       0.0                            Artagaveytia, Mr. Ramon


## Réindexer les lignes d'un dataframe

In [42]:
new_titanic_survival

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
14,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0.0,0.0,27042,30.0000,A23,S,B,,"Hessle, Yorks"
61,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1.0,0.0,19877,78.8500,C46,S,6,,"Little Onn Hall, Staffs"
1235,3.0,0.0,"Svensson, Mr. Johan",male,74.0,0.0,0.0,347060,7.7750,,S,,,
135,1.0,0.0,"Goldschmidt, Mr. George B",male,71.0,0.0,0.0,PC 17754,34.6542,A5,C,,,"New York, NY"
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1297,3.0,0.0,"Wiseman, Mr. Phillippe",male,,0.0,0.0,A/4. 34244,7.2500,,S,,,
1302,3.0,0.0,"Yousif, Mr. Wazli",male,,0.0,0.0,2647,7.2250,,C,,,
1303,3.0,0.0,"Yousseff, Mr. Gerious",male,,0.0,0.0,2627,14.4583,,C,,,
1305,3.0,0.0,"Zabour, Miss. Thamine",female,,1.0,0.0,2665,14.4542,,C,,,


In [43]:
# DataFrame.reset_index()
# Par défaut, la méthode créé une nouvelle colonne pour garder l'ancien indexage
# On peut, si l'on souhaite ne pas garder une colonne, on peut changer le paramètre
# DataFrame.reset_index(drop=True)

In [44]:
new_titanic_survival.reset_index()

Unnamed: 0,index,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,14,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0.0,0.0,27042,30.0000,A23,S,B,,"Hessle, Yorks"
1,61,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1.0,0.0,19877,78.8500,C46,S,6,,"Little Onn Hall, Staffs"
2,1235,3.0,0.0,"Svensson, Mr. Johan",male,74.0,0.0,0.0,347060,7.7750,,S,,,
3,135,1.0,0.0,"Goldschmidt, Mr. George B",male,71.0,0.0,0.0,PC 17754,34.6542,A5,C,,,"New York, NY"
4,9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,1297,3.0,0.0,"Wiseman, Mr. Phillippe",male,,0.0,0.0,A/4. 34244,7.2500,,S,,,
1306,1302,3.0,0.0,"Yousif, Mr. Wazli",male,,0.0,0.0,2647,7.2250,,C,,,
1307,1303,3.0,0.0,"Yousseff, Mr. Gerious",male,,0.0,0.0,2627,14.4583,,C,,,
1308,1305,3.0,0.0,"Zabour, Miss. Thamine",female,,1.0,0.0,2665,14.4542,,C,,,


Exo
* Ré-indexer la dataframe new_titanic_survival pour que la première ligne commence à 0 et supprimer l'ancien indexage.
* Assigner le résultat à la variable titanic_reindexed.
* Afficher les 5 premières lignes et 3 premières colonnes de titanic_reindexed.

In [45]:
titanic_reindexed = new_titanic_survival.reset_index(drop=True)
titanic_reindexed.iloc[0:5,0:3]

Unnamed: 0,pclass,survived,name
0,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson"
1,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence..."
2,3.0,0.0,"Svensson, Mr. Johan"
3,1.0,0.0,"Goldschmidt, Mr. George B"
4,1.0,0.0,"Artagaveytia, Mr. Ramon"


## Appliquer des fonctions sur un Dataframe

In [46]:
# Dataframe.apply()
# Par défaut, la méthode effectue la fonction sur chaque colonne du dataframe
# on place le paramètre de la fonction dans apply

In [47]:
# Soit une fonctione qui retourne le 100e élément

def row_100(column):
    # extraire le 100e élément d'une colonne
    item = column.iloc[99]
    return item

In [48]:
#retourner le 100e élément de chaque colonne
row_100_var = titanic_survival.apply(row_100)
row_100_var

pclass                                                       1
survived                                                     1
name         Duff Gordon, Lady. (Lucille Christiana Sutherl...
sex                                                     female
age                                                         48
sibsp                                                        1
parch                                                        0
ticket                                                   11755
fare                                                      39.6
cabin                                                      A16
embarked                                                     C
boat                                                         1
body                                                       NaN
home.dest                                       London / Paris
dtype: object

Exo
* Ecrire une fonction qui compte le nombre d'éléments manquants d'un objets Series
* Utiliser la méthode DataFrame.apply() pour appliquer votre fonction sur titanic_survival.
* Assigner le résultat à la variable column_null_count.
* Afficher le résultat.

In [49]:
def null_count(column):
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)

In [50]:
column_null_count = titanic_survival.apply(null_count)
print(column_null_count)

pclass          1
survived        1
name            1
sex             1
age           264
sibsp           1
parch           1
ticket          1
fare            2
cabin        1015
embarked        3
boat          824
body         1189
home.dest     565
dtype: int64


## Appliquer une fonction à une ligne

In [51]:
# DataFrame.apply(function, axis=1)
# Permet de réaliser la fonction à chaque ligne

In [52]:
def is_minor(row):
    if row["age"] < 18:
        return True
    else:
        return False

In [53]:
minors = titanic_survival.apply(is_minor, axis=1)
minors

0       False
1        True
2        True
3       False
4       False
        ...  
1305    False
1306    False
1307    False
1308    False
1309    False
Length: 1310, dtype: bool

In [54]:
def which_class(row):
    pclass = row["pclass"]
    
    if pd.isnull(pclass):
        return "Unknown"
    elif pclass == 1:
        return "First Class"
    elif pclass == 2:
        return "Second Class"
    else:
        return "Third Class"

In [55]:
classes = titanic_survival.apply(which_class, axis=1)
classes

0       First Class
1       First Class
2       First Class
3       First Class
4       First Class
           ...     
1305    Third Class
1306    Third Class
1307    Third Class
1308    Third Class
1309        Unknown
Length: 1310, dtype: object

* Créer une fonction qui retourne la chaine de caractère "minor" pour quelqu'un de moins de 18 ans, "adult" si son age est supérieur ou égal à 18 et "unknown" si la valeur est manquante.
* Utiliser cette fonction avec la méthode apply() pour trouver l'intitulé correct pour chaque passager du dataframe titanic_survival.
* Assigner le résultat à la variable age_labels
* Afficher le résultat.

In [56]:
def passager_age(row):
    age= row["age"]
    
    if pd.isnull(age):
        return "Unknown"
    elif age < 18:
        return "Minor"
    else:
        return "Adult"

In [57]:
age_labels = titanic_survival.apply(passager_age, axis=1)
age_labels

0         Adult
1         Minor
2         Minor
3         Adult
4         Adult
         ...   
1305    Unknown
1306      Adult
1307      Adult
1308      Adult
1309    Unknown
Length: 1310, dtype: object

### Cas pratique

### Pratique : Calculer le pourcentage de survie par groupe d'âge

* Ajouter la colonne "age_labels" au dataframe titanic_survival contenant la variable age_labels qu'on a créé dans la vidéo précédente.
* Créer une table pivot qui calcule la moyenne de chance de survie (colonne "survived") pour chaque groupe d'âge (colonne "age_labels") du dataframe titanic_survival.
* Assigner l'objet Series résultant à la variable age_group_survival.
* Afficher le résultat.

In [60]:
titanic_survival["age_labels"] = age_labels

In [63]:
age_group_survival = titanic_survival.pivot_table(index= "age_labels", values="survived")
age_group_survival

Unnamed: 0_level_0,survived
age_labels,Unnamed: 1_level_1
Adult,0.387892
Minor,0.525974
Unknown,0.277567
