# Traiter les valeurs manquantes

### Introduction au dataset

In [2]:
import pandas as pd

fichier_in = "C:/Users/Thierno Barry/Documents/Python Scripts/01.Data Science - Analyse de données avec Python/00. Data/titanic_survival.csv"
titanic_survival = pd.read_csv(fichier_in)

In [6]:
titanic_survival.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


### Trouver les valeurs manquantes

In [4]:
# pandas.isnull()

In [6]:
sex = titanic_survival["sex"]
sex_is_null = pd.isnull(sex)
print(sex_is_null)

0       False
1       False
2       False
3       False
4       False
        ...  
1305    False
1306    False
1307    False
1308    False
1309     True
Name: sex, Length: 1310, dtype: bool


In [7]:
sex_null = sex[sex_is_null]
print(sex_null)

1309    NaN
Name: sex, dtype: object


#### Mission:
- Compter le nombre de valeurs dans la colonne "age" possédant des valeurs manquantes:
>- Assigner à la variable age la colonne des ages du dataframe titanic_survival
>- Utiliser pandas.isnull() sur la variable age pour créer une Series de valeurs True et False
>- Utiliser la Series résultante pour sélectionner seulement les élements de la colonne "age" qui sont nuls et assigner le résultat à la variable age_null
>- Assigner le nombre de valeurs manquantes de age_null à la variable age_null_count (fonction len() )
- Afficher age_null_count pour voir le nombre de valeurs manquantes de la colonne "age"

In [17]:
age = titanic_survival["age"]
age_is_null = pd.isnull(age)
age_null = age[age_is_null]
age_null_count = len(age_null)
print(age_null_count)

264


### Problème avec les valeurs manquantes

In [16]:
mean_age = sum(titanic_survival["age"]) / len(titanic_survival["age"])
print(mean_age)

nan


#### Mission:
- Utiliser age_is_null pour créer un vecteur qui contient seulement les valeurs de la colonne "age" qui ne sont pas NaN (c'est à dire pour lesquelles age_is_null vaut False)
- Assigner ce résultat à la variable good_ages
- Calculer la moyenne de ce nouveau vecteur et assigner le résultat à la variable mean_age
- Afficher cette moyenne.

In [22]:
age = titanic_survival["age"]
age_is_null = pd.isnull(age)
good_ages = age[age_is_null == False]
mean_age = sum(good_ages)/len(good_ages)
print(mean_age)

29.8811345124283


### Calculer une moyenne plus simple

In [23]:
# Series.mean()

In [25]:
mean_age = titanic_survival["age"].mean()
print(mean_age)

29.8811345124283


#### Mission:
- Assigner la moyenne de la colonne "fare" à la variable mean_fare.
- Afficher le résultat.

In [3]:
mean_fare = titanic_survival["fare"].mean()
print(mean_fare)

33.29547928134572


### Calculer des statistiques de prix

#### Mission:
- Créer un dictionnaire vide qu'on nommera fares_by_class.
- Créer la liste passenger_classes qui contient les éléments [1,2,3].
- Utiliser une boucle for pour parcourir la liste passenger_classes:
>- Sélectionner juste les lignes de titanic_survival pour lesquelles la colonne pclass est égale à la variable temporaire (l'itérareur) de la boucle for, c'est à dire correspondant au numéro de classe (1, 2 ou 3)
>- Sélectionner seulement la colonne fare pour ce sous-ensemble de lignes (correspondant à la classe)
>- Utiliser la méthode Series.mean() pour calculer la moyenne de ce sous-ensemble
>- Ajouter cette moyenne calculée de la classe au dictionnaire fares_by_class avec comme clé le numéro de la classe (et donc comme valeur la moyenne du prix du billet d'embarquement)
- Une fois la boucle achevée, le dictionnaire fares_by_class devrait avoir 1,2 et 3 comme clés avec en valeur les moyennes correpondantes.
- Afficher le résultat.

In [9]:
fares_by_class = {}
passenger_classes = [1,2,3]

for this_class in passenger_classes:
    pclass_rows = titanic_survival[titanic_survival["pclass"] == this_class]
    pclass_fares = pclass_rows["fare"]
    fare_for_class = pclass_fares.mean()
    fares_by_class[this_class] = fare_for_class
    
print(fares_by_class)

{1: 87.50899164086687, 2: 21.1791963898917, 3: 13.302888700564957}


### Introduction aux pivots de table

In [None]:
# DataFrame.pivo_table()

In [12]:
import numpy as np

passenger_classes_fares = titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=np.mean)
print(passenger_classes_fares)

             fare
pclass           
1.0     87.508992
2.0     21.179196
3.0     13.302889


#### Mission:
- Utiliser la méthode dataFrame.pivot_table() pour calculer la moyenne de l'âge pour chaque classe de passager ("pclass").
- Assigner le résultat à la variable passenger_age.
- Afficher passenger_age.
- Faire de même avec la colonne survived pour chaque classe de passager.

In [14]:
import numpy as np

passenger_age = titanic_survival.pivot_table(index="pclass", values="age", aggfunc=np.mean)
print(passenger_age)

              age
pclass           
1.0     39.159918
2.0     29.506705
3.0     24.816367


In [15]:
passenger_survived = titanic_survival.pivot_table(index="pclass", values="survived", aggfunc=np.mean)
print(passenger_survived)

        survived
pclass          
1.0     0.619195
2.0     0.429603
3.0     0.255289


### Tables Pivot Niveau 2

#### Mission:
- Faire un pivot de table qui calcule le total d'argent encaissé ("fare") et le nombre total de survivants ("survived") pour chaque port d'embarcation ("embarked"). Il faudra utiliser la fonction numpy.sum
- Assigner le résultat à la variable port_stats.
- Afficher le résultat.

In [17]:
import numpy as np

port_stats = titanic_survival.pivot_table(index="embarked", values=["fare", "survived"], aggfunc=np.sum)
print(port_stats)

                fare  survived
embarked                      
C         16830.7922     150.0
Q          1526.3085      44.0
S         25033.3862     304.0


### Eliminer les valeurs manquantes

In [18]:
# DataFrame.dropna(axis=0 ou axis='index') -pour les colonnes -> axis=1

In [23]:
# exclure toutes les lignes qui contiennent une valeur manquante
drop_na_rows = titanic_survival.dropna(axis=0)
drop_na_rows

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest


In [24]:
# exclure toutes les colonnes qui contiennent une valeur manquante
drop_na_olumns = titanic_survival.dropna(axis=1)
drop_na_olumns

0
1
2
3
4
...
1305
1306
1307
1308
1309


In [25]:
# dropna(axis=..., subset=["name"])

In [27]:
titanic_survival.dropna(axis=0, subset=["name"])

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3.0,0.0,"Zabour, Miss. Hileni",female,14.5000,1.0,0.0,2665,14.4542,,C,,328.0,
1305,3.0,0.0,"Zabour, Miss. Thamine",female,,1.0,0.0,2665,14.4542,,C,,,
1306,3.0,0.0,"Zakarian, Mr. Mapriededer",male,26.5000,0.0,0.0,2656,7.2250,,C,,304.0,
1307,3.0,0.0,"Zakarian, Mr. Ortin",male,27.0000,0.0,0.0,2670,7.2250,,C,,,


#### Mission:
- Supprimer toutes les lignes de titanic_survival pour lesquelles les colonnes "age" ou "sex" ont des valeurs manquantes et assigner le résultat à la variable new_titanic_survival.
- Comparer le nombre de lignes qu'il reste avec l'attribut shape.

In [46]:
new_titanic_survival = titanic_survival.dropna(axis=0, subset=["age", "sex"])
new_titanic_survival

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1301,3.0,0.0,"Youseff, Mr. Gerious",male,45.5000,0.0,0.0,2628,7.2250,,C,,312.0,
1304,3.0,0.0,"Zabour, Miss. Hileni",female,14.5000,1.0,0.0,2665,14.4542,,C,,328.0,
1306,3.0,0.0,"Zakarian, Mr. Mapriededer",male,26.5000,0.0,0.0,2656,7.2250,,C,,304.0,
1307,3.0,0.0,"Zakarian, Mr. Ortin",male,27.0000,0.0,0.0,2670,7.2250,,C,,,


In [29]:
print(new_titanic_survival.shape)

(1046, 14)


### iloc pour accéder à des lignes

In [30]:
# DataFrame.loc[]

In [50]:
new_titanic_survival = new_titanic_survival.sort_values("age", inplace=False, ascending=False)
new_titanic_survival

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
14,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson",male,80.0000,0.0,0.0,27042,30.0000,A23,S,B,,"Hessle, Yorks"
61,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0000,1.0,0.0,19877,78.8500,C46,S,6,,"Little Onn Hall, Staffs"
1235,3.0,0.0,"Svensson, Mr. Johan",male,74.0000,0.0,0.0,347060,7.7750,,S,,,
135,1.0,0.0,"Goldschmidt, Mr. George B",male,71.0000,0.0,0.0,PC 17754,34.6542,A5,C,,,"New York, NY"
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0000,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
657,3.0,1.0,"Baclini, Miss. Eugenie",female,0.7500,2.0,1.0,2666,19.2583,,C,C,,"Syria New York, NY"
427,2.0,1.0,"Hamalainen, Master. Viljo",male,0.6667,1.0,1.0,250649,14.5000,,S,4,,"Detroit, MI"
1240,3.0,1.0,"Thomas, Master. Assad Alexander",male,0.4167,0.0,1.0,2625,8.5167,,C,16,,
747,3.0,0.0,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.3333,0.0,2.0,347080,14.4000,,S,,,"Stanton, IA"


In [74]:
# intitullé (index) de ligne
new_titanic_survival.loc[0]

pclass                                 1.0
survived                               1.0
name         Allen, Miss. Elisabeth Walton
sex                                 female
age                                   29.0
sibsp                                  0.0
parch                                  0.0
ticket                               24160
fare                              211.3375
cabin                                   B5
embarked                                 S
boat                                     2
body                                   NaN
home.dest                     St Louis, MO
Name: 0, dtype: object

In [36]:
# ordre de position
new_titanic_survival.iloc[0:5]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
14,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0.0,0.0,27042,30.0,A23,S,B,,"Hessle, Yorks"
61,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1.0,0.0,19877,78.85,C46,S,6,,"Little Onn Hall, Staffs"
1235,3.0,0.0,"Svensson, Mr. Johan",male,74.0,0.0,0.0,347060,7.775,,S,,,
135,1.0,0.0,"Goldschmidt, Mr. George B",male,71.0,0.0,0.0,PC 17754,34.6542,A5,C,,,"New York, NY"
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


#### Mission:
- Assigner les 10 premières lignes de new_titanic_survival à la variable first_ten_rows.
- Assigner la 5e ligne de new_titanic_survival à la variable row_position-fifth.
- Assigner la ligne dont l'intitulé d'index est 25 pour new_titanic_survival à la variable row_index_25.

In [47]:
first_ten_rows = new_titanic_survival.iloc[0:10]
print(first_ten_rows)

   pclass  survived                                             name     sex  \
0     1.0       1.0                    Allen, Miss. Elisabeth Walton  female   
1     1.0       1.0                   Allison, Master. Hudson Trevor    male   
2     1.0       0.0                     Allison, Miss. Helen Loraine  female   
3     1.0       0.0             Allison, Mr. Hudson Joshua Creighton    male   
4     1.0       0.0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female   
5     1.0       1.0                              Anderson, Mr. Harry    male   
6     1.0       1.0                Andrews, Miss. Kornelia Theodosia  female   
7     1.0       0.0                           Andrews, Mr. Thomas Jr    male   
8     1.0       1.0    Appleton, Mrs. Edward Dale (Charlotte Lamson)  female   
9     1.0       0.0                          Artagaveytia, Mr. Ramon    male   

       age  sibsp  parch    ticket      fare    cabin embarked boat   body  \
0  29.0000    0.0    0.0     24160  211.3

In [48]:
row_position_fifth = new_titanic_survival.iloc[4]
row_position_fifth

pclass                                                   1.0
survived                                                 0.0
name         Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
sex                                                   female
age                                                     25.0
sibsp                                                    1.0
parch                                                    2.0
ticket                                                113781
fare                                                  151.55
cabin                                                C22 C26
embarked                                                   S
boat                                                     NaN
body                                                     NaN
home.dest                    Montreal, PQ / Chesterville, ON
Name: 4, dtype: object

In [49]:
row_index_25 = new_titanic_survival.loc[25]
row_index_25

pclass                       1.0
survived                     0.0
name         Birnbaum, Mr. Jakob
sex                         male
age                         25.0
sibsp                        0.0
parch                        0.0
ticket                     13905
fare                        26.0
cabin                        NaN
embarked                       C
boat                         NaN
body                       148.0
home.dest      San Francisco, CA
Name: 25, dtype: object

### Les index de colonne

In [51]:
new_titanic_survival.iloc[0,0]

1.0

In [52]:
new_titanic_survival.iloc[:,0:3]

Unnamed: 0,pclass,survived,name
14,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson"
61,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence..."
1235,3.0,0.0,"Svensson, Mr. Johan"
135,1.0,0.0,"Goldschmidt, Mr. George B"
9,1.0,0.0,"Artagaveytia, Mr. Ramon"
...,...,...,...
657,3.0,1.0,"Baclini, Miss. Eugenie"
427,2.0,1.0,"Hamalainen, Master. Viljo"
1240,3.0,1.0,"Thomas, Master. Assad Alexander"
747,3.0,0.0,"Danbom, Master. Gilbert Sigvard Emanuel"


In [58]:
new_titanic_survival.loc[83,"age"]

64.0

In [54]:
new_titanic_survival.loc[766,"pclass"]

3.0

#### Mission:
- Assigner la valeur d'intitule de ligne 1100 pour la colonne "age" de new_titanic_survival à la variable row_index_1100_age.
- Assigner la valeur d'intitulé de ligne 25 pour la colonne "survived" de new_titanic_survival à la variable row_index_25_survived.
- Assigner les 5 premières lignes et 3 premières colonnes de new_titanic_survival à la variable five_rows_three_cols.
- Afficher tous les résultats.

In [75]:
row_index_1100_age = new_titanic_survival.loc[1100,"age"]
row_index_25_survived = new_titanic_survival.loc[25,"survived"]
five_rows_three_cols = new_titanic_survival.iloc[0:5,0:3]

In [76]:
print(row_index_1100_age)

29.0


In [77]:
print(row_index_25_survived)

0.0


In [78]:
print(five_rows_three_cols)

      pclass  survived                                               name
14       1.0       1.0               Barkworth, Mr. Algernon Henry Wilson
61       1.0       1.0  Cavendish, Mrs. Tyrell William (Julia Florence...
1235     3.0       0.0                                Svensson, Mr. Johan
135      1.0       0.0                          Goldschmidt, Mr. George B
9        1.0       0.0                            Artagaveytia, Mr. Ramon


### Réindexer les lignes d'un dataframe

In [79]:
# DataFrame.reset_index(drop=True) - drop=True permet de supprimer la variable index

In [80]:
new_titanic_survival.reset_index()

Unnamed: 0,index,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,14,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson",male,80.0000,0.0,0.0,27042,30.0000,A23,S,B,,"Hessle, Yorks"
1,61,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0000,1.0,0.0,19877,78.8500,C46,S,6,,"Little Onn Hall, Staffs"
2,1235,3.0,0.0,"Svensson, Mr. Johan",male,74.0000,0.0,0.0,347060,7.7750,,S,,,
3,135,1.0,0.0,"Goldschmidt, Mr. George B",male,71.0000,0.0,0.0,PC 17754,34.6542,A5,C,,,"New York, NY"
4,9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0000,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1041,657,3.0,1.0,"Baclini, Miss. Eugenie",female,0.7500,2.0,1.0,2666,19.2583,,C,C,,"Syria New York, NY"
1042,427,2.0,1.0,"Hamalainen, Master. Viljo",male,0.6667,1.0,1.0,250649,14.5000,,S,4,,"Detroit, MI"
1043,1240,3.0,1.0,"Thomas, Master. Assad Alexander",male,0.4167,0.0,1.0,2625,8.5167,,C,16,,
1044,747,3.0,0.0,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.3333,0.0,2.0,347080,14.4000,,S,,,"Stanton, IA"


#### Mission:
- Ré-indexer le dataframe new_titanic_survival pour que la première ligne commence à 0 et supprimer l'ancien indexage.
- Assigner le résultat à la variable titanic_reindexed.
- Afficher les 5 premières lignes et 3 premières colonnes de titanic_reindexed.

In [91]:
new_titanic_survival_2 = titanic_survival.dropna(axis=0, subset=["age", "sex"])
titanic_reindexed = new_titanic_survival_2.reset_index(drop=True)
titanic_reindexed.iloc[0:5,0:3]

Unnamed: 0,pclass,survived,name
0,1.0,1.0,"Allen, Miss. Elisabeth Walton"
1,1.0,1.0,"Allison, Master. Hudson Trevor"
2,1.0,0.0,"Allison, Miss. Helen Loraine"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)"


### Appliquer des fonctions sur un DataFrame

In [92]:
# DataFrame.apply()

In [93]:
# Soit une fonction qui retourne le 100e élément 

def row_100(column):
    # extraire le 100e élément d'une colonne
    item = column.iloc[99]
    return item

In [94]:
# returne le 100e élément de chaque colonne
row_100_var = titanic_survival.apply(row_100)
row_100_var

pclass                                                     1.0
survived                                                   1.0
name         Duff Gordon, Lady. (Lucille Christiana Sutherl...
sex                                                     female
age                                                       48.0
sibsp                                                      1.0
parch                                                      0.0
ticket                                                   11755
fare                                                      39.6
cabin                                                      A16
embarked                                                     C
boat                                                         1
body                                                       NaN
home.dest                                       London / Paris
dtype: object

#### Mission:
- Ecrire une fonction qui compte le nombre d'éléments manquants d'un objet Series.
- Utiliser la méthode DataFrame.apply() pour appliquer votre fonction sur titanic_survival.
- Assigner le résultat  à la variable column_null_count.
- Afficher le résultat.

In [99]:
import pandas as pd

def count_missing_values(column):
    column_is_null    = pd.isnull(column)
    column_null       = column[column_is_null]
    return len(column_null)

column_null_count = titanic_survival.apply(count_missing_values)
column_null_count

pclass          1
survived        1
name            1
sex             1
age           264
sibsp           1
parch           1
ticket          1
fare            2
cabin        1015
embarked        3
boat          824
body         1189
home.dest     565
dtype: int64

### Appliquer une fonction à une ligne

In [None]:
# DataFrame.apply(function, axis=1)

In [113]:
def is_minor(row):
    if row["age"] < 18:
        return True
    else:
        return False

In [114]:
minors = titanic_survival.apply(is_minor, axis=1)
minors

0       False
1        True
2        True
3       False
4       False
        ...  
1305    False
1306    False
1307    False
1308    False
1309    False
Length: 1310, dtype: bool

In [103]:
def which_class(row):
    pclass = row["pclass"]
    
    if pd.isnull(pclass):
        return "Unknown"
    elif pclass == 1:
        return "First Class"
    elif pclass == 2:
        return "Second Class"
    else:
        return "Third Class"

In [104]:
classes = titanic_survival.apply(which_class, axis=1)
classes

0       First Class
1       First Class
2       First Class
3       First Class
4       First Class
           ...     
1305    Third Class
1306    Third Class
1307    Third Class
1308    Third Class
1309        Unknown
Length: 1310, dtype: object

#### Mission:
- Créer une fonction qui retourne la chaine de caractères "minor" pour quelqu'un de moins de 18 ans, "adult" si son age est supérieur ou égal à 18 et "unknown" si la valeur est manquante.
- utiliser cette fonction avec la méthode apply() pour trouver l'intitulé correct pour chaque apssager du dataframe titanic_survival.
- Assigner le résultat à la variable age_labels.
- Afficher le résultat.

In [117]:
import pandas as pd

def is_minor(row):
    age = row["age"]
    
    if age < 18:
        return "minor"
    elif age >= 18:
        return "adult"
    else:
        return "unknown"

age_labels = titanic_survival.apply(is_minor, axis=1)
age_labels

0         adult
1         minor
2         minor
3         adult
4         adult
         ...   
1305    unknown
1306      adult
1307      adult
1308      adult
1309    unknown
Length: 1310, dtype: object

In [119]:
# correction 
def generate_age_label(row):
    age = row["age"]
    
    if pd.isnull(age):
        return "Unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"
    
age_labels = titanic_survival.apply(generate_age_label, axis=1)
age_labels

0         adult
1         minor
2         minor
3         adult
4         adult
         ...   
1305    Unknown
1306      adult
1307      adult
1308      adult
1309    Unknown
Length: 1310, dtype: object

### Pratique: Calculer le pourcentage de survie par groupe d'âge

#### Mission:
- Ajouter la colonne "age_labels" au dataframe titanic_survival contenant la variable age_labels qu'on a créé dans la vidéo précédente.
- Créer une table pivot qui calcule la moyenne de chance de survie (colonne "survived") pour chaque groupe d'âge (colonne "age_labels") du dataframe titanic_survival.
- Assigner l'objet Series résultant à la variable age_group_survival.
- Afficher le résultat.

In [125]:
import numpy as np

titanic_survival["age_labels"] = age_labels
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="survived", aggfunc=np.mean)
print(age_group_survival)

            survived
age_labels          
Unknown     0.277567
adult       0.387892
minor       0.525974
