# Notebook d'analyse statistique et détection d'insights

In [1]:
import pandas as pd

# pd.set_option("display.max_columns", 50)

## a. Données

In [4]:
df = pd.read_csv("./data/train.csv", sep=",")
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [8]:
df.shape

(891, 12)

## b. Analyse générale du jeu de données

In [6]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

2 types présents : 
* numérique (PassengerId, Survived, Pclass, ...)
* textuel (Name, Sex, ...)

In [7]:
df.isna().sum() / df.shape[0]

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

2 colonnes sont gravement impactées par des valeurs manquantes : Age (20%) et Cabin (77%) 

Notons qu'en dehors de ces deux, seule Embarked contient aussi des valeurs manquantes.

## c. Analyse statistique des variables textuelles


In [10]:
[col for col in df.columns if df[col].dtype == "object"]

['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

In [12]:
textual_cols = ["Name", "Sex", "Ticket", "Cabin", "Embarked"]

In [21]:
for col in textual_cols:
    data = (
        pd.concat(
            [
                df[col].fillna("isna").value_counts(),
                (df[col].fillna("isna").value_counts()/df.shape[0]).rename("prop")
            ], 
            axis=1,
        )
    )
    print(data)
    print("\n"*2 + "="*40 + "\n")

                                          count      prop
Name                                                     
Braund, Mr. Owen Harris                       1  0.001122
Boulos, Mr. Hanna                             1  0.001122
Frolicher-Stehli, Mr. Maxmillian              1  0.001122
Gilinski, Mr. Eliezer                         1  0.001122
Murdlin, Mr. Joseph                           1  0.001122
...                                         ...       ...
Kelly, Miss. Anna Katherine "Annie Kate"      1  0.001122
McCoy, Mr. Bernard                            1  0.001122
Johnson, Mr. William Cahoone Jr               1  0.001122
Keane, Miss. Nora A                           1  0.001122
Dooley, Mr. Patrick                           1  0.001122

[891 rows x 2 columns]



        count      prop
Sex                    
male      577  0.647587
female    314  0.352413



          count      prop
Ticket                   
347082        7  0.007856
CA. 2343      7  0.007856
1601          7 