# Introduccion



En este notebook se realiza un analisis exploratorio de los datos, con el fin de obtener una idea de como se comportan los datos y que tipo de analisis se pueden realizar.

El dataset **titanic.csv** contiene los detalles de un subconjunto de pasajeros a bordo (1309 pasajeros, para ser exactos, donde cada pasajero obtiene una fila diferente en la tabla).

# Carga Datos

## Importacion librerias

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

## Creacion de Dataframe

In [2]:
data = pd.read_csv('../data/train.csv')
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


# Explicacion Datos

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


| Columna      | Descripción                                  | Tipo de Datos | Valores No Nulos | Comentarios                                         |
|--------------|----------------------------------------------|---------------|------------------|-----------------------------------------------------|
| PassengerId  | Identificador único del pasajero             | int64         | 1309             |                                                     |
| Survived     | Sobrevivencia del pasajero                   | int64         | 1309             | 0 = No (No sobrevivió), 1 = Sí (Sobrevivió)         |
| Pclass       | Clase del boleto del pasajero                | object        | 1309             | 1 = Primera clase, 2 = Segunda clase, 3 = Tercera clase |
| Name         | Nombre del pasajero                         | object        | 1309             |                                                     |
| Sex          | Género del pasajero                         | object        | 1223             | "male" = Hombre, "female" = Mujer                   |
| Age          | Edad del pasajero                           | float64       | 1132             |                                                     |
| SibSp        | Número de hermanos/cónyuges a bordo          | int64         | 1309             |                                                     |
| Parch        | Número de padres/hijos a bordo               | object        | 1309             |                                                     |
| Ticket       | Número del boleto                            | object        | 1308             |                                                     |
| Fare         | Tarifa pagada por el pasajero                | object        | 982              |                                                     |
| Cabin        | Número de cabina                            | object        | 622              |                                                     |
| Embarked     | Puerto de embarque del pasajero              | object        | 889              | "C" = Cherbourg, "Q" = Queenstown, "S" = Southampton |


- **Existen valores nulos**
- **Algunas columnas no tienen el tipo de dato correcto como por ejemplo Pclass,Parch,Fare**

# Exploracion inicial

In [4]:
# Primeras filas del conjunto de datos
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Resumen estadistico

In [5]:
#Resumen estadistico columnas numericas
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
# Columnas tipo objeto
data.describe(include=['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


## Consultas basicas antes de limpieza

### Porcentaje de sobrevivientes

In [8]:
porcentaje_supervivencia = (data['Survived'].mean()) * 100
print(f'Porcentaje de Sobrevivientes: {porcentaje_supervivencia:.2f}%')

Porcentaje de Sobrevivientes: 38.38%


## Porcentaje de supervivientes por clase

In [9]:
p_sup_class = data.groupby('Pclass')['Survived'].mean() * 100

print('Porcentaje de supervivencia por clase')
print(p_sup_class)

Porcentaje de supervivencia por clase
Pclass
1    62.962963
2    47.282609
3    24.236253
Name: Survived, dtype: float64


In [10]:
#Verificando las clases que existen actualmente
data['Pclass'].value_counts() 

Pclass
3    491
1    216
2    184
Name: count, dtype: int64

## Porcentaje de supervivientes por sexo

In [11]:
p_sup_sex = data.groupby('Sex')['Survived'].mean() * 100
print('Porcentaje de supervivencia por sexo')
print(p_sup_sex)

Porcentaje de supervivencia por sexo
Sex
female    74.203822
male      18.890815
Name: Survived, dtype: float64


# Limpieza de datos

## Datos faltantes

### Identificar datos faltantes

In [12]:
# Cantidad de datos faltantes por columna
print(data.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


**Las columnas con datos faltantes son:**

- Age:177
- Cabin:687
- Embarked:2

### Manejo de datos faltantes

#### Edad

In [13]:
data[data['Age'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


**Dado que es una variable numérica, se va a considerar llenar los valores faltantes con la media o mediana de esa columna.**

In [14]:
data['Age'].fillna(data['Age'].median(), inplace=True)

#### Cabina

In [15]:
data[data['Cabin'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,28.0,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,28.0,1,2,W./C. 6607,23.4500,,S


**Al ser demasiados valores nulos y no poder determinar una forma adecuada para el tratamiento de esta columna, la opcion mas ideal seria eliminar la columna para evitar el sesgo de cabinas** 

In [17]:
data.drop('Cabin', axis=1, inplace=True)

#### Embarked

In [22]:
data[data['Embarked'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,


In [23]:
data['Embarked'].fillna('D', inplace=True)

### Verificar datos

In [24]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [25]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB
