# Titanic

Se adjunta un dataset sobre el desastre del Titanic para hacer ejercicios libres.

Algunos ejercicios que puedes llegar a realizar son:

* Ver el número de valores nulos
* Representar el porcentaje de filas con atributos nulos.
* Limpieza de columnas.
* Saber la edad mínima y máxima de las personas del barco.
* Conocer la mediana de las edades.
* Ver los precios (columna `fares`) más altos y bajos.
* Número de pasajeros embarcados (columna `Embarked`).
* Ver la distribución de sexos en las personas embarcadas.



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [61]:
df = pd.read_csv('titanic.csv')

In [24]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [29]:
df.isnull().sum() / len(df) * 100 

PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

In [68]:
df.duplicated().sum()  

0

In [67]:
df.head(25)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex1
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,Sin_Datos_Cabina,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,Sin_Datos_Cabina,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,Sin_Datos_Cabina,S,1
5,6,0,3,"Moran, Mr. James",male,26.057135,0,0,330877,8.4583,Sin_Datos_Cabina,Q,1
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,1
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,Sin_Datos_Cabina,S,1
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,Sin_Datos_Cabina,S,0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,Sin_Datos_Cabina,C,0


In [66]:
df['Cabin'].fillna('Sin_Datos_Cabina', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Cabin'].fillna('Sin_Datos_Cabina', inplace=True)


In [63]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Sex1'] = le.fit_transform(df['Sex'])

df_age_not_null = df[df['Age'].notnull()]
df_age_is_null = df[df['Age'].isnull()]

model = LinearRegression()
model.fit(df_age_not_null[['Fare', 'Sex1', 'Pclass']], df_age_not_null['Age'])

predicted_ages = model.predict(df_age_is_null[['Fare', 'Sex1', 'Pclass']])

df.loc[df['Age'].isnull(), 'Age'] = predicted_ages

In [64]:
df['Embarked'].fillna('S', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna('S', inplace=True)


In [65]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
Sex1             0
dtype: int64

In [76]:
print(f'La edad mínima es {df["Age"].min()}')
print(f'La edad máxima es {df["Age"].max()}')
print(f'La edad promedio es {df["Age"].median().round(2)}')
print(f'El precio más alto pagado por un ticket es {df["Fare"].max()}')
print(f'El precio más bajo pagado por un ticket es {df["Fare"].min()}')
print(f'El total de personas a bordo es {(df["Embarked"].count())}')
print(f'El total de Hombres a bordo es {sum(df["Sex"] == "male")}')
print(f'El total de Mujeres a bordo es {sum(df["Sex"] == "female")}')

La edad mínima es 0.42
La edad máxima es 80.0
La edad promedio es 26.1
El precio más alto pagado por un ticket es 512.3292
El precio más bajo pagado por un ticket es 0.0
El total de personas a bordo es 891
El total de Hombres a bordo es 577
El total de Mujeres a bordo es 314
