# Reading data
Read the train.csv file as a pandas dataframe.

In [39]:
import pandas as pd
import numpy as np

train = pd.read_csv('data/train.csv')
train.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Indexing
1. Create a function that returns the name of a passenger given their PassengerId.
2. Create a function that returns the PassengerId of a passenger given their Name.
3. Print a message with the ID of passenger **Montvila, Rev. Juozas** with the following format: 'The ID pf passenger Montvila, Rev. Juozas is ##'
4. Print a message with the name of the passenger with ID **42** with the following format: 'The passenger with ID 42 is X'

5. Print all information about the oldest passenger.

In [None]:
def get_passenger(df:pd.DataFrame, passengerId:int):
    """Regrese el nombre del pasajero segun su id

    Args:
        df (pd.DataFrame): Conjunto de datos
        passengerId (int): Id del pasajero

    Returns:
        str: El nombre del pasajero
    """
    return df.Name[df.PassengerId == passengerId].values[0]

def get_passengerId(df:pd.DataFrame, passenger:str):
    """Regrese el id del pasajero segun su nombre

    Args:
        df (pd.DataFrame): Conjunto de datos
        passenger (str): Nombre del pasajero

    Returns:
        int: Id del pasajero
    """
    return df.PassengerId[df.Name ==passenger].values[0]

print(f'The passenger with ID 42 is {get_passenger(train, 42)}')
print(f'The ID of passenger Montvila, Rev. Juozas is {get_passengerId(train, "Montvila, Rev. Juozas")}')

print(train[train.Age == train.Age.max()])


The passenger with ID 42 is Turpin, Mrs. William John Robert (Dorothy Ann Wonnacott)
The ID of passenger Montvila, Rev. Juozas is 887
     PassengerId  Survived  Pclass                                  Name  \
630          631         1       1  Barkworth, Mr. Algernon Henry Wilson   

      Sex   Age  SibSp  Parch Ticket  Fare Cabin Embarked  
630  male  80.0      0      0  27042  30.0   A23        S  


# Subseting
We are asked to share data for analysis by a third party. Since our dataset contains personal details, we only want to share with them the following information: ticket classes, fares and port of embarkation. We are asked to deliver a sample of the first 100 rows of this dataset.

6. Create and save the new dataset in **data/port_fares.csv**.

In [23]:
port_fares = train.get(['Pclass', 'Fare', 'Embarked'])[:100]
port_fares.to_csv('data/port_fares.csv')


# Counting
7. We want to know if there were any survivors over the age of 60, print all of their information.
8. How many people over 60 survived?
9. What percentage of people over 60 survived?

In [38]:
print(train[(train.Age > 60) & (train.Survived == 1)])
print('')
print(train[(train.Age > 60) & (train.Survived == 1)].shape[0])
print('')
print(train[(train.Age > 60) & (train.Survived == 1)].shape[0] / train[(train.Age > 60)].shape[0] * 100, '%')

     PassengerId  Survived  Pclass                                       Name  \
275          276         1       1          Andrews, Miss. Kornelia Theodosia   
483          484         1       3                     Turkula, Mrs. (Hedwig)   
570          571         1       2                         Harris, Mr. George   
630          631         1       1       Barkworth, Mr. Algernon Henry Wilson   
829          830         1       1  Stone, Mrs. George Nelson (Martha Evelyn)   

        Sex   Age  SibSp  Parch       Ticket     Fare Cabin Embarked  
275  female  63.0      1      0        13502  77.9583    D7        S  
483  female  63.0      0      0         4134   9.5875   NaN        S  
570    male  62.0      0      0  S.W./PP 752  10.5000   NaN        S  
630    male  80.0      0      0        27042  30.0000   A23        S  
829  female  62.0      0      0       113572  80.0000   B28      NaN  

5

22.727272727272727 %


# Women and children first?
10. Find out if women and children were more likely to survive.

In [45]:
kids = train[train.Age < 18]
women = train[(train.Age >= 18) & (train.Sex == 'female')]
men = train[(train.Age >= 18) & (train.Sex == 'male')]

print("% de niños que sobrevivieron:", kids[kids.Survived == 1].shape[0] / kids.shape[0] * 100,'%')
print("% de mujeres que sobrevivieron:",women[women.Survived == 1].shape[0] / women.shape[0] * 100,'%')
print("% de hombres que sobrevivieron:",men[men.Survived == 1].shape[0] / men.shape[0] * 100,'%')

% de niños que sobrevivieron: 53.98230088495575 %
% de mujeres que sobrevivieron: 77.18446601941747 %
% de hombres que sobrevivieron: 17.72151898734177 %


11. Write a function that returns the percentage of people that survived from a subset given as a boolean Pandas series.

In [49]:
def porcentaje_sobrevivir(df:pd.DataFrame, nombre_columna:str):
    """Te da el porcentaje de personas que sobreviven en un df

    Args:
        df (pd.DataFrame): Conjunto de datos
        nombre_columna (str): Nombre de la columna binaria

    Returns:
        str: String que calcula el porcentaje de supervivientes y lo imprime
    """
    return f'{df[df[nombre_columna]==1].shape[0] / df.shape[0] * 100} %'

porcentaje_sobrevivir(train, 'Survived')

'38.38383838383838 %'

# Summarizing

12. What is the median age of the passengers?
13. How many passengers embarked from each port?

14. Generate two hypotheses about how does the survival rate differ among groups of passengers. Write your code to explore both hypotheses.

In [None]:
print(train.Age.median())

print(train.Embarked.value_counts())

28.0
S    644
C    168
Q     77
Name: Embarked, dtype: int64
