# Reading data
Read the train.csv file as a pandas dataframe.

In [2]:
import pandas as pd

# Carga el archivo CSV en un DataFrame
tabla = pd.read_csv("gitignoreclo/train.csv")

tabla

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


# Indexing
1. Create a function that returns the name of a passenger given their PassengerId.
2. Create a function that returns the PassengerId of a passenger given their Name.
3. Print a message with the ID of passenger **Montvila, Rev. Juozas** with the following format: 'The ID pf passenger Montvila, Rev. Juozas is ##'
4. Print a message with the name of the passenger with ID **42** with the following format: 'The passenger with ID 42 is X'
5. Print all information about the oldest passenger.

In [22]:
def id_nombre(passenger_id: int, df: pd.DataFrame) -> str:
    """Devuelve el nombre del pasajero con el ID especificado."""
    pasajero = df.loc[df['PassengerId'] == passenger_id, 'Name']
    return pasajero.iloc[0] if not pasajero.empty else "ID no encontrado"

def nombre_id(nombre: str, df: pd.DataFrame) -> int:
    """Devuelve el ID del pasajero con el nombre especificado."""
    pasajero = df.loc[df['Name'] == nombre, 'PassengerId']
    return pasajero.iloc[0] if not pasajero.empty else "Nombre no encontrado"

# Obtener ID del pasajero "Montvila, Rev. Juozas"
id_montvila = nombre_id("Montvila, Rev. Juozas", df)
print(f"The ID of passenger Montvila, Rev. Juozas is {id_montvila}")

# Obtener el nombre del pasajero con ID 42
nombre_42 = id_nombre(42, df)
print(f"The passenger with ID 42 is {nombre_42}")

# Información del pasajero más viejo
pasajero_mayor = df.loc[df['Age'].idxmax()]
print("Information about the oldest passenger:")
print(pasajero_mayor)


The ID of passenger Montvila, Rev. Juozas is 887
The passenger with ID 42 is Turpin, Mrs. William John Robert (Dorothy Ann Wonnacott)
Information about the oldest passenger:
PassengerId                                     631
Survived                                          1
Pclass                                            1
Name           Barkworth, Mr. Algernon Henry Wilson
Sex                                            male
Age                                            80.0
SibSp                                             0
Parch                                             0
Ticket                                        27042
Fare                                           30.0
Cabin                                           A23
Embarked                                          S
Name: 630, dtype: object


# Subseting
We are asked to share data for analysis by a third party. Since our dataset contains personal details, we only want to share with them the following information: ticket classes, fares and port of embarkation. We are asked to deliver a sample of the first 100 rows of this dataset.

6. Create and save the new dataset in **data/port_fares.csv**.

In [21]:
# Seleccionar las columnas que queremos compartir (ticket class, fares, port of embarkation)
subset_df = df.loc[:99, ['Pclass', 'Fare', 'Embarked']]  # Primeras 100 filas (0-99)

# Guardar el nuevo dataset en el archivo 'port_fares.csv' dentro de la carpeta 'data'
subset_df.to_csv('gitignoreclo/port_fares.csv', index=False)

print("Nuevo dataset guardado en 'gitignoreclo/port_fares.csv'.")

Nuevo dataset guardado en 'gitignoreclo/port_fares.csv'.


# Counting
7. We want to know if there were any survivors over the age of 60, print all of their information.
8. How many people over 60 survived?
9. What percentage of people over 60 survived?

In [20]:
# Filtrar los pasajeros mayores de 60 años
over_60 = df[df['Age'] > 60]

# Filtrar aquellos que sobrevivieron (Survived == 1)
survivors_over_60 = over_60[over_60['Survived'] == 1]

# Imprimir toda la información de los sobrevivientes mayores de 60 años
print("Survivors over the age of 60:")
print(survivors_over_60)

# Contar cuántos sobrevivieron
count_survivors_over_60 = survivors_over_60.shape[0]
print(f"Number of survivors over 60: {count_survivors_over_60}")

# Calcular el porcentaje de sobrevivientes mayores de 60 años respecto al total de personas mayores de 60
total_over_60 = over_60.shape[0]
percentage_survivors_over_60 = (count_survivors_over_60 / total_over_60) * 100
print(f"Percentage of survivors over 60: {percentage_survivors_over_60:.2f}%")

Survivors over the age of 60:
     PassengerId  Survived  Pclass                                       Name  \
275          276         1       1          Andrews, Miss. Kornelia Theodosia   
483          484         1       3                     Turkula, Mrs. (Hedwig)   
570          571         1       2                         Harris, Mr. George   
630          631         1       1       Barkworth, Mr. Algernon Henry Wilson   
829          830         1       1  Stone, Mrs. George Nelson (Martha Evelyn)   

        Sex   Age  SibSp  Parch       Ticket     Fare Cabin Embarked  
275  female  63.0      1      0        13502  77.9583    D7        S  
483  female  63.0      0      0         4134   9.5875   NaN        S  
570    male  62.0      0      0  S.W./PP 752  10.5000   NaN        S  
630    male  80.0      0      0        27042  30.0000   A23        S  
829  female  62.0      0      0       113572  80.0000   B28      NaN  
Number of survivors over 60: 5
Percentage of survivors ov

# Women and children first?
10. Find out if women and children were more likely to survive.

In [19]:
# Filtrar mujeres (sexo = 'female')
women = df[df['Sex'] == 'female']

# Filtrar niños (edad < 18)
children = df[df['Age'] < 18]

# Calcular la tasa de supervivencia para las mujeres
survival_rate_women = women['Survived'].mean() * 100
print(f"Survival rate for women: {survival_rate_women:.2f}%")

# Calcular la tasa de supervivencia para los niños
survival_rate_children = children['Survived'].mean() * 100
print(f"Survival rate for children: {survival_rate_children:.2f}%")

# Calcular la tasa de supervivencia general
survival_rate_overall = df['Survived'].mean() * 100
print(f"Overall survival rate: {survival_rate_overall:.2f}%")

# Comparar si las mujeres y los niños tuvieron más probabilidades de sobrevivir que la media general
if survival_rate_women > survival_rate_overall:
    print("Women were more likely to survive than the overall average.")
else:
    print("Women were not more likely to survive than the overall average.")

if survival_rate_children > survival_rate_overall:
    print("Children were more likely to survive than the overall average.")
else:
    print("Children were not more likely to survive than the overall average.")

Survival rate for women: 74.20%
Survival rate for children: 53.98%
Overall survival rate: 38.38%
Women were more likely to survive than the overall average.
Children were more likely to survive than the overall average.


11. Write a function that returns the percentage of people that survived from a subset given as a boolean Pandas series.

In [23]:
def survival_rate(subset: pd.Series, df: pd.DataFrame) -> float:
    """
    Returns the percentage of people that survived from a given subset.
    
    Parameters:
    subset (pd.Series): A boolean series indicating which rows to consider.
    df (pd.DataFrame): The DataFrame containing the Titanic dataset.
    
    Returns:
    float: Survival percentage of the subset.
    """
    subset_df = df[subset]  # Apply the boolean mask to filter the dataset
    if subset_df.empty:
        return 0.0  # Avoid division by zero if the subset is empty
    
    survival_percentage = subset_df['Survived'].mean() * 100
    return round(survival_percentage, 2)  # Round to 2 decimal places

# Example usage:
df = pd.read_csv("gitignoreclo/train.csv")

# Survival rate for women
women_subset = df['Sex'] == 'female'
print(f"Survival rate for women: {survival_rate(women_subset, df)}%")

# Survival rate for children
children_subset = df['Age'] < 18
print(f"Survival rate for children: {survival_rate(children_subset, df)}%")

# Survival rate for first-class passengers
first_class_subset = df['Pclass'] == 1
print(f"Survival rate for first-class passengers: {survival_rate(first_class_subset, df)}%")

Survival rate for women: 74.2%
Survival rate for children: 53.98%
Survival rate for first-class passengers: 62.96%


# Summarizing

12. What is the median age of the passengers?
13. How many passengers embarked from each port?

In [24]:
# Calcular la edad mediana de los pasajeros
median_age = df['Age'].median()
print(f"The median age of the passengers is: {median_age:.2f}")

# Contar cuántos pasajeros embarcaron desde cada puerto
embarked_counts = df['Embarked'].value_counts()
print("\nNumber of passengers who embarked from each port:")
print(embarked_counts)

The median age of the passengers is: 28.00

Number of passengers who embarked from each port:
Embarked
S    644
C    168
Q     77
Name: count, dtype: int64


14. Generate two hypotheses about how does the survival rate differ among groups of passengers. Write your code to explore both hypotheses.

# Hypotheses

1. First-class passengers had a higher survival rate than lower-class passengers.
-Hypothesis: Passengers in higher classes (Pclass = 1) had a greater chance of survival compared to those in lower classes (Pclass = 2 or 3).
2. Women had a higher survival rate than men.
Hypothesis: The survival rate for female passengers was significantly higher than that of male passengers due to the "women and children first" rule.
