<font color='darkorange'> Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, add your functions to the *apputil\.py* file and *app\.py* file as instructed. *These exercises use the same [Titanic dataset](https://www.kaggle.com/competitions/titanic/data) as the lab.*


## Exercise 1: Survival Patterns


For this exercise you will analyze survival patterns on the Titanic by looking at passenger class, sex, and age group. Name the function `survival_demographics()`.

1. Create a new column in the Titanic dataset that classifies passengers into age categories (i.e., a pandas `category` series). The categories should be:
    - Child (up to 12)
    - Teen (13–19)
    - Adult (20–59)
    - Senior (60+)  
  
	Hint: The `pd.cut()` function might come in handy here.

2. Group the passengers by class, sex, and age group.  

3. For each group, calculate:  
    - The total number of passengers, `n_passengers`
    - The number of survivors, `n_survivors`
    - The survival rate, `survival_rate`

4. Return a table that includes the results for *all* combinations of class, sex, and age group.  

5. Order the results so they are easy to interpret.  

6. Come up with a clear question that your results table makes you curious about (e.g., “Did women in first class have a higher survival rate than men in other classes?”). Write this question in your `app.py` file above the call to your visualization function, using `st.write("Your Question Here")`.
   
7. Create a Plotly visualization in a function named `visualize_demographic()` that directly addresses your question by returning a Plotly figure (e.g., `fig = px. ...`). You are free to choose the chart type that you think best communicates the findings. Be creative — try different approaches, compare them, and ensure that your chart clearly answers the question you posed.


In [1]:
import plotly.express as px
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv')


In [2]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
import pandas as pd

def survival_demographics(df: pd.DataFrame) -> pd.DataFrame:
    """
    Analyze survival patterns on the Titanic by class, sex, and age group.

    Parameters:
        df (pd.DataFrame): Titanic dataset containing at least 'Age', 'Pclass', 'Sex', 'Survived'.

    Returns:
        pd.DataFrame: Summary table with Pclass, Sex, AgeGroup, n_passengers, n_survivors, survival_rate
    """

    # Step 1: Create age categories
    age_bins = [0, 12, 19, 59, float('inf')]
    age_labels = ['Child', 'Teen', 'Adult', 'Senior']
    df['AgeGroup'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=True)

    # Drop rows with missing AgeGroup (due to missing Age)
    df = df.dropna(subset=['AgeGroup'])

    # Step 2: Group by class, sex, and age group
    group_cols = ['Pclass', 'Sex', 'AgeGroup']
    grouped = df.groupby(group_cols)

    # Step 3: Aggregate total passengers and survivors
    result = grouped['Survived'].agg(
        n_passengers='count',
        n_survivors='sum'
    ).reset_index()

    # Step 4: Calculate survival rate
    result['survival_rate'] = result['n_survivors'] / result['n_passengers']

    # Step 5: Sort for readability
    result = result.sort_values(by=['Pclass', 'Sex', 'AgeGroup'])

    return result

In [4]:
# Outside the function result fetching and display
url = "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv"
df = pd.read_csv(url)

summary_df = survival_demographics(df)
summary_df

  grouped = df.groupby(group_cols)


Unnamed: 0,Pclass,Sex,AgeGroup,n_passengers,n_survivors,survival_rate
0,1,female,Child,1,0,0.0
1,1,female,Teen,13,13,1.0
2,1,female,Adult,68,66,0.970588
3,1,female,Senior,3,3,1.0
4,1,male,Child,3,3,1.0
5,1,male,Teen,4,1,0.25
6,1,male,Adult,80,34,0.425
7,1,male,Senior,14,2,0.142857
8,2,female,Child,8,8,1.0
9,2,female,Teen,8,8,1.0


In [5]:
import plotly.express as px
import pandas as pd

def visualize_demographic(summary_df: pd.DataFrame):
    """
    Create a Plotly bar chart showing survival rates across
    passenger class, sex, and age group.

    Parameters:
        summary_df (pd.DataFrame): Output of survival_demographics()

    Returns:
        plotly.graph_objs._figure.Figure: A Plotly Figure object
    """
    fig = px.bar(
        summary_df,
        x="Pclass",
        y="survival_rate",
        color="AgeGroup",
        barmode="group",
        facet_col="Sex",
        category_orders={
            "Pclass": [1, 2, 3],
            "AgeGroup": ['Child', 'Teen', 'Adult', 'Senior'],
            "Sex": ["male", "female"]
        },
        labels={
            "Pclass": "Passenger Class",
            "survival_rate": "Survival Rate",
            "AgeGroup": "Age Group"
        },
        title="Survival Rate by Class, Sex, and Age Group"
    )

    fig.update_layout(
        yaxis=dict(tickformat=".0%"),
        legend_title_text='Age Group',
        height=500,
        template="plotly_white"
    )

    return fig


In [6]:
url = "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv"
df = pd.read_csv(url)

summary_df = survival_demographics(df)

fig = visualize_demographic(summary_df)
fig.show()  # OR st.plotly_chart(fig) in Streamlit

  grouped = df.groupby(group_cols)


## Exercise 2: Family Size and Wealth

Using the Titanic dataset, write a function named `family_groups()` to explore the relationship between family size, passenger class, and ticket fare.  

1. Create a new column in the Titanic dataset that represents the total family size for each passenger, `family_size`. Family size is defined as the number of siblings/spouses aboard plus the number of parents/children aboard, plus the passenger themselves.

2. Group the passengers by family size and passenger class. For each group, calculate:  
   - The total number of passengers, `n_passengers`
   - The average ticket fare, `avg_fare`
   - The minimum and maximum ticket fares (to capture variation in wealth), `min_fare` and `max_fare`

3. Return a table with these results, sorted so that the values are clear and easy to interpret (for example, by class and then family size).

4. Write a function called `last_names()` that extracts the last name of each passenger from the `Name` column, and returns the count for each last name (i.e., a pandas series with last name as index, and count as value). Does this result agree with that of the data table above? Share your findings in your app using `st.write`.

5. Just like you did in Exercise 1, come up with a clear question that your results makes you curious about. Write this question in your app.py file above the call to your visualization function. Then, create a Plotly visualization in a function named `visualize_families()` that directly addresses your question. As in Exercise 1 you are free to choose the chart type that you think best communicates the findings.

In [None]:
def family_groups(df: pd.DataFrame) -> pd.DataFrame:
    """
    Adds a family_size column, groups by family size and class,
    and computes fare statistics.

    Parameters:
        df (pd.DataFrame): Titanic dataset

    Returns:
        pd.DataFrame: Grouped summary with passenger count and fare stats
    """

    # Ensure necessary columns are present
    required_cols = ['SibSp', 'Parch', 'Pclass', 'Fare']
    for col in required_cols:
        if col not in df.columns:
            raise ValueError(f"DataFrame must contain '{col}' column.")
        
    # Step 1: Add family_size = SibSp + Parch + 1 (self)
    df['family_size'] = df['SibSp'] + df['Parch'] + 1

    # Step 2: Group by family size and class
    grouped = df.groupby(['family_size', 'Pclass'])

    # Step 3: Aggregate values
    result = grouped['Fare'].agg(
        n_passengers='count',
        avg_fare='mean',
        min_fare='min',
        max_fare='max'
    ).reset_index()

    # Step 4: Sort for readability
    result = result.sort_values(by=['Pclass', 'family_size'])

    return result

In [8]:
def last_names(df: pd.DataFrame) -> pd.Series:
    """
    Extracts last names from Name column and counts frequency.

    Parameters:
        df (pd.DataFrame): Titanic dataset

    Returns:
        pd.Series: Last name as index, count as values
    """

    # Extract last name before comma in Name
    df['LastName'] = df['Name'].apply(lambda name: name.split(',')[0].strip())

    # Count last names
    last_name_counts = df['LastName'].value_counts()

    return last_name_counts

In [9]:
url = "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv"
df = pd.read_csv(url)

In [10]:
# Part 1
family_df = family_groups(df)
print(family_df)

    family_size  Pclass  n_passengers    avg_fare  min_fare  max_fare
0             1       1           109   63.672514    0.0000  512.3292
3             2       1            70   91.848039   29.7000  512.3292
6             3       1            24   95.681075   26.2833  211.5000
9             4       1             7  133.521429  120.0000  151.5500
12            5       1             2  262.375000  262.3750  262.3750
15            6       1             4  263.000000  263.0000  263.0000
1             1       2           104   14.066106    0.0000   73.5000
4             2       2            34   24.682962   11.5000   33.0000
7             3       2            31   31.693819   13.0000   73.5000
10            4       2            13   36.575969   11.5000   65.0000
13            5       2             1   23.000000   23.0000   23.0000
16            6       2             1   18.750000   18.7500   18.7500
2             1       3           324    9.272052    0.0000   56.4958
5             2     

In [11]:
# Part 2
last_name_counts = last_names(df)
print(last_name_counts.head(10))

LastName
Andersson    9
Sage         7
Skoog        6
Panula       6
Carter       6
Goodwin      6
Johnson      6
Rice         5
Fortune      4
Williams     4
Name: count, dtype: int64


In [14]:
def visualize_families(family_df: pd.DataFrame):
    """
    Create a Plotly line chart showing the relationship between
    family size and average fare, broken down by passenger class.

    Parameters:
        family_df (pd.DataFrame): Output from family_groups()

    Returns:
        plotly.graph_objs._figure.Figure: A Plotly Figure
    """
    fig = px.line(
        family_df,
        x="family_size",
        y="avg_fare",
        color="Pclass",
        markers=True,
        labels={
            "family_size": "Family Size",
            "avg_fare": "Average Fare",
            "Pclass": "Passenger Class"
        },
        title="Average Fare by Family Size and Passenger Class"
    )

    fig.update_layout(
        template="plotly_white",
        hovermode="x unified",
        legend_title_text='Class',
        height=500
    )

    return fig


In [15]:
url = "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv"
df = pd.read_csv(url)

# Generate processed data
family_df = family_groups(df)

# Generate plot
fig = visualize_families(family_df)
fig.show()  # or st.plotly_chart(fig) in Streamlit

## Bonus Question

Add a new column, `older_passenger`, to the Titanic dataset that indicates whether each passenger’s age is above the median age for *their* passenger class. So, suppose row $x$ is in passenger class 2. Then, a value of `True` at row $x$ would indicate that passenger older than 50% of class 2 passengers, and `False` would indicate that they younger.

- You should use pandas functions to accomplish this.
- The new column should contain Boolean values (True if the age is above the median, False if less than or equal to).
- Return the updated table in the function `determine_age_division()`

Once you’ve created this column, consider how this age division relates to your analysis above. Try to visualize this analysis in Plotly using the function name `visualize_age_division()`.