<font color='darkorange'> Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, add your functions to the *apputil\.py* file and *app\.py* file as instructed. *These exercises use the same [Titanic dataset](https://www.kaggle.com/competitions/titanic/data) as the lab.*


## Exercise 1: Survival Patterns


For this exercise you will analyze survival patterns on the Titanic by looking at passenger class, sex, and age group. Name the function survival_demographics().

1. Create a new column in the Titanic dataset that classifies passengers into age categories (i.e., a pandas category series). The categories should be:

        Child (up to 12)
        Teen (13–19)
        Adult (20–59)
        Senior (60+)
Hint: The pd.cut() function might come in handy here.

2. Group the passengers by class, sex, and age group.

3. For each group, calculate:

        The total number of passengers, n_passengers
        The number of survivors, n_survivors
        The survival rate, survival_rate
4. Return a table that includes the results for all combinations of class, sex, and age group.

5. Order the results so they are easy to interpret.

6. Come up with a clear question that your results table makes you curious about (e.g., “Did women in first class have a higher survival rate than men in other classes?”). Write this question in your app.py file above the call to your visualization function, using st.write("Your Question Here").

7. Create a Plotly visualization in a function named visualize_demographic() that directly addresses your question by returning a Plotly figure (e.g., fig = px. ...). You are free to choose the chart type that you think best communicates the findings. Be creative — try different approaches, compare them, and ensure that your chart clearly answers the question you posed.

In [None]:
import streamlit as st
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

df = pd.read_csv('https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv')

def survival_demographics():
    """
    Analyze survival patterns by passenger class, sex, and age group.
    Returns a DataFrame with passenger counts, survivor counts, and survival rates.
    """
    # Age categories
    bins = [0, 12, 19, 59, 100]
    labels = ['Child', 'Teen', 'Adult', 'Senior']
    df['age_group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=True)
    
    # Group by class, sex, and age group
    grouped = df.groupby(['Pclass', 'Sex', 'age_group']).agg(
        n_passengers=('PassengerId', 'count'),
        n_survivors=('Survived', 'sum')
    ).reset_index()
    
    #survival rate
    grouped['survival_rate'] = (grouped['n_survivors'] / grouped['n_passengers']).round(3)
    
    grouped = grouped.sort_values(['Pclass', 'Sex', 'age_group'])
    
    return grouped

def visualize_demographic():
    """
    Create a visualization with distinct colors for men and women
    showing survival rates across passenger classes and age groups.
    """
    data = survival_demographics()
    
    fig = px.bar(
        data,
        x='Pclass',
        y='survival_rate',
        color='Sex',
        facet_col='age_group',
        facet_col_wrap=2,
        title='Titanic Survival Rates: Women vs Men Across Classes and Age Groups',
        labels={
            'survival_rate': 'Survival Rate',
            'Pclass': 'Passenger Class',
            'Sex': 'Gender',
            'age_group': 'Age Group'
        },
        category_orders={
            'Pclass': [1, 2, 3],
            'Sex': ['female', 'male'],
            'age_group': ['Child', 'Teen', 'Adult', 'Senior']
        },
        barmode='group',
        color_discrete_map={
            'female': '#FF6B9C',  
            'male': '#4A90E2'    
        }
    )
    
    # Customize the layout
    fig.update_layout(
        yaxis_tickformat=',.0%',
        yaxis_range=[0, 1.1],
        height=600,
        showlegend=True,
        legend=dict(
            orientation="h",
            yanchor="bottom",
            y=1.02,
            xanchor="right",
            x=1
        )
    )
    
    fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
    
    for i, row in data.iterrows():
        class_offset = -0.2 if row['Sex'] == 'female' else 0.2
        facet_col = ['Child', 'Teen', 'Adult', 'Senior'].index(row['age_group'])
        
        fig.add_annotation(
            x=row['Pclass'] + class_offset,
            y=row['survival_rate'] + 0.05,
            text=f"{row['survival_rate']:.0%}",
            showarrow=False,
            font=dict(size=10, color='black'),
            xref=f"x{facet_col+1 if facet_col > 0 else ''}",
            yref=f"y{facet_col+1 if facet_col > 0 else ''}"
        )
    
    return fig

visualize_demographic()

  grouped = df.groupby(['Pclass', 'Sex', 'age_group']).agg(


In [None]:
def visualize_gender_comparison():
    data = survival_demographics()
    
    # Calculate the difference in survival rates between women and men
    women_data = data[data['Sex'] == 'female'].set_index(['Pclass', 'age_group'])
    men_data = data[data['Sex'] == 'male'].set_index(['Pclass', 'age_group'])
    
    comparison_data = women_data[['survival_rate']].copy()
    comparison_data.columns = ['women_survival_rate']
    comparison_data['men_survival_rate'] = men_data['survival_rate']
    comparison_data['survival_difference'] = comparison_data['women_survival_rate'] - comparison_data['men_survival_rate']
    comparison_data = comparison_data.reset_index()
    
    fig = px.bar(
        comparison_data,
        x='survival_difference',
        y='age_group',
        color='survival_difference',
        facet_col='Pclass',
        title='Survival Advantage: Women vs Men Across Classes and Age Groups',
        labels={
            'survival_difference': 'Survival Rate Advantage for Women',
            'age_group': 'Age Group',
            'Pclass': 'Passenger Class'
        },
        color_continuous_scale='RdYlBu',
        range_color=[-1, 1]
    )
    
    fig.update_layout(
        height=500,
        showlegend=False,
        xaxis=dict(tickformat=',.0%'),
        xaxis_title="Women's Survival Advantage Over Men"
    )
    
    fig.for_each_annotation(lambda a: a.update(text=f"Class {a.text.split('=')[1]}"))
    
    for i in range(1, 4):
        fig.add_vline(x=0, line_dash="dash", line_color="black", 
                     row=1, col=i)
    
    return fig
visualize_gender_comparison()





In [None]:

st.write(
'''
# Titanic Visualization 1 - Demographic Analysis

**Research Question:** How did the survival rates for women and children compare across different passenger classes, and were there situations where class privilege significantly altered expected survival patterns?

'''
)

fig1 = visualize_demographic()
st.plotly_chart(fig1, use_container_width=True)

st.write(
'''
# Titanic Visualization 2 - Gender Comparison
'''
)

fig2 = visualize_gender_comparison()
st.plotly_chart(fig2, use_container_width=True)

#data table
st.write("### Underlying Data")
demographic_data = survival_demographics()
st.dataframe(demographic_data)

st.write("### Key Insights")
st.write("""
- **Women's Advantage**: Across all classes and age groups, women had significantly higher survival rates than men
- **Class Matters**: First Class women had near-perfect survival rates (75-100%), while Third Class women had more varied outcomes
- **Children First**: The 'women and children first' protocol is evident, but class privilege amplified its effects
- **Notable Exception**: Third Class men had extremely low survival rates across all age groups
- **Color Coding**: Pink bars represent women, blue bars represent men - making gender comparisons immediate and clear
""")









## Exercise 2: Family Size and Wealth

Using the Titanic dataset, write a function named `family_groups()` to explore the relationship between family size, passenger class, and ticket fare.  

1. Create a new column in the Titanic dataset that represents the total family size for each passenger, `family_size`. Family size is defined as the number of siblings/spouses aboard plus the number of parents/children aboard, plus the passenger themselves.

2. Group the passengers by family size and passenger class. For each group, calculate:  
   - The total number of passengers, `n_passengers`
   - The average ticket fare, `avg_fare`
   - The minimum and maximum ticket fares (to capture variation in wealth), `min_fare` and `max_fare`

3. Return a table with these results, sorted so that the values are clear and easy to interpret (for example, by class and then family size).

4. Write a function called `last_names()` that extracts the last name of each passenger from the `Name` column, and returns the count for each last name (i.e., a pandas series with last name as index, and count as value). Does this result agree with that of the data table above? Share your findings in your app using `st.write`.

5. Just like you did in Exercise 1, come up with a clear question that your results makes you curious about. Write this question in your app.py file above the call to your visualization function. Then, create a Plotly visualization in a function named `visualize_families()` that directly addresses your question. As in Exercise 1 you are free to choose the chart type that you think best communicates the findings.

In [None]:
def family_groups():
    """
    Analyze family size, passenger class, and ticket fare relationships.
    """
    df['family_size'] = df['SibSp'] + df['Parch'] + 1
    
    grouped = df.groupby(['family_size', 'Pclass']).agg(
        n_passengers=('PassengerId', 'count'),
        avg_fare=('Fare', 'mean'),
        min_fare=('Fare', 'min'),
        max_fare=('Fare', 'max')
    ).reset_index()
    
    grouped['avg_fare'] = grouped['avg_fare'].round(2)
    grouped['min_fare'] = grouped['min_fare'].round(2)
    grouped['max_fare'] = grouped['max_fare'].round(2)
    
    grouped = grouped.sort_values(['Pclass', 'family_size'])
    
    return grouped

In [9]:
def last_names():
    """
    Extract last names from the Name column and return counts for each last name.
    """
    # Extract last names (everything before the comma)
    df['last_name'] = df['Name'].str.split(',').str[0].str.strip()
    
    # Count occurrences of each last name
    last_name_counts = df['last_name'].value_counts().reset_index()
    last_name_counts.columns = ['last_name', 'count']
    
    return last_name_counts

st.write("""
# Family Analysis of Titanic Survival

**Research Question:** How did family size and passenger class interact to affect ticket fares, and were larger families concentrated in specific classes with distinct fare patterns?
""")



In [11]:
def visualize_families():
    """
    Create a visualization exploring the relationship between family size, 
    passenger class, and ticket fares.
    """
    data = family_groups()
    
    # Create a bubble chart showing family size vs fare by class
    fig = px.scatter(
        data,
        x='family_size',
        y='avg_fare',
        size='n_passengers',
        color='Pclass',
        title='Family Size vs Average Ticket Fare by Passenger Class',
        labels={
            'family_size': 'Family Size',
            'avg_fare': 'Average Fare (£)',
            'Pclass': 'Passenger Class',
            'n_passengers': 'Number of Passengers'
        },
        category_orders={'Pclass': [1, 2, 3]},
        color_discrete_map={
            1: '#1f77b4',  # Blue for 1st class
            2: '#ff7f0e',  # Orange for 2nd class  
            3: '#2ca02c'   # Green for 3rd class
        },
        size_max=40,
        hover_data=['min_fare', 'max_fare']
    )
    
    # Customize layout
    fig.update_layout(
        height=500,
        showlegend=True,
        xaxis=dict(dtick=1),  # Ensure integer ticks for family size
        yaxis=dict(title='Average Fare (£)')
    )
    
    # Add trend lines for each class
    for pclass in [1, 2, 3]:
        class_data = data[data['Pclass'] == pclass]
        if len(class_data) > 1:
            fig.add_trace(
                go.Scatter(
                    x=class_data['family_size'],
                    y=class_data['avg_fare'],
                    mode='lines',
                    line=dict(dash='dot', width=1),
                    showlegend=False,
                    hoverinfo='skip'
                )
            )
    
    return fig
visualize_families()

In [12]:
def visualize_fare_ranges():
    """
    Create a visualization showing fare ranges for different family sizes and classes.
    """
    data = family_groups()
    
    # Create a bar chart showing fare ranges
    fig = px.bar(
        data,
        x='family_size',
        y='avg_fare',
        color='Pclass',
        facet_col='Pclass',
        title='Average Ticket Fare by Family Size and Passenger Class',
        labels={
            'family_size': 'Family Size',
            'avg_fare': 'Average Fare (£)',
            'Pclass': 'Passenger Class'
        },
        category_orders={'Pclass': [1, 2, 3]},
        color_discrete_map={
            1: '#1f77b4',
            2: '#ff7f0e', 
            3: '#2ca02c'
        }
    )
    
    fig.update_layout(
        height=400,
        showlegend=False
    )
    
    fig.for_each_annotation(lambda a: a.update(text=f"Class {a.text.split('=')[1]}"))
    
    return fig

In [13]:
# Load Titanic datase
st.write(
'''
# Titanic Visualization 1 - Family Analysis

**Research Question:** How did family size and passenger class interact to affect ticket fares, and were larger families concentrated in specific classes with distinct fare patterns?

'''
)

# Generate and display the family visualization
fig1 = visualize_families()
st.plotly_chart(fig1, use_container_width=True)

st.write(
'''
# Titanic Visualization 2 - Fare Ranges
'''
)

# Generate and display the fare range visualization
fig2 = visualize_fare_ranges()
st.plotly_chart(fig2, use_container_width=True)

# Show the family groups data
st.write("### Family Groups Data")
family_data = family_groups()
st.dataframe(family_data)

# Show last name analysis
st.write("### Last Name Analysis")
last_name_data = last_names()
st.write("Top 20 Most Common Last Names:")
st.dataframe(last_name_data.head(20))

# Compare last name counts with family size analysis
st.write("### Comparison: Last Name Counts vs Family Size Analysis")

# Calculate average family size by class
avg_family_by_class = df.groupby('Pclass')['family_size'].mean().round(2)

st.write("**Average Family Size by Class:**")
st.write(avg_family_by_class)

st.write("**Key Findings:**")
st.write("""
- **Last Name Distribution**: The most common last names (Sage, Andersson, Asplund, etc.) appear 5-11 times, suggesting large family groups
- **Family Size Patterns**: Third class had the largest average family sizes, while first class had smaller families
- **Fare Structure**: First class families paid significantly higher fares regardless of family size
- **Large Families**: Families larger than 4 were almost exclusively in third class with lower average fares
- **Wealth Disparity**: The fare range within first class was much wider than other classes, indicating greater wealth variation among wealthy passengers
""")

st.write(
'''
# Titanic Visualization Bonus - Large Families Analysis
'''
)

# Bonus visualization focusing on large families
def visualize_large_families():
    """
    Focus on families with 4+ members to understand large family patterns.
    """
    data = family_groups()
    large_families = data[data['family_size'] >= 4]
    
    fig = px.bar(
        large_families,
        x='family_size',
        y='n_passengers',
        color='Pclass',
        title='Distribution of Large Families (4+ Members) by Passenger Class',
        labels={
            'family_size': 'Family Size',
            'n_passengers': 'Number of Passengers',
            'Pclass': 'Passenger Class'
        },
        category_orders={'Pclass': [1, 2, 3]},
        color_discrete_map={
            1: '#1f77b4',
            2: '#ff7f0e',
            3: '#2ca02c'
        }
    )
    
    fig.update_layout(height=400)
    return fig

fig3 = visualize_large_families()
st.plotly_chart(fig3, use_container_width=True)



DeltaGenerator()

## Bonus Question

Add a new column, `older_passenger`, to the Titanic dataset that indicates whether each passenger’s age is above the median age for *their* passenger class. So, suppose row $x$ is in passenger class 2. Then, a value of `True` at row $x$ would indicate that passenger older than 50% of class 2 passengers, and `False` would indicate that they younger.

- You should use pandas functions to accomplish this.
- The new column should contain Boolean values (True if the age is above the median, False if less than or equal to).
- Return the updated table in the function `determine_age_division()`

Once you’ve created this column, consider how this age division relates to your analysis above. Try to visualize this analysis in Plotly using the function name `visualize_age_division()`.