<font color='darkorange'> Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, add your functions to the *apputil\.py* file and *app\.py* file as instructed. *These exercises use the same [Titanic dataset](https://www.kaggle.com/competitions/titanic/data) as the lab.*


## Exercise 1: Survival Patterns


For this exercise you will analyze survival patterns on the Titanic by looking at passenger class, sex, and age group. Name the function `survival_demographics()`.

1. Create a new column in the Titanic dataset that classifies passengers into age categories (i.e., a pandas `category` series). The categories should be:
    - Child (up to 12)
    - Teen (13–19)
    - Adult (20–59)
    - Senior (60+)  
  
	Hint: The `pd.cut()` function might come in handy here.

2. Group the passengers by class, sex, and age group.  

3. For each group, calculate:  
    - The total number of passengers, `n_passengers`
    - The number of survivors, `n_survivors`
    - The survival rate, `survival_rate`

4. Return a table that includes the results for *all* combinations of class, sex, and age group.  

5. Order the results so they are easy to interpret.  

6. Come up with a clear question that your results table makes you curious about (e.g., “Did women in first class have a higher survival rate than men in other classes?”). Write this question in your `app.py` file above the call to your visualization function, using `st.write("Your Question Here")`.
   
7. Create a Plotly visualization in a function named `visualize_demographic()` that directly addresses your question by returning a Plotly figure (e.g., `fig = px. ...`). You are free to choose the chart type that you think best communicates the findings. Be creative — try different approaches, compare them, and ensure that your chart clearly answers the question you posed.


In [1]:
import plotly.express as px
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv')


In [2]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [8]:
import pandas as pd

def survival_demographics(file_path):
    # Load the dataset
    df = pd.read_csv(file_path)

    # Define age bins and labels
    age_bins = [0, 12, 19, 59, float('inf')]
    age_labels = ["Child", "Teen", "Adult", "Senior"]

    # Create age category column (using category dtype for clarity)
    df['age_group'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=True)
    df['age_group'] = df['age_group'].astype(pd.CategoricalDtype(categories=age_labels, ordered=True))

    # Group by Pclass, Sex, and age_group
    grouped = df.groupby(['Pclass', 'Sex', 'age_group'], observed=True)

    # Aggregate counts and survival statistics
    results = grouped['Survived'].agg(
        n_passengers='count',
        n_survivors='sum',
        survival_rate='mean'
    ).reset_index()

    # Reorder for clarity: sort by class (ascending), sex (female first), age_group (ordered)
    sex_order = ['female', 'male']
    results['Sex'] = pd.Categorical(results['Sex'], categories=sex_order, ordered=True)
    results = results.sort_values(['Pclass', 'Sex', 'age_group'])

    return results

# Example curiosity question to place above your visualization function in app.py:
# st.write("Did children in third class have a higher survival rate than adults in second class?")


In [11]:
# Example usage:
table = survival_demographics('https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv')
print(table)

    Pclass     Sex age_group  n_passengers  n_survivors  survival_rate
0        1  female     Child             1            0       0.000000
1        1  female      Teen            13           13       1.000000
2        1  female     Adult            68           66       0.970588
3        1  female    Senior             3            3       1.000000
4        1    male     Child             3            3       1.000000
5        1    male      Teen             4            1       0.250000
6        1    male     Adult            80           34       0.425000
7        1    male    Senior            14            2       0.142857
8        2  female     Child             8            8       1.000000
9        2  female      Teen             8            8       1.000000
10       2  female     Adult            58           52       0.896552
11       2    male     Child             9            9       1.000000
12       2    male      Teen            10            1       0.100000
13    

In [16]:
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'notebook'  # or 'notebook_connected' or 'iframe' if 'notebook' does not work

def visualize_demographic(table):
    # Focus on survival rate for targeted age/class groups
    # Example: compare Children in 3rd class vs. Adults in 2nd class
    subset = table[((table['Pclass'] == 3) & (table['age_group'] == 'Child')) |
                   ((table['Pclass'] == 2) & (table['age_group'] == 'Adult'))]

    fig = px.bar(subset,
                 x='age_group',
                 y='survival_rate',
                 color='Pclass',
                 barmode='group',
                 hover_data=['n_passengers', 'n_survivors'],
                 labels={'survival_rate': 'Survival Rate', 'age_group': 'Age Group', 'Pclass': 'Passenger Class'},
                 title='Survival Rate: Children in 3rd Class vs Adults in 2nd Class'
    )

    # Creative options: try a grouped bar chart for all combinations
    # Or use px.parallel_categories for categorical trends
    # Uncomment for grouped bar by sex and age group:
    #
    # fig = px.bar(table, x='age_group', y='survival_rate', color='Sex', facet_col='Pclass',
    #              barmode='group', title='Survival Rate by Age, Sex, and Class')

    return fig


## Exercise 2: Family Size and Wealth

Using the Titanic dataset, write a function named `family_groups()` to explore the relationship between family size, passenger class, and ticket fare.  

1. Create a new column in the Titanic dataset that represents the total family size for each passenger, `family_size`. Family size is defined as the number of siblings/spouses aboard plus the number of parents/children aboard, plus the passenger themselves.

2. Group the passengers by family size and passenger class. For each group, calculate:  
   - The total number of passengers, `n_passengers`
   - The average ticket fare, `avg_fare`
   - The minimum and maximum ticket fares (to capture variation in wealth), `min_fare` and `max_fare`

3. Return a table with these results, sorted so that the values are clear and easy to interpret (for example, by class and then family size).

4. Write a function called `last_names()` that extracts the last name of each passenger from the `Name` column, and returns the count for each last name (i.e., a pandas series with last name as index, and count as value). Does this result agree with that of the data table above? Share your findings in your app using `st.write`.

5. Just like you did in Exercise 1, come up with a clear question that your results makes you curious about. Write this question in your app.py file above the call to your visualization function. Then, create a Plotly visualization in a function named `visualize_families()` that directly addresses your question. As in Exercise 1 you are free to choose the chart type that you think best communicates the findings.

In [20]:
def family_groups(file_path):
    df = pd.read_csv(file_path)
    df['family_size'] = df['SibSp'] + df['Parch'] + 1

    grouped = df.groupby(['family_size', 'Pclass'], observed=True)
    results = grouped['PassengerId'].agg(n_passengers='count')
    # Do not call .to_frame() here, as results is already a DataFrame

    # Add other aggregated fare stats
    results['avg_fare'] = grouped['Fare'].mean()
    results['min_fare'] = grouped['Fare'].min()
    results['max_fare'] = grouped['Fare'].max()

    results = results.reset_index()
    results = results.sort_values(['Pclass', 'family_size'])
    return results


In [21]:
def last_names(file_path):
    df = pd.read_csv(file_path)
    # Extract last name before comma in 'Name'
    df['last_name'] = df['Name'].apply(lambda x: x.split(',')[0].strip())
    # Count occurrences of each last name
    last_name_counts = df['last_name'].value_counts()
    return last_name_counts

In [22]:
family_table = family_groups('https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv')
print(family_table.head())

    family_size  Pclass  n_passengers    avg_fare  min_fare  max_fare
0             1       1           109   63.672514    0.0000  512.3292
3             2       1            70   91.848039   29.7000  512.3292
6             3       1            24   95.681075   26.2833  211.5000
9             4       1             7  133.521429  120.0000  151.5500
12            5       1             2  262.375000  262.3750  262.3750


In [23]:
last_name_counts = last_names('https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv')
print(last_name_counts.head())

last_name
Andersson    9
Sage         7
Skoog        6
Panula       6
Carter       6
Name: count, dtype: int64


## Bonus Question

Add a new column, `older_passenger`, to the Titanic dataset that indicates whether each passenger’s age is above the median age for *their* passenger class. So, suppose row $x$ is in passenger class 2. Then, a value of `True` at row $x$ would indicate that passenger older than 50% of class 2 passengers, and `False` would indicate that they younger.

- You should use pandas functions to accomplish this.
- The new column should contain Boolean values (True if the age is above the median, False if less than or equal to).
- Return the updated table in the function `determine_age_division()`

Once you’ve created this column, consider how this age division relates to your analysis above. Try to visualize this analysis in Plotly using the function name `visualize_age_division()`.