# The Donner Party
In 1846 the Donner and Reed families left Springfield, Illinois, for  California by covered wagon. In July, the Donner Party, as it  became known, reached Fort Bridger, Wyoming. There its leaders  decided to attempt a new and untested route to the Sacramento  Valley. Having reached its full size of 87 people and 20 wagons, the  party was delayed by a difficult crossing of the Wasatch Range and again in the crossing of the desert west of the Great Salt Lake. The  group became stranded in the eastern Sierra Nevada mountains when the region was hit by heavy snows in late October. By the  time the last survivor was rescued on April 21, 1847, 40 of the 87  members had died from famine and exposure to extreme cold.

In [12]:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

AttributeError: partially initialized module 'patsy' has no attribute 'highlevel' (most likely due to a circular import)

## About this exercise
**Exercise set:** Logistic Regression Analysis on Donner Party Survival Rates

**Objective:** In this exercise set, you’ll explore survival rates based on various factors, build alternative logistic regression models, and evaluate them using statistical measures.


---

#### Part 1: Exploratory Data Analysis (EDA)

> 1. **Breakdown of Survival Rates by Gender**
>   - Calculate and compare survival rates for males and females.
>   
>   *Question*: What are the differences in survival rates between males and females?


In [8]:
# Import the data
donner = pd.read_csv('./data/donner_party.csv')

# Rename the columns 
donner.columns = ['survived','name','sex','age', 'status']


NameError: name 'pd' is not defined

In [13]:
# your code here


---
> 2. **Survival Rates by Age**
>    - Divide individuals into age groups (e.g., children, young adults, adults, elderly).
>    - Calculate survival rates for each age group.
> 
>    *Question*: How does age correlate with survival? Are there certain age groups with higher or lower survival rates?


In [2]:
# your code here


---
> 3. **Survival Rates within Families**
>    - Identify family groups in the data based on similar last names (if applicable) or create a family grouping variable.
>    - Calculate the survival rate within each family and analyze patterns.
> 
>    *Question*: Do larger families have different survival rates compared to smaller or individual groups?


In [3]:
# your code here


---
> 4. **Interaction Between Gender and Age on Survival**
>    - Plot survival rates by both gender and age, dividing age into meaningful ranges (such as bins of 10 years).
>    - Use box plots or a heat map to visualise survival across these groups.
> 
>    *Question*: Is there an interaction between age and gender that appears to influence survival rates?

In [4]:
# your code here


---

#### Part 2: Logistic Regression Modeling

5. **Baseline Model**
   - Build an initial logistic regression model using **Age** as the only predictor of survival.

   *Question*: What is the log-likelihood and AIC for this model? How do these metrics help evaluate model fit?


In [None]:
# your code here
formula_string = ...


In [None]:
# Define a logistic regression model
model = sm.formula.logit(formula_string, data=donner)

# Fit the model to the data
results = model.fit()

# Print a model summary table
print(results.summary())

Now that we have fitted a model to the data, we can use the model parameters to predict the survival rate based on any age, including ages that were not in the original dataset. 

In fact, we can plot the logistic regression line to visually show the chances of survival for all ages in a range of our choosing.

Although the next cell should run for you without issues, it is worth spending 5-10 minutes trying to understand what each line does. Feel free to tweak the code to better understand it. 

In [None]:
# Generate a range of ages
age_range = pd.DataFrame({'age': np.linspace(donner['age'].min(), donner['age'].max(), 100)})

# Compute the linear predictor (log-odds)
log_odds = results.params['Intercept'] + results.params['age'] * age_range

# Compute the probabilities using the sigmoid function
probabilities = results.predict(age_range)

# Plot the model's sigmoid function
plt.figure(figsize=(10, 6))
plt.plot(age_range, probabilities, label='Logistic fit')
plt.xlabel('Age')
plt.ylabel('Probability of Not Surviving')
plt.title('Negative Sigmoid Function: Survived ~ Age')

# Superimpose the actual data points
sns.scatterplot(data=donner, x='age', y='survived', style='survived', alpha=0.2)

plt.legend()
plt.show()


---
6. **Alternative Model Building**
   - Create at least two alternative models by adjusting predictors:
     - Model 1: Include **Sex** and **Age** only.
     - Model 2: Add an interaction term for **Sex \* Age**.  
     - Model 3: Add **Sex**, **Age**, and **Marital Status** as predictors of survival.

   *Questions*:
     - What are the AIC and log-likelihood values for each model?
     - Based on these metrics, which model appears to provide the best fit?

   *Hint:*
   Although we have not seen interactions in class, answer the following: 
   - Do you expect the effect of age on survival to be the same for females and males? 
   - Refit and plot the model above separately for males and females. What do you see?
   - You can come up with new predictors from existing ones. An interaction is a predictor that is obtained by multiplying two predictors together, e.g. $Age * Sex$. When will this new predictor be large? When will it be small? What does a coefficient on this new predictor indicate. 
   - Once you answer the previous questions, ask your professor. Did your answers match up?


In [None]:
# You code for Model 1


In [None]:

# You code for Model 2


In [None]:
# You code for Model 3



---
7. **Chi-Square Test for Model Comparison**
   - Conduct chi-square tests to compare each alternative model with the baseline model.

   *Question*: Do any of the alternative models significantly improve upon the baseline? What does this tell you about the influence of added features on survival prediction?


In [None]:
# Your code here



---
8. **Log-Odds Prediction for Specific Individual**
   - Using the best-fitting model, calculate the predicted log-odds of survival for a 24-year-old single female.

   *Question*: What are the log-odds and corresponding probability of survival for this individual based on the model?


In [None]:
# Your code here


---

#### Part 3: Model Interpretation and Insights

9. **Interpret Coefficients and Log-Odds**
   - Interpret the coefficients of the model, especially for significant predictors such as **Sex**, **Age**, and any interaction terms.

   *Questions*:
     - What does the coefficient for **Sex** imply about survival rates?
     - How does the **Age** coefficient impact the survival probability?
     - If there’s a significant interaction between **Sex** and **Age**, what does this tell you about survival differences across age and gender?

In [22]:
# Your code or comments here


---
10. **Final Model Selection and Justification**
    - Summarise your findings and select the best model based on AIC, log-likelihood, and chi-square results.

    *Question*: Which model would you use for explaining survival rates in this historical context, and why?
