# COGS 108 - Project Proposal

## Authors

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

- Harvey Sandhu - Conceptualization
- Kyle Wu - Conceptualization
- Connor Wu - Conceptualization
- Adrianne See - Conceptualization
- Teddy Nguyen - Conceptualization

## Research Question

Did the characteristics of passengers on the Titanic, such as ticket class, age, and gender, affect an individual's survivability?
- Considering the women and children first rule, how does that correlate with survival statistics, specifically looking at the "adult male" column of the data?
- Did richer passengers, i.e. those who purchased higher deck or class tickets for higher prices have a higher survival rate than those who purchased cheaper tickets?



## Background and Prior Work

Introduction

The RMS Titanic was a British ocean liner infamous for its sinking after colliding with an iceberg on April 15, 1912, resulting in ~1500 people out of the 2200+ passengers dying from the wreck. The data set we are analyzing includes information on 891 passengers, including their passenger class, age, sex, and their accompaniment

Similar studies

A study similar to our proposed topic, but with a slightly different data set, concluded that gender was, in fact, the key determining factor in survivability.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Because this is a historical topic, we can safely assume that there will be no issues with outdated data or analysis, but the data sets being analyzed may have varying results. Despite being the same size, with 891 passengers, the data set we are analyzing has some additional information that may reveal more nuance in the death statistics. Additionally, the individuals may be different, meaning a different population of the Titanic survivors was sampled.

An article from the NIH states that, despite the widespread belief of prioritizing women and children in maritime accidents, women in fact have a significant survival disadvantage in these situations <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2). It further asserts that despite the survival rate of women on the Titanic being significantly higher, in the grander scheme of maritime accidents reflects the opposite.

Citations

1. <a name="cite_note-1"></a> [^](#cite_ref-1) "Analysis of Titanic Survival Data", ttsteiger.github.io/projects/titanic_report.html. Accessed 4 Feb. 2026. 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Elinder, Mikael, and Oscar Erixson. “Gender, Social Norms, and Survival in Maritime Disasters.” *Proceedings of the National Academy of Sciences of the United States of America*, U.S. National Library of Medicine, 14 Aug. 2012, pmc.ncbi.nlm.nih.gov/articles/PMC3421183/. 

## Hypothesis


We predict that primarily gender and class affected survival rates of passengers aboard the Titanic.
- Gender: We believe that the “Women and children first” had a major impact on the survival rates of the types of passengers since this “rule” was most prominent in the 19th and 20th century
- Socioeconomic status: Richer individuals likely had a higher rate of survival due to classist discrimination and a higher perceived self-value.


## Data

1. The dataset we found is actually quite ideal for answering our chosen question.
- Contains age, sex, and ticket price/class for each passenger
- All passenger data is included, which provides a fuller picture and less bias.
- Ideally the data is sourced from the original documents like a passenger list or from ticket sales.
- These were probably stored in a spreadsheet-like paper record at the time, then transferred to a digital spreadsheet later.

2. There are a few useful real datasets we could use, but most are quite similar to the base one we decided on
- [Safe-DS Titanic dataset](https://datasets.safeds.com/en/stable/datasets/titanic/) - contains cabin number which could be used to identify more precise location on the ship.
- [Wikipedia](https://en.wikipedia.org/wiki/Passengers_of_the_Titanic) - Contains information about the nationalities of the passengers. This would be useful in examining which nationalities were more likely to survive and if any discriminatory practices in the evacuation of the ship. There is also data on which survivors were on which lifeboats, though that is not particularly useful.

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [ ] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - [ ] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

Crew members are not accounted for in the dataset. Additionally, we are reducing the tragedy to numeric and categorical data which does not take into account the emotional and psychological factors in play. 

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

It is well-known that in the tragedy of the Titanic, a "women and children first" policy was implemented in evacuating the ship, likely unevenly across classes. The data also only include passengers who bought tickets, not crew members. 

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

The correlations that we are exploring are more the result of contemporary discriminatory or otherwise biased views i.e. valuing women and children more, rather than the characteristics of the passengers themselves which we are measuring. There are also certain factors that cannot be measured or accounted for like blocked access to lifeboats, bad luck (relatively random location of passengers on the ship when the tragedy occurred). The included visualizations are based on measurable and/or clearly identifiable passenger characteristics to generalize and identify larger trends.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

We have ensured that this dataset does not include passenger names or otherwise unnecessary PII.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

Ticket class and cost are direct proxy variables for wealth and social status of passengers. However, this will be accounted for in our analysis to determine if class-based discrimination likely affected survivability for passengers.

 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

- Communication & responsiveness: Communicate through Instagram group chat, online meetings on Google Meet, in-person meetings at the discussion, and whenever needed. Acknowledge messages within a day, and make sure others are aware of your absence.
- Contribution and task management: While each has individual strengths, effort should be distributed equally across the project. No single person should have to write all the code ot text; each person is responsible for contributing. However, we may utilize Google Docs to write our work and have one person push it to GitHub.
- Conflict and Decision making: utilize majority vote, if conflict arises, resolve through dedicated meetings. If someone is underperforming, provide a notification and a week before escalation to TAs

## Project Timeline Proposal

Instructions: REPLACE the contents of this cell with your work

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/2  |  before 11:59 | brainstorm ideas on Kaggle  | Make a group chat, finalize project review, ensure we are on the same page. | 
| 2/4  |  3PM |  Do background research and pick a Kaggle dataset | Begin to work on project proposal and divide up responsibilities | 
| 2/9  | 3PM  | Edit and finalize proposal, examine data structure  | Assign data wrangling responsibilities to each member, discuss project direction. |
| 2/16  | 3PM | Import and wrangle data; EDA | Review and edit wrangling, visualize data, start to discuss analysis plan   |
| 2/23  | 3PM  | Finalize wrangling and begin analysis | discuss and edit analysis; complete project check-in |
| 3/9  | 3PM  | Complete analysis | edit the full project during in-person meeting |
| 3/18  | Before 11:59 PM  | Make final changes and ensure the video and project are ready | Turn in Final Project & Group Project Surveys |