# COGS 108 - Project Proposal

## Authors

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

**Team Contributions**
- **Ailyn Becerra:** Organized the authors section and wrote the team expectations.
- **Sean Chiu:** Drafted the initial research question, hypothesis, and ethics section.
- **Alex Evans:** Contributed to project planning and proposal development.
- **Max Henderson:** Wrote the background and prior work section and developed the project timeline proposal.
- **Paulina Pelayo:** Refined the research question and hypothesis, identified the dataset, and contributed to the ethics section.

## Research Question

**Research Question:** How are breed labels and spay/neuter status associated with (1) the likelihood of adoption and (2) the length of stay for animals at the Austin Animal Center?

We will treat adoption as a binary outcome derived from the Outcome Type variable and define length of stay as the time difference between intake and outcome timestamps. We will use statistical inference and predictive modeling to examine how these factors relate to adoption outcomes and length of stay, while also considering other variables such as age and animal type. Because this is observational shelter data, our analysis will focus on identifying associations rather than making causal claims.

## Background and Prior Work

Many animals will unfortunately spend the rest of their lives in animal shelters with their only hope of leaving being adoption. There are numerous reasons that a person may choose to adopt one animal over another but we would like to try to distill these reasons into a few easy to measure variables to determine what factors affect adoption rates most.

A study at UC Davis suggests <a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) that one factor that increases how likely an animal is to be adopted is if they are neutered. Which, as a variable that a shelter can control, is worth studying further.

Other studies <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) <a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) of participants who adopted animals but later returned them to the shelter concluded that, while dogs are typically returned for aggressive behaviours, cats are most often returned for personal reasons. This may point to the idea that people are more forgiving when it comes to cats having specific characteristics, which may lead to a lower correlation among the measured attributes of a cat and its adoption rate.

This study <a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) concluded that size and age are likely the largest contributing factors to how long it is before an animal is adopted. Small dogs, puppies, and large dogs were the fastest group to be adopted, leading to medium sized dogs staying at shelters the longest. Additionally, they found that sex and the color of the animal's coat played a very small role in how likely a dog is to be adopted.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Jaime Clevenger and Philip H. Kass. (Winter 2003). Determinants of Adoption and Euthanasia of Shelter Dogs Spayed or Neutered in the University of California Veterinary Student Surgery Program Compared to Other Shelter Dogs. 
2. <a name="cite_note-2"></a> [^](#cite_ref-2)  Sloane M Hawes, Josephine M Kerrigan, Tess Hupe, Kevin N Morris. (2020 Sep 3). Factors Informing the Return of Adopted Dogs and Cats to an Animal Shelter. https://pmc.ncbi.nlm.nih.gov/articles/PMC7552273/
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Leslie Sinn. (July–August 2016). Factors affecting the selection of cats by adopters. *Journal of Veterinary Behavior*. https://www.sciencedirect.com/science/article/abs/pii/S1558787816300442
4. <a name="cite_note-4"></a> [^](#cite_ref-4) William P Brown, Janelle P Davidson, Marion E Zuefle. Effects of phenotypic characteristics on the length of stay of dogs at two no kill animal shelters. https://pubmed.ncbi.nlm.nih.gov/23282290/

## Hypothesis

We hypothesize that animals recorded as spayed or neutered will have a higher probability of adoption and a shorter length of stay compared to intact animals. This relationship is expected to vary significantly across breed labels, as historical adoption rates differ between breeds. 

Furthermore, we anticipate that variables such as age and animal type will influence these outcomes and must be controlled in our analysis. We expect these relationships to be associative rather than causal, meaning that while these factors are correlated with adoption speed, they are not the sole drivers of adoption decisions.

## Data

### Ideal Dataset
The ideal dataset would be a comprehensive record from animal shelters across diverse geographic and socio-economic regions over the last 10 years. Important variables would include breed labels (primary and secondary) recorded by the shelter, spay/neuter status at intake and outcome, age, animal type, and precise timestamps for intake and adoption to accurately calculate length of stay. This data would typically be collected by shelter staff during routine intake and outcome procedures and stored within a centralized shelter management system before being exported as structured files. For this project, raw data will be stored in the data/00-raw/ directory, with cleaned datasets placed in 01-interim/ and analysis-ready datasets stored in 02-processed/ to support reproducibility and organized workflows.

### Real Dataset
**Austin Animal Center Outcomes**
- **URL**: [https://www.kaggle.com/datasets/jackdaoud/animal-shelter-analytics?select=Austin_Animal_Center_Outcomes.csv](https://www.kaggle.com/datasets/jackdaoud/animal-shelter-analytics?select=Austin_Animal_Center_Outcomes.csv)
- **Description**: This dataset contains over 100,000 records from 2013 to the present, documenting animal outcomes at the Austin Animal Center. Key variables include `Breed`, `Color`, `Date of Birth`, `Sex upon Outcome`, `Outcome Type`, and outcome timestamps. The dataset is publicly available through Kaggle and provided by the City of Austin, requiring no special permissions to access. Team members can download the data locally using the Kaggle API. These variables will allow us to measure adoption likelihood using the Outcome Type field and compute length of stay from intake and outcome timestamps. Because the dataset is large and publicly accessible, it is well-suited for both statistical analysis and predictive modeling.

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 -  **[X] A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> This dataset reflects animal shelter operations within Austin, Texas. Adoption patterns observed in this region may not generalize to other locations due to differences in local culture, shelter resources, and adoption policies.

 - [ ] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 -  **[X] A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> We have considered that our analysis does not include the demographic data of adopters (e.g., race, gender, or income). As a result, we cannot evaluate whether our findings reflect broader social inequities in adoption behavior.


### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [ ] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 -  **[X] C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> Breed labels are often subjective guesses by staff based on visual traits. This introduces a 'labeling bias' where an animal might be categorized based on visual interpretation, which could skew the data on adoption performance for specific breeds.

 -  **[X] C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> To avoid misleading results, we will report both mean and median for 'Length of Stay.' This ensures that animals that spend an exceptionally long time in the shelter do not disproportionately affect the perceived average adoption time. We will avoid framing results in ways that suggest certain breeds are inherently more or less adoptable. Instead, we will interpret findings as reflections of historical adoption patterns within this dataset.

 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [ ] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 -  **[X] D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> We must ensure that 'breed' does not unintentionally act as a proxy for the socio-economic status of the neighborhoods where animals are surrendered. We will avoid using intake location as a predictive variable to prevent reinforcing geographic stigmas.

 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - **[X] D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> Our report will explicitly state that the results show correlations and are not causal predictors. We will also acknowledge the 'unmeasured' variables like an animal's individual temperament which our data cannot capture.


### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - **[X] E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> We will avoid presenting results in ways that could encourage breed-based stigma or oversimplified adoption recommendations. Findings will be framed as descriptive patterns rather than prescriptive guidance.


## Team Expectations

* ***Communication**: Team members are expected to keep consistent communication and to engage in peaceful academic discourse when disagreements arise. The main medium of communication used will be the team discord server. Failing to keep up with communication (ghosting, etc.) may result in further disciplinary action (group removal, group meeting to discuss lack of communication). It is important and expected that each team member reaches out whenever unforeseen circumstances occur that may prevent them from completing part of the work (emergencies, urgent matters, etc.)*
* ***Decision-making**: Minor decisions that do not interfere with the project plan may be made at each team member’s discretion. Decisions that may impact the overall project or directly affect the work of other team members (changing approach, changing data set, etc.) must be discussed with all team members prior to making a final decision.*
* ***Tasks**: Tasks must be equitably distributed between team members. If a team member is unable to complete a task for any unforeseen circumstances, they must communicate to other team members to ensure that part of the task will be completed on time and whenever is necessary. It is important that each team member contributes early and often to ensure team deadlines are met.*

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/21  |  5 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication |
| 1/28  |  5 PM | Review previous COGS 108 projects | Complete project review and discuss project ideas|
| 2/4   |  5 PM | Do background research on topic  | Discuss ideal dataset(s) and ethics; create and submit project proposal |
| 2/11  | 5 PM  | Import & Wrangle Data | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/18  | 5 PM  | Complete data checkpoint | Discuss individual and group progress	|
| 2/25  | 5 PM  | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis |
| 3/4   | 5 PM  | Complete EDA | Discuss and submit EDA checkpoint |
| 3/11  | 5 PM  | Prepare final project | Discuss and submit EDA checkpoint |
| 3/18  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |
