**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Nicole Kim
- Rikako Ono
- Geena Limfat
- MyungJoo Kim
- Elizaveta Beltyukova

# Research Question

How can we predict when car accidents are most likely to happen in the United States? What factors can we depend on to make an assumption?


## Background and Prior Work


Over the years, cars have become safer whether it be through improvements in structural design or the development of new technologies.
Despite such progressions, however, the frequency of car accidents has risen significantly over the last several years.
These include less researched technology, regenerative braking, increased total mass, and faster acceleration.
Specifically, there were 6,102,936 police-reported vehicle accidents in the United States in just the year 2021 itself<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3).

Interestingly, the factors influencing car crashes in the United States are quite diverse. Accidents can be the result of negative behaviors such as distracted driving, speeding, drunk driving, reckless driving, and tailgating<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1).
In fact, more than a third (36%) of all fatal crashes involve alcohol<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3).
All of these factors contribute to the high rate of car accidents in the US, making it crucial for individuals and authorities to address these issues through awareness, preventive measures, and effective policies to improve road safety. 

Accidents can even correlate to variables as general as the state one is in, to those as specific as the month, day, and even hour. For instance, there is previous work that has been done to analyze the frequency of car crashes based on days and times of the week.
This analysis was conducted by NSC, which used data from 2021 and analyzed the frequency of fatal and non-fatal car crashes based on the time of the day and week.
They concluded that for warmer months (spring and summer), fatal crashes peaked in the late evening and night (8 p.m. and midnight), meanwhile, for winter and early spring, the fatal crashes peaked between 4 and 8 p.m.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)
Another observation that could be extracted from their analysis is that more crashes tended to happen closer to the end of the week (Friday to Sunday). This will be interesting to contrast with our analysis later on to see if it’s consistent with our findings for the years 2016-2020.
1. <a name="cite_note-1"></a> [^](#cite_ref-1) [*Top 25 causes of car accidents: Exploring the major factors.* GJEL Accident Attorneys. (2023, November 24).](https://www.gjel.com/car-accident-lawyers/top-causes-car-accidents/) 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) [*Car crashes by time of day and day of week.* Injury Facts. (2023, April 18). ](https://injuryfacts.nsc.org/motor-vehicle/overview/crashes-by-time-of-day-and-day-of-week/) 
3. <a name="cite_note-3"></a> [^](#cite_ref-3) [Moore, T. *Fatal car crash statistics 2024.* USA Today. (2024, January 16).](https://www.usatoday.com/money/blueprint/auto-insurance/fatal-car-crash-statistics/#:~:text=There%20are%20nearly%2043%2C000%20fatal,accidents%20in%20the%20United%20States.&text=Of%20those%2C%2039%2C508%20were%20fatal) 


# Hypothesis


DELETE BELOW BULLETS LATER ONCE FIXED FEEDBACK ISSUES
- Include your team's hypothesis
- Ensure that this hypothesis is clear to readers
- Explain why you think this will be the outcome (what was your thinking?)
- What is your main hypothesis/predictions about what the answer to your question is? Briefly explain your thinking. (2-3 sentences)

We hypothesize that car accidents will happen most frequently around the holiday season because of various factors such as the increase of vacations which leads to an increase in traffic, and the influence of alcohol.
Additionally, we feel that there will be more accidents on weekday mornings, due to a high volume of commuters to work and school.
In particular, we feel that innovative features such as autopilot and regenerative braking can negatively affect electric vehicle drivers.


# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: US Accidents (2016 - 2023)
  - Link to the dataset: https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents?select=US_Accidents_March23.csv
  - Number of observations: 7,728,394
  - Number of variables: 46
- Dataset #2 (if you have more than one!)
  - Dataset Name: Highest Public Transit Usage Cities in California
  - Link to the dataset: https://www.homearea.com/rankings/place-in-ca/percent_using_public_transportation/#:~:text=The%20California%20percent%20using%20public,year%20saw%20several%20big%20changes
  - Number of observations: 62
  - Number of variables: 3
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

### Set up

In [1]:
import pandas as pd
import numpy as np

## Dataset #1 - US Accidents (2016 - 2023)

Because the data was incredibly large, we have filtered the data before uploading to GitHub

In [2]:
accident_df = pd.read_csv('AccidentData.csv')
accident_df.head()

Unnamed: 0,Start_Time,Severity,City,Description
0,2017-01-01 00:03:31,3,Norwalk,Accident on I-5 Northbound at Exits 120 120A B...
1,2017-01-01 00:09:26,3,Lynwood,Accident on I-710 Southbound at Exits 12 12A 1...
2,2017-01-01 00:09:52,3,Hesperia,Accident on I-15 Northbound at Exit 138 Oak Hi...
3,2017-01-01 00:10:14,2,Pasadena,Accident on CA-110 Southbound at Glenarm St.
4,2017-01-01 00:11:14,3,Colton,Accident on I-10 Eastbound at Exit 72 I-215.


## Dataset #2 - Highest Public Transit Usage Cities in California

In [3]:
City = ['San Francisco', 'Oakland', 'Berkeley', 'Daly City', 'Alameda', 'San Leandro', 'Richmond', 'Pleasanton', 'San Mateo', 'Concord', 'San Ramon', 'Hayward', 'Los Angeles', 'Mountain View', 'Fremont', 'Antioch', 'Sunnyvale', 'Pittsburg', 'East Los Angeles', 'Redwood City', 'Vallejo', 'Pasadena', 'Long Beach', 'El Cajon', 'San Jose', 'Santa Clara', 'Santa Ana', 'Inglewood', 'South Gate', 'San Diego', 'Downey', 'Sacramento', 'Chula Vista', 'Santa Clarita', 'Glendale', 'Anaheim', 'Escondido', 'Riverside', 'Santa Rosa', 'Oceanside', 'Santa Barbara', 'Elk Grove', 'Palmdale', 'San Bernardino', 'Fullerton', 'Fresno', 'Moreno Valley', 'Pomona', 'Huntington Beach', 'Santa Maria', 'Lancaster', 'Ontario', 'Garden Grove', 'Corona', 'Bakersfield', 'Stockton', 'Rancho Cucamonga', 'Lakewood', 'San Buenaventura (Ventura)', 'Irvine', 'Visalia', 'Napa']
Percent_Used = [0.347, 0.227, 0.221, 0.207, 0.185, 0.149, 0.121, 0.105, 0.103, 0.099, 0.098, 0.098, 0.089, 0.088, 0.081, 0.079, 0.076, 0.071, 0.066, 0.066, 0.061, 0.055, 0.055, 0.053, 0.05, 0.049, 0.048, 0.047, 0.046, 0.04, 0.038, 0.038, 0.036, 0.034, 0.033, 0.031, 0.03, 0.03, 0.026, 0.023, 0.023, 0.021, 0.021, 0.021, 0.021, 0.02, 0.019, 0.018, 0.018, 0.018, 0.016, 0.015, 0.014, 0.012, 0.011, 0.011, 0.011, 0.008, 0.008, 0.008, 0.006, 0.005]

transit_df = pd.DataFrame({'City':City, 'Percent Using Transit':Percent_Used})

transit_df

Unnamed: 0,City,Percent Using Transit
0,San Francisco,0.347
1,Oakland,0.227
2,Berkeley,0.221
3,Daly City,0.207
4,Alameda,0.185
...,...,...
57,Lakewood,0.008
58,San Buenaventura (Ventura),0.008
59,Irvine,0.008
60,Visalia,0.006


# Ethics & Privacy

There should not be an issue with any biases from the source of collection because the agency that collected the data was the NHTSA which is the National Highway Traffic Safety Administration.
Since this is a reputable government organization in charge of collecting and organizing the data, we can assume that there are no demographic/ethnic biases when collecting the data.
However, there may be biases in which accidents were reported to the government as there may be racial/ethical biases when it came to reporting the accidents to the government.
This may exclude certain demographics of people or target others, but as we are not focusing on race/demographics, it should not affect our data analysis as much. Additionally, since the data is vastly spread across the country and over four years, the data should even out itself.
The large dataset will also aid in protecting the privacy of the people involved in the accidents as they fall under police jurisdiction regions but do not specify exactly where the accident occurred.

One factor we may fail to fully consider is the location and its impact on each accident.
For example, there may be weather conditions or other variables that we are unable to record that may be the main reason for the accident, yet we concluded a different reasoning with the information we analyzed.
However, this should still be acceptable as the main goal of our hypothesis and research is to find out how certain factors consistently result in an accident.
In other words, our focus is more on concluding if there will be an accident if a certain variable is present, rather than which variable is the most crucial out of all variables to lead to an accident.

In the dataset that we are using, we must also take into account the COVID-19 Pandemic as it states that between the months of May and March the data will be more sparse because there were less people on the road.
We will take into account this bias in our analysis as we understand that trends should be skewed because of this event.

We will detect these specific biases before analysis by thoroughly discussing our research question before finding datasets and identifying any additional variables that would skew our data and affect our question.
During data analysis, we will identify any points in the data that do not match up with the rest of the data and conduct additional research to identify whether those specific data points are abnormalities or whether they are biases that we did not take into account before.
Then, we will explain the abnormalities/biases in our analysis and discuss how this affects our data.
After we have already written our analysis of the data, we will proofread our conclusion for any other biases that may stand out and if they do occur, we will revise and address them again.
Additionally, we may conduct a peer review (if allowed) with another group so we can have unbiased feedback.


# Team Expectations 


* *Keep in touch with group members via iMessage or Discord groupchat.*
* *Fill out When2Meet's on time in order to coordinate meeting times.*
* *Complete individually assigned tasks on time.*
* *Communicate promptly if any change of plans arises.*


# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/3  |  12 PM | Determine best form of communication  | Complete Previous Project Review Assignment | 
| 2/7  |  6:30PM |  Brainstorm ideas for project | Work on Project Proposal Assignment | 
| 2/12 | 8:30 PM |  Brainstorm any new ideas for project + potential datasets | Finish Project Proposal Assignment|
| 2/17  | 12 PM | Find Data and work on Research  | Work on Checkpoint #1: Data   |
| 2/22  | 8:30 PM  | Look Over FeedBack | Complete Checkpoint #1: Data   |
| 3/2 | 12 PM  | Work on Assigned Tasks | Work on Checkpoint #2: EDA |
| 3/10  | 6:30 PM  | Complete analysis; Draft results/conclusion/discussion | Complete Checkpoint #2: EDA  |
| 3/15 | 12 PM  | Work on assigned Tasks | Work together to discuss and wrap up the Final Project |
| 3/20  | Before 11:59 PM  | Make final edits and prepare to turn in Final Project | Turn in Final Project & Group Project Surveys |