**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Menghang Wu
- Cecilia Lin
- Julie Cai
- Yunfei Shih
- Guan Huang Chen

# Research Question

Is there a statistically significant difference in the preference for hybrid plug-in vehicles versus battery electric vehicles (EVs) across the counties in Washington?

## Background and Prior Work

In recent years, heightened concerns over climate change and air pollution have drawn significant attention to sustainable transportation. Electric vehicles (EVs) and hybrid vehicles are gaining traction across the United States, particularly in eco-conscious states like Washington. These vehicles are essential for reducing greenhouse gas emissions and lowering dependency on fossil fuels. Additionally, rising fuel prices are prompting more consumers to consider electric or fuel-efficient hybrid options.

Although both governments and businesses actively promote the electric vehicle market, consumer preferences vary considerably by region, especially in the choice between hybrid vehicles and battery electric vehicles (BEVs). Understanding these regional preferences offers valuable insights for assessing market demand and shaping policies that support a smoother transition toward electrification in the transportation sector.

Prior studies indicate that preferences for EVs and hybrids differ across geographic regions. In a study by Nelder and Jung (2016)<a name="cite_ref-1.1"></a>[<sup>1</sup>](#cite_note-1.1), factors influencing the adoption of EVs were examined, including the distribution of charging infrastructure, vehicle range, and geographic characteristics. These factors significantly impact the efficiency of both hybrid and electric vehicles, influencing consumer preferences for each type across different areas.

Another relevant study by Morrissey et al. (2016)<a name="cite_ref-1.2"></a>[<sup>2</sup>](#cite_note-1.2) analyzed EV adoption patterns in several European countries, focusing on urban versus rural regions. They found that urban consumers were more inclined to choose BEVs due to readily available charging infrastructure, whereas rural areas favored hybrids, given their flexibility in regions with limited charging options. This research supports the notion that consumer preferences for hybrids and BEVs may vary based on infrastructure and geography, aligning closely with our research question of whether statistically significant preference differences exist across Washington counties.

References:
1. <a name="cite_note-1.1"></a> [^](#cite_ref-1)Nelder, C., & Jung, C. (2016). The future of electric vehicles in the U.S.: Forecasts and projections. Rocky Mountain Institute.https://rmi.org

2. <a name="cite_note-1.2"></a> [^](#cite_ref-2)Morrissey, P., Weldon, P., & O'Mahony, M. (2016). Future standard and fast charging infrastructure planning: An analysis of charging behaviour in EV-ready urban regions. Journal of Transport Geography.
https://www.infona.pl/resource/bwmeta1.element.elsevier-5c9e73d4-17a4-38da-ae24-db1598cce4d7


# Hypothesis


We predict that the counties across Washington has a statistically significant preference for battery electric vehicles (BEVs) over hybrid plug-in vehicle. The preference will be measure by the ratio of BEV and plug-in vehicle across the counties. We believe that people perfer BEVs more because the price of BEVs is lower than hybrid vehicles and consumers do not need to be concerned with the fluctuating price of gasoline.

# Data

## Data overview

- Dataset #1
  - Dataset Name: Electric Vehicle Population Data
  - Link to the dataset: https://catalog.data.gov/dataset/electric-vehicle-population-data
  - Number of observations: 210162
  - Number of variables: 17


*Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset*

The dataset includes all electric car registered across the counties in Washington state, and the model year of the EVs varied from 1999 to 2025. It has 210162 rows of observations as the population, and we could sample randomly from the dataset if needed. The dataset allows us to conduct for statistical analysis since the samples would be over 1000 observations.

Most of the variables are categorical variables, stored as string, such as County, City, Model Year, Make, and Electric Vehicle Type. There are only two quantitative variables stored as float, which is Electric Range and Base MSRP. Electric Range describes the distance that EV can travel on a single charge of its battery, while Base MSRP describes the manufacturer price of a EV without any additional features.

The important variables of the dataset for our project are County and Electric Vehicle Type. The County column includes the county of the EV is registered in, and the Electric Vehicle Type identifies the vehicle as Plug-in Hybrid Electric Vehicle (PHEV) or Battery Electric Vehicle (BEV). ANother variable that we might dive into would be Base MSRP since we think there would be more BEVs due to the cheaper price.

To prepare for the analysis, we could store the dataset as a Pandas dataframe, drop uncessary features to lessen computational burden, check missing values, and ensure correct data types for features. We would also group by the EV by counties to gain insight on the percent ratio of BEVs and PHEVs across the counties.

## Electric Vehicle Population Data

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [3]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 
ev_population = pd.read_csv('Electric_Vehicle_Population_Data.csv')
ev_population.head(3)

Unnamed: 0,VIN (1-10),County,City,State,Postal Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Base MSRP,Legislative District,DOL Vehicle ID,Vehicle Location,Electric Utility,2020 Census Tract
0,5UXTA6C0XM,Kitsap,Seabeck,WA,98380.0,2021,BMW,X5,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,30.0,0.0,35.0,267929112,POINT (-122.8728334 47.5798304),PUGET SOUND ENERGY INC,53035090000.0
1,5YJ3E1EB1J,Kitsap,Poulsbo,WA,98370.0,2018,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,215.0,0.0,23.0,475911439,POINT (-122.6368884 47.7469547),PUGET SOUND ENERGY INC,53035090000.0
2,WP0AD2A73G,Snohomish,Bothell,WA,98012.0,2016,PORSCHE,PANAMERA,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,15.0,0.0,1.0,101971278,POINT (-122.206146 47.839957),PUGET SOUND ENERGY INC,53061050000.0


In [4]:
#check number of observations and variables
ev_population.shape

(210165, 17)

In [5]:
#check datatypes
ev_population.dtypes

VIN (1-10)                                            object
County                                                object
City                                                  object
State                                                 object
Postal Code                                          float64
Model Year                                             int64
Make                                                  object
Model                                                 object
Electric Vehicle Type                                 object
Clean Alternative Fuel Vehicle (CAFV) Eligibility     object
Electric Range                                       float64
Base MSRP                                            float64
Legislative District                                 float64
DOL Vehicle ID                                         int64
Vehicle Location                                      object
Electric Utility                                      object
2020 Census Tract       

In [6]:
# drop irrevelant columns and keep variables that we might use for analysis
ev_clean = ev_population.drop(columns = ['VIN (1-10)', 
                                         'Legislative District', 
                                         'DOL Vehicle ID', 
                                         'Vehicle Location', 
                                         'Electric Utility', 
                                         '2020 Census Tract',
                                         'Electric Range'])
ev_clean.head()

Unnamed: 0,County,City,State,Postal Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Base MSRP
0,Kitsap,Seabeck,WA,98380.0,2021,BMW,X5,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,0.0
1,Kitsap,Poulsbo,WA,98370.0,2018,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,0.0
2,Snohomish,Bothell,WA,98012.0,2016,PORSCHE,PANAMERA,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,0.0
3,Kitsap,Bremerton,WA,98310.0,2018,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,0.0
4,King,Redmond,WA,98052.0,2019,NISSAN,LEAF,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,0.0


In [7]:
#check for missing values
ev_clean.isna().any()

County                                                True
City                                                  True
State                                                False
Postal Code                                           True
Model Year                                           False
Make                                                 False
Model                                                False
Electric Vehicle Type                                False
Clean Alternative Fuel Vehicle (CAFV) Eligibility    False
Base MSRP                                             True
dtype: bool

In [8]:
ev_clean['Base MSRP'].value_counts().head()

Base MSRP
0.0        206851
69900.0      1334
31950.0       367
52900.0       221
32250.0       142
Name: count, dtype: int64

There is missing values and 0 for base MSRP, so we fill them by the group mean base MSPR of the make of the model to limit possible bias.


In [9]:
ev_clean['Base MSRP'].replace(0, np.nan, inplace=True)
ev_clean['Base MSRP'] = ev_clean.groupby('Make')['Base MSRP'].transform(lambda x: x.fillna(x.mean()))

#for those make that does not have a base msrp, we fill it with the mean based on the vehicle type
ev_clean['Base MSRP'] = ev_clean.groupby('Electric Vehicle Type')['Base MSRP'].transform(lambda x: x.fillna(x.mean()))


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  ev_clean['Base MSRP'].replace(0, np.nan, inplace=True)


We drop the missing values in the categorical variables since if we randomly assign the vehicle to a county it could lead to significant bias in the analysis.

In [10]:
ev_clean = ev_clean.dropna(subset=['County', 'City', 'Postal Code'])

In [11]:
ev_clean.isna().any()

County                                               False
City                                                 False
State                                                False
Postal Code                                          False
Model Year                                           False
Make                                                 False
Model                                                False
Electric Vehicle Type                                False
Clean Alternative Fuel Vehicle (CAFV) Eligibility    False
Base MSRP                                            False
dtype: bool

In [15]:
ev_clean.head()

Unnamed: 0,County,City,State,Postal Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Base MSRP
0,Kitsap,Seabeck,WA,98380.0,2021,BMW,X5,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,52859.547244
1,Kitsap,Poulsbo,WA,98370.0,2018,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,70175.777631
2,Snohomish,Bothell,WA,98012.0,2016,PORSCHE,PANAMERA,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,132440.0
3,Kitsap,Bremerton,WA,98310.0,2018,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,70175.777631
4,King,Redmond,WA,98052.0,2019,NISSAN,LEAF,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,66840.223284


After handling missing values, we will start taking a look at the EV type distributions.

In [13]:
ev_clean['Electric Vehicle Type'].value_counts()

Electric Vehicle Type
Battery Electric Vehicle (BEV)            165552
Plug-in Hybrid Electric Vehicle (PHEV)     44609
Name: count, dtype: int64

In [14]:
#look at the distribution of the vehicle types across counties
ev_count_by_county = ev_clean.groupby(['County','Electric Vehicle Type'])['City'].size().reset_index(name='Count')
ev_count_by_county

Unnamed: 0,County,Electric Vehicle Type,Count
0,Ada,Battery Electric Vehicle (BEV),2
1,Adams,Battery Electric Vehicle (BEV),46
2,Adams,Plug-in Hybrid Electric Vehicle (PHEV),21
3,Alameda,Battery Electric Vehicle (BEV),4
4,Alameda,Plug-in Hybrid Electric Vehicle (PHEV),1
...,...,...,...
283,Yakima,Battery Electric Vehicle (BEV),912
284,Yakima,Plug-in Hybrid Electric Vehicle (PHEV),372
285,Yolo,Battery Electric Vehicle (BEV),3
286,York,Battery Electric Vehicle (BEV),1


# Ethics & Privacy

**Issues with Privacy and Terms of Use:**

1. The data set including information on vehicle population, school buses, ZEV sales, hydrogen refueling stations, and EV chargers may have specific privacy and terms of use problems. For example, the frequency of someone using a EV charger at a specific location. The data also has location information, such as the zip code and city. The zip code and car make along with VIN could possibly identify the owner for someone who lives in the same zip code. 

**Mitigation of Privacy Risks:**

2. The team will carefully review by and use the data by the term of use for the datasets to make sure that privacy standards are maintained. We would also use aggregation techniques to make sure that no individual or small group can be identified based on the location or frequency of EV infrastructure usage. To solve these privacy issues, our team will implement stric data handling, with clear promise to protect personal and community level privacy throughout the research. We only conduct analysis focusing on the necessary variables, like county and type of vehicle, to avoid any potential privacy issue.



Our team acknowledges that this project might have some potential ethics or privacy issues. However, we shall address all potential biases or privacy concerns regarding the use of the data.

We used the data from data.wa.gov, “Electric Vehicle Population Data," which is intended for public access and use. Our question is about the difference in the preference for hybrid plug-in vehicles versus battery electric vehicles across the counties in Washington. We acknowledge that the data we use is only from Washington State, so it might be limited in expanding the analyses to other states due to various factors. However, this dataset comprises 210,165 samples, each originating from different counties in Washington. The dataset's use and analysis could apply to similar counties with minor ethical or bias issues.


# Team Expectations 


* *Communicate via Text. Respond to text within 24 hours. Weekly virtual meetings to finish weekly tasks.*
* *Respectfully give feedback. Do not be blunt or rude.*
* *Unanimous decision, but if there’s disagreement, then the decision will be made by majority vote.*
* *Cecilia will be the facilitator to ensure the project is on track for completion.*
* *No other specific roles, but tasks will be assigned or voluntarily taken.*
* *The load of tasks should be fair and equal among the team members.*
* *A list of current tasks and upcoming meetings will be posted in the group chat announcements section.*
* *When issues arise, communicate early with the team. Seek help from the team as soon as possible if you need it.*
* *If not able to finish certain tasks one time, take more load of the task next time*

# Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/25  |  10 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research, assign sections | 
| 10/30  | 5 PM  |Finish draft of proposal; Search for datasets  | Revise and submit proposal |
| 11/14 | 6 PM  | Cecilia finishes EDA| Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part |
| 11/20 | 6 PM  | Import & Wrangle Data; EDA | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 11/27 | 10 PM  | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis; Complete project check-in |
| 12/4 |5 PM  | Complete analysis; Draft results/conclusion/discussion| Discuss/edit full project |
| 12/11 | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |