In [1]:
#Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

#to display plots inline
%matplotlib inline

In [2]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
# Access your dataset
data_path = '/content/gdrive/My Drive/Colab Notebooks/Electiric vehical/final_after_outlier_cleaned_data.csv'

In [4]:
df = pd.read_csv(data_path)
# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,VIN (1-10),County,City,State,Postal Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Base MSRP,DOL Vehicle ID,Vehicle Location,Electric Utility,2020 Census Tract
0,1C4JJXP68P,Yakima,Yakima,WA,98901.0,2023,JEEP,WRANGLER,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,21,0,249905295,POINT (-120.4688751 46.6046178),PACIFICORP,53077000000.0
1,WBY8P6C05L,Kitsap,Kingston,WA,98346.0,2020,BMW,I3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,153,0,260917289,POINT (-122.5178351 47.7981436),PUGET SOUND ENERGY INC,53035090000.0
2,JTDKARFP1J,Kitsap,Port Orchard,WA,98367.0,2018,TOYOTA,PRIUS PRIME,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,25,0,186410087,POINT (-122.6530052 47.4739066),PUGET SOUND ENERGY INC,53035090000.0
3,5UXTA6C09N,Snohomish,Everett,WA,98208.0,2022,BMW,X5,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,30,0,186076915,POINT (-122.2032349 47.8956271),PUGET SOUND ENERGY INC,53061040000.0
4,JTMAB3FV7P,Thurston,Rainier,WA,98576.0,2023,TOYOTA,RAV4 PRIME,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,42,0,236505139,POINT (-122.6771414 46.8882415),PUGET SOUND ENERGY INC,53067010000.0


**Formulating and Prepare for Hypothesis Testing**

**I am formulating the hypothesis related to the proportion of BEVs in urban vs. non-urban areas.**

Hypothesis 1: Urbanization and BEV Adoption

Null Hypothesis (H0): There is no significant difference in the proportion of BEVs compared to PHEVs between urban and non-urban areas.

Alternative Hypothesis (H1): Urban areas have a significantly higher proportion of BEVs compared to PHEVs than non-urban areas.

**Creating an Urban vs. Non-Urban Classification**


I have to classify areas as urban or non-urban. For simplicity, assuming counties with large cities are urban, and others are non-urban.

In [5]:
# We have a list of urban counties
urban_counties = ['King', 'Snohomish', 'Pierce', 'Clark', 'Spokane', 'Thurston']

# Creating a new column to classify urban vs non-urban areas
df['Urban'] = df['County'].apply(lambda x: 'Urban' if x in urban_counties else 'Non-Urban')

# Checking the distribution
df['Urban'].value_counts()


Unnamed: 0_level_0,count
Urban,Unnamed: 1_level_1
Urban,147615
Non-Urban,28211


**Performing Hypothesis Testing**

The Hypothesis testing is to compare the proportion of BEVs in urban vs. non-urban areas.

I use a chi-square test of independence to compare the proportions of BEVs in urban vs. non-urban areas.

In [6]:
import scipy.stats as stats

# Create a contingency table
contingency_table = pd.crosstab(df['Urban'], df['Electric Vehicle Type'])

# Perform the Chi-square test
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)

# Display the results
print(f"Chi-square Statistic: {chi2}")
print(f"P-Value: {p}")

# Conclusion
alpha = 0.05
if p < alpha:
    print("Reject the null hypothesis - There is a significant difference in BEV adoption between urban and non-urban areas.")
else:
    print("Fail to reject the null hypothesis - No significant difference in BEV adoption between urban and non-urban areas.")


Chi-square Statistic: 744.5965126540892
P-Value: 6.0023304534558134e-164
Reject the null hypothesis - There is a significant difference in BEV adoption between urban and non-urban areas.


**Interpretation of Results:**

Chi-square Statistic: 744.60
The chi-square statistic of 744.60 is very large, indicating a substantial difference between the observed and expected frequencies in the contingency table. This suggests a strong association between the type of area (urban vs. non-urban) and the adoption of Battery Electric Vehicles (BEVs).

P-Value: 6.00e-164
The p-value of 6.00e-164 is extremely small, far below any common significance level (e.g., 0.05). This indicates that the likelihood of observing such a strong association by chance is virtually zero.

Conclusion: Reject the null hypothesis
The very small p-value, we reject the null hypothesis. This means there is a statistically significant difference in BEV adoption between urban and non-urban areas. In other words, the type of area (urban vs. non-urban) has a significant impact on the adoption of BEVs.

**Hypothesis 2: CAFV Eligibility and EV Adoption**

This hypothesis will test whether counties with higher CAFV (Clean Alternative Fuel Vehicle) eligibility rates have a higher overall EV adoption rate.

**Hypothesis Formulation**

Null Hypothesis (H0): There is no significant difference in EV adoption rates between counties with high CAFV eligibility and those with low CAFV eligibility.

Alternative Hypothesis (H1): Counties with higher CAFV eligibility rates have significantly higher EV adoption rates compared to those with lower CAFV eligibility rates.

**Preparing the Data for Hypothesis Testing**

**Calculating the proportion of EVs that are CAFV-eligible in each county and then classify counties into "High CAFV Eligibility" and "Low CAFV Eligibility" based on a threshold**

In [7]:
# Calculate the proportion of CAFV-eligible vehicles by county
county_cafv_proportion = df.groupby('County')['Clean Alternative Fuel Vehicle (CAFV) Eligibility'].apply(lambda x: (x == 'Eligible').mean()).reset_index()

# Determine the median CAFV eligibility proportion
median_cafv_proportion = county_cafv_proportion['Clean Alternative Fuel Vehicle (CAFV) Eligibility'].median()

# Classify counties as "High CAFV Eligibility" or "Low CAFV Eligibility"
county_cafv_proportion['CAFV Eligibility Category'] = county_cafv_proportion['Clean Alternative Fuel Vehicle (CAFV) Eligibility'].apply(lambda x: 'High' if x >= median_cafv_proportion else 'Low')

# Merge this classification back to the main dataframe
df = df.merge(county_cafv_proportion[['County', 'CAFV Eligibility Category']], on='County', how='left')

# Check the classification distribution
df['CAFV Eligibility Category'].value_counts()


Unnamed: 0_level_0,count
CAFV Eligibility Category,Unnamed: 1_level_1
High,175826


**Conducting Hypothesis Testing**

I perform a chi-square test to see if there's a significant association between CAFV eligibility category (high vs. low) and overall EV adoption.

In [8]:
# Create a contingency table for CAFV eligibility category and EV type
contingency_table_cafv = pd.crosstab(df['CAFV Eligibility Category'], df['Electric Vehicle Type'])

# Perform the Chi-square test
chi2_cafv, p_cafv, dof_cafv, expected_cafv = stats.chi2_contingency(contingency_table_cafv)

# Display the results
print(f"Chi-square Statistic: {chi2_cafv}")
print(f"P-Value: {p_cafv}")

# Conclusion
alpha = 0.05
if p_cafv < alpha:
    print("Reject the null hypothesis - There is a significant difference in EV adoption rates between high and low CAFV eligibility counties.")
else:
    print("Fail to reject the null hypothesis - No significant difference in EV adoption rates between high and low CAFV eligibility counties.")


Chi-square Statistic: 0.0
P-Value: 1.0
Fail to reject the null hypothesis - No significant difference in EV adoption rates between high and low CAFV eligibility counties.


**Interpretation of Results:**

Chi-square Statistic: 0.0
The chi-square statistic of 0.0 indicates that the observed frequencies in the contingency table are exactly what I would expect if there were no association between CAFV eligibility and EV adoption rates across different counties. This means that the distribution of EV adoption is identical between counties with high CAFV eligibility and those with low CAFV eligibility.

P-Value: 1.0
The p-value of 1.0 is extremely high, suggesting that there is no evidence to reject the null hypothesis. This indicates that the difference in EV adoption rates between counties with high and low CAFV eligibility is not statistically significant.

Conclusion: Fail to reject the null hypothesis
Based on these results, I fail to reject the null hypothesis. This means that there is no significant difference in EV adoption rates between counties with high CAFV eligibility and those with low CAFV eligibility. In other words, CAFV eligibility does not appear to be a strong factor influencing the overall EV adoption rate at the county level.