# Explanation of Script 

The dataset "Electric_Vehicle_Population_Data.csv" has been investigated and later cleaned.

The cleaned file can be found in the data/out path.

**Cleaning/Transformation operations**

**(1.** Filtering data:
- State == 'WA'

**(2.** Dropping columns

**(3.** Dropping observations with missing values

**(4.** Altering string columns:
- 'Model'
- 'Make'

**(5.** Changing dtype of columns:
- 'DOL Vehicle ID'
- 'Model Year'

**(6.** Adding new columns:
- Separate columns for Latitude and Longitude, created from 'Vehicle Location' variable. (float)
- manufacturerCountry (str)
- americanOrForeign (boolean)

# Load relevant packages

In [1]:
import pandas as pd
import numpy as np

# (1 Investigate file

In [2]:
df = pd.read_csv("data/in/Electric_Vehicle_Population_Data.csv")

In [3]:
df.columns

Index(['VIN (1-10)', 'County', 'City', 'State', 'Postal Code', 'Model Year',
       'Make', 'Model', 'Electric Vehicle Type',
       'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Electric Range',
       'Base MSRP', 'Legislative District', 'DOL Vehicle ID',
       'Vehicle Location', 'Electric Utility', '2020 Census Tract'],
      dtype='object')

In [4]:
df['State'].unique()

array(['FL', 'NV', 'WA', 'IL', 'NY', 'VA', 'OK', 'KS', nan, 'CA', 'NE',
       'MD', 'CO', 'DC', 'TN', 'SC', 'CT', 'OR', 'TX', 'SD', 'HI', 'GA',
       'MS', 'AR', 'NC', 'MO', 'UT', 'PA', 'DE', 'OH', 'WY', 'AL', 'ID',
       'AZ', 'AK', 'LA', 'NM', 'WI', 'KY', 'NJ', 'MN', 'MA', 'ME', 'RI',
       'NH', 'ND'], dtype=object)

#### 'County' column

In [5]:
df['County'].unique()

array(['Monroe', 'Clark', 'Yakima', 'Skagit', 'Snohomish', 'Island',
       'Thurston', 'Grant', 'St. Clair', 'Pierce', 'Saratoga', 'Stevens',
       'King', 'Kitsap', 'Newport News', 'Jackson', 'Whitman', 'Lake',
       'Spokane', 'Clallam', 'Cowlitz', 'Kittitas', 'Grays Harbor',
       'Chelan', 'Whatcom', 'Benton', 'Walla Walla', 'Mason', 'San Juan',
       'Lewis', 'Jefferson', 'Douglas', 'Klickitat', 'Geary', 'Skamania',
       'Fairfax', nan, 'Franklin', 'Okanogan', 'Sonoma', 'Asotin',
       'Ferry', 'Pacific', 'Riverside', 'Orange', 'Wahkiakum',
       'Leavenworth', 'Contra Costa', 'Howard', 'Larimer',
       'District of Columbia', 'Washington', 'Tipton', 'San Diego',
       'Sumter', "Prince George's", 'New Haven', 'Lincoln', 'Las Animas',
       'Frederick', 'Adams', 'Hidalgo', 'Pend Oreille', 'Bexar',
       'Garfield', 'Pennington', 'Honolulu', 'Anne Arundel', 'Montgomery',
       'Houston', 'Charleston', 'Monterey', 'Kern', 'Napa', 'Loudoun',
       'Harrison', 'Pulaski'

#### 'Model year' column

In [6]:
df['Model Year'].value_counts()

2022.0    26469
2021.0    18331
2018.0    14214
2020.0    11009
2019.0    10247
2017.0     8612
2016.0     5724
2015.0     4931
2013.0     4684
2014.0     3674
2023.0     1884
2012.0     1702
2011.0      838
2010.0       24
2008.0       23
2000.0       10
1999.0        3
2002.0        2
1998.0        1
1997.0        1
Name: Model Year, dtype: int64

Dataset has vehicles from 1997 to 2022

#### 'Make' column

In [7]:
df['Make'].value_counts()

TESLA             51976
NISSAN            12865
CHEVROLET         10139
FORD               5789
BMW                4676
KIA                4481
TOYOTA             4389
VOLKSWAGEN         2510
AUDI               2325
VOLVO              2285
CHRYSLER           1789
HYUNDAI            1410
JEEP               1147
RIVIAN              881
FIAT                822
PORSCHE             817
HONDA               792
MINI                629
MITSUBISHI          587
POLESTAR            557
MERCEDES-BENZ       505
SMART               272
JAGUAR              219
LINCOLN             168
CADILLAC            107
LUCID MOTORS         65
SUBARU               59
LAND ROVER           38
LEXUS                33
FISKER               20
GENESIS              18
AZURE DYNAMICS        7
TH!NK                 3
BENTLEY               3
Name: Make, dtype: int64

Tesla, Nissan and Chevrolet is the top 3 electric vehicles

#### 'Electric Vehicle Type' column


In [8]:
df['Electric Vehicle Type'].value_counts()

Battery Electric Vehicle (BEV)            85875
Plug-in Hybrid Electric Vehicle (PHEV)    26508
Name: Electric Vehicle Type, dtype: int64

#### 'Electrical range' column

In [9]:
df['Electric Range'].describe()

count    112383.000000
mean         87.818166
std         102.329616
min           0.000000
25%           0.000000
50%          32.000000
75%         208.000000
max         337.000000
Name: Electric Range, dtype: float64

In [10]:
noElectricRange = len(df.query('`Electric Range` == 0'))

print(f'There are {noElectricRange} vehicles with no electric range data. This amounts to {noElectricRange/len(df)*100} percent')

There are 39156 vehicles with no electric range data. This amounts to 34.76392563524335 percent


#### 'Base MSRP' column

In [11]:
baseMSRP = len(df.query('`Base MSRP` == 0'))

print(f'{baseMSRP} observations have no base MSRP data. This amounts to {baseMSRP/len(df)*100} percent')

108878 observations have no base MSRP data. This amounts to 96.66530532521264 percent


#### 'Legislative district' column

In [12]:
df['Legislative District'].unique()

array([nan, 15., 39., 38.,  1., 21., 10., 40., 22., 13., 20.,  2., 32.,
        7., 46., 30., 35., 44., 14.,  9.,  3., 23., 24.,  5., 33., 45.,
       19., 27., 26., 25., 43., 17.,  6., 41., 37., 34., 31., 12., 28.,
       48., 49.,  4., 29., 36., 42.,  8., 18., 11., 16., 47.])

#### checking missing values in any of the columns

In [13]:
df.isna().sum()

VIN (1-10)                                             0
County                                               251
City                                                 251
State                                                251
Postal Code                                          251
Model Year                                           251
Make                                                 251
Model                                                271
Electric Vehicle Type                                251
Clean Alternative Fuel Vehicle (CAFV) Eligibility    251
Electric Range                                       251
Base MSRP                                            251
Legislative District                                 537
DOL Vehicle ID                                       251
Vehicle Location                                     275
Electric Utility                                     694
2020 Census Tract                                    251
dtype: int64

In [14]:
len(df[df.isnull().any(axis=1)])

733

# (2 Cleaning data

#### (2.1 Filtering data

In [15]:
df = df.query("State == 'WA'")

#### (2.2 Dropping columns

In [16]:
df.columns

Index(['VIN (1-10)', 'County', 'City', 'State', 'Postal Code', 'Model Year',
       'Make', 'Model', 'Electric Vehicle Type',
       'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Electric Range',
       'Base MSRP', 'Legislative District', 'DOL Vehicle ID',
       'Vehicle Location', 'Electric Utility', '2020 Census Tract'],
      dtype='object')

In [17]:
df = df.drop(
    ['VIN (1-10)', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Base MSRP', '2020 Census Tract', 'Legislative District', 'Electric Utility']
    , axis='columns')

#### (2.3 Dropping observations with missing values

In [18]:
df = df.dropna().copy()

Checking for duplicated rows

In [19]:
len(df.duplicated(keep=False)), len(df)

(112058, 112058)

#### (2.4 Altering string columns

In [20]:
df['Model'] = df['Model'].str.title()
df['Make'] = df['Make'].str.upper()

#### (2.5 Changing dtype of columns

In [21]:
df['DOL Vehicle ID'] = df['DOL Vehicle ID'].astype(int)
df['Model Year'] = df['Model Year'].astype(int)
df.dtypes

County                    object
City                      object
State                     object
Postal Code              float64
Model Year                 int32
Make                      object
Model                     object
Electric Vehicle Type     object
Electric Range           float64
DOL Vehicle ID             int32
Vehicle Location          object
dtype: object

#### (2.6 Adding new columns

##### 'Longitude' and 'Latitude'

In [22]:
df['Vehicle Location'] = df['Vehicle Location'].str.replace('POINT (', '', regex=False)
df['Vehicle Location'] = df['Vehicle Location'].str.replace(')', '', regex=False)

In [23]:
df[['Longitude', 'Latitude']] = df['Vehicle Location'].str.split(' ', 1, expand=True)
df = df.astype({"Longitude": "float", "Latitude": "float"})
df = df.drop('Vehicle Location', axis=1)

In [24]:
#Removing observations with a location outside of WA:
# Points were gathered using source (GPS Coordinates, Latitude and Longitude with Interactive Maps, n.d.) - full source in bibliography of report.
# Using the tool, it is possible to click at the borders of the state to obtain the ranges for the latitude and longitude
# This was done since several observations had coordinates in Mexico and Malaysia, which we assumed were errors. These observations are therefore removed.
df = df.query("Latitude > 45.53359761204305 & Latitude < 49.00212680393799 & Longitude < -116.90669145505936 & Longitude > -124.75828882197806")

#### 'manufacturerCountry'

In [25]:
manufacturerCountry = pd.read_excel('data/in/manufacturerCountry.xlsx')
manufacturerCountry = manufacturerCountry.drop_duplicates()

In [26]:
# add proper case, so the two dfs can be joined.
manufacturerCountry['Make'] = manufacturerCountry['Make'].str.upper()

In [27]:
df2 = pd.merge(df,
                manufacturerCountry,
                on='Make', how='left')

In [28]:
df2[df2['manufacturerCountry'].isna()]['Make'].value_counts()

LUCID MOTORS      65
FISKER            19
GENESIS           18
AZURE DYNAMICS     7
TH!NK              3
Name: Make, dtype: int64

#### Adding the country to the rest manually (Googling the names)

In [29]:
df2.loc[df2['Make']=="LUCID MOTORS", "manufacturerCountry"] = 'United States'
df2.loc[df2['Make']=="FISKER", "manufacturerCountry"] = 'United States'
df2.loc[df2['Make']=="GENESIS", "manufacturerCountry"] = 'South Korea'
df2.loc[df2['Make']=="AZURE DYNAMICS", "manufacturerCountry"] = 'United States'
df2.loc[df2['Make']=="TH!NK", "manufacturerCountry"] = 'Norway'

#### 'americanOrForeign'

In [30]:
#np.where(condition, value if condition is true, value if condition is false)
df2['americanOrForeign'] = np.where(df2['manufacturerCountry'] == 'United States', 'American', 'Foreign')

In [31]:
df2.dtypes

County                    object
City                      object
State                     object
Postal Code              float64
Model Year                 int32
Make                      object
Model                     object
Electric Vehicle Type     object
Electric Range           float64
DOL Vehicle ID             int32
Longitude                float64
Latitude                 float64
manufacturerCountry       object
americanOrForeign         object
dtype: object

In [32]:
df2.head(2)

Unnamed: 0,County,City,State,Postal Code,Model Year,Make,Model,Electric Vehicle Type,Electric Range,DOL Vehicle ID,Longitude,Latitude,manufacturerCountry,americanOrForeign
0,Yakima,Yakima,WA,98901.0,2011,NISSAN,Leaf,Battery Electric Vehicle (BEV),73.0,218972519,-120.50721,46.60448,Japan,Foreign
1,Skagit,Concrete,WA,98237.0,2017,CHEVROLET,Bolt Ev,Battery Electric Vehicle (BEV),238.0,186750406,-121.7515,48.53892,United States,American


# Export to csv

In [33]:
df2.to_csv('data/out/vehiclePopulation.csv', index=False)