<a href="https://colab.research.google.com/github/Murad1997/DS-Portfolio/blob/main/Data_analysis_and_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Loading
  - We use the electric vhicle dataset provided by the Washington State Department of Licensing and is available [here](https://data.wa.gov/Transportation/Electric-Vehicle-Population-Data/f6w7-q2d2/about_data). This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL).
  - The selected dataset from teh Washington State Department of Licensing (DOL) has `163K` rows and `17` columns. The detials of the columns are as follows:

   <table>
  <thead>
      <tr>
        <th>Column Name</th>
        <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
        <td>VIN (1-10)</td>
        <td>The 1st 10 characters of each vehicle's Vehicle Identification Number (VIN).</td>
      </tr>
      <tr>
        <td>County</td>
        <td>This is the geographic region of a state that a vehicle's owner is listed to reside within. Vehicles registered in Washington state may be located in other states.</td>
      </tr>
      <tr>
        <td>City</td>
        <td>The city in which the registered owner resides.</td>
      </tr>
      <tr>
        <td>State</td>
        <td>This is the geographic region of the country associated with the record. These addresses may be located in other states.</td>
      </tr>
      <tr>
        <td>Postal Code</td>
        <td>The 5 digit zip code in which the registered owner resides.</td>
      </tr>
      <tr>
        <td>Model Year</td>
        <td>The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).</td>
      </tr>
      <tr>
        <td>Make</td>
        <td>The manufacturer of the vehicle, determined by decoding the Vehicle Identification Number (VIN).</td>
      </tr>
      <tr>
        <td>Model</td>
        <td>The model of the vehicle, determined by decoding the Vehicle Identification Number (VIN).</td>
      </tr>
      <tr>
        <td>Electric Vehicle Type</td>
        <td>This distinguishes the vehicle as all electric or a plug-in hybrid.</td>
      </tr>
      <tr>
        <td>Clean Alternative Fuel Vehicle (CAFV) Eligibility</td>
        <td>This categorizes vehicle as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-only range requirement in House Bill 2042 as passed in the 2019 legislative session.</td>
      </tr>
      <tr>
        <td>Electric Range</td>
        <td>Describes how far a vehicle can travel purely on its electric charge.</td>
      </tr>
      <tr>
        <td>Base MSRP</td>
        <td>This is the lowest Manufacturer's Suggested Retail Price (MSRP) for any trim level of the model in question.</td>
      </tr>
      <tr>
        <td>Legislative District</td>
        <td>The specific section of Washington State that the vehicle's owner resides in, as represented in the state legislature.</td>
      </tr>
      <tr>
        <td>DOL Vehicle ID</td>
        <td>Unique number assigned to each vehicle by Department of Licensing for identification purposes.</td>
      </tr>
      <tr>
        <td>Vehicle Location</td>
        <td>The center of the ZIP Code for the registered vehicle.</td>
      </tr>
      <tr>
        <td>Electric Utility</td>
        <td>This is the electric power retail service territories serving the address of the registered vehicle. All ownership types for areas in Washington are included: federal, investor owned, municipal, political subdivision, and cooperative. If the address for the registered vehicle falls into an area with overlapping electric power retail service territories then a single pipe | delimits utilities of same TYPE and a double pipe || delimits utilities of different types. We combined vehicle address and Homeland Infrastructure Foundation Level Database (HIFLD) (https://gii.dhs.gov/HIFLD) Retail_Service_Territories feature layer using a geographic information system to assign values for this field. Blanks occur for vehicles with addresses outside of Washington or for addresses falling into areas in Washington not containing a mapped electric power retail service territory in the source data.</td>
      </tr>
      <tr>
        <td>2020 Census Tract</td>
        <td>The census tract identifier is a combination of the state, county, and census tract codes as assigned by the United States Census Bureau in the 2020 census, also known as Geographic Identifier (GEOID). More information can be found here: https://www.census.gov/programs-surveys/geography/about/glossary.html#par_textimage_13 https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html</td>
      </tr>
  </tbody>
  </table>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import requests
import re

In [55]:
df = pd.read_csv('https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD')

## Changing the columns name for the better readibility
  - We change the name of all columns by converting to the lower case and replace white spaces with the underscores.

In [56]:
def func(s : str) -> str:
  s = s.replace('(', '').replace(')', '').replace(' ', '_').lower()
  return s

In [57]:
df.columns = df.columns.map(func)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163003 entries, 0 to 163002
Data columns (total 17 columns):
 #   Column                                           Non-Null Count   Dtype  
---  ------                                           --------------   -----  
 0   vin_1-10                                         163003 non-null  object 
 1   county                                           162999 non-null  object 
 2   city                                             162999 non-null  object 
 3   state                                            163003 non-null  object 
 4   postal_code                                      162999 non-null  float64
 5   model_year                                       163003 non-null  int64  
 6   make                                             163003 non-null  object 
 7   model                                            163003 non-null  object 
 8   electric_vehicle_type                            163003 non-null  object 
 9   clean_alternati

In [58]:
df.memory_usage(deep = True).sum()/(1024**2)

127.04358577728271

  - This dataset is quite large and needs `127 MBs` to store. It is much better to look into the datatypes of the columns and see if we can reduce the memory usage of this dataset without loosing the information about the data.
  - Many columns are stored as the python `object` which takes larger space. Therefore, we see whether we can convert these `object` datatypes to the datatypes which needs much smaller space.

In [59]:
df['state'].memory_usage(deep = True)/(1024**2)

9.17177677154541

  - We use `memory_usage` function to calculate the memory used by the `state` column. It seems that it is using `9+ MBs` of the memory. Therefore, it is important to dig further to analyze what is going on?
  - The description of the columns shows that the state column stores the name of the state, which is a string type. Let us find how many unique states we have in the `state` column.

In [60]:
df['state'].nunique()

45

  - The `state` column has only `45` states, therefore, instead of storing them as individual string we convert them to the `pandas` data type catageory.

In [61]:
df['state'].astype('category').memory_usage(deep = True)/(1024**2)

0.15912818908691406

  - After converting to the `pandas` `category` type, we have reduced the memory usage from `9+ MBs` to `0.16 MBs`.

In [62]:
l = list(df.columns[df.dtypes==object][1:])

In [47]:
dtypes_mapping = \
{
    'county': 'category',
    'state': 'category',
}

In [48]:
df = df.astype(dtypes_mapping)

In [49]:
df.memory_usage(deep = True).sum()/(1024**2)

108.64638328552246

# Exploratory Data Analysis

# Modeling

# Conclusion