# Visualización de datos (UOC)

Se pretende extender conjunto de datos extraído de https://data.wa.gov/Transportation/Electric-Vehicle-Population-Data/f6w7-q2d2/data, donde se presentan poco datos numéricos.
Se ha encontrado la siguiente web: https://ev-database.org, donde hay una descripción detallada de las características de cada vehículo eléctrico.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import nltk
import numpy as np
import re
import urllib.parse
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Alejandro\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Construir nuevo dataset

Creamos las nuevas columnas con datos numéricos a añadir

In [3]:
df = pd.read_csv('./Electric_Vehicle_Population_Data.csv')
df['0-100 (s)'] = np.nan
df['VMax (km/h)'] = np.nan
df['Range (km)'] = np.nan
df['Batteries capacity (kWh)'] = np.nan
df['Efficiency (Wh/km)'] = np.nan
df['Price (€)'] = np.nan

Comprovamos y eliminamos las filas donde la columna Modelo contiene NaN's o NULL's.

In [4]:
df['Model'].isnull().sum()

84

In [5]:
df=df.dropna(subset=['Model'])

In [6]:
df['Model'].isnull().sum()

0

## Web scrapping

In [7]:
%%capture
basepath = 'https://ev-database.org'

def remove_characters(s):
    if ',' in s:
        s = s.replace(',', '')
    return re.sub("[^0123456789\.]", "", s)

car_list = []

i = 0

for index, row in df.iterrows():
    if row['Electric Vehicle Type'] == 'Battery Electric Vehicle (BEV)':
        print('Seteando el modelo {} de la marca {}.'.format(row['Model'], row['Make']))
        car = row['Make'].lower() + ' ' + row['Model'].lower()
        if row['Model'] + ' ' + row['Make'] in car_list:
            print('Already setted')
        else:
            print('Seteando el modelo {} de la marca {}.'.format(row['Model'], row['Make']))
            all_rows = df[(df['Make'] == row['Make']) & (df['Model'] == row['Model'])]
            car_list.append(row['Model'] + ' '+ row['Make'])

            encoded_string = urllib.parse.quote(row['Make'] + ' ' + row['Model'] + '"')
            url = basepath + '/#title-filter:value="' + encoded_string
            print(url)
            response = requests.get(url)
            soup = BeautifulSoup(response.content, 'html.parser')
            list_items = soup.find_all('div', class_='list-item')
#             print(list_items)
            for item in list_items:
                sentences = nltk.sent_tokenize(item.text)
                for sentence in sentences:
                    sentence = sentence.strip() 
                    sentences = sentence.split('\n')
                    cleaned_list = list(filter(None, sentences))
#                     print(cleaned_list[0].lower())
#                     print(car)
                    if car in cleaned_list[0].lower():
#                         print(cleaned_list[0])
    #                 print(cleaned_list)
                        for idx, s in enumerate(cleaned_list):
                            if '0 - 100' in s:
                                print(cleaned_list[idx+1])
                                all_rows['0-100 (s)'] = remove_characters(cleaned_list[idx+1])
                            if 'Top Speed' in s:
                                all_rows['VMax (km/h)'] = remove_characters(cleaned_list[idx+1])
                            if 'Range*' in s:
                                all_rows['Range (km)'] = remove_characters(cleaned_list[idx+1])
                            if 'kWh' in s:
                                all_rows['Batteries capacity (kWh)'] = remove_characters(cleaned_list[idx])
                            if 'Wh/km' in s:
                                all_rows['Efficiency (Wh/km)'] = remove_characters(cleaned_list[idx])
                            if '€' in s:
                                all_rows['Price (€)'] = remove_characters(cleaned_list[idx+1])
                                break
                        break      
#                 break
            df.update(all_rows)
#         i = i + 1
#         if i == 2:
#             break

Eliminamos los coches que no son eléctricos

In [8]:
df = df[df["Electric Vehicle Type"] != "Plug-in Hybrid Electric Vehicle (PHEV)"]
# print(df)

In [9]:
df = df.reset_index()

In [10]:
cols_to_plot = [18,19,20,21,22,23]
df_subset = df.iloc[:, cols_to_plot]
df.dtypes

index                                                  int64
VIN (1-10)                                            object
County                                                object
City                                                  object
State                                                 object
Postal Code                                          float64
Model Year                                           float64
Make                                                  object
Model                                                 object
Electric Vehicle Type                                 object
Clean Alternative Fuel Vehicle (CAFV) Eligibility     object
Electric Range                                       float64
Base MSRP                                            float64
Legislative District                                 float64
DOL Vehicle ID                                       float64
Vehicle Location                                      object
Electric Utility        

In [11]:
df.to_csv('./test.csv', sep=',')