# Pandas
In addition to the dataframe class and attendant methods and functions, Pandas also provides functions and methods for data collection. In particular,
Pandas supports reading in a variety of important file formats such as CSVs, Feather, and Excel spreadsheets. In addition, Pandas also supports reading data directly from
relational databases by exploiting the SQLAlechemy and SQLite libraries. Other custom database libraries - like the library to connect to Snowflake data-warehouses - also play well with Pandas
so core concepts and ideas will translate across.

In [2]:
import pandas as pd

In [3]:
houses = pd.read_csv('./data/houses.csv') # reads data from CSV into dataframe

In [4]:
houses.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [5]:
houses.tail()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125
1459,1460,20,RL,75.0,9937,Pave,,Reg,Lvl,AllPub,...,0,,,,0,6,2008,WD,Normal,147500


In [6]:
# imagine that this was done in another file to write to a sqlite database
from sqlite3 import connect
conn = connect('./data/mydb.sqlite') # connection object abstracts away inner working of database, so we can operate in service oriented manner
df = pd.DataFrame(data=[[0, '10/11/12'], [1, '12/11/10']],
                  columns=['int_column', 'date_column'])
df.to_sql('test_data2', conn) # can use to_sql method on dataframe to write dataframe to a table on a database using connection

In [7]:
df.head()

Unnamed: 0,int_column,date_column
0,0,10/11/12
1,1,12/11/10


In [8]:
# we now want to read from the database
from sqlite3 import connect
conn = connect('./data/mydb.sqlite')
sql = 'SELECT int_column, date_column FROM test_data' # we provide query to execute - has to be a SELECT query to extract data
df = pd.read_sql(sql, conn)

In [9]:
df.head()

Unnamed: 0,int_column,date_column
0,0,10/11/12
1,1,12/11/10


# API (Application Programming Interface)
APIs allow us to interact with some application through an interface that abstracts away inner workings
APIs come in the form of software libraries. When interacting with applications running on the web, a popular method is to use a REST API.
REST - Representational State Transfer
We send HTTP requests to some server (to retrieve, a GET request) to acquire data from the server in a format, usually JSON
Some APIs demand the use of an API key that moderates access and guards against misuse
Some APIs are rate limited - only x number of requests in t minutes. ALWAYS respect rate limits or you might be blocked. Can use time.sleep in between calls to enforce limits on your end
Some APIs, despite using REST, provide wrapper libraries as the URIs formed during complex calls is cumbersome to work with.
These wrappers can be either third-party (PRAW) or official (Twitter)
Many APIs can be found at [https://github.com/public-apis/public-apis](https://github.com/public-apis/public-apis)

In [10]:
# queries the USGS Earthquare incident API for all earthquakes between the
# first and second of January 2023
url = 'https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&starttime=2023-01-01&endtime=2023-01-02'

In [11]:
import requests # allows interaction through HTTP requests
import json
raw_earthquakes = requests.get(url)
print(raw_earthquakes.text) # string representation of JSON

{"type":"FeatureCollection","metadata":{"generated":1674683399000,"url":"https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&starttime=2023-01-01&endtime=2023-01-02","title":"USGS Earthquakes","status":200,"api":"1.13.6","count":337},"features":[{"type":"Feature","properties":{"mag":4.2,"place":"4 km NNE of Jayapura, Indonesia","time":1672617572846,"updated":1673574738040,"tz":null,"url":"https://earthquake.usgs.gov/earthquakes/eventpage/us7000j3yb","detail":"https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us7000j3yb&format=geojson","felt":null,"cdi":null,"mmi":null,"alert":null,"status":"reviewed","tsunami":0,"sig":271,"net":"us","code":"7000j3yb","ids":",us7000j3yb,","sources":",us,","types":",origin,phase-data,","nst":19,"dmin":16.495,"rms":0.84,"gap":99,"magType":"mb","type":"earthquake","title":"M 4.2 - 4 km NNE of Jayapura, Indonesia"},"geometry":{"type":"Point","coordinates":[140.7278,-2.4906,35]},"id":"us7000j3yb"},
{"type":"Feature","properties":{"mag":3.6

In [12]:
type(raw_earthquakes.text)

str

In [13]:
earth_json = json.loads(raw_earthquakes.text) # converts the string representation to dictionary in Python

In [14]:
earth_json

{'type': 'FeatureCollection',
 'metadata': {'generated': 1674683399000,
  'url': 'https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&starttime=2023-01-01&endtime=2023-01-02',
  'title': 'USGS Earthquakes',
  'status': 200,
  'api': '1.13.6',
  'count': 337},
 'features': [{'type': 'Feature',
   'properties': {'mag': 4.2,
    'place': '4 km NNE of Jayapura, Indonesia',
    'time': 1672617572846,
    'updated': 1673574738040,
    'tz': None,
    'url': 'https://earthquake.usgs.gov/earthquakes/eventpage/us7000j3yb',
    'detail': 'https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us7000j3yb&format=geojson',
    'felt': None,
    'cdi': None,
    'mmi': None,
    'alert': None,
    'status': 'reviewed',
    'tsunami': 0,
    'sig': 271,
    'net': 'us',
    'code': '7000j3yb',
    'ids': ',us7000j3yb,',
    'sources': ',us,',
    'types': ',origin,phase-data,',
    'nst': 19,
    'dmin': 16.495,
    'rms': 0.84,
    'gap': 99,
    'magType': 'mb',
    'type': 'earthquak

In [15]:
type(earth_json)

dict

In [16]:

with open('earthquakes.json', 'w') as fp:
    json.dump(earth_json, fp)

In [42]:
# let us now massage this data into a dataframe. Let us extract the magnitude, timestamp, long, lat, place

In [17]:
earth_json['features'][0]

{'type': 'Feature',
 'properties': {'mag': 4.2,
  'place': '4 km NNE of Jayapura, Indonesia',
  'time': 1672617572846,
  'updated': 1673574738040,
  'tz': None,
  'url': 'https://earthquake.usgs.gov/earthquakes/eventpage/us7000j3yb',
  'detail': 'https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us7000j3yb&format=geojson',
  'felt': None,
  'cdi': None,
  'mmi': None,
  'alert': None,
  'status': 'reviewed',
  'tsunami': 0,
  'sig': 271,
  'net': 'us',
  'code': '7000j3yb',
  'ids': ',us7000j3yb,',
  'sources': ',us,',
  'types': ',origin,phase-data,',
  'nst': 19,
  'dmin': 16.495,
  'rms': 0.84,
  'gap': 99,
  'magType': 'mb',
  'type': 'earthquake',
  'title': 'M 4.2 - 4 km NNE of Jayapura, Indonesia'},
 'geometry': {'type': 'Point', 'coordinates': [140.7278, -2.4906, 35]},
 'id': 'us7000j3yb'}

In [18]:
def process_incident(incident):
    properties = incident['properties']
    title = properties['title']
    magnitude = properties['mag']
    place = properties['place']
    time = properties['time']

    geometry = incident['geometry']
    longitude = geometry['coordinates'][0]
    latitude = geometry['coordinates'][1]
    record = [title, magnitude, place, time, longitude, latitude]
    return record

In [19]:
process_incident(earth_json['features'][0])

['M 4.2 - 4 km NNE of Jayapura, Indonesia',
 4.2,
 '4 km NNE of Jayapura, Indonesia',
 1672617572846,
 140.7278,
 -2.4906]

In [20]:
records = []
for incident in earth_json['features']:
    record = process_incident(incident)
    records.append(record)


In [22]:
records

[['M 4.2 - 4 km NNE of Jayapura, Indonesia',
  4.2,
  '4 km NNE of Jayapura, Indonesia',
  1672617572846,
  140.7278,
  -2.4906],
 ['M 3.6 - 103 km N of Suárez, Puerto Rico',
  3.62,
  '103 km N of Suárez, Puerto Rico',
  1672617413930,
  -65.7256,
  19.3601],
 ['M -0.8 - 85 km NNW of Karluk, Alaska',
  -0.76,
  '85 km NNW of Karluk, Alaska',
  1672617277550,
  -155.180333333333,
  58.2275],
 ['M 0.6 - 10km NW of The Geysers, CA',
  0.57,
  '10km NW of The Geysers, CA',
  1672617223560,
  -122.8415,
  38.8445],
 ['M 2.3 - ', 2.28, None, 1672617148580, -66.8565, 17.8848333333333],
 ['M 1.2 - 3km NNW of Fontana, CA',
  1.2,
  '3km NNW of Fontana, CA',
  1672616861690,
  -117.4708333,
  34.121],
 ['M 1.7 - 12 km NE of Smiley, Texas',
  1.7,
  '12 km NE of Smiley, Texas',
  1672616392936,
  -97.5368796,
  29.33441162],
 ['M 1.9 - 110 km NNW of Yakutat, Alaska',
  1.9,
  '110 km NNW of Yakutat, Alaska',
  1672616048220,
  -140.4536,
  60.4709],
 ['M 4.3 - 123 km NW of Mikuni, Japan',
  4.3,

In [23]:

earthquakes = pd.DataFrame(data=records,
                           columns=['title', 'magnitude', 'place', 'time', 'long', 'lat'])

In [24]:
earthquakes.head()

Unnamed: 0,title,magnitude,place,time,long,lat
0,"M 4.2 - 4 km NNE of Jayapura, Indonesia",4.2,"4 km NNE of Jayapura, Indonesia",1672617572846,140.7278,-2.4906
1,"M 3.6 - 103 km N of Suárez, Puerto Rico",3.62,"103 km N of Suárez, Puerto Rico",1672617413930,-65.7256,19.3601
2,"M -0.8 - 85 km NNW of Karluk, Alaska",-0.76,"85 km NNW of Karluk, Alaska",1672617277550,-155.180333,58.2275
3,"M 0.6 - 10km NW of The Geysers, CA",0.57,"10km NW of The Geysers, CA",1672617223560,-122.8415,38.8445
4,M 2.3 -,2.28,,1672617148580,-66.8565,17.884833


In [25]:
earthquakes.to_csv('./data/earthquakes.csv', quotechar='"', index=False) # this data needs to be cleaned before actual use!

# Scraping

While a file might technically "machine-readable", it might not be in a structure that natively supports pulling specific pieces of data out of it.
These files are typically of the sort that encode a wide gamut of data/information, for example webpages (html) and PDF files. The act of pulling data from these files
is called scraping. In some cases, sophisticated tools like OCR might even be required.

Pulling data fromm websites is called web scraping, and the bread and butter library of webscraping in Python is called beautifulsoup

In [26]:
from bs4 import BeautifulSoup

In [27]:
mazda_mx_30_url = 'https://www.ancap.com.au/safety-ratings/mazda/mx-30/da006c'
mazda_mx_30_page = requests.get(mazda_mx_30_url)
mazda_mx_30_soup = BeautifulSoup(mazda_mx_30_page.content, "html.parser")

In [28]:
mazda_mx_30_soup

<!DOCTYPE html>
<html><head><meta content="width=device-width" name="viewport"/><meta charset="utf-8"/><title>Mazda MX-30 | Safety Rating &amp; Report | ANCAP</title><meta content="Take a look through the latest ANCAP safety ratings, assessment scores &amp; technical reports - including a range of crash test images and videos - for the Mazda MX-30." name="description"/><meta content="Mazda MX-30 Safety Rating &amp; Report | ANCAP" property="og:title"/><meta content="Explore ANCAP mazda Crash Test Ratings" property="og:description"/><meta content="http://cdn.ancap.com.au/app/public/assets/8fd98322ab91ad313955a7a00ec99cb48155c9f5/large.png?1616459020" property="og:image"/><meta content="https://www.ancap.com.au/safety-ratings/mazda/mx-30/da006c" property="og:url"/><meta content="8" name="next-head-count"/><style>/*cyrillic-ext*/@font-face{font-family:'Inter';font-style:normal;font-weight:300;font-display:swap;src:url(/fonts/s/inter/v12/UcC73FwrK3iLTeHuS_fvQtMwCp50KnMa2JL7SUc.woff2)format

In [29]:
car_name_elem = mazda_mx_30_soup.find('h1', class_='text-3xl sm:text-4xl md:text-6xl font-bold my-5') # find_all will return an array

In [30]:
car_name_elem

<h1 class="text-3xl sm:text-4xl md:text-6xl font-bold my-5">Mazda MX-30</h1>

In [52]:
print(car_name_elem.text)

Mazda MX-30


In [31]:
car_star_elems = mazda_mx_30_soup.find_all('div', class_='w-4 h-4 mt-0.5')

In [32]:
car_star_elems

[<div class="w-4 h-4 mt-0.5"><svg class="w-full h-full" fill="currentColor" height="1em" stroke="currentColor" stroke-width="0" viewbox="0 0 24 24" width="1em" xmlns="http://www.w3.org/2000/svg"><polygon fill-rule="evenodd" points="12 16.667 5 22 8 14 2 9.5 9.5 9.5 12 2 14.5 9.5 22 9.5 16 14 19 22"></polygon></svg></div>,
 <div class="w-4 h-4 mt-0.5"><svg class="w-full h-full" fill="currentColor" height="1em" stroke="currentColor" stroke-width="0" viewbox="0 0 24 24" width="1em" xmlns="http://www.w3.org/2000/svg"><polygon fill-rule="evenodd" points="12 16.667 5 22 8 14 2 9.5 9.5 9.5 12 2 14.5 9.5 22 9.5 16 14 19 22"></polygon></svg></div>,
 <div class="w-4 h-4 mt-0.5"><svg class="w-full h-full" fill="currentColor" height="1em" stroke="currentColor" stroke-width="0" viewbox="0 0 24 24" width="1em" xmlns="http://www.w3.org/2000/svg"><polygon fill-rule="evenodd" points="12 16.667 5 22 8 14 2 9.5 9.5 9.5 12 2 14.5 9.5 22 9.5 16 14 19 22"></polygon></svg></div>,
 <div class="w-4 h-4 mt-0.5"

In [55]:
num_stars = len(car_star_elems)

In [33]:
rankings = {
    'adult_occupant_protection': 'text-alt-yellow',
    'child_occupant_protection': 'text-alt-blue',
    'vulnerable_road_user_protection': 'text-alt-green',
    'safety_assist': 'text-alt-red'
}

class_base = 'font-bold text-5xl mt-3 mb-2'

In [34]:
mazda_safety_properties = {}
for prop, cls_key in rankings.items():
    full_class_name = f'{class_base} {cls_key}'
    elem = mazda_mx_30_soup.find('div', class_=full_class_name)
    text = elem.text
    mazda_safety_properties[prop] = text

In [35]:
mazda_safety_properties

{'adult_occupant_protection': '93%',
 'child_occupant_protection': '87%',
 'vulnerable_road_user_protection': '68%',
 'safety_assist': '74%'}

In [76]:
mazda_mx_30_soup.find_all('div', class_='font-bold text-5xl mt-3 mb-2 text-alt-yellow')

[<div class="font-bold text-5xl mt-3 mb-2 text-alt-yellow">93%</div>]

In [36]:
def parse_ancap_car_page(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    name = soup.find('h1', class_='text-3xl sm:text-4xl md:text-6xl font-bold my-5').text
    num_stars = len(soup.find_all('div', class_='w-4 h-4 mt-0.5'))
    year = soup.find('p', class_='font-bold text-sm leading-snug').text

    rankings = {
        'adult_occupant_protection': 'text-alt-yellow',
        'child_occupant_protection': 'text-alt-blue',
        'vulnerable_road_user_protection': 'text-alt-green',
        'safety_assist': 'text-alt-red'
    }

    class_base = 'font-bold text-5xl mt-3 mb-2'
    safety_properties = {}
    for prop, cls_key in rankings.items():
        full_class_name = f'{class_base} {cls_key}'
        elem = soup.find('div', class_=full_class_name)
        if elem is None:
            text = ''
        else:
            text = elem.text
        safety_properties[prop] = text
    safety_properties['name'] = name
    safety_properties['num_stars'] = num_stars
    safety_properties['rating_year'] = year
    return safety_properties


In [37]:
mazda_30_safety = parse_ancap_car_page(mazda_mx_30_url)

In [38]:
mazda_30_safety

{'adult_occupant_protection': '93%',
 'child_occupant_protection': '87%',
 'vulnerable_road_user_protection': '68%',
 'safety_assist': '74%',
 'name': 'Mazda MX-30',
 'num_stars': 5,
 'rating_year': 'Jan 2021 - onwards'}

In [39]:
parse_ancap_car_page('https://www.ancap.com.au/safety-ratings/toyota/corolla-cross/530bf9')

{'adult_occupant_protection': '85%',
 'child_occupant_protection': '88%',
 'vulnerable_road_user_protection': '87%',
 'safety_assist': '83%',
 'name': 'Toyota Corolla Cross',
 'num_stars': 5,
 'rating_year': 'Jul 2022 - onwards'}

In [40]:
page = requests.get('https://www.ancap.com.au/safety-ratings/mazda')
soup = BeautifulSoup(page.content, 'html.parser')

In [41]:
import pyperclip
pyperclip.copy(str(soup))

In [44]:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

In [45]:
import tqdm # for progress bar

In [46]:
driver = Firefox()

In [47]:
driver.get('https://www.ancap.com.au/safety-ratings/mazda')

In [48]:
elems = driver.find_elements(By.TAG_NAME, 'a')

In [50]:
cars = []
for elem in elems:
    href = elem.get_attribute('href')
    if href is not None and '/safety-ratings/mazda/' in href:
        cars.append(href)

In [51]:
cars

['https://www.ancap.com.au/safety-ratings/mazda/bt-50/d7c175',
 'https://www.ancap.com.au/safety-ratings/mazda/mx-30/da006c',
 'https://www.ancap.com.au/safety-ratings/mazda/3/8fa82b',
 'https://www.ancap.com.au/safety-ratings/mazda/cx-30/dc6684',
 'https://www.ancap.com.au/safety-ratings/mazda/6/ac206f',
 'https://www.ancap.com.au/safety-ratings/mazda/cx-8/add913',
 'https://www.ancap.com.au/safety-ratings/mazda/cx-5/f239a6',
 'https://www.ancap.com.au/safety-ratings/mazda/cx-9/8aa28b',
 'https://www.ancap.com.au/safety-ratings/mazda/mx-5/6e3441',
 'https://www.ancap.com.au/safety-ratings/mazda/2/d944ee',
 'https://www.ancap.com.au/safety-ratings/mazda/cx-3/3ce139',
 'https://www.ancap.com.au/safety-ratings/mazda/bt-50/3e6178',
 'https://www.ancap.com.au/safety-ratings/mazda/3/5c3ab8',
 'https://www.ancap.com.au/safety-ratings/mazda/3/8c1df2',
 'https://www.ancap.com.au/safety-ratings/mazda/cx-5/650267',
 'https://www.ancap.com.au/safety-ratings/mazda/2/7e3c35',
 'https://www.ancap.co

In [52]:
car_ratings = []
cars_seen = set()
for car_url in tqdm.tqdm(cars):
    ratings = parse_ancap_car_page(car_url)
    ratings['url'] = car_url
    name = ratings['name']
    if name not in cars_seen:
        car_ratings.append(ratings)
        cars_seen.add(name)


100%|██████████| 37/37 [00:19<00:00,  1.87it/s]


In [53]:
car_ratings

[{'adult_occupant_protection': '86%',
  'child_occupant_protection': '89%',
  'vulnerable_road_user_protection': '67%',
  'safety_assist': '84%',
  'name': 'Mazda BT-50',
  'num_stars': 5,
  'rating_year': 'Jul 2022 - onwards',
  'url': 'https://www.ancap.com.au/safety-ratings/mazda/bt-50/d7c175'},
 {'adult_occupant_protection': '93%',
  'child_occupant_protection': '87%',
  'vulnerable_road_user_protection': '68%',
  'safety_assist': '74%',
  'name': 'Mazda MX-30',
  'num_stars': 5,
  'rating_year': 'Jan 2021 - onwards',
  'url': 'https://www.ancap.com.au/safety-ratings/mazda/mx-30/da006c'},
 {'adult_occupant_protection': '98%',
  'child_occupant_protection': '89%',
  'vulnerable_road_user_protection': '81%',
  'safety_assist': '76%',
  'name': 'Mazda 3',
  'num_stars': 5,
  'rating_year': 'Apr 2019 - onwards',
  'url': 'https://www.ancap.com.au/safety-ratings/mazda/3/8fa82b'},
 {'adult_occupant_protection': '99%',
  'child_occupant_protection': '88%',
  'vulnerable_road_user_protecti

In [54]:
car_df = pd.DataFrame.from_records(car_ratings)

In [94]:
car_df.head(10)

Unnamed: 0,adult_occupant_protection,child_occupant_protection,vulnerable_road_user_protection,safety_assist,name,num_stars,rating_year,url
0,86%,89%,67%,84%,Mazda BT-50,5,Jul 2022 - onwards,https://www.ancap.com.au/safety-ratings/mazda/...
1,93%,87%,68%,74%,Mazda MX-30,5,Jan 2021 - onwards,https://www.ancap.com.au/safety-ratings/mazda/...
2,98%,89%,81%,76%,Mazda 3,5,Apr 2019 - onwards,https://www.ancap.com.au/safety-ratings/mazda/...
3,99%,88%,80%,76%,Mazda CX-30,5,Feb 2020 - onwards,https://www.ancap.com.au/safety-ratings/mazda/...
4,95%,91%,66%,73%,Mazda 6,5,Jun 2018 - onwards,https://www.ancap.com.au/safety-ratings/mazda/...
5,96%,87%,72%,73%,Mazda CX-8,5,Jul 2018 - onwards,https://www.ancap.com.au/safety-ratings/mazda/...
6,95%,80%,78%,59%,Mazda CX-5,5,Apr 2017 - onwards,https://www.ancap.com.au/safety-ratings/mazda/...
7,,,,,Mazda CX-9,5,Jul 2016 - onwards,https://www.ancap.com.au/safety-ratings/mazda/...
8,,,,,Mazda MX-5,5,Sep 2015 - onwards,https://www.ancap.com.au/safety-ratings/mazda/...
9,,,,,Mazda 2,0,Jan 2023 - onwards,https://www.ancap.com.au/safety-ratings/mazda/...


In [55]:
car_df.to_csv('./data/mazda_safety.csv', index=False)

In [56]:
driver.get('https://www.ancap.com.au/safety-ratings/toyota')

In [None]:
elems = driver.find_elements(By.TAG_NAME, 'a')
cars = []
for elem in elems:
    href = elem.get_attribute('href')
    if href is not None and '/safety-ratings/toyota/' in href:
        cars.append(href)
car_ratings = []
cars_seen = set()
for car_url in tqdm.tqdm(cars):
    ratings = parse_ancap_car_page(car_url)
    ratings['url'] = car_url
    name = ratings['name']
    if name not in cars_seen:
        car_ratings.append(ratings)
        cars_seen.add(name)
car_df = pd.DataFrame.from_records(car_ratings)

100%|██████████| 48/48 [00:23<00:00,  2.09it/s]
 31%|███▏      | 15/48 [00:05<00:11,  2.86it/s]


KeyboardInterrupt: 

In [59]:
car_df

Unnamed: 0,adult_occupant_protection,child_occupant_protection,vulnerable_road_user_protection,safety_assist,name,num_stars,rating_year,url
0,85%,88%,87%,83%,Toyota Corolla Cross,5,Jul 2022 - onwards,https://www.ancap.com.au/safety-ratings/toyota...
1,89%,88%,81%,77%,Toyota Landcruiser,5,Jul 2021 - onwards,https://www.ancap.com.au/safety-ratings/toyota...
2,90%,88%,76%,82%,Toyota Kluger / Highlander,5,Mar 2021 - onwards,https://www.ancap.com.au/safety-ratings/toyota...
3,88%,87%,80%,83%,Toyota Mirai,5,Nov 2020 - onwards,https://www.ancap.com.au/safety-ratings/toyota...
4,86%,86%,78%,82%,Toyota Yaris Cross,5,Aug 2020 - onwards,https://www.ancap.com.au/safety-ratings/toyota...
5,86%,87%,78%,87%,Toyota Yaris,5,May 2020 - onwards,https://www.ancap.com.au/safety-ratings/toyota...
6,95%,84%,88%,78%,Toyota Fortuner,5,Oct 2019 - onwards,https://www.ancap.com.au/safety-ratings/toyota...
7,94%,88%,84%,79%,Toyota Granvia,5,Oct 2019 - onwards,https://www.ancap.com.au/safety-ratings/toyota...
8,94%,88%,84%,77%,Toyota Hiace,5,May 2019 - onwards,https://www.ancap.com.au/safety-ratings/toyota...
9,96%,87%,88%,78%,Toyota Hilux,5,Jul 2019 - onwards,https://www.ancap.com.au/safety-ratings/toyota...


In [60]:
car_df.to_csv('./data/toyota_safety.csv', index=False)

In [61]:
driver.close()

MaxRetryError: HTTPConnectionPool(host='localhost', port=50259): Max retries exceeded with url: /session/58125c61-3707-4905-b5e0-748f51740eea/window (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcd41838f40>: Failed to establish a new connection: [Errno 61] Connection refused'))