# Autovillage webscraper

Within this notebook i will be making a script that will crawl and scrape
this [website](https://www.autovillage.co.uk/used-car/filter/bodystyle/saloon). <br>
My goal is to create a data frame with the below features: <br>
1. make - cars brand
2. model - cars model
3. doors - # of doors the car has
4. body_type - saloon, hatchback, sport, suv, etc
5. transmission - automatic or manual
6. Mileage - # of miles on the odometer
7. engine_size - in cc
8. price - in £
9. year - car's year of registration

My plan of action is to parse into the elements I want on my web page, reverse engineer a for loop based off the containers I make, and use an array to store my data.

### The packages I will use:

In [1]:
# dataframe
import pandas as pd
import numpy as np

# Webscraping libraries
from urllib.request import urlopen # url inspector
from bs4 import BeautifulSoup
import re
from selenium import webdriver # connects to chrome browser
import warnings
warnings.filterwarnings('ignore')

# Web crawler imports
import requests
from requests import get

# Web crawlers random seeds/time delays
from time import sleep
from random import randint

# image viewer for cell outputs
from IPython.display import display, Markdown, Latex, Image, display_html, HTML

## Establishing a connection
I will start by establishing a connection to the website autovilliage this will allow me to send and read responses.

In [2]:
# establish my url as a string
my_url = 'https://www.autovillage.co.uk/used-car/filter/bodystyle/saloon'

In [3]:
my_client = urlopen(my_url) # open up a connection to the webpage
autovillage_page =my_client.read() # reads all the html from the webpage

## Parse the HTML

Parsing allows me to read the response object html syntax as a giant string which python can handle.

In [4]:
# HTML parsing
page_soup = BeautifulSoup(autovillage_page, "html.parser")

In [10]:
page_soup.h1 # can view tags just by passing tag onto the soup

<h1>Used Cars for Sale</h1>

## Traverse the HTML
Capture exactly the elements I want by inspecting the web page code. Once I found the exact points within the HTML code I want to reference, I can save them as variables and reference them as containers in my for loops.

In [None]:
page_soup.body # run this if you are a saddist and want to view the entire webpage html code or else use inspector on your web browser

### Create container
The container is the html code that houses exactly what we need.

In [40]:
# only focus on the html code that contains the info that's important to me
# thats important for me
container = page_soup.findAll("div", {"class":"ucatid20"}) # found the ucatid20 tag from inspecting webpage and selecting the entire container=

In [41]:
# check how many containers there are (should be 10 since 10 cars)
len(container) 


10

It found 10 of my cars on the page this is correct as I know visually there are 10 cars listed on each of the webpages.

### The images
For fun I will webscrape the images and loop the outputs. This will be a good starting base to structure my web scraper on.

In [63]:
# I will index into the html code that contains the images
container[0].findAll("div", {"class":"mb5"})[0].img
# change first number index to swap cars

<img alt="Audi A4" src="https://cdn-csnetworkstock.s3.amazonaws.com/audi/a4/22545/32872023/audi_a4_1_pl.jpg"/>

In [46]:
# finds the images of the cars index into 1 car of the container and specify only the img flags within the containers html code
image = container[0].findAll("div", {"class":"mb5"})

# use ipython display to view the html/yml code as an output cell
display(HTML(str(image[0].img)))

great! the parsing worked now lets loop!

In [47]:
# a for loop that iterates over all 10 containers and returns the images using ipython display output

for item in range(0,len(container)):
    display(HTML(str(container[item].findAll("div", {"class":"mb5"})[0].img)))



## Parse containers
Now that my little demo worked, I can parse into the features I will need for my Data Frame. I will still be using the alinkContainer as I found that the website seems to store its car details in that container.

Just as above, I will test by parsing into 1 car container and once I get it right I will reverse engineer a for loop to iterate over all of the containers.

In [48]:
# use this container for our car names potentially
name_container = page_soup.findAll("div", {"class":"alinkContainer"})

In [49]:
# this is one path to get the name of the car but it might not be optimal
name_container[0].div.a

<a href="/used-car/audi/a4/mtz_76725_55392689">
Audi A4 2.0T FSI S Line 4dr Manual</a>

In [50]:
# price
pricing = name_container[1].findAll("div", {"class": "avprice"}) # our prices
pricing

[<div class="avprice">£10,700</div>]

In [51]:
# This container houses the price of the car's it is seperate from the other containers that house the details.
pricing2 = name_container[1].find("div", class_="avprice") # our prices
pricing2 # this method allows you to chain more find/findall commands without returning error

<div class="avprice">£10,700</div>

## Parsing into my features
All of my features except price are in the same class:item flag. The only thing required is to index into the correct feature using python indexing.

In [52]:
# year + name 
year_name_model_html = container[0].div.findAll("div", {"class":"item"})[0]
year_name_model_html

<div class="item">
2016 Audi A4 

</div>

In [53]:
# engine + transmission
eng_tran_html = container[0].div.span
eng_tran_html

<span>1984cc Manual </span>

In [54]:
# door + body type
door_body_html = container[0].div.findAll("div", {"class":"item"})[2].span
door_body_html#.get_text()

<span> 4 Door Saloon</span>

In [55]:
# Mileage
mileage_html = container[0].div.findAll("div", {"class":"item"})[3].span
mileage_html#.get_text()

<span> 26,354 miles</span>

Great! I found all of my features. The code above will help me create my scraper.

## Looping my containers
In this section, I will create for loops to append lists of elements from my cars container's and create data frame columns. I have already indexed into the right places I just need to iterate them. When I have got to the absolute end of my parsing, I can use the `get_text()` to extract the text. (use this only at the end). `.strip()` removes white space from the code.

In [56]:
container2 = page_soup.findAll("div", {"class":"avprice"})
container2

[<div class="avprice">£15,500</div>,
 <div class="avprice">£10,700</div>,
 <div class="avprice">£10,650</div>,
 <div class="avprice">£15,990</div>,
 <div class="avprice">£11,950</div>,
 <div class="avprice">£26,950</div>,
 <div class="avprice">£7,195</div>,
 <div class="avprice">£11,750</div>,
 <div class="avprice">£10,370</div>,
 <div class="avprice">£59,490</div>]

In [57]:
price =[] # car price
year_make_model =[] # year made, brand name, model
eng_tran =[] # engine size and transmission type
door_body =[] # number of doors and body style
mileage =[] # number of miles on the odometer

# loop within container2 and return just the text
for item in container2:
    
    #price
    price.append(item.text)
    
for item in range(0,len(container)):
    
    #year, make, and model
    car_names= container[item].div.findAll("div", {"class":"item"})[0]
    year_make_model.append(car_names.get_text().strip())
    
    #engine size and transmission type
    tran = container[item].div.span
    eng_tran.append(tran.get_text())
    
    # number of doors and car body type
    door_bod = container[item].div.findAll("div", {"class":"item"})[2].span
    door_body.append(door_bod.get_text())
    
    # Car mileage
    car_mileage = container[item].div.findAll("div", {"class":"item"})[3].span
    mileage.append(car_mileage.get_text())
    

In [58]:
# Check to see if the features turned out alright
price

['£15,500',
 '£10,700',
 '£10,650',
 '£15,990',
 '£11,950',
 '£26,950',
 '£7,195',
 '£11,750',
 '£10,370',
 '£59,490']

In [59]:
# lets count how many cars we have in our features it should be 10
print("Rows in price:",len(price))
print("Rows in mileage:",len(mileage))
print("Rows in door count/body style:",len(door_body))
print("Rows in engine size/transmission:",len(eng_tran))
print("Rows in year/make/model:",len(year_make_model))

Rows in price: 10
Rows in mileage: 10
Rows in door count/body style: 10
Rows in engine size/transmission: 10
Rows in year/make/model: 10


## Make Our data frame
Test our webscraper by making a data frame practice run.

In [60]:
car_df = pd.DataFrame({'price':price, 
                       'mileage':mileage, 
                       'door/body':door_body, 
                       'eng/tran':eng_tran, 
                       'year/make/model':year_make_model})

In [61]:
car_df

Unnamed: 0,price,mileage,door/body,eng/tran,year/make/model
0,"£15,500","26,354 miles",4 Door Saloon,1984cc Manual,2016 Audi A4
1,"£10,700","59,036 miles",4 Door Saloon,1995cc Automatic,2013 BMW 5 Series
2,"£10,650","46,875 miles",4 Door Saloon,1598cc Automatic,2015 Volkswagen Passat
3,"£15,990","33,909 miles",4 Door Saloon,2143cc Manual,2015 Mercedes-Benz C Class
4,"£11,950","49,000 miles",4 Door Saloon,2143cc Automatic,2013 Mercedes-Benz C Class
5,"£26,950","16,000 miles",Saloon,2993cc Manual,Jaguar XJ Series
6,"£7,195","81,754 miles",4 Door Saloon,2400cc Automatic,2012 Volvo S60
7,"£11,750","53,000 miles",4 Door Saloon,2143cc Automatic,2013 Mercedes-Benz E Class
8,"£10,370","75,238 miles",4 Door Saloon,1499cc Automatic,2015 BMW 3 Series
9,"£59,490","20,000 miles",4 Door Saloon,5900cc Automatic,2014 Aston Martin Rapide


Great! the webscraper worked! but this is only effective for 1 page. We need to set up a webcrawler to do many other pages.

## Save DF
Testing save for DF.

In [61]:
save_path = '../Raw-Data/car_practice.csv'

In [62]:
car_df.to_csv(save_path)

## Creating My Webcrawler
Now that I know everything works on each webpage, I will automate the process of scraping my web pages by creating a webcrawler.

A crawler essentially goes through each page and applies your web scraper. It's very handy as I plan on using a few thousand cars for my data set.

In [40]:
# set up crawler

# this line of code takes my first webpage and iterates into my defined page range
for i in range(0,150): #range of pages to scrape
    url= 'https://www.autovillage.co.uk/used-car/page/{}/filter/bodystyle/saloon'.format(i)
    html= urlopen(url)
    autovillage_page= html.read()
    soup= BeautifulSoup(autovillage_page, "html.parser")
    
    # Define my containers so I can reference them in below loops
    container= page_soup.findAll("div", {"class":"ucatid20"})
    container2= page_soup.findAll("div", {"class":"avprice"})
    
    # iterate into my price container
    for item in container2:
        # create price feature
        price.append(item.text)
        
    # iterate into my other features container   
    for item in range(0,len(container)):
    
        #year, make, and model
        car_names= container[item].div.findAll("div", {"class":"item"})[0]
        year_make_model.append(car_names.get_text().strip())
    
        #engine size and transmission type
        tran = container[item].div.span
        eng_tran.append(tran.get_text())
    
        # number of doors and car body type
        door_bod = container[item].div.findAll("div", {"class":"item"})[2].span
        door_body.append(door_bod.get_text())
    
        # Car mileage
        car_mileage = container[item].div.findAll("div", {"class":"item"})[3].span
        mileage.append(car_mileage.get_text())
    
    

Great! the webcrawler worked! Now lets form a data frame out of the arrays.

# Create the Data Frame

In [46]:
# Create the DataFrame
car_df_main = pd.DataFrame({'price':price, 
                       'mileage':mileage, 
                       'door/body':door_body, 
                       'eng/tran':eng_tran, 
                       'year/make/model':year_make_model})


In [52]:
car_df_main

Unnamed: 0,price,mileage,door/body,eng/tran,year/make/model
0,"£16,990","23,142 miles",4 Door Saloon,2494cc Automatic,2015 Lexus GS
1,"£139,850","8,976 miles",4 Door Saloon,6752cc Automatic,2017 Bentley Mulsanne
2,"£14,500","43,174 miles",4 Door Saloon,2179cc Automatic,2015 Jaguar XF
3,"£10,280","23,493 miles",4 Door Saloon,2000cc Manual,2016 Mazda 6
4,"£9,590","25,122 miles",4 Door Saloon,1998cc Manual,2014 Mazda 6
5,"£15,698","4,705 miles",4 Door Saloon,1999cc Automatic,2015 Jaguar XE
6,"£19,936","16,323 miles",4 Door Saloon,2143cc Manual,2017 Mercedes-Benz C Class
7,"£34,950","59,500 miles",4 Door Saloon,2891cc Automatic,2017 Alfa Romeo Giulia
8,"£14,795","77,410 miles",4 Door Saloon,2987cc Automatic,2015 Mercedes-Benz E Class
9,"£10,996","34,000 miles",4 Door Saloon,1995cc Manual,2014 BMW 3 Series


It worked! we created our data frame and we can run this script as many times we want to retrieve more cars in the future. This website refreshes it's cars order and I didn't set a random seed so I am unable to replicate this exact dataframe more than once.

## Main Data Frame Save
I will save my main created dataframe as a csv file in my raw data folder. I will then load it into my [Data Cleaning](https://github.com/PaulWill92/cars/blob/master/Jupyter-Notebooks/02-Data_Cleaning.ipynb) notebook and proceed to clean it.

In [None]:
# Create our save directory path
car_save_path = '../Raw-Data/saloon.csv'
car_main_df.to_csv(car_save_path) # running this cell multiple times overwrites saves