# Task 2 - Webcrawlers

## *Websites to be consumed*

### The webcrawlers in this task used the python library BeautifulSoup. There were two main websites used to get information for this task. The golftraders.com.au/drivers site to get drivers for sale and the golfmonthly.com/reviews/drivers/ site to get the reviews.

## *Rationale*

### The idea behind the whole project was to get golf drivers for sale and then reviews of numerous drivers. These could then be linked together with the final aim to provide a recommendation for the end user of the system. After much research the sites listed above were chosen as they provided multiple drivers and reviews. 

### Using a trading site for the golf drivers allowed a mix of both new and older drivers to be found. The beauty of this is that typically only the latest or newest drivers are available on a specific golf retailers website. This would make having enough cases and variation hard to achieve. 

### Similarly using the review site of a magazine rather than those provided by the manufacturer meant there would be less bias involved. It should be noted that the review site also sold the drivers it reviewed, however, it was an English site and all the prices were in pounds.

### The two webcrawlers will be explained in detail separately.

## Part 1 - Extracting the drivers for sale information

### Clicking on the golftraders link above will show the layout of the page from which the data was to be extracted. In general it consisted of a picture of a driver for sale with the price and some general details. The name of the driver that was displayed was also a link which took the viewer to another page where the specifics and further description of the driver were given. Only the driver name, the link to its' sale page and the price were extracted from the initial page. These were then supplemented by using the specifications given on the linked sale page. This supplemented information became important as it had the details that a buyer would be interested in. Namely the specific brand and model, whether the club was left or right handed, the loft of the driver, the shaft type and flex, the grip and the condition (new or used).

### The process to extract then became a two step problem. Firstly the driver name, link to the individual driver page and the price were collected and stored. Then the link to each individual driver was followed to and the specific information extracted. This extra information was then also stored enhancing the information for each driver. The original driver name extracting with the link included extra information and was quite messy. By using the specific brand and model fields extracted in the enhancement step above it was possible to join these names and then use the joined name field to replace the driver name. This will help in the joining stage with the reviews, shown in the next webcrawler section.

### The layout of the golftraders website was rather complicated. Several attempts were required to get the coding right to extract the correct information. There were multiple drop down menus and filtering options available on the page that could be ignored. With the final extraction being the information listed in a set of heading tags and the price being stored in a span tag.

### There was a copyright notice on the page. There were no issues using the site and no contact was made with the sites owners (either me contacted them or they contacting me. The terms of use page https://www.golftraders.com.au/terms_of_use  does not mention anything in regard to web scraping activities. So there were seemingly no legal impediments to using the site for this project.

### The code is shown below and comment as to what each part is doing. The explanation of the review webpage extraction comes before the code for it down the page.

# Extracting information from Golf Traders website https://www.golftraders.com.au/drivers/

In [1]:
# import statements used for both extractors

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

# Libraries for nltk
import nltk   #natural language toolkit
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

# for parsing the URLs
import urllib.parse 

# to output nice dataframe table
from IPython.display import display 

In [974]:
'''
Tokenizer function removes punctuation - this is for the reviews - want the full review to make sense to a reader
stemming and lematizing will be used later when performing NLP tasks in the recommender
'''

def tokenize_text(text):
    # tokenization to ensure that punctuation is caught as its own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    
    for t in tokens:        
        if re.search('[a-z0-9]', t): # use regex to get only letters and numbers - tokens already lower case
            filtered_tokens.append(t)       
    return filtered_tokens

## Define a page (https://www.golftraders.com.au/drivers/ ) that data will be extracted from and a BeautifulSoup object to do the scraping

In [951]:
# page and soup objects
page = requests.get("https://www.golftraders.com.au/drivers/")
soup = BeautifulSoup(page.content,'html.parser') 

### Perform some intial investigations into the page structure and extract some test features

In [952]:
# Get the new of a driver from the page
print(soup.find("h3", {"itemprop":"name"}).find("a")["title"]) 

New Callaway Mavrik Driver 10.5º Graphite with Shaft Options Cover H2919


In [953]:
# get the link for the associated page that will have to be followed to get the details
print(soup.find("h3", {"itemprop":"name"}).find("a")["href"])

https://www.golftraders.com.au/new-callaway-mavrik-driver-10.5-graphite-with-shaf


In [954]:
# get the price of a driver
print(soup.find("span", {"itemprop":"price"}))

<span itemprop="price">$549.00</span>


In [955]:
# Get the name of the driver and its' link
golf_drivers = soup.select("div h3 a",{"itemprop":"name","a":"title"})  

In [2]:
# check the format of output of the first value
display(golf_drivers[0:5])

NameError: name 'golf_drivers' is not defined

## Iterate through the whole list of drivers to extract the driver name and driver link

In [1004]:
# Use this to get the Driver title and it's link for more information
driver_brand_model =[]
driver_link = []
j=0 # simple counter
for d in golf_drivers:
    try:           # catch any errors 
        # ignore the first value as that is None - see print is code above above 1st value is filter [<a class="accordion-toggle" ....
        if j == 0: 
            pass    
        else:
            driver_link.append(d.get('href'))
            driver_brand_model.append(d.get('title').lower())   # set to lower case.
    except:
        pass # just continue if there is an error
    j+=1

In [1122]:
# check the first 5 values - format looks good but messy including too much info and will be cleaned later
driver_brand_model[0:5]

['new callaway mavrik driver 10.5º graphite with shaft options cover h2919',
 'cobra ladies fly-z marine driver 10.5-13.5° ladies flex+hc+tool -lh -new #e3494',
 'cobra max blue offset driver hl 15° graphite ladies + cover - new lh #e3468',
 'new cobra king radspeed driver 9º motore x f1 stiff flex cover tool h4867',
 'new cobra king sz white / black driver 10.5º graphite regular hc tool h3307']

In [1123]:
# check the first 5 values - looks correct
driver_link[0:5]

['https://www.golftraders.com.au/new-callaway-mavrik-driver-10.5-graphite-with-shaf',
 'https://www.golftraders.com.au/cobra-ladies-fly-z-marine-driver-10.5-13.5-ladies',
 'https://www.golftraders.com.au/cobra-max-blue-offset-driver-hl-15-graphite-ladies',
 'https://www.golftraders.com.au/new-cobra-king-radspeed-driver-9-motore-x-f1-stiff',
 'https://www.golftraders.com.au/new-cobra-king-sz-white-black-driver-10.5-graphite']

In [1124]:
# Get the price of the drivers
price_drivers = soup.select("div p span",{"itemprop":"price"}) 

In [1125]:
# chack the value of the first 5 items
price_drivers[0:5]

[<span itemprop="price">$549.00</span>,
 <span itemprop="price">$329.00</span>,
 <span itemprop="price">$399.00</span>,
 <span itemprop="price">$799.00</span>,
 <span itemprop="price">$549.00</span>]

In [1126]:
price =[]
for p in price_drivers:    
    remove_dollar_comma = re.sub('[\$,]', '', str(p)) # get rid of the $ sign
    keep_numbers = re.findall(r"[\d,]*\.\d\d", str(remove_dollar_comma)) # remove the commas in pprice over 1000
    price.append(keep_numbers)    

In [1127]:
# check the first 5 prices 
price[0:5]

[['549.00'], ['329.00'], ['399.00'], ['799.00'], ['549.00']]

# With Driver model, the specific driver link and the price a dataframe can be made. 
## Other columns for Loft, Dexterity, Shaft, Flex, Grip and Condition are created to be filled using the specific link assocaited with each driver.

In [1128]:
# create a dataframe of the driver, link and price. Make columns information  Loft, Dexterity, Shaft, Flex, Grip, Condition
df_drivers = pd.DataFrame([(i,j,k) for i,j,k in zip(driver_brand_model,driver_link,price)], 
                          columns=['Driver_Name','Link','Price'])

# need to add more columns for later adding of data from each indivudual page

df_drivers['Brand']=''
df_drivers['Model']=''
df_drivers['Loft']=''
df_drivers['Dexterity']=''
df_drivers['Shaft']=''
df_drivers['Flex']=''
df_drivers['Grip']=''
df_drivers['Condition']=''

df_drivers.head()

Unnamed: 0,Driver_Name,Link,Price,Brand,Model,Loft,Dexterity,Shaft,Flex,Grip,Condition
0,new callaway mavrik driver 10.5º graphite with...,https://www.golftraders.com.au/new-callaway-ma...,[549.00],,,,,,,,
1,cobra ladies fly-z marine driver 10.5-13.5° la...,https://www.golftraders.com.au/cobra-ladies-fl...,[329.00],,,,,,,,
2,cobra max blue offset driver hl 15° graphite l...,https://www.golftraders.com.au/cobra-max-blue-...,[399.00],,,,,,,,
3,new cobra king radspeed driver 9º motore x f1 ...,https://www.golftraders.com.au/new-cobra-king-...,[799.00],,,,,,,,
4,new cobra king sz white / black driver 10.5º g...,https://www.golftraders.com.au/new-cobra-king-...,[549.00],,,,,,,,


In [1129]:
# to remove square brackets from the Price
df_drivers['Price'] = df_drivers['Price'].str.get(0)

In [1130]:
# View the data
df_drivers.head()

Unnamed: 0,Driver_Name,Link,Price,Brand,Model,Loft,Dexterity,Shaft,Flex,Grip,Condition
0,new callaway mavrik driver 10.5º graphite with...,https://www.golftraders.com.au/new-callaway-ma...,549.0,,,,,,,,
1,cobra ladies fly-z marine driver 10.5-13.5° la...,https://www.golftraders.com.au/cobra-ladies-fl...,329.0,,,,,,,,
2,cobra max blue offset driver hl 15° graphite l...,https://www.golftraders.com.au/cobra-max-blue-...,399.0,,,,,,,,
3,new cobra king radspeed driver 9º motore x f1 ...,https://www.golftraders.com.au/new-cobra-king-...,799.0,,,,,,,,
4,new cobra king sz white / black driver 10.5º g...,https://www.golftraders.com.au/new-cobra-king-...,549.0,,,,,,,,


# With initial data in the dataframe use the link to add extra information from the description section on each individual page
## Create some more fields to fill :  Brand, Model, Loft, Dexterity, Shaft, Flex, Grip, Condition

In [1131]:
'''
Method to remove tags and some words from a string (what is returned as a soup object). 
Convert to string and replace with empty string
'''

# simple method to remove <p> tags and <span.....> handed degrees symbol
def remove_tags (tags_remove):
    tags_to_remove=str(tags_remove) # make sure it is a string
    tags_to_remove=tags_to_remove.replace('°','') # to remove the degrees symbol off the loft
    tags_to_remove=tags_to_remove.replace('º','') # to remove the degrees symbol off the loft sometimes it is underlined
    tags_to_remove=tags_to_remove.replace('<p>','')
    tags_to_remove=tags_to_remove.replace('</p>','')
    tags_to_remove=tags_to_remove.replace('<span style="line-height: 20.8px;">','')
    tags_to_remove=tags_to_remove.replace('</span>','')
    return tags_to_remove.lower() # return cleaned string in lower case.

In [1132]:
'''
Method to fill the extra fileds within df_driver dataframe. Get the links associated with each driver (represents a new webapge). 
Then follow that page and extract the information from the specifications table. Need to split the Dex field to get only left or right from lefthanded
'''

def df_driver_fill_extra_fields():
    # now fill df_drivers with information from URL stored 
    link    = df_drivers['Link']

    for l in link:    
    
        try: # use a try/catch to catch exceptions - so error deosn't get thrown
            each_driver_page = requests.get(l)
            each_driver_soup = BeautifulSoup(each_driver_page.content,'html.parser')
            # get the specifications from inside the table
            specifications = each_driver_soup.select("table tbody tr td p")  
            # get the values from the table as elements of specification
            Brand =     remove_tags(specifications[3])  
            Model =     remove_tags(specifications[5])  
            Loft =      remove_tags(specifications[9])    
            Dex =       remove_tags(specifications[13]) 
            Dexterity = Dex.split()[0] # use this to split 'right handed' into 'right' and 'handed' and then only take first value i.e.right (or left)
            Shaft  =    remove_tags(specifications[15])
            Flex   =    remove_tags(specifications[17])
            Grip   =    remove_tags(specifications[19])
            Condition = remove_tags(specifications[21])
            
            # now fill the fields in df_drivers with supplemented data.
            df_drivers.loc[df_drivers.Link == l,'Brand'] = Brand
            df_drivers.loc[df_drivers.Link == l,'Model'] = Model
            df_drivers.loc[df_drivers.Link == l,'Loft'] = Loft
            df_drivers.loc[df_drivers.Link == l,'Dexterity'] = Dexterity
            df_drivers.loc[df_drivers.Link == l,'Shaft'] = Shaft
            df_drivers.loc[df_drivers.Link == l,'Flex'] = Flex
            df_drivers.loc[df_drivers.Link == l,'Grip'] = Grip
            df_drivers.loc[df_drivers.Link == l,'Condition'] = Condition
        except: 
            pass # simply catch the exception and continue
           


In [1133]:
# run the function above to fill the extra fields
df_driver_fill_extra_fields()

# Now show the dataframe of the names and specifications of the drivers from golftraders website.
## The general nature of the data can be seen. The Driver_Name is messy due to each original name being a specific driver for sale. But it can be recreated by joining the Brand and Model fields. 

In [1134]:
df_drivers.head()

Unnamed: 0,Driver_Name,Link,Price,Brand,Model,Loft,Dexterity,Shaft,Flex,Grip,Condition
0,new callaway mavrik driver 10.5º graphite with...,https://www.golftraders.com.au/new-callaway-ma...,549.0,callaway,mavrik,10.5,right,graphite - options available,options available,lamkin crossline black midsize rubber grip,new
1,cobra ladies fly-z marine driver 10.5-13.5° la...,https://www.golftraders.com.au/cobra-ladies-fl...,329.0,cobra,cobra ladies fly-z ultra marine,10.5-13.5 adjustable,left,graphite matrix vlct -sp 50g,ladies,cobra winn+ master wrap rubber grip,new
2,cobra max blue offset driver hl 15° graphite l...,https://www.golftraders.com.au/cobra-max-blue-...,399.0,cobra,max blue offset high launch,15,left,matrix mfs 45x4 white tie graphite,ladies,cobra winn master wrap soft polymer grip,new
3,new cobra king radspeed driver 9º motore x f1 ...,https://www.golftraders.com.au/new-cobra-king-...,799.0,cobra,king radspeed,9,right,graphite motore x f1,stiff,cobra connect rubber grip,new
4,new cobra king sz white / black driver 10.5º g...,https://www.golftraders.com.au/new-cobra-king-...,549.0,cobra,king sz black / white,10.5 adjustable,right,graphite tensei av series 65,regular,cobra connect rubber grip,new


## Now the driver name can be replaced by the Brand and Model column concatenated together for a much cleaner Brand_Name column

In [1135]:
# Join model and brand in new column Name
df_drivers['Name'] = df_drivers['Brand'].str.cat(df_drivers['Model'],sep=" ")
df_drivers.head()

Unnamed: 0,Driver_Name,Link,Price,Brand,Model,Loft,Dexterity,Shaft,Flex,Grip,Condition,Name
0,new callaway mavrik driver 10.5º graphite with...,https://www.golftraders.com.au/new-callaway-ma...,549.0,callaway,mavrik,10.5,right,graphite - options available,options available,lamkin crossline black midsize rubber grip,new,callaway mavrik
1,cobra ladies fly-z marine driver 10.5-13.5° la...,https://www.golftraders.com.au/cobra-ladies-fl...,329.0,cobra,cobra ladies fly-z ultra marine,10.5-13.5 adjustable,left,graphite matrix vlct -sp 50g,ladies,cobra winn+ master wrap rubber grip,new,cobra cobra ladies fly-z ultra marine
2,cobra max blue offset driver hl 15° graphite l...,https://www.golftraders.com.au/cobra-max-blue-...,399.0,cobra,max blue offset high launch,15,left,matrix mfs 45x4 white tie graphite,ladies,cobra winn master wrap soft polymer grip,new,cobra max blue offset high launch
3,new cobra king radspeed driver 9º motore x f1 ...,https://www.golftraders.com.au/new-cobra-king-...,799.0,cobra,king radspeed,9,right,graphite motore x f1,stiff,cobra connect rubber grip,new,cobra king radspeed
4,new cobra king sz white / black driver 10.5º g...,https://www.golftraders.com.au/new-cobra-king-...,549.0,cobra,king sz black / white,10.5 adjustable,right,graphite tensei av series 65,regular,cobra connect rubber grip,new,cobra king sz black / white


## All looks good so replace Driver_Name column with Name and drop Name, Brand and Model

In [1136]:
df_drivers['Driver_Name'] = df_drivers['Name']
df_drivers.drop('Name', axis=1, inplace=True)
df_drivers.drop('Brand', axis=1, inplace=True)
df_drivers.drop('Model', axis=1, inplace=True)
df_drivers.head()

Unnamed: 0,Driver_Name,Link,Price,Loft,Dexterity,Shaft,Flex,Grip,Condition
0,callaway mavrik,https://www.golftraders.com.au/new-callaway-ma...,549.0,10.5,right,graphite - options available,options available,lamkin crossline black midsize rubber grip,new
1,cobra cobra ladies fly-z ultra marine,https://www.golftraders.com.au/cobra-ladies-fl...,329.0,10.5-13.5 adjustable,left,graphite matrix vlct -sp 50g,ladies,cobra winn+ master wrap rubber grip,new
2,cobra max blue offset high launch,https://www.golftraders.com.au/cobra-max-blue-...,399.0,15,left,matrix mfs 45x4 white tie graphite,ladies,cobra winn master wrap soft polymer grip,new
3,cobra king radspeed,https://www.golftraders.com.au/new-cobra-king-...,799.0,9,right,graphite motore x f1,stiff,cobra connect rubber grip,new
4,cobra king sz black / white,https://www.golftraders.com.au/new-cobra-king-...,549.0,10.5 adjustable,right,graphite tensei av series 65,regular,cobra connect rubber grip,new


## Final check for NaNs shows some of the new Driver_Names formed didn't work so need to remove them

In [1165]:
df_drivers['Driver_Name'].replace(' ', np.nan, inplace=True)

In [1166]:
# check rows 80-85
df_drivers[80:85]

Unnamed: 0,Driver_Name,Link,Price,Loft,Dexterity,Shaft,Flex,Grip,Condition
80,taylormade sim 2,https://www.golftraders.com.au/new-taylormade-...,779.0,9.0,right,graphite fujikura ventus 5 blue,stiff,taylormade golf pride z grip standard rubber grip,new
81,,https://www.golftraders.com.au/new-taylormade-...,699.0,,,,,,
82,,https://www.golftraders.com.au/new-taylormade-...,699.0,,,,,,
83,titleist 915d3,https://www.golftraders.com.au/new-titleist-91...,499.0,8.5,right,graphite fujikura vista pro seventy,extra stiff,titleist golf pride m580 rubber grip,new
84,titleist 915d3,https://www.golftraders.com.au/new-titleist-91...,440.0,8.5,right,graphite grafalloy prolaunch 65,stiff,titleist golf pride m580 rubber grip,used


In [1167]:
# drop rows with Driver_Name = NaN
df_drivers.dropna(subset=['Driver_Name'],inplace=True)

In [1168]:
# check again - rows gone
df_drivers[80:85]

Unnamed: 0,Driver_Name,Link,Price,Loft,Dexterity,Shaft,Flex,Grip,Condition
80,taylormade sim 2,https://www.golftraders.com.au/new-taylormade-...,779.0,9.0,right,graphite fujikura ventus 5 blue,stiff,taylormade golf pride z grip standard rubber grip,new
83,titleist 915d3,https://www.golftraders.com.au/new-titleist-91...,499.0,8.5,right,graphite fujikura vista pro seventy,extra stiff,titleist golf pride m580 rubber grip,new
84,titleist 915d3,https://www.golftraders.com.au/new-titleist-91...,440.0,8.5,right,graphite grafalloy prolaunch 65,stiff,titleist golf pride m580 rubber grip,used
85,titleist 915d3,https://www.golftraders.com.au/new-titleist-91...,499.0,8.5,right,graphite aldila voodoo xvs6,extra stiff,titleist golf pride m580 rubber grip,used
86,titleist 915d4,https://www.golftraders.com.au/new-titleist-91...,699.0,9.5,right,graphite project x hzrdus 6.0 62g,regular,titleist golf pride m580 rubber grip,new


## Finally write it to csv file for the next part

In [1099]:
# now save it to csv file df_books to save having to do this all again !
df_drivers.to_csv('df_drivers.csv',header=True)

# Extracting information from GolfMonthly reviews website https://www.golfmonthly.com/reviews/drivers/

## Part 2 - Webcrawler extracting reviews from GolfMonthly.com 

### The next part of the task was the extraction of reviews from the https://www.golfmonthly.com/reviews/drivers/ pages. There is some manipulation of the URL required as the reviews stretch over 12 pages that will be discussed later.

### Clicking on the link above will show the layout of the page from which the reviews will be retrieved. It was a golf magazine so there were some adverts, links to recent golf stories, video tips and podcasts on the front page. Access to the review was made via a mouse over drop down list at the top of the page. Once on the driver review page there was a selector where price, manufacturer and recency could be entered plus any key words. There were also links to clothing and other accessories.

### The reviews page had the name of each driver and a price with a brief description. The driver title acted as a link to the actual driver review. So there were several steps involved in navigating and retrieving the reviews. The first step was to loop through all 12 pages and get the links to the 20 reviews from each page. Once these links were attained they could then each be visited and the relevant information extracted, namely the review and verdict fields, that were combined with the driver name field extracted from the link. While there was a price field it was in pounds and deemed of little value so not extracted.

### The golfmonthly site is based in England and so had extensive terms and conditions documentation (part of European requirements I believe) - https://www.futureplc.com/terms-conditions/. I searched the document and found no reference to crawling or scraping limitations. There was no problem running the scraper many times, sometimes in quick succession after silly coding mistakes ! So as with the golftraders website there was no legal troubles in using this site for the project.

### The relevant python coding for the crawler is shown below with steps commented in each process.

In [1105]:
'''
This method gets all the driver review links by iterating over the 12 pages of reviews and then 
finding the relevant parts and finally storing each link in a list. The last 3 links on each page
were not related to the reviews and so not stored.
'''
def get_review_links():

    # get all the links to the reviews from the golfmonthly drivers review page
    link_golfmonthly =[]

    # there are 12 pages of driver reviews starting page 1 and ending page 12
    for i in range(1,12):
        # golf mag reviews page to get links from
        reviews_page = requests.get("https://www.golfmonthly.com/reviews/drivers/page/"+str(i)+"") 
        reviews_page_soup = BeautifulSoup(reviews_page.content,'html.parser')

        links_to_review =  reviews_page_soup.select("h2 a")   

        #links_to_review
        for l in links_to_review:
            link_golfmonthly.append(l.get('href'))
    
        # need to ignore the last 3 links on each page as they relate to golf swing tips and gear videos
        link_golfmonthly = link_golfmonthly[:len(link_golfmonthly)-3]
    return link_golfmonthly


In [1106]:
# Get total number of reviews (220) and show the first 5
links = get_review_links()
print(len(links))
links[0:5]

220


['https://www.golfmonthly.com/reviews/drivers/tour-edge-exs-220-driver-review',
 'https://www.golfmonthly.com/reviews/drivers/mizuno-st-z-driver-review',
 'https://www.golfmonthly.com/reviews/drivers/mizuno-st-drivers-review',
 'https://www.golfmonthly.com/reviews/drivers/taylormade-sim2-drivers-review',
 'https://www.golfmonthly.com/reviews/drivers/callaway-epic-21-drivers-review']

## Using the links collected for every driver the reviews can be retrieved.

In [1114]:
'''
This method gets the review_links and then visits each review page to extract 
the review and verdict field and then stores them in a list
'''
def get_reviews_list():
    
    link_golfmonthly = get_review_links()
    
    reviews_as_list =[] # list to store the values

    for l in link_golfmonthly:    
    
        try:
            each_link_page = requests.get(l)
            each_link_soup = BeautifulSoup(each_link_page.content,'html.parser')
        
        # the review is in a div tag body with tag reviewBody
            reviews =  each_link_soup.find_all("div",{"itemprop":"reviewBody"})    
        
            soup = BeautifulSoup(str(reviews))
            review = soup.get_text()
        
            # get the driver name, review and verdict for each link provided
            review_list = golfmonthly_review_to_list(review,l)         
    
        except:
            print('there was an error')
            pass # just catch the error but continue and move on if there is error
    
        #append each review to a list
        reviews_as_list.append(review_list)
    return reviews_as_list

## With the list of drivernames and corresponding link, reviews and verdict we can create a dataframe

In [1115]:
reviews_as_list = get_reviews_list()
df_golfmonthly_reviews = pd.DataFrame(reviews_as_list, columns=['Driver_Name','Link','Review','Verdict'])

In [1117]:
df_golfmonthly_reviews.head()

Unnamed: 0,Driver_Name,Link,Review,Verdict
0,tour edge exs 220 driver review,https://www.golfmonthly.com/reviews/drivers/to...,part of the exotics line from tour edge which ...,the ball flight on offer here is fantastic whe...
1,mizuno st z driver review,https://www.golfmonthly.com/reviews/drivers/mi...,mizuno has made serious strides in the driver ...,with the st-z model mizuno have created anothe...
2,mizuno st drivers review,https://www.golfmonthly.com/reviews/drivers/mi...,we were very impressed with last year s st200 ...,with the st range mizuno has created two excel...
3,taylormade sim2 drivers review,https://www.golfmonthly.com/reviews/drivers/ta...,as good as last year s sim drivers were one of...,with all sim2 models positioned at £449 there ...
4,callaway epic 21 drivers review,https://www.golfmonthly.com/reviews/drivers/ca...,callaway set the bar at a fairly lofty height ...,only by testing all three models you know for ...


### From the Driver_Name we can removed 'driver', 'drivers' and 'review'

In [1118]:
# Remove driver / drivers / review from the Driver_Name - it is redundant
df_golfmonthly_reviews['Driver_Name'] = df_golfmonthly_reviews['Driver_Name'].str.replace('driver','')
df_golfmonthly_reviews['Driver_Name'] = df_golfmonthly_reviews['Driver_Name'].str.replace('drivers','')
df_golfmonthly_reviews['Driver_Name'] = df_golfmonthly_reviews['Driver_Name'].str.replace('review','')

In [1121]:
df_golfmonthly_reviews.head()

Unnamed: 0,Driver_Name,Link,Review,Verdict
0,tour edge exs 220,https://www.golfmonthly.com/reviews/drivers/to...,part of the exotics line from tour edge which ...,the ball flight on offer here is fantastic whe...
1,mizuno st z,https://www.golfmonthly.com/reviews/drivers/mi...,mizuno has made serious strides in the driver ...,with the st-z model mizuno have created anothe...
2,mizuno st s,https://www.golfmonthly.com/reviews/drivers/mi...,we were very impressed with last year s st200 ...,with the st range mizuno has created two excel...
3,taylormade sim2 s,https://www.golfmonthly.com/reviews/drivers/ta...,as good as last year s sim drivers were one of...,with all sim2 models positioned at £449 there ...
4,callaway epic 21 s,https://www.golfmonthly.com/reviews/drivers/ca...,callaway set the bar at a fairly lofty height ...,only by testing all three models you know for ...


## Create a function that uses the review that has been extracted and the link and writes each of them to a list 


In [1113]:
'''
This method extracts the review and the verdict from the given webpage link. The words 'review' and 'verdict' are indexed with 
the values between the indices being used to create the review and verdict strings. The inital extracted review is tokenized to 
aid in this process.
'''

# Now before this is stored in a dataframe there is a need to lemmatize and remove punctuation and stop words. 
# Also there is a need to get the words before 'review' as that will be the name and brand of the driver with review after
def golfmonthly_review_to_list(review,link):
    add_row = []
    rev = tokenize_text(review)
    
    sp = urllib.parse.unquote(link)
    driver_name = sp.split("/")[5] # get the driver name from the end of the link - 6th element 
   
    j=0
    m=0 # simple counters
    review_notname=""
    verdict =""
    
    try: # incase these words are missing
    # find the index of the word review and verdict
        j = rev.index('review')
        m = rev.index('verdict')
    except:
        pass

    driver_name = driver_name.replace('-',' ')
    
    # get the review - (all words after the word review unitl the word verdict)
    for l in range(j+1,m):   #len(review_lem)):
        review_notname += rev[l] + ' '
    #print(review_notname.strip())
    
    for n in range(m+1,len(rev)):
        verdict += rev[n] + ' '
    #print(verdict.strip())
        
    return [driver_name,link,review_notname,verdict]; # return the 4 values as a list

## Now save to csv for next part

In [1120]:
# now save it to csv file df_books to save having to do this all again !
df_golfmonthly_reviews.to_csv('df_golfmonthly_reviews.csv',header=True)