# Scraping data from the web with Python

We would like to know whether we can predict mean ratings of individual rums based on price and sugar content.
In order to do so, we need to collect (and clean) the data for analysis.
Here, our goal will be to scrape our data from two websites before combining it in a single data frame.


## Part 1: Compiling a list of rum label names with sugar content

Using hydrometer tests, The Finnish "Systembolag" and a handful of non-prefessionals decided to find out how much sugar (g/L) is added to different, mainly high-end rums. Their results for 738 rums were then shared by "Capn Jimbo" on his website: http://rumproject.com/rumforum/viewtopic.php?t=1683

Before cleaning the data by separating label names and sugar content into separate columns, we would like to simply extract all the row and place them in new .csv file.

Let us first load the necessary packages and set a directory for the .csv file

In [1]:
from lxml import html
import requests
import csv
import os
from bs4 import BeautifulSoup
os.chdir("/Users/peerchristensen/Desktop/rum project")

Next, we will "scrape" the data using xpath. We then assign each element to a variable called "rums". Before doing this, we must first look "under the hood" of the website, i.e. inspecting the underlying code, to see which CSS class we're looking for. 

In [2]:
page = requests.get('http://rumproject.com/rumforum/viewtopic.php?t=1683&sid=b522a1521334abe1096817edfa299933')
contents = html.fromstring(page.content)
rums = contents.xpath('//span[@class="postbody"]/text()')
rums[66]

'\nAppleton Estate Extra 12yr: 0\tthefatrumpirate\t0\r\n'

For some reason, I wasn't able to complete the next steps in a single for loop, so we start by creating an empty list called "rums2" and then append each element that we previously stored in "rums". I couldn't quite figure out how to encode the text the optimal way, but this solution works.

In [3]:
rums2 =[]
for rum in rums:
    rums2.append(rum.encode('ascii', 'ignore'))
rums[66]

'\nAppleton Estate Extra 12yr: 0\tthefatrumpirate\t0\r\n'

Finally, we create our file called "rumlist.csv" and paste each element of "rums2" as a separate row. 

In [4]:
with open("rumlist.csv", 'w') as rumSugar:
    wr = csv.writer(rumSugar, quoting=csv.QUOTE_ALL)
    for rum in rums2:
        wr.writerow([rum])

The data will however need some cleaning using so-called regular expressions. I've done this in R.

## Part 2: Scrape mean rum ratings and attach meta data

Next, we would like to scrape some mean ratings and create more variables containing information about price in USD, country of origin and classification (e.g. aged, spiced, gold). We also want to know how many raters there are for each rum and determine a minimum number of raters. We scrape all of this data from the following site: 
https://www.rumratings.com/brands?action=index&controller=brands&order_by=number_of_ratings

From inspecting the website, we know that we want data from the first 21 pages ordered by number of raters. We now create a list called "urls" and fill it with each page. Seeing how the pages are organized, we can do this using a simple for loop to recursively add pages to our list.

In [5]:
urls=[]
for i in range(0,21):
        urls.append('https://www.rumratings.com/brands?letter=&order_by=number_of_ratings&page=%d' % (i))
urls[0:4]

['https://www.rumratings.com/brands?letter=&order_by=number_of_ratings&page=0',
 'https://www.rumratings.com/brands?letter=&order_by=number_of_ratings&page=1',
 'https://www.rumratings.com/brands?letter=&order_by=number_of_ratings&page=2',
 'https://www.rumratings.com/brands?letter=&order_by=number_of_ratings&page=3']

So far so good.. We would then like to separate the data according to the aforementioned variables. Unfortunately, our data are grouped so that we need an intermediate step in order to isolate and extract the information we want.

We first create three empty lists and fill them with the CSS classes shown in the below code. In order to scrape data recursively from multiple pages, we can use a package like "BeautifulSoup". 

For each page (each url in "urls"), we place each instance of the CSS class "rum-title" in our list called "rum_data".
These are the names of the individual rums. In the same way, and in the same for loop, we assign instances of the CSS classes "rum-info" and "rum-rating-icon" to separate lists. The "rum_info" list will need some further unpacking as it now contains three individual variables (price, country and classification). Importantly, each element of our three variables is actually a list (one per webpage/iteration) containing a list with information on each rum on a particular page.

In [None]:
rum_data=[]
rum_info=[]
rum_ratings=[]

for url in urls:
    r=requests.get(url)
    soup = BeautifulSoup(r.content)
    rum_data.append(soup.find_all("div",{"class":"rum-title"}))
    rum_info.append(soup.find_all("div",{"class":"rum-info"}))
    rum_ratings.append(soup.find_all("div",{"class":"rum-rating-icon"}))

In [10]:
rum_info[9][9]


<div class="rum-info">
Dominican Republic | Aged | 31 ratings
</div>

Alright, let's get the label names. This seems overly complicated and could be done more elegantly. The reason for this mess is that the following lines of code do several things. Here we slice list elements from embedded lists, clean each element a bit (though it would probably have been better to have left this for later) and deal with encoding in yet another loop.

In [13]:
labs=[]
for i in rum_data:
    for j in i:    
        labs+= j
labs=labs[::5]
labels=[]
for i in labs:
    labels.append(str(i).strip("\n").rstrip())
labels2=[]
for i in labels:
    labels2.append(i.encode('ascii','ignore'))
labels[88]

'Diplomatico 2000 Single Vintage'

We then do the same for the variables that we may later use as predictors of mean ratings. As mentioned earlier, all this info is embedded in the "rum_info" variable, so we need to extract and clean the data.

In [None]:
info=[]
for i in rum_info:
    for j in i:    
        info+= j
info2=[]
for i in info:
    info2.append(str(i).split("|"))

country=[]
category=[]
n_ratings=[]
price=[]
for i in info2:
    country.append(i[0].strip("\n").rstrip())
    category.append(i[1].strip(" "))
    n_ratings.append(i[2].strip("\n ratings").rstrip().lstrip())
    try:
        price.append(i[3].strip("\n $").rstrip().lstrip())
    except:
        price.append("NA")

We then get the mean ratings..

In [None]:
rats=[]
for i in rum_ratings:
    for j in i:    
        rats+= j
ratings=[]
for i in rats:
    ratings.append(str(i).strip("\n"))

Finally, we just need to write our file called "rumratings.csv" with our variables neatly organized in columns

In [None]:
with open("rumratings.csv", 'w') as rumRating:
    wr = csv.writer(rumRating,delimiter=';', quoting=csv.QUOTE_ALL)
    wr.writerows(zip(labels2,country,category,ratings,n_ratings,price))