> **Note:** In most sessions you will be solving exercises posed in a Jupyter notebook that looks like this one. Because you are cloning a Github repository that only we can push to, you should **NEVER EDIT** any of the files you pull from Github. Instead, what you should do, is either make a new notebook and write your solutions in there, or **make a copy of this notebook and save it somewhere else** on your computer, not inside the `sds` folder that you cloned, so you can write your answers in there. If you edit the notebook you pulled from Github, those edits (possible your solutions to the exercises) may be overwritten and lost the next time you pull from Github. This is important, so don't hesitate to ask if it is unclear.

You should run `pip install scraping_class`

In [5]:
import scraping_class
from bs4 import BeautifulSoup
import pandas as pd
import lxml
connector = scraping_class.Connector("log.csv")

# Exercise Set 9: Parsing and Information Extraction

*Morning, August 17, 2018*

In this Exercise Set we shall develop our webscraping skills even further by practicing **parsing** and navigating html trees using BeautifoulSoup and furthermore train extracting information from raw text with no html tags to help, using regular expressions. 

But just as importantly you will get a chance to think about **data quality issues** and how to ensure reliability when curating your own webdata. 

## Exercise Section 9.1: Logging and data quality

> **Ex. 9.1.1:** *Why is is it important to log processes in your data collection?*

** answer goes here** 



> **Ex. 9.1.2:**

*How does logging help with both ensuring and documenting the quality of your data?*
** answer goes here** 



## Exercise Section 9.2: Parsing a Table from HTML using BeautifulSoup.

Yesterday I showed you a neat little prepackaged function in pandas that did all the work. However today we should learn the mechanics of it. *(It is not just for educational purposes, sometimes the package will not do exactly as you want.)*

We hit the Basketball stats page from yesterday again: https://www.basketball-reference.com/leagues/NBA_2018.html.


> **Ex. 9.2.1:** Here we practice simply locating the table node of interest using the `find` method build into BeautifoulSoup. But first we have to fetch the HTML using the `requests` module. Parse the tree using `BeautifulSoup`. And then use the **>Inspector<** tool (* right click on the table < press inspect element *) in your browser to see how to locate the Eastern Conference table node - i.e. the *tag* name of the node, and maybe some defining *attributes*.

In [9]:
response, call_id = connector.get('https://www.basketball-reference.com/leagues/NBA_2018.html', 'looking-around')
soup = BeautifulSoup(response.text, 'lxml')
soup_tables = soup.findAll('table')

In [33]:
soup_tables[0] #This is the html for the first table

<table class="suppress_all sortable stats_table" data-cols-to-freeze="1" id="confs_standings_E"><caption>Conference Standings Table</caption>
<colgroup><col/><col/><col/><col/><col/><col/><col/><col/></colgroup>
<thead>
<tr>
<th aria-label="Eastern Conference" class="poptip sort_default_asc left" data-stat="team_name" scope="col">Eastern Conference</th>
<th aria-label="Wins" class="poptip right" data-stat="wins" data-tip="Wins" scope="col">W</th>
<th aria-label="Losses" class="poptip right" data-stat="losses" data-tip="Losses" scope="col">L</th>
<th aria-label="Win-Loss Percentage" class="poptip right" data-stat="win_loss_pct" data-tip="Win-Loss Percentage" scope="col">W/L%</th>
<th aria-label="Games Behind" class="poptip sort_default_asc right" data-stat="gb" data-tip="Games Behind" scope="col">GB</th>
<th aria-label="Points Per Game" class="poptip right" data-stat="pts_per_g" data-tip="Points Per Game" scope="col">PS/G</th>
<th aria-label="Opponent Points Per Game" class="poptip righ

You have located the table should now build a function that starts at a "table node" and parses the information, and outputs a pandas DataFrame. 

Inspect the element either within the notebook or through the **>Inspector<** tool and start to see how a table is written in html. Which tag names can be used to locate rows? How will you iterate through columns. Were is the header located?

> **Ex. 9.2.2:** First you parse the header which can be found in the canonical tag name: thead. 
Next you use the `find_all` method to search for the tag, and iterate through each of the elements extracting the text, using the `.text` method builtin to the the node object. Store the header values in a list container. 

> **Ex. 9.2.3:** Next you locate the rows, using the canonical tag name: tbody. And from here you search for all rows tags. Fiugre out the tag name yourself, inspecting the tbody node in python or using the **Inspector**. 

> **Ex. 9.2.4:** Next run through all the rows and extract each value, similar to how you extracted the header. However here is a slight variation: Since each value node can have a different tag depending on whether it is a digit or a string, you should use the `.children` method instead of the `.find_all` - (or write compile a regex that matches both the td tag and the th tag.) 
>Once the value nodes of each row has been located using the `.children` method you should extract the value. Store the extracted rows as a list of lists: ```[[val1,val2,...valk],...]```

In [46]:
def get_tables(html):
    soup_tables = BeautifulSoup(html, 'lxml').findAll('table') #findall tables in the html
    dataframes = {}
    for table in soup_tables: #iterate through tables
        ths = table.find('thead').findAll('th') #find all tableheaders, i.e. the <th> html-tag
        column_headers = [th.text for th in ths] #extract the text from all the <th>-tags
        
        data = []
        for tr in table.find('tbody').findAll('tr'): #iterate through tablerows, i.e. the <tr> html-tag
            data.append([td.text for td in tr.children]) #extract the text from children of a <tr>-tag
            
        dataframes[table['id']] = pd.DataFrame(data, columns = column_headers) #save table to the dictionary

    return dataframes

In [47]:
NBA_dfs = get_tables(response.text)
NBA_dfs

{'confs_standings_E':            Eastern Conference   W   L  W/L%    GB   PS/G   PA/G    SRS
 0       Toronto Raptors* (1)   59  23  .720     —  111.7  103.9   7.29
 1        Boston Celtics* (2)   55  27  .671   4.0  104.0  100.4   3.23
 2    Philadelphia 76ers* (3)   52  30  .634   7.0  109.8  105.3   4.30
 3   Cleveland Cavaliers* (4)   50  32  .610   9.0  110.9  109.9   0.59
 4        Indiana Pacers* (5)   48  34  .585  11.0  105.6  104.2   1.18
 5            Miami Heat* (6)   44  38  .537  15.0  103.4  102.9   0.15
 6       Milwaukee Bucks* (7)   44  38  .537  15.0  106.5  106.8  -0.45
 7    Washington Wizards* (8)   43  39  .524  16.0  106.6  106.0   0.53
 8        Detroit Pistons (9)   39  43  .476  20.0  103.8  103.9  -0.26
 9     Charlotte Hornets (10)   36  46  .439  23.0  108.2  108.0   0.07
 10      New York Knicks (11)   29  53  .354  30.0  104.5  108.0  -3.53
 11        Brooklyn Nets (12)   28  54  .341  31.0  106.6  110.3  -3.67
 12        Chicago Bulls (13)   27  55  .32

**Ex. 9.2.5** Convert the data you have collected into a pandas dataframe. _Bonus:_ convert the code you've written above into a function which scrapes the page and returns a dataframe. 

In [None]:
#[Answer 9.2.5]
# see above

> **Ex. 9.2.6:** Now locate all tables from the page, using the `.find_all` method searching for the table tag name. Iterate through the table nodes and apply the function created for parsing html tables. Store each table in a dictionary using the table name as key. The name is found by accessing the id attribute of each table node, using dictionary-style syntax - i.e. `table_node['id']`.

> **9.2.extra.:** Compare your results to the pandas implementation. pd.read_html

In [None]:
# [Answer to Ex. 9.1.6]
# see above

## Exercise Section 9.3: Practicing Regular Expressions.
This exercise is about developing your experience with designing your own regular expressions.

Remember you can always consult the regular expression reference page [here](https://www.regular-expressions.info/refquick.html), if you need to remember or understand a specific symbol. 

You should practice using *"define-inspect-refine-method"* described in the lectures to systematically ***explore*** and ***refine*** your expressions, and save all the patterns tried. You can download the small module that I created to handle this in the following way: 
``` python
import requests
url = 'https://raw.githubusercontent.com/snorreralund/explore_regex/master/explore_regex.py'
response = requests.get(url)
with open('explore_regex.py','w') as f:
    f.write(response.text)
import explore_regex as e_re
```

Remember to start ***broad*** to gain many examples, and iteratively narrow and refine.

We will use a sample of the trustpilot dataset that you practiced collecting yesterday.
You can load it directly into python from the following link: https://raw.githubusercontent.com/snorreralund/scraping_seminar/master/english_review_sample.csv

> **Ex. 9.3.0:** Load the data used in the exercise using the `pd.read_csv` function. (Hint: path to file can be both a url or systempath). 

>Define a variable `sample_string = '\n'.join(df.sample(2000).reviewBody)` as sample of all the reviews that you will practice on.  (Run it once in a while to get a new sample for potential differences).
Imagine we were a company wanting to find the reviews where customers are concerned with the price of a service. They decide to write a regular expression to match all reviews where a currencies and an amount is mentioned. 

> **Ex. 9.3.1:** 
> Write an expression that matches both the dollar-sign ($) and dollar written literally, and the amount before or after a dollar-sign. Remember that the "$"-sign is a special character in regular expressions. Explore and refine using the explore_pattern function in the package I created called explore_regex. 
```python
import explore_regex as e_re
explore_regex = e_re.Explore_Regex(sample_string) # Initaizlie the Explore regex Class.
explore_regex.explore_pattern(pattern) # Use the .explore_pattern method.
```


Start with exploring the context around digits ("\d") in the data. 

In [49]:
import pandas as pd
import re
import requests
# download data
path2data = 'https://raw.githubusercontent.com/snorreralund/scraping_seminar/master/english_review_sample.csv'
df = pd.read_csv(path2data)
# download module
url = 'https://raw.githubusercontent.com/snorreralund/explore_regex/master/explore_regex.py'
response = requests.get(url)
# write script to your folder to create a locate module
with open('explore_regex.py','w') as f:
    f.write(response.text)
# import local module
import explore_regex as e_re

In [53]:
import re
digit_re = re.compile('[0-9]+') 
df['hasNumber'] = df['reviewBody'].apply(lambda x: bool(digit_re.search(x)))
sample_string = '\n'.join(df[df['hasNumber']].sample(1000).reviewBody)

In [56]:
df[['reviewBody', 'hasNumber']].head()

Unnamed: 0,reviewBody,hasNumber
0,"Lots of inventory, very fast and efficient. I ...",False
1,I did not received the map I had ordered and p...,False
2,After searching a number of stores here in my ...,False
3,Website is not intuitive. I don't like having...,False
4,"Outstanding customer service, appreciated the ...",False


In [57]:
explore_regex = e_re.ExploreRegex(sample_string)

In [73]:
#price_pattern matches prices including decimals. The decimals are the '(\d*\.)*' part
price_pattern = r'((\d*\.)*\d+\s*(dollar|\$))|((dollar|\$)\s*(\d*\.)*\d+)'
explore_regex.explore_pattern(price_pattern, n_samples = 30)

------ Pattern: ((\d*\.)*\d+\s*(dollar|\$))|((dollar|\$)\s*(\d*\.)*\d+)	 Matched 247 patterns -----
Match: $2	Context:ckage for $2 shipping.
Match: $79	Context:sed to be $79 USD and e
Match: $20	Context: food for $20 less a ba
Match: $10	Context:reased to $10.  Imagine
Match: 10dollar	Context: belts at 10dollarmall for p
Match: $1000	Context: price of $1000.  Your pr
Match: $25.00	Context:ought the $25.00 gift card
Match: 10$	Context:s I get a 10$ off coupo
Match: $100	Context: It was a $100 fix, and 
Match: $99	Context:wn to the $99 service s
Match: $200	Context:serve for $200.  Returne
Match: $164.00	Context: I saved  $164.00 and did n
Match: $200	Context:ot so the $200.- 3 year 
Match: $1	Context:dditional $1,000 when 
Match: $277	Context: $65 on a $277 order. Th
Match: $10	Context:nce—apply $10 off now, 
Match: $100	Context:I ordered $100 plus in c
Match: $100	Context:nly worth $100 on non-di
Match: $900	Context:ng at the $900+ alternat
Match: $149	Context:g up with $149 website I


> **Ex.9.3.3** Use the .report() method. e_re.report(), and print the all patterns in the development process using the .pattern method - i.e. e_re.patterns 


>**Ex. 9.3.4** 
Finally write a function that takes in a string and outputs if there is a match. Use the .match function to see if there is a match (hint if does not return a NoneType object - `re.match(pattern,string)!=None`).

> Define a column 'mention_currency' in the dataframe, by applying the above function to the text column of the dataframe. 
*** You should have approximately 310 reviews that matches. - but less is also alright***

> **Ex. 9.3.5** Explore the relation between reviews mentioning prices and the average rating. 

> **Ex. 9.3.extra** Define a function that outputs the amount mentioned in the review (if more than one the largest), define a new column by applying it to the data, and explore whether reviews mentioning higher prices are worse than others by plotting the amount versus the rating.

In [82]:
#9.3.4
def mention_currency(span):
    m = re.compile(r'((\d*\.)*\d+\s*(dollar|\$))|((dollar|\$)\s*(\d*\.)*\d+)').search(span)
    return bool(m)

df['mentionCurrency'] = df['reviewBody'].apply(mention_currency)
df.head()

Unnamed: 0.1,Unnamed: 0,__domain__,address_@type,address_addressCountry,address_addressLocality,address_postalCode,address_streetAddress,author_@type,datePublished,email,...,itemReviewed_name,meta_@type,name,reviewBody,reviewRating_@type,reviewRating_ratingValue,telephone,categories,hasNumber,mentionCurrency
0,159770,https://trustpilot.com/review/www.exmed.net,PostalAddress,,Fenton,63026,218 Seebold Spur,Person,2017-07-29T20:27:03Z,sales@exmed.net,...,Express Medical Supply,LocalBusiness,Express Medical Supply,"Lots of inventory, very fast and efficient. I ...",Rating,5,(800) 633-2139,/health_wellbeing,False,False
1,168724,https://trustpilot.com/review/mapscompany.com,PostalAddress,,"Petit-Rocher, NB",E8J 1E4,713 rue de la Mer,Person,2017-08-11T20:09:48Z,contact@mapscompany.com,...,MapsCompany,LocalBusiness,MapsCompany,I did not received the map I had ordered and p...,Rating,3,,/travel_holidays,False,False
2,96443,https://trustpilot.com/review/www.thriftbooks.com,PostalAddress,,Tukwila,98188,"18300 Cascade Ave S, Ste 150",Person,2015-03-19T22:59:22Z,reviews@thriftbooks.com,...,Thrift Books,LocalBusiness,Thrift Books,After searching a number of stores here in my ...,Rating,5,253-275-2251,/entertainment,False,False
3,173433,https://trustpilot.com/review/fabletics.com,PostalAddress,,,,,Person,2017-04-30T19:47:39Z,,...,Fabletics,LocalBusiness,Fabletics,Website is not intuitive. I don't like having...,Rating,2,855-202-3570,/clothes_fashion,False,False
4,138968,https://trustpilot.com/review/www.enterprise.com,PostalAddress,US,St Louis,63105,600 Corporate Park Dr,Person,2018-05-26T20:43:41Z,,...,Enterprise,LocalBusiness,Enterprise,"Outstanding customer service, appreciated the ...",Rating,5,,/transportation,False,False


In [85]:
#9.3.5
df.groupby('mentionCurrency').mean()['reviewRating_ratingValue']

mentionCurrency
False    4.502371
True     3.046667
Name: reviewRating_ratingValue, dtype: float64

Apparantly the average rating is higher when currency is not mentioned. That would make sense.

> **Ex. 9.3.6:** Now we write a regular expression to extract emoticons from text.
Start by locating all mouths ')' of emoticons, and develop the variations from there. Remember that paranthesis are special characters in regex, so you should use the escape character.

In [94]:
#9.3.6
sample_string = '\n'.join(df[df['hasNumber']].sample(1000).reviewBody)
explore_emoticon = e_re.ExploreRegex(sample_string)

#emoticon_pattern matches all the most common emoticons:
#    :), :(, :D, :o, :p, :* 
#also flipped, with noses '-', and winking eyes ';'.
#also the heart emoticon '<3'

emoticon_pattern = '([:;]-?[\)\(PpDOo\*])|([\)\(\*]-?[:;])|<3'
explore_emoticon.explore_pattern(emoticon_pattern, n_samples = 30)

------ Pattern: ([:;]-?[\)\(PpDOo\*])|([\)\(\*]-?[:;])	 Matched 21 patterns -----
Match: :(	Context:er galley :( at $9.40,
Match: ):	Context: of Minted):  if you o
Match: :)	Context:s Izzy!!! :) I'm a Squ
Match: :-)	Context:redit card:-) Very prof
Match: :)	Context:excellent :)
The whole
Match: :)	Context: any one! :)
Inflates 
Match: ):	Context:RT version):

* I call
Match: :)	Context:og treats :)
I was the
Match: :-)	Context:day loans :-)
first tim
Match: :)	Context: giveaway :)
I had a c
Match: :(	Context:starving. :(
I loved t
Match: :(	Context:hat book. :(
My order 
Match: :)	Context: anyone.  :)
I traded 
Match: :-)	Context:t service :-)
booked a 
Match: :)	Context:thank you :)
A great e
Match: :D	Context: Wars MMO :D
Like many
Match: :D	Context:ys again. :D
From the 
Match: :)	Context:orry free.:)
I was loo
Match: :(	Context: cold now :(

*Thanks*
Match: :-(	Context: a while. :-(
Fees were
Match: ;)	Context:the title ;)

<3 Jani

