> **DO YOU USE GITHUB?**  
If True: print('Remember to make your edits in a personal copy of this notebook')  
Else: print('You don't have to understand. Continue your life.')

# Module 7: Web Scraping 2

In module_6 your learned some powerful tricks. Tricks that will work when the data is already shipped in a neat format. However this is often not the case. In this session we shall learn the art of parsing unstructured text and a more principled and advanced method of parsing HTML.

This will help you build ***custom datasets*** within just a few hours or days work, that would have taken ***months*** to curate and clean manually.



Readings for `module 6+7+8`:
- [Python for Data Analysis, chapter 6](https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf)
- [A Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/)
- [An introduction to web scraping with Python](https://towardsdatascience.com/an-introduction-to-web-scraping-with-python-a2601e8619e5)
- [Introduction to Web Scraping using Selenium](https://medium.com/the-andela-way/introduction-to-web-scraping-using-selenium-7ec377a8cf72)

Video materiale from `ISDS 2020`:
- [Web Scraping 1](https://bit.ly/ISDS2021_6)
- [Web Scraping 2](https://bit.ly/ISDS2021_7)
- [Web Scraping 3](https://bit.ly/ISDS2021_8)

Other ressources:
- [Nicklas Webpage](https://nicklasjohansen.netlify.app/)
- [Data Driven Organizational Analysis, Fall 2021](https://efteruddannelse.kurser.ku.dk/course/2021-2022/ASTK18379U)
- [Master of Science (MSc) in Social Data Science](https://www.socialdatascience.dk/education)


## Introduction to HTML
[What is HTML?](https://www.w3schools.com/whatis/whatis_html.asp)  

HTML has a Tree structure. 

Each node in the tree has:
- Children, siblings, parents, descendants. 
- Ids and attributes

<img src="http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png"/>


## Important syntax and patterns
_______________
```html 
<p>The p tag indicates a paragraph <p/>
```
_______________
```html 
<b>The b tag makes the text bold, giving us a clue to its importance </b>
```
output: <b>The b tag makes the text bold, giving us a clue to its importance </b>
```html 

<em>The em tag emphasize the text</em>, giving us a clue to its importance
```
output: <em>The em tag makes emphasize the text</em>, giving us a clue to its importance
___________
```html 
<h1>h1</h1><h2>h2</h2><h3>h3</h3><b>Headers give similar clues</b>
```
output:
<h1>h1</h1><h2>h2</h2><h3>h3</h3><b>Headers give similar clues</b>  
  
```html 
<a href="www.google.com">The a tag creates a hyperlink <a/>
```
output: <a href="www.google.com">The a tag creates a hyperlink <a/>

## How do we find our way around this tree?
1. Regex: Extracting string patterns using .split and regular expresssions
2. CSS-selectors: Specifying paths using css-selectors, xpath syntax.
3. ```BeautifulSoup```: A more powerful, principled and readable way to parse data and navigate HTML

In [2]:
import requests
from bs4 import BeautifulSoup
import re
import selenium
import time
import pandas as pd

### Regex
- [What is regex?](https://en.wikipedia.org/wiki/Regular_expression)
- The brute force way is to parse by convering your downloded matriale into a large string
- Now you can create standard string operations
- And apply smart regex to identify the data you are looking for e.g. links.

In [3]:
url = 'https://www.theguardian.com/us-news/2019/aug/14/taco-eating-contest-death-fresno-california'
response = requests.get(url)
html = response.text
#html.split('\n')
#re.findall("(?P<url>https?://[^\s]+)", html)[0]

### CSS Selectors 
- [What is a CSS Selector?](https://www.w3schools.com/css/css_selectors.asp)
- Another way to browse through the HTML tree
- Define a unique path to an element in the HTML tree.
- It is quick but has to be hardcoded and also more likely to break.
- [Nicklas recommend using this free Google Chrome CSS Selector](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)

In [4]:
url = 'https://www.theguardian.com/us-news/2019/aug/14/taco-eating-contest-death-fresno-california'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
soup.select('.dcr-125vfar')[0].text

'Man dies after taco-eating contest in California'

## Parsing HTML with BeautifulSoup
BeautifulSoup makes the html tree navigable. 
It allows you to:
    * Search for elements by tag name and/or by attribute.
    * Iterate through them, go up, sideways or down the tree.
    * Furthermore it helps you with standard tasks such as extracting raw text from html,
    which would be a very tedious task if you had to hardcode it using `.split` commands and using your own regular expressions will be unstable.

In [5]:
url = 'https://www.theguardian.com/us-news/2019/aug/14/taco-eating-contest-death-fresno-california'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html,'lxml') # parse the raw html using BeautifoulSoup

The HTML is just a long string
The BeautifulSoup makes the string into a 'nicer' look and not a string anymore.
The soup takes the 

In [6]:
# example: finding hyperlinks
links = soup.find_all('a') # find all a tags -connoting a hyperlink.
[link['href'] for link in links if link.has_attr('href')][0:5] # unpack the hyperlink from the a nodes.

['#maincontent',
 '#navigation',
 '/preference/edition/int',
 '/preference/edition/uk',
 '/preference/edition/us']

In [33]:
# example: finding headline
headline = soup.find('h1') # search for the first headline: h1 tag. 
name = headline['class'][0].strip() # use the class attribute name as column name.
value = headline.text.strip() # extract text using build in method.
print(name,':',value)

dcr-125vfar : Man dies after taco-eating contest in California


In [8]:
# example: finding article_text
article_text = soup.findAll('div', {'class':'dcr-185kcx9'})[0]
article_text.text

'A man died shortly after competing in a taco-eating contest at a minor league baseball game in California, authorities said Wednesday.Dana Hutchings, 41, of Fresno, died Tuesday night shortly after arriving at a hospital, said Tony Botti, a Fresno sheriff spokesman.An autopsy on Hutchings will be done Thursday to determine a cause of death, Botti said. It was not immediately known how many tacos the man had eaten or whether he had won the contest.Paul Braverman, a spokesman for the Fresno Grizzlies, said in a statement that the team was “devastated to learn” of the fan’s death and that the team would “work closely with local authorities and provide any helpful information that is requested”.Tuesday night’s competition came before Saturday’s World Taco Eating Championship, to be held at Fresno’s annual Taco Truck Throwdown. The team on Wednesday announced that it was canceling that contest.Matthew Boylan, who watched Tuesday’s taco eating contest from his seat in the stadium, told the 

Say we are interested in how articles cite sources to back up their story i.e. their hyperlink behaviour within the article, and we want to see if the media has changed their behaviour.

We know how to search for links. But the cool part is that we can search from anywhere in the HTML tree. This means that once we have located the article content node - as above - we can search from there. This results in hyperlinks used within the article text.


In [9]:
# example: finding citation links
citations = article_text.findAll('a')

citation_links = [] # define container to the hyperlinks
for citation in citations: # iterate through each citation node
    if citation.has_attr('data-link-name'): # check if it has the right attribute
        if citation['data-link-name'] =='in body link': # and if the value of that attribute is correct
            print(citation['href'])
            citation_links.append(citation['href']) #  add link to the container

citation_links

https://www.theguardian.com/us-news/california
https://www.theguardian.com/us-news/gallery/2016/jul/04/nathans-famous-hotdog-eating-contest-in-pictures


['https://www.theguardian.com/us-news/california',
 'https://www.theguardian.com/us-news/gallery/2016/jul/04/nathans-famous-hotdog-eating-contest-in-pictures']

## Creating a dataset from www.bold.dk

### Let's put together some of the stuff we have learned so far
1. **Investigate:** In this example we will try to investigate the website to uderstand its structure. 
2. **Mapping:** Then we will try to collect all the urls and save them into a list
3. **Parsing:** At last, we will try to collect the information in each url in a simpel loop.

#### First, we pay around with the site trying to understand its structure

In [25]:
# define our URL
url = 'https://www.bold.dk/' 

# connects to site
response = requests.get(url)

# parse data with BeautifulSoup
soup = BeautifulSoup(response.text,'lxml') # parse the HTML

#identify articles to scrape by inspecting site
articles = soup.find_all('div',{'class':'news_list_item'}) # search for the ul node

# checking if articles match website
for i in range(1):
    print(articles[i].text.strip())

# identifying how to find an url from an article
article_url = articles[0].attrs['data-vr-contentbox-url']
print(article_url)

Saints om Ings-skifte: Vi får en masse penge

09:54
https://www.bold.dk/fodbold/nyheder/saints-om-ings-skifte-vi-faar-en-masse-penge/


In [26]:
articles[0]

<div article_id="389211" class="news_list_item" data-vr-contentbox="#1" data-vr-contentbox-url="https://www.bold.dk/fodbold/nyheder/saints-om-ings-skifte-vi-faar-en-masse-penge/" id="news_list_item_389211" position="1" tag_ids="6331,8202,8212">
<div class="checkbox ball no_text_select" style="margin:7px 6px 3px 2px">
<img class="unchecked" src="https://s3.eu-central-1.amazonaws.com/static.bold.dk/img/sprites/newslist_checkbox_20x40.png"/>
</div>
<a class="title" href="/fodbold/nyheder/saints-om-ings-skifte-vi-faar-en-masse-penge/"><span data-vr-headline="">Saints om Ings-skifte: Vi får en masse penge</span>
<img src="https://s3.eu-central-1.amazonaws.com/static.bold.dk/img/tag/180x180/8206.png" style="position: absolute;width: 12px;right: 28px;top: 7px;"/>
<span class="font9 note-grey" style="right:2px;">09:54</span>
</a>
</div>

In [27]:
articles[0].a

<a class="title" href="/fodbold/nyheder/saints-om-ings-skifte-vi-faar-en-masse-penge/"><span data-vr-headline="">Saints om Ings-skifte: Vi får en masse penge</span>
<img src="https://s3.eu-central-1.amazonaws.com/static.bold.dk/img/tag/180x180/8206.png" style="position: absolute;width: 12px;right: 28px;top: 7px;"/>
<span class="font9 note-grey" style="right:2px;">09:54</span>
</a>

In [28]:
articles[0].a['href']

'/fodbold/nyheder/saints-om-ings-skifte-vi-faar-en-masse-penge/'

#### Second, we create a list of urls that we want to scrape

In [30]:
url = 'https://www.bold.dk/' 
response = requests.get(url)
soup = BeautifulSoup(response.text,'lxml')
articles = soup.find_all('div',{'class':'news_list_item'})

#create an empty list
list_of_article_urls = []

# creating a loop that appends the article url to the list above
for i in range(len(articles)):
    list_of_article_urls.append(articles[i].attrs['data-vr-contentbox-url'])

#printing the list
#list_of_article_urls

#printing one example
print(list_of_article_urls[0])

https://www.bold.dk/fodbold/nyheder/saints-om-ings-skifte-vi-faar-en-masse-penge/


#### Third, we scrape each site from the url list

In [31]:
# this step usually reuqiere a new step of investigation
# to figure out what information you want to download
# in this example we want the title, the lead and time posted

# creatig empty list for the infomation we want to extract for every article
h1_list = []
lead = []
time_posted = []

for i in range(10): # 10 #len(list_of_article_urls)
    
    # this time we scrape for each news article in the url list we created before
    url = list_of_article_urls[i]
    response = requests.get(url)
    soup = BeautifulSoup(response.text,'lxml')
    
    # pedagogical way of append title to list
    temp_1 = soup.find_all('h1')
    temp_1 = temp_1[1]
    temp_1 = temp_1.text.strip()
    h1_list.append(temp_1)
    
    # how I would actually do it
    lead.append(soup.find_all('div',{'class':'lead'})[0].text.strip())
    
    # sometimes you make wierd things - that works
    temp_3 = soup.find_all('time')
    temp_3 = temp_3[0]
    temp_3 = str(temp_3)[16:32]
    time_posted.append(temp_3)

In [130]:
# h1 
soup.find_all('h1')

[<h1 class="break_new_headline"></h1>,
 <h1>PL-klubber fortsætter knælen i ny sæson</h1>,
 <h1 class="title">Fodbold - Seneste nyheder</h1>]

In [145]:
soup.find_all('h1')[1]

<h1>PL-klubber fortsætter knælen i ny sæson</h1>

In [146]:
soup.find_all('h1')[1].text.strip()

'PL-klubber fortsætter knælen i ny sæson'

In [144]:
# lead
soup.find_all('div',{'class':'lead'})[0].text.strip()

'De 20 Premier League-klubber er blevet enige om at knæle før kickoff i næste sæson også med budskabet om at få racisme ud af fodbolden.'

In [129]:
# time_posted
soup.find_all('time')[0]

<time datetime="2021-08-03 20:28">03.08.2021 20:28</time>

#### Lastly, we put our collected information into a dataframe

In [32]:
import pandas as pd
df = pd.DataFrame({'title':h1_list, 'lead':lead, 'time':time_posted})
df

Unnamed: 0,title,lead,time
0,Saints om Ings-skifte: Vi får en masse penge,Selv om Southampton gerne ville have forlænget...,2021-08-05 09:54
1,Emil Nielsen lavede fire: Slet ingen arrogance,Emil Nielsen scorede fire mål i Lyngbys 9-0-po...,2021-08-05 09:35
2,Tårnby-træner hylder vildt mål: Helt vanvittigt,Tårnby FF-træner Timos Adraktas storroser Patr...,2021-08-05 09:12
3,Officielt: Frederik Sørensen på plads i Serie B,"Ternana bekræfter, at Serie B-klubben henter d...",2021-08-05 08:50
4,Avis: Brasiliansk spids er nu enig med FCM,Den brasilianske angriber Marrony er nu angive...,2021-08-05 08:43
5,Strandby tror på mirakel: Skagen slog jo BIF,Serie 3-klubben Strandby-Elling-Nielstrup tror...,2021-08-05 08:25
6,Klubløs siden januar: Izunna tager til Finland,Den tidligere Superliga-spiller Izunna Uzochuk...,2021-08-05 08:11
7,Klitten var ombejlet: Frosinone er det rigtige,"Lukas Klitten fortæller, at han havde flere ti...",2021-08-05 07:40
8,Medie: Lyngby henter islandsk angriber,Lyngby henter angiveligt den islandske angribe...,2021-08-05 07:17
9,Skuffet Zagreb-dansker: Var bedre end Legia,Dinamo Zagreb-danskeren Rasmus Lauritsen er sk...,2021-08-05 07:05


In [147]:
# saving df
df.to_csv('df_bold.dk.csv')

# loading df
pd.read_csv('df_bold.dk.csv', index_col=0)

Unnamed: 0,title,lead,time
0,PSV-anfører vil ikke kalde FCM en walkover,PSV-anfører Marco van Ginkel vil ikke kalde ka...,2021-08-03 22:30
1,Gerrard om overtidsmål: Et kæmpe øjeblik,Rangers-manager Steven Gerrard nød at se holde...,2021-08-03 22:06
2,Onuachu scorede forgæves i Shakhtar-triumf,"Paul Onuachu bragte Genk foran, men det var ik...",2021-08-03 21:55
3,PSV blæste decimerede FCM omkuld,PSV satte et afbudsramt FC Midtjylland-hold på...,2021-08-03 21:52
4,SLUT: PSV - FCM minut for minut,"FC Midtjylland er uden flere profiler, når hol...",2021-08-03 21:47
5,Ajax belønner 18-årig dansker med ny aftale,Ajax har forlænget kontrakten med deres unge d...,2021-08-03 21:22
6,Overblik: Disse hold er videre i pokalen,Her får du overblikket over alle tirsdagens re...,2021-08-03 21:10
7,Fredericia er videre efter vildt pokal-drama,FC Fredericia er videre til anden runde i Sydb...,2021-08-03 20:59
8,Malmö-triumf: Rieks og AC dukkede Rangers,"Søren Rieks scorede, og Anders Christiansen as...",2021-08-03 20:53
9,PL-klubber fortsætter knælen i ny sæson,De 20 Premier League-klubber er blevet enige o...,2021-08-03 20:28


# Exercise Set 7: Web Scraping 2

In this Exercise Set we shall develop our webscraping skills even further by practicing **parsing** and navigating html trees using `BeautifoulSoup` and furthermore train extracting information from raw text with no html tags to help, using regular expressions. 

But just as importantly you will get a chance to think about **data quality issues** and how to ensure reliability when curating your own webdata. 

## Exercise Section 7.1: Logging and data quality

> **Ex. 7.1.1:** *`Why` is it important to log processes in your data collection?*



> **Ex. 7.1.2:**
*`How` does logging help with both ensuring and documenting the quality of your data?*


## Exercise Section 7.2: Parsing a Table from HTML using BeautifulSoup.

In module_6 I showed you a neat little prepackaged function in pandas that did all the work. However today we should learn the mechanics of it. *(It is not just for educational purposes, sometimes the package will not do exactly as you want.)*

We hit the Basketball stats page from yesterday again: https://www.basketball-reference.com/leagues/NBA_2018.html.


> **Ex. 7.2.1:** Here we practice simply locating the table node of interest using the `find` method build into BeautifoulSoup. But first we have to fetch the HTML using the `requests` module. Parse the tree using `BeautifulSoup`. And then use the **>Inspector<** tool (* right click on the table < press inspect element *) in your browser to see how to locate the Eastern Conference table node - i.e. the *tag* name of the node, and maybe some defining *attributes*.

In [166]:
url = 'https://www.basketball-reference.com/leagues/NBA_2018.html'
response = requests.get(url)
soup = BeautifulSoup(response.text,'lxml')

In [248]:
#soup.find_all('table')[0].find('caption')

<caption>Conference Standings Table</caption>

In [231]:
#conf_E = soup.find('div',{'id':'div_confs_standings_E'})
conf_E=soup.find_all('table')[0]

'Conference Standings Table'

In [240]:
conf_E.find_all('th',{'data-stat':'team_name'})[15].text

'Atlanta Hawks'

In [108]:
#conf_E.find_all('th')

In [315]:
COLS = []
for i in range(len(conf_E.find_all('th',{'scope' : 'col'}))):
    COLS.append(conf_E.find_all('th',{'scope' : 'col'})[i].text)
COLS

['Eastern Conference', 'W', 'L', 'W/L%', 'GB', 'PS/G', 'PA/G', 'SRS']

In [202]:
conf_E.find_all('tr',{'class':'full_table'})[0]

<tr class="full_table"><th class="left" data-stat="team_name" scope="row"><a href="/teams/TOR/2018.html">Toronto Raptors</a>*</th><td class="right" data-stat="wins">59</td><td class="right" data-stat="losses">23</td><td class="right" data-stat="win_loss_pct">.720</td><td class="right" data-stat="gb">—</td><td class="right" data-stat="pts_per_g">111.7</td><td class="right" data-stat="opp_pts_per_g">103.9</td><td class="right" data-stat="srs">7.29</td></tr>

In [219]:
wins =[]
losses = []
winLossPerc = []
gamesBehind = []
PPG = []
opponentPPG = []
SimpeRS = []
table = conf_E.find_all('tr',{'class':'full_table'})
for i in range(len(table)):
    wins.append(table[i].find_all('td')[0].text)
    losses.append(table[i].find_all('td')[1].text)
    winLossPerc.append(table[i].find_all('td')[2].text)
    gamesBehind.append(table[i].find_all('td')[3].text)
    PPG.append(table[i].find_all('td')[4].text)
    opponentPPG.append(table[i].find_all('td')[5].text)
    SimpeRS.append(table[i].find_all('td')[6].text)
#conf_E.find_all('tr',{'class':'full_table'})[0]#.find_all('td')#[0]

In [155]:
conf_E.find_all('tr',{'class':'full_table'})[0].find_all('td')[1].text
table[0].find_all('td')[1].text

'23'

In [220]:
import pandas as pd
df = pd.DataFrame({COLS[1]:wins, COLS[2]:losses, COLS[3]:winLossPerc,\
                   COLS[4]:gamesBehind, COLS[5]:PPG, COLS[6]:opponentPPG, COLS[7]:SimpeRS})
df

Unnamed: 0,Wins,Losses,Win-Loss Percentage,Games Behind,Points Per Game,Opponent Points Per Game,Simple Rating System
0,59,23,0.72,—,111.7,103.9,7.29
1,55,27,0.671,4.0,104.0,100.4,3.23
2,52,30,0.634,7.0,109.8,105.3,4.3
3,29,53,0.354,30.0,104.5,108.0,-3.53
4,28,54,0.341,31.0,106.6,110.3,-3.67
5,50,32,0.61,—,110.9,109.9,0.59
6,48,34,0.585,2.0,105.6,104.2,1.18
7,44,38,0.537,6.0,106.5,106.8,-0.45
8,39,43,0.476,11.0,103.8,103.9,-0.26
9,27,55,0.329,23.0,102.9,110.0,-6.84


You have located the table should now build a function that starts at a "table node" and parses the information, and outputs a pandas DataFrame. 

Inspect the element either within the notebook or through the **>Inspector<** tool and start to see how a table is written in html. Which tag names can be used to locate rows? How will you iterate through columns. Were is the header located?

> **Ex. 7.2.2:** First you parse the header which can be found in the canonical tag name: thead. 
Next you use the `find_all` method to search for the tag, and iterate through each of the elements extracting the text, using the `.text` method builtin to the the node object. Store the header values in a list container. 

> **Ex. 7.2.3:** Next you locate the rows, using the canonical tag name: tbody. And from here you search for all rows tags. Fiugre out the tag name yourself, inspecting the tbody node in python or using the **Inspector**. 

> **Ex. 7.2.4:** Next run through all the rows and extract each value, similar to how you extracted the header. However here is a slight variation: Since each value node can have a different tag depending on whether it is a digit or a string, you should use the `.children` method instead of the `.find_all` - (or write compile a regex that matches both the td tag and the th tag.) 
>Once the value nodes of each row has been located using the `.children` method you should extract the value. Store the extracted rows as a list of lists: ```[[val1,val2,...valk],...]```

In [320]:
tables = soup.find_all('table')
def create_table(arg):
    if arg < 0 or arg >12:
        print('index out of range')
    else:
        table = soup.find_all('table')[arg]
        header = table.find('caption').text
        nr_col = len(table.find_all('col'))
        colNames = []
        for i in range(nr_col):
            colNames.append(table.find_all('th',{'scope':'col'})[i].text)
            if i > 0:
                '%s_values'colNames[%i] = []
        
    return('hej')


SyntaxError: invalid syntax (<ipython-input-320-f40999d0368c>, line 13)

In [311]:
#tables[0].find_all('tr',{'class':'full_table'})

In [313]:
tables[6]

<table class="stats_table sortable" data-cols-to-freeze=",2" id="totals-team"> <caption>Total Stats Table</caption> <colgroup><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/></colgroup> <thead> <tr> <th aria-label="Rank" class="ranker poptip sort_default_asc show_partial_when_sorting center" data-stat="ranker" data-tip="Rank" scope="col">Rk</th> <th aria-label="team" class="poptip center" data-stat="team" scope="col">Team</th> <th aria-label="Games" class="poptip center" data-stat="g" data-tip="Games" scope="col">G</th> <th aria-label="Minutes Played" class="poptip center" data-stat="mp" data-tip="Minutes Played" scope="col">MP</th> <th aria-label="Field Goals" class="poptip center" data-stat="fg" data-tip="Field Goals" scope="col">FG</th> <th aria-label="Field Goal Attempts" class="poptip center" data-stat="fga" data-tip="Field Goal Attempts" scope="col">FGA</th> <th aria-label="Field

> **Ex. 7.2.5:** Now locate all tables from the page, using the `.find_all` method searching for the table tag name. Iterate through the table nodes and apply the function created for parsing html tables. Store each table in a dictionary using the table name as key. The name is found by accessing the id attribute of each table node, using dictionary-style syntax - i.e. `table_node['id']`.

> **Ex. 7.2.6. (extra) :** Compare your results to the pandas implementation [pd.read_html](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html)

## Exercise Section 7.3: The European Research Counsel (optional)
**NOTE** Exercise 7.3 is difficult and therefore also optional. I expect less than 10% of you being able to solve this one.

Imagine we wanted to analyze whether the European funding behaviour was biased towards certain countries and gender. We might decide to scrape who has received funding from the ERC.
https://erc.europa.eu/

* First we figure find navigate the grant listings.
* Next we figure out how to page these results. 
* And finally we want to grab the information.

### Data Storage and operating system interactions

> **Ex. 7.3.1:** *Import the python library `os`. Write pyhon code in this jupyter notebook creating a new folder in your directory called "erc_funding". Inside your new folder create 3 subfolders called 'mapping', 'raw_data' and 'parsed_data'.*



### Mapping

> **Ex. 7.3.2:** *Investigate [https://erc.europa.eu/projects-figures/erc-funded-projects/results?items_per_page=100&search_api_views_fulltext=&](https://erc.europa.eu/projects-figures/erc-funded-projects/results?items_per_page=100&search_api_views_fulltext=&). Figure out how many sites you need to loop thorugh. Save the response for each site using in `condecs` in your "mapping" subfolder. Use the `tqdm` to track your loop.
Use the Snorre Ralund Connector class to log your activity.*



### Parsing

> **Ex. 7.3.3:** *Write a function that takes a filename (from our mapping subfolder) as and input and returns (in our parsed_data subfolder) a `pandas`dataframe of parsed information. Use `os` library to navigate your operating system (paths) and `condecs` library to read files inside your function. Last, concatenate all your dataframes into one dataframe you call "df" consisting of all parsed data.*

### Reliability and Data Quality

##### Inspect the data
> **Ex. 7.3.4:** *Investigate your dataframe "df". Check for dublicates. Count NaN values. Create a `matplotlib` histrogram plot for every column of "df" illustrating the lenght of the string (x-axis) and row count( y-axis).*

##### Do simple descriptives
> **Ex. 7.3.5:** *Create a value_counts() for each of the three columns (Host Institution (HI), Researcher (PI) and Project acronym) in your "df". What can counting do for us in this exercise in terms of Reliability and Data Quality?*

##### Visualize the Log
> **Ex. 7.3.6:** *Load your "erc_log.csv". Convert the time column 't' to datetime. Use `matplotlib` to create three plots: (1) time it took to make the call, (2) the response size over time, and (3) the delta_t against the response_size .)*