# WEB SCRAPING: Day 1

The following contains coding examples used in the first day of the CALDISS workshop: "Web Scraping for the Social Sciences".

Codes are meant to illustrate how the output from various tools from DMI (Digital Methods Initiative) can be used.

Copy the code to your own folder to run and edit the code yourself.

**CONTENT**
- DMI Tool: Text Ripper
- DMI Tool: Image Scraper
- DMI Tool: Search Engine Scraper
- Using the Statistics Denmark’s API for StatBank (https://www.dst.dk/en/Statistik/statistikbanken/api)
- Using the Twitter API

## DMI TOOL: Text Ripper

Link to the tool: https://tools.digitalmethods.net/beta/textRipper/

What can we do with the output of Text Ripper?

The following code is inspired by the following Medium post: https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908

With python libraries `re` and `string` we can quickly preprocess raw text making it eligible for analysis.

In [2]:
import re
import string
import requests

textrip_url = "https://tools.digitalmethods.net/beta/results/textripper/frontpages2046898822.txt"
textrip_resp = requests.get(textrip_url)
text_raw = textrip_resp.text
print(text_raw)

https://www.en.caldiss.aau.dk/about/	About CALDISS                                              News    Events    Contact    Campus Areas    For the press    For Alumni                                Shortcuts       News    Events    Contact    Campus Areas    For the press    For Alumni        aau education    aau research    aau cooperation    About AAU    Vacant Positions    Staff and students                                                                    Caldiss  /  About CALDISS  /    Menu    Caldiss  /  About CALDISS  /         About CALDISS             The digital data and methods laboratory, CALDISS (Computational Analytics Laboratory for Digital Social Science), is a shared physical methods laboratory as well as a digital platform for the Faculty of Social Sciences at AAU, where new possibilities with digital data and digital methods are addressed and explored. The main purposes of the lab is to help everyone associated with the Faculty of Social Sciences at AAU better rea

In [3]:
lc_text = text_raw.lower()  #Lower-case#
nopunct_text = lc_text.translate(str.maketrans("", "", string.punctuation))  #Remove punctuation
strip_text = nopunct_text.strip()  #Remove white-space
text_vec = strip_text.split(" ")  #Split words - create vector
text_vec = [i for i in text_vec if not i == '']
print(text_vec)

['httpswwwencaldissaaudkabout\tabout', 'caldiss', 'news', 'events', 'contact', 'campus', 'areas', 'for', 'the', 'press', 'for', 'alumni', 'shortcuts', 'news', 'events', 'contact', 'campus', 'areas', 'for', 'the', 'press', 'for', 'alumni', 'aau', 'education', 'aau', 'research', 'aau', 'cooperation', 'about', 'aau', 'vacant', 'positions', 'staff', 'and', 'students', 'caldiss', 'about', 'caldiss', 'menu', 'caldiss', 'about', 'caldiss', 'about', 'caldiss', 'the', 'digital', 'data', 'and', 'methods', 'laboratory', 'caldiss', 'computational', 'analytics', 'laboratory', 'for', 'digital', 'social', 'science', 'is', 'a', 'shared', 'physical', 'methods', 'laboratory', 'as', 'well', 'as', 'a', 'digital', 'platform', 'for', 'the', 'faculty', 'of', 'social', 'sciences', 'at', 'aau', 'where', 'new', 'possibilities', 'with', 'digital', 'data', 'and', 'digital', 'methods', 'are', 'addressed', 'and', 'explored', 'the', 'main', 'purposes', 'of', 'the', 'lab', 'is', 'to', 'help', 'everyone', 'associated'

An important aspect of text preprocessing is removing stopwords.

In [4]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

tokens = text_vec
stop_words = set(ENGLISH_STOP_WORDS)
tokens_nostop = [i for i in tokens if not i in stop_words]
print (tokens_nostop)

['httpswwwencaldissaaudkabout\tabout', 'caldiss', 'news', 'events', 'contact', 'campus', 'areas', 'press', 'alumni', 'shortcuts', 'news', 'events', 'contact', 'campus', 'areas', 'press', 'alumni', 'aau', 'education', 'aau', 'research', 'aau', 'cooperation', 'aau', 'vacant', 'positions', 'staff', 'students', 'caldiss', 'caldiss', 'menu', 'caldiss', 'caldiss', 'caldiss', 'digital', 'data', 'methods', 'laboratory', 'caldiss', 'computational', 'analytics', 'laboratory', 'digital', 'social', 'science', 'shared', 'physical', 'methods', 'laboratory', 'digital', 'platform', 'faculty', 'social', 'sciences', 'aau', 'new', 'possibilities', 'digital', 'data', 'digital', 'methods', 'addressed', 'explored', 'main', 'purposes', 'lab', 'help', 'associated', 'faculty', 'social', 'sciences', 'aau', 'better', 'realize', 'make', 'use', 'potentials', 'developments', 'digital', 'social', 'science', 'offer', 'help', 'researchers', 'students', 'integrate', 'new', 'data', 'method', 'solutions', 'research', 're

After preprocessing we are ready for some simple computations (here using `pandas`).

In [5]:
import pandas as pd
import numpy as np

text_ind = list(range(0, len(tokens_nostop)))  #Create index
text_col = ['Tokens']  #List of column names (here 1 column)

text_df = pd.DataFrame(index = text_ind, columns = text_col)  #Create empty pandas dataframe using the above
text_df['Tokens'] = tokens_nostop  #Fill the empty column with the tokens

text_df['Tokens'].value_counts()[:20]  #Count the tokens - top 20 tokens

data           53
caldiss        50
aau            34
social         28
new            26
sciences       22
digital        20
research       20
methods        19
faculty        16
contact        14
researchers    13
areas          13
               12
activities     11
management     10
science         9
use             9
caldiss’        9
students        9
Name: Tokens, dtype: int64

## DMI TOOL: Image Scraper

Link to the tool: https://tools.digitalmethods.net/beta/imagesDeep/

Using output from image scraper, we can write a short script to download all the images on a website.

In [6]:
import urllib.request
import pandas as pd
import os

image_csv = 'caldiss_imgurls.csv'
image_df = pd.read_csv(image_csv, sep = ",", skiprows = 1, header=None)  #Read the DMI image scraper output as a pandas dataframe
image_urls = list(image_df.loc[:,1])  #Extract the image URL's and store them in a list

image_name = "caldissimg"

img_folder = "./images/"

try:  #Check if the folder "images" exists. Otherwise it is created.
    os.mkdir(img_folder)
except FileExistsError:
    print("'images' folder already exists")
else:
    print("'images' folder created")

i = 1  #Setting i to 1 - using it as identifier
for url in image_urls:  #Loops through each URL
    j = str(i)
    url = url
    img_path = image_name + j + ".png"  #Creates the imagepath - assumes all images are .png!
    urllib.request.urlretrieve(url, filename = img_folder + img_path)  #Saves the image to the folder "images" - has to be created first!
    i = i + 1  #Increases i with 1

'images' folder created


## DMI TOOL: Search Engine Scraper

Link to the tool: https://tools.digitalmethods.net/beta/searchEngineScraper/

The output of the search engine scraper from the Digital Methods Initiative can be used in a spider to extract content from sites returned from google's search results.

In the following, two pre-saved csv-files are used contatining the top 10 Google search results for "education" in Sweden and Venezuela respectively.

In [None]:
import pandas as pd

swed_csv = 'swed_goog.csv'
swed_df = pd.read_csv(swed_csv, sep = '\t')  #Load the csv as pandas dataframe
swed_urls = list(swed_df['article url'])  #Extract the url's only
swed_urls

After reading in the list of URL's, we can loop through them and extract specific elements of each.

In [None]:
import requests
import scrapy
from scrapy import Selector
import time
import random
 
swed_titles = []  #Create empty list
for url in swed_urls:  #Iterate over url's in list
    time_delay = random.uniform(0.3,1.5)  #Set delay between 0.3 and 1.5 seconds in each iteration
    time.sleep(time_delay)  #Activate delay
    
    try:  # Error-handling
        requests.get(url)
    except requests.ConnectionError:
        title = "website NA"
    else:
        url_html = requests.get(url).content  #Load the url content
        url_sel = Selector(text = url_html)   #Convert HTML to Selector object
        title = url_sel.css('title::text').extract_first()  #Extract the text of the title element

    swed_titles.append(title)  #Add title to list
    

The title elements of the sites are now stores in the list `swed_titles`.

In [None]:
swed_titles  #print the titles

In [None]:
venez_csv = 'venez_goog.csv'
venez_df = pd.read_csv(venez_csv, sep = '\t')
venez_urls = list(venez_df['article url'])
venez_urls

In [None]:
venez_titles = []
for url in venez_urls:
    time_delay = random.uniform(0.3,1.5)
    time.sleep(time_delay)
    
    try:
        requests.get(url)
    except requests.ConnectionError:
        title = "website NA"
    else:
        url_html = requests.get(url).content
        url_sel = Selector(text = url_html)
        title = url_sel.css('title::text').extract_first()
      
    venez_titles.append(title)

In [None]:
venez_titles

## Using the Statistics Denmark’s API for StatBank

Link to API documentation: https://www.dst.dk/en/Statistik/statistikbanken/api

That Statistics Denmark's API for StatBank makes it possible to access the data in Statbank (http://www.statbank.dk/statbank5a/default.asp?w=2560)

The following demonstrates how to interact with the API directly via python.

The `PyDST` package is a python package for working with the API in a simpler way: https://github.com/Kristianuruplarsen/PyDST

### Finding the right table

The StatBank has several API's. The most useful is their API for extracting data: https://api.statbank.dk/v1/data

However, making use of the data API requires knowing what to ask it, which depends on the table we want to extract data from.

The "tableinfo" API returns information regarding a specific table in StatBank: https://api.statbank.dk/v1/tableinfo

Before even interacting with the API, it makes the most sense to find the table you want to draw data from via the main StatBank page: http://www.statbank.dk/statbank5a/default.asp?w=2560


## Extracting information about the Danish population ("FOLK1C")

In the following, we use the "tableinfo" API to find information regarding the table: FOLK1C.

In [None]:
import json
import requests
import urllib.parse
from urllib.request import urlopen

statbank_api = "https://api.statbank.dk/v1/tableinfo"  #Link to the API
table_req = {"lang": "en", "table": "folk1c"}  #The request to be send (JSON format) - note the table input!
stat_req = requests.post(statbank_api, json=table_req)  #Send the requests
stat_req.encoding = 'UTF-8'  #Change the encoding to 'utf-8'

table_json = json.loads(stat_req.text, encoding = 'utf-8')  #Load the data as JSON (allowing us to interact with the data)
print(json.dumps(table_json, indent=4, ensure_ascii=False)) #Print the data as JSON

With the `table_json` containing the information about the table FOLK1C, we can extract specific information about the table.

In [None]:
table_json['description']

In [None]:
for variable in table_json['variables']:
    print(variable['id'])

In [None]:
table_json['variables'][0]  #OMRÅDE (area/municipality)

In [None]:
table_json['variables'][2]  #Alder (age)

## Extracting data from the StatBank

Using the information above, we can now request specific data from the data API.

In [None]:
statbank_api = "https://api.statbank.dk/v1/data"  #Adress of the data API

data_req = {'table': 'folk1c','format': 'CSV','variables': [{'code': 'OMRÅDE', 'values': ['101', '851']},  #Request in JSON
                                                            {'code': 'ALDER', 'values': ['20-24', '25-29']}]
           }

data_req = requests.post(statbank_api, json=data_req)  #Sending requests

print(data_req.text)  #Printing the raw text output

The data API returns commma-separated values by default (csv).

This output is directly readable by the `pandas` package (`pd.read_csv`)

In [None]:
import sys
from io import StringIO
import pandas as pd

dstdata = StringIO(data_req.text)  #Read the data output as raw text
dstdf = pd.read_csv(dstdata, sep=";")  #Read text as csv
dstdf  #Print data

In [None]:
dstdf.groupby(['OMRÅDE']).sum()  #Group by municipality and count sum

## EXERCISE: Using the StatBank API

Using the API code below (or in the console), can you figure out whether more men or women were admitted to the hospital in 2017 with influenza?

Table: IND01

StatBank console: http://api.statbank.dk/console#subjects

In [None]:
import json
import requests
import urllib.parse
from urllib.request import urlopen

statbank_api = "https://api.statbank.dk/v1/tableinfo"  #Link to the API

table_req = #???#  Your code here!

stat_req = requests.post(statbank_api, json=table_req)  #Send the requests
stat_req.encoding = 'UTF-8'  #Change the encoding to 'utf-8'

table_json = json.loads(stat_req.text, encoding = 'utf-8')  #Load the data as JSON (allowing us to interact with the data)
print(json.dumps(table_json, indent=4, ensure_ascii=False)) #Print the data as JSON

In [None]:
for variable in table_json['variables']:
    print(variable['id'])

In [None]:
table_json['variables'][0]  #Change the index to see info about other variables

In [None]:
import sys
from io import StringIO
import pandas as pd

statbank_api = "https://api.statbank.dk/v1/data"  #Adress of the data API

data_req = #???# - Your code here!

data_req = requests.post(statbank_api, json=data_req)  #Sending requests
dstdata = StringIO(data_req.text)  #Read the data output as raw text
dstdf = pd.read_csv(dstdata, sep=";")  #Read text as csv
dstdf  #Print data