# LDSCI7236 Theory and Applications of Data Analytics

# Week 4 - Data-Driven Applications 



## PART A: Data scraping online _aka_ _web_ _scraping_ & Data Pre-processing Hands-on Practice


_Table of contents_

* Data collection via web scraping

* Data pre-processing

### 1.a. Why do we need to anonymise data?

* Data Protection Act 2018 and GDPR.
* Safeguarding the anonymity of individuals.

### 1.b. What is $k$-anonymity?

* A collection of techniques;

* Contents ambiguously map information to at least $k$ entities;

* Techniques include:

  * Hashing
  * Generalisation
  * Masking
  
  
Also, consider encryption and other data protection mechanisms.

### 1.c. "Spiders from Mars"

We used to call _spiders_ the programs that crawl the Web to examine and retrieve content data from websites.

This process is also know as _Web scaping_.

#### Advantages

* Dataset is not always readily available for processing.
* Extract different pieces of information from non-traditional formats, e.g. social media posts.


#### Disadvantages

* It may be illegal to perform web scraping e.g. due to copyright reasons.
* It may be forbidden e.g. due to violation of terms of service.
* When you do web scraping you often deal with (unintentionally) bad organisation of information i.e. the data you collect from online may suffer from either quality issues or other structure problems.

## 2. Getting Web Data

### 2.a. Basics

The `urllib` Python library allows you to read data from Web pages.

The `urlopen()` function opens a URL (A _Uniform Resource Locator_, or, simply, a Web address). The URL can be a string, or a `Request` object. It returns a file-like object containing the contents of the Web page.

The `read()` function reads the entire contents of a page. Let's see both functions in action:

In [1]:
# import the urlib library into your application
import urllib.request as urllib

# setup a variable called html to receive the object information.
html = urllib.urlopen("https://www.nulondon.ac.uk/degrees/postgraduate/")
# variable contents will read the HTML layout of the page you access.
contents = html.read()

# Let's try to print `contents` in class
contents
# The contents will be of readable or non-readable format!

b'<!DOCTYPE html>\n<html lang="en-GB">\n<head>\n\n\t<meta charset="UTF-8">\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\n\t<link rel="profile" href="http://gmpg.org/xfn/11">\n\t<link rel="pingback" href="https://www.nulondon.ac.uk/xmlrpc.php">\n\n\t<!-- Adobe TypeKit -->\n\t<script src="//use.typekit.net/rdj0dts.js"></script>\n\t<script>try{Typekit.load({async:true});}catch(e){}</script>\n\t\n\t<link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png">\n\t<link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png">\n\t<link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png">\n\t<link rel="manifest" href="/site.webmanifest">\n\t<link rel="mask-icon" href="/safari-pinned-tab.svg" color="#e11837">\n\t<meta name="msapplication-TileColor" content="#e11837">\n\t<meta name="theme-color" content="#ffffff">\n \n\t<meta name="facebook-domain-verification" content="hmekue18kolnkax0ify1ucgk47tdzf" />\n\t\n\t<!-- NUL Google

## 3. Parsing with BeautifulSoup

__BeautifulSoup__ is a Web Scraping library that converts HTML pages into easily-traversible Python objects, retaining the general structure of a web page:

```html
<html>
   <head>...</head>
   <body>...</body>
</html>
```

#### Note
Make sure you have installed first the Python library __BeautifulSoup4__
You can install it like so:

```
$ pip install beautifulsoup4

or 

$ conda install beautifulsoup4
```

For more information read the documentation: https://pypi.org/project/beautifulsoup4/

In [2]:
from bs4 import BeautifulSoup
# import the Python library into your application

# create the variable to parse the contents of the URL page.
soup = BeautifulSoup(contents)

# We can also try to call other functions e.g. soup.html, soup.head and so on
soup.body

<body class="degrees-template degrees-template-page-degree-type-postgraduate degrees-template-page-degree-type-postgraduate-php single single-degrees postid-70104" data-header-ctas="visible">
<!-- NUL Google Tag Manager (noscript) -->
<noscript><iframe class="lazyload" data-src="https://www.googletagmanager.com/ns.html?id=GTM-MZHBDT" height="0" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" style="display:none;visibility:hidden" width="0"></iframe></noscript>
<!-- End NUL Google Tag Manager (noscript) -->
<!-- NU Google Tag Manager (noscript) (two tags) -->
<noscript><iframe class="lazyload" data-src="https://www.googletagmanager.com/ns.html?id=GTM-WGQLLJ" height="0" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" style="display:none;visibility:hidden" width="0"></iframe></noscript>
<noscript><iframe class="lazyload" data-src="https://www.googletagmanager.com/ns.html?id=GTM-PKNLJQX" height="0" src="data:image/gif;base64

### 3.a. Extracting text

Usually, we want to retrieve the text contained within HTML tags, such as headers. The `get_text()` function strips all tags from an HTML structure and return a tag-less block of text.

__TIP:__ It is better to call `get_text()` right before you print, store, or manipulate your final data. It is easier to navigate an HTML page's tag structure using a `BeautifulSoup` object than a big block of text.

In [3]:
soup.body.h1.get_text(), soup.body.h2.get_text()

('Postgraduate Study', 'Choose your postgraduate degree')

### 3.b. Tag attributes

We can access the attributes of an HTML tag (address) as a Python `dict` object:

In [4]:
soup.body.a.attrs

{'href': 'https://www.nulondon.ac.uk/application/'}

### 3.c. Searching for tags

The `find()` function searches an HTML page for a particular tag and returns the _first occurence_ of that tag; or `None` if the tag does not exist.

In [5]:
# Let's try other tags too, e.g. <title>
soup.find('a')

<a href="https://www.nulondon.ac.uk/application/">Apply now</a>

The `find_all()` function returns a list of instances of a tag found in an HTML page:

In [6]:
links = soup.body.find_all('a')
for i, link in enumerate(links):
    print('{:03d} {}'.format(i, link.attrs['href']))

000 https://www.nulondon.ac.uk/application/
001 tel:+442076374550
002 https://www.nulondon.ac.uk/nch-student-hub/
003 https://www.nulondon.ac.uk/offer-holders/
004 https://nulondon.peoplehr.net/
005 /
006 #
007 https://www.nulondon.ac.uk/degrees-2023/
008 https://www.nulondon.ac.uk/degrees/postgraduate/
009 https://www.nulondon.ac.uk/study/apprenticeships/
010 https://www.nulondon.ac.uk/pre-university-programmes
011 https://www.nulondon.ac.uk/study/nch-admissions/
012 https://www.nulondon.ac.uk/study/international-students-at-nch/visa/
013 https://www.nulondon.ac.uk/entry-requirements-country/
014 https://www.nulondon.ac.uk/fees-and-funding/
015 https://www.nulondon.ac.uk/application/
016 https://www.nulondon.ac.uk/prospectus/
017 https://www.nulondon.ac.uk/study/visit-us/
018 #
019 https://www.nulondon.ac.uk/cmens-faculty/
020 https://www.nulondon.ac.uk/faculties/economics-faculty/
021 https://www.nulondon.ac.uk/faculties/english-faculty-2/
022 https://www.nulondon.ac.uk/faculties/his

Obviously, it is likely that there are many instances of a tag in Web page (e.g. `<div>`, `<span>`, or `<p>`). We usually combine tags with special attributes, e.g. assign a `class` to style and format them appropriately.

The `find()` and `find_all()` functions can search through a Web page for tags with specific attributes. Attributes can be specified as a dictionary.

__Example:__ Let's find all postgraduate programmes at NCH.

In [7]:
programmes = soup.body.find_all('div', {'class': 'masters-degree-content'})
for i, programme in enumerate(programmes):
    # We can parse a child tag
    title = programme.find('p', {'class': 'title'})
    name = title.get_text()
    print('{}: {}'.format(i + 1, name))

1: MSc Artificial Intelligence & Computer Science
2: MSc Artificial Intelligence & Data Analytics
3: MSc Artificial Intelligence and Ethics
4: MSc Artificial Intelligence & Technology Leadership
5: MA Contemporary Creative Writing
6: MSc Digital Politics & Sustainable Development
7: MSc Global Investment Banking
8: MA Philosophy and Artificial Intelligence


__Example:__ Let's find all faculty members at NUL.

In [36]:
import pandas as pd
#import the Python libaries you need i.e. pandas


# Create an empty dataframe
df = pd.DataFrame(columns=['name', 'qualifications'])

# Read Web page
html = urllib.urlopen("https://www.nulondon.ac.uk/cmens-faculty/")
contents = html.read()
print(contents)
soup = BeautifulSoup(contents)


# PLEASE NOTE THE FOLLOWING CODE IS AN EXAMPLE OF HOW TO USE DIFFERENT HTML TAGS
# FOR SEARCHING INFORMATION.
# THE SPECIFIC EXAMPLE MAY NOT WORK SINCE THE ABOVE URL USED HAS BEEN UPDATED.
# CAN YOU TRY INSTEAD OTHER TAGS TO SEARCH AND FIND INFORMATION?

# Extract faculty members
members = soup.find_all('div', {'class': 'faculty-member'})
#print(members)

for i, member in enumerate(members):
    # Let's parse child tags
    name = member.h4.a.get_text()
    qualifications = member.find('p', {'class': 'qualifications'}).get_text()
    # Slice and copy to dataframe
    df.loc[i] = [name, qualifications]
    

df

b'<!DOCTYPE html>\n<html lang="en-GB">\n<head>\n\n\t<meta charset="UTF-8">\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\n\t<link rel="profile" href="http://gmpg.org/xfn/11">\n\t<link rel="pingback" href="https://www.nulondon.ac.uk/xmlrpc.php">\n\n\t<!-- Adobe TypeKit -->\n\t<script src="//use.typekit.net/rdj0dts.js"></script>\n\t<script>try{Typekit.load({async:true});}catch(e){}</script>\n\t\n\t<link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png">\n\t<link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png">\n\t<link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png">\n\t<link rel="manifest" href="/site.webmanifest">\n\t<link rel="mask-icon" href="/safari-pinned-tab.svg" color="#e11837">\n\t<meta name="msapplication-TileColor" content="#e11837">\n\t<meta name="theme-color" content="#ffffff">\n \n\t<meta name="facebook-domain-verification" content="hmekue18kolnkax0ify1ucgk47tdzf" />\n\t\n\t<!-- NUL Google

Unnamed: 0,name,qualifications


Let's write the data frame to a file.

In [37]:
df.to_csv("nu-faculty.csv")

#### NOTE: More online resources on Web Scraping

* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), as seen today

* [Selenium](https://selenium-python.readthedocs.io/index.html) - incl. interacting with dynamic pages

* [Scrapy](https://scrapy.org/) - optimized for scraping tasks

## 4. Data pre-processing

As you may have realised by our examples thus far, __data does not always come in forms ready for analysis__. 

Data could be:

* In the wrong format (e.g. column types)
* Incorrect 
* Missing

As a data scientist, you can can spend a lot of your time (some might say, "as much as 75%") preparing data. 

Data preparation has been given many _modern_ names such as _data munging_ or _data wrangling_.

### 4.a. Funtion `explode()` for _data normalisation_

Consider the following example data frame, where column values are not __atomic__.

__Note:__ Python dictionary data structure: https://www.w3schools.com/python/python_dictionaries.asp

In [38]:
import pandas as pd
# import the Python library pands into your application

# declare a dictionary structure
data = {
    "id": [ 1,  2, 3],
    "director": ['Mary, John', None, 'Mark, Mary'],
    "cast": ['A, B', 'A, B, C, D', 'A, B, C']
}

df = pd.DataFrame(data)
df

Unnamed: 0,id,director,cast
0,1,"Mary, John","A, B"
1,2,,"A, B, C, D"
2,3,"Mark, Mary","A, B, C"


In [39]:
def _tolist(s):
    """
    Splits a comma-separated string and return 
    it as a list of strings
    """
    if not s:
        return []
    lst = s.split(',')
    lst = [x.strip() for x in lst]
    return lst

# Convert string to list of strings
df.cast = df.cast.map(_tolist)

# Transform each element of a list to a row
df = df.explode('cast', ignore_index=True)

df.groupby('cast').id.count()

cast
A    3
B    3
C    2
D    1
Name: id, dtype: int64

### 4.b. Missing values

Possible actions for missing values include:

* Filtering out
* Filling in
* Ignoring

Let's start with __filtering out__ missing values.

In [40]:
df = pd.DataFrame({'A': [1, 2, 3, None, 5], 'B': ['I', None, 'III', 'IV', None]}) 
#notice that keyword None is used for NaN
df

Unnamed: 0,A,B
0,1.0,I
1,2.0,
2,3.0,III
3,,IV
4,5.0,


The `dropna()` function drops rows and columns with missing data. By default, it drops any row containing a missing value.

In [41]:
df.dropna()

Unnamed: 0,A,B
0,1.0,I
2,3.0,III


Next, we can try __filling in__ missing values:

In [42]:
df.fillna({'A': -1, 'B': 'Unknown'})
# we can set the missing NaN (None) value with what alternative value to be replaced

Unnamed: 0,A,B
0,1.0,I
1,2.0,Unknown
2,3.0,III
3,-1.0,IV
4,5.0,Unknown


__Note:__  Similarly we could instead dealt with missing values during loading (reading) the file.

```python
pd.read_csv(filename, na_values='0',
                      skip_blank_lines=True)
```

### 4.c. Dealing with duplicates

Sometimes data frames contain duplicate rows, which you (should) want to discard - unless, there is a _hidden_ time, or other ordering, dimension to the data. 

In [43]:
df = pd.DataFrame({'A': [1, 1], 'B': [2, 2]})
df

Unnamed: 0,A,B
0,1,2
1,1,2


Function `duplicated()` returns a `Series` object with `True` and `False` values, indicating whether a row has been previously observed. 

__NOTE:__ It also works on specific columns.

In [44]:
df.duplicated()

0    False
1     True
dtype: bool

Function `drop_duplicates()` returns a data frame with duplicate rows removed.

__NOTE:__ We can specify a list of column names as an argument to look for and drop duplicates e.g.

```python
df.drop_duplicates(['A', 'B'])
```

In [45]:
df.drop_duplicates()

Unnamed: 0,A,B
0,1,2


__NOTE:__ By default function `drop_duplicates()` keeps the first row. In order to keep the last one, you can specify `keep='last'` as an argument.

### 4.d. Replacing values

Beyond filtering or fill in missing data, we can replace certain values with new values:

In [46]:
df.replace({1: 'I', 2: 'II'})  #  Or this can be equivalent to df.replace(1, 'I'); df.replace(2, 'II')
                               #   or df.replace([1, 2], ['I', 'II'])

Unnamed: 0,A,B
0,I,II
1,I,II


### 4.e. Renaming

It is possible to rename columns and rows using function `rename()` .

In [47]:
df.rename(index={0: '0th', 1: '1st', 2: '2nd', 3: '3rd', 4: '4th'}, 
          columns={'A': 'literal', 'B': 'roman'})

Unnamed: 0,literal,roman
0th,1,2
1st,1,2
