# Week 4

## Python setup

In [41]:
import json
import math
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import os
import re
import urllib.request

## Exercises Prelude: Regular expressions

### 0.1 Tutorial in RegEx
_Honestly, this [youtube](https://www.youtube.com/watch?v=rhzKDrUiJVk) guide is waaaayyyy better than the google guide..._

Regular expressions are a powerful language for matching text patterns.

The Python "re" module provides regular expression support.

In [2]:
# If-statement after search() tests if it succeeded
def check_regex(match):
    if match:
      print('found:', match.group())
    else:
      print('did not find')

In RegEx, we are interested in using a pattern, 'pat', to search through a text, 'str'. The pattern is defined as an regular expression based on a defined syntax. In python's 're' library, each pattern starts with a 'r' followed by the expression. 

_Note that the syntax in python is a bit different than traditional RegEx!_

In [3]:
str = 'purple alice-b@google.com monkey dishwasher'
pat = r'([\w.-]+)@([\w.-]+)'

# The re.search() function returns a match type, which contains groups
match = re.search(pat, str)

check_regex(match)

found: alice-b@google.com


Group Extraction - We can split our result into groups and then extract them separately.

In [4]:
print(match.group(0)) # (the whole match)
print(match.group(1)) # (the username, group 1)
print(match.group(2)) # (the host, group 2)

alice-b@google.com
alice-b
google.com


The most powerful function in re is findall(), which returns a list of regex results. 

In [5]:
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

pat = r'([\w\.-]+@[\w\.-]+)'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(pat, str) ## ['alice@google.com', 'bob@abc.com']

for email in emails:
    # do something with each found email string
    print(email)

alice@google.com
bob@abc.com


The findall function can also be used on files!

In [6]:
# Open file
f = open('../files/hamlet_act_1_scene_1.txt', encoding='utf-8')

# Get only scenographic instructions, marked by '(...)'
pat = r'\(([^)]+)\)'

# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(pat, f.read())
strings

['Enter Barnardo and Francisco, two sentinels.',
 'Enter Horatio and Marcellus.']

Findall and Groups can be used to sub divide the search results. 

In [7]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
pat = r'([\w\.-]+)@([\w\.-]+)'

tuples = re.findall(pat, str)

print(tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]

for tuple in tuples:
    print(f"Username: \"{tuple[0]}\", Host: \"{tuple[1]}\"")

[('alice', 'google.com'), ('bob', 'abc.com')]
Username: "alice", Host: "google.com"
Username: "bob", Host: "abc.com"


### 0.2 What are regular expressions

A regular expression is a string, which follows a predefined syntax that enables pattern recognition in texts.

### 0.3 RegEx on 4-digit numbers from URL text

Find all 4-digit numbers in [this text](https://raw.githubusercontent.com/SocialComplexityLab/socialgraphs2020/master/files/regex_exercise.txt).

In [8]:
# Define the url
url = "https://raw.githubusercontent.com/SocialComplexityLab/socialgraphs2020/master/files/regex_exercise.txt"

# Get HTTPResponse from url
data = urllib.request.urlopen(url)

# Extract byte string from reponse data
byte_string = data.read()

# Decode byte string to regular string
text = byte_string.decode("utf-8")

In [9]:
# Get only 4-digit numbers
pat = r'(?<!\d)\d{4}(?!\d)'

# Feed the file text into findall(); it returns a list of all the found strings
numbers = re.findall(pat, text)

# Print digist one at a time on new lines 
print(*numbers, sep="\n")

1234
9999


### 0.4 RegEx for words starting with 'super' from URL text

In [10]:
# Get only words starting with 'super'
pat = r'super[\w+]*'

# Feed the file text into findall(); it returns a list of all the found strings
words = re.findall(pat, text)

# Print digist one at a time on new lines 
print(*words, sep="\n")

superpolaroid
supertaxidermy
superbeer


### 0.5 RegEx find Wiki links in URL text

In [179]:
def find_all_wiki_links(text):
    # Get any strings surrounded by '[[...]]'
    pat = r'\[\[(.*?)\]\]' 
    pat = r'\[\[(?!Category:)(?!File:)(?!Image:)(.*?)\]\]'
    matches = re.findall(pat, text)
    
    # List to hold the final substrings
    results = []
    for match in matches:
        # Remove content in parentheses and split by '|'
        cleaned_substrings = re.sub(r'\s*\(.*?\)\s*', '', match).split('|')
        results.extend(cleaned_substrings)
    
    # Create urls from wiki links one at a time while replacing spaces with '_' 
    urls = []
    for res in results:
        res = res.replace(" ", "_")
        urls.append("https://en.wikipedia.org/wiki/" + res)

    # Remove dublicate results and urls
    results = list(dict.fromkeys(results))
    urls = list(dict.fromkeys(urls))
    
    return urls, results 

# Find all wiki_link urls and print them out
urls, results = find_all_wiki_links(text)

print(f"Results ({len(results)} found):")
print(f"Urls ({len(urls)} found):")
print(*urls, sep="\n")

Results (2110 found):
Urls (2110 found):
https://en.wikipedia.org/wiki/country_music
https://en.wikipedia.org/wiki/3_of_Hearts
https://en.wikipedia.org/wiki/4_Runner
https://en.wikipedia.org/wiki/8_Ball_Aitken
https://en.wikipedia.org/wiki/Gene_Autry
https://en.wikipedia.org/wiki/Eddy_Arnold
https://en.wikipedia.org/wiki/Roy_Acuff
https://en.wikipedia.org/wiki/Rodney_Atkins
https://en.wikipedia.org/wiki/The_Abrams_Brothers
https://en.wikipedia.org/wiki/Ace_in_the_Hole_Band
https://en.wikipedia.org/wiki/Kay_Adams
https://en.wikipedia.org/wiki/Ryan_Adams
https://en.wikipedia.org/wiki/Doug_Adkins
https://en.wikipedia.org/wiki/Trace_Adkins
https://en.wikipedia.org/wiki/David_"Stringbean"_Akeman
https://en.wikipedia.org/wiki/Rhett_Akins
https://en.wikipedia.org/wiki/Alabama
https://en.wikipedia.org/wiki/Lauren_Alaina
https://en.wikipedia.org/wiki/Jason_Aldean
https://en.wikipedia.org/wiki/Alee
https://en.wikipedia.org/wiki/Daniele_Alexander
https://en.wikipedia.org/wiki/Jessi_Alexander
http

## Exercises Part 1: Download the Wikipedia pages of characters

### 1.1 Extract all links into a list of country performers

We use the wikipedia API to fetch the content of the wikipedia page: https://en.wikipedia.org/wiki/List_of_country_music_performers

In [12]:
baseurl = "https://en.wikipedia.org/w/api.php?"
action = "action=query"
title = "titles=List_of_country_music_performers"
content = "prop=revisions&rvprop=content"
dataformat ="format=json"

query = "{}{}&{}&{}&{}".format(baseurl, action, content, title, dataformat)
print(query)

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&titles=List_of_country_music_performers&format=json


In [13]:
wikiresponse = urllib.request.urlopen(query)
wikidata = wikiresponse.read()
wikitext = wikidata.decode('utf-8')
json_response = json.loads(wikitext)

In [14]:
# Get the latest page number
page = list(json_response["query"]["pages"].keys())[0]
page

'328877'

In [80]:
# Get text content from page
text = json_response["query"]["pages"][page]["revisions"][0]["*"]

In [176]:
# Find all wiki_link urls and print them out
urls, results = find_all_wiki_links(text)
print(f"Urls ({len(urls)} found):")
print(*urls, sep="\n")

Urls (2110 found):
https://en.wikipedia.org/wiki/country_music
https://en.wikipedia.org/wiki/3_of_Hearts
https://en.wikipedia.org/wiki/4_Runner
https://en.wikipedia.org/wiki/8_Ball_Aitken
https://en.wikipedia.org/wiki/Gene_Autry
https://en.wikipedia.org/wiki/Eddy_Arnold
https://en.wikipedia.org/wiki/Roy_Acuff
https://en.wikipedia.org/wiki/Rodney_Atkins
https://en.wikipedia.org/wiki/The_Abrams_Brothers
https://en.wikipedia.org/wiki/Ace_in_the_Hole_Band
https://en.wikipedia.org/wiki/Kay_Adams
https://en.wikipedia.org/wiki/Ryan_Adams
https://en.wikipedia.org/wiki/Doug_Adkins
https://en.wikipedia.org/wiki/Trace_Adkins
https://en.wikipedia.org/wiki/David_"Stringbean"_Akeman
https://en.wikipedia.org/wiki/Rhett_Akins
https://en.wikipedia.org/wiki/Alabama
https://en.wikipedia.org/wiki/Lauren_Alaina
https://en.wikipedia.org/wiki/Jason_Aldean
https://en.wikipedia.org/wiki/Alee
https://en.wikipedia.org/wiki/Daniele_Alexander
https://en.wikipedia.org/wiki/Jessi_Alexander
https://en.wikipedia.org/w

### 1.2 Download the content of each wiki link and save them to a text file

In [69]:
# Folder to store the text files
folder_path = "wiki_text"

In [217]:
def get_result_content(result):
    # Create query using wikipedia's API
    baseurl = "https://en.wikipedia.org/w/api.php?"
    action = "action=query"
    
    # Add '_' for proper url format
    result = result.replace(" ", "_")
    
    # Use urllib.parse.quote to encode special characters
    title = f"titles={urllib.parse.quote(result)}"
    
    content = "prop=revisions&rvprop=content"
    dataformat ="format=json"
    query = "{}{}&{}&{}&{}".format(baseurl, action, content, title, dataformat)
    
    # Get http response from query and decode it
    wikiresponse = urllib.request.urlopen(query)
    wikidata = wikiresponse.read()
    wikitext = wikidata.decode('utf-8')
    json_response = json.loads(wikitext)
    
    # Find latest page and fetch content
    page = list(json_response["query"]["pages"].keys())[0]
    
    # The name gave no hits nor redirects
    if page == '-1':
        return -1
    
    url_content = json_response["query"]["pages"][page]["revisions"][0]["*"]
    
    # Check if the name is redirected, in which case return redirected result
    redirect = re.findall(r'(?<=#redirect )\[\[(.*?)\]\]', url_content)
    if redirect:
        return get_result_content(redirect[0])
    
    return url_content

In [218]:
def save_wikitext_to_txt(name):
    # Artist name becomes the name of the .txt file
    file_name = name

    # Pre-cleans artist name so the file does not get corrupted 
    file_name = re.sub(r'[^a-zA-Z0-9]', '_', file_name) + ".txt"
    
    # File path to where the file will be stored
    file_path = os.path.join(folder_path, file_name)
    
    content = get_result_content(name)
    
    # The name gave no hits nor redirects
    if content == -1:
        return -1
    
    # Create/open file and write content from url
    with open(file_path,'w',encoding='utf-8') as file :
        file.write(content.encode('utf-8').decode('utf-8'))
        file.close()

In [223]:
# # ONLY RUN IF YOU NEED TO START FROM SCRATCH!!!!
# # Save txt file for every artist in the list
# for name in results:
#     save_wikitext_to_txt(name)

## Exercises Part 2: Building the networks

# Part 2: Stats of the Country Music Network

This second part requires you to have built the network of Country Musicians as described in the exercises for Week 4. You should complete the following exercise from **Part 2**.

* *Simple network statistics and analysis*

And the following exercise from **Part 3**

* *Let's build a simple visualization of the network*

And that's it! You're all set.