# Week 4

## Python setup

In [87]:
import math
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import re
import urllib.request as req

## Exercises Part 1: Regular expressions

### 1.1 Tutorial in RegEx
_Honestly, this [youtube](https://www.youtube.com/watch?v=rhzKDrUiJVk) guide is waaaayyyy better than the google guide..._

Regular expressions are a powerful language for matching text patterns.

The Python "re" module provides regular expression support.

In [43]:
# If-statement after search() tests if it succeeded
def check_regex(match):
    if match:
      print('found:', match.group())
    else:
      print('did not find')

In RegEx, we are interested in using a pattern, 'pat', to search through a text, 'str'. The pattern is defined as an regular expression based on a defined syntax. In python's 're' library, each pattern starts with a 'r' followed by the expression. 

_Note that the syntax in python is a bit different than traditional RegEx!_

In [46]:
str = 'purple alice-b@google.com monkey dishwasher'
pat = r'([\w.-]+)@([\w.-]+)'

# The re.search() function returns a match type, which contains groups
match = re.search(pat, str)

check_regex(match)

found: alice-b@google.com


Group Extraction - We can split our result into groups and then extract them separately.

In [50]:
print(match.group(0)) # (the whole match)
print(match.group(1)) # (the username, group 1)
print(match.group(2)) # (the host, group 2)

alice-b@google.com
alice-b
google.com


The most powerful function in re is findall(), which returns a list of regex results. 

In [58]:
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

pat = r'([\w\.-]+@[\w\.-]+)'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(pat, str) ## ['alice@google.com', 'bob@abc.com']

for email in emails:
    # do something with each found email string
    print(email)

alice@google.com
bob@abc.com


The findall function can also be used on files!

In [72]:
# Open file
f = open('../files/hamlet_act_1_scene_1.txt', encoding='utf-8')

# Get only scenographic instructions, marked by '(...)'
pat = r'\(([^)]+)\)'

# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(pat, f.read())
strings

['Enter Barnardo and Francisco, two sentinels.',
 'Enter Horatio and Marcellus.']

Findall and Groups can be used to sub divide the search results. 

In [76]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
pat = r'([\w\.-]+)@([\w\.-]+)'

tuples = re.findall(pat, str)

print(tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]

for tuple in tuples:
    print(f"Username: \"{tuple[0]}\", Host: \"{tuple[1]}\"")

[('alice', 'google.com'), ('bob', 'abc.com')]
Username: "alice", Host: "google.com"
Username: "bob", Host: "abc.com"


### 1.2 What are regular expressions

A regular expression is a string, which follows a predefined syntax that enables pattern recognition in texts.

### 1.3 RegEx on 4-digit numbers from URL text

Find all 4-digit numbers in [this text](https://raw.githubusercontent.com/SocialComplexityLab/socialgraphs2020/master/files/regex_exercise.txt).

In [96]:
# Define the url
url = "https://raw.githubusercontent.com/SocialComplexityLab/socialgraphs2020/master/files/regex_exercise.txt"

# Get HTTPResponse from url
data = req.urlopen(url)

# Extract byte string from reponse data
byte_string = data.read()

# Decode byte string to regular string
text = byte_string.decode("utf-8")

In [117]:
# Get only 4-digit numbers
pat = r'(?<!\d)\d{4}(?!\d)'

# Feed the file text into findall(); it returns a list of all the found strings
numbers = re.findall(pat, text)

# Print digist one at a time on new lines 
print(*numbers, sep="\n")

1234
9999


### 1.4 RegEx for words starting with 'super' from URL text

In [125]:
# Get only words starting with 'super'
pat = r'super[\w+]*'

# Feed the file text into findall(); it returns a list of all the found strings
words = re.findall(pat, text)

# Print digist one at a time on new lines 
print(*words, sep="\n")

superpolaroid
supertaxidermy
superbeer


### 1.5 RegEx find Wiki links in URL text

In [249]:
# Get any strings surrounded by '[[...]]'
pat = r'\[\[(.*?)\]\]'    
matches = re.findall(pat, text)
print("Matches:", *matches, sep="\n  ")

# List to hold the final substrings
results = []
for match in matches:
    # Remove content in parentheses and split by '|'
    cleaned_substrings = re.sub(r'\s*\(.*?\)\s*', '', match).split('|')
    results.extend(cleaned_substrings)
print("Results:", *results, sep="\n  ")

# Create urls from wiki links one at a time while replacing spaces with '_' 
urls = []
for res in results:
    urls.append("https://en.wikipedia.org/wiki/" + res.replace(" ", "_"))

# Remove dublicate urls
urls = list(dict.fromkeys(urls))

# print out urls
print("Urls:", *urls, sep="\n  ")

Matches:
  drinking vinegar
  gentrify
  hashtag
  Bicycle|Bicycle(two-wheeled type)
  Pitchfork|Pitchfork Magazine
Results:
  drinking vinegar
  gentrify
  hashtag
  Bicycle
  Bicycle
  Pitchfork
  Pitchfork Magazine
Urls:
  https://en.wikipedia.org/wiki/drinking_vinegar
  https://en.wikipedia.org/wiki/gentrify
  https://en.wikipedia.org/wiki/hashtag
  https://en.wikipedia.org/wiki/Bicycle
  https://en.wikipedia.org/wiki/Pitchfork
  https://en.wikipedia.org/wiki/Pitchfork_Magazine


## Exercises Part 2: Download the Wikipedia pages of characters