# Hardcore List Parsing

Ok. you've made it here. We salute you. Below are some tips and trick for you to get started on the non-trivial task of extracting the names of all Marvel and DC characters on Wikipedia.

## Marvel Characters

We will start with the [List of Marvel characters](http://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters). As you can see the list is organized alphabetically. Therefore, you're going to need to do a few things:


1. Use Python to generate a list of the API address all of the relevant pages:  [0-9](https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_0%E2%80%939), [A](https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_A), [B](https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_B), [C](https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_C) and all the way up to [Z](https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_Z).


2. Let's take a look at the [A-page](https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_A). Your task is to extract the information for each character and save it as a **txt** file. See an example [here](https://github.com/SocialComplexityLab/socialgraphs2020/blob/master/files/week4_example_1.png?raw=true) (where red is a character name and blue is the content associated to the character). 
   *  You can also see some of the issues you will have to face - [some characters have *redirections* to the *main articles*](https://github.com/SocialComplexityLab/socialgraphs2020/blob/master/files/week4_example_2.png?raw=true). In that case you should follow a link to the main article and save it's content. Keep in mind that links to **main articles** might be defined as `{{main|link_name}}`, `{{Main article|link_name}}`, `{{Main|link_name}}` and more. Thus, when using regular expressions you should find a way to account for that.
   * Some characters have multiple *main articles*. At the same time, not all *main articles* are relevant. If you take a look at the character named **Aginar**, you can see that the *main article* redirects users to [List of Eternals](https://en.wikipedia.org/wiki/List_of_Eternals) - it is not specific to **Aginar**.
   * Some characters have `#redirects [[link_to_character_page]]` instead of *main articles*
   * There are more *edge* cases and you will have to make a lot of decisions along the way (most of the time, there won't be a *correct* way to handle a case)
   * Of course you won't have time to tackle all of the issues (and we do not expect you to do so) - this is to say that we do not expect you to build a perfect dataset.


3. A couple of extra tips:
    * This is an important point: **Don't get the `html` version of the page**, get the standard [wiki markup](https://en.wikipedia.org/wiki/Help:Wiki_markup) which is what you see when you press "edit" on a wikipedia page.
    * Some pages contain unicode characters, so we recommend you save the files using the [`io.open`](http://stackoverflow.com/questions/5250744/difference-between-open-and-codecs-open-in-python) method with `utf-8` encoding. You can also take a look at `urllib.parse.quote()` - this might be helpfull when encoding the links
    * Store the content of all pages. It's up to you how to do this. One strategy is to use Python's built in `pickle` format. Or you can simply write the content of wiki-pages to text files and store those in a folder on your computer. I'm sure there are other ways. It's crucial that you store them in a way that's easy to access, since we'll use these pages a lot throughout the remainder of the course (so you don't want to retrieve them from wikipedia every time).

    * **Small hint**: names in the Marvel List are 2nd-level healines (i.e. wiki-markup specifies them as `==Character Name==`). You can use regular expressions to match the pattern.

## DC Characters

These are even harder than the Marcel characters. It's OK to give up now. (But you've come this far, so why give up now ?)

You start at [DC Universe](http://en.wikipedia.org/wiki/List_of_DC_Comics_characters)

1. Follow the same strategy as above, loop over the alphabet to retrive links to every DC character with their own wiki page. 


2. This time the task is a little bit more tricky since you'll be parsing a table to get the page names, see [B](https://en.wikipedia.org/wiki/List_of_DC_Comics_characters:_B): you need to find a way to extract names in the first column (as well as associated links). For this task you can use **BeautifulSoap** (see [here](https://stackoverflow.com/a/53920093)).


3. You will have to alter your code, as some pages (see [C](https://en.wikipedia.org/wiki/List_of_DC_Comics_characters:_C)) include lists of character names.
    * And pages such [A](https://en.wikipedia.org/wiki/List_of_DC_Comics_characters:_A) contain both list of character names, as well as the content that is similar to Marvel Pages.

Big thanks to TA Germans for parsing these messy and horrible wikipedia lists.

In [18]:
import urllib.request
import json
import re

url = 'https://en.wikipedia.org/wiki/Michael_Jackson'
response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
text = data.decode('utf-8')


In [9]:
listOfHeaders= ["0%E2%80%939"]
for i in range(65,91):
    listOfHeaders.append(chr(i))
print(listOfHeaders)


['0%E2%80%939', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']


In [14]:
api_list= []

baseurl = "https://en.wikipedia.org/w/api.php?"
action = "action=query"
content = "prop=revisions&rvprop=content"
dataformat ="format=json"
base_title = "titles=List_of_Marvel_Comics_characters:_"

for t in listOfHeaders:
    query = "{}{}&{}&{}&{}".format(baseurl, action, content, "titles=List_of_Marvel_Comics_characters:_" + t , dataformat)
    api_list.append(query)

In [22]:
api_list[0]

'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&titles=List_of_Marvel_Comics_characters:_0%E2%80%939&format=json'

In [60]:
def getPageContents(link):
    wikiresponse = urllib.request.urlopen(link)
    wikidata = wikiresponse.read()
    wikitext = wikidata.decode('utf-8')
    response = json.loads(wikitext)

    pages = response["query"]["pages"].keys()
    pageContentList = []

    for page in pages:
        pageContentList.append(response["query"]["pages"][page]["revisions"][0]["*"])
    
    return pageContentList[0]

def cleanText(text):
    # remove references
    editedText = re.sub("<ref*[^~]*?</ref>", "", text)
    editedText = re.sub("<!--[\D|\d]*}}\n", "", editedText)
    return re.sub("[\D|\d]*?[}}]\n(?=.)", "", editedText)

def match_link(character_text):
    return  re.search("{{main[\w|\s|(|)]*}}", character_text, re.IGNORECASE)
def visit_link(link):
    link = re.findall("{{main[\w|\s|(|)]*}}", editedText, re.IGNORECASE)
    try:
        return true
    except IndexError:
        print("not link")
        return false

def generate_sub_link(wiki_link):
    pre_transformed_link = re.search("(?<=[|])[\w|\s|()]*", wiki_link)
    if pre_transformed_link:
        title_for_link = pre_transformed_link.group().replace(" ", "_")
        return "{}{}&{}&{}&{}".format(baseurl, action, "prop=revisions&rvprop=content", "titles=" +  title_for_link , dataformat)


In [66]:
# SINGLE
content = getPageContents(api_list[1])
    
matchedPatterns= re.findall("(?<=\n)==[\w|\s|-]*[^=]==", content)
titles = [w.replace('==', '') for w in matchedPatterns]
titles.pop()
for index,title in enumerate(titles):

    f = open("./marvel/"+ title + ".txt", "w", encoding= "utf-8")

    if(index != len(titles) -1):
        regex= "(?<=" + titles[index] + "==)[\D|^\n|\w]+(?=\n=="+titles[index+1] + ")"
    else:
        regex= "(?<=" + titles[index] + "==)[\D|^\n|\w]+(?=\n==)"
   

    matchedText= re.findall(regex, content)
    try:
        single_character_text= cleanText(matchedText[0])

        match = match_link(single_character_text) 
        if match:
            link = generate_sub_link(match.group())
            link_content = getPageContents(link)
            f.write(link_content)
        else:
            f.write(single_character_text + "\n")

    except :
        print("failed on= " + title)
        pass
    f.close()



In [None]:
# ALL HEADERS
counter = 0
for header in api_list:

    content = getPageContents(header)
    
    matchedPatterns= re.findall("(?<=\n)==[\w|\s|-]*[^=]==", content)
    titles = [w.replace('==', '') for w in matchedPatterns]
    titles.pop()

    for index,title in enumerate(titles):
        t= title.strip()
        f = open("./marvel/"+ t + ".txt", "w", encoding= "utf-8")

        if(index != len(titles) -1):
            regex= "(?<=" + titles[index] + "==)[\D|^\n|\w]+(?=\n=="+titles[index+1] + ")"
        else:
            regex= "(?<=" + titles[index] + "==)[\D|^\n|\w]+(?=\n==)"
    
    
        matchedText= re.findall(regex, content)
        
        try:
            single_character_text= cleanText(matchedText[0])

            match = match_link(single_character_text) 
            if match:
                link = generate_sub_link(match.group())
                link_content = getPageContents(link)
                f.write(link_content)
            else:
                f.write(single_character_text + "\n")

        except :
            print("!!!!! \n\n failed on= " + title + "\n\n------------")
            pass
        f.close()