# Formalia:

Please read the [assignment overview page](https://github.com/SocialComplexityLab/socialgraphs2021/wiki/Assignments) carefully before proceeding. This page contains information about formatting (including formats etc.), group sizes, and many other aspects of handing in the assignment. 

_If you fail to follow these simple instructions, it will negatively impact your grade!_

**Due date and time**: The assignment is due on Tuesday November the 2nd, 2021 at 23:59. Hand in your IPython notebook file (with extension `.ipynb`) via http://peergrade.io/

### Overview

This year's Assignment 2 is all about analyzing the network of The Legend of Zelda: Breath of the Wild.

Note that this time I'm doing the exercises slightly differently in order to clean things up a bit. The issue is that the weekly lectures/exercises include quite a few instructions and intermediate results that are not quite something you guys can meaningfully answer. 

Therefore, in the assignment below, I have tried to reformulate the questions from the weekly exercises into something that is (hopefully) easier to answer. *Then I also note which lectures each question comes from*; that way, you can easily go back and find additional tips & tricks on how to solve things 😇


----

# Part 0: Building the network 

To create our network, we downloaded the Zelda Wiki pages for all characters in BotW (during Week 4) and linked them via the hyperlinks connecting pages to each other. To achieve this goal we have used regular expressions!

> * Explain the strategy you have used to extract the hyperlinks from the Wiki-pages, assuming that you have already collected the pages with the Zelda API.
> * Show the regular expression(s) you have built and explain in details how it works.

# Part 1: Network visualization and basic stats

Visualize the network (from lecture 5) and calculate stats (from lecture 4 and 5). For this exercise, we assume that you've already generated the BotW network and extracted the giant connected component. Use the GCC to report the results.

_Exercise 1a_: Stats (see lecture 4 and 5 for more hints)

> * What is the number of nodes in the network? 
> * What is the number of links?
> * Who is the top connected character in BotW? (Report results for the in-degrees and out-degrees). Comment on your findings. Is this what you would have expected?
> * Who are the top 5 most connected allies (again in terms of in/out-degree)? 
> * Who are the top 5 most connected enemies -- bosses included -- (again in terms of in/out-degree)?
> * Plot the in- and out-degree distributions. 
>   * What do you observe? 
>   * Can you explain why the in-degree distribution is different from the out-degree distribution?
> * Find the exponent of the degree distribution (by using the `powerlaw` package) for the in- and out-degree distribution. What does it say about our network?
> * Compare the degree distribution of the undirected graph to a *random network* with the same number of nodes and probability of connection *p*. Comment your results.

_Exercise 1b_: Visualization (see lecture 5 for more hints)

> * Create a nice visualization of the total (undirected) network:
>   * Color nodes according to the role;
>   * Scale node-size according to degree;
>   * Get node positions based on the Force Atlas 2 algorithm;
>   * Whatever else you feel like that would make the visualization nicer.
> * Describe the structure you observe. Can you identify nodes with a privileged position in the network? Do you observe chains of connected nodes? Do you see any interesting group of nodes (can you guess who's involved)?

# Part 2: Word-clouds

Create your own version of the word-clouds (from lecture 7). For this exercise we assume you know how to download and clean text from the ZeldaWiki pages.

Here's what you need to do:
> * Create a word-cloud for each race of the [five champions of Hyrule](https://zelda.fandom.com/wiki/Champions) (i.e. Hylian, Zora, Goron, Gerudo, and Rito) according to either TC-IDF. Feel free to make it as fancy as you like. Explain your process and comment on your results.

In [11]:
import requests
import pandas as pd
import re
import string
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import math

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/olinezachariassen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/olinezachariassen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/olinezachariassen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Fect text with the zelda APi 

In [2]:

baseurl = "https://zelda.fandom.com/api.php?action=query&titles={}&prop=revisions&rvprop=content&rvslots=*&format=json"
titles = ["Hylian", "Zora", "Goron", "Gerudo","Rito" ]
pagedict = {}
stringdict = {}
for title in titles:
    response = requests.get(baseurl.format(title))
    pagedict[title] = response.json()
    key = next(iter(response.json()["query"]["pages"].keys()))
    stringdict[title] = response.json()["query"]["pages"][key]["revisions"][0]['slots']['main']['*']
    

    

Clean texts

In [3]:

stringdict2 = {}

for title in titles:
    text = stringdict[title]
    
    #text = re.sub(r'\S*@\S*\s*', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    text = re.sub(r'\{\{.*?\}\}', '', text)
    text = re.sub(r'\[\[.*?\]\]', '', text)
    text = re.sub(r'\<.*?\>', '', text)
    text_tokens = word_tokenize(text)
    tokens = [word for word in text_tokens if not word in stopwords.words()]
    #tokens = [word for word in text if not word in stopwords.words('english')]
 
    lemmatizer = WordNetLemmatizer()
    #tokens = [lemmatizer.lemmatize(token) for token in tokens]
    stringdict2[title]=  tokens
    
    

    
    

In [37]:
#Here comes TDF 
from sklearn.feature_extraction.text import TfidfVectorizer






convert the lists to strings to be able to use the wourdCloud library  

In [20]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud


def term_count(titleList):
    termCount = {}
    for word in titleList:
        if word in termCount:
            termCount[word] +=1
        else: 
            termCount[word]= 1 

    return termCount

collectionText = nltk.TextCollection([stringdict2['Hylian'], stringdict2['Zora'], stringdict2['Goron'],stringdict2['Gerudo'],stringdict2['Rito']])



def idf(text):
    idf= {}
    for word in text:
        idf[word] = text.idf(word)
    return idf
        

def tc_idf(title):
    idfs = idf(collectionText)
    tc_idf= {}
    for word in stringdict2[title]:
        tc_idf[word] = int(math.ceil(idfs[word])) * term_count(stringdict2[title])[word]
    return tc_idf

hylian_tc_idf= tc_idf('Hylian')
zora_tc_idf= tc_idf('Zora')
goron_tc_idf= tc_idf('Goron')
gerudo_tc_idf= tc_idf('Gerudo')
rito_tc_idf= tc_idf('Rito')
tc_idfs = [hylian_tc_idf, zora_tc_idf, goron_tc_idf, gerudo_tc_idf,rito_tc_idf]
        
for tcidf in tc_idfs:  
    strng = ""
    for key,value in tcidf.items():
        strng+= ((key+ " ") * value)
    
    
    #string = " ".join(stringdict2[title])
    wordcloud = WordCloud(background_color="white",collocations=False).generate(strng)
    img=plt.imshow(wordcloud)
    plt.axis("off")
    print( title)
    plt.show()



NameError: name 'false' is not defined

# Part 3: Communities and TF-IDF

Find communities and compute their associated TF-IDF (from lecture 7 and 8).

Here's what you need to do:
> * Explain the Louvain algorithm and how it finds communities in a newtork.
> * Explain how you chose to identify the communities: Which algorithm did you use? (if you did not use the Louvain method, explain how the method you have used works) 
> * Comment your results:
>   * How many communities did you find in total?
>   * Compute the value of modularity with the partition created by the algorithm.
>   * Plot the distribution of community sizes.
> * For the 5 largest communities, create TF-IDF based rankings of words in each community. 
>   * There are many ways to calculate TF-IDF, explain how you've done it and motivate your choices.
>   * List the 5 top words for each community according to TF.
>   * List the 5 top words for each community accourding to TF-IDF. Are these words more descriptive of the community than just the TF? Justify your answer.

# Part 4: Sentiment of communities

Analyze the sentiment of the communities (lecture 8). Here, we assume that you've successfully identified communities.  More tips & tricks can be found, if you take a look at Lecture 8's exercises.

A couple of additional instructions you will need below:
* We name each community by its three most connected characters.
* Average the average sentiment of the nodes in each community to find a community level sentiment.

Here's what you need to do (repeat these steps and report your results for **both LabMT and VADER**):
> * Calculate and store sentiment for every character
> * Create a histogram of all character's associated sentiments.
> * What are the 10 characters with happiest and saddest pages?

Now, compute the sentiment of communities: 
> * What are the three happiest communities according to the LabMT wordlist approach? What about VADER?
> * What are the three saddest communities according to the LabMT wordlist approach? What about VADER?
> * Create a bar plot showing the average sentiment of each community and add error-bars using the standard deviation for both methods. 
> * Explain the difference between the two methods and compare the results you have obtained above.
> * What is the advantage of using a rule-based method over the dictionary-based approach? 