# Explainer Notebook

### LINK TO WEBPAGE

https://dribo.github.io/DTU-02467_ProjectB/

### LINK TO GITHUB
https://github.com/Dribo/DTU-02467_projectA


# Motivation

We get our dataset by scraping the cold war on Wikipedia. Then all references are also scraped. Each row is a wikipedia page with data: Title, Text, References, Categories

We chose this particular dataset because the Cold War is an interesting subject, and we suspect that the page of choice being english, might cause bias, since the war is very polarizing depending on country and language.

Our goal was to shed light on how wikipedia might contain bias, and what it might look like.

# Basic Stats

We chose to scrape at a depth of 1, since this scrape was already very time-consuming.

When we were building the Network, we ended up filtering heavily, because we found wikipedia to be too heavily interlinked for the purposes of our Text Analysis.

Our complete network has 857 nodes and 16216 edges.

# Tools, theory and analysis

We used networkX to represent the graph object, Louvain Detection for communities.

We created our own classification using heuristics that we then used to create subgraphs. We wanted to get a good distinction between events and people on wikipedia. We found our classification to be decent by inspecting the classifications.

Afterwards, we examined assortativity on the full graph and our metric. This value was 0.05. Since we are sure our classification was not that bad, we now know that people and events will not be useful for community detection, and hence we worked with a subgraph for solely 'event' from then on.

We spent some time experimenting with subgraphs, because it was difficult to get a good community segmentation. We finally reached a good modularity by removing high-degree nodes. We used distributions of the data, for example node degree distributions, to aid in our choice of thresholds for the subgraph filtering.

For text analysis we used the communities for TF-IDF analysis. We attempted to answer our research question about finding bias in Wikipedia by looking at what words and distinctions came to surface, and whether these were surprising or not.

We decided to use the sklearn implementation of the TF-IDF vectorisation. This was done mainly to avoid having to write a less efficient code than what was already available. A downside to this was that all pre-processing was done behind the scenes and as such we relenquished some control over the exact way our text data was pre-processed. We did take a look at the result of the pre-processing and found it to be in exactly the same vein that we ourselves would have written, and therfore we decided to move forward with this approach. To sum up, the pre-processing of the text was done by sklearn using their tokenizer, where all punctuation for instance counts as token seperators, compared to only using spaces or having some abbreviation count as one token, i.e U.S. We did however decided that this was a worthwhile trade-off compared to having to write the code ourselves which in any case would be less efficient.

# Discussion

In general the project went well. We found that Wikipedia has a large amount of data available, and that articles are more interlinked than perhaps we intially expected. This made it necessary to be a bit creative in what heuristics we employed in order to get a subgraph that was both interesting to analyse and which also could be used to answer the questions we wanted.
We also found that using a ready-made library in SKLearn for the TF-IDF gave us more time to actually analyse the data rather than spending the time writing code that anyways would be less efficient. It did however mean that we had less of a say in how the pre-processing was done, which obviously gave us less independence in the results. On that note, however, we found that the pre-processing actually made sense, and we didn't find any obvious deviations from how we would have done it ourselves.

Some of the things we would have wished to expand on is sentiment analysis and looking at other languages. We think this would have enabled a deeper, more interesting analysis, and would have been a better way to answer the research question we intially looked at.




In [1]:
# Options
OPTION_PERFORM_SCRAPE = False
OPTION_SAVE_FIG = True
OPTION_SHOW_PLOT = True

In [68]:
# Imports
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
from ast import literal_eval
from collections import defaultdict
import networkx as nx
from netwulf.interactive import visualize
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import numpy as np
import community
from collections import OrderedDict
from operator import getitem
import importlib
from wordcloud import WordCloud
import matplotlib.pyplot as plt


In [69]:
# Static Variables
LINK_WIKI_ENGLISH = "https://en.wikipedia.org/wiki/Cold_War"
LINK_WIKI_GERMAN = "https://de.wikipedia.org/wiki/Kalter_Krieg"
EVENT_FILTER = ["conflicts", "conflict", "events", "event", "wars", "war", "coups", "coup",
                "crises", "crisis", "coup d'état", "history", "warfare", "battle", "battles"
                "invasion", "invasions", "revolution", "revolutions"]
PERSON_FILTER = ["births", "deaths", "people", "leader", "leaders", "politicians", "politician",
                 "writer", "writers", "scientist", "scientists", "personnel", "family", "families",
                 "executive", "executives", "spy", "spies", "person"]

# Explainer Scraping:

We define some functions to help scrape wikipedia.

We take an initial link, in this case the cold war link. Then we get all references from this link and scrape these. The combined data consists of the initial link and all references from it. We save the paragraph texts, references, title, URL and wikipedias Categories on the bottom of the pages. Meaning that for each page we scrape, we also find all its references in order to build all edges.

Our cleanup function makes sure to remove pages that are internal or contributor-focused. A lot of content on wikipedia is information about editing wikipedia itself, which we are not interested in for this work.



In [70]:
def hyperlink_cleanup(str_link, language):
    link_prefix = f"https://{language}.wikipedia.org"
    if "/wiki/" in str_link:
        out = link_prefix + str_link
        if '#' in out:
            out = out.split('#')[0]
        if 'Wikipedia' in out:
            return False
        if 'Template' in out:
            return False
        return out
    else:
        return False

In [71]:
def get_content_soup(link_wikipedia):
    website = requests.get(link_wikipedia)
    content_soup = BeautifulSoup(website.content)
    return content_soup

def soup_get_title(wiki_content_soup):
    head = wiki_content_soup.find("h1", {"id": "firstHeading"})
    return head.text

def soup_get_reference_links(wiki_content_soup, language="en"):
    p_elements = wiki_content_soup.find_all("p")
    links = [a['href'] for p in p_elements for a in p.find_all("a", href=True)]
    links = [hyperlink_cleanup(link, language) for link in links]
    links = [link for link in links if link != False]
    return links

def soup_get_category_texts(wiki_content_soup, language="en"):
    html_div = wiki_content_soup.find("div", {"id": "mw-normal-catlinks"})
    links_without_categories = html_div.find("ul")
    links = links_without_categories.find_all("a")
    link_texts = [a.text for a in links]
    return link_texts

def soup_get_paragraph_texts(wiki_content_soup):
    p_elements = wiki_content_soup.find_all("p")
    paragraph_texts = [p.text for p in p_elements]
    return paragraph_texts

def get_all_reference_links(link_wikipedia, language):
    website = requests.get(link_wikipedia)
    content_soup = BeautifulSoup(website.content)
    p_elements = content_soup.find_all("p")
    links = [a['href'] for p in p_elements for a in p.find_all("a", href=True)]
    links = [hyperlink_cleanup(link, language) for link in links]
    links = [link for link in links if link != False]
    return links


In [72]:
# Scrape English version
if OPTION_PERFORM_SCRAPE:
    links_to_scan = get_all_reference_links(LINK_WIKI_ENGLISH, "en") + [LINK_WIKI_ENGLISH]
    links_to_scan = sorted(list(set(links_to_scan)))
    data = []
    for url in tqdm(links_to_scan):
        soup = get_content_soup(url)
        title = soup_get_title(soup)
        list_references = soup_get_reference_links(soup, language="en")
        list_paragraph_texts = soup_get_paragraph_texts(soup)
        list_category_texts = soup_get_category_texts(soup)
        data.append([url, title, list_references, list_paragraph_texts, list_category_texts])


In [73]:
# Save English version
if OPTION_PERFORM_SCRAPE:
    COLUMN_NAMES = ['URL', 'TITLE', 'LIST_REFERENCES', 'LIST_PARAGRAPH_TEXTS', "CATEGORIES"]
    df_wikipedia_english = pd.DataFrame(data, columns=COLUMN_NAMES)
    df_wikipedia_english = df_wikipedia_english.set_index('URL')
    df_wikipedia_english.to_csv('./data/wiki_english.csv')

In [74]:
# Scrape German Version
if OPTION_PERFORM_SCRAPE:
    links_to_scan = get_all_reference_links(LINK_WIKI_GERMAN, language="de") + [LINK_WIKI_GERMAN]
    links_to_scan = sorted(list(set(links_to_scan)))
    data = []
    for url in tqdm(links_to_scan):
        soup = get_content_soup(url)
        title = soup_get_title(soup)
        list_references = soup_get_reference_links(soup, language="de")
        list_paragraph_texts = soup_get_paragraph_texts(soup)
        list_category_texts = soup_get_category_texts(soup)
        data.append([url, title, list_references, list_paragraph_texts, list_category_texts])


In [75]:
# Save German Version
if OPTION_PERFORM_SCRAPE:
    COLUMN_NAMES = ['URL', 'TITLE', 'LIST_REFERENCES', 'LIST_PARAGRAPH_TEXTS', "CATEGORIES"]
    df_wikipedia_german = pd.DataFrame(data, columns=COLUMN_NAMES)
    df_wikipedia_german = df_wikipedia_german.set_index('URL')
    df_wikipedia_german.to_csv('./data/wiki_german.csv')

### Load data

In [76]:
#df_wikipedia_english = pd.read_csv('./data/wiki_english.csv', index_col='URL', converters={'LIST_REFERENCES': literal_eval, 'LIST_PARAGRAPH_TEXTS': literal_eval, "CATEGORIES": literal_eval})

#df_wikipedia_english['TYPE'] = df_wikipedia_english_fromCSV.apply(lambda x: wiki_util.get_category(x['CATEGORIES'], EVENT_FILTER, PERSON_FILTER), axis=1)

df_wikipedia_english = pd.read_csv('./data/wiki_english_with_tokens.csv', index_col='URL', converters={'LIST_REFERENCES': literal_eval, 'LIST_PARAGRAPH_TEXTS': literal_eval, "CATEGORIES": literal_eval, "TOKENS": literal_eval, "UNIQUE_TOKENS": literal_eval})

#df_wikipedia_german_fromCSV = pd.read_csv('./data/wiki_german.csv', index_col='URL', converters={'LIST_REFERENCES': literal_eval, 'LIST_PARAGRAPH_TEXTS': literal_eval, "CATEGORIES": literal_eval})


#df_wikipedia_german_fromCSV['TYPE'] = df_wikipedia_german_fromCSV.apply(lambda x: wiki_util.get_category(x['CATEGORIES'], EVENT_FILTER, PERSON_FILTER), axis=1)

# Virker til en vis grænse, nogle småting der ikke bliver fanget ordenligt. None filteret virker bedst.

# Explainer:
Here is the function for producing communities using the Louvain Community Detection Algorithm as explained in the website

In [77]:
def get_graph_stats(graph):
    def format_print(str1, str2):
        print("{:>15}".format(str1), "{:>15}".format(str2))

    if graph.is_directed():
        in_degrees = [x[1] for x in graph.in_degree]
        out_degrees = [x[1] for x in graph.out_degree]
    else:
        degrees = [x[1] for x in graph.degree]

    format_print('Statistic', 'Value')
    format_print('N Nodes', len(graph.nodes))
    format_print('N Edges', len(graph.edges))
    if graph.is_directed():
        format_print('Max in_degree', max(in_degrees))
        format_print('Min in_degree', min(in_degrees))
        format_print('Max out_degree', max(out_degrees))
        format_print('Min out_degree', min(out_degrees))
        format_print('Mean in_degree', "{: 2.2f}".format(np.mean(in_degrees)))
        format_print('Mean out_degree', "{: 2.2f}".format(np.mean(out_degrees)))
    else:
        format_print('Max degree', max(degrees))
        format_print('Min degree', min(degrees))
        format_print('Mean degree', "{: 2.2f}".format(np.mean(degrees)))

def produce_communities(graph, resolution=0.3):
    list_communities = nx.community.louvain_communities(graph, seed=1, resolution=resolution)

    list_communities.sort(key=len, reverse=True)
    #print(len(list_communities))

    nice_colors = ['#0fdbff', '#0fdbff', '#ff0fb3', '#5e3582', '#ffe70f', '#1e9648', '#1e6296', '#4a1e96', '#961e6a', '#51888c']

    DEFAULT_COLOR = '#8c8c8c'
    partition_colors = defaultdict(lambda: DEFAULT_COLOR)

    for i in range(len(nice_colors)):
        partition_colors[i] = nice_colors[i]

    for node in graph.nodes:
        for i in range(len(list_communities)):
            if node in list_communities[i]:
                graph.nodes[node]['color'] = partition_colors[i]
                graph.nodes[node]['community'] = i

    return graph, len(list_communities), list_communities


# Explainer: Building the Edge List

When we want to create a graph, we have to create one from an Edge List. We go through the data and build a reference count dict, which has the key (urlFrom, urlTo), with value weight. Defaultdict allows us to assume 0 if no entries yet, making it safe to simply increment the value, instead of checking for None.

From this dict we can create a directed graph or undirected by simply summing weights (A, B) and (B, A) for each edge.

In [78]:
def df_get_url_list(df_wiki):
    return list(df_wiki.index)

def mask_list(base, to_mask):
    res = [o for o in to_mask if o in base]
    return res

def get_dict_reference_count(df_wiki):
    reference_count = defaultdict(lambda:0)

    url_list = df_get_url_list(df_wiki)
    for url in tqdm(url_list):
        references = df_wiki["LIST_REFERENCES"][url]
        references = mask_list(url_list, references)

        for url_ref in references:
            reference_count[(url, url_ref)] += 1

    return reference_count

def get_edge_list(df_wiki, directed=False):
    reference_count = get_dict_reference_count(df_wiki)

    edge_list = []
    url_list = df_get_url_list(df_wiki)
    for a in tqdm(url_list):
        for b in url_list:
            weight = reference_count[(a,b)]
            if not directed:
                weight += reference_count[(b,a)]
            if weight > 0:
                if directed:
                    edge_list.append((a, b, weight))
                elif not ((a, b, weight) in edge_list or (b, a, weight) in edge_list):
                    edge_list.append((a, b, weight))

    return edge_list

def generate_graph_with_node_attributes(graph, df_wiki):
    for node in graph.nodes:
        graph.nodes[node]['TYPE'] = df_wiki.TYPE[node]
        PARAGRAPH_TEXTS = df_wiki.LIST_PARAGRAPH_TEXTS[node]
        graph.nodes[node]['LIST_PARAGRAPH_TEXTS'] = PARAGRAPH_TEXTS
        graph.nodes[node]['FLAT_TEXT'] = ' '.join(PARAGRAPH_TEXTS)
        graph.nodes[node]['CATEGORIES'] = df_wiki.CATEGORIES[node]
        graph.nodes[node]['TITLE'] = df_wiki.TITLE[node]
    return graph



In [79]:
def threshold_node_degree_undirected(graph, min=0, max=30):
    if graph.is_directed():
        raise Exception("Only use on undirected_graphs")
    nodes = (
        node for node, data in graph.nodes(data=True)
        if min <= graph.degree[node] <= max
    )
    return graph.subgraph(nodes)

def threshold_node_degree_directed(graph, min=[0, 0], max=[30,30]):
    if not graph.is_directed():
        raise Exception("Only use on directed graphs")
    if len(min) != 2:
        raise Exception("Min must be array of length 2")
    if len(max) != 2:
        raise Exception("Max must be array of length 2")
    nodes = (
        node for node, data in graph.nodes(data=True)
        if min[0] <= graph.in_degree[node] <= max[0]
        and min[1] <= graph.out_degree[node] <= max[1]
    )
    return graph.subgraph(nodes)

In [80]:
en_edge_list = get_edge_list(df_wikipedia_english)
en_edge_list_directed = get_edge_list(df_wikipedia_english, directed=True)

graph_en_directed = nx.DiGraph()
graph_en_directed.add_weighted_edges_from(en_edge_list_directed)
graph_en_directed = generate_graph_with_node_attributes(graph_en_directed, df_wikipedia_english)

100%|██████████| 859/859 [00:01<00:00, 564.72it/s]
100%|██████████| 859/859 [00:12<00:00, 66.36it/s] 
100%|██████████| 859/859 [00:01<00:00, 513.67it/s]
100%|██████████| 859/859 [00:00<00:00, 2173.99it/s]


# Explainer: Tokenizer

We initially used our own implementation of TF-IDF, but this proved too slow.

In [None]:
def tf_idf_sklearn(documents, max_df=1, min_df=1, max_features=100):
    vectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
                                 max_features=max_features,
                                 max_df=max_df, min_df=min_df)
    tfidf_matrix = vectorizer.fit_transform(documents)
    return tfidf_matrix, vectorizer


# Graph and Text analysis
In the cell below is a class, which we made as a wrapper that makes it simple for us to iterate and experiment with different configurations.

In [None]:
def get_attributes_for_community(graph, community, attribute):
    list_attribute = []
    for (n, d) in graph.nodes(data=True):
        if d['community'] == community:
            list_attribute.append(d[attribute])

    return list_attribute

def get_attributes_for_graph(graph, attribute):
    list_attribute = []
    for (n, d) in graph.nodes(data=True):
        list_attribute.append(d[attribute])

    return list_attribute

def get_subgraph(graph, attribute, values):
    nodes = (
        node for node, data in graph.nodes(data=True)
        if data.get(attribute) in values
    )
    return graph.subgraph(nodes)

class CSSGraph:
    def __init__(self, directed=False):
        self.n_communities = 0
        self.communities = None
        self.directed = directed
        self.visualize_config = None
        self.graph_corpus = None
        self.community_corpus = [] #Maybe not needed

        self.tf_idf_df = None

        self.community_top_10_tf = {}
        self.community_top_10_tf_idf = {}
        self.community_top_3_nodes = {}

        self.vectorizer = None
        self.tf_idf_matrix = None

        if directed:
            self.graph = nx.DiGraph()
        else:
            self.graph = nx.Graph()

    def add_weighted_edges_from(self, edge_list):
        self.graph.add_weighted_edges_from(edge_list)

    def populate_node_attributes(self, df):
        self.graph = generate_graph_with_node_attributes(self.graph, df)

    def print_graph_stats(self):
        get_graph_stats(self.graph)

    def make_subgraph_with_attribute_values(self, attribute, values):
        self.graph = get_subgraph(self.graph, attribute=attribute, values=values)

    def filter_nodes_by_degree(self, d_min, d_max):
        if self.directed:
            self.graph = threshold_node_degree_directed(self.graph, min = d_min, max= d_max)
        else:
            self.graph = threshold_node_degree_undirected(self.graph, min=d_min, max=d_max)

    def embed_communities(self, resolution=1):
        self.graph, self.n_communities, self.communities = produce_communities(self.graph, resolution)
        self.embed_graph_corpus()
        self.embed_community_corpus()

    def visualize(self, graph_saving=False):
        if not graph_saving:
            visualize(self.graph)
        elif self.visualize_config is None:
            _, self.visualize_config = visualize(self.graph)
        else:
            visualize(self.graph, config=self.visualize_config)

    def embed_graph_corpus(self):
        if self.graph_corpus is None:
            self.graph_corpus = get_attributes_for_graph(self.graph, 'FLAT_TEXT')

    def embed_tf_idf(self, max_df=1, min_df=1, max_features=100):
        self.tf_idf_matrix, self.vectorizer = tf_idf_sklearn(self.community_corpus, max_df=max_df, min_df=min_df, max_features=max_features)
        self.tf_idf_df = pd.DataFrame(self.tf_idf_matrix.toarray(), columns=self.vectorizer.get_feature_names_out())

    def get_top_terms(self, community=None):
        if community is None:
            return [list(self.tf_idf_df.iloc[c].sort_values(ascending=False)[:10].index) for c in range(self.n_communities)]
        else:
            return list(self.tf_idf_df.iloc[community].sort_values(ascending=False)[:10].index)

    def get_nodes_in_community(self, community):
        return [node for node, c in self.graph.nodes(data='community') if c == community]

    def get_top_nodes(self, community=0, n_max = 3):
        nodes_in_community = self.get_nodes_in_community(community)
        degrees = [(node,val) for (node, val) in self.graph.degree() if node in nodes_in_community]
        sorted_degree= sorted(degrees, key=lambda x: x[1], reverse=True)[:n_max]
        return [node for node, val in sorted_degree]

    # Maybe not needed
    def embed_community_corpus(self):
        if len(self.community_corpus) != self.n_communities:
            for n in range(self.n_communities):
                self.community_corpus.append(' '.join(get_attributes_for_community(self.graph, n, 'FLAT_TEXT')))


## Subgraph without communities

In [None]:
# Create Graph
graph_en_basic = CSSGraph()
graph_en_basic.add_weighted_edges_from(en_edge_list)
graph_en_basic.populate_node_attributes(df_wikipedia_english)

graph_en_basic.make_subgraph_with_attribute_values('TYPE', ['event'])

#graph_en_basic.visualize(graph_saving=True)

# Full Graph

In [None]:
# Create Graph
graph_en_full = CSSGraph()
graph_en_full.add_weighted_edges_from(en_edge_list)
graph_en_full.populate_node_attributes(df_wikipedia_english)

# Community Analysis
graph_en_full.embed_communities(resolution=1)
graph_en_full.embed_tf_idf(max_df=0.99, min_df=0, max_features=100)

#graph_en_full.visualize(graph_saving=True)

# Sub Graph for Event

In [None]:
# Create Graph
graph_en_event = CSSGraph()
graph_en_event.add_weighted_edges_from(en_edge_list)
graph_en_event.populate_node_attributes(df_wikipedia_english)

# Filter Graph
graph_en_event.make_subgraph_with_attribute_values('TYPE', ['event'])

# Community Analysis
graph_en_event.embed_communities(resolution=1)
graph_en_event.embed_tf_idf(max_df=0.99, min_df=0, max_features=100)

#graph_en_event.visualize(graph_saving=True)

# Subgraph for event with degree threshold (Selected for further analysis)

In [None]:
# Create Graph
graph_en = CSSGraph()
graph_en.add_weighted_edges_from(en_edge_list)
graph_en.populate_node_attributes(df_wikipedia_english)

# Filter Graph
graph_en.make_subgraph_with_attribute_values('TYPE', ['event'])
graph_en.filter_nodes_by_degree(d_min=2, d_max=40)

# Community Analysis
graph_en.embed_communities(resolution=1)
graph_en.embed_tf_idf(max_df=0.99, min_df=0, max_features=100)

#graph_en.visualize()

# Graph Analysis

### Note on modularity

We see that the modularity is much higher on the graph filtered by event and then threshold-filtered based on node degree

In [None]:
# Statistics
graph_en.print_graph_stats()
graph_en_full.print_graph_stats()
print("Modularity:", nx.community.modularity(graph_en.graph, graph_en.communities))
print("Modularity event:", nx.community.modularity(graph_en_event.graph, graph_en_event.communities))
print("Modularity full:", nx.community.modularity(graph_en_full.graph, graph_en_full.communities))


## Assortativity

Here we use NetworksX's implementation of Louvains' Community Detection algorithm

In [None]:
print("Assortativity degree subgraph:", nx.degree_assortativity_coefficient(graph_en.graph))
print("Assortativity degree full graph:", nx.degree_assortativity_coefficient(graph_en_full.graph))
print("Assortativity type attribute full graph:", nx.attribute_assortativity_coefficient(graph_en_full.graph, 'TYPE'))


## Degree Distribution

In [None]:
G = graph_en.graph

if OPTION_SHOW_PLOT:
    degrees = [x[1] for x in G.degree]
    bins = np.linspace(min(degrees), max(degrees), max(degrees))

    mean = np.mean(degrees)
    median = np.median(degrees)

    hist, edges = np.histogram(degrees, bins=bins)
    x = (edges[1:] + edges[:-1])/2
    width = bins[1] - bins[0]
    fig, axs = plt.subplots(1, figsize=(6, 3))
    axs.bar(x, hist, width=width*0.9)

    axs.set_xlabel('Degrees')
    axs.set_ylabel('Articles')
    axs.set_yscale('log')
    axs.set_xticks(range(0, 501, 5))
    axs.set_yticks([10**0, 10**1, 10**2, max(hist)], labels=[10**0, 10**1, 10**2, max(hist)])
    axs.axline((mean, 0), (mean, max(degrees)), color='red', label='mean degrees')
    axs.axline((median, 0), (median, max(degrees)), linestyle='--', color='red', label='median degrees')
    axs.legend()

    fig.tight_layout(pad=0.5)
    plt.show()
    if OPTION_SAVE_FIG:
        fig.savefig("./images/degree_distributions_final.png")

G = graph_en_full.graph

if OPTION_SHOW_PLOT:
    degrees = [x[1] for x in G.degree]
    bins = np.linspace(min(degrees), max(degrees), 30)

    mean = np.mean(degrees)
    median = np.median(degrees)

    hist, edges = np.histogram(degrees, bins=bins)
    x = (edges[1:] + edges[:-1])/2
    width = bins[1] - bins[0]
    fig, axs = plt.subplots(1, figsize=(6, 3))
    axs.bar(x, hist, width=width*0.9)

    axs.set_xlabel('Degrees')
    axs.set_ylabel('Articles')
    axs.set_yscale('log')
    axs.set_xticks(range(0, 501, 40))
    axs.set_yticks([10**0, 10**1, 10**2, max(hist)], labels=[10**0, 10**1, 10**2, max(hist)])
    axs.axline((mean, 0), (mean, max(degrees)), color='red', label='mean degrees')
    axs.axline((median, 0), (median, max(degrees)), linestyle='--', color='red', label='median degrees')
    axs.legend()

    fig.tight_layout(pad=0.5)
    plt.show()
    if OPTION_SAVE_FIG:
        fig.savefig("./images/degree_distributions_final_fullgraph.png")

In [None]:
# Get top terms and stats
print("Number of communities:", graph_en.n_communities)
print("Top 10 terms in community 0")
print(graph_en.get_top_terms(0))
print("Top 3 nodes in community 0")
print(graph_en.get_top_nodes(0))

# Get top terms of graph
tf_idf_matrix, vectorizer = tf_idf_sklearn(graph_en.graph_corpus, max_df=0.99, min_df=1, max_features=100)
importance = np.argsort(np.asarray(tf_idf_matrix.sum(axis=0)).ravel())[::-1]
tfidf_feature_names = np.array(vectorizer.get_feature_names_out())
print(tfidf_feature_names[importance[:10]])


# WordClouds

In [None]:
#plt.style.use('classic')
fig, axs =  plt.subplots(nrows=3, ncols=3, figsize=(20, 12))
plt.subplots_adjust(hspace=0.2)
for community, ax in zip(range(graph_en.n_communities), axs.ravel()):
    top3 = graph_en.get_top_nodes(community, 3)
    top3 = [graph_en.graph.nodes[t]['TITLE'] for t in top3]
    text = graph_en.community_corpus[community]
    wordcloud = WordCloud().generate(text)
    ax.imshow(wordcloud)
    ax.set_title(top3, wrap=True)
    ax.axis("off")
plt.show()