<a href="https://colab.research.google.com/github/FranziskaSW/DS-keyword-clusters/blob/master/4_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing The Network Of Keywords
## Motivation
Every article in the NYT is tagged with several keywords. We can assume that some of these keywords are "section-specific", i.e. they mostly appear in specific sections (for example, tennis players will mostly appear in the Sports section, while dancers would appear in Culture). But some keywords might appear in many different sections, (for example, 'New York' - it's the New York Times after all).

Simplyfied our dataset looks like this:

| article   | keywords                                                     | section |
| ---- | ------------------------------------------------------------ | ------- |
| 1    | 'Nuclear Weapons', 'Trump, Donald J', 'Kim Jong-un', 'North Korea', 'United States Defense and Military Forces' | World   |
| 2    | 'Cats', 'Wildfires', 'Santa Rosa (California)'               | U.S.    |
| 3    | 'Playoff Games', 'Football', 'Super Bowl', 'New England Patriots', 'Jacksonville Jaguars' | Sports  |
| ... |                           ...  |                                         ...  | 



If we would assign each keyword a section of the newspaper it "belongs" to (by the dominant section it appears in), we can visualize how the different sections of the newspaper interact, content-wise. We will define the keywords network with the keywords as nodes, and add an edge between two keywords if they appeared together in an article.

We will embed this graph in 2D in a way that represents the co-appearance connections between the keywords, visualizing the relations between the different sections of the newspaper. We will see if this network of keywords truthfully represents the sections of the newspaper.

From another perspective, since the embedding of the network is based solely on the connections between keywords, keywords that will be close in the embedding, also have a strong connection in the content of the NYT. In this sense, the network could also reveal some unexpected connections. 

## Building The Network
### Getting & Cleaning The Data


In [0]:
import json
import requests
import pickle
import pandas as pd
import numpy as np
from itertools import chain
from itertools import combinations
import math
import os

global cwd
cwd = os.getcwd()


In [0]:
def get_data(year, month):
    """
    pulls the meta data of the articles that were published during that month and saves it in archive,
    uses nytimes search api
    :param year: str
    :param month: str
    """
    archive_key = 'Jctp3rj1ZdOaLQiMArs79ioGnwvfK1pC'
    month_api = year + '/' + month
    if len(month) == 1:
        month = '0' + month
    data_suffix = year + '_' + month
    url = 'https://api.nytimes.com/svc/archive/v1/' + month_api + '.json?api-key=' + archive_key

    print('-------------- load', url, ' --------------')
    html = requests.get(url)  # load page
    a = html.text
    api_return = json.loads(a)
    articles = api_return['response']['docs']
    # articles = response['docs']
    df = pd.DataFrame(articles)
    with open(cwd + "/data/archive/articles_" + data_suffix + ".pickle", "wb") as f:
        pickle.dump(df, f)
        

### Data Cleaning

**Articles**: 

Before cleaning the data, the API returns 232172 articles for the three years 2016-2018. But not all of them are relevant to us and some of them are duplicates. We will clean this dataset in the following way:

- Only keep articles with more than 20 words. 

- Drop duplicate articles, keep an article with the same headline only if it appears in different sections

- Only articles that belong to the type_of_material 'News' and 'Briefing' (shorter Articles, mainly from the Sports section). Those make about 71% of the plain results that are left if we take all of the API results (after dropping the duplicates and >20 words). 

- Drop articles of sections that are not relevant for our analysis (as explained in next paragraph about section cleaning)

After cleaning we are left with 104953 articles for the three years 2016, 2017, 2018.


In [0]:

def extr_headline_main(field):
    """
    extracts main headline from api entry
    :param field: api entry structure
    :return: headline
    """
    return field['main']


def clean_articles(df, word_count):
    """
    clean the articles, only keep articles with
    - more than 20 words
    - that are certain 'type_of_material'
    - drop duplicate articles, if same headline appears in same section
    :param df: DataFrame of articles
    :param word_count: minimum amount of words, not included
    :return: cleaned DataFrame of articles
    """
    df = df[~(df.word_count.isnull())]
    df['word_count'] = df.word_count.apply(lambda x: int(x))
    df = df[df.word_count > word_count]
    df['headline'] = df.headline.apply(lambda x: extr_headline_main(x))
    df = df.drop_duplicates(['headline', 'section_name'])
    mask = ['News', 'Brief', 'briefing']
    df = df[df.type_of_material.isin(mask)]
    return df


**Section**: 

The field "section_name" has 98 different values for the years 2016-2018 in the raw data. Since we wish to use the sections of the NYT to tag the nodes of the network, this is too much and we need to manually reduce them to fewer sections while staying true to the NY Times website. We also drop sections that are not interesting for our analysis e.g. 'Crosswords & Games' or 'Insider Events'. We end up with the sections: World, Business&Technology, Culture, Sports, U.S. New York, Leisure, Style, Politics, Health&Science.

After this routine, there are still 22.4% articles left, that do not have a section_name tag. Therefore we repeated the same routine on the 'news_board' fild, which contains similar information. Now only  1,8% articles are left that can not be tagged to a section, and these get the tag 'Unknown'.



In [0]:
def getSectionDict(name):
    """
    groups section_name into 12 meta-sections
    :param name: section_name in from search api
    :return: name of meta-section
    """
    world = ['World', 'Africa', 'Americas', 'Asia', 'Asia Pacific', 'Australia', 'Canada', 'Europe', 'Middle East',
             'What in the World', 'Opinion | The World', 'Foreign']
    if name in world: return 'World'
    us = ['U.S.', 'National']
    if name in us: return 'U.S.'
    politics = ['Elections', 'Politics', 'Tracking Trumps Agenda', 'The Upshot', 'Opinion | Politics', 'Upshot',
                'Washington ']
    if name in politics: return 'Politics'
    ny = ['N.Y. / Region', 'New York Today', 'Metro', 'Metropolitan']
    if name in ny: return 'New York'
    business_technology = ['Business Day', 'Economy', 'Media', 'Money', 'DealBook', 'Markets', 'Energy', 'IPhone App',
                           'Media', 'Technology', 'Personal Tech', 'Entrepreneurship', 'Your Money', 'Business',
                           'SundayBusiness']
    if name in business_technology: return 'Business & Technology'
    sports = ['Skiing', 'Rugby', 'Sailing', 'Cycling', 'Cricket', 'Auto Racing', 'Horse Racing', 'World Cup',
              'Olympics', 'Pro Football', 'Pro Basketball', 'Sports', 'Baseball', 'NFL', 'College Football', 'NBA',
              'College Basketball', 'Hockey', 'Soccer', 'Golf', 'Tennis']
    if name in sports: return 'Sports'
    arts = ['Opinion | Culture', 'Arts', 'Art & Design', 'Books', 'Book Review', 'BookReview', 'Best Sellers',
            'By the Book', 'Crime', 'Children\'s Books', 'Book Review Podcast', 'Now read this', 'Dance', 'Movies',
            'Music', 'Television', 'Theater', 'Pop Culture', 'Watching', 'Culture', 'Arts&Leisure']
    if name in arts: return 'Culture'
    style = ['Men\'s Style', 'Style', 'Styles', 'TStyle', 'Fashion & Style', 'Fashion & Beauty', 'Fashion', 'Weddings',
             'Self-Care']
    if name in style: return 'Style'
    science = ['Energy & Environment', 'Science', 'Climate', 'Opinion | Environment', 'Space & Cosmos', 'Trilobites',
               'Sciencetake', 'Out There']
    health = ['Mind', 'Health Guide', 'Health', 'Health Policy', 'Live', 'Global Health', 'The New Old Age', 'Science',
              'Well', 'Move']
    sci_hel = science + health + ['Family', 'Live']
    if name in sci_hel: return 'Health & Science'
    food = ['Eat', 'Wine, Beer & Cocktails', 'Restaurant Reviews', 'Dining', 'Food']
    travel = ['36 Hours', 'Frugal Traveler', '52 Places to go', 'Travel']
    magazine = ['Smarter Living', 'Wirecutter', 'Automobiles', 'T Magazine', 'Magazine', 'Design & Interiors',
                'Entertainment', 'Video', 'Weekend']
    leisure = food + travel + magazine
    if name in leisure: return 'Leisure'
    opinion = ['Opinion', 'Letters', 'Contributors', 'Editorials', 'Columnists', 'OpEd', 'Sunday Review', 'Games',
               'Editorial']
    realestate = ['Real Estate', 'RealEstate', 'Commercial Real Estate', 'The High End', 'Commercial', 'Find a Home',
                  'Mortgage Calculator', 'Your Real Estate', 'List a Home']
    education = ['Education', 'Education Life', 'The Learning Network', 'Lesson Plans', 'Learning']
    delete = (['Blogs', 'Insider Events', 'Retirement', 'América', 'Multimedia/Photos', 'The Daily',
               'Briefing', 'Sunday Review', 'Crosswords & Games', 'Times Insider', 'Corrections', 'NYTNow',
               'Corrections', 'Podcasts', 'Insider', 'Obits', 'Summary']
              + opinion + education + realestate)
    if name in delete: return '*DELETE*'
    else: return '*UNKNOWN*'

    
def clean_sections(df):
    """
    uses getSectionDict to rename sections to their meta-section
    :param df: DataFrame of articles
    :return: DataFrame of articles, section renamed
    """
    df['section'] = df.section_name.apply(lambda x: getSectionDict(x))
    without_section = df[df.section == '*UNKNOWN*']  # the articles that haven't had a section_name,
                                                     # many of them have news_desk entry
    sections_from_newsdesk = without_section.news_desk.apply(lambda x: getSectionDict(x))
    idx = sections_from_newsdesk.index.get_values()
    df.loc[idx, 'section'] = sections_from_newsdesk
    return df

### Create Keyword Table
In order to create a graph of keywords we first need to gather some information about them. We are mainly interested in which section the keyword belongs to and we want to translate every keyword into a keyword id, so that instead of saving the whole string, we would only save a number. 
The process of creating the table of keywords was more complicated then we first thought it would be, because the DataFrames for the years got very large and the process needed many lookups.

For runtime and memory reasons, we processed the articles year-wise and then combined the yearly keyword tables again to find out the final and correct numbers. This is especially the case when we wanted to assign the keywords to sections. For that task, we counted how many times the keyword appeared in each section, and when we actually had the data of the whole timeframe in memory, we checked if one section stands out enough to assign the keyword to it. 

In [0]:

def extr_keywords_step1(field):
    """
    brings entry as it comes from api in more handy format
    :param field: 'keywords' entry of api
    :return: tupel (name, value)
    """
    keyword = field
    keyword_tup = (keyword['name'], keyword['value'])
    return keyword_tup


def create_keyword_table_partial(df):
    """
    uses article DataFrame to create table of keywords. How often keyword appeared in which section
    :param df: articles DataFrame
    :return: DataFrame of keywords (keyword, section, counts)
    """
    dfs = df[['_id', 'section', 'pub_date', 'headline', 'keywords']]
    # expand columns from keyword_dict
    d1 = dfs.keywords.apply(pd.Series).merge(dfs, left_index=True, right_index=True).drop(["keywords"], axis=1)
    # columns are additional rows
    d2 = d1.melt(id_vars=['_id', 'section', 'pub_date', 'headline'], value_name="keyword").drop("variable", axis=1)

    mask = d2.keyword.isna()
    d3 = d2[~mask]

    d3 = d3.sort_values(by=['pub_date', '_id'])

    d3['keyword'] = d3.keyword.apply(lambda x: extr_keywords_step1(x))

    keyword_table = d3[['keyword', 'section', '_id']]
    table = keyword_table.groupby(by=['keyword', 'section']).count()
    table = table.reset_index()
    table.columns = ['keyword', 'section', 'counts']
    return table


def create_keyword_table(table, threshold, article_amount):
    """
    table: table of keywords where one keyword can have multiply rows, if it appeared in different sections
    function reduces this table to keyword_table, where each keyword only appears once and section is the most likely
    section (if section is more frequent than threshold value), if no section stands out, tag as '*UNSPECIFIC*'
    :param table: table of keywords
    :param threshold: to what percentage keyword needs to appear in one section, that this section overweights
    the others
    :param article_amount: amount of articles of full data set, used to calculate frequency of keywords
    :return: table of keywords where every keyword only appears once
    """
    keyword_table = pd.DataFrame([['keyword', 'name', 'value', 0, 'section']],
                                 columns=['keyword', 'name', 'value', 'counts', 'section'])
    for i, kw in enumerate(table.keyword.unique()):
        if i%100 == 0: print(str(i) + ' / 64537')

        entries = table[table.keyword == kw]
        entries_comb = entries.groupby(by=['keyword', 'section']).sum()
        max_count = entries_comb.max()[0]
        total_counts = entries_comb.sum()[0]
        if max_count >= threshold*total_counts:
            section = entries_comb.idxmax()[0][1]
            # idx = entries['counts'].idxmax()
            # section = table.loc[idx, 'section']
        else:
            section = '*UNSPECIFIC*'
        new_row = pd.DataFrame(data=  [[ kw,        kw[0],  kw[1],   total_counts, section]],
                               columns=['keyword', 'name', 'value', 'counts',     'section'])
        keyword_table = keyword_table.append(new_row)
        keyword_table['id'] = range(0, keyword_table.shape[0])
        keyword_table['prob'] = np.log(keyword_table.counts / article_amount)
    keyword_table = keyword_table[1:]

    # weight for how many edges we reduce later
    idf = np.log(article_amount / keyword_table.counts)
    keyword_table['idf'] = idf / max(idf)

    return keyword_table


def extr_keywords(field, table_keywords):
    """
    translate keywords structure as it comes from api to list of keywords ids (ids from table_keywords)
    :param field: article keywords as it comes from api
    :param table_keywords: table of keywords (created by create_keyword_table)
    :return: list of keyword ids
    """
    keyword_list = list()
    for keyword in field:
        try:
            id = table_keywords.id[
                (table_keywords.name == keyword['name']) &
                (table_keywords.value == keyword['value'])]._get_values(0)
            keyword_list.append(id)
        except IndexError:
            pass
    return keyword_list


In [0]:

def main_articles_keywords():
    for year in ['2016', '2017', '2018']:
        # get and save articles
        for m in range(1,13):
            month = str(m)
            get_data(year, month)

        # load articles, clean them
        # concat dfs to df_year and then clean and translate keywords
        for m in range(1, 13):
            month = str(m)
            if len(month) == 1:
                month = '0' + month
            suffix = year + "_" + month
            print(suffix)

            with open(cwd + "/data/archive/articles_" + suffix + ".pickle", "rb") as f:
                df_new = pickle.load(f)

            if m == 1:
                df_year = df_new
            else:
                df_year = pd.concat([df_year, df_new], ignore_index=True)

        with open(cwd + "/data/archive/articles_" + year + ".pickle", "rb") as f:
            df_year = pickle.load(f)

        print(df_year.shape)
        df_year = clean_articles(df=df_year, word_count=20)
        df_year = clean_sections(df_year)
        # drop sections that are not interesting for keyword-analysis
        df_year = df_year[~(df_year['section'] == '*DELETE*')]
        print(df_year.shape)

        with open(cwd + "/data/archive/articles_" + year + "_clean.pickle", "wb") as f:
            pickle.dump(df_year, f)

        # create keyword table for one year
        table_year = create_keyword_table_partial(df_year)
        with open(cwd + "/data/table_keywords_partial_" + year + ".pickle", "wb") as f:
            pickle.dump(table_year, f)

    # combine keyword tables of singel years to full keyword table
    for i, year in enumerate(['2016', '2017', '2018']):

        with open(cwd + "/data/archive/articles_" + year + "_clean.pickle", "rb") as f:
            df_year = pickle.load(f)
        with open(cwd + "/data/table_keywords_partial_" + year + ".pickle", "rb") as f:
            table_year = pickle.load(f)

        if i == 0:
            table = table_year
            df = df_year
        else:
            table = pd.concat([table, table_year], ignore_index=True)
            df = pd.concat([df, df_year], ignore_index=True)
        print(df.shape, table.shape)

    with open(cwd + "/data/archive/df_16-18.pickle", "wb") as f:
        pickle.dump(df, f)
    with open(cwd + "/data/archive/df_16-18.pickle", "rb") as f:
        df = pickle.load(f)

    article_amount = df.shape[0]

    # combine keyword_tables from different years (counts, idf, major section)
    table_keywords = create_keyword_table(table, 0.35, article_amount)

    with open(cwd + "/data/table_keywords_16-18.pickle", "wb") as f:
        pickle.dump(table_keywords, f)

    # use this table to translate keyword to ids
    df['keywords'] = df.keywords.apply(lambda x: extr_keywords(x, table_keywords))
    with open(cwd + "/data/df_16-18.pickle", "wb") as f:
        pickle.dump(df, f)


## The Graph

(Download folder [here](https://drive.google.com/open?id=1OIffBrPUZ9WZZCbGsj8HEwXQU-ANojZv) and open index.html in browser, e.g. firefox, navigate to file via file:///home/.../keyword-graph/index.html)

### Nodes
Each keyword is represented as a node and the color indicates which section the keyword appeared in the most, but only if it appears there in more than 35% of the occurrences. Otherwise we gave the tag 'Unspecific' (color: Lavender). 'Unknown' (color: Beige) is the tag for keywords that did not appear in articles that were assigned to a section (see paragraph about Section).

### Edges
Edges represent the connection between keywords that appeared together. Weight of the edges is the sum of the conditional probabilities. The probability is defined as "fraction of articles that have this keyword". 
$$
W(A, B) = P (A|B) + P (B|A) = \frac{P(A \cap B)}{P(B)} + \frac{P(B \cap A)}{P(A)}
$$
In words: The conditional probability that keyword A appears in an article that contains keyword B is equal to the probability that keywords A and B appear in the same article divided by the probability with which keyword B appears. 

For example:
$$
W(\text{'Musk, Elon'}, \text{'Boring Company'}) \\
= P(\text{'Musk, Elon'} | \text{'Boring Company'}) +  P(\text{'Boring Company'} | \text{'Musk, Elon'}) \\
= 85.71 \% + 3.35 \% = 89,06 \%
$$


In [0]:

def keyword_edges(field):
    """
    creates list of edges between keywords
    :param field: list of keyword ids
    :return: list of edges
    """
    field.sort()
    edges = []
    for subset in combinations(field, 2):
        edge = str(subset[0]) + ',' + str(subset[1])
        edges.append(edge)
    return edges


def edge_weight(edges_row, table_keywords):
    """
    calculates weight of edge based on conditional probability.
    In the beginning pro(edges, table_keywords) babilities in log-scale, for weight translated to normal scale
    :param edges_row: one row from edges table (one edge with information)
    :param table_keywords: keywords DataFrame
    :return: weight of edge
    """
    p1 = (edges_row.prob - table_keywords[table_keywords.id == edges_row.Target].prob).get_values()[0]
    p2 = (edges_row.prob - table_keywords[table_keywords.id == edges_row.Source].prob).get_values()[0]
    p1, p2 = np.exp(p1), np.exp(p2)
    p = (p1 + p2)*100
    return p


def edges_nodes(article_keywords, table_keywords, article_amount):
    """
    creates edges and nodes of the article keyword
    :param article_keywords: keywords of articles, every article has a list of keywords
    :param table_keywords: keywords DataFrame
    :param article_amount: amount of articles (same as rows in article_keywords)
    :return: edges DataFrame, nodes DataFrame
    """
    edges_list = article_keywords.apply(lambda x: keyword_edges(x)).tolist()  # each article has a list of keywords
    edges_df = pd.Series(list(chain.from_iterable(edges_list)))  # write everything in one list
    edges_counts = edges_df.value_counts()

    edges = pd.DataFrame([x.split(',') for x in edges_counts.index], columns=['keyword_1', 'keyword_2'])
    edges['Source'] = edges.keyword_1.apply(lambda x: int(x))
    edges['Target'] = edges.keyword_2.apply(lambda x: int(x))
    edges['Counts'] = edges_counts.reset_index()[0]

    e = edges[['Source', 'Target', 'Counts']]

    # only keep edges where both Source and Target are in table_keywords
    e_red = e[e.Source.isin(table_keywords.id) & e.Target.isin(table_keywords.id)]

    e_red['prob'] = np.log(e_red.Counts/article_amount)
    e_red['Weight'] = e_red.apply(lambda x: edge_weight(x, table_keywords), axis=1)

    t = table_keywords[['id', 'section', 'value']]
    ids_1 = e_red.Source.value_counts().index.get_values().tolist()  # unique ids in Source
    ids_2 = e_red.Target.value_counts().index.get_values().tolist()  # unique ids in Target
    mask = [any(y) for y in zip(t.id.isin(ids_1), t.id.isin(ids_2))]  # if id was either in Source or in Target or both
    n = t[mask]
    n.columns = ['id', 'Section', 'Label']

    return e_red, n

### Reducing The Number Of Edges And Nodes

If we consider the data of all 3 years, we get around 65.000 nodes and 800.000 edges, which would make our graph very difficult to grasp. Therefore we tried to reduce the size of the graph, while still keeping the all of the interesting nodes and edges.

First, for each node we only took the most frequent 35% of its' edges to other nodes, but only, if the edge is mutually in the top 35% of both nodes (sort of like mutual K-nearest neighbours). With the 35%-rule, Trump can keep 849 nodes, but he is also included in the top 35% of 1951 other nodes. If we now only consider the mutual top 35%, Trump is left with 847 edges which means that two of his edges did not appear in the top-35% of the other edges. 

Compared to a clean cut of the 35% most frequent edges, this node-wise-35% assures that we also keep the nodes and edges in less common topics that would otherwise disappear from the graph. 

So far we made sure that the less common topics also stay in the graph. Another problem that leads to a messy graph is, that some keywords appear in disproportionately many articles compared to the others. This mainly touches hyper-keywords like "Politics and Government", "United States" and "Trump, Donald J" (of course). Therefore the Inverse Document Frequency value (idf-value) is used to reduce heaviness of the nodes. Instead of keeping the top 35% of the edges as proposed above, we only keep idf-value*35% (Trump can then only keep 7% of his edges) the idf-value is caluclated as below and then normalized to the maximum value so that we get values between 1 and 0, where 1: keyword appears the least often, 0: keyword appears in every article.
$$
\text{idf}(k) = log \left(\frac{\# articles}{\# \text{articles with keyword } k}\right)
$$

In [0]:
def reduce_edges(edges, nodes, percentage, table_keywords, min_edges):
    """
    reduces the edges according to following:
    - only keep edges that are in top mutual 'percentage'% edges of their nodes
    - only keep nodes that have at least min_edges edges
    :param edges: edges DataFrame
    :param nodes: nodes DataFrame
    :param percentage: cutoff precentage
    :param table_keywords: keywords DataFrame
    :param min_edges: minimum amount of edges per node, included
    :return: lower dimensional edges DataFrame, lower dimensional nodes DataFrame
    """
    # find top x% of edges to each node
    # matrix of edges, nodes*nodes
    mat = np.zeros([nodes.id.max()+1, nodes.id.max()+1])
    for keyword_id in nodes.id:
        # the other keywords that keyword_id is connected to
        connected_t = edges[edges.Source == keyword_id][['Target', 'Counts']]
        connected_t.columns = ['Node', 'Counts']
        connected_s = edges[edges.Target == keyword_id] [['Source', 'Counts']]
        connected_s.columns = ['Node', 'Counts']

        total_connections = (connected_s).append(connected_t)
        idf = table_keywords.idf[table_keywords.id == keyword_id]
        max_edges = math.ceil(total_connections.shape[0]*percentage*idf)
        tc = total_connections.sort_values(by='Counts', ascending=False)
        tc = tc[:max_edges]

        # entry = 1 if edge is in top x% of row-node
        mat[keyword_id, tc.Node.tolist()] = 1

    # only keep the edges that are in top x% of row-node AND column-node
    keep_edges = dict()
    for keyword_id in nodes.id:
        keyword_has = mat[keyword_id, :]
        keyword_appears_in = mat[:, keyword_id]

        l1 = pd.Series(keyword_has).nonzero()[0].tolist()
        l2 = pd.Series(keyword_appears_in).nonzero()[0].tolist()
        intersection = set(l1) - (set(l1) - set(l2))
        dict_update = {keyword_id: intersection}
        keep_edges.update(dict_update)

    mask = []
    for idx in range(0, edges.shape[0]):
        mask.append(edges.Target[idx] in keep_edges[edges.loc[idx].Source])

    edges_reduced = edges[mask]

    # remove the nodes that are not left after the x% filtering
    s = edges_reduced.Source.value_counts()
    t = edges_reduced.Target.value_counts()
    st = pd.merge(pd.DataFrame(s), pd.DataFrame(t), left_index=True, right_index=True, how='outer').fillna(0)

    mask = st.index.tolist()

    nodes.index = nodes.id.tolist()
    nodes_reduced = nodes.loc[mask]

    # delete nodes that only have one edge
    s = edges_reduced.Source.value_counts()
    t = edges_reduced.Target.value_counts()

    st = pd.merge(pd.DataFrame(s), pd.DataFrame(t), left_index=True, right_index=True, how='outer').fillna(0)
    st['counts'] = st.Source + st.Target

    # drop nodes that don't have enough edges
    mask = (st.counts > min_edges)
    idx = st[mask].index.get_values().tolist()
    nodes_reduced = nodes_reduced[nodes_reduced.id.isin(idx)]

    # drop edges where we had one of those nodes
    mask = [all(tup) for tup in zip(edges_reduced.Source.isin(idx), edges_reduced.Target.isin(idx))]

    edges_reduced = edges_reduced[mask]

    return edges_reduced, nodes_reduced


  
def translate_id(table_keywords, edges, nodes):
    """
    resets the keyword id in table_keywords to index of this table, in case some of the rows were deleted (ids would be missing)
    renames ids in edges and nodes accordingly
    :param table_keywords: keywords DataFrame
    :param edges: edges DataFrame
    :param nodes: nodes DataFrame
    :return:
    """
    tr = pd.DataFrame(table_keywords.id)
    tr['id_new'] = tr.index

    edges_Source = pd.merge(edges, tr, left_on='Source', right_on='id')
    edges_Source.columns = ['Source_old', 'Target', 'Counts', 'prob', 'Weight', 'id', 'Source']
    edges_Target = pd.merge(edges_Source, tr, left_on='Target', right_on='id')
    edges_Target.columns = ['Source_old', 'Target_old', 'Counts', 'prob', 'Weight', 'id_x', 'Source', 'id_y', 'Target']
    edges = edges_Target[['Source', 'Target', 'Counts', 'prob', 'Weight']]

    nodes_new = pd.merge(nodes, tr, left_on='id', right_on='id')
    nodes_new.columns = ['id_old', 'Section', 'Label', 'id']
    nodes = nodes_new[['id', 'Section', 'Label']]

    table_keywords.id = tr.id_new
    return table_keywords, edges, nodes

In [0]:

def main_keywordgraph():

    with open(cwd + "/data/table_keywords_16-18.pickle", "rb") as f:
        table_keywords = pickle.load(f)

    with open(cwd + "/data/df_16-18.pickle", "rb") as f:
        df = pickle.load(f)

    article_amount = df.shape[0]
    keywords = df.keywords

    # only use keywords that will be relevant for us later,
    # because will sort out less frequent ones in reduce_edges anyways
    min_edges = 2
    percentage = 0.35
    table_keywords = table_keywords[table_keywords.counts >= min_edges / percentage]
    table_keywords.index = range(0, table_keywords.shape[0])


    edges, nodes = edges_nodes(keywords, table_keywords, article_amount)
    print(edges.shape, nodes.shape)

    with open(cwd + "/data/edges_16-18.pickle", "wb") as f:
        pickle.dump(edges, f)
    with open(cwd + "/data/nodes_16-18.pickle", "wb") as f:
        pickle.dump(nodes, f)


    # reduce_edges
    table_keywords, edges, nodes = translate_id(table_keywords, edges, nodes)

    edges_reduced, nodes_reduced = reduce_edges(edges, nodes, percentage, table_keywords, min_edges)
    print(edges_reduced.shape, nodes_reduced.shape)

    series = '02'
    name = 'idf-mutual_16-18_3'
    nodes_reduced.to_csv(cwd + '/data/gephi/' + series + 'nodes_' + name + '.csv', sep=';', index=False)
    edges_reduced.to_csv(cwd + '/data/gephi/' + series + 'edges_' + name + '.csv', sep=';', index=False)


## Evaluation

The full graph can be explored here: (Download folder [here](https://drive.google.com/open?id=1OIffBrPUZ9WZZCbGsj8HEwXQU-ANojZv) and open index.html in browser, e.g. firefox, navigate to file via file:///home/.../keyword-graph/index.html)


It was created with Gephi (with layout algorithm: Force Atlas 2) and exported with the sigma exporter extension.

The obvious problem with visualizing network graphs is that the graph itself is multidimensional, whereas the visualization can only capture two dimensions. So the visualization inevitably swollows some of the relationships in the network. 

Inour case the network is nicely structured in sections where the topics do not overlap much. 
For example the keywords of the Sports section mainly refer to the Name of the players, their team and the league, which are all unique for each discipline in Sports. This results in clusters that represent the different disciplines (Baseball, Basketball, Soccer, Football, Golf, Tennis)

https://drive.google.com/open?id=1yXb25nJC8Ry4Bxgl-WZ2N-vts9lNqdZZ

Also the Culture-section is clean enough to form sub clusters according to the areas Music, Theater, Movies, Books. 
The advantage of those two sections (Sports and Culture) is, that they have clear sub-sections within and also do not overlap too much with the other sections. Of course Sports has edges to World, that's why it is located next to it, but the edges are not as strong as the edges within Sports itself. 

https://drive.google.com/open?id=1tzXKhmIMfg9AT16vldnNaGpEI5p54iiu


The problem of dimensionality reduction becomes obvious in sections where:
- Topics overlap much inside one Section (for example the node 'elections' is placed somewhere in the middle of the World section, because it is connected to many countries.)
- Sections that do not have clear borders to other sections. Many of the topics that are covered in the New York section, could also be part of U.S. or Business&Technology. 

So we find the sections: World, U.S., New York, Business&Technology very close together in the middle of the graph. The nodes are still pushed towards one corner or the other acording to their section, and there is still some structure inside the sections (e.g. Israel-Palestinians-Jerusalem-Netanyahu-... all next to each other). But the sections overlap extremely and we do not see nice sub-clusters like in the sections Sports and Culture. 

https://drive.google.com/open?id=1-fkdz7I2JpFXKMlh5R9kvc552Ni045TN

Keywords with tag 'Unspecific': 
When we assigned the keywords to the section that they appeared in the most, we gave the tag 'Unspecific' to those keywords that appear almost equally in at least two sections and therefore cannot be assigned to one section specificially. 
The behavior of those keywords also shows the problem of the section-overlap in certain areas. The 'Unspecific' keywords mainly appear in the middle, of the graph, the blurry part (World, U.S. Business&Technology, ...) but almost do not appear in the well clustered part of the graph (Sports, Culture). 