## Keyword Tagger Script

In this notebook you can find two methods with which you can tag the keywords of a certain text. In this case the texts we use are the articles pulled from the poliflw api. The two methods that we used are total word count, and term frequency - inversed document frequency.

#### Import section

The section belows imports all the necessary libraries. Some are quite standard such as numpy / pandas / json.
The nltk library is very useful for natural language processing (text mining). It has features such as stop words, stemmers and tokenizers. On Github we found a 'part of speech' tagger for Dutch texts that trained on 8 million corpora. 

In [4]:
import pandas as pd
import numpy as np
import requests
import json
import re
from collections import Counter

import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('Dutch'))
extra_stop_words = ['waar', 'onze', 'weer', 'daarom']
stop_words.update(extra_stop_words)
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("dutch")

# the section below imports a trained Dutch word-tagger we found on Github
from nltk.tag.perceptron import PerceptronTagger
import os
os.chdir(r"C:\Users\basje\Documents\Young Mavericks - Intern\Hackathon HvU\WordTaggerDutch")
tagger = PerceptronTagger(load=False)
tagger.load('model.perc.dutch_tagger_large.pickle')

#### Functions for accessing the Poliflw API

The two functions in this section will help you download all the political articles through the API of poliflw and will dump the downloaded data as a json file

In [5]:
def FIND_ALL_DOCUMENTS_POLIFLOW():
    """
    This function searches for all possible articles through the API of Poliflow.
    """
    BASE_URL = 'https://api.poliflw.nl/v0/search?scroll=1m'

    scroll_id = ''    
    total_results, total_size, size = 0, 0, 100
     
    all_data = []
    while not total_size or total_results < total_size:
        if scroll_id:
            result = requests.get(BASE_URL + '&size=' + str(size) + '&scroll_id=' + scroll_id)
        else:
            result = requests.get(BASE_URL + '&size=' + str(size))
    
        data = result.json()
    
        scroll_id = data['meta']['scroll']
        total_size = data['meta']['total']
    
        total_results += size
    
        print('%s/%s' % (total_results, total_size))
    
        if 'item' in data:
            all_data += data['item']
    
    return all_data
    
def SAVE_ALL_DOCUMENTS_POLIFLOW(data):
    """
    This function dumps the found data in a json-file.
    """
    with open('DataDump_ArticlesPoliflow.json', 'w') as OUT:
        json.dump(data, OUT)

#### Functions for importing and refining data

The functions in this section allow you to transform the json file with all the articles into a pandas dataframe and the
second function will remove all the empty articles from the dataframe and reset the index.

In [6]:
def JSONfile_toDataframe(json_file):
    """ 
    This function converts a json-file into a pandas dataframe.
    The most important information per article will be stored in the dataframe.
    """
    
    # Empty lists to fill with article data
    dates = []
    date_granularities = []
    descriptions = []
    enrichments = []
    locations = []
    parties = []
    politicians = []
    sources = []
    titles = []
    
    all_lists = [dates,
                 date_granularities,
                 descriptions,
                 enrichments,
                 locations, 
                 parties,
                 politicians,
                 sources,
                 titles]
    
    all_searches = ['date',
                    'date_granularity',
                    'description', 
                    'enrichments',
                    'location',
                    'parties',
                    'politicians',
                    'source',
                    'title']
    
    # Loop through requested articles
    for i in range(len(json_file)):
        for j, searchkey in enumerate(all_searches):
            try:
                all_lists[j].append(json_file[i][searchkey])
            except KeyError:
                all_lists[j].append(None)

    # Save as DataFrame and return
    DF_search = pd.DataFrame(
        {'date':dates,
         'date granularity': date_granularities,
         'description': descriptions,
         'enrichments': enrichments,
         'location': locations,
         'parties': parties,
         'politicians': politicians,
         'source': sources,
         'title': titles
        })
    
    return DF_search

def RefineDataframe(dataframe):
    """
    This function removes the entries from the dataframe that do not contain a description.
    """
    empty_list = []
    
    # loop over dataframe to record the empty entries
    for i in range(len(dataframe)):
        if dataframe.description[i] == None:
            empty_list.append(i)
    
    # remove the empty entries by index and reset the index
    dataframe.drop(empty_list, inplace = True)
    dataframe = dataframe.reset_index()
    
    return dataframe

In [8]:
# path to the data    
file_path = r'C:\Users\basje\Documents\Young Mavericks - Intern\Junior Hackathon\Data Dump\all_poliflw_items.json'

# load the data by opening the json file
with open(file_path, encoding='utf-8') as data:
    
    # convert json file to a pandas dataframe
    DF_all = JSONfile_toDataframe(json.load(data))
    
    # refine the dataframe by removing entries that containt no articles
    DF_all = RefineDataframe(DF_all)
    
DF_all.head(10)

Unnamed: 0,index,date,date granularity,description,enrichments,location,parties,politicians,source,title
0,0,2013-07-10T09:34:43,12.0,Verkleining gemeenteraden gaat definitief niet...,{},Groningen,[GroenLinks],,Facebook,
1,1,2013-05-26T10:40:32,12.0,Zaterdag 1 juni: Bijeenkomst over krimp in Gro...,{},Groningen,[GroenLinks],,Facebook,
2,2,2012-05-09T21:30:32,12.0,Wethouder Philip Broeksma opende samen met col...,{},Groningen,[GroenLinks],,Facebook,
3,3,2013-08-27T16:16:33,12.0,Vanavond eerste steunfractie overleg na de hee...,{},Groningen,[GroenLinks],,Facebook,
4,4,2012-08-14T13:29:56,12.0,De vakantie loopt voor een heleboel mensen lan...,{},Groningen,[GroenLinks],,Facebook,
5,5,2017-05-11T00:00:00,12.0,"<div class=""text"">&#13;\n\t\t\t\t\t\t\t<div id...",{},Ridderkerk,[SGP],,Partij nieuws,Herman Dooyeweerd en participatie
6,6,2015-06-13T00:00:00,12.0,"<div class=""text"">&#13;\n\t\t\t\t\t\t\t<div id...",{},Ermelo,[SGP],,Partij nieuws,SGP-Informatieavond (15 juni 2015): Doet de PA...
7,7,2017-10-03T10:59:12,12.0,<p>Het bestuur van D66 Renkum heeft de concept...,{},Renkum,[D66],,Partij nieuws,Nieuwe namen op kandidatenlijst D66 Renkum
8,8,2017-06-03T10:55:18,12.0,"<p><span style=""color: #1d2129;""><i>Dinsdagavo...",{},Rheden,[D66],,Partij nieuws,Een streep door het blowverbod
9,9,2013-12-17T00:00:00,12.0,"<div class=""text"">&#13;\n\t\t\t\t\t\t\t<div id...",{},Ermelo,[SGP],,Partij nieuws,Ermelo en zijn agrariers


#### Functions for text cleaning

The functions in this section allow you to clean the texts. There are different steps in the process which are all
combined in the last function. Several steps are taken are; removing html language, removing urls, removing special
characters and removing stop words. The texts are also tagged and tokenized.

In [9]:
def TextMining_CleanTextRegex(text, regex_statement, replacement):
    """
    This function removes the html-language from the article. Using a regex statement the function 
    clears all the text that is between '<' and '>'.
    """
    pattern = re.compile(regex_statement)
    return re.sub(pattern, replacement, text)

def TextMining_TokenizeText(text):
    """
    This function tokenizes the text.
    """
    return nltk.word_tokenize(text)

def TextMining_CleanNonNouns(text, word_tagger):
    """
    This function removes all non-noun words. Using a tagger loaded with a model that was trained using 8
    million records of dutch sentences / words. The tagger labels all the words in the article and then the
    function will only keep the words that are labeled as nouns.
    """
    tagged = word_tagger.tag(text)
    return [t[0] for t in tagged if 'noun' in t[1]]

def TextMining_CleanStopWords(tokenized_text, stop_words):
    """
    This function removes all the stop words from an article. Stop words are frequently occuring words that
    have a relativly low meaning / weight.
    """
    return [word.lower() for word in tokenized_text if word.lower() not in stop_words]

def TextMining_CleanSmallWords(tokenizetext, min_wordsize=2):
    """
    This function removes all words that are smaller than two characters.
    """
    return [word for word in tokenizetext if len(word) >= min_wordsize]

def Create_Filter(options_list):
    """
    This functions creates a word list with all party names.
    """
    return [opt.lower() for opt in options_list]

def TextMining_CleanCitiesParties(tokenizetext, party_list, city_list):
    """
    This function removes all words that are party names.
    This function removes all words that are city names and occur as a city in the dataframe.
    """
    party_filter = Create_Filter(party_list)
    city_filter = Create_Filter(city_list)
    return [word for word in tokenizetext if word not in set(party_filter) and word not in set(city_filter)]

def TextMining_StemWords(tokenized_text):
    """
    This functions stems all the words in the article.
    """
    return [stemmer.stem(word) for word in tokenized_text]

def TextMining_ALL(texts_list, stop_words, party_list, city_list, word_tagger):
    """
    This function combines all the text cleaning steps and cleans the articles.
    """
    # remove html code
    step1 = [TextMining_CleanTextRegex(text, '<.*?>', '') for text in texts_list]
    
    # remove newlines ('\n') with a space
    step2 = [TextMining_CleanTextRegex(text, '\n', ' ') for text in step1] 
    
    # remove urls
    step3 = [TextMining_CleanTextRegex(text, r'http\S+', '') for text in step2]
    
    # remove everything that doesn't contain letters
    step4 = [TextMining_CleanTextRegex(text, r"[^a-zA-Z]+", ' ') for text in step3]
    
    # tokenize text with nltk function and convert all words to lowercase
    step5 = [TextMining_TokenizeText(text) for text in step4] 
    
    # remove all words that are not nouns
    step6 = [TextMining_CleanNonNouns(text, word_tagger) for text in step5]
    
    # remove the words that are contained within the list called stopwords
    step7 = [TextMining_CleanStopWords(text, stop_words) for text in step6] 
    
    # removes all words that have a length of 3 or less
    step8 = [TextMining_CleanCitiesParties(text, party_list, city_list) for text in step7]
    
    # removes all small words with length 3 or lower
    step9 = [TextMining_CleanSmallWords(text, 4) for text in step8] 
    
    return step9


#### Functions for calculating the term fequency and inversed document frequency

The functions in this section are used for calculating the term frequency and inversed document frequency. Using all the 
unique words and the number of texts a dictionary is created. Using this dictionary a matrix is formed in which the 
occurence of all the words per text is counted. The inversed document frequency is calculated for each word. These are then
combined to create the tf-idf value for each word in each text.

In [10]:
def TFIDF_UniqueWords(processed_texts):
    """
    This function finds all the unique words from the provided articles.
    """
    temp = set([word for text in processed_texts for word in text])
    return list(temp)

def TFIDF_WordDictionary(unique_word_list):
    """
    This function creates a dictionary with all the unique words and gives them an index.
    """
    return {key:index for index, key in enumerate(unique_word_list)}
    
def TFIDF_EmptyMatrix(texts_list, unique_word_list):
    """
    This function creates an empty matrix filled with zeros. The axes are the number of articles in the texts
    list and the number of unique words.
    """
    return np.zeros((len(texts_list), len(unique_word_list)))

def TFIDF_CountMatrix(unique_word_list, texts_list, word_dictionary):
    """
    This function filles in an empty matrix with all the occurences of every word in every article.
    """
    temp_matrix = TFIDF_EmptyMatrix(texts_list, unique_word_list)
    for i in range(len(texts_list)):
        for word in texts_list[i]:
            temp_matrix[i, word_dictionary[word]] += 1
            
    return temp_matrix

def TFIDF_CleanMatrix(count_matrix):
    """
    This function can remove all the articles from the matrix that do not contain any words due to the text
    cleaning operations.
    """
    return count_matrix[np.where(count_matrix.sum(axis=1) != 0)]

def TFIDF_TermFrequency(count_matrix):
    """
    This function calculates the term frequency of each word per article. The outcome is a filled in matrix with
    term frequency values for every word per article.
    """
    sums = count_matrix.sum(axis=1)
    new_matrix = 0.5 + 0.5 * (count_matrix / sums[np.newaxis].T)
    return new_matrix

def TDIDF_InversedDocumentFrequency(count_matrix):
    """
    This function calculates the inversed document frequency of every word. The outcome is an array with all the
    inversed document values per word.
    """
    occurences = np.count_nonzero(count_matrix, axis=0)
    new_array = np.log((len(count_matrix) / occurences))
    return new_array

def TFIDF_CalculateTFIDF(tf_matrix, idf_matrix):
    """
    This function combines the word frequency matrix with the inversed document frequency array in order to 
    calculate all the term frequency-inversed document frequency values per word per article.
    """
    return tf_matrix * idf_matrix

def TFIDF_ALL(texts_list):
    """
    This function combines all the tf-idf functions to create a process with which the tf-idf values can be 
    calculated for the provided articles.
    """
    # find all the unique words in the provided texts
    step1 = TFIDF_UniqueWords(texts_list)
    
    # convert the all the unique words to a dictionary with a given index
    step2 = TFIDF_WordDictionary(step1)
    
    # create a matrix with all the word occurences per word per text
    step3 = TFIDF_CountMatrix(step1, texts_list, step2)
    
    # in the matrix calculate the term frequency for every word per text
    step4 = TFIDF_TermFrequency(step3)
    
    # calculate the inversed document frequency for each word
    step5 = TDIDF_InversedDocumentFrequency(step3)
    
    # combine the word frequency and the inversed document frequency
    step6 = TFIDF_CalculateTFIDF(step4, step5)
    
    return [step2, step6]

#### Function for finding all the unique occurences of locations and political parties

In [11]:
def Dataframe_FindUniques(dataframe, choice):
    """
    Create a list with all the unique occurences depending on the given choice. At the moment you can choose between
    'party' and 'city' to create a new list with uniques.
    """
    mega_list = []
    
    if choice == 'party':
        for parties in dataframe.parties:
            for party in parties:
                mega_list.append(party)
        
    elif choice == 'city':
        for city in dataframe.location:
            if city is None:
                continue
            mega_list.append(city)
    
    mega_list = set(mega_list)
    
    return list(mega_list)

#### Functions for tagging the words and finding the top words per article

The functions in this selection will combine everything into a result. The texts are cleaned and calculations are performed
based on the given choice. Then, based on the results, the top words per texts are found and returned.

In [17]:
def Find_TFIDF_TopWords_ALL(analysis_result, number=5):
    """
    Find the n number of words that score the highest tf-idf rating per article. These are then presented in a list
    with the top words per article.
    """
    temp_word_dict = {}
    for value, key in enumerate(analysis_result[0]):
        temp_word_dict[value] = key
            
    top_words_inds = []
    for arr in analysis_result[1]:
        top_words_inds.append(list(np.argpartition(arr, -number)[-number:]))
    
    top_words_per_article = []
    for indexes in top_words_inds:
        temp = []
        for index in indexes:
            temp.append(temp_word_dict[index])
        top_words_per_article.append(temp)
        
    return top_words_per_article

def Find_Count_TopWords_ALL(analysis_result, number=5):
    """
    Find the n number of words that occur the most per article. These are then presented in a list with the top 
    words per article.
    """
    top_words_per_article = []
    for art in analysis_result:
        temp_list = []
        most_used_words = Counter(art).most_common(number)
        for tup in most_used_words:
            temp_list.append(tup[0])
            
        top_words_per_article.append(temp_list)
            
    return top_words_per_article

def KeywordTagger_ALL(dataframe, stop_words, word_tagger, choice):
    """
    This function combines all the neccessary functions in order to create a list with the top scoring words based 
    on your choice (tf-idf or count). The result is a list with the top words per article.
    """
    start_index = 0
    end_index = 1000
    result_list = []
    party_list = Dataframe_FindUniques(dataframe, 'party')
    city_list = Dataframe_FindUniques(dataframe, 'city')
    
    while start_index < len(dataframe):
        
        # select a subset of the dataframe with the start and end index, these change (with +1000) after every loop
        raw_articles = dataframe.description[start_index:end_index]
        # clean and prepare all the texts in the selected subset
        cleaned_articles = TextMining_ALL(raw_articles, stop_words, party_list, city_list, word_tagger)
        
        if choice == 'tfidf':
            
            # calculate the tf-idf of each word per text
            analysed_articles = TFIDF_ALL(cleaned_articles)
            
            # find the top words based on the tf-idf
            top_words = Find_TFIDF_TopWords_ALL(analysed_articles)
        
        elif choice == 'count':
            
            # find the top words based on the count
            top_words = Find_Count_TopWords_ALL(cleaned_articles)
        
        for l in top_words:
            result_list.append(l)
        
        start_index += 1000
        end_index += 1000
        
        print('\r{}/{}'.format(start_index, len(dataframe)), end = '')
        
#        if start_index == 10000:
#            break
        
    print("\nTHE END!")
    
    return result_list

In [18]:
# find the top words with the tf-idf method 
top_words_per_article_list_tfidf = KeywordTagger_ALL(DF_all, stop_words, tagger, 'tfidf')

# find the top words with the count method 
top_words_per_article_list_count = KeywordTagger_ALL(DF_all, stop_words, tagger, 'count')



610000/609240
THE END!
610000/609240
THE END!


#### Function for adding a list to a dataframe

In [19]:
def AddColumn_toDataframe(dataframe, column_list, column_name):
    """
    This function creates a new column in the dataframe using the given list and the given column name.
    """
    dataframe[column_name] = column_list

In [21]:
print(top_words_per_article_list_tfidf[:10], '\n')

# add the top words per article from the tf-idf method to the dataframe as a new column
AddColumn_toDataframe(DF_all, top_words_per_article_list_tfidf, 'KEYWORDS_TFIDF')

print(top_words_per_article_list_count[:10])

# add the top words per article from the count method to the dataframe as a new column
AddColumn_toDataframe(DF_all, top_words_per_article_list_count, 'KEYWORDS_COUNT')

[['aantal', 'raad', 'gemeente', 'gemeenteraden', 'personen'], ['bijeenkomst', 'dorp', 'juni', 'krimp', 'dorpen'], ['maatregelen', 'avond', 'bedrijven', 'mensen', 'wethouder'], ['vakantie', 'zonnepanelen', 'gebouwen', 'overleg', 'raadsvergadering'], ['wijk', 'verkiezingen', 'leden', 'mensen', 'folders'], ['vragen', 'praktijk', 'gemeente', 'mensen', 'participatie'], ['politiekbij', 'haken', 'instellingsterrein', 'achterkleinzoon', 'discussie'], ['jaar', 'leden', 'gemeente', 'lijst', 'kandidatenlijst'], ['voorstel', 'drugsbeleid', 'gemeente', 'alcohol', 'blowverbod'], ['mensen', 'kalveren', 'boer', 'jaar', 'gemeente']] 

[['gemeenteraden', 'personen', 'verkleining', 'kamer', 'plan'], ['krimp', 'dorpen', 'juni', 'bijeenkomst', 'pand'], ['wethouder', 'philip', 'broeksma', 'collega', 'herwil'], ['steunfractie', 'overleg', 'vakantie', 'raadsvergadering', 'zonnepanelen'], ['mensen', 'folders', 'vakantie', 'einde', 'verkiezingscampagne'], ['participatie', 'praktijk', 'mensen', 'voorbeeld', 'gem

#### Extra part for word2vec!

In [None]:
import gensim 
import logging
from sklearn.cluster import KMeans
from sklearn.neighbors import KDTree

def clustering_on_wordvecs(word_vectors, num_clusters):
    # Initalize a k-means object and use it to extract centroids
    kmeans_clustering = KMeans(n_clusters = num_clusters, init='k-means++')
    idx = kmeans_clustering.fit_predict(word_vectors)
    
    return kmeans_clustering.cluster_centers_, idx

def get_top_words(index2word, k, centers, wordvecs):
    tree = KDTree(wordvecs)
    #Closest points for each Cluster center is used to query the closest 20 points to it.
    closest_points = [tree.query(np.reshape(x, (1, -1)), k=k) for x in centers]
    closest_words_idxs = [x[1] for x in closest_points]
    #Word Index is queried for each position in the above array, and added to a Dictionary.
    closest_words = {}
    for i in range(0, len(closest_words_idxs)):
        closest_words['Cluster #' + str(i)] = [index2word[j] for j in closest_words_idxs[i][0]]
    
    #A DataFrame is generated from the dictionary.
    df = pd.DataFrame(closest_words);
    df.index = df.index + 1
    return df

In [None]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

cleaned_articles = TextMining_ALL(DF_all.description, stop_words, city_list, party_list, tagger) 

model = gensim.models.Word2Vec(cleaned_articles, size=150, window=10, min_count=10, workers=10)
model.train(new_texts, total_examples=len(cleaned_articles), epochs=10)

search_word = 'wilders'
model.wv.most_similar(positive=search_word, topn=10)

In [None]:
Z = model.wv.syn0

centers, clusters = clustering_on_wordvecs(Z, 25)
centroid_map = dict(zip(model.wv.index2word, clusters))

top_words = get_top_words(model.wv.index2word, 20, centers, Z)