# Web page & policy tagging

## Table Of Content:
* [Install packages](#first-bullet)
* [Read in data](#second-bullet)
* [Preprocessing](#third-bullet)
* [Installing word to vec model](#fourth-bullet)
* [Tagging webpages to policies](#fifth-bullet)
* [Analysing results](#sixth-bullet)



## 1) Install packages <a class="anchor" id="first-bullet"></a>

In [1]:
import pandas as pd
import numpy as np

#read in data
import os
import json

#preprocessing
import re, string

#tagging webpages to words
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
import gensim
from gensim.models.keyedvectors import KeyedVectors
import copy



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jvanhaeren001\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2) Read in data <a class="anchor" id="second-bullet"></a>

Read in web pages as json file, read in policy codes in csv format.

### Read in web pages

In [2]:
with open(file="html_data_2.json", encoding="utf-8") as jsonFile:
    jsonObject = json.load(jsonFile)
    jsonFile.close()

In [3]:
data = jsonObject['html_list']

df_webpages = pd.DataFrame(data)

In [4]:
df_webpages.shape

(1837, 13)

In [5]:
df_webpages.columns

Index(['title', 'url', 'language', 'classification_information',
       'metadata_type_string', 'country', 'html', 'sdg_tag', 'ISO3166',
       'Location', 'Service', 'policy_code', 'Policy'],
      dtype='object')

### Read in policy codes

In [6]:
df_policy_codes = pd.read_csv('policy_codes.csv', sep=";")

In [7]:
df_policy_codes.shape

(133, 3)

In [8]:
df_policy_codes.columns

Index(['Policy_code', 'information or procedure', 'relevant_tags'], dtype='object')

## 3) Preprocessing <a class="anchor" id="third-bullet"></a>

Preprocessing of web pages and policy codes

In [9]:
raw_data = df_webpages.loc[:,'html']

In [10]:
raw_data.reset_index(drop=True)

0       Your Guide to Greece - Gov.gr ΕL EN Created wi...
1       Your Guide to Greece - Gov.gr Skip to main con...
2       Your Guide to Greece - Gov.gr Skip to main con...
3       Your Guide to Greece - Gov.gr Skip to main con...
4       Your Guide to Greece - Gov.gr Skip to main con...
                              ...                        
1832    Import charges - Tullverket About the website ...
1833    Recognition information for EU citizens - Swed...
1834    Candidates standing for election - Valmyndighe...
1835    The right to vote and voting cards - Valmyndig...
1836    - Yrkeshögskolan Antingen stödjer din webbläsa...
Name: html, Length: 1837, dtype: object

### Removing punctuations, numbers, to lower,...

In [11]:
def preprocess_text(text):
    text = text.lower()
    text = re.compile('<.*?>').sub('', text)
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
    text = re.sub('\s+', ' ', text)
    text = re.sub(r'\[[0-9]*\]', ' ', text)
    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
    text = re.sub(r'\d', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

In [12]:
data = [preprocess_text(t) for t in raw_data]

In [13]:
df_webpages['text_preprocessed']=data

## 4) Installing word to vec model <a class="anchor" id="fourth-bullet"></a>

Word to vec model from: http://vectors.nlpl.eu/repository/

The model is not installed on the Github repo: so in case you would like to run this code. You will need to install the English word to vec (for example English CoNLL17 corpus vec size = 100, window = 10, which is used for this model)

### From CoNLL17 corpus to gensim word2vec (ONLY RUN 1 TIME)

In [14]:
'''col_names = ["word"]
for i in range(0,100):
    col_names.append("vec" + str(i))'''

'col_names = ["word"]\nfor i in range(0,100):\n    col_names.append("vec" + str(i))'

In [15]:
#embedded_dic = pd.read_csv('model.txt', skiprows=1, sep=' ', encoding='latin-1', names = col_names, index_col=False)

In [16]:
#embedded_dic['word'] = embedded_dic['word'].str.encode('latin-1')

In [17]:
#embedded_dic['word'] = embedded_dic['word'].str.decode('utf-8', errors='ignore') 

In [18]:
#embedded_dic.index = embedded_dic['word']

In [19]:
#embedded_dic = embedded_dic.drop(columns = 'word')

In [20]:
'''np.savetxt('embedded_dic_english.txt', embedded_dic.reset_index().values, 
           delimiter=" ", 
           header="{} {}".format(len(embedded_dic), len(embedded_dic.columns)),
           comments="",
           fmt=["%s"] + ["%.18e"]*len(embedded_dic.columns), encoding = 'utf-8')''' #save the model as a model you can use with gensim

'np.savetxt(\'embedded_dic_english.txt\', embedded_dic.reset_index().values, \n           delimiter=" ", \n           header="{} {}".format(len(embedded_dic), len(embedded_dic.columns)),\n           comments="",\n           fmt=["%s"] + ["%.18e"]*len(embedded_dic.columns), encoding = \'utf-8\')'

### Create similarity measure between documents

In [21]:
stopW = set(stopwords.words('english'))
punctuation_map = dict((ord(char), None) for char in string.punctuation)

class DocSimV1(object):
    def __init__(self, w2v_model , stopwords=stopW , remove_punctuation_map=punctuation_map):
        self.w2v_model = w2v_model
        self.stopwords = stopwords
        self.remove_punctuation_map = punctuation_map

    def vectorize(self, doc):
        """Identify the vector values for each word in the given document"""
        doc = doc.lower()
        words = [w.translate(punctuation_map) for w in doc.split(" ") if w not in self.stopwords]
        word_vecs = []
        for word in words:
            try:
                vec = self.w2v_model[word]
                word_vecs.append(vec)
            except KeyError:
                # Ignore if the word doesn't exist in the vocabulary
                pass

        # Assuming that document vector is the mean of all the word vectors
        vector = np.mean(word_vecs, axis=0)
        return vector


    def _cosine_sim(self, vecA, vecB):
        """Find the cosine similarity distance between two vectors."""
        csim = np.dot(vecA, vecB) / (np.linalg.norm(vecA) * np.linalg.norm(vecB))
        if np.isnan(np.sum(csim)):
            return 0
        return csim

    def calculate_similarity(self, source_doc, target_docs=[], threshold=0):
        """Calculates & returns similarity scores between given source document & all
        the target documents."""
        if isinstance(target_docs, str):
            target_docs = [target_docs]

        source_vec = self.vectorize(source_doc)
        results = []
        for doc in target_docs:
            target_vec = self.vectorize(doc)
            sim_score = self._cosine_sim(source_vec, target_vec)
            if sim_score > threshold:
                results.append({
                    'score' : sim_score,
                    'doc' : doc
                })
            # Sort results by score in desc order
            #results.sort(key=lambda k : k['score'] , reverse=True)

        return results

In [22]:
from gensim.models.keyedvectors import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('embedded_dic_english.txt', binary=False)

### Practise to see if model works

In [23]:
ds = DocSimV1(word_vectors)

In [126]:
ds.calculate_similarity("gas", ["identification","driving","travel", "visum", "electricity"])

[{'score': 0.14972492, 'doc': 'identification'},
 {'score': 0.40867487, 'doc': 'driving'},
 {'score': 0.30057374, 'doc': 'travel'},
 {'score': 0.12562916, 'doc': 'visum'},
 {'score': 0.7626458, 'doc': 'electricity'}]

## 5) Tagging webpages to policies <a class="anchor" id="fifth-bullet"></a>

In [25]:
def create_tags(org_data, policy_codes,threshold):
    
    index_text = org_data.columns.get_loc("text_preprocessed")
    main_policy_codes=policy_codes["Policy_code"][policy_codes["Policy_code"].str.len() == 1]
    
    all_tags =[]
    policy_codes_list = []

    for main_code in main_policy_codes:
        temp_one_policy = policy_codes[policy_codes.Policy_code.str.startswith(main_code)]
        sub_policy_codes=temp_one_policy["Policy_code"][temp_one_policy["Policy_code"].str.len() ==2]
        policy_codes_list.append(sub_policy_codes.values.tolist())       
        
        sub_policy_list = []

        for sub_policy_code in sub_policy_codes:
            sub_policy_tag = policy_codes[policy_codes["Policy_code"]==sub_policy_code]
                       
            sub_policy_tag = sub_policy_tag["relevant_tags"].str.split(',', expand=True)
                        
            sub_policy_list.append(sub_policy_tag.values.tolist()[0])
            
        all_tags.append(sub_policy_list)            
               
    #print(policy_codes_list)                
    
    temp = copy.deepcopy(org_data)
    
    ds = DocSimV1(word_vectors)
    
    for code in main_policy_codes:
        temp[code]=0 
           
    temp['tags'] = ""
    temp['subtags'] = ""
    temp['tag_word'] =""
    
    index_tags = temp.columns.get_loc("tags")  
    index_subtags = temp.columns.get_loc("subtags")

    index_tagwords = temp.columns.get_loc("tag_word")

    #dubbelcheck if the words in the brackets are the words we want to match with
    
    # combine the categories in a list
      
    tags_en = []
    tags_en = all_tags
    #print(tags_en)
          
    #define the names of the subcategories
          
    subtags_names = []
    subtags_names = policy_codes_list
    #print(subtags_names)
    

    #splitting into two word comparison
    for i in range(0,len(temp)): # doc per doc
        words = temp.iloc[i,index_text].split()
        word_filters = [] #contains all three-words of a document 
        
           
        for w in range(0,len(words)-1):          
            word_filters.append(words[w] + " " + words[w+1])
        
        for word_filter in word_filters:
            
            for t in range(0,len(tags_en)): # every main tag
                for subt in range(0,len(tags_en[t])): # for every subtag
                    sim_scores = ds.calculate_similarity(word_filter, tags_en[t][subt]) # compute similarity between preprocessed text and tag
                    scores = [sim_scores[k].get('score') for k in range(0,len(sim_scores))]
                    col_pos = index_text + 1 +t
                    score_max = max(scores, default=0)
                
                    if score_max > temp.iloc[i,col_pos]:
                        temp.iloc[i,col_pos] = float(score_max) # add max value of match under the tag variable

                    if score_max >=threshold:
                        index_max = np.argmax(scores)
                                                                       
                        if temp.columns[index_text + 1 + t] not in temp.iloc[i,index_tags]:
                            temp.iloc[i,index_tags] = temp.iloc[i,index_tags] + temp.columns[index_text + 1 +t] + ", " #add main tag
                                         
                        if subtags_names[t][subt] not in temp.iloc[i,index_subtags]:
                            temp.iloc[i,index_subtags] = temp.iloc[i,index_subtags] + subtags_names[t][subt] + ", " #add sub tag
                        
                        if tags_en[t][subt][index_max] not in temp.iloc[i,index_tagwords]:
                            temp.iloc[i,index_tagwords] = temp.iloc[i,index_tagwords] + tags_en[t][subt][index_max] + ", " #add tag word                           
       
    return temp

In [26]:
output = create_tags(df_webpages, df_policy_codes, 0.95)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


In [None]:
pd.set_option('display.max_columns', 41)

In [29]:
output

Unnamed: 0,title,url,language,classification_information,metadata_type_string,country,html,sdg_tag,ISO3166,Location,...,R,S,T,U,V,W,X,tags,subtags,tag_word
0,Your Guide to Greece - Gov.gr,https://www.gov.gr/en/sdg/citizens-and-family-...,en,G3,Information,Greece,Your Guide to Greece - Gov.gr ΕL EN Created wi...,sdg,en,,...,1.000000,0.672528,0.800680,0.936135,0.813468,0.736008,0.930276,"G, B, F, R, K,","G3, B8, F1, R1, B9, K2, G1,","gender recognition, health, medical treatment,..."
1,Your Guide to Greece - Gov.gr,https://www.gov.gr/en/sdg/citizens-and-family-...,en,G4,Information,Greece,Your Guide to Greece - Gov.gr Skip to main con...,sdg,en,,...,0.595408,0.637999,0.692486,0.825659,0.606110,0.780286,0.886670,"B, G, D,","B4, G4, G3, D6,","taxation, Succession rights, gender recognitio..."
2,Your Guide to Greece - Gov.gr,https://www.gov.gr/en/sdg/citizens-and-family-...,en,G4,Information,Greece,Your Guide to Greece - Gov.gr Skip to main con...,sdg,en,,...,0.561580,0.629226,0.749408,0.809133,0.656334,0.779231,0.886670,"B, G,","B4, G4, G3,","taxation, Succession rights, gender recognition,"
3,Your Guide to Greece - Gov.gr,https://www.gov.gr/en/sdg/citizens-and-family-...,en,G4,Information,Greece,Your Guide to Greece - Gov.gr Skip to main con...,sdg,en,,...,0.891274,0.814821,0.856768,0.818225,0.900747,0.780286,0.886670,"G, D, L, B,","G4, G3, D6, L5, B4,","Succession rights, gender recognition, death, ..."
4,Your Guide to Greece - Gov.gr,https://www.gov.gr/en/sdg/citizens-and-family-...,en,G4,Information,Greece,Your Guide to Greece - Gov.gr Skip to main con...,sdg,en,,...,0.638353,0.624820,0.681847,0.812690,0.646286,0.779231,0.886670,"G, B,","G4, G3, B4, G2,","Succession rights, gender recognition, taxatio..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1832,Import charges - Tullverket,https://www.tullverket.se/eng/business/importi...,en,L4,Information,Sweden,Import charges - Tullverket About the website ...,sdg,SE,,...,0.672616,0.647360,0.676092,0.819390,0.872010,0.779743,0.843321,"H, D, I, L,","H1, D1, I1, L1, L4,","buying goods, moving, personal data, VAT, impo..."
1833,Recognition information for EU citizens - Swed...,https://www.uhr.se/en/start/recognition-of-for...,en,B3,Information,Sweden,Recognition information for EU citizens - Swed...,sdg,SE,,...,0.648890,0.610716,0.902828,0.830009,0.778063,0.833771,0.880796,"N, E, M, I, D,","N3, E1, M8, E3, E4, I1, D1,","vocational education, education system, certi..."
1834,Candidates standing for election - Valmyndigheten,https://www.val.se/servicelankar/other-languag...,en,D3,Information,Sweden,Candidates standing for election - Valmyndighe...,sdg,SE,,...,0.850436,0.666533,0.709328,0.718841,0.900747,0.759830,0.740445,"D, G,","D3, G1,","elections, birth,"
1835,The right to vote and voting cards - Valmyndig...,https://www.val.se/servicelankar/other-languag...,en,D3,Information,Sweden,The right to vote and voting cards - Valmyndig...,sdg,SE,,...,0.569242,0.590785,0.709328,0.704464,0.900747,0.759830,0.740445,"D,","D3,","elections,"


In [27]:
output.to_csv('webpages_policies_tagged.csv', index= False)


In [30]:
output.to_excel("webpages_policies_tagged.xlsx") 

## 6) Analysing results <a class="anchor" id="sixth-bullet"></a>

The results are analysed in perfect match (1-1 correct), not-perfect match (at least one label correct), wrong match (no label correct).

Next, we report the average of labels predicted from the entire bucket of labels, 

In [31]:
predictions = output ["subtags"]

In [32]:
true_values = output["classification_information"]

In [45]:
true_values_list=[]
for i in range(0,len(true_values)):
    true_values_list.append(true_values[i].split(";"))

In [63]:
predictions_list=[]
for i in range(0,len(predictions)):
    text = predictions[i].split(",")
    text = ' '.join(text).split() #remove blanks
    predictions_list.append(text)

In [66]:
len(predictions_list)

1837

In [67]:
len(true_values_list)

1837

1) Count the number of perfect matches

In [77]:
perfect_match = []
for i in range(0,len(true_values_list)):
    
    if set(true_values_list[i]) == set(predictions_list[i]):    
        perfect_match.append(True)
    else:
        perfect_match.append(False)   

In [78]:
sum(perfect_match)

59

In [80]:
(sum(perfect_match)/len(true_values_list))*100 #percentage perfect match

3.211758301578661

There are 59 perfect matches from the 1837 or 3.21%

2) Count the number of not perfect matches

In [92]:
not_perfect_match = []
for i in range(0,len(true_values_list)):
    
    if any(item in true_values_list[i] for item in predictions_list[i]) and not(set(true_values_list[i]) == set(predictions_list[i])):    
        not_perfect_match.append(True)
    else:
        not_perfect_match.append(False)  

In [98]:
sum(not_perfect_match)

678

In [99]:
(sum(not_perfect_match)/len(true_values_list))*100 #percentage perfect match

36.90800217746325

There are 678 not perfect matches from the 1837 or 36.9%

3) Count the number of wrong matches

In [94]:
wrong_match = []
for i in range(0,len(true_values_list)):
    
    if not(any(item in true_values_list[i] for item in predictions_list[i])):    
        wrong_match.append(True)
    else:
        wrong_match.append(False)  

In [100]:
sum(wrong_match)

1100

In [103]:
(sum(wrong_match)/len(true_values_list))*100 #percentage perfect match

59.88023952095808

There are 1100 wrong matches from the 1837 or 59.8%

4) average number of tags

In [105]:
len(predictions_list[2])

3

In [107]:
len(predictions_list)

1837

In [110]:
sum_tags =0
for i in range(0, len(predictions_list)):
    sum_tags = sum_tags + len(predictions_list[i])
print(sum_tags/len(predictions_list))

3.2847033206314644


On  average 3.28 tags were guessed for a webpage

In [111]:
stopW = set(stopwords.words('english'))


In [113]:
stopW

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r