# Good Morning and welcome to  <font color='BLUE'>"Nugget Word embedding for text classification" </font>


By Zied **HAJ YAHIA** and Adrien **SIEG**

### 0.A - Introducing paper <font color='RED'>"Unsupervised Text Classification Leveraging Experts and Word Embeddings" </font> submitted to ACL 

- **Author** : Zied **HAJ YAHIA**, Adrien **SIEG**, and Léa **DELERIS** (Head of RiskAir DataLab) and the great support of Jordan **TOH**

- **Mission** : ORAIA (Operational Risk Artificial Intelligence Assistant) in BNP Paribas (Risk ORC)

- **Year** : 2018-2019

- **Topic**: Classify documents into hundreds of labels when you have NO DATA, no training set, ... nothing

<img src="paper.PNG">

### 0.B - Structure/Frame of Paper

1. **Introduction**



2. **Related Work**



3. **Method**

    - <font color='GREEN'>3.1</font> Cleaning Steps
    - <font color='GREEN'>3.2</font> Enrichment
    - <font color='GREEN'>3.3</font> Consolidation
    - <font color='GREEN'>3.4</font> Text Similarity Metric
    
    
    
4. **Experiments**

    - <font color='GREEN'>4.1</font> Dataset
        - 20 NewsGroup
        - AG's Corpus
        - Yahoo Answer's
        - 5 Abstract Group
        - Google Snippets
            
    - <font color='GREEN'>4.2</font> Configurations and Baseline Methods
    - <font color='GREEN'>4.3</font> Experimental Settings
    - <font color='GREEN'>4.4</font> Results and Discussion
    
    
    
5. **Application to Operational Risk Incident Classification**

  - <font color='GREEN'>5.1</font> Operational Risk Incidents Corpus and Taxonomy
 **ORAIA Dataset from BNP Group**
 
**IFS** > Cardif, Asset Management, Wealth Management, Personal Finance, IRB, ...

**Domestic Market** > Leasing Solutions, BDDF, BDDB, BNL, BGL, ...

**CIB** > ITO, APAC, Global Market, BP2S / Securities Services

   - <font color='GREEN'>5.2</font> Result
   - <font color='GREEN'>5.3</font> Discussion

### 0.C - What is the method? 

Our approach for **unsupervised text classification** is based on the choice to model the task as a **text similarity problem** between **two sets of words**: 

- One containing the most relevant words in the document and 
- another containing keywords derived from the label of the target category. 

While the key advantage of this approach is its **simplicity**, its success hinges on the **good definition of a dictionary of words for each category.**

<img src="method.PNG">

### 0.D - Why is it so hard? 

<img src="religions.PNG">

<img src="hockey.PNG">

### 0.E - What is taxonomy of Operational Risk? Example with ICT

<img src="example_taxonomy_ict.PNG">

# A concrete example with *20News Group*

In [None]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

def twenty_newsgroup_to_csv():
    newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

    df = pd.DataFrame([newsgroups_train.data, newsgroups_train.target.tolist()]).T
    df.columns = ['text', 'target']
    
    df['target'] = df['target'].apply(int)

    targets = pd.DataFrame( newsgroups_train.target_names)
    targets.columns=['title']

    out = pd.merge(df, targets, left_on='target', right_index=True)
    out['date'] = pd.to_datetime('now')
    out.to_csv('20_newsgroup.csv')
    
twenty_newsgroup_to_csv()

In [None]:
import pandas as pd
news20dataset = pd.read_csv('20_newsgroup.csv').rename(columns={'Unnamed: 0':'id'})
news20dataset. head()

In [None]:
import re
from nltk.corpus import stopwords
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

def preprocess(raw_text):

    # keep only words
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)

    # convert to lower case and split 
    words = letters_only_text.lower().split()

    # remove stopwords
    stopword_set = set(stopwords.words("english"))
    meaningful_words = [w for w in words if w not in stopword_set]
    
    #remove letters
    meaningful_words = [w for w in meaningful_words if len(w)>3]
    
#     #stemmed words
#     ps = PorterStemmer()
#     stemmed_words = [ps.stem(word) for word in meaningful_words]
    
    # join the cleaned words in a list
    cleaned_word_list = " ".join(meaningful_words)

    return cleaned_word_list

news20dataset['text'] = news20dataset['text'].apply(str)
news20dataset['text_cleaned_str'] = news20dataset['text'].apply(lambda line : preprocess(line))
news20dataset['text_cleaned_str'] = news20dataset['text_cleaned_str'].apply(str)

#news20dataset.to_csv('news20dataset_cleaned.csv')

dictionary_basic_words = news20dataset[['text_cleaned_str','title']].groupby(['title'])['text_cleaned_str'].apply(lambda x: ' '.join(x))

# writer = pd.ExcelWriter('dico_20wordgroups.xlsx')
# dictionary_basic_words.to_excel(writer, sheet_name='Sheet1')
# writer.save()

dictionary_basic_words.to_frame()

### 1.A : Most frequency words <font color='RED'>(4.1)</font>

In [None]:
import pandas as pd
dictionary_basic_words = pd.read_excel('dico_20wordgroups.xlsx')

In [None]:
def find_most_frequent_words(raw_string_text, nb_words_to_return):
    from collections import Counter 
    
    # split() returns list of all the words in the string 
    split_it = raw_string_text.split() 
    
    # Pass the split_it list to instance of Counter class. 
    Counter = Counter(split_it) 
    
    # most_common() produces k frequently encountered 
    # input values and their respective counts. 
    most_occur = Counter.most_common(nb_words_to_return) 
    
    return(most_occur)

# def remove_common_words_from_list(list_of_unique_words, list_to_filter):
#     return(list(filter(lambda x: x in list_of_unique_words, list_to_filter)))

# def keep_uncommon_most_frequent_words_from_column_of_text_in_data_frame(data_frame, name_column):
    
#     # 1. Transform string rows to list rows
#     data_frame[name_column] = data_frame[name_column].apply(lambda line : line.split())
    
#     # 2. Get only unique words - uncommon words between each category
#     from collections import Counter
#     frequency = Counter(data_frame[name_column].sum())
#     unique_words = list({word : frequency[word] for word in frequency if frequency[word] == 1 })
    
#     # 3. Keep away common words
#     data_frame[name_column] = data_frame[name_column].apply(lambda line : remove_common_words_from_list(unique_words, line))
    
#     # 4. Return string of unique words
#     data_frame[name_column] = data_frame[name_column].apply(lambda line : ' '.join(line))
    
#     return data_frame[name_column]

dictionary_basic_words['most_similar_words_not_unique'] = dictionary_basic_words['text_cleaned_str'].apply(lambda line : [word[0] for word in find_most_frequent_words(line, 150)])
# dictionary_basic_words['unique_words_per_categories'] = keep_uncommon_most_frequent_words_from_column_of_text_in_data_frame(dictionary_basic_words, 'text_cleaned_str')

In [None]:
dictionary_basic_words.head()

### 1.B : Most frequency words BUT unique in keeping with other categories  <font color='RED'>(4.1) </font>

In [None]:
def flatten_list(list_toflatten):
    return([item for sublist in list_toflatten for item in sublist])
from collections import Counter 
list_most_frequent_words_not_unique_counter = dict(Counter(flatten_list(list(dictionary_basic_words['most_similar_words_not_unique']))))

unique_words = list({k:v for k,v in list_most_frequent_words_not_unique_counter.items() if v == 1})

list_most_frequent_words_unique = [[word for word in nested_list if word in unique_words] for nested_list in list(dictionary_basic_words['most_similar_words_not_unique'])]
dictionary_basic_words['most_similar_words_unique_per_catergory'] = pd.Series(list_most_frequent_words_unique)

# writer = pd.ExcelWriter('dico_20wordgroups.xlsx')
# dictionary_basic_words.to_excel(writer, sheet_name='Sheet1')
# writer.save()

dictionary_basic_words.head()

### 1.C <font color='GREEN'>[Bonus]</font> - Create a matrix of unique words per categories

In [None]:
most_important_words = pd.DataFrame(list(dictionary_basic_words['most_similar_words_unique_per_catergory'])).T
most_important_words.columns = list(dictionary_basic_words['title'].drop_duplicates())
most_important_words.head(10)

# writer = pd.ExcelWriter('dico_20wordgroups_v2.xlsx')
# most_important_words.to_excel(writer, sheet_name='Sheet1')
# writer.save()

### 2 : Spelling Variants <font color='RED'>(4.2.1) </font>

<img src="spelling_variants.PNG">

In [None]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

def split_dataset_into_words(dataset):
    datawords = dataset.apply(lambda x: x.split())
    return list(datawords)

#  my_list = all_incidents 
# dictionnary
def buffer_stemmisation_keywords(my_list):
    my_list = [item for sublist in my_list for item in sublist]
    aux = pd.DataFrame(my_list, columns =['word'] )
    aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x))
    aux = aux.groupby('word_stemmed').transform(lambda x: ', '.join(x))
    aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x.split(',')[0]))
    aux.index = aux['word_stemmed']
    del aux['word_stemmed']
    my_dict = aux.to_dict('dict')['word']
    return my_dict

dictionnary_all_words_unstemmed = buffer_stemmisation_keywords(split_dataset_into_words(dictionary_basic_words['text_cleaned_str']))

# Dictionnary de-duplicated
for key, value in dictionnary_all_words_unstemmed.items():
    new_value = value.replace(",", "")
    new_value = list(set(value.split()))
    new_value = list(set(map(lambda each:each.strip(","), new_value)))
    dictionnary_all_words_unstemmed[key]=new_value

In [None]:
{'sure': ['surely', 'sure'],
 'look': ['looks', 'looked', 'look', 'looking'],
 'happen': ['happens', 'happened', 'happenings', 'happen', 'happening'],
 'japanes': ['japanese'],
 'citizen': ['citizens', 'citizen'],
 'world': ['worlds', 'world'],
 'prepar': ['prepare', 'prepared', 'preparation', 'preparations'],
 'round': ['rounded', 'round', 'rounding', 'rounds'],
 'peopl': ['people', 'peoples'],
 'stick': ['sticking', 'stick'],
 'concentr': ['concentrated', 'concentrate', 'concentrating', 'concentration'],
 'camp': ['camps', 'camp', 'camping'],
 'without': ['without'],
 'trial': ['trial'],
 'short': ['shorted', 'shorting', 'short'],
 'step': ['step', 'stepping'],
 'gass': ['gassed', 'gassing'],
 'seem': ['seems', 'seeming', 'seem', 'seemed']}

### 3 : Synonyms from WordNet  <font color='RED'>(4.2.2) </font>

In [None]:
def get_synonyms_wordnet(word):
    from nltk.corpus import wordnet 
    syns = wordnet.synsets(word)
    return (list(set([syns[item].lemmas()[0].name() for item in range(len(syns))])))

def retrieve_synonyms_from_wordnet_dico(nested_list_of_words_per_category):
    buffer = flatten_list(nested_list_of_words_per_category)
    
    dictionnary_synonyms_referential = {}
    for i in range(len(buffer)):
        dictionnary_synonyms_referential[buffer[i]] = get_synonyms_wordnet(buffer[i])
    
    return (dictionnary_synonyms_referential)

all_synonyms_wordnet = retrieve_synonyms_from_wordnet_dico(dictionary_basic_words['most_similar_words_unique_per_catergory'])
dictionary_basic_words['wordnet'] = dictionary_basic_words['most_similar_words_unique_per_catergory'].apply(lambda line : flatten_list(list(map(get_synonyms_wordnet, line))))

In [None]:
dictionary_basic_words.head()

### 4: Enrichment from a pre-trained word embedding model <font color='RED'>(4.2.4) </font>

<img src="glove.gif">

In [None]:
import os
import shutil
import smart_open
from sys import platform

import gensim


def prepend_line(infile, outfile, line):
    with open(infile, 'r', encoding="utf8") as old:
        with open(outfile, 'w', encoding="utf8") as new:
            new.write(str(line) + "\n")
            shutil.copyfileobj(old, new)

def prepend_slow(infile, outfile, line):
    with open(infile, 'r', encoding="utf8") as fin:
        with open(outfile, 'w', encoding="utf8") as fout:
            fout.write(line + "\n")
            for line in fin:
                fout.write(line)
                
def get_lines(glove_file_name):

    with smart_open.smart_open(glove_file_name, 'r', encoding="utf8") as f:
        num_lines = sum(1 for line in f)
    with smart_open.smart_open(glove_file_name, 'r', encoding="utf8") as f:
        num_dims = len(f.readline().split()) - 1
    return num_lines, num_dims

# Input: GloVe Model File
# More models can be downloaded from http://nlp.stanford.edu/projects/glove/
glove_file="glove.6B.300d.txt"

num_lines, dims = get_lines(glove_file)

In [None]:
# Output: Gensim Model text format.
gensim_file='glove_model2.txt'
gensim_first_line = "{} {}".format(num_lines, dims)

# Demo: Loads the newly created glove_model.txt into gensim API.
model=gensim.models.KeyedVectors.load_word2vec_format(glove_file,binary=False) #GloVe Mode

def get_similar_words_from_glove(word, nb_of_words_to_get = 5):
    try:
        buffer = [word[0] for word in model.most_similar(positive=[word], topn=nb_of_words_to_get)]
    except:
        buffer = []
    return buffer

dictionary_basic_words['glove'] = dictionary_basic_words['most_similar_words_unique_per_catergory'].apply(lambda line : flatten_list(list(map(get_similar_words_from_glove, line))))

In [None]:
dictionary_basic_words

### 5. Enrichment from a new word embedding model trained on the input corpus <font color='RED'>(4.2.5) </font>

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *
stemmer = PorterStemmer()

import pandas as pd
news20dataset = pd.read_csv('20_newsgroup.csv').rename(columns={'Unnamed: 0':'id'})
news20dataset['text'] = news20dataset['text'].apply(str)

corpus_list = list(news20dataset['text'])

In [None]:
import pandas as pd
news20dataset = pd.read_csv('20_newsgroup.csv').rename(columns={'Unnamed: 0':'id'})
news20dataset['text'] = news20dataset['text'].apply(str)

corpus_list = list(news20dataset['text'])

# build a corpus for the word2vec model
def build_corpus(data):
    "Creates a list of lists containing words from each sentence"
    corpus = []
    for sentence in data:
        word_list = sentence.split(" ")
        corpus.append(word_list)    
           
    return corpus

def get_similar_words_from_homemade_word2vec(word):
    try:
        buffer = [word[0] for word in model.wv.most_similar(positive=word)]
    except:
        buffer = []
    return buffer

# build a corpus for the word2vec model
def build_corpus(data):
    "Creates a list of lists containing words from each sentence"
    corpus = []
    for sentence in data:
        word_list = sentence.split(" ")
        corpus.append(word_list)    
           
    return corpus

corpus = build_corpus(corpus_list)

model = gensim.models.Word2Vec (corpus, size=150, window=15, min_count=2, workers=10)
model.train(corpus,total_examples=len(corpus),epochs=25)

In [None]:
w1 = "hundred"
model.wv.most_similar(positive=w1)

get_similar_words_from_homemade_word2vec('home')

In [None]:
dictionary_basic_words['homemade_enrichment'] = dictionary_basic_words['most_similar_words_unique_per_catergory'].apply(lambda line : flatten_list(list(map(get_similar_words_from_homemade_word2vec, line))))

In [None]:
# writer = pd.ExcelWriter('dictionary_20newsGroup.xlsx')
# dictionary_basic_words.to_excel(writer, sheet_name='Sheet1')
# writer.save()

### 6. Function-aware component <font color='RED'>(4.2.6) </font>

\begin{multline}
        FAC(\textit{w},\textit{c}) = \\ \frac{TF(w,c) - \frac{1}{M}\sum_{1\leq k\leq M}TF(w,k) }{var(TF_{-c}(w))}
\end{multline}

In [None]:
import ast
def convert_stringList_to_List(string_list):
    buffer = ast.literal_eval(string_list)
    return (' '.join(buffer))

def convert_stringlist_to_string(dataset, column_to_apply):
    return (dataset.apply(lambda x: x[column_to_apply] if pd.isnull(x[column_to_apply]) else convert_stringList_to_List(x[column_to_apply]), axis=1))

def remove_common_words_from_list(list_of_unique_words, list_to_filter):
    return(list(filter(lambda x: x in list_of_unique_words, list_to_filter)))

def stem_my_list(list_of_words):
    import re
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from nltk.tokenize import sent_tokenize, word_tokenize
    ps = PorterStemmer()
    list_of_words_stemmed = [ps.stem(word) for word in list_of_words]
    return list_of_words_stemmed


def calculate_how_many_words_appear_per_category(data_frame, 
                                                 name_column_text,
                                                 name_colum_label,
                                                 name_new_colum_counter_to_create = 'how_many_a_word_appear_per_category',
                                                 name_new_colum_label_buffer_to_create = 'label_to_use_for_outer_join'):
    
    # 0. Dataframe copy save
    data_frame_s1 = data_frame.copy()
    
    # 1. Transform string rows to list rows
    # data_frame[name_column] = data_frame[name_column].apply(lambda line : line.split())
    data_frame_s1[name_column_text] = convert_stringlist_to_string(data_frame_s1, name_column_text).apply(lambda line : line.split())
    
    # 2. Stemm to put away spelling variants of a given word
    data_frame_s1[name_column_text] = data_frame_s1[name_column_text].apply(lambda line : stem_my_list(line))

    # 3. How a given word appears per category
    from collections import Counter
    data_frame_s1[name_new_colum_counter_to_create] = data_frame_s1[name_column_text].apply(lambda line : Counter(line))
    
    # 4. Find category of works
    data_frame_s1[name_new_colum_label_buffer_to_create] = data_frame_s1[name_colum_label].apply(lambda line: [category for category in data_frame_s1[name_colum_label].unique().tolist() if category != line])
    data_frame_s2 = data_frame_s1.copy()
    
    # 5. Compute how many times a words appear in other category outer the category under consideration
    
    return data_frame_s1

def calculate_how_many_words_appear_outer_category(data_frame_from_before, 
                                                   name_colum_text,
                                                   name_colum_label, 
                                                   list_label_to_use_for_outer_join,
                                                   name_new_colum_counter_to_create = 'how_many_a_word_appear_per_category',
                                                   name_new_colum_counter_to_create_OUTER = 'how_many_a_word_appear_per_category_OUTER'):
    
    # 1. Get dataframe regarding how each word occurs per category out of category
    dataframe_filtered = data_frame_from_before[data_frame_from_before[name_colum_label].isin(list_label_to_use_for_outer_join)]
    
    # 2. Change format of dictionary to operate aggregation of list
    dataframe_filtered[name_new_colum_counter_to_create_OUTER] = dataframe_filtered[name_new_colum_counter_to_create].apply(lambda line : list(line.items()))
    
    # 3. Calculation
    from collections import Counter
    inp = [ dict([i]) for i in tuple(dataframe_filtered[name_new_colum_counter_to_create_OUTER].sum()) ]
    count = Counter()
    for y in inp:
        count += Counter(y)
    
    G = len(dataframe[name_colum_label].unique().tolist())
    # 4. Divide by number of categories
    sum_all_words_outer_a_given_category = {k: v/G for k, v in count.items()}
        
    return sum_all_words_outer_a_given_category

def FAC(data_frame, 
        name_column_text,
        name_colum_label,
        name_new_colum_counter_to_create = 'how_many_a_word_appear_per_category',
        name_new_colum_label_buffer_to_create = 'label_to_use_for_outer_join'): 
    
    dataframe_buffer = calculate_how_many_words_appear_per_category(data_frame, name_column_text, name_colum_label)
    
    OUTER = []
    for item_list_category in dataframe_buffer[name_new_colum_label_buffer_to_create]:
        OUTER.append(calculate_how_many_words_appear_outer_category(dataframe_buffer, name_column_text, name_colum_label, item_list_category))
    
    dataframe_buffer['RESULT_OUTER'] = pd.Series(OUTER)
    return dataframe_buffer

### 7. TF-ICF (Term Frequency - Inverse Category Frequency) <font color='RED'>(4.2.6) </font>

In [None]:
import ast
def convert_stringList_to_List(string_list):
    buffer = ast.literal_eval(string_list)
    return (' '.join(buffer))

def convert_stringlist_to_string(dataset, column_to_apply):
    return (dataset.apply(lambda x: x[column_to_apply] if pd.isnull(x[column_to_apply]) else convert_stringList_to_List(x[column_to_apply]), axis=1))

def remove_common_words_from_list(list_of_unique_words, list_to_filter):
    return(list(filter(lambda x: x in list_of_unique_words, list_to_filter)))

def stem_my_list(list_of_words):
    import re
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from nltk.tokenize import sent_tokenize, word_tokenize
    ps = PorterStemmer()
    list_of_words_stemmed = [ps.stem(word) for word in list_of_words]
    return list_of_words_stemmed

def popularity_of_words_per_category(data_frame, name_column_text):
    
    # 0. Dataframe copy save
    data_frame_s1 = data_frame.copy()
    
    # 1. Transform string rows to list rows
    # data_frame[name_column] = data_frame[name_column].apply(lambda line : line.split())
    data_frame_s1[name_column_text] = convert_stringlist_to_string(data_frame_s1, name_column_text).apply(lambda line : line.split())
    
    # 2. Stemm to put away spelling variants of a given word
    data_frame_s1[name_column_text] = data_frame_s1[name_column_text].apply(lambda line : stem_my_list(line))
    
    # 3. Keep only one word of a given spelling variant
    data_frame_s1[name_column_text] = data_frame_s1[name_column_text].apply(lambda line : sorted(set(line), key=lambda x:line.index(x)))
    
    # 4. How a given word appears in all category? The max is the number of categories
    from collections import Counter
    frequency = Counter(data_frame_s1[name_column_text].sum())
    
    # 5. Calculate inverse_category_frequency per words
    popularity_per_words = {k: v for k, v in frequency.items()}
    
    return popularity_per_words

def inverse_category_frequency(popularity_per_words, nb_labels):
    import math
    inverse_category_frequency_per_words = {k: math.log(nb_labels/v) for k, v in popularity_per_words.items()}
    return inverse_category_frequency_per_words

def calculate_how_many_words_appear_per_category(data_frame, name_column_text, name_new_colum_counter_to_create = 'how_many_a_word_appear_per_category'):
        
    # 0. Dataframe copy save
    data_frame_s2 = data_frame.copy()
    
    # 1. Transform string rows to list rows
    data_frame_s2[name_column_text] = convert_stringlist_to_string(data_frame_s2, name_column_text).apply(lambda line : line.split())
    
    # 2. Stemm to put away spelling variants of a given word
    data_frame_s2[name_column_text] = data_frame_s2[name_column_text].apply(lambda line : stem_my_list(line))

    # 3. How a given word appears per category
    from collections import Counter
    data_frame_s2[name_new_colum_counter_to_create] = data_frame_s2[name_column_text].apply(lambda line : Counter(line))
    
    return data_frame_s2

def TFICF(data_frame, 
          name_column_text, 
          nb_labels,
          threshold,
          name_new_colum_counter_to_create = 'how_many_a_word_appear_per_category'):
    
    # 0. Dataframe copy save
    data_frame_s3 = data_frame.copy()
    
    # 1. Calculate inverse_category_frequency
    inverse_category_frequency_per_words = popularity_of_words_per_category(data_frame_s3, name_column_text)
    inverse_category_frequency_per_words = inverse_category_frequency(inverse_category_frequency_per_words, nb_labels)
    
    # 2. Term frequency per word and per category
    data_frame_s4 = calculate_how_many_words_appear_per_category(data_frame_s3, name_column_text, name_new_colum_counter_to_create)
    
    # 3. Compute Term Frequency Inverse Category Frequency
    data_frame_s4['RESULT'] = data_frame_s4[name_new_colum_counter_to_create].apply(lambda line : {k: line[k]*inverse_category_frequency_per_words[k] for k in line})
    
    # 4. Filter my dictionary
    data_frame_s4['RESULT_v2'] = data_frame_s4['RESULT'].apply(lambda line : {k for k, v in line.items() if v > threshold})
    
    return data_frame_s4

### Preprocessing before merging <font color='RED'>(4.3) </font>

In [None]:
import re
from nltk.corpus import stopwords
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

def preprocess_with_stemming(raw_text):
    
    # keep only words
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)

    # convert to lower case and split 
    words = letters_only_text.lower().split()

    # remove stopwords
    stopword_set = set(stopwords.words("english"))
    meaningful_words = [w for w in words if w not in stopword_set]
    
    #remove letters
    meaningful_words = [w for w in meaningful_words if len(w)>3]
    
    #stemmed words
    ps = PorterStemmer()
    stemmed_words = [ps.stem(word) for word in meaningful_words]
    
    stemmed_words_unique = list(set(stemmed_words))
    
    # join the cleaned words in a list
    cleaned_str = " ".join(stemmed_words)

    return cleaned_str

In [None]:
import ast
def convert_stringList_to_List(string_list):
    buffer = ast.literal_eval(string_list)
    return (' '.join(buffer))

def convert_stringlist_to_string(dataset, column_to_apply):
    return (dataset.apply(lambda x: x[column_to_apply] if pd.isnull(x[column_to_apply]) else convert_stringList_to_List(x[column_to_apply]), axis=1))

def transform_column_into_string(dataset, column):
    dataset[column] = convert_stringlist_to_string(dataset, column)
    dataset[column] = dataset[column].apply(str)
    dataset[column] = dataset[column].apply(lambda line : preprocess_with_stemming(line))
    return dataset[column]

In [None]:
# load dataset
import pandas as pd
dictionary_basic_words = pd.read_excel('C:/Users/c28742/Downloads/research_paper/dictionary_20newsGroup.xlsx')

dictionary_basic_word

In [None]:
#column_to_transform = ['most_similar_words_unique_per_catergory', 'wordnet', 'glove', 'homemade_enrichment']
column_to_transform = ['most_similar_words_unique_per_catergory', 'glove', 'homemade_enrichment']

In [None]:
for item in column_to_transform:
    transform_column_into_string(dictionary_basic_words, item)

def create_final_dico(dataset, list_columns_in_final_dico):
    dataset['final_dico'] = dataset[list_columns_in_final_dico].apply(lambda line : ' '.join(line), axis = 1)
    dataset['final_dico'] = dataset['final_dico'].apply(lambda line : list(set(line.split())))
    dataset['final_dico'] = dataset['final_dico'].apply(lambda line : ' '.join(line))
    return dataset

dictionary_basic_words = create_final_dico(dictionary_basic_words, column_to_transform)

In [None]:
writer = pd.ExcelWriter('final_dico.xlsx')
dictionary_basic_words.to_excel(writer, sheet_name='Sheet1')
writer.save()

### Text similarity using LSI and cosine similarity <font color='RED'>(4.2.7) </font>

In [None]:

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import collections
import itertools
import re
import nltk
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from gensim import corpora, models, similarities
from tqdm import tqdm_notebook 
import itertools
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

def remove_stop_words(words_list, stopw):
    return list(set(words_list)-set(stopw))
def stemming(words_list):
    stemmer = PorterStemmer()
    return [stemmer.stem(t) for t in words_list]
def tokenize_column(colonne, stem=True):
    if (stem):
        return colonne.str.replace(punctuation," ").str.replace("[0-9]+"," ").str.lower().            apply(word_tokenize).            apply(set).            apply(remove_stop_words, stopw=stopwords.words('english')).            apply(stemming)
    else:
        return colonne.str.replace(punctuation," ").str.replace("[0-9]+"," ").str.lower().            apply(word_tokenize).            apply(set).            apply(remove_stop_words, stopw=stopwords.words('english'))
    
def intersection_similarity(df_cand, df_src, colnames, score_name="score_sim_inter"):
    # constants
    column_merge = "column_merge"
    # make copy of dataframes
   df_c=df_cand.copy()
    df_s=df_src.copy()
    # joint both dataframe (cartesian product)
    df_c = df_c[[colnames[id_cand], colnames[tokens_cand]]]
    df_c[column_merge]=1
    df_s = df_s[[colnames[id_src], colnames[tokens_src]]]
    df_s[column_merge]=1
    output = df_c.merge(df_s, on=column_merge).drop(labels=[column_merge], axis=1)
    # compute similarity
    output[score_names["inter"]] = output.apply(lambda row: 0 if (min(len(set(row[colnames[tokens_cand]])), len((set(row[colnames[tokens_src]]))))==0) else len(set(row[colnames[tokens_cand]]).intersection(set(row[colnames[tokens_src]])))/min(len(set(row[colnames[tokens_cand]])), len((set(row[colnames[tokens_src]])))), axis=1) 
    return output

def cosine_similarity(df_c, df_s, colnames, score_name="score_sim_cos"):
    # constants
    vec_bow_cand = "vec_bow_cand"
    vec_lsi_src = "vec_lsi_src"
    vec_columns = {vec_bow_cand:id_cand+"_vec_bow", vec_lsi_src:id_src+"_vec_lsi"}
    # build dictionary
    dictionary = corpora.Dictionary(df_c[colnames[tokens_cand]])
    # compute vec_bow
    df_c[vec_columns[vec_bow_cand]] = df_c[colnames[tokens_cand]].map(dictionary.doc2bow)
    # compute lsi model
    lsi = models.LsiModel(df_c[vec_columns[vec_bow_cand]], id2word=dictionary, num_topics=100)
    # compute vec_lsi src
    df_s[vec_columns[vec_lsi_src]] = list(lsi[df_s[colnames[tokens_src]].map(dictionary.doc2bow)])
    # compute vec_lsi cand
    index = similarities.MatrixSimilarity(lsi[df_c[vec_columns[vec_bow_cand]]])
    # compute cosine similarity score
    output = np.round_(index[df_s[vec_columns[vec_lsi_src]]], 3)
    # build the output data frame
    myShape = output.shape
    output = pd.DataFrame(output.reshape((output.shape[0]*output.shape[1], 1), order='C'))
    output.columns=[score_name]
    output[colnames[id_cand]]=list(df_c[colnames[id_cand]])*myShape[0]
    output[colnames[id_src]]=[v for v in df_s[colnames[id_src]] for _ in range(myShape[1])]
    return output

# remove identical ids
def different_id_row(row, id1, id2):
    return row[id1]!=row[id2]

def top_k_similarity_word2vec(df_cand, df_src, colnames, score_names={"cos":"score_sim_cos","inter":"score_sim_inter"}, top_k=10):
    # copy of current data frame
    df_c = df_cand.copy()
    df_s = df_src.copy()
    # column names of tokenized string
    colnames2 = {id_cand: colnames[id_cand], id_src:colnames[id_src], tokens_cand:colnames[text_cand]+"_tokens", tokens_src:colnames[text_src]+"_tokens"}
    
    # cleaning, applying to lower, tokenizing, removing stop word, stemming
    df_c[colnames2[tokens_cand]] = tokenize_column(df_c[colnames[text_cand]])
    df_s[colnames2[tokens_src ]] = tokenize_column(df_s[colnames[text_src]])
    
    # compute cosine similarity score
    df_cos = cosine_similarity(df_c, df_s, colnames2, score_name=score_names["cos"])
    
    # compute intersection similarity score
    df_inter = intersection_similarity(df_c, df_s, colnames2, score_name=score_names["inter"])

    # merge both results
    output = df_cos.merge(df_inter, on=[colnames[id_cand], colnames[id_src]])
    #output = df_inter
    #remove pairs of same reference
    output = output[output.apply(different_id_row, id1=colnames[id_cand], id2=colnames[id_src], axis=1)]

    # select top k candidates for each item from source
    #output = output.sort_values(by=score_names["inter"], ascending=False).reset_index(drop=True)
    output = output.sort_values(by=score_names["inter"], ascending=False).groupby(by=colnames[id_src]).head(top_k).reset_index(drop=True)
    #return output[[colnames[id_src], colnames[id_cand], score_names["cos"], score_names["inter"]]]
    return output[[colnames[id_src], colnames[id_cand], score_names["inter"], score_names["cos"]]]
#constants
punctuation = "[!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\r\n]"
id_cand="id_cand"
id_src="id_src"
text_cand="text_cand"
text_src="text_src"
tokens_cand="tokens_cand"
tokens_src="tokens_src"

In [None]:
News_cleaned = pd.read_csv('C:/Users/c28742/Downloads/research_paper/news20dataset_cleaned.csv')
Risk_taxonomy = pd.read_excel('C:/Users/c28742/Downloads/research_paper/dico_20NewsGroup_v01.xlsx')

Risk_taxonomy['dico_final_buffer'] = Risk_taxonomy['dico_final_buffer'].apply(str)
News_cleaned['text_cleaned_str'] = News_cleaned['text_cleaned_str'].apply(str)

In [None]:
column_names={id_cand:'title', id_src:"id", text_cand:"dico_final_buffer", text_src:"text_cleaned_str"}
score_names={"cos":"score_sim_cos", "inter":"score_sim_inter"}

sim_score = top_k_similarity_word2vec(Risk_taxonomy,News_cleaned.iloc[:1] , column_names, score_names, top_k=1)

for i in tqdm_notebook(range(2,len(News_cleaned))):
    sim_score = pd.concat([sim_score,top_k_similarity_word2vec(Risk_taxonomy,News_cleaned.iloc[i-1:i] , column_names, score_names, top_k=1)])


In [None]:
writer = pd.ExcelWriter('result_20NewsGroup20022019.xlsx')
sim_score.to_excel(writer, sheet_name='Sheet1')
writer.save()

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

In [None]:
def popularity_of_words(data_frame, name_column, nb_occurence_per_category):
    
    # 1. Transform string rows to list rows
    # data_frame[name_column] = data_frame[name_column].apply(lambda line : line.split())
    data_frame[name_column] = convert_stringlist_to_string(data_frame, name_column).apply(lambda line : line.split())
    
    # 2. Stemm to put away spelling variants of a given word
    data_frame[name_column] = data_frame[name_column].apply(lambda line : stem_my_list(line))
    
    # 3. Keep only one word of a given spelling variant
    data_frame[name_column] = data_frame[name_column].apply(lambda line : sorted(set(line), key=lambda x:line.index(x)))
    
    # 4. How a given word appears in all category? The max is the number of categories
    from collections import Counter
    frequency = Counter(data_frame[name_column].sum())
    
    # 5. Calculate inverse_category_frequency per words
    inverse_category_frequency_per_words = {k: 1/v for k, v in essai.items()}
    print(inverse_category_frequency_per_words)
    
    # 6. Filter words which are in XX categories in common
    occurence_words = list({word : frequency[word] for word in frequency if frequency[word] <= nb_occurence_per_category })
    
    # 7. Keep away common words
    data_frame[name_column] = data_frame[name_column].apply(lambda line : remove_common_words_from_list(occurence_words, line))
    
    # 8. Return string of unique words
    data_frame[name_column] = data_frame[name_column].apply(lambda line : ' '.join(line))
    
    return data_frame[name_column]

# ICF = Inverse Category Frequency

In [None]:
import ast
def convert_stringList_to_List(string_list):
    buffer = ast.literal_eval(string_list)
    return (' '.join(buffer))

def convert_stringlist_to_string(dataset, column_to_apply):
    return (dataset.apply(lambda x: x[column_to_apply] if pd.isnull(x[column_to_apply]) else convert_stringList_to_List(x[column_to_apply]), axis=1))

def remove_common_words_from_list(list_of_unique_words, list_to_filter):
    return(list(filter(lambda x: x in list_of_unique_words, list_to_filter)))

def stem_my_list(list_of_words):
    import re
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from nltk.tokenize import sent_tokenize, word_tokenize
    ps = PorterStemmer()
    list_of_words_stemmed = [ps.stem(word) for word in list_of_words]
    return list_of_words_stemmed

def popularity_of_words_per_category(data_frame, name_column_text):
    
    # 0. Dataframe copy save
    data_frame_s1 = data_frame.copy()
    
    # 1. Transform string rows to list rows
    # data_frame[name_column] = data_frame[name_column].apply(lambda line : line.split())
    data_frame_s1[name_column_text] = convert_stringlist_to_string(data_frame_s1, name_column_text).apply(lambda line : line.split())
    
    # 2. Stemm to put away spelling variants of a given word
    data_frame_s1[name_column_text] = data_frame_s1[name_column_text].apply(lambda line : stem_my_list(line))
    
    # 3. Keep only one word of a given spelling variant
    data_frame_s1[name_column_text] = data_frame_s1[name_column_text].apply(lambda line : sorted(set(line), key=lambda x:line.index(x)))
    
    # 4. How a given word appears in all category? The max is the number of categories
    from collections import Counter
    frequency = Counter(data_frame_s1[name_column_text].sum())
    
    # 5. Calculate inverse_category_frequency per words
    popularity_per_words = {k: v for k, v in frequency.items()}
    
    return popularity_per_words

def inverse_category_frequency(popularity_per_words, nb_labels):
    import math
    inverse_category_frequency_per_words = {k: math.log(nb_labels/v) for k, v in popularity_per_words.items()}
    return inverse_category_frequency_per_words

def calculate_how_many_words_appear_per_category(data_frame, name_column_text, name_new_colum_counter_to_create = 'how_many_a_word_appear_per_category'):
        
    # 0. Dataframe copy save
    data_frame_s2 = data_frame.copy()
    
    # 1. Transform string rows to list rows
    data_frame_s2[name_column_text] = convert_stringlist_to_string(data_frame_s2, name_column_text).apply(lambda line : line.split())
    
    # 2. Stemm to put away spelling variants of a given word
    data_frame_s2[name_column_text] = data_frame_s2[name_column_text].apply(lambda line : stem_my_list(line))

    # 3. How a given word appears per category
    from collections import Counter
    data_frame_s2[name_new_colum_counter_to_create] = data_frame_s2[name_column_text].apply(lambda line : Counter(line))
    
    return data_frame_s2

def TFICF(data_frame, 
          name_column_text, 
          nb_labels,
          threshold,
          name_new_colum_counter_to_create = 'how_many_a_word_appear_per_category'):
    
    # 0. Dataframe copy save
    data_frame_s3 = data_frame.copy()
    
    # 1. Calculate inverse_category_frequency
    inverse_category_frequency_per_words = popularity_of_words_per_category(data_frame_s3, name_column_text)
    inverse_category_frequency_per_words = inverse_category_frequency(inverse_category_frequency_per_words, nb_labels)
    
    # 2. Term frequency per word and per category
    data_frame_s4 = calculate_how_many_words_appear_per_category(data_frame_s3, name_column_text, name_new_colum_counter_to_create)
    
    # 3. Compute Term Frequency Inverse Category Frequency
    data_frame_s4['RESULT'] = data_frame_s4[name_new_colum_counter_to_create].apply(lambda line : {k: line[k]*inverse_category_frequency_per_words[k] for k in line})
    
    # 4. Filter my dictionary
    data_frame_s4['RESULT_v2'] = data_frame_s4['RESULT'].apply(lambda line : {k for k, v in line.items() if v > threshold})
    
    return data_frame_s4

In [None]:
import pandas as pd
dico = pd.read_excel('C:/Users/adsieg/Desktop/Google_snippet/dictionary_yahoo_with_word_freq.xlsx')
dico = dico[['class_name', 'all_keywords']]

In [None]:
df = TFICF(dico, 'all_keywords', 10, 5)

In [None]:
writer = pd.ExcelWriter('zied_dico.xlsx')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()

In [None]:
df

# Function Aware Components 

In [None]:
import ast
def convert_stringList_to_List(string_list):
    buffer = ast.literal_eval(string_list)
    return (' '.join(buffer))

def convert_stringlist_to_string(dataset, column_to_apply):
    return (dataset.apply(lambda x: x[column_to_apply] if pd.isnull(x[column_to_apply]) else convert_stringList_to_List(x[column_to_apply]), axis=1))

def remove_common_words_from_list(list_of_unique_words, list_to_filter):
    return(list(filter(lambda x: x in list_of_unique_words, list_to_filter)))

def stem_my_list(list_of_words):
    import re
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from nltk.tokenize import sent_tokenize, word_tokenize
    ps = PorterStemmer()
    list_of_words_stemmed = [ps.stem(word) for word in list_of_words]
    return list_of_words_stemmed


def calculate_how_many_words_appear_per_category(data_frame, 
                                                 name_column_text,
                                                 name_colum_label,
                                                 name_new_colum_counter_to_create = 'how_many_a_word_appear_per_category',
                                                 name_new_colum_label_buffer_to_create = 'label_to_use_for_outer_join'):
    
    # 0. Dataframe copy save
    data_frame_s1 = data_frame.copy()
    
    # 1. Transform string rows to list rows
    # data_frame[name_column] = data_frame[name_column].apply(lambda line : line.split())
    data_frame_s1[name_column_text] = convert_stringlist_to_string(data_frame_s1, name_column_text).apply(lambda line : line.split())
    
    # 2. Stemm to put away spelling variants of a given word
    data_frame_s1[name_column_text] = data_frame_s1[name_column_text].apply(lambda line : stem_my_list(line))

    # 3. How a given word appears per category
    from collections import Counter
    data_frame_s1[name_new_colum_counter_to_create] = data_frame_s1[name_column_text].apply(lambda line : Counter(line))
    
    # 4. Find category of works
    data_frame_s1[name_new_colum_label_buffer_to_create] = data_frame_s1[name_colum_label].apply(lambda line: [category for category in data_frame_s1[name_colum_label].unique().tolist() if category != line])
    data_frame_s2 = data_frame_s1.copy()
    
    # 5. Compute how many times a words appear in other category outer the category under consideration
    
    return data_frame_s1

def calculate_how_many_words_appear_outer_category(data_frame_from_before, 
                                                   name_colum_text,
                                                   name_colum_label, 
                                                   list_label_to_use_for_outer_join,
                                                   name_new_colum_counter_to_create = 'how_many_a_word_appear_per_category',
                                                   name_new_colum_counter_to_create_OUTER = 'how_many_a_word_appear_per_category_OUTER'):
    
    # 1. Get dataframe regarding how each word occurs per category out of category
    dataframe_filtered = data_frame_from_before[data_frame_from_before[name_colum_label].isin(list_label_to_use_for_outer_join)]
    
    # 2. Change format of dictionary to operate aggregation of list
    dataframe_filtered[name_new_colum_counter_to_create_OUTER] = dataframe_filtered[name_new_colum_counter_to_create].apply(lambda line : list(line.items()))
    
    # 3. Calculation
    from collections import Counter
    inp = [ dict([i]) for i in tuple(dataframe_filtered[name_new_colum_counter_to_create_OUTER].sum()) ]
    count = Counter()
    for y in inp:
        count += Counter(y)
    
    G = len(dataframe[name_colum_label].unique().tolist())
    # 4. Divide by number of categories
    sum_all_words_outer_a_given_category = {k: v/G for k, v in count.items()}
        
    return sum_all_words_outer_a_given_category

def FAC(data_frame, 
        name_column_text,
        name_colum_label,
        name_new_colum_counter_to_create = 'how_many_a_word_appear_per_category',
        name_new_colum_label_buffer_to_create = 'label_to_use_for_outer_join'): 
    
    dataframe_buffer = calculate_how_many_words_appear_per_category(data_frame, name_column_text, name_colum_label)
    
    OUTER = []
    for item_list_category in dataframe_buffer[name_new_colum_label_buffer_to_create]:
        OUTER.append(calculate_how_many_words_appear_outer_category(dataframe_buffer, name_column_text, name_colum_label, item_list_category))
    
    dataframe_buffer['RESULT_OUTER'] = pd.Series(OUTER)
    return dataframe_buffer

In [None]:
import pandas as pd
dico = pd.read_excel('C:/Users/adsieg/Desktop/Google_snippet/dictionary_yahoo_with_word_freq.xlsx')
dico = dico[['class_name', 'all_keywords']]

list_to_consider = ['Business & Finance','Computers & Internet','Education & Reference']

In [None]:
adrien = FAC(dico, 'all_keywords', 'class_name')

In [None]:
adrien