# Notebook de test de gensim pour le projet fil rouge
Exploration de la librairie GENSIM.
L'idée est de générer les mots clés sur quelques fichiers arXive pour lesquels j'ai déjà les mots-clés.

## Initialisation

In [1]:
import gensim as gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
import spacy# Plotting tools
import nltk

nltk.download('stopwords')

try:
    nlp = spacy.load('en_core_web_sm')
except:
    nlp =  spacy.cli.download("en_core_web_sm") # la première fois, on passe ici

# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
#stop_words.extend(['from', 'subject', 're', 'edu', 'use']) # on peut l'étendre

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tof\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Gensim  utilise logging (contrairement à owlready2), je l'initialise pour voir ce qui se passe.

In [2]:
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

J'utilise pandas pour lire un fichier csv de données.
Il s'agit d'un tableau contenant pour 10 articles arXive le titre, l'abstract et les mots clés

In [3]:
import pandas as pd

df = pd.read_csv("testGENSIM.csv",delimiter=';',header=0,encoding='Windows-1252') # encoding d'excel sous windows...

In [4]:
df.shape
NbTexte , _ = df.shape

In [5]:
df.head()

Unnamed: 0,Entry_id,title,summary,KeyWORDS(c),KeyWORDS
0,arXiv:1503.04220,Fuzzy Mixed Integer Optimization Model for Reg...,Mixed Integer Optimization has been a topic of...,"Mixed Integer Optimization, Fuzzy Sets, Regres...",–Mixed Integer Optimization; Fuzzy Sets; Regre...
1,arXiv:1511.02420,Design of an Alarm System for Isfahan Ozone Le...,The ozone level prediction is an important tas...,"ozone predictor, artificial intelligence, UV","– ozone predictor, artificial intelligence, UV..."
2,arXiv:1603.09728,A Survey of League Championship Algorithm: Pro...,The League Championship Algorithm (LCA) is spo...,"Global­Optimization,­ League­Championship­ Al...","­Global­Optimization,­ League­Championship­ A..."
3,arXiv:1509.00690,A Fuzzy Approach for Feature Evaluation and Di...,Web Usage Mining is the application of data mi...,"web usage mining; fuzzy c-means clustering, f...","web usage mining; fuzzy c-means clustering, f..."
4,arXiv:1501.05940,A New Efficient Method for Calculating Similar...,Web services allow communication between heter...,web service ; semantic similarity ; syntactic...,web service ; semantic similarity ; syntactic...


On va faire un premier nettoyage du texte en retirant les parentheses.
On peut utiliser https://regex101.com/ pour tester les regex

In [8]:
import re
# Define a function to map the values
def correct_text(text):
    text1 = text.replace('(', '')
    text2 = text1.replace(')','')
    text3 = re.sub("\d*-\d*-\d*", '', text2)  # on retire les dates
    text4 = re.sub("\d{4}",'',text3) # on retire les années
    text5 = re.sub("(\d+\.{1}\d+)",'',text4)  # on retire les valeurs numériques
    return text5

In [10]:
df['cleared_summary'] = df['summary'].apply(correct_text)

In [11]:
data = df.cleared_summary.values.tolist()

In [None]:
data = df.summary.values.tolist()

In [13]:
data[2]

'The League Championship Algorithm LCA is sport-inspired optimization algorithm that was introduced by Ali Husseinzadeh Kashan in the year . It has since drawn enormous interest among the researchers because of its potential efficiency in solving many optimization problems and real-world applications. The LCA has also shown great potentials in solving non-deterministic polynomial time NP-complete problems. This survey presents a brief synopsis of the LCA literatures in peer-reviewed journals, conferences and book chapters. These research articles are then categorized according to indexing in the major academic databases Web of Science, Scopus, IEEE Xplore and the Google Scholar. The analysis was also done to explore the prospects and the challenges of the algorithm and its acceptability among researchers. This systematic categorization can be used as a basis for future studies.'

# Construction du corpus

In [None]:
#text = ""
#for i in range(NbTexte):
#    text = text + " " + data.summary[i]

In [None]:
print(data)

In [None]:
data[0]

todo retirer les années, les dates, les parentheses

In [None]:
#import re
# Remove Emails
#data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
# Remove new line characters
#data = [re.sub('\s+', ' ', sent) for sent in data]
# Remove distracting single quotes
#data = [re.sub("\'", "", sent) for sent in data]
#print(data[:1])

In [None]:
print(data)

In [None]:
# Tokenize(split) the sentences into words
lmots = [[text for text in phrase.split(" ")] for phrase in data[0].split(".")]

# Create dictionary
dictionary = corpora.Dictionary(lmots)
print(dictionary)

In [None]:
# Show the word to id map
print(dictionary.token2id)

In [None]:
print(lmots)

In [None]:
from collections import defaultdict  # For word frequency
word_freq = defaultdict(int)
for sent in data:
    print("sent:",sent)
    mots = sent.split(" ")
    for i in mots:#sent:
        word_freq[i] += 1
len(word_freq)

In [None]:
sorted(word_freq, key=word_freq.get, reverse=True)[:30]

In [None]:
phrases = gensim.models.Phrases(data, min_count=30, progress_per=10000)

In [None]:
print(phrases)

# Tokenize words and cleanup the text

In [None]:
def sent_to_words(sentences):  # from https://medium.com/analytics-vidhya/topic-modeling-using-gensim-lda-in-python-48eaa2344920
  for sentence in sentences:
    yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))            #deacc=True removes punctuations
data_words = list(sent_to_words(data))
print(data_words)

## Creating Bigram and Trigram models

Bigrams are 2 words frequently occuring together in docuent. Trigrams are 3 words frequently occuring. Many other techniques are explained in part-1 of the blog which are important in NLP pipline, it would be worth your while going through that blog. The 2 arguments for Phrases are min_count and threshold. The higher the values of these parameters , the harder its for a word to be combined to bigram.
https://radimrehurek.com/gensim/models/phrases.html
indique Automatically detect common phrases – aka multi-word expressions, word n-gram collocations – from a stream of sentences.

In [None]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=10)#100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=10)#0)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

In [None]:
# Define function for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [None]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load("en_core_web_sm")#, disable=['parser', 'ner']) # au depart seulement 'en'

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])
print(data_words_nostops)

In [None]:
print(data_words_bigrams)

In [None]:
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

## Create Dictionary and Corpus needed for Topic Modeling
Make sure to check if dictionary[id2word] or corpus is clean otherwise you may not get good quality topics.

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)
# Create Corpus
texts = data_lemmatized
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# View
print(corpus[:1])

    Gensim creates unique id for each word in the document. Its mapping of word_id and word_frequency. Example: (8,2) above indicates, word_id 8 occurs twice in the document and so on.
    This is used as input to LDA model.

If you want to see what word corresponds to a given id, then pass the id as a key to dictionary. Example: id2word[4].

    Readable format of corpus can be obtained by executing below code block.

In [None]:
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

## Building topic model

Parameters of LDA

    Alpha and Beta are Hyperparameters — alpha represents document-topic density and Beta represents topic-word density, chunksize is the number of documents to be used in each training chunk, update_every determines how often the model parameters should be updated and passes is the total number of training passes.
    A measure for best number of topics really depends on kind of corpus you are using, the size of corpus, number of topics you expect to see.

In [None]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=NbTexte/2,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

View topics in LDA model

    Each topic is combination of keywords and each keyword contributes a certain weightage to the topic.
    You can see keywords for each topic and weightage of each keyword using lda_model.print_topics().

In [None]:
# Print the keyword of topics
print(lda_model.print_topics())
doc_lda = lda_model[corpus]

In [None]:


lda_model.show_topic(4)



In [None]:


for topic_id in range(lda_model.num_topics):
    topk = lda_model.show_topic(topic_id, 10)
    topk_words = [ w for w, _ in topk ]

    print('{}: {}'.format(topic_id, ' '.join(topk_words)))



Les topics 1 et 2 sont tres proches. J'ai produit trop de topic -> je vais en faire que 5


## Predicting the topics for a document

If you have a new document, you can use the trained model to estimate the topic proportions for it.

This is done in two steps: first, the document is converted into a matrix, and then the inference is carried out.


In [None]:
data[0]

In [None]:
doc = data[0].split()

doc_vector = lda_model.id2word.doc2bow(doc)
doc_topics = lda_model[doc_vector]



The result shows predicted topic distribution. In most cases, there will be one or more dominant topics, and small probabilities for the rest of the topics.

For instance, for the document this book describes Windows software, we will typically get a result that this document is a mix of a book-related topic and a software-related topic. (Compare to the topic list you got above.) Again, the exact result here will vary between executions because of issues related to random number generation.


In [None]:
doc_topics

In [None]:
for phrase, score in gensim.models.Phrases.find_phrases(doc).items():

    print(phrase, score)