Bonjour à tous,
J’espère que vous allez bien !
Je vous contacte pour vous partager le sujet du projet final de Natural Language
Processing, dont le thème est l’exploration de méthodologies de Topic Modeling.
Contexte  

Mentionné durant le cours, le Topic Modeling est un sous-champ du Natural
Language Processing visant à extraire les sujets de discussions principaux d’un
corpus de texte. On appelle un sujet de discussion un groupe de mots ou un groupe de
phrases partageant une thématique spécifique.
Cette problématique est donc un sujet non-supervisé qui s'apparente à du clustering.  

De fait, de nombreuses méthodes de clustering existent – K-means, DBScan,
Agglomerative clustering, etc. Cependant les données textuelles possèdent des
particularités qui ont résulté en la création d’algorithmes spécifiques tirant profit de ces
dernières.   

Il est possible d'appliquer des méthodologies de Topic Modeling à n'importe quelle
source de textes : commentaires postés sur les réseaux sociaux, articles scientifiques,
pages Wikipédia, review de produits, etc. Le type de corpus analysé conditionne
fortement la méthodologie à utiliser (vous verrez probablement que plus les textes sont
courts, plus la tâche devient ardue).
Dans le cadre de ce projet, vous travaillerez sur le jeu de données News Category
Dataset utilisé dans le TD NLP #2 - Data Pipeline, contenant 200K headlines de journaux
en anglais.  

## Sujet  

L’objectif de ce projet est d’étudier les performances d’algorithmes de Topic Modeling
sur les données mentionnées ci-avant. Pour ce faire, vous devrez :

1. Analyser le corpus de texte pour en extraire ses caractéristiques spécifiques (taille moyenne, types de mots utilisés, mots les plus fréquents, stopwords, etc.).

2. Sélectionner 3 méthodologies de Topic Modeling / Clustering vous semblant en phase avec les données à traiter
3. Définir une ou plusieurs métriques permettant de mesurer la qualité de vos modèles
a. Note : vous pouvez utiliser les catégories des headlines fournies dans le
jeu de données. Cependant, il s’agit ici d’un problème de clustering avant
tout, il faut donc que vous mettiez également en avant des métriques
prenant en compte cela. 

4. Réaliser les tests comparatifs de chacun des modèles que vous avez sélectionné

5. Conclure sur la meilleure méthodologie à utiliser dans votre cas et préciser les pistes d’améliorations de votre analyse
Pensez à justifier vos choix !

## Modalités de rendu

### • Taille des équipes : 3 personnes

### • Format de rendu : Notebook Jupyter présentant les résultats de l’étude
o Import à mettre dans la première cellule du projet
o A déposer sur Sharepoint à ce lien
o Nom du fichier : Nom1_Nom2_Nom3_ProjectTopicModeling.ipynb

### • Critères d’évaluation :  

o Qualité de l’étude des caractéristiques du corpus et de la sélection des
méthodologies à tester /5  

o Qualité des métriques sélectionnées ou/et créées /5  

o Qualité de l’analyse des différentes méthodologies /5  

o Qualité de la conclusion finale /5  


### • Date de rendu : 26/11/2021 -> (si cela pose problème vis-à-vis de vos examens,
faites le moi savoir)
Ressources

Vous trouverez dans le dossier Teams lié à ce cours trois documents pouvant vous aider
pour ce projet, à savoir :

• La version finale du cours auquel vous avez assisté, disponible ici

• Une brève présentation sur le Topic Modeling présentant différents types de
méthodologies, disponible ici

• Le TD de NLP Data Pipeline corrigé qui vous servira de base pour ce
projet, disponible ici

Pourriez-vous s’il vous plaît ajouter les équipes à l’excel disponible à ce lien une
fois celles-ci établies ?

Si vous avez la moindre question n’hésitez pas à me contacter par retour de mail !
Bonne journée à vous,

# I. Exploration du Dataset

In [1]:
#Import des librairies nécessaires pour nos calculs
import pandas as pd
import numpy as np
import itertools
import itertools
import os
import re
import secrets
import string

import pandas as pd
import spacy

from itertools import chain

from gensim.models.callbacks import CallbackAny2Vec
from gensim.models import Word2Vec, Phrases, KeyedVectors
from gensim.models.phrases import Phraser
from gensim.utils import simple_preprocess
from nltk.corpus import wordnet
# from pattern.en import pluralize, singularize
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from tqdm import tqdm

from spacy.parts_of_speech import IDS as POS_map

In [2]:
#Import du dataset
dataset = pd.read_json("News_Category_Dataset_v2.json", lines=True, dtype={"headline": str})

In [3]:
dataset

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26
...,...,...,...,...,...,...
200848,TECH,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,"Reuters, Reuters",https://www.huffingtonpost.com/entry/rim-ceo-t...,Verizon Wireless and AT&T are already promotin...,2012-01-28
200849,SPORTS,Maria Sharapova Stunned By Victoria Azarenka I...,,https://www.huffingtonpost.com/entry/maria-sha...,"Afterward, Azarenka, more effusive with the pr...",2012-01-28
200850,SPORTS,"Giants Over Patriots, Jets Over Colts Among M...",,https://www.huffingtonpost.com/entry/super-bow...,"Leading up to Super Bowl XLVI, the most talked...",2012-01-28
200851,SPORTS,Aldon Smith Arrested: 49ers Linebacker Busted ...,,https://www.huffingtonpost.com/entry/aldon-smi...,CORRECTION: An earlier version of this story i...,2012-01-28


On observe que ce dataset comprend 6 variables : catégory, headline, authors, link, short description et date de format 200853 * 6. Il représente des articles de journaux anglais classés par catégories.

# Sujet

Isoler la variable headline et traiter les différents corpus de texte afin de les catégoriser sous forme de thème à l'aide de méthodes de Topic Modeling en apprentissage non-supervisé : Clusterisation.

# Avant le traitement des mots
### Cette partie n'est pas dans le pipeline mais permet de l'appliquer sur des corpus plus lisibles

Première modification : on travail uniquement sur les headline du dataset

In [4]:
df = pd.DataFrame(dataset["headline"])
df

Unnamed: 0,headline
0,There Were 2 Mass Shootings In Texas Last Week...
1,Will Smith Joins Diplo And Nicky Jam For The 2...
2,Hugh Grant Marries For The First Time At Age 57
3,Jim Carrey Blasts 'Castrato' Adam Schiff And D...
4,Julianna Margulies Uses Donald Trump Poop Bags...
...,...
200848,RIM CEO Thorsten Heins' 'Significant' Plans Fo...
200849,Maria Sharapova Stunned By Victoria Azarenka I...
200850,"Giants Over Patriots, Jets Over Colts Among M..."
200851,Aldon Smith Arrested: 49ers Linebacker Busted ...


On ajoute les token, on met nos mots en minuscules et on supprime la ponctuaction sur le dataset

In [5]:
#Mise en minuscule du dataframe
df = df["headline"].str.lower()
df

0         there were 2 mass shootings in texas last week...
1         will smith joins diplo and nicky jam for the 2...
2           hugh grant marries for the first time at age 57
3         jim carrey blasts 'castrato' adam schiff and d...
4         julianna margulies uses donald trump poop bags...
                                ...                        
200848    rim ceo thorsten heins' 'significant' plans fo...
200849    maria sharapova stunned by victoria azarenka i...
200850    giants over patriots, jets over colts among  m...
200851    aldon smith arrested: 49ers linebacker busted ...
200852    dwight howard rips teammates after magic loss ...
Name: headline, Length: 200853, dtype: object

In [6]:
#Suppression de la ponctuation
df = df.str.replace('[^\w\s]','')

df

0         there were 2 mass shootings in texas last week...
1         will smith joins diplo and nicky jam for the 2...
2           hugh grant marries for the first time at age 57
3         jim carrey blasts castrato adam schiff and dem...
4         julianna margulies uses donald trump poop bags...
                                ...                        
200848    rim ceo thorsten heins significant plans for b...
200849    maria sharapova stunned by victoria azarenka i...
200850    giants over patriots jets over colts among  mo...
200851    aldon smith arrested 49ers linebacker busted f...
200852    dwight howard rips teammates after magic loss ...
Name: headline, Length: 200853, dtype: object

In [7]:
df

0         there were 2 mass shootings in texas last week...
1         will smith joins diplo and nicky jam for the 2...
2           hugh grant marries for the first time at age 57
3         jim carrey blasts castrato adam schiff and dem...
4         julianna margulies uses donald trump poop bags...
                                ...                        
200848    rim ceo thorsten heins significant plans for b...
200849    maria sharapova stunned by victoria azarenka i...
200850    giants over patriots jets over colts among  mo...
200851    aldon smith arrested 49ers linebacker busted f...
200852    dwight howard rips teammates after magic loss ...
Name: headline, Length: 200853, dtype: object

In [8]:
# On sépare nos mots de chaque phrase
def dummy_word_split(texts):
    """Function identifying words in a sentence in a really dummy way.
        
        Argument:
            - texts (list of str): a list of raw texts in which we'd like to identify words
            
        Return:
            - list of list containing each word separately.
    """
    texts_out = []
    for text in texts:
        texts_out.append(text.split(" "))#Méthode permettant de séparer nos mots 
        
    return texts_out


In [65]:
splitted_text =  dummy_word_split(df.tolist())

In [10]:
splitted_text

[['there',
  'were',
  '2',
  'mass',
  'shootings',
  'in',
  'texas',
  'last',
  'week',
  'but',
  'only',
  '1',
  'on',
  'tv'],
 ['will',
  'smith',
  'joins',
  'diplo',
  'and',
  'nicky',
  'jam',
  'for',
  'the',
  '2018',
  'world',
  'cups',
  'official',
  'song'],
 ['hugh',
  'grant',
  'marries',
  'for',
  'the',
  'first',
  'time',
  'at',
  'age',
  '57'],
 ['jim',
  'carrey',
  'blasts',
  'castrato',
  'adam',
  'schiff',
  'and',
  'democrats',
  'in',
  'new',
  'artwork'],
 ['julianna',
  'margulies',
  'uses',
  'donald',
  'trump',
  'poop',
  'bags',
  'to',
  'pick',
  'up',
  'after',
  'her',
  'dog'],
 ['morgan',
  'freeman',
  'devastated',
  'that',
  'sexual',
  'harassment',
  'claims',
  'could',
  'undermine',
  'legacy'],
 ['donald',
  'trump',
  'is',
  'lovin',
  'new',
  'mcdonalds',
  'jingle',
  'in',
  'tonight',
  'show',
  'bit'],
 ['what',
  'to',
  'watch',
  'on',
  'amazon',
  'prime',
  'thats',
  'new',
  'this',
  'week'],
 ['mike'

On observe que nos phrases contiennent des numéro, qui ne nous seront pas utiles lors du clustering, on décide de les supprimer aussi.

In [11]:
df = df.str.replace('\d+', '')


In [64]:
splitted_text =  dummy_word_split(df.tolist())
splitted_text

[['there',
  'were',
  '',
  'mass',
  'shootings',
  'in',
  'texas',
  'last',
  'week',
  'but',
  'only',
  '',
  'on',
  'tv'],
 ['will',
  'smith',
  'joins',
  'diplo',
  'and',
  'nicky',
  'jam',
  'for',
  'the',
  '',
  'world',
  'cups',
  'official',
  'song'],
 ['hugh', 'grant', 'marries', 'for', 'the', 'first', 'time', 'at', 'age', ''],
 ['jim',
  'carrey',
  'blasts',
  'castrato',
  'adam',
  'schiff',
  'and',
  'democrats',
  'in',
  'new',
  'artwork'],
 ['julianna',
  'margulies',
  'uses',
  'donald',
  'trump',
  'poop',
  'bags',
  'to',
  'pick',
  'up',
  'after',
  'her',
  'dog'],
 ['morgan',
  'freeman',
  'devastated',
  'that',
  'sexual',
  'harassment',
  'claims',
  'could',
  'undermine',
  'legacy'],
 ['donald',
  'trump',
  'is',
  'lovin',
  'new',
  'mcdonalds',
  'jingle',
  'in',
  'tonight',
  'show',
  'bit'],
 ['what',
  'to',
  'watch',
  'on',
  'amazon',
  'prime',
  'thats',
  'new',
  'this',
  'week'],
 ['mike',
  'myers',
  'reveals',


# Analyse de la fréquence des mots

In [13]:
def compute_word_occurences(texts):
    """You have to define this function yourself. """
    
    words = itertools.chain.from_iterable(texts)#Mets toutes les listes dans une liste 
    
    word_count = pd.Series(words).value_counts()
    word_count = pd.DataFrame({"Word": word_count.index, "Count": word_count.values})

    return word_count

In [14]:
occurences = compute_word_occurences(splitted_text).head(1000)
occurences

Unnamed: 0,Word,Count
0,the,62071
1,to,50125
2,,39884
3,of,31923
4,a,31613
...,...,...
995,movies,274
996,press,273
997,sunday,273
998,experts,273


On observe que les premiers mots sont inutiles dans le traitement à venir, on catégorise ces mots comme étant des stopwords, on décide de les supprimer.

# Suppression des Stopwords
NLTK est une plateforme de premier plan pour la création de programmes Python destinés à travailler avec des données sur le langage humain.

Elle fournit des interfaces faciles à utiliser pour plus de 50 corpus et ressources lexicales telles que WordNet, ainsi qu'une suite de bibliothèques de traitement de texte pour la classification, la tokénisation, l'étymologie, le balisage, l'analyse syntaxique et le raisonnement sémantique, des wrappers pour les bibliothèques NLP industrielles, et un forum de discussion actif.

NLTK a été qualifié de "merveilleux outil pour enseigner et travailler dans le domaine de la linguistique computationnelle en utilisant Python" et "d'incroyable bibliothèque pour jouer avec le langage naturel". Natural Language Processing with Python fournit une introduction pratique à la programmation pour le traitement du langage.


In [15]:
# Import stopwords with nltk.
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [16]:
from nltk.corpus import stopwords
#Affichage des stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [17]:
stop = stopwords.words('english')

In [78]:
#On supprime nos stopwords du DataFrame Splitted text
the = [""," ","i", "me", "my", "myself", "we", "our","us", "ours", "ourselves", "you", "you're", "you've", "you'll", "you'd", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "she's", "her", "hers", "herself", "it", "it's", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that","thats", "that'll", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't","the"]
test2=[]
for lists in splitted_text:
    for word in the:
        try:
            lists.remove(word)
        except Exception as e:
            e ="  "
splitted_text
    

[['mass', 'shootings', 'texas', 'last', 'week', 'tv'],
 ['smith',
  'joins',
  'diplo',
  'nicky',
  'jam',
  'world',
  'cups',
  'official',
  'song'],
 ['hugh', 'grant', 'marries', 'first', 'time', 'age'],
 ['jim',
  'carrey',
  'blasts',
  'castrato',
  'adam',
  'schiff',
  'democrats',
  'new',
  'artwork'],
 ['julianna',
  'margulies',
  'uses',
  'donald',
  'trump',
  'poop',
  'bags',
  'pick',
  'dog'],
 ['morgan',
  'freeman',
  'devastated',
  'sexual',
  'harassment',
  'claims',
  'could',
  'undermine',
  'legacy'],
 ['donald',
  'trump',
  'lovin',
  'new',
  'mcdonalds',
  'jingle',
  'tonight',
  'show',
  'bit'],
 ['watch', 'amazon', 'prime', 'new', 'week'],
 ['mike',
  'myers',
  'reveals',
  'hed',
  'like',
  'fourth',
  'austin',
  'powers',
  'film'],
 ['watch', 'hulu', 'new', 'week'],
 ['justin', 'timberlake', 'visits', 'texas', 'school', 'shooting', 'victims'],
 ['south',
  'korean',
  'president',
  'meets',
  'north',
  'koreas',
  'kim',
  'jong',
  'un',


In [79]:
occurences = compute_word_occurences(splitted_text).head(1000)
occurences

Unnamed: 0,Word,Count
0,photos,10277
1,trump,9569
2,new,8554
3,video,5770
4,donald,4589
...,...,...
995,vegas,253
996,actor,253
997,center,252
998,longer,252


On observe que nous avons en première place des espaces, nous les supprimons afin de ne plus les traiter.

In [75]:
space = [""," "]
for lists in splitted_text:
    for word in space:
        try:
            lists.remove(word)
        except Exception as e:
            e ="  "
splitted_text

[['there',
  'were',
  'mass',
  'shootings',
  'in',
  'texas',
  'last',
  'week',
  'but',
  'only',
  'on',
  'tv'],
 ['will',
  'smith',
  'joins',
  'diplo',
  'and',
  'nicky',
  'jam',
  'for',
  'the',
  'world',
  'cups',
  'official',
  'song'],
 ['hugh', 'grant', 'marries', 'for', 'the', 'first', 'time', 'at', 'age'],
 ['jim',
  'carrey',
  'blasts',
  'castrato',
  'adam',
  'schiff',
  'and',
  'democrats',
  'in',
  'new',
  'artwork'],
 ['julianna',
  'margulies',
  'uses',
  'donald',
  'trump',
  'poop',
  'bags',
  'to',
  'pick',
  'up',
  'after',
  'her',
  'dog'],
 ['morgan',
  'freeman',
  'devastated',
  'that',
  'sexual',
  'harassment',
  'claims',
  'could',
  'undermine',
  'legacy'],
 ['donald',
  'trump',
  'is',
  'lovin',
  'new',
  'mcdonalds',
  'jingle',
  'in',
  'tonight',
  'show',
  'bit'],
 ['what',
  'to',
  'watch',
  'on',
  'amazon',
  'prime',
  'thats',
  'new',
  'this',
  'week'],
 ['mike',
  'myers',
  'reveals',
  'hed',
  'like',
  '

In [43]:
type(splitted_text)

list

In [21]:
occurences = compute_word_occurences(splitted_text).head(1000)
occurences

Unnamed: 0,Word,Count
0,photos,10277
1,trump,9569
2,new,8554
3,video,5770
4,donald,4589
...,...,...
995,iraq,253
996,fail,253
997,vegas,253
998,crash,252


Nos stopwords on bien été supprimés ! 

In [22]:
occurences = compute_word_occurences(splitted_text).head(1000)
occurences

Unnamed: 0,Word,Count
0,photos,10277
1,trump,9569
2,new,8554
3,video,5770
4,donald,4589
...,...,...
995,iraq,253
996,fail,253
997,vegas,253
998,crash,252


# Pipeline

1. **Ensuring data quality.** You have to make sure that there's no N/A in your data and that everything is in the good format shape. Having this as the entrance of your pipeline will save you a lot of time in the long run, so try defining it thoroughly.


2. **Filtering texts from unwanted characters**. Especially if you get data from web, you'll end up with HTML tags or encoding stuff that you don't need in your texts. Before applying anything to them, you need to get them cleaned up. Here, try removing the dates and the punctuation for instance.


3. **Unify your texts**. (*This is topic modeling specific*). You don't want to make the difference between a word at the beginning of a phrase of in the middle of it here. You should unify all your words by lowercasing them and deaccenting them as well.


4. **Converting sentences to lists of words**. Some words aren't needed for our analyses, such as *your*, *my*, etc. In order to remove them easily, you have to convert your sentences to lists of words. You can use the dummy function defined above but I'd advised against it. Try finding a function that does that smoothly in [gensim.utils](https://radimrehurek.com/gensim/utils.html)!


5. **Remove useless words**. You need to remove useless words from your corpus. You have two approaches: [use a hard defined list of stopwords](https://www.analyticsvidhya.com/blog/2019/08/how-to-remove-stopwords-text-normalization-nltk-spacy-gensim-python/) or rely on TF-IDF to identify useless words. The first is the simplest, the second might yield better results!


6. **Creating n-grams**. If you look at New York, it is composed of two words. As a result, a word count wouldn't really return a true count for *New York* per se. In NLP, we represent New York as New_York, which is considered a single word. The n-gram creation consists in identifying words that occur together often and regrouping them. It boosts interpretability for topic modeling in this case.


7. **Stemming / Lemmatization**. Shouldn't run, running, runnable be grouped and counted as a single word when we're identifying discussion topics? Yes, they should. Stemming is the process of cutting words to their word root (run- for instance) quite brutally while lemmatization will do the same by identifying the kind of word it is working on. You should convert the corpus words into those truncated representations to have a more realistic word count.


8. **Part of speech tagging**. POS helps in the identification of verbs, nouns, adjectives, etc. For topic models, it is a good idea to work only on verbs and nouns. Adjectives don't convey info about the actual underlying topic discussed at hand.

Jusque là, nous avons assurés une bonne qualité de nos données, unifié nos textes, convertit nos phrases et supprimé nos Stopwords. 

Selon le Pipeline, il nous reste maintenant à :

1. **Appliquer des regex** 
2. **Créer des n-grams**
3. **Lemmatization / Steeming**
4. **Speach Tagging**

# Suppression des Regex

In [80]:
def filter_text(texts_in):
    """Removes incorrect patterns from a list of texts, such as hyperlinks, bullet points and so on"""
    
    texts_out = re.sub(r'https?:\/\/[A-Za-z0-9_.-~\-]*', ' ', str(texts_in), flags=re.MULTILINE)
    texts_out = re.sub(r'[(){}\[\]<>]', ' ', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'&amp;#.*;', ' ', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'&gt;', ' ', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'â€™', "'", texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'\s+', ' ', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'&#x200B;', ' ', texts_out, flags=re.MULTILINE)
    # Mail regex
    # This regex is correct but WAY TOO LONG to process. So we skip it with a simpler version
    # texts_out = re.sub(r"(?i)(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|\"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])", '', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+\.[a-zA-Z0-9-_.]+', '', texts_out, flags=re.MULTILINE)
    # Phone regex
    # This regex is correct but WAY TOO LONG to process. So we skip it with a simpler version
    # texts_out = re.sub(r".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", '', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r"\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}", '', texts_out, flags=re.MULTILINE)
    # Remove names in twitter
    texts_out = re.sub(r'@\S+( |\n)', '', texts_out, flags=re.MULTILINE)

    # Remove starts commonly used on social media
    texts_out = re.sub(r'\*', '', texts_out, flags=re.MULTILINE)
    return texts_out


In [81]:
splitted_text1 = [filter_text(t) for t in splitted_text]

In [82]:
def sent_to_words(sentences):
    """Converts sentences to words.

    Convert sentences in lists of words while removing the accents and the punctuation.

    @param:
        sentences: a list of strings, the sentences we want to convert
    @return
        A list of words' lists.
    """
    for sentence in tqdm(sentences):
        yield (simple_preprocess(str(sentence), deacc=True))


In [83]:
texts = list(sent_to_words(splitted_text))

100%|███████████████████████████████████████████████████████████████████████| 200853/200853 [00:06<00:00, 31663.03it/s]


# Création de n-grams

In [86]:
def create_bigrams(texts, bigram_count=15, threshold=10, convert_sent_to_words=False, as_str=True):
    """Identify bigrams in texts and return the texts with bigrams integrated"""
    if convert_sent_to_words:
        texts = list(sent_to_words(texts))
    
    bigram_model = Phraser(Phrases(texts, min_count=bigram_count, threshold=threshold))
    
    if as_str:
        return [" ".join(bigram_model[t]) for t in texts]

    else:
        return [bigram_model[t] for t in texts]

In [87]:
texts_t = create_bigrams(texts)

In [89]:
texts_t

['mass_shootings texas last week tv',
 'smith joins diplo nicky jam world cups official song',
 'hugh grant marries first_time age',
 'jim_carrey blasts castrato adam schiff democrats new artwork',
 'julianna margulies uses donald_trump poop bags pick dog',
 'morgan freeman devastated sexual_harassment claims could undermine legacy',
 'donald_trump lovin new mcdonalds jingle tonight_show bit',
 'watch amazon prime new week',
 'mike myers reveals hed like fourth austin powers film',
 'watch hulu new week',
 'justin_timberlake visits texas school_shooting victims',
 'south_korean president meets north_koreas kim_jong un talk trump summit',
 'way life risk remote oystergrowing region called robots',
 'trumps crackdown immigrant parents puts kids already strained system',
 'trumps son concerned fbi obtained wiretaps putin ally met trump_jr',
 'edward_snowden theres one trump loves vladimir_putin',
 'booyah obama photographer hilariously trolls trumps spy claim',
 'ireland votes repeal abor