# TOPIC MODELLING

https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45

https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6

NMF (Non-negative Matrix Factorization): 
A->W+H avec 
A=articles by words, original
H=Article by topics, topics found
W=topics by words, weight of these topics


NMF is more scalable than LDA, but LDA more frequently used


In [96]:
# Initialisation

import re
import contractions
import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import casual_tokenize
import string
from nltk.corpus import stopwords
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [109]:
# Text cleaning

punc=string.punctuation
nltk.download('stopwords')
stops=set(stopwords.words('english'))
#print(stops)


def process_text(t):
    t=casual_tokenize(t)
    
    t=[e.lower() for e in t]
    
    t=[re.sub('[0-9]+', '',e) for e in t]
    
    t=[contractions.fix(e) for e in t]
    
    t=[SnowballStemmer('english').stem(e) for e in t]
    t=[w for w in t if w not in punc]
    t=[w for w in t if w not in stops]
    t=[e for e in t if len(e)>1]
    t=[e for e in t if ' ' not in e]
    
    return t


text = 'In the new system “Canton becomes Guangzhou and Tientsin becomes Tianjin.” Most importantly, the newspaper would now refer to the country’s capital as Beijing, not Peking. This was a step too far for some American publications. In an article on Pinyin around this time, the Chicago Tribune said that while it would be adopting the system for most Chinese words, some names had “become so ingrained'
print(text)
text_cleaned=process_text(text)
print(text_cleaned)




In the new system “Canton becomes Guangzhou and Tientsin becomes Tianjin.” Most importantly, the newspaper would now refer to the country’s capital as Beijing, not Peking. This was a step too far for some American publications. In an article on Pinyin around this time, the Chicago Tribune said that while it would be adopting the system for most Chinese words, some names had “become so ingrained
['new', 'system', 'canton', 'becom', 'guangzhou', 'tientsin', 'becom', 'tianjin', 'import', 'newspap', 'would', 'refer', 'countri', 'capit', 'beij', 'peke', 'step', 'far', 'american', 'public', 'articl', 'pinyin', 'around', 'time', 'chicago', 'tribun', 'said', 'would', 'adopt', 'system', 'chines', 'word', 'name', 'becom', 'ingrain']


[nltk_data] Downloading package stopwords to /Users/evan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [108]:
# Data processing

texts = pd.DataFrame(text_cleaned)
texts=pd.concat([texts,texts],keys=['text1','text2'])
print(texts)
tfidf_vectorizer = TfidfVectorizer(
    min_df=3, #on ne retient pas les mots apparaissant dans moins de n textes différents
    max_df=0.85, #on ignore les mots apparaissant dans plus de 0,n% des articles : non représentatifs
    max_features=5000, #nombre de mots max
    ngram_range=(1, 2),
    preprocessor=' '.join
)

tfidf = tfidf_vectorizer.fit_transform(texts)

                  0
text1 0         new
      1      system
      2      canton
      3       becom
      4   guangzhou
...             ...
text2 30     chines
      31       word
      32       name
      33      becom
      34    ingrain

[70 rows x 1 columns]


TypeError: can only join an iterable