# 01-News-Modelling

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [22]:
# TODO: import needed libraries
import pandas as pd
import pickle
import numpy as np
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
from nltk.stem import PorterStemmer

from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from gensim.models import LsiModel
from gensim.models import LdaModel
from pprint import pprint


Load the data in the file `random_headlines.csv`

In [5]:
# TODO: load the dataset
df = pd.read_csv('random_headlines.csv',on_bad_lines='skip')


This is always a good idea to perform some EDA on a dataset...

In [6]:
# TODO: Perform a short EDA
df.info()
# pas de valeur null

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


In [7]:
df.columns

Index(['publish_date', 'headline_text'], dtype='object')

In [8]:
## voir les plots histos de V.MALARA sur des donn√©es textuelles
df.headline_text[:6]

0         ute driver hurt in intersection crash
1                  6yo dies in cycling accident
2                 bumper olive harvest expected
3            replica replaces northernmost sign
4                  woods targets perfect season
5    leckie salvages dramatic draw for adelaide
Name: headline_text, dtype: object

Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [13]:
# TODO: Preprocess the input data
stemmer = PorterStemmer()
def processing(document):
    tokens = word_tokenize(document)
    tokens = [t.lower() for t in tokens if t.isalpha()]
    tokens = [t for t in tokens if t not in stop_words]
    tokens = [stemmer.stem(t) for t in tokens]
    return tokens

df['tokens'] = df.headline_text.apply(processing)
df.tokens[:7]

0     [ute, driver, hurt, intersect, crash]
1                        [die, cycl, accid]
2           [bumper, oliv, harvest, expect]
3     [replica, replac, northernmost, sign]
4           [wood, target, perfect, season]
5    [lecki, salvag, dramat, draw, adelaid]
6        [group, gaug, rail, servic, futur]
Name: tokens, dtype: object

In [14]:
# Create a corpus
corpus = df['tokens']

In [15]:
# Compute the dictionary: this is a dictionary mapping words and their corresponding numbers for later visualisation
id2word = Dictionary(corpus)

Now use Gensim to compute a BOW

In [16]:
# TODO: Compute the BOW using Gensim
# Create a BOW
bow = [id2word.doc2bow(line) for line in corpus]  # convert corpus to BoW format
print(bow[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]


Compute the TF-IDF using Gensim

In [17]:
# TODO: Compute TF-IDF
# Instanciate a TF-IDF
tfidf_model = TfidfModel(bow)

# Compute the TF-IDF
tf_idf_gensim = tfidf_model[bow]

In [18]:
tf_idf_gensim[0]

[(0, 0.30725466582280214),
 (1, 0.3528943781678455),
 (2, 0.42129048115131124),
 (3, 0.5992666854471201),
 (4, 0.49442279315598586)]

Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [20]:
# TODO: Compute LSA
lsi = LsiModel(tf_idf_gensim, id2word=id2word, num_topics=5)

pprint(lsi.print_topics())

[(0,
  '-0.455*"man" + -0.389*"polic" + -0.324*"charg" + -0.146*"murder" + '
  '-0.146*"court" + -0.131*"face" + -0.112*"new" + -0.112*"miss" + '
  '-0.109*"crash" + -0.107*"death"'),
 (1,
  '0.396*"second" + 0.334*"abc" + 0.329*"news" + -0.305*"man" + '
  '0.284*"weather" + 0.228*"busi" + -0.216*"charg" + 0.163*"sport" + '
  '0.139*"plan" + 0.124*"council"'),
 (2,
  '-0.372*"second" + -0.318*"man" + -0.296*"abc" + -0.261*"news" + '
  '-0.257*"weather" + -0.242*"charg" + 0.195*"plan" + 0.169*"govt" + '
  '0.167*"council" + -0.152*"busi"'),
 (3,
  '0.765*"polic" + -0.234*"man" + -0.230*"charg" + 0.164*"investig" + '
  '0.141*"probe" + -0.131*"council" + -0.122*"plan" + -0.119*"court" + '
  '-0.111*"face" + 0.106*"search"'),
 (4,
  '0.394*"kill" + 0.319*"crash" + 0.279*"fire" + -0.215*"charg" + '
  '-0.182*"council" + 0.172*"car" + -0.171*"court" + -0.161*"polic" + '
  '0.158*"rural" + -0.157*"plan"')]


For each of the topic, show the most significant words.

In [None]:
# TODO: Print the 3 or 4 most significant words of each topic


What do you think about those results?

### Topic 0 only has topics not showing up in the docs with negative LSA coefs 

Now let's try to use LDA instead of LSA using Gensim

In [23]:
# TODO: Compute LDA
# Compute the LDA
lda1 = LdaModel(corpus=tf_idf_gensim, num_topics=5, id2word=id2word, passes=10, random_state=0)


In [35]:
# TODO: print the most frequent words of each topic
# Print the main topics
for idx in range(0,len(lda1.print_topics())):
    print(str(idx) + ':  ' + lda1.print_topics()[idx][1].split('+')[0])

0:  0.005*"countri" 
1:  0.007*"man" 
2:  0.005*"plan" 
3:  0.007*"murder" 
4:  0.006*"second" 


Now, how does it work with LDA?

Let's make some visualization of the LDA results using pyLDAvis.

In [None]:
# TODO: show visualization results of the LDA


Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.