# News Aggregator Challenge

![](https://images.unsplash.com/photo-1504711434969-e33886168f5c?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1500&q=80)

This dataset contains headlines, URLs, and categories for 422,937 news stories collected by a web aggregator between March 10th, 2014 and August 10th, 2014.

News categories included in this dataset include business; science and technology; entertainment; and health. Different news articles that refer to the same news item (e.g., several articles about recently released employment statistics) are also categorized together.

Content
The columns included in this dataset are:

- ID : the numeric ID of the article
- TITLE : the headline of the article
- URL : the URL of the article
- PUBLISHER : the publisher of the article
- CATEGORY : the category of the news item; one of: 
        -- b : business 
        -- t : science and technology 
        -- e : entertainment 
        -- m : health
- STORY : alphanumeric ID of the news story that the article discusses
- HOSTNAME : hostname where the article was posted
- TIMESTAMP : approximate timestamp of the article's publication, given in Unix time (seconds since midnight on Jan 1, 1970)

➡️ Can we predict the category (business, entertainment, etc.) of a news article given only its headline?


👉 Source: https://www.kaggle.com/uciml/news-aggregator-dataset/data

In [2]:
import pandas as pd
import numpy as np

<h1> Load the data

In [7]:
# TODO: import the data
data = pd.read_csv('/home/guillaume/code/GGIML/vivadata-student/data/news_aggreg/uci-news-aggregator.csv', index_col='ID')

<h1> EDA

In [8]:
# TODO: data exploration
data.head()

Unnamed: 0_level_0,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 422419 entries, 1 to 422937
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   TITLE      422419 non-null  object
 1   URL        422419 non-null  object
 2   PUBLISHER  422417 non-null  object
 3   CATEGORY   422419 non-null  object
 4   STORY      422419 non-null  object
 5   HOSTNAME   422419 non-null  object
 6   TIMESTAMP  422419 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 25.8+ MB


In [10]:
data.PUBLISHER.value_counts()

Reuters                         3902
Huffington Post                 2455
Businessweek                    2395
Contactmusic.com                2334
Daily Mail                      2254
                                ... 
The Busby Babe                     1
Australian Personal Computer       1
Camp Verde Journal                 1
East County Magazine               1
Moultrie News                      1
Name: PUBLISHER, Length: 10985, dtype: int64

In [20]:
data.CATEGORY.value_counts()/data.CATEGORY.value_counts().sum()

e    0.360943
b    0.274531
t    0.256485
m    0.108042
Name: CATEGORY, dtype: float64

In [4]:
# TODO: Identify the different values of news article category 

- CATEGORY : the category of the news item; one of: 
        -- b : business 
        -- t : science and technology 
        -- e : entertainment 
        -- m : health

In [12]:
# TODO: sample your dataframe to 10000 lines
from sklearn.model_selection import train_test_split

sample, y_sample, excluded, y_excluded = train_test_split(data, data.CATEGORY, train_size=10000, stratify=data.CATEGORY)

In [13]:
sample

Unnamed: 0_level_0,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
402885,"Girls Star Lands Coveted Title Role in NBC's ""...",http://www.playbill.com/news/article/194033-Gi...,Playbill.com,e,dhjYBAAXM5LZmzMOKOk9--4LddeuM,www.playbill.com,1406791818188
115198,The Game of Thrones Whodunit: Place Your Bets ...,http://www.tvovermind.com/game-of-thrones/game...,TVOvermind,e,dkrPWNf6SBWC_9MqKzu96qFdU6ZEM,www.tvovermind.com,1397521509773
258267,Queen brings Mercury back with unreleased songs,http://www.3news.co.nz/Queen-brings-Mercury-ba...,3News NZ,e,dZdtm5nDuct0BjMbRYhY-OxVZj63M,www.3news.co.nz,1401238293106
19939,NFL Trying to Make MIA Pay $16.6 Million for t...,http://www.designntrend.com/articles/11769/201...,Design \& Trend,e,dXFy44ZHAHulrFM1kGkjd8Iu1IWFM,www.designntrend.com,1395167457099
153696,Facebook Wins FTC Approval for Oculus VR Acqui...,http://mashable.com/2014/04/23/facebook-oculus...,Mashable,t,dfv8CiGxM3K_MFMErjINPnBmIC7TM,mashable.com,1398279247739
...,...,...,...,...,...,...,...
257398,Seth Rogen And Judd Apatow Slam Film Critic Wh...,http://www.businessinsider.in/Seth-Rogen-And-J...,Businessinsider India,e,dkELF1oHuU8yHNMcIvkJIiaA24f2M,www.businessinsider.in,1401234107318
72794,Josh Elliott departing 'Good Morning America';...,http://www.oregonlive.com/movies/index.ssf/201...,The Oregonian,e,djGtHbOjePLQf7M0KvUt3zox5i_7M,www.oregonlive.com,1396293498868
324741,Sherri Shepherd and Jenny McCarthy Fired From ...,http://doslives.com/2014/06/sherri-shepherd-je...,Dos Lives,e,dTTQyyZL867gGOM-fUjwLUOkLn_AM,doslives.com,1403868232058
7328,Nikki Ferrell Responds To Haters After Juan Pa...,http://www.celebdirtylaundry.com/2014/nikki-fe...,Celebrity Dirty Laundry,e,dsxSK7cm0Bn72sM0Y-QP-7WD8mgLM,www.celebdirtylaundry.com,1394617907943


In [6]:
# TODO: check the ratio of each category in your new dataset

In [19]:
sample.CATEGORY.value_counts()/sample.CATEGORY.value_counts().sum()

e    0.3610
b    0.2745
t    0.2565
m    0.1080
Name: CATEGORY, dtype: float64

<h1> Preprocessing of our news titles

In [21]:
# TODO: Preprocess the headlines of the news article
headlines = sample.TITLE

In [22]:
from nltk.corpus import wordnet

def get_wordnet_pos(pos_tag):
    output = np.asarray(pos_tag)
    for i in range(len(pos_tag)):
        if pos_tag[i][1].startswith('J'):
            output[i][1] = wordnet.ADJ
        elif pos_tag[i][1].startswith('V'):
            output[i][1] = wordnet.VERB
        elif pos_tag[i][1].startswith('R'):
            output[i][1] = wordnet.ADV
        else:
            output[i][1] = wordnet.NOUN
    return output

In [23]:
# TODO: Preprocess the input data
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = stopwords.words('english')

def preproc(text):
    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t.isalpha()]
    tokens = [t for t in tokens if t not in stop_words]
    pos_tags = nltk.pos_tag(tokens)
    wordnet_tags = get_wordnet_pos(pos_tags)
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(tokens[i], wordnet_tags[i,1]) for i in range(len(tokens))]
    return tokens

In [24]:
tokens_headlines = headlines.apply(preproc)

In [25]:
tokens_headlines

ID
402885    [girl, star, land, coveted, title, role, nbc, ...
115198                 [game, throne, whodunit, place, bet]
258267      [queen, bring, mercury, back, unreleased, song]
19939     [nfl, try, make, mia, pay, million, middle, fi...
153696    [facebook, win, ftc, approval, oculus, vr, acq...
                                ...                        
257398    [seth, rogen, judd, apatow, slam, film, critic...
72794     [josh, elliott, depart, morning, america, dead...
324741      [sherri, shepherd, jenny, mccarthy, fire, view]
7328      [nikki, ferrell, respond, hater, juan, pablo, ...
290000           [feed, focus, investor, seek, reassurance]
Name: TITLE, Length: 10000, dtype: object

<h1> Build our gensim BOW

In [26]:
# TODO: Create a gensim BOW of your news article headlines 
from gensim.corpora import Dictionary

## Create a corpus

## Compute the dictionary: this is a dictionary mapping words and their corresponding numbers for later visualisation
id2word = Dictionary(tokens_headlines)

## Create a BOW
bow = [id2word.doc2bow(headline) for headline in tokens_headlines]

In [27]:
bow

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1)],
 [(10, 1), (11, 1), (12, 1), (13, 1), (14, 1)],
 [(15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1)],
 [(21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1),
  (30, 1)],
 [(31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1)],
 [(38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)],
 [(44, 1),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 1)],
 [(53, 1), (54, 1), (55, 1), (56, 1), (57, 1)],
 [(58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1)],
 [(15, 1),
  (66, 1),
  (67, 1),
  (68, 1),
  (69, 1),
  (70, 1),
  (71, 1),
  (72, 1),
  (73, 1)],
 [(8, 1),
  (74, 1),
  (75, 1),
  (76, 1),
  (77, 1),
  (78, 1),
  (79, 1),
  (80, 1),
  (81, 1)],
 [(82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1)],
 [(90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1)

In [33]:
from gensim.models import TfidfModel

model = TfidfModel(bow)
tf_idf = model[bow]

<h1> Latent Dirichlet Allocation

In [45]:
# TODO: define different topics within those news article using a LDA.
from gensim.models.ldamodel import LdaModel

## Compute the LDA
lda_bow = LdaModel(bow, 4, id2word)
lda_tf_idf = LdaModel(tf_idf, 4, id2word)

## Print the main topics
from pprint import pprint
pprint(lda_bow.print_topics())
pprint(lda_tf_idf.print_topics())

[(0,
  '0.009*"new" + 0.008*"google" + 0.007*"u" + 0.006*"kim" + 0.006*"kardashian" '
  '+ 0.005*"get" + 0.004*"say" + 0.004*"show" + 0.004*"video" + '
  '0.004*"market"'),
 (1,
  '0.008*"galaxy" + 0.008*"samsung" + 0.006*"one" + 0.005*"u" + 0.005*"launch" '
  '+ 0.004*"android" + 0.004*"new" + 0.004*"google" + 0.004*"say" + '
  '0.004*"rate"'),
 (2,
  '0.011*"new" + 0.006*"season" + 0.005*"u" + 0.004*"ebola" + 0.004*"throne" + '
  '0.004*"watch" + 0.004*"first" + 0.004*"game" + 0.004*"say" + 0.004*"stock"'),
 (3,
  '0.007*"star" + 0.007*"recall" + 0.007*"apple" + 0.006*"review" + 0.005*"gm" '
  '+ 0.005*"million" + 0.005*"microsoft" + 0.005*"office" + 0.005*"new" + '
  '0.004*"box"')]
[(0,
  '0.004*"price" + 0.004*"new" + 0.003*"launch" + 0.003*"amazon" + 0.003*"gas" '
  '+ 0.003*"u" + 0.003*"service" + 0.002*"apple" + 0.002*"samsung" + '
  '0.002*"heartbleed"'),
 (1,
  '0.004*"google" + 0.003*"new" + 0.003*"microsoft" + 0.003*"office" + '
  '0.003*"u" + 0.003*"android" + 0.003*"first

In [46]:
# TODO: check the coherence of your model
from gensim.models.coherencemodel import CoherenceModel

cm_bow = CoherenceModel(model=lda_bow, corpus=bow, coherence='u_mass')
coherence_bow = cm_bow.get_coherence()  # get coherence value

cm_tf_idf = CoherenceModel(model=lda_tf_idf, corpus=bow, coherence='u_mass')
coherence_tf_idf = cm_tf_idf.get_coherence()  # get coherence value

print(f'Coherence with BOW: {coherence_bow}\nCoherence with TF-IDF: {coherence_tf_idf}')

Coherence with BOW: -10.40448931258908
Coherence with TF-IDF: -11.212934271256383


In [48]:
#TODO: visualize your different topics
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

In [49]:
vis = pyLDAvis.gensim.prepare(lda_bow, bow, id2word)

In [50]:
vis

<h1> Compare LDA topics to our predefined categories

In [12]:
# TODO: Are we able to identify clearly the topics ? 
# If so, create a mapping dictionnary between your LDA topics and the article category

In [13]:
# TODO: Assign to each document its LDA topic using the methods get_document_topics()

In [14]:
# TODO: create a function get_probas(i)
# return an array of LDA topics probability for a given index

In [15]:
# TODO: create a function get_category(i)
# return the LDA topic category for a given index (you can add a minimum probability threshold)

In [16]:
# TODO: Check some random headlines category and compare it with LDA topic

In [17]:
# TODO: create your y_pred and y_test

<h1> Plot our results

In [18]:
# TODO: plot your result: 
# Create a hist plot of the distribution of each LDA topic for each predefined category

In [19]:
#TODO: evaluate the accuracy for each category:

In [20]:
# TODO: add some lines to your sample to improve your result 
#(up to 100000 lines if your CPU can handle it)