<a href="https://colab.research.google.com/github/Shehab-7/NLP/blob/main/News%20Categories%20Classifier/News_Categories_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Objective:
- The objective from this task is to use your NLP skills to build a ML model can <b>classify the news categories</b>.
- Try hard to get the maximum value of the preferable evaluation metric using any technique.
- <b>Concat headlines and short descriptions</b> and use them in classification. 

### Time:
- This task mustn't take more than <b>3 hours</b>.
    - Load Data and EDA : 30 minutes
    - Cleaning and Preprocessing : 60 Minutes
    - Modelling and Enhancement : 60 Minutes
    - Extratime : 30 minutes

### Fixed Rules:
- train test split 80% : 20%
- all random seeds = 42

# Delivery:
## Through this [form](https://forms.gle/PshJQw2bTa48Ligz7)

> ### Take a deep breath, read the instructions again, and then start

## Load Libraries

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from collections import Counter
import random
from termcolor import colored
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances
from sklearn.model_selection import train_test_split
import re
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn import metrics
from sklearn.pipeline import Pipeline
from nltk.corpus import stopwords

import string, os
from wordcloud import STOPWORDS
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from tensorflow.keras.layers import Dense, Embedding, LSTM, Dropout
from tensorflow.keras.models import Sequential
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Load Dataset

In [2]:
data = pd.read_json("/content/Dataset.json", lines=True)
data.sample(5)

Unnamed: 0,category,headline,authors,link,short_description,date
83465,STYLE,"On This Week's Cheap Celeb Finds List, Reese W...",Michelle Persad,https://www.huffingtonpost.com/entry/cheap-cel...,"BRB, we're going shopping.",2015-08-08
16268,POLITICS,The GOP Congress Is Rushing Wildly Ahead With ...,Jeffrey Young,https://www.huffingtonpost.com/entry/gop-healt...,We've seen this movie before -- but it might e...,2017-09-18
48508,POLITICS,"Concern About Terrorist Attacks Is Growing, 15...",Grace Sparks,https://www.huffingtonpost.com/entry/terrorist...,Americans' belief that terrorists can launch a...,2016-09-08
187504,PARENTING,Life Lessons from My Children,"Signe Whitson, Contributor\nSigne Whitson is a...",https://www.huffingtonpost.com/entry/life-less...,Kids can be our greatest teachers. When I am r...,2012-06-19
18292,POLITICS,North Carolina Republicans Would Give Themselv...,Sam Levine,https://www.huffingtonpost.com/entry/north-car...,"“By historical standards, these are extraordin...",2017-08-22


In [3]:
data["Merged Columns"] = data['headline'].astype(str) +" "+ data["short_description"]
data.sample(5)

Unnamed: 0,category,headline,authors,link,short_description,date,Merged Columns
98158,TRAVEL,"If It's February, This Must Be Brussels","Magda Abu-Fadil, ContributorDirector of Media ...",https://www.huffingtonpost.com/entry/if-its-fe...,"Yes, it's a twist on the 1969 film starring Su...",2015-02-21,"If It's February, This Must Be Brussels Yes, i..."
130355,WELLNESS,How to Lose the Woman You Love... for Good,"Tamara Star, Contributor\nBest-Selling Author,...",https://www.huffingtonpost.com/entry/how-to-lo...,"""Love is the greatest refreshment in life."" --...",2014-02-18,"How to Lose the Woman You Love... for Good ""Lo..."
102390,COMEDY,The Truth About 'The Interview',"Susan Silver, ContributorTV writer and radio p...",https://www.huffingtonpost.com/entry/the-truth...,The movie insults anyone who has ever written ...,2015-01-03,The Truth About 'The Interview' The movie insu...
116129,HEALTHY LIVING,Do What You Love,"Stacie Huckeba, ContributorPhotographer, Film ...",https://www.huffingtonpost.com/entry/do-what-y...,"Maybe you're not an ""artist"" by anyone's stand...",2014-07-30,"Do What You Love Maybe you're not an ""artist"" ..."
19804,POLITICS,A Poet's Fight Against Money Bail,"Robert Greenwald and Alyesha Wise, Contributors",https://www.huffingtonpost.com/entry/a-poets-f...,I recently read the heartbreaking story of Ped...,2017-08-03,A Poet's Fight Against Money Bail I recently r...


In [4]:
data.drop(["headline","short_description"],axis=1,inplace=True)

In [5]:
data.sample(50)

Unnamed: 0,category,authors,link,date,Merged Columns
172266,IMPACT,"Verneda Adele White, Contributor\nFounder - Cr...",https://www.huffingtonpost.com/entry/world-aid...,2012-12-01,When Unprotected Sex Is No Longer Negotiable: ...
134317,WELLNESS,"Joe Boyd, Contributor\nFounder and CEO, Rebel ...",https://www.huffingtonpost.com/entry/happiness...,2014-01-07,7 Ways to Get Happy This New Year The great mo...
10079,POLITICS,Willa Frej,https://www.huffingtonpost.com/entry/jerusalem...,2017-12-07,Jerusalem Is Just The Latest Example Of Trump'...
175687,STYLE & BEAUTY,Ellie Krupnick,https://www.huffingtonpost.com/entry/hillary-c...,2012-10-25,Hillary Clinton's Fashion: 65 Looks For 65 Yea...
43568,POLITICS,Paul Blumenthal,https://www.huffingtonpost.com/entry/dark-mone...,2016-11-03,This Dark Money Group Is Spending Big On Judic...
93364,ARTS,"Peak Johnson, ContributorPhiladelphia Journalist",https://www.huffingtonpost.com/entry/ocaseys-p...,2015-04-17,O'Casey's Plays Return to Stage at Philly Iris...
17135,ARTS & CULTURE,Cavan Sieczkowski,https://www.huffingtonpost.com/entry/tina-feys...,2017-09-06,Tina Fey's 'Mean Girls' Musical Has Its Plasti...
4466,MEDIA,Rebecca Shapiro,https://www.huffingtonpost.com/entry/erin-burn...,2018-03-06,CNN's Erin Burnett To Sam Nunberg: I Smell Alc...
133959,QUEER VOICES,"Michael Alvear, Contributor\nAdvice Columnist",https://www.huffingtonpost.com/entry/whos-bett...,2014-01-11,Who's Better-Looking: Gay or Straight Porn Sta...
85358,FIFTY,"Dey Young, ContributorActress, sculptor in sto...",https://www.huffingtonpost.com/entry/please-do...,2015-07-18,How Theatre Helped Me Embrace My Age It is bec...


## EDA

In [6]:
#Check nulls
data.isnull().any()

category          False
authors           False
link              False
date              False
Merged Columns    False
dtype: bool

In [7]:
#check duplicates
data.duplicated().sum()

13

In [8]:
data = data.drop_duplicates()

In [9]:
data.duplicated().sum()

0

## Cleaning & Preprocessing

In [10]:
def remove_emoji(text):
    emoji_pattern = re.compile("["
          u"\U0001F600-\U0001F64F"  # emoticons
          u"\U0001F300-\U0001F5FF"  # symbols & pictographs
          u"\U0001F680-\U0001F6FF"  # transport & map symbols
          u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
          u"\U00002500-\U00002BEF"  # chinese char
          u"\U00002702-\U000027B0"
          u"\U00002702-\U000027B0"
          u"\U000024C2-\U0001F251"
          u"\U0001f926-\U0001f937"
          u"\U00010000-\U0010ffff"
          u"\u2640-\u2642" 
          u"\u2600-\u2B55"
          u"\u200d"
          u"\u23cf"
          u"\u23e9"
          u"\u231a"
          u"\ufe0f"  # dingbats
          u"\u3030"
       
                        "]+", re.UNICODE)
    return emoji_pattern.sub(r'', text)


def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

def remove_number(text):
    num = re.compile(r'[-+]?[.\d][\d]+[:,.\d]')
    return num.sub(r'NUMBER', text)

def toremove_stopword(text):
    remove_stopword = [word for word in text.split() if word.lower() not in stopwords.words('english')]
    return remove_stopword

In [11]:
data['Merged Columns'] = data['Merged Columns'].apply(lambda x: remove_emoji(x))
data['Merged Columns'] = data['Merged Columns'].apply(lambda x: remove_punct(x))
data['Merged Columns'] = data['Merged Columns'].apply(lambda x: remove_number(x))
data['Merged Columns'] = data['Merged Columns'].apply(lambda x: toremove_stopword(x))

In [12]:
data.sample(50)

Unnamed: 0,category,authors,link,date,Merged Columns
11099,POLITICS,Lee Moran,https://www.huffingtonpost.com/entry/tomi-lahr...,2017-11-24,"[Twitter, Users, Shred, Tomi, Lahren, Disrespe..."
129383,TRAVEL,"Sonia Gil, Contributor\nHost and co-creator of...",https://www.huffingtonpost.com/entry/roots-of-...,2014-02-28,"[Eyes, Rio, much, Rio, sand, sea]"
117271,RELIGION,"Christian Piatt, ContributorAuthor of 'postChr...",https://www.huffingtonpost.com/entry/fence-sit...,2014-07-17,"[FenceSitters, BoundaryPushers, Postmodern, Re..."
112621,HEALTHY LIVING,"Jane Simon, M.D., ContributorWrites a weekly b...",https://www.huffingtonpost.com/entry/permissio...,2014-09-07,"[Permission, Prohibition, Reexamining, behavio..."
177118,WEDDINGS,,https://www.huffingtonpost.comhttp://www.nytim...,2012-10-10,"[Marriage, Launched, Dream, Weaver, Julie, Cla..."
89986,POLITICS,Eliot Nelson,https://www.huffingtonpost.com/entry/huffpost-...,2015-05-26,"[HUFFPOST, HILL, Politicians, Follow, Weekend,..."
131854,STYLE & BEAUTY,Dana Oliver,https://www.huffingtonpost.com/entry/barely-th...,2014-02-02,"[Celebrities, Lighten, BarelyThere, Makeup, We..."
106297,RELIGION,"Mike Ghouse, ContributorSpeaker, thinker, writ...",https://www.huffingtonpost.com/entry/quran-is-...,2014-11-19,"[Quran, Muslims, God, God, Muslims, claims, Qu..."
198722,CRIME,,https://www.huffingtonpost.com/entry/george-hu...,2012-02-20,"[George, Huguely, Murder, Trial, Timeline, For..."
23156,POLITICS,Carol Kuruvilla,https://www.huffingtonpost.com/entry/ramadan-m...,2017-06-23,"[Ramadan, Draws, End, Mosques, Worry, Security..."


In [13]:
data['category_num'] = data['category'].factorize()[0]

In [14]:
data.sample(5)

Unnamed: 0,category,authors,link,date,Merged Columns,category_num
18640,POLITICS,"Bradley P. Moss, ContributorNational security ...",https://www.huffingtonpost.com/entry/is-the-pr...,2017-08-17,"[President, EnablerInChief, Tuesday, span, les...",4
51599,LATINO VOICES,Carolina Moreno,https://www.huffingtonpost.com/entry/15-profou...,2016-08-04,"[15, Profound, Thoughts, Whether, Burger, Avoc...",17
98466,MEDIA,"Janesh Rahlan, ContributorFulbright Scholar",https://www.huffingtonpost.com/entry/namaz-in-...,2015-02-17,"[Namaz, Fear, West, ramped, War, Terror, media...",13
60221,STYLE,Jamie Feldman,https://www.huffingtonpost.com/entry/modcloth-...,2016-04-28,"[Modcloths, Latest, BodyPositive, Swim, Shoot,...",22
164609,WELLNESS,"Paul Spector MD, Contributor\nHealth consultan...",https://www.huffingtonpost.com/entry/blood-sug...,2013-02-21,"[Need, Know, Blood, Sugar, Results, important,...",31


In [15]:
data['category_num'].value_counts().count()

41

In [19]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_features=10000
tokenizer=Tokenizer(num_words=max_features,split=' ')
tokenizer.fit_on_texts(data['Merged Columns'].values)
X = tokenizer.texts_to_sequences(data['Merged Columns'].values)
X = pad_sequences(X)

In [20]:
y = data['category_num']

In [26]:
#Splitting the data into 3 datasets
Xtrain, X_test, ytrain, y_test = train_test_split(X,y, test_size = 0.2, random_state =42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(Xtrain,ytrain, test_size = 0.2, random_state =42,stratify=ytrain)

## Modelling

In [27]:
#Applying LSTM
from tensorflow import keras
from keras import models, layers

embed_dim = 32
lstm_out = 32
model_l = models.Sequential()
model_l.add(layers.Embedding(max_features, embed_dim,input_length = X.shape[1]))
model_l.add(layers.Dropout(0.3))
model_l.add(layers.LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.5))
model_l.add(layers.Dense(1024,activation='relu'))
model_l.add(layers.Dense(41,activation='softmax'))
adam = keras.optimizers.Adam(learning_rate=0.001)
model_l.compile(loss = 'sparse_categorical_crossentropy', optimizer=adam ,metrics = ['accuracy'])
print(model_l.summary())





Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 127, 32)           320000    
                                                                 
 dropout_1 (Dropout)         (None, 127, 32)           0         
                                                                 
 lstm_1 (LSTM)               (None, 32)                8320      
                                                                 
 dense_2 (Dense)             (None, 1024)              33792     
                                                                 
 dense_3 (Dense)             (None, 41)                42025     
                                                                 
Total params: 404,137
Trainable params: 404,137
Non-trainable params: 0
_________________________________________________________________
None


In [28]:
model_l.fit(X_train, y_train, epochs = 10, batch_size=512, validation_data=(X_val, y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f1e53a50690>

In [29]:
score = model_l.evaluate(X_test, y_test, verbose=0)
print("%s: %.2f%%" % (model_l.metrics_names[1], score[1]*100))

accuracy: 58.08%


## Enhancement

In [30]:
#Applying GRU
from tensorflow import keras
from keras import models, layers
from keras.layers.recurrent import GRU

embed_dim = 128

model_g = models.Sequential()
model_g.add(layers.Embedding(max_features, embed_dim,input_length = X.shape[1]))
model_g.add(GRU(embed_dim, dropout=0.4, recurrent_dropout=0.5))
model_g.add(layers.Dense(1024,activation='relu'))
model_g.add(layers.Dense(41,activation='softmax'))
model_g.compile(loss = 'sparse_categorical_crossentropy', optimizer=adam ,metrics = ['accuracy'])
print(model_g.summary())

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 127, 128)          1280000   
                                                                 
 gru (GRU)                   (None, 128)               98688     
                                                                 
 dense_4 (Dense)             (None, 1024)              132096    
                                                                 
 dense_5 (Dense)             (None, 41)                42025     
                                                                 
Total params: 1,552,809
Trainable params: 1,552,809
Non-trainable params: 0
_________________________________________________________________
None


In [32]:
model_g.fit(X_train, y_train, batch_size=512, epochs=10,verbose=1, validation_data=(X_val, y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f1e4d7c3f90>

In [33]:
score = model_g.evaluate(X_test, y_test, verbose=0)

print("%s: %.2f%%" % (model_g.metrics_names[1], score[1]*100))

accuracy: 60.42%


## Final Results and Conclusion

In [None]:
'''
- The time duration wasn't enough to tune and enhance the models more.
 
- The running time was taking too long for "Removing stop words process" due to the size of the data 
  and the time for fitting the "LSTM" , "GRU" Models also was taking too long for just 10 epochs.

- The "GRU" achived a higher accuracy by a little bit.
  but overall the accuracy can be even higher by lowering down the number of batches and adding an additional layers to the model architecture.
'''

## Best Wishes


## Again: Delivery through this [form](https://forms.gle/PshJQw2bTa48Ligz7)