# Objective: Predicting News Category from Article

In [0]:
import pandas as pd
import numpy as np
import nltk
import string
import re
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.feature_selection import SelectKBest, chi2

# Part 1: NLTK

In [0]:
df_train = pd.read_excel('Data_Train.xlsx')

In [0]:
df_train.head()

Unnamed: 0,STORY,SECTION
0,But the most painful was the huge reversal in ...,3
1,How formidable is the opposition alliance amon...,0
2,Most Asian currencies were trading lower today...,3
3,"If you want to answer any question, click on ‘...",1
4,"In global markets, gold prices edged up today ...",3


In [0]:
df_test = pd.read_excel('Data_Test.xlsx')
df_test.head()

Unnamed: 0,STORY
0,2019 will see gadgets like gaming smartphones ...
1,It has also unleashed a wave of changes in the...
2,It can be confusing to pick the right smartpho...
3,The mobile application is integrated with a da...
4,We have rounded up some of the gadgets that sh...


# 1. Data Exploration

In [0]:
df_train['SECTION'].value_counts()

1    2772
2    1924
0    1686
3    1246
Name: SECTION, dtype: int64

This is a multiclass classification problem with 4 labels:
0 - Politics Story
1 - Technology Story
2 - Entertainment Story
4 - Business Story

In [0]:
df_train['SECTION'].isnull().sum()

0

In [0]:
df_train['STORY'].isnull().sum()

0

# 2. Data Cleaning

In [0]:
nltk.download("popular")

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

True

In [0]:
stopword = nltk.corpus.stopwords.words('english')

In [0]:
wn = nltk.WordNetLemmatizer()

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
def clean_text(text):
    # Remove punctuation
    punctuation = "".join([char for char in text if char not in string.punctuation])
    # Vectorize
    split = re.split("\W+", punctuation)
    # Remove stopwords
    text = [word for word in split if word not in stopword]
    # Lemmatize text
    lem_text = [wn.lemmatize(word) for word in text]
    return lem_text

In [0]:
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(df_train['STORY'])

X_features = pd.DataFrame(X_tfidf.toarray())

Used TF-IDF vectrizer to get tokens. This vectorizer using fit creates a vocabulary of all the words in the corpus with indexes. On transform it converts sentences to feature vecrtors.The columns are the indices for all the words in the corpus and the value is the term-frequency inverse document frequency for the word in the document.This is called the **Bag of Words** model. 

In [0]:
X_features.shape

(7628, 40379)

# Model Training - RandomForest

In [0]:
rf = RandomForestClassifier()
param ={'n_estimators' : [100,120],
       'max_depth':[60, 80,100,120]}
gs = GridSearchCV(rf, param, cv =5, n_jobs = 1)
gs_fit = gs.fit(X_features,df_train['SECTION'])

In [0]:
gs_fit.best_params_

{'max_depth': 120, 'n_estimators': 120}

In [0]:
rf = RandomForestClassifier(max_depth=120, n_estimators=120, n_jobs=1,random_state=0)
rf.fit(X_features,df_train['SECTION'])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=120, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=120, n_jobs=1,
                       oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

# Model Prediction

In [0]:
X_test_tfidf= tfidf_vect.transform(df_test['STORY'])
X_test_features = pd.DataFrame(X_test_tfidf.toarray())

In [0]:
y_pred = rf.predict(X_test_features)

In [0]:
df_result = pd.DataFrame({'SECTION':y_pred})
df_result.head()

Unnamed: 0,SECTION
0,1
1,2
2,1
3,0
4,1


In [0]:
df_result.to_excel('Submission3.xlsx')

Result: 0.958 Rank: 188

# Part 2: Neural Networks

In [2]:
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

Using TensorFlow backend.


Word Embedding Model : Coverts words to feature vectors. It maps semantic meaning to geometric space. Two ways to create word embeddings :

1) Train word embedding during training of neural network.

2) Use pretrained word embedding in the network. 

Keras Vectorizer utility that maps the corpus to a dictionary with keys being the words and tha value being the index of the word. num_words can be used to keep n most common words. The most common word is assigned the intger 1 and so on. Words not in vocab are assigned integer word_count+1. 

Diff from tfdf vectorizer is that the length of each vector is equal to length of the text and the value is the index. 

In [3]:
df_train = pd.read_excel("Data_Train.xlsx")
df_train.head()

Unnamed: 0,STORY,SECTION
0,But the most painful was the huge reversal in ...,3
1,How formidable is the opposition alliance amon...,0
2,Most Asian currencies were trading lower today...,3
3,"If you want to answer any question, click on ‘...",1
4,"In global markets, gold prices edged up today ...",3


In [4]:
df_test = pd.read_excel("Data_Test.xlsx")
df_test.head()

Unnamed: 0,STORY
0,2019 will see gadgets like gaming smartphones ...
1,It has also unleashed a wave of changes in the...
2,It can be confusing to pick the right smartpho...
3,The mobile application is integrated with a da...
4,We have rounded up some of the gadgets that sh...


In [0]:
sentences_train = df_train['STORY'].values
y_train = df_train['SECTION'].values
sentences_test = df_test['STORY'].values



# Data Cleaning

In [0]:
tokenizer = Tokenizer(num_words=20000)
tokenizer.fit_on_texts(sentences_train)

In [0]:
X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)

In [0]:
vocab_size = len(tokenizer.word_index) + 1

In [9]:
print(sentences_train[2])
print(X_train[2])

Most Asian currencies were trading lower today. South Korean won was down 0.4%, China renminbi 0.23%, China Offshore 0.15%, Malaysian ringgit 0.12%, Indonesian rupiah 0.11%, Taiwan dollar 0.06%. However, Japanese yen was up 0.32%.


The dollar index, which measures the US currency’s strength against major currencies, was trading at 97.26, down 0.14% from its previous close of 97.395.
[94, 1239, 991, 67, 428, 448, 345, 317, 1251, 338, 24, 149, 108, 139, 218, 3864, 108, 729, 218, 4392, 108, 376, 3625, 3727, 108, 318, 2705, 3142, 108, 287, 3728, 253, 108, 2650, 141, 2197, 2161, 24, 36, 108, 1602, 1, 253, 442, 27, 1159, 1, 72, 2975, 1174, 104, 327, 991, 24, 428, 17, 2443, 745, 149, 108, 506, 16, 23, 457, 340, 4, 2443, 13029]


In [0]:
maxlen = 8000
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

In [0]:
y_train = to_categorical(y_train)

# Model Training

## Train word embedding during training


In [0]:
from keras.models import Sequential
from keras import layers

In [0]:
embedding_dim = 300

first_model = Sequential()
first_model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
first_model.add(layers.GlobalMaxPool1D())
first_model.add(layers.Dense(64, activation='relu'))
first_model.add(layers.Dense(4, activation='softmax'))
first_model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
first_model.summary()






Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 8000, 300)         11486100  
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 300)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                19264     
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 260       
Total params: 11,505,624
Trainable params: 11,505,624
Non-trainable params: 0
_________________________________________________________________


In [0]:
history1 = first_model.fit(X_train, y_train,
                    epochs=5,
                    verbose=2,
                    batch_size=50)
loss, accuracy = first_model.evaluate(X_train, y_train, verbose=False)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Epoch 1/5
 - 208s - loss: 0.9518 - acc: 0.6020
Epoch 2/5
 - 208s - loss: 0.1230 - acc: 0.9744
Epoch 3/5
 - 208s - loss: 0.0426 - acc: 0.9887
Epoch 4/5
 - 207s - loss: 0.0198 - acc: 0.9942
Epoch 5/5
 - 208s - loss: 0.0113 - acc: 0.9967


In [0]:
pred_train_embedding = first_model.predict_classes(X_test)

In [0]:
df_result1 = pd.DataFrame({'SECTION':pred_train_embedding})
df_result1.head()

Unnamed: 0,SECTION
0,1
1,2
2,1
3,1
4,1


In [0]:
df_result1.to_excel('Submission-WordEmbeddingTrainign.xlsx')

Accuracy: 0.97998544396 Rank: 49

## Using a Pretrained Word Embedding

In [16]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2019-09-20 15:45:37--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2019-09-20 15:45:37--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2019-09-20 15:45:37--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2019-0

In [17]:
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [0]:
import numpy as np

def create_embedding_matrix(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1  # Adding again 1 because of reserved 0 index
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    with open(filepath) as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word] 
                embedding_matrix[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]

    return embedding_matrix

In [0]:
embedding_dim = 300
embedding_matrix = create_embedding_matrix ('glove.6B.300d.txt',tokenizer.word_index, embedding_dim)

In [0]:
embedding_dim = 300

second_model = Sequential()
second_model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix], 
                           input_length=maxlen, 
                           trainable=True))
second_model.add(layers.GlobalMaxPool1D())
second_model.add(layers.Dense(64, activation='relu'))
second_model.add(layers.Dense(4, activation='softmax'))
second_model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
second_model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 8000, 300)         11486100  
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 300)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 64)                19264     
_________________________________________________________________
dense_4 (Dense)              (None, 4)                 260       
Total params: 11,505,624
Trainable params: 11,505,624
Non-trainable params: 0
_________________________________________________________________


In [0]:
history2 = second_model.fit(X_train, y_train,
                    epochs=5,
                    verbose=2,
                    batch_size=50)
loss, accuracy = second_model.evaluate(X_train, y_train, verbose=False)

Epoch 1/5
 - 203s - loss: 0.8114 - acc: 0.7465
Epoch 2/5
 - 202s - loss: 0.2355 - acc: 0.9491
Epoch 3/5
 - 203s - loss: 0.1152 - acc: 0.9725
Epoch 4/5
 - 202s - loss: 0.0721 - acc: 0.9831
Epoch 5/5
 - 202s - loss: 0.0489 - acc: 0.9887


In [0]:
pred_pre_embedding = second_model.predict_classes(X_test)
df_result2 = pd.DataFrame({'SECTION':pred_pre_embedding})
df_result2.to_excel('Submission-WordEmbeddingPre.xlsx')

Result: 0.97052 Rank: 48

In [0]:
embedding_dim = 300

third_model = Sequential()
third_model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix], 
                           input_length=maxlen, 
                           trainable=False))
third_model.add(layers.GlobalMaxPool1D())
third_model.add(layers.Dense(64, activation='relu'))
third_model.add(layers.Dense(4, activation='softmax'))
third_model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
third_model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 8000, 300)         11486100  
_________________________________________________________________
global_max_pooling1d_3 (Glob (None, 300)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 64)                19264     
_________________________________________________________________
dense_6 (Dense)              (None, 4)                 260       
Total params: 11,505,624
Trainable params: 19,524
Non-trainable params: 11,486,100
_________________________________________________________________


In [0]:
history3 = third_model.fit(X_train, y_train,
                    epochs=10,
                    verbose=2,
                    batch_size=50)
loss, accuracy = third_model.evaluate(X_train, y_train, verbose=False)

Epoch 1/10
 - 38s - loss: 0.1885 - acc: 0.9380
Epoch 2/10
 - 38s - loss: 0.1736 - acc: 0.9421
Epoch 3/10
 - 38s - loss: 0.1650 - acc: 0.9465
Epoch 4/10
 - 39s - loss: 0.1570 - acc: 0.9477
Epoch 5/10
 - 38s - loss: 0.1517 - acc: 0.9478
Epoch 6/10
 - 38s - loss: 0.1489 - acc: 0.9507
Epoch 7/10
 - 38s - loss: 0.1477 - acc: 0.9510
Epoch 8/10
 - 38s - loss: 0.1498 - acc: 0.9494
Epoch 9/10
 - 39s - loss: 0.1376 - acc: 0.9532
Epoch 10/10
 - 38s - loss: 0.1366 - acc: 0.9544


In [0]:
pred_pre_embedding_no = third_model.predict_classes(X_test)
df_result3 = pd.DataFrame({'SECTION':pred_pre_embedding_no})
df_result3.to_excel('Submission-WordEmbeddingPreNo.xlsx')

Result: 0.95305677

# Word Embedding + CNN

In [0]:
embedding_dim = 300

cnn_model = Sequential()
cnn_model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
cnn_model.add(layers.Conv1D(128, 5, activation='relu'))
cnn_model.add(layers.GlobalMaxPool1D())
cnn_model.add(layers.Dropout(0.1))
cnn_model.add(layers.Dense(64, activation='relu'))
cnn_model.add(layers.Dense(4, activation='softmax'))

cnn_model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
cnn_model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 8000, 300)         11486100  
_________________________________________________________________
conv1d_11 (Conv1D)           (None, 7996, 128)         192128    
_________________________________________________________________
global_max_pooling1d_5 (Glob (None, 128)               0         
_________________________________________________________________
dropout_8 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_8 (Dense)              (None, 4)                 260       
Total params: 11,686,744
Trainable params: 11,686,744
Non-trainable params: 0
__________________________________________

In [0]:
history3 = cnn_model.fit(X_train, y_train,
                    epochs=5,
                    verbose=2,
                    batch_size=50)
loss, accuracy = cnn_model.evaluate(X_train, y_train, verbose=False)

Epoch 1/5
 - 1662s - loss: 0.5986 - acc: 0.7672
Epoch 2/5
 - 1532s - loss: 0.0624 - acc: 0.9822
Epoch 3/5
 - 1537s - loss: 0.0140 - acc: 0.9963
Epoch 4/5
 - 1534s - loss: 0.0088 - acc: 0.9971
Epoch 5/5
 - 1532s - loss: 0.0066 - acc: 0.9975


In [0]:
pred_cnn = cnn_model.predict_classes(X_test)
df_result3 = pd.DataFrame({'SECTION':pred_cnn})
df_result3.to_excel('Submission-WordEmbeddingCNN.xlsx')

Result: 0.97161572 Rank:48

# Hyper parameter Optimization

In [0]:
y_train = df_train['SECTION'].values

In [0]:
def create_model(num_filters, kernel_size, vocab_size, embedding_dim, maxlen):
  embedding_dim = 300
  cnn_model = Sequential()
  cnn_model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix], 
                           input_length=maxlen, 
                           trainable=False))
  cnn_model.add(layers.Conv1D(num_filters, kernel_size, activation='relu'))
  cnn_model.add(layers.GlobalMaxPool1D())
  cnn_model.add(layers.Dropout(0.1))
  cnn_model.add(layers.Dense(64, activation='relu'))
  cnn_model.add(layers.Dense(4, activation='softmax'))

  cnn_model.compile(optimizer='adam',
                loss='categorical_crossentropy',
                metrics=['accuracy'])
  return cnn_model

In [0]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV

In [26]:
embedding_dim = 300
param_grid = dict(num_filters=[64, 128],
                      kernel_size=[5, 7],
                      vocab_size=[vocab_size],
                      embedding_dim=[embedding_dim],
                      maxlen=[maxlen])
model_cnn= KerasClassifier(build_fn=create_model,
                            epochs=4, batch_size=50,
                            verbose=2)
grid = GridSearchCV(estimator=model_cnn, param_grid=param_grid,
                              cv=3, verbose=1,scoring = 'accuracy')
grid_result = grid.fit(X_train, y_train)

Fitting 3 folds for each of 4 candidates, totalling 12 fits






[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 1/4
 - 445s - loss: 0.4339 - acc: 0.8531
Epoch 2/4
 - 443s - loss: 0.1098 - acc: 0.9677
Epoch 3/4
 - 444s - loss: 0.0582 - acc: 0.9831
Epoch 4/4
 - 443s - loss: 0.0336 - acc: 0.9917
Epoch 1/4
 - 424s - loss: 0.4545 - acc: 0.8450
Epoch 2/4
 - 422s - loss: 0.1219 - acc: 0.9607
Epoch 3/4
 - 424s - loss: 0.0574 - acc: 0.9841
Epoch 4/4
 - 423s - loss: 0.0362 - acc: 0.9906
Epoch 1/4
 - 424s - loss: 0.4738 - acc: 0.8344
Epoch 2/4
 - 423s - loss: 0.1279 - acc: 0.9626
Epoch 3/4
 - 423s - loss: 0.0732 - acc: 0.9797
Epoch 4/4
 - 423s - loss: 0.0398 - acc: 0.9896
Epoch 1/4
 - 661s - loss: 0.3673 - acc: 0.8747
Epoch 2/4
 - 661s - loss: 0.0895 - acc: 0.9719
Epoch 3/4
 - 661s - loss: 0.0354 - acc: 0.9912
Epoch 4/4
 - 660s - loss: 0.0198 - acc: 0.9949
Epoch 1/4
 - 662s - loss: 0.3457 - acc

[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed: 546.1min finished


Epoch 1/4
 - 990s - loss: 0.2877 - acc: 0.9014
Epoch 2/4
 - 990s - loss: 0.0789 - acc: 0.9747
Epoch 3/4
 - 989s - loss: 0.0299 - acc: 0.9924
Epoch 4/4
 - 991s - loss: 0.0189 - acc: 0.9953


In [27]:
grid_result.best_params_

{'embedding_dim': 300,
 'kernel_size': 5,
 'maxlen': 8000,
 'num_filters': 128,
 'vocab_size': 38287}

In [28]:
grid_result.best_score_

0.9619821709491347