# **INTRODUCTION**

---


**Task-** Build a ML model in python that can Identify the category of news based on headlines and short descriptions.

**Data-**
This dataset contains around 125k news headlines from the year 2013 to 2018 obtained from HuffPost. 


```
{
    "short_description": "She left her husband. He killed their children. Just another day in America.",
    "headline" : "There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV",
    "date" : "2018-05-26",
    "link": "https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89"
    "authors": "Melissa Jeltsen",
    "category" : "CRIME"
}

```
# STEPS TAKEN

---


**Step-1** : Data Preprocessing- It involves all the preprocessing steps that were taken over Data in order to produce the desired results.
* Importing the Dataset
* Dealing with Categorical Data
* Scaling the Data
* Handling NaNs
* Getting Ready Training Data

**Step-2 ** : Splitting the Data into Training Set and Test Set

**Step-3** : Fitting the Model over Train and Test Set

**Step-4** : Testing with Test Data


These are the Basic Steps that are performed while making a ML Model but here we also apply some advance steps in order to make our classifier more accurate and at the same time prevent it from over fitting.






# DATA PREPROCESSING

In [0]:
'''
As the given Data is in json format firstly we need to format it according to our use.
The json data file is file converted into list type so that it can easily be consumed by pandas.
'''
import json
data = []
with open('News_Category_Dataset.json') as f:
    for line in f:
        data.append(json.loads(line))

In [0]:
'''
Here we import our list data into pandas for further processing.
The dataset consists of blank spaces which is replaced by NaNs.
As the quantity of data is good we can drop some unstructured data with the help of pandas.
finally dropna changes the index so we reindex them.

'''
import pandas as pd
import numpy as np
dataset = pd.DataFrame.from_dict(data)
dataset = dataset.replace(r'\s+( +\.)|#',np.nan,regex=True).replace('',np.nan)
dataset.dropna(inplace=True)
dataset.reset_index(drop=True, inplace=True)

In [0]:
'''
Make Training data Dictionary
'''
training_data = []
availableCategories = ['QUEER VOICES', 'STYLE', 'WEIRD NEWS', 'THE WORLDPOST', 'ARTS', 'LATINO VOICES', 'EDUCATION', 'ARTS & CULTURE',
                'POLITICS', 'WOMEN', 'TECH', 'BUSINESS', 'BLACK VOICES', 'COLLEGE', 'PARENTS', 'COMEDY', 'HEALTHY LIVING', 
                'SPORTS', 'RELIGION', 'GREEN', 'GOOD NEWS', 'TASTE', 'CRIME', 'ENTERTAINMENT', 'FIFTY', 'MEDIA', 'WORLDPOST', 
                'SCIENCE', 'TRAVEL', 'WORLD NEWS', 'IMPACT']
for i in range(0,len(dataset)):
    training_data.append({'data' : dataset['headline'][i] + ' ' + dataset['short_description'][i], 'flag' : availableCategories.index(dataset['category'][i])})
  

In [0]:
'''
Using training data dict for making Training Dataframe
Also Exporting it to CSV for future uses.
'''
training_data = pd.DataFrame(training_data, columns=['data', 'flag'])
training_data.to_csv("train_data.csv", sep=',', encoding='utf-8')


In [0]:
'''
Here we use pickle so that python objects can be saved on disk.
The CountVectorizer provides a simple way to both tokenize
a collection of text documents and build a vocabulary of known words,
but also to encode new documents using that vocabulary.

'''
import pickle
from sklearn.feature_extraction.text import CountVectorizer


#Getting the Vector Count
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(training_data.data)

#Saving the Generated Word Vector for Further Use
pickle.dump(count_vect.vocabulary_, open("count_vector.pkl","wb"))

In [0]:
'''
The TfidfVectorizer will tokenize documents, learn the vocabulary and 
inverse document frequency weightings, and allow to encode new documents.
Alternately, if you already have a learned CountVectorizer,
we can use it with a TfidfTransformer to just calculate the inverse
document frequencies and start encoding documents.

'''
from sklearn.feature_extraction.text import TfidfTransformer

#Transforming word vector to tf-idf
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

#Saving the tf-idf
pickle.dump(tfidf_transformer, open("tfidf.pkl","wb"))

# SPLITTING DATA INTO TRAINING AND TESTING DATA

In [0]:
'''
Splitting the Dataset in train and test data
'''
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_train_tfidf, training_data.flag, test_size=0.20, random_state=30)

# FITTING THE MODEL 

In [0]:
'''
Fitting the Model over our Training set.
Here we use Support Vector Machines as they give quite good results with Tf-idf.
Then Save the model File for Future use.
'''
from sklearn import svm
clf_svm = svm.LinearSVC()
clf_svm.fit(X_train_tfidf, training_data.flag)
pickle.dump(clf_svm, open("svm.pkl", "wb"))

# PREDICTING THE RESULT

In [15]:
'''
Now we predict over test data.
And Printing it along with the original data.
'''
predicted = clf_svm.predict(X_test)
result_svm = pd.DataFrame( {'true_labels': y_test,'predicted_labels': predicted})
result_svm.to_csv('res_svm.csv', sep = ',')
for predicted_item, result in zip(predicted, y_test):
    print(availableCategories[predicted_item], ' - ', availableCategories[result])

POLITICS  -  POLITICS
THE WORLDPOST  -  THE WORLDPOST
PARENTS  -  PARENTS
HEALTHY LIVING  -  HEALTHY LIVING
RELIGION  -  RELIGION
TASTE  -  TASTE
MEDIA  -  MEDIA
HEALTHY LIVING  -  HEALTHY LIVING
GREEN  -  GREEN
ENTERTAINMENT  -  ENTERTAINMENT
POLITICS  -  POLITICS
QUEER VOICES  -  QUEER VOICES
ENTERTAINMENT  -  ENTERTAINMENT
STYLE  -  STYLE
POLITICS  -  POLITICS
RELIGION  -  RELIGION
BLACK VOICES  -  BLACK VOICES
POLITICS  -  POLITICS
POLITICS  -  POLITICS
POLITICS  -  POLITICS
TASTE  -  TASTE
BLACK VOICES  -  BLACK VOICES
POLITICS  -  WORLD NEWS
BLACK VOICES  -  BLACK VOICES
POLITICS  -  BUSINESS
POLITICS  -  POLITICS
ENTERTAINMENT  -  ENTERTAINMENT
POLITICS  -  POLITICS
GREEN  -  GREEN
HEALTHY LIVING  -  HEALTHY LIVING
POLITICS  -  POLITICS
ENTERTAINMENT  -  ENTERTAINMENT
POLITICS  -  POLITICS
ENTERTAINMENT  -  ENTERTAINMENT
ENTERTAINMENT  -  ENTERTAINMENT
PARENTS  -  WOMEN
EDUCATION  -  EDUCATION
THE WORLDPOST  -  THE WORLDPOST
CRIME  -  CRIME
POLITICS  -  POLITICS
BLACK VOICES  - 

# EVALUATING THE MODEL ACCURACY

In [16]:
'''
Evaluating Model Accuracy with Different Metrics
'''
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
 
accuracy = accuracy_score(y_test, predicted)
precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, predicted, average=None)
 
print("Accuracy: ", accuracy)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 score: ", f1_score)

Accuracy:  0.9430808294540838
Precision:  [0.94334278 0.9847561  0.98863636 0.96100917 1.         0.97101449
 0.95973154 0.99180328 0.92390737 0.95209581 0.98095238 0.95804196
 0.95171026 0.95054945 0.93351801 0.93467337 0.92443572 0.94888179
 0.94230769 0.9375     0.98726115 0.97536946 0.94968553 0.92886457
 0.96825397 0.95121951 0.9765625  0.99453552 0.971875   0.97687861
 0.96694215]
Recall:  [0.94468085 0.9556213  0.94822888 0.94369369 0.99453552 0.88741722
 0.92857143 0.96414343 0.97429306 0.84875445 0.96261682 0.90578512
 0.88576779 0.9558011  0.93871866 0.84036145 0.942      0.96585366
 0.93929712 0.92783505 0.91715976 0.99       0.96178344 0.95949129
 0.93846154 0.86240786 0.95785441 0.93814433 0.98107256 0.87338501
 0.87531172]
F1 score:  [0.94401134 0.96996997 0.96801113 0.95227273 0.99726027 0.92733564
 0.94389439 0.97777778 0.9484315  0.89746002 0.97169811 0.93118097
 0.91755577 0.95316804 0.93611111 0.8850119  0.93313522 0.95729251
 0.9408     0.93264249 0.95092025 0.98263

# TESTING WITH OWN NEWS HEADLINE

In [17]:
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

sample =['Director David Nutter Talks \'Game Of Thrones\' Security On Set: Like The \'Gestapo\'']

X_new_counts = loaded_vec.transform(sample)
X_new_tfidf = loaded_tfidf.transform(X_new_counts)
predicted = loaded_model.predict(X_new_tfidf)

print(availableCategories[predicted[0]])

ENTERTAINMENT


# FINAL WORDS
All the necessary steps are implemented and the desired Result is produced.

Now Lets do some advance stuff.

# Lets Try Different Model and Algorithms

**First Is Multinomial Naive Bayes**

In [0]:
# Multinomial Naive Bayes

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

#clf = MultinomialNB().fit(X_train_tfidf, training_data.flag)
X_train_mult, X_test_mult, y_train_mult, y_test_mult = train_test_split(X_train_tfidf, training_data.flag, test_size=0.20, random_state=2)
clf_mult = MultinomialNB().fit(X_train, y_train)

#SAVE MODEL
pickle.dump(clf_mult, open("nb_model.pkl", "wb"))

In [20]:
predicted_mult = clf_mult.predict(X_test_mult)
result_bayes = pd.DataFrame( {'true_labels': y_test_mult,'predicted_labels': predicted_mult})
result_bayes.to_csv('res_bayes.csv', sep = ',')

for predicted_item, result in zip(predicted_mult, y_test_mult):
    print(availableCategories[predicted_item], ' - ', availableCategories[result])

POLITICS  -  POLITICS
POLITICS  -  POLITICS
POLITICS  -  CRIME
POLITICS  -  POLITICS
ENTERTAINMENT  -  ENTERTAINMENT
ENTERTAINMENT  -  STYLE
POLITICS  -  WORLD NEWS
ENTERTAINMENT  -  HEALTHY LIVING
ENTERTAINMENT  -  WOMEN
POLITICS  -  POLITICS
POLITICS  -  HEALTHY LIVING
POLITICS  -  SPORTS
POLITICS  -  POLITICS
POLITICS  -  ARTS
POLITICS  -  POLITICS
POLITICS  -  TASTE
POLITICS  -  POLITICS
POLITICS  -  POLITICS
POLITICS  -  BLACK VOICES
ENTERTAINMENT  -  ENTERTAINMENT
ENTERTAINMENT  -  ENTERTAINMENT
POLITICS  -  MEDIA
POLITICS  -  CRIME
POLITICS  -  COMEDY
POLITICS  -  WOMEN
POLITICS  -  QUEER VOICES
ENTERTAINMENT  -  ENTERTAINMENT
POLITICS  -  LATINO VOICES
POLITICS  -  BUSINESS
POLITICS  -  WORLDPOST
POLITICS  -  WOMEN
POLITICS  -  RELIGION
POLITICS  -  ENTERTAINMENT
POLITICS  -  POLITICS
POLITICS  -  FIFTY
POLITICS  -  WOMEN
POLITICS  -  GOOD NEWS
POLITICS  -  LATINO VOICES
POLITICS  -  BUSINESS
POLITICS  -  WORLDPOST
POLITICS  -  BUSINESS
POLITICS  -  POLITICS
POLITICS  -  THE WO

In [21]:
'''
Evaluating Model Accuracy with Different Metrics
'''
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
 
accuracy = accuracy_score(y_test_mult, predicted_mult)
print("Accuracy: ", accuracy)


Accuracy:  0.3818239526026238


**Second we Use a Neural Network**

In [23]:
from sklearn.neural_network import MLPClassifier

X_train_net, X_test_net, y_train_net, y_test_net = train_test_split(X_train_tfidf, training_data.flag, test_size=0.20, random_state=30)

clf_net = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(15,), random_state=1)
clf_net.fit(X_train_net, y_train_net)

MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(15,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)

In [0]:
pickle.dump(clf_net, open("softmax.pkl", "wb"))

In [25]:
predicted_net = clf_net.predict(X_test)
result_softmax = pd.DataFrame( {'true_labels': y_test_net,'predicted_labels': predicted_net})
result_softmax.to_csv('res_softmax.csv', sep = ',')

for predicted_item, result in zip(predicted_net, y_test_net):
    print(availableCategories[predicted_item], ' - ', availableCategories[result])

BLACK VOICES  -  POLITICS
THE WORLDPOST  -  THE WORLDPOST
STYLE  -  PARENTS
HEALTHY LIVING  -  HEALTHY LIVING
SPORTS  -  RELIGION
WEIRD NEWS  -  TASTE
POLITICS  -  MEDIA
HEALTHY LIVING  -  HEALTHY LIVING
WEIRD NEWS  -  GREEN
ENTERTAINMENT  -  ENTERTAINMENT
POLITICS  -  POLITICS
POLITICS  -  QUEER VOICES
ENTERTAINMENT  -  ENTERTAINMENT
ENTERTAINMENT  -  STYLE
POLITICS  -  POLITICS
RELIGION  -  RELIGION
EDUCATION  -  BLACK VOICES
POLITICS  -  POLITICS
POLITICS  -  POLITICS
QUEER VOICES  -  POLITICS
TASTE  -  TASTE
BLACK VOICES  -  BLACK VOICES
POLITICS  -  WORLD NEWS
PARENTS  -  BLACK VOICES
POLITICS  -  BUSINESS
POLITICS  -  POLITICS
ENTERTAINMENT  -  ENTERTAINMENT
POLITICS  -  POLITICS
POLITICS  -  GREEN
HEALTHY LIVING  -  HEALTHY LIVING
POLITICS  -  POLITICS
ENTERTAINMENT  -  ENTERTAINMENT
POLITICS  -  POLITICS
ENTERTAINMENT  -  ENTERTAINMENT
POLITICS  -  ENTERTAINMENT
WOMEN  -  WOMEN
COLLEGE  -  EDUCATION
WOMEN  -  THE WORLDPOST
CRIME  -  CRIME
POLITICS  -  POLITICS
MEDIA  -  BLACK V

In [26]:
'''
Evaluating Model Accuracy with Different Metrics
'''
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
 
accuracy = accuracy_score(y_test_net, predicted_net)
print("Accuracy: ", accuracy)

Accuracy:  0.543377063055438


**Final Words**
As we see that the above two models are not that great.
May be we can get some higher accuracy with the hyper parameter tuning but its going to be less than (SVM + TF-IDF)

In [0]:
#The downloaded models can be loaded with these commands.
#LOAD MODEL
from sklearn.feature_extraction.text import CountVectorizer
import pickle

loaded_vec = CountVectorizer(vocabulary=pickle.load(open("count_vector.pkl", "rb")))
loaded_tfidf = pickle.load(open("tfidf.pkl","rb"))
loaded_model = pickle.load(open("nb_model.pkl","rb"))