References to news pages collected from an web aggregator in the period from 10-March-2014 to 10-August-2014. The resources are grouped into categories that represent pages discussing the same story. News categories included in this dataset include business(b); science and technology(t); entertainment(e); and health(h).

predict which class a particular resource belongs to given the title of the resource.

In [1]:
path = '/Users/herambdharmadhikari/OneDrive - stevens.edu/NLP/uci-news-aggregator.csv'

# Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score ,confusion_matrix


In [2]:
# Lets Load the data and observe the class distribution
# load data
news = pd.read_csv(path)
print(news.head())
news.info()

   ID                                              TITLE  \
0   1  Fed official says weak data caused by weather,...   
1   2  Fed's Charles Plosser sees high bar for change...   
2   3  US open: Stocks fall after Fed official hints ...   
3   4  Fed risks falling 'behind the curve', Charles ...   
4   5  Fed's Plosser: Nasty Weather Has Curbed Job Gr...   

                                                 URL          PUBLISHER  \
0  http://www.latimes.com/business/money/la-fi-mo...  Los Angeles Times   
1  http://www.livemint.com/Politics/H2EvwJSK2VE6O...           Livemint   
2  http://www.ifamagazine.com/news/us-open-stocks...       IFA Magazine   
3  http://www.ifamagazine.com/news/fed-risks-fall...       IFA Magazine   
4  http://www.moneynews.com/Economy/federal-reser...          Moneynews   

  CATEGORY                          STORY             HOSTNAME      TIMESTAMP  
0        b  ddUyU0VZz0BRneMioxUPQVP6sIxvM      www.latimes.com  1394470370698  
1        b  ddUyU0VZz0BRneMi

As we can see our aim is to classify the articles, so lets keep the relevant columns and remove the rest. i.e. lets keep Title an category.

In [3]:
# Subset the data
news = news[['TITLE','CATEGORY']]

In [4]:
news.head()

Unnamed: 0,TITLE,CATEGORY
0,"Fed official says weak data caused by weather,...",b
1,Fed's Charles Plosser sees high bar for change...,b
2,US open: Stocks fall after Fed official hints ...,b
3,"Fed risks falling 'behind the curve', Charles ...",b
4,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,b


In [5]:
# observing the distribution of the class
dist = news['CATEGORY'].value_counts()

In [6]:
print(dist,news.head())

e    152469
b    115967
t    108344
m     45639
Name: CATEGORY, dtype: int64                                                TITLE CATEGORY
0  Fed official says weak data caused by weather,...        b
1  Fed's Charles Plosser sees high bar for change...        b
2  US open: Stocks fall after Fed official hints ...        b
3  Fed risks falling 'behind the curve', Charles ...        b
4  Fed's Plosser: Nasty Weather Has Curbed Job Gr...        b


In [7]:
#Lets start preprocessing the data
stop = set(stopwords.words('english'))
# Retaning the alphabets using lambda functions
news['TITLE'] = news['TITLE'].apply(lambda x:re.sub("[^a-zA-Z]"," ",x))
#Convert into lower case and tokenize
news['TITLE'] = news['TITLE'].apply(lambda x:x.lower().split())
#remove stopwords from every instance
news['TITLE'] = news['TITLE'].apply(lambda x:[i for i in x if i not in stop])
# join list elements
news['TITLE'] = news['TITLE'].apply(lambda x: ' '.join(x))
# split into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(news["TITLE"], news["CATEGORY"],
                                                                            test_size = 0.2,random_state=3)

In [8]:
#intializing CountVectorizer
count_vectorizer = CountVectorizer()
#intializing TFIDFVectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,3),)
#fit each vectorizer on training and test features with text data and transform them to vectors.
# Count vectorizer on X train on X_test
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count =  count_vectorizer.transform(X_test)
# TFidf - vectorizer on X train on X_test
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)


In [9]:

# initialize multinomial naive bayes
nb_1 = MultinomialNB()
nb_2 = MultinomialNB()

# fit on count vectorizer training data
nb_1.fit(X_train_count, Y_train)

# fit on tfidf vectorizer training data
nb_2.fit(X_train_tfidf, Y_train)

# accuracy with count vectorizer
acc_count_nb = accuracy_score(nb_1.predict(X_test_count), Y_test)

# accuracy with tfidf vectorizer
acc_tfidf_nb = accuracy_score(nb_2.predict(X_test_tfidf), Y_test)

# display accuracies
print(acc_count_nb, acc_tfidf_nb)

# Code ends here

0.9268618910089484 0.9323895648880262


# Predicting with Logistic Regression
Logistic Regression can be used for binary classification but when combined with OneVsRest classifer, it can perform multiclass classification as well. We will be using one such algorithm to train and test it on both the versions i.e. Bag-of-words and TF-IDF ones and then checking the accuracy on both of them

In [12]:
#Lets intialize OneVsRest Logistics classifier
logreg_1 = OneVsRestClassifier(LogisticRegression(random_state = 10,max_iter = 10000))
logreg_2 = OneVsRestClassifier(LogisticRegression(random_state = 10,max_iter = 100000))
# the resaon for intializing 2 classifiers because we are comparing both Bag-of-words and TF-IDF approach.
#Lets's fit the models
# Bag-of-words
logreg_1.fit(X_train_count,Y_train)
#TF-IDF
logreg_2.fit(X_train_tfidf,Y_train)

# Finding the accuracy.
acc_count_logreg = accuracy_score(logreg_1.predict(X_test_count),Y_test)
acc_tfidf_logreg = accuracy_score(logreg_2.predict(X_test_tfidf),Y_test)

#Comparing the results
print("The accuracy using Bag-of-words approach is:",acc_count_logreg)
print("\n")
print("The accuracy using TF-IDF approach is:", acc_tfidf_logreg)


The accuracy using Bag-of-words approach is: 0.9464632356422518


The accuracy using TF-IDF approach is: 0.9428649211685053
