# Build Naive-Bayes Model to predict the category of the News

* Try Logistic Regression
* Decision Tree
* RandomForest
* Support Vector Classifier

## News Categorization using Multinomial Naive Bayes

The objective of this site is to show how to use Multinomial Naive Bayes method to classify news according to some predefined classes. 

The News Aggregator Data Set comes from the UCI Machine Learning Repository. 


This dataset contains headlines, URLs, and categories for 422,937 news stories collected by a web aggregator between March 10th, 2014 and August 10th, 2014. News categories in this dataset are labelled:

* b: business; 
* t: science and technology; 
* e: entertainment; and 
* m: health.

In [1]:
# import Libraries

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [2]:
# load dataset

df = pd.read_csv('uci-news-aggregator.csv')
df.head()

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


In [3]:
df.shape

(422424, 8)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 422424 entries, 0 to 422423
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   ID         422424 non-null  int64 
 1   TITLE      422424 non-null  object
 2   URL        422424 non-null  object
 3   PUBLISHER  422422 non-null  object
 4   CATEGORY   422424 non-null  object
 5   STORY      422424 non-null  object
 6   HOSTNAME   422424 non-null  object
 7   TIMESTAMP  422424 non-null  int64 
dtypes: int64(2), object(6)
memory usage: 25.8+ MB


In [5]:
df.isna().sum()

ID           0
TITLE        0
URL          0
PUBLISHER    2
CATEGORY     0
STORY        0
HOSTNAME     0
TIMESTAMP    0
dtype: int64

###### for me main purpose is to predict the category of the TITLE text, and all other columns are irrelevant for me

In [6]:
df.CATEGORY.unique()

array(['b', 't', 'e', 'm'], dtype=object)

In [7]:
#'b' = Business; t= Technology; e=Entertainment, m=Health

In [8]:
df.CATEGORY.value_counts()

CATEGORY
e    152469
b    115971
t    108344
m     45640
Name: count, dtype: int64

#### EDA -- Normalise the TITLE column

In [9]:
# Regex module
import re

In [10]:
def normalize_text(s):
    s = s.lower()
    
    # Remove punctuation that is non word internal
    s = re.sub('\s\W', ' ', s)  # \s identifies whitespace
    s = re.sub('\W\s', ' ', s)  # \W identifies non alphanum chars(other than letter, number and underscore)
    
    s = re.sub('\s+', ' ', s)   # \s+ --- + identifies any additional \s i.e whitespace
    
    return s
    

In [11]:
# add a columns with normalizet title texts

df['Text'] = [normalize_text(s) for s in df['TITLE']]

In [12]:
df.head(1)

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP,Text
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698,fed official says weak data caused by weather ...


In [13]:
df.Text

0         fed official says weak data caused by weather ...
1         fed's charles plosser sees high bar for change...
2         us open stocks fall after fed official hints a...
3         fed risks falling behind the curve' charles pl...
4         fed's plosser nasty weather has curbed job growth
                                ...                        
422419                                    i am salesman ...
422420                                 i am businessman ...
422421                                   sales business ...
422422                              sick not well fever ...
422423                                    i am salesman ...
Name: Text, Length: 422424, dtype: object

#### Feature Extraction using Count Vectoriser

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
vectorizer = CountVectorizer()

In [16]:
X = vectorizer.fit_transform(df['Text'])
X

<422424x54637 sparse matrix of type '<class 'numpy.int64'>'
	with 3747887 stored elements in Compressed Sparse Row format>

In [17]:
X.shape, X.size

((422424, 54637), 3747887)

In [18]:
# Dealing with the output column 'CATEGORY' by label encoder

from sklearn.preprocessing import LabelEncoder

In [19]:
a = df.CATEGORY.unique()  # storing unencoded unique values in variable a

In [20]:
a

array(['b', 't', 'e', 'm'], dtype=object)

In [21]:
encoder = LabelEncoder()

In [22]:
y = encoder.fit_transform(df['CATEGORY'])

In [23]:
y

array([0, 0, 0, ..., 0, 2, 0])

In [24]:
y.shape

(422424,)

In [25]:
b = np.unique(y) # storing encoded unique values in variable a

In [26]:
cat_uniques = dict(zip(a,b))

In [27]:
cat_uniques

{'b': 0, 't': 1, 'e': 2, 'm': 3}

In [28]:
# train_test_split

from sklearn.model_selection import train_test_split

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=55)

In [30]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(316818, 54637)
(316818,)
(105606, 54637)
(105606,)


### Naive - Bayes Algorithm

In [31]:
from sklearn.naive_bayes import MultinomialNB

In [32]:
nb_model = MultinomialNB()

In [33]:
nb_model.fit(X_train, y_train)

In [34]:
nb_model.score(X_test, y_test)

0.927381020017802

In [35]:
# Note that the above model is ~92.74% accurate

###### Build an application to predict unseen data

In [36]:
def category_predictor(title):
    category_names = {'b':'Business', 't':'Technology', 'e':'Entertainment','m':'Health'}
    cod = nb_model.predict((vectorizer.transform([title])))
    return category_names[encoder.inverse_transform(cod)[0]]

In [37]:
print(category_predictor("Prime Minister Narendra Modi on Friday inaugurated the Mumbai Trans-harbour link (MTHL), now named 'Atal Bihari Vajpayee Sewari-Nhava Sheva Atal Setu', constructed at a cost of over Rs 17,840 crore. Atal Setu is the longest bridge in India and also the longest sea bridge in the country. It will provide faster connectivity to Mumbai International Airport and Navi Mumbai International Airport and will also reduce the travel time from Mumbai to Pune, Goa and South India."))

Business


### Predicting With Logistic Regression

In [38]:
from sklearn.linear_model import LogisticRegression

In [39]:
logreg = LogisticRegression()

In [40]:
logreg.fit(X_train, y_train)

In [41]:
y_pred = logreg.predict(X_test)

In [42]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [43]:
accuracy_score(y_test, y_pred)

0.947379883718728

In [44]:
def cat_pred(text):
    cat_names = {'b':'Business', 't':'Technology', 'e':'Entertainment','m':'Health'}
    cod = logreg.predict(vectorizer.transform([text]))
    return cat_names[encoder.inverse_transform(cod)[0]]

In [45]:
print(cat_pred("Prime Minister Narendra Modi on Friday inaugurated the Mumbai Trans-harbour link (MTHL), now named 'Atal Bihari Vajpayee Sewari-Nhava Sheva Atal Setu', constructed at a cost of over Rs 17,840 crore. Atal Setu is the longest bridge in India and also the longest sea bridge in the country. It will provide faster connectivity to Mumbai International Airport and Navi Mumbai International Airport and will also reduce the travel time from Mumbai to Pune, Goa and South India."))

Business


END