## **1. Data Reading**

In [1]:
import pandas as pd
data=pd.read_excel("Prepared_data.xlsx")

In [2]:
data

Unnamed: 0,document_topic,document_title,document_content
0,business,Ad sales boost Time Warner profit,Quarterly profits at US media giant TimeWarner...
1,business,Dollar gains on Greenspan speech,The dollar has hit its highest level against t...
2,business,Yukos unit buyer faces loan claim,The owners of embattled Russian oil giant Yuko...
3,business,High fuel prices hit BA's profits,British Airways has blamed high fuel prices fo...
4,business,Pernod takeover talk lifts Domecq,Shares in UK drinks and food firm Allied Domec...
...,...,...,...
2220,tech,BT program to beat dialler scams,BT is introducing two initiatives to help beat...
2221,tech,Spam e-mails tempt net shoppers,Computer users across the world continue to ig...
2222,tech,Be careful how you code,A new European directive could put software wr...
2223,tech,US cyber security chief resigns,The man making sure US computer networks are s...


## **2. Data Cleaning and Preprocessing**

In [6]:
import re
import nltk
#nltk.download('all')

In [7]:
# Data cleaning and Preprocessing

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
corpus = []
for i in range(0, len(data)):
    review = re.sub('[^a-zA-Z]', ' ', data['document_content'][i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [8]:
corpus[0]

'quarterly profit u medium giant timewarner jumped bn three month december year earlier firm one biggest investor google benefited sale high speed internet connection higher advert sale timewarner said fourth quarter sale rose bn bn profit buoyed one gain offset profit dip warner bros le user aol time warner said friday owns search engine google internet business aol mixed fortune lost subscriber fourth quarter profit lower preceding three quarter however company said aol underlying profit exceptional item rose back stronger internet advertising revenue hope increase subscriber offering online service free timewarner internet customer try sign aol existing customer high speed broadband timewarner also restate result following probe u security exchange commission sec close concluding time warner fourth quarter profit slightly better analyst expectation film division saw profit slump helped box office flop alexander catwoman sharp contrast year earlier third final film lord ring trilogy 

In [13]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()

In [14]:
import pickle
vec_file = 'vectorizer.pkl'
pickle.dump(cv, open(vec_file, 'wb'))

In [15]:
y=data['document_topic']

In [16]:
y.value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: document_topic, dtype: int64

In [17]:
y=y.map({'sport':1, 'business':2, 'politics':3, 'tech':4, 'entertainment':5})

In [18]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 6, 5, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 2, 0, ..., 0, 0, 0]])

## **3. Model Building**

In [19]:
# Train Test Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

### **Model 1: Multinomial NB**

In [20]:
# Training model using Naive bayes classifier

from sklearn.naive_bayes import MultinomialNB
NB_model = MultinomialNB().fit(X_train, y_train)
y_pred=NB_model.predict(X_test)

from sklearn.metrics import confusion_matrix,accuracy_score
print('Confusion Matrix\n')
print(confusion_matrix(y_test,y_pred))
print('\nAccuracy is:',accuracy_score(y_test,y_pred))

Confusion Matrix

[[106   1   1   0   0]
 [  0 103   4   3   1]
 [  1   1  84   0   0]
 [  0   2   0  64   0]
 [  0   0   2   1  71]]

Accuracy is: 0.9617977528089887


### **Model 2: Support Vector Machine**

In [21]:
from sklearn.svm import SVC
sm = SVC()
sm.fit(X_train, y_train)
y_pred = sm.predict(X_test)
print('Confusion Matrix\n')
print(confusion_matrix(y_test,y_pred))
print('\nAccuracy is:',accuracy_score(y_test,y_pred))

Confusion Matrix

[[107   1   0   0   0]
 [  0  99   4   7   1]
 [  0   1  84   1   0]
 [  1   3   0  62   0]
 [  0   0   0   2  72]]

Accuracy is: 0.952808988764045


### **Model 3: Decision Tree**

In [22]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
print('Confusion Matrix\n')
print(confusion_matrix(y_test,y_pred))
print('\nAccuracy is:',accuracy_score(y_test,y_pred))

Confusion Matrix

[[103   2   3   0   0]
 [  2  93   7   7   2]
 [  4   6  74   0   2]
 [  4   3   0  57   2]
 [  6   2   2   3  61]]

Accuracy is: 0.8719101123595505


Here the best model is multinomial NB since it gives best accuracy and least no.of miss classifications

### **Create Pickle**

In [23]:
import pickle
pickle_out=open("classifier.pkl","wb")
pickle.dump(NB_model, pickle_out)
pickle_out.close()