The plan is to install it for a logistics organisation that manages numerous apps by separate teams. Therefore, when a user complains about a delay or one of the applications, it can be difficult for the service desk to assign the incident to the right team. After we automate BOT, whenever the user inputs a description of their issue, our model will learn from the past and assign the ticket to the appropriate team.

As per the company's privacy concerns, we are constructing a text classification algorithm using the bbc news data that is currently available. In this scenario, the news are being classified using a similar tactic to what we want to do in practise. We can construct the model once we get the necessary data collection.

### Prototype

In [15]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import nltk

In [16]:
train=pd.read_csv(r'D:\BBC News Train.csv\BBC News Train.csv')
test=pd.read_csv(r'D:\BBC News Test.csv\BBC News Test.csv')
data=pd.concat([train,test],axis=0,ignore_index=True)

In [17]:
data.head()

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business


In [18]:
data.nunique()

ArticleId    2225
Text         2126
Category        5
dtype: int64

In [19]:
data.shape


(2225, 3)

In [20]:
data.duplicated().sum()

0

In [21]:
data.isna().sum()

ArticleId      0
Text           0
Category     735
dtype: int64

data is moderately well balanced 

In [22]:
data.drop('ArticleId',axis=1,inplace=True)

In [23]:
data['Text']=data['Text'].str.lower()

In [24]:
import re
def remove_html(Text):
    pattern=re.compile('<.#?>')
    return pattern.sub(r'',Text)
data['Text']=data['Text'].apply(remove_html)

In [25]:
import string 
exclude=string.punctuation
def remove_punct(text):
    for i in exclude:
        text=text.replace(i,'')
    return text
data['Text']=data['Text'].apply(remove_punct)

In [26]:
import re
def remove_url(text):
    return re.sub(r'https?:\S*''','',text)
data['Text']=data['Text'].apply(remove_url)

In [27]:
data

Unnamed: 0,Text,Category
0,worldcom exboss launches defence lawyers defen...,business
1,german business confidence slides german busin...,business
2,bbc poll indicates economic gloom citizens in ...,business
3,lifestyle governs mobile choice faster bett...,tech
4,enron bosses in 168m payout eighteen former en...,business
...,...,...
2220,eu to probe alitalia state aid the european ...,
2221,u2 to play at grammy awards show irish rock ba...,
2222,sport betting rules in spotlight a group of mp...,
2223,alfa romeos to get gm engines fiat is to sto...,


In [28]:
def remove_num(text):
    pattern=r'[^a-zA-z.,!?/:;\"\'\s]'
    return re.sub(pattern,'',text)
data['Text']=data['Text'].apply(remove_num)

In [29]:
from nltk.corpus import stopwords
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [30]:
data['text_token']=data['Text'].apply(lambda x:nltk.word_tokenize(x))

In [31]:
def stopword(text):
    y=[]
    for i in text:
        if i not in stopwords.words('english'):
            y.append(i)
    return " ".join(y)
data['text_token']=data['text_token'].apply(stopword)

In [32]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
data['text_token']=data['text_token'].apply(lambda x: lemmatizer.lemmatize(x))

In [33]:
data

Unnamed: 0,Text,Category,text_token
0,worldcom exboss launches defence lawyers defen...,business,worldcom exboss launches defence lawyers defen...
1,german business confidence slides german busin...,business,german business confidence slides german busin...
2,bbc poll indicates economic gloom citizens in ...,business,bbc poll indicates economic gloom citizens maj...
3,lifestyle governs mobile choice faster bett...,tech,lifestyle governs mobile choice faster better ...
4,enron bosses in m payout eighteen former enron...,business,enron bosses payout eighteen former enron dire...
...,...,...,...
2220,eu to probe alitalia state aid the european ...,,eu probe alitalia state aid european commissio...
2221,u to play at grammy awards show irish rock ban...,,u play grammy awards show irish rock band u pl...
2222,sport betting rules in spotlight a group of mp...,,sport betting rules spotlight group mps peers ...
2223,alfa romeos to get gm engines fiat is to sto...,,alfa romeos get gm engines fiat stop making si...


### text vectorization 

Count vectorization

In [34]:
'''from sklearn.feature_extraction.text import CountVectorizer
Countvec=CountVectorizer(min_df=2,max_df=5)
x_counts=Countvec.fit_transform(data['text_token'])
print(x_counts.shape)
print(Countvec.get_feature_names()[15:30])'''

"from sklearn.feature_extraction.text import CountVectorizer\nCountvec=CountVectorizer(min_df=2,max_df=5)\nx_counts=Countvec.fit_transform(data['text_token'])\nprint(x_counts.shape)\nprint(Countvec.get_feature_names()[15:30])"

TF-IDF Method

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer(ngram_range=(2,2),min_df=2,max_df=5)
x_tf=tfidf.fit_transform(data['text_token'])
print(x_tf.shape)
print(tfidf.idf_)
print(tfidf.get_feature_names()[15:30])

(2225, 60149)
[7.60934924 7.60934924 7.60934924 ... 7.60934924 7.60934924 7.60934924]
['abc network', 'abc television', 'abdellatif kechiche', 'ability browse', 'ability collect', 'ability control', 'ability ensure', 'ability everyone', 'ability games', 'ability handle', 'ability helping', 'ability hijack', 'ability influence', 'ability joked', 'ability listen']


### Model creation

In [36]:
train.shape

(1490, 3)

In [37]:
train=data.iloc[:1490,:]
test=data.iloc[1490:,:]

In [38]:
X = tfidf.transform(train.text_token).toarray()

In [39]:
y = train['Category'].map({'business':0, 'entertainment':1, 'politics':2,'sport':3,'tech':4})

In [40]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [41]:
'''from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix
lr=LogisticRegression()
lr.fit(x_train,y_train)
y_pred=lr.predict(x_test)
print("accuracy score", accuracy_score(y_pred,y_test))
print('precision {} -------- recall{} '.format(precision_score(y_pred,y_test,average='weighted'),recall_score(y_test,y_pred,average='weighted')))'''

'from sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix\nlr=LogisticRegression()\nlr.fit(x_train,y_train)\ny_pred=lr.predict(x_test)\nprint("accuracy score", accuracy_score(y_pred,y_test))\nprint(\'precision {} -------- recall{} \'.format(precision_score(y_pred,y_test,average=\'weighted\'),recall_score(y_test,y_pred,average=\'weighted\')))'

In [42]:
'''from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)
print("accuracy score", accuracy_score(y_pred,y_test))
print('precision {} -------- recall{} '.format(precision_score(y_pred,y_test,average='weighted'),recall_score(y_test,y_pred,average='weighted')))'''

'from sklearn.ensemble import RandomForestClassifier\nrf=RandomForestClassifier()\nrf.fit(x_train,y_train)\ny_pred=rf.predict(x_test)\nprint("accuracy score", accuracy_score(y_pred,y_test))\nprint(\'precision {} -------- recall{} \'.format(precision_score(y_pred,y_test,average=\'weighted\'),recall_score(y_test,y_pred,average=\'weighted\')))'

In [43]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix
mb=MultinomialNB()
mb.fit(x_train,y_train)
y_pred=mb.predict(x_test)
print("accuracy score", accuracy_score(y_pred,y_test))
print('precision {} -------- recall{} '.format(precision_score(y_pred,y_test,average='weighted'),recall_score(y_test,y_pred,average='weighted')))

accuracy score 0.7919463087248322
precision 0.8752032409011218 -------- recall0.7919463087248322 


In [44]:
'''from sklearn import svm
sv=svm.SVC()
sv.fit(x_train,y_train)
y_pred=sv.predict(x_test)
print("accuracy score", accuracy_score(y_pred,y_test))
print('precision {} -------- recall{} '.format(precision_score(y_pred,y_test,average='weighted'),recall_score(y_test,y_pred,average='weighted')))'''

'from sklearn import svm\nsv=svm.SVC()\nsv.fit(x_train,y_train)\ny_pred=sv.predict(x_test)\nprint("accuracy score", accuracy_score(y_pred,y_test))\nprint(\'precision {} -------- recall{} \'.format(precision_score(y_pred,y_test,average=\'weighted\'),recall_score(y_test,y_pred,average=\'weighted\')))'

In [45]:
#test=test.drop('Category',axis=1)
vect = tfidf.transform(test['text_token'])
test['Result']=mb.predict(vect)
test['Results']=test['Result'].map({0:'business', 1:'entertainment', 2: 'politics',3: 'sport',4: 'tech'})

In [46]:
test.loc[2220][0]

'eu to probe alitalia  state aid  the european commission has officially launched an indepth investigation into whether italian airline alitalia is receiving illegal state aid  commission officials are to look at rome s provision of a m euro m m loan to the carrier both the italian government and alitalia have repeatedly denied that the money  part of a vital restructuring plan  is state aid the investigation could take up to  months however  transport commissioner jacques barrot said he wanted it to be carried out as swiftly as possible  the italian authorities have presented a serious industrial plan   said mr barot  we now have to verify certain aspects to confirm that this plan contains no state aid i would like our analysis to be completed swiftly   the matter of possible state aid was brought to the commission s attention by eight of alitalia s rivals  including germany s lufthansa  british airways and spain s iberia while alitalia needs to restructure to bring itself back to pro

In [49]:
name=input("Your good name pls:")
print('Hi',name, ",what's your today's news for us")
news=[input()]
vect = tfidf.transform(news)
final=mb.predict(vect)
print('/n/n/n')
for i in final:
    if i==0:
        print("Thanks for the details, your news will lies under Business category")
    elif i==1:
        print("Thanks for the details, your news will lies under Entertainment category")
    elif i==2:
        print("Thanks for the details, your news will lies under Politics category")
    elif i==3:
        print("Thanks for the details, your news will lies under Sports category")
    else:
        print("Thanks for the details, your news will lies under Tech category")

Your good name pls:shahal
Hi shahal ,what's your today's news for us
'eu to probe alitalia  state aid  the european commission has officially launched an indepth investigation into whether italian airline alitalia is receiving illegal state aid  commission officials are to look at rome s provision of a m euro m m loan to the carrier both the italian government and alitalia have repeatedly denied that the money  part of a vital restructuring plan  is state aid the investigation could take up to  months however  transport commissioner jacques barrot said he wanted it to be carried out as swiftly as possible  the italian authorities have presented a serious industrial plan   said mr barot  we now have to verify certain aspects to confirm that this plan contains no state aid i would like our analysis to be completed swiftly   the matter of possible state aid was brought to the commission s attention by eight of alitalia s rivals  including germany s lufthansa  british airways and spain s i