# BBC News Article classification

**Business Problem :** Create a classification model which classifiy the article into category

**Dataset location :**[Kaggle](https://www.kaggle.com/c/learn-ai-bbc/data)

**Description from Kaggle :**

Text documents are one of the richest sources of data for businesses.

We’ll use a public dataset from the BBC comprised of 2225 articles, each labeled under one of 5 categories: business, entertainment, politics, sport or tech.

The dataset is broken into 1490 records for training and 735 for testing. The goal will be to build a system that can accurately classify previously unseen news articles into the right category

**Dataset Description:**

ArticleId - Article id unique # given to the record
Article - text of the header and article
Category - cateogry of the article (tech, business, sport, entertainment, politics/li>

Lets get started,
As this is the text classification NLP problem , we are going to solve this problem with Deep learning library called as Keras

In [48]:
# Import libraries
import pandas as pd
import numpy as np
import re
import collections
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from keras import models
from keras import layers
from tensorflow.python import keras
import keras
from keras.initializers import Constant
from keras.layers import LSTM
import nltk
#nltk.download()
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
import pandas_profiling as pp

### Load dataset

In [49]:
data = pd.read_csv('bbc-text.csv')
data.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


## Exploratary Data Analysis

1. Now days, we have library called pandas-profiliing which makes our life more easier for data analysis
2. pandas profilling give all analysis which require for understanding the data.
3. from profilling , we got information like, there is no missing data in dataset, also got target variable distributon and count
4. its look like 99 rows are duplicate in our dataset , lets remove those rows
5. take a copy of dataset and start working on it

In [50]:
pp.ProfileReport(data)

0,1
Number of variables,2
Number of observations,2225
Total Missing (%),0.0%
Total size in memory,34.8 KiB
Average record size in memory,16.0 B

0,1
Numeric,0
Categorical,2
Boolean,0
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,5
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0

0,1
sport,511
business,510
politics,417
Other values (2),787

Value,Count,Frequency (%),Unnamed: 3
sport,511,23.0%,
business,510,22.9%,
politics,417,18.7%,
tech,401,18.0%,
entertainment,386,17.3%,

0,1
Distinct count,2126
Unique (%),95.6%
Missing (%),0.0%
Missing (n),0

0,1
prince crowned top music earner prince earned more than any other pop star in 2004 beating artists such madonna and elton john in us magazine rolling stone s annual list. the singer banked $56.5m (£30.4m) from concerts album and publishing sales with his musicology tour and album. he kept madonna in second place as she earned $54.9m (£29.5m) while embarking on her global re-invention tour. veterans simon and garfunkel were in 10th place their comeback tour helping them earn $24.9m (£13.4m) last year. prince returned to centre stage after a decade in the commercial wilderness the magazine reported. the singer s 2004 tour took $90.3m (£48.5m) in ticket sales and he sold 1.9 million copies of his latest album musicology. although she grossed more than prince last year madonna remained in second place because of the monumental production costs of her tour. heavy metal band metallica s madly in anger with the world tour helped push their 2004 earnings up to $43.1m (£23.1m). they were ahead of sir elton john who took fourth place and almost $42.7m (£23m) from performances including a debut on the las vegas strip. other seasoned performers in the list included rod stewart whose sold-out shows and third volume of the great american songbook covers album helped net him £35m (£19m). the highest-ranking rap act in the list was 50 cent who at number 19 took $24m (£13m) to the bank.,2
sir paul rocks super bowl crowds sir paul mccartney wowed fans with a live mini-concert at american football s super bowl - and avoided any janet jackson-style controversies. the 62-year-old sang hey jude and other beatles songs in a 12-minute set at half-time during the game in florida. last year jackson exposed a breast during a dance routine causing outrage among millions of tv viewers and landing the cbs tv network a fine. sir paul however did nothing more racy than remove his jacket as he sang. organisers were widely considered to be playing it safe this year by booking 62-year-old sir paul for his second super bowl show. three years ago he was invited to perform at the first super bowl after the september 11 attacks and performed his specially-written song freedom. this time he started off the show at the alltel stadium in jacksonville florida with the beatles numbers drive my car and get back. he then performed a mellow version of live and let die the james bond theme he recorded with the band wings. finally he closed the show with a rousing version of hey jude. the former beatle resisted any temptation to refer to janet jackson s headline-grabbing performance last year instead keeping banter between songs to a minimum in order to squeeze as much music as he could into his slot. the singer removed his black jacket halfway through the show - but any fans hoping for a second nipple-gate were to be disappointed as he kept his red sweatshirt on underneath. earlier the black eyed peas and alicia keys had provided the night s other high-profile entertainment by performing in a pre-game show. black eyed peas singer fergie was dressed in a tight orange top and purple hotpants but nothing in her performance was likely to upset tv watchdogs. after the controversy last year - which saw cbs fined a record $550 000 (£292 000) by federal regulators - super bowl organisers had turned to producer don mischer to oversee this year s half-time show. his previous production credits included olympic opening and closing ceremonies. the super bowl is watched by an audience of 144.4 million in the us with many of the people watching are said to tune in specifically to see the entertainment put on around the event. michael jackson aerosmith diana ross gloria estefan and phil collins are among the stars who have previously graced the super bowl stage.,2
stars pay tribute to actor davis hollywood stars including spike lee burt reynolds and oscar nominee alan alda have paid tribute to actor ossie davis at a funeral in new york. veteran star ossie davis a well-known civil rights activist died in miami at the age of 87 on 4 february 2005. friends and family including actress ruby dee his wife of 56 years gathered at the riverside church on saturday. also present at the service was former us president bill clinton and singer harry belafonte who gave the eulogy. he would have been a very good president of the united states said mr clinton. like most of you here he gave more to me than i gave to him. the 87-year-old was found dead last weekend in his hotel room in florida where he was making a film. police said that he appeared to have died of natural causes. davis made his acting debut in 1950 in no way out starring sidney poiter. he frequently collaborated with director spike lee starring in seven lee films including jungle fever do the right thing and malcolm x. attallah shabazz the daughter of activist malcolm x recalled the famous eulogy delivered by davis at her father s funeral. harlem has come to bid farewell to one of its finest hopes she said quoting the man she knew as uncle ossie. ditto. ossie was my hero and he still is said aviator star alan alda a family friend for over forty years. ossie was a thing of beauty. i want so badly someday to have his dignity - a little of it anyway added burt reynolds davis s co-star in the 90s tv comedy evening shade. before the midday funeral scores of harlem residents formed a queue outside the church to pay their respects to davis. it is hard to fathom that we will no longer be able to call on his wisdom his humour his loyalty and his moral strength to guide us in the choices that are yet to be made and the battles that are yet to be fought said belafonte himself an ardent civil rights activist who had been friends with davis for over 60 years. but how fortunate we were to have him as long as we did.,2
Other values (2123),2219

Value,Count,Frequency (%),Unnamed: 3
prince crowned top music earner prince earned more than any other pop star in 2004 beating artists such madonna and elton john in us magazine rolling stone s annual list. the singer banked $56.5m (£30.4m) from concerts album and publishing sales with his musicology tour and album. he kept madonna in second place as she earned $54.9m (£29.5m) while embarking on her global re-invention tour. veterans simon and garfunkel were in 10th place their comeback tour helping them earn $24.9m (£13.4m) last year. prince returned to centre stage after a decade in the commercial wilderness the magazine reported. the singer s 2004 tour took $90.3m (£48.5m) in ticket sales and he sold 1.9 million copies of his latest album musicology. although she grossed more than prince last year madonna remained in second place because of the monumental production costs of her tour. heavy metal band metallica s madly in anger with the world tour helped push their 2004 earnings up to $43.1m (£23.1m). they were ahead of sir elton john who took fourth place and almost $42.7m (£23m) from performances including a debut on the las vegas strip. other seasoned performers in the list included rod stewart whose sold-out shows and third volume of the great american songbook covers album helped net him £35m (£19m). the highest-ranking rap act in the list was 50 cent who at number 19 took $24m (£13m) to the bank.,2,0.1%,
sir paul rocks super bowl crowds sir paul mccartney wowed fans with a live mini-concert at american football s super bowl - and avoided any janet jackson-style controversies. the 62-year-old sang hey jude and other beatles songs in a 12-minute set at half-time during the game in florida. last year jackson exposed a breast during a dance routine causing outrage among millions of tv viewers and landing the cbs tv network a fine. sir paul however did nothing more racy than remove his jacket as he sang. organisers were widely considered to be playing it safe this year by booking 62-year-old sir paul for his second super bowl show. three years ago he was invited to perform at the first super bowl after the september 11 attacks and performed his specially-written song freedom. this time he started off the show at the alltel stadium in jacksonville florida with the beatles numbers drive my car and get back. he then performed a mellow version of live and let die the james bond theme he recorded with the band wings. finally he closed the show with a rousing version of hey jude. the former beatle resisted any temptation to refer to janet jackson s headline-grabbing performance last year instead keeping banter between songs to a minimum in order to squeeze as much music as he could into his slot. the singer removed his black jacket halfway through the show - but any fans hoping for a second nipple-gate were to be disappointed as he kept his red sweatshirt on underneath. earlier the black eyed peas and alicia keys had provided the night s other high-profile entertainment by performing in a pre-game show. black eyed peas singer fergie was dressed in a tight orange top and purple hotpants but nothing in her performance was likely to upset tv watchdogs. after the controversy last year - which saw cbs fined a record $550 000 (£292 000) by federal regulators - super bowl organisers had turned to producer don mischer to oversee this year s half-time show. his previous production credits included olympic opening and closing ceremonies. the super bowl is watched by an audience of 144.4 million in the us with many of the people watching are said to tune in specifically to see the entertainment put on around the event. michael jackson aerosmith diana ross gloria estefan and phil collins are among the stars who have previously graced the super bowl stage.,2,0.1%,
stars pay tribute to actor davis hollywood stars including spike lee burt reynolds and oscar nominee alan alda have paid tribute to actor ossie davis at a funeral in new york. veteran star ossie davis a well-known civil rights activist died in miami at the age of 87 on 4 february 2005. friends and family including actress ruby dee his wife of 56 years gathered at the riverside church on saturday. also present at the service was former us president bill clinton and singer harry belafonte who gave the eulogy. he would have been a very good president of the united states said mr clinton. like most of you here he gave more to me than i gave to him. the 87-year-old was found dead last weekend in his hotel room in florida where he was making a film. police said that he appeared to have died of natural causes. davis made his acting debut in 1950 in no way out starring sidney poiter. he frequently collaborated with director spike lee starring in seven lee films including jungle fever do the right thing and malcolm x. attallah shabazz the daughter of activist malcolm x recalled the famous eulogy delivered by davis at her father s funeral. harlem has come to bid farewell to one of its finest hopes she said quoting the man she knew as uncle ossie. ditto. ossie was my hero and he still is said aviator star alan alda a family friend for over forty years. ossie was a thing of beauty. i want so badly someday to have his dignity - a little of it anyway added burt reynolds davis s co-star in the 90s tv comedy evening shade. before the midday funeral scores of harlem residents formed a queue outside the church to pay their respects to davis. it is hard to fathom that we will no longer be able to call on his wisdom his humour his loyalty and his moral strength to guide us in the choices that are yet to be made and the battles that are yet to be fought said belafonte himself an ardent civil rights activist who had been friends with davis for over 60 years. but how fortunate we were to have him as long as we did.,2,0.1%,
commodore finds new lease of life the once-famous commodore computer brand could be resurrected after being bought by a us-based digital music distributor. new owner yeahronimo media ventures has not ruled out the possibility of a new breed of commodore computers. it also plans to develop a worldwide entertainment concept with the brand although details are not yet known. the groundbreaking commodore 64 computer elicits fond memories for those who owned one back in the 1980s. in the chronology of home computing commodore was one of the pioneers. the commodore 64 launched in 1982 was one of the first affordable home pcs. it was followed a few years later by the amiga. the commodore 64 sold more than any other single computer system even to this day. the brand languished somewhat in the 1990s. commodore international filed for bankruptcy in 1994 and was sold to dutch firm tulip computers. in the late 1980s the firm was a great rival to atari which produced its own range of home computers and is now a brand of video games formerly known as infogrames. tulip computers sold several products under the commodore name including portable usb storage devices and digital music players. it had planned to relaunch the brand following an upsurge of nostalgia for 1980s-era games. commodore 64 enthusiasts have written emulators for windows pc apple mac and even pdas so that the original commodore games can be still run. the sale of commodore is expected to be complete in three weeks in a deal worth over £17m.,2,0.1%,
ocean s twelve raids box office ocean s twelve the crime caper sequel starring george clooney brad pitt and julia roberts has gone straight to number one in the us box office chart. it took $40.8m (£21m) in weekend ticket sales according to studio estimates. the sequel follows the master criminals as they try to pull off three major heists across europe. it knocked last week s number one national treasure into third place. wesley snipes blade: trinity was in second taking $16.1m (£8.4m). rounding out the top five was animated fable the polar express starring tom hanks and festive comedy christmas with the kranks. ocean s twelve box office triumph marks the fourth-biggest opening for a december release in the us after the three films in the lord of the rings trilogy. the sequel narrowly beat its 2001 predecessor ocean s eleven which took $38.1m (£19.8m) on its opening weekend and $184m (£95.8m) in total. a remake of the 1960s film starring frank sinatra and the rat pack ocean s eleven was directed by oscar-winning director steven soderbergh. soderbergh returns to direct the hit sequel which reunites clooney pitt and roberts with matt damon andy garcia and elliott gould. catherine zeta-jones joins the all-star cast. it s just a fun good holiday movie said dan fellman president of distribution at warner bros. however us critics were less complimentary about the $110m (£57.2m) project with the los angeles times labelling it a dispiriting vanity project . a milder review in the new york times dubbed the sequel unabashedly trivial .,2,0.1%,
musical treatment for capra film the classic film it s a wonderful life is to be turned into a musical by the producer of the controversial hit show jerry springer - the opera. frank capra s 1946 movie starring james stewart is being turned into a £7m musical by producer jon thoday. he is working with steve brown who wrote the award-winning musical spend spend spend. a spokeswoman said the plans were in the very early stages with no cast opening date or theatre announced. a series of workshops have been held in london and on wednesday a cast of singers unveiled the musical to a select group of potential investors. mr thoday said the idea of turning the film into a musical had been an ambition of his for almost 20 years. it s a wonderful life was based on a short story the greatest gift by philip van doren stern. mr thoday managed to buy the rights to the story from van doren stern s family in 1999 following mr brown s success with spend spend spend. he later secured the film rights from paramount enabling them to use the title it s a wonderful life.,2,0.1%,
kennedy questions trust of blair lib dem leader charles kennedy has said voters now have a fundamental lack of trust of tony blair as prime minister. he said backing his party was not a wasted vote adding that with the lib dems what you see is what you get . he made his comments at the start of a day of appearances on channel five in a session on the wright stuff programme. questions from callers a studio audience and the show s presenter covered lib dem tax plans anti-terror laws and immigration. mr kennedy said during his nearly 22 years in parliament he had seen prime ministers and party leaders come and go and knew the pitfalls of british politics. 1983 was when i was first elected as an mp - so tony blair michael howard and myself were all class of 83 - and over that nearly quarter of a century the world has changed out of recognition he said. we don t actually hear the argument any longer: lib dems good people reasonable ideas but only if we thought they could win around here - it s a wasted vote . you don t hear that because the evidence of people s senses demonstrates that it isn t a wasted vote. but he said mr blair had lost the trust of the british people. there is a fundamental lack of trust in tony blair as prime minister and in his government he said. what we ve got to do as a party - what i ve got to do as a leader of this party - is to convey to people that what you see is what you get. mr kennedy also used his tv appearance to defend his party s plans to increase income tax to 50% for those earning more than £100 000 saying it would apply to just 1% of the population. he said the extra revenue would allow his party to get rid of tuition and top-up fees introduce free personal care for the elderly and replace the council tax with a local income tax. mr blair has already spent a day with five and michael howard is booked for a similar session.,2,0.1%,
talks aim to avert pension strike talks aimed at averting a series of national strikes over pensions reforms will take place this weekend. five public sector unions will hold private talks with deputy prime minister john prescott at labour s spring conference in gateshead. they want the government to withdraw regulations - due to be introduced in weeks - which would raise the pension age for council workers from 60 to 65. up to 1.4m workers could take part in a strike already earmarked for 23 march. however all sides are anxious to avoid a major confrontation in the run up to the general election said bbc labour affairs correspondent stephen cape. in four days britain s biggest union unison will start balloting 800 000 local government workers on strikes. other public sector unions have pledged to follow. it is just weeks before new regulations are introduced to raise the pension age of local government workers. the five unions meeting mr prescott want the government to withdraw these regulations. this would allow months of tough negotiations to follow said our correspondent. but a spokesman for mr prescott warned that the changes to the local government pension scheme would have to go ahead in april. privately ministers believe this will be the less painful option our correspondent added. the public and commercial services union (pcs) will co-ordinate any industrial action with up to six other public sector unions. pcs leader mark serwotka warned last week that there could be further walkouts unless there was a government rethink. for a government that lectures everyone on choice - choice on public service choice on this and choice on that - isn t it ironic that they re saying to public sector workers there is no choice he said. if you want the pension you were promised when you started you must work for an extra five years - that is working until people drop. in the 20th century it s completely unacceptable. unison s 800 000 workers the transport and general workers union s 70 000 and amicus 20 000 are among those being balloted about a 23 march walkout. mr prescott held a private meeting with senior union figures last week. it is understood no deal was offered in that meeting but there was room for further negotiations.,2,0.1%,
s korea spending boost to economy south korea will boost state spending next year in an effort to create jobs and kick start its sputtering economy. it has earmarked 100 trillion won ($96bn) for the first six months of 2005 60% of its total annual budget. the government s main problems are slumping consumption and a contraction in the construction industry . it aims to create 400 000 jobs and will focus on infrastructure and home building as well as providing public firms with money to hire new workers. the government has set an economic growth rate target of 5% for next year and hinted that would be in danger unless it took action. internal and external economic conditions are likely to remain unfavourable in 2005 the finance and economy ministry said in a statement. it blamed continuing uncertainties such as fluctuating oil prices and foreign exchange rates and stagnant domestic demand that has shown few signs of a quick rebound . in 2004 growth will be between 4.7% and 4.8% the ministry said. not everyone is convinced the plan will work. our primary worry centres on the what we believe is the government s overly optimistic view that its front loading of the budget will be enough to turn the economy around consultancy 4cast said in a report. the problem facing south korea is that many consumers are reeling from the effects of a credit bubble that only recently burst. millions of south koreans are defaulting on their credit card bills and the country s biggest card lender has been hovering on the verge of bankruptcy for months. as part of its spending plans the government said it will ask firms to roll over mortgage loans that come due in the first half of 2005 . it also pledged to look at ways of helping families on low incomes. the government voiced concern about the effect of redundancies in the building trade. given the economic spill over and employment effect in the construction sector a sharp downturn in the construction industry could have other adverse effects the ministry said. as a result south korea will give private companies also will be given the chance to build schools hospitals houses and other public buildings. it also will look at real estate tax system. other plans on the table include promoting new industries such as bio-technology and nano-technology as well as offering increased support to small and medium sized businesses. the focus will be on job creation and economic recovery given that unfavourable domestic and global conditions are likely to dog the korean economy in 2005 the ministry said.,2,0.1%,
pop band busted to take a break chart-topping pop band busted have confirmed that they plan to take a break following rumours that they were on the verge of splitting. a statement from the band s record company universal said frontman charlie simpson planned to spend some time working with his other band fightstar. however they said that busted would reconvene in due course . the band have had eight top three hits including four number ones since they first hit the charts in 2002. their singles include what i go to school for year 3000 crashed the wedding you said no and who s david the band which also includes members matt jay and james bourne made the top ten with their self-titled debut album as well as the follow-up a present for everyone in 2003. they won best pop act and best breakthrough act at the 2004 brit awards and were nominated for best british group. most recently they topped the charts with the theme from the live-action film version of thunderbirds which was voted record of the year on the itv1 show. the band have capitalised on a craze for artists playing catchy pop music with rock overtones. the trio are seen as an alternative to more manufactured artists who are not considered credible musicians because they do not write their own songs or play their own instruments. however recent rumours have suggested that simpson has been wanting to quit the band to focus on fightstar. he now plans to take fightstar on tour.,2,0.1%,

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [51]:
df = data.copy()

In [52]:
%time df.drop_duplicates(subset ="text",keep=False,inplace=True)
df.shape

Wall time: 15 ms


(2027, 2)

In [53]:
df = df.reset_index(drop=True)

In [54]:
df['category'].value_counts()

sport            497
business         496
politics         389
entertainment    352
tech             293
Name: category, dtype: int64

lets look at first article of news and understand to make it clean for next procedure what we need to do.

In [55]:
df['text'][0]

'tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high

## Removing stopwords

In [56]:
## Text cleaning
## Remove stopwords
def stopword(data):
    
    stopwordlist= stopwords.words('english')
    
    data = data.split()
    clean_data = [word for word in data if (word not in stopwordlist)]
    return " ".join(clean_data)    

In [57]:
df.text = df.text.apply(stopword)

In [58]:
df.text[0]

'tv future hands viewers home theatre systems plasma high-definition tvs digital video recorders moving living room way people watch tv radically different five years time. according expert panel gathered annual consumer electronics show las vegas discuss new technologies impact one favourite pastimes. us leading trend programmes content delivered viewers via home networks cable satellite telecoms companies broadband service providers front rooms portable devices. one talked-about technologies ces digital personal video recorders (dvr pvr). set-top boxes like us tivo uk sky+ system allow people record store play pause forward wind tv programmes want. essentially technology allows much personalised tv. also built-in high-definition tv sets big business japan us slower take europe lack high-definition programming. people forward wind adverts also forget abiding network channel schedules putting together a-la-carte entertainment. us networks cable satellite companies worried means terms a

## Using textblob work on root of the word with help of lemmatization

In [59]:
### lemmitization
from textblob import Word
def lem_apply(data):
    
    data=data.split()
    new_data = [Word(word).lemmatize()  for word in data]
    return " ".join(new_data)

In [60]:
df.text = df.text.apply(lem_apply)

In [61]:
df.text[0]

'tv future hand viewer home theatre system plasma high-definition tv digital video recorder moving living room way people watch tv radically different five year time. according expert panel gathered annual consumer electronics show la vega discus new technology impact one favourite pastimes. u leading trend programme content delivered viewer via home network cable satellite telecom company broadband service provider front room portable devices. one talked-about technology ce digital personal video recorder (dvr pvr). set-top box like u tivo uk sky+ system allow people record store play pause forward wind tv programme want. essentially technology allows much personalised tv. also built-in high-definition tv set big business japan u slower take europe lack high-definition programming. people forward wind advert also forget abiding network channel schedule putting together a-la-carte entertainment. u network cable satellite company worried mean term advertising revenue well brand identity

## dataset splitting for training & Testing

In [62]:
X_train, X_test, Y_train, Y_test = train_test_split(df.text, df.category, test_size=0.2, random_state=37)

## Tokenizer and filtering with keras

1. out of the all words we are taking on 10000 words
2. with keras API of Tokenizer which tokenize the sentence also do data cleaning and split the data

In [63]:
size_words=10000
ks_tok = Tokenizer(num_words=size_words,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n0123456789',lower=True,split=" ")
ks_tok.fit_on_texts(X_train)

## converting the word into sequence of number

In [64]:
X_train_seq = ks_tok.texts_to_sequences(X_train) # convert each sentence into number 
X_test_seq = ks_tok.texts_to_sequences(X_test)

In [65]:
max_count = X_train.apply(lambda x: len(x.split(' ')))
max_count.describe()

count    1621.000000
mean      225.090685
std       137.161430
min        49.000000
25%       144.000000
50%       196.000000
75%       275.000000
max      2278.000000
Name: text, dtype: float64

## Padding
Adding zero to each sentence and making it equal to sentence of maximum lenght 

In [66]:
X_train_seq_trunc = pad_sequences(X_train_seq, maxlen=2278)
X_test_seq_trunc = pad_sequences(X_test_seq, maxlen=2278)
X_test_seq_trunc[0]

array([   0,    0,    0, ..., 1545,  424,  137])

## Encoding the target feature in number 

In [67]:
le = LabelEncoder()
y_train_le = le.fit_transform(Y_train)
y_test_le = le.transform(Y_test)
y_train_oh = to_categorical(y_train_le)
y_test_oh = to_categorical(y_test_le)

## Neural Network with Keras
1. we are using functional API of keras to build model
2. Initially we add input layer which take input with size of max 2278
3. then we add embedding layer of max word which we have earlier decided 
4. use dropout and flattern to make input sequencially
5. output layer with dense 5 which is equal to number of unique target feature and will use softmax as activation function
6. then, we have to compile our model and below is the overall summary of model.

In [45]:
deep_inputs = layers.Input(shape=(2278,))
embedding= layers.Embedding(10000 , 8 ,input_length=2278 )(deep_inputs)
embedding = layers.Dropout(0.5)(embedding)
#embedding = layers.Dense(100, activation='sigmoid')(embedding)
embedding = layers.Flatten()(embedding)
embedding_out = layers.Dense(5, activation='softmax')(embedding)
deep_model = keras.Model(inputs=deep_inputs,outputs=embedding_out)
deep_model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
deep_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 2278)              0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 2278, 8)           80000     
_________________________________________________________________
dropout_2 (Dropout)          (None, 2278, 8)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 18224)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 91125     
Total params: 171,125
Trainable params: 171,125
Non-trainable params: 0
_________________________________________________________________


## Train the model on training data 
we can change epoch and batch_size according to our requirement , we also need to check model performance on different value of epoch and batch_size

In [46]:
deep_model.fit(X_train_seq_trunc,y_train_oh,epochs=50,batch_size=512)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x2cb0b39fef0>

## Model evaluation on training data

In [47]:
result=deep_model.evaluate(X_train_seq_trunc,y_train_oh)
result



[0.031847974345235085, 1.0]

## Model evaluation on Testing data

In [48]:
result=deep_model.evaluate(X_test_seq_trunc,y_test_oh)
result



[0.13823234855101027, 0.9704433497536946]

## Model building with CNN with keras

In [49]:

deep_inputs = layers.Input(shape=(2278,))
embedding= layers.Embedding(10000 , 8 ,input_length=2278 )(deep_inputs)
embedding = layers.Conv1D(256, 5, activation='relu')(embedding)
embedding = layers.GlobalMaxPooling1D()(embedding)
#embedding = layers.Dense(128, activation='relu')(embedding)
#embedding = layers.Dropout(0.5)(embedding)
#embedding = layers.Flatten()(embedding)
embedding_out = layers.Dense(5, activation='softmax')(embedding)
deep_model_CNN = keras.Model(inputs=deep_inputs,outputs=embedding_out)
deep_model_CNN.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
deep_model_CNN.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 2278)              0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 2278, 8)           80000     
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 2274, 256)         10496     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 1285      
Total params: 91,781
Trainable params: 91,781
Non-trainable params: 0
_________________________________________________________________


## Train the model on training data CNN

In [50]:
deep_model_CNN.fit(X_train_seq_trunc,y_train_oh,epochs=50,batch_size=512)

Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x2cb0b376f28>

## Model evaluation on training data CNN

In [51]:
result=deep_model_CNN.evaluate(X_train_seq_trunc,y_train_oh)
result



[0.024931464101808958, 0.9987661937075879]

## Model evaluation on training data CNN

In [52]:
result=deep_model_CNN.evaluate(X_test_seq_trunc,y_test_oh)
result



[0.1439388860475841, 0.9581280779368772]

## Model building using GRU

In [29]:
deep_inputs = layers.Input(shape=(2278,))
embedding= layers.Embedding(10000 , 8 ,input_length=2278 )(deep_inputs)
embedding = layers.GRU(100,dropout=0.5, recurrent_dropout=0.5)(embedding)
#embedding = layers.Dense(256, activation='sigmoid')(embedding)
#embedding = layers.Dropout(0.5)(embedding)
#embedding = layers.Flatten()(embedding)
embedding_out = layers.Dense(5, activation='softmax')(embedding)
deep_model_gru = keras.Model(inputs=deep_inputs,outputs=embedding_out)
deep_model_gru.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
deep_model_gru.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 2278)              0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 2278, 8)           80000     
_________________________________________________________________
gru_1 (GRU)                  (None, 100)               32700     
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 505       
Total params: 113,205
Trainable params: 113,205
Non-trainable params: 0
_________________________________________________________________


In [30]:
deep_model_gru.fit(X_train_seq_trunc,y_train_oh,epochs=50,batch_size=512)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x2c61f558588>

## model evaluation on testing data using GRU

In [31]:
result=deep_model_gru.evaluate(X_test_seq_trunc,y_test_oh)
result



[0.7343408161334777, 0.7685393275839559]

In [69]:
## LSTM 
deep_inputs = layers.Input(shape=(2278,))
embedding= layers.Embedding(10000 , 8 ,input_length=2278 )(deep_inputs)
embedding = layers.LSTM(100,dropout=0.5, recurrent_dropout=0.5)(embedding)
#embedding = layers.Dense(256, activation='sigmoid')(embedding)
#embedding = layers.Dropout(0.5)(embedding)
#embedding = layers.Flatten()(embedding)
embedding_out = layers.Dense(5, activation='softmax')(embedding)
deep_model_lstm = keras.Model(inputs=deep_inputs,outputs=embedding_out)
deep_model_lstm.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
deep_model_lstm.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 2278)              0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 2278, 8)           80000     
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               43600     
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 505       
Total params: 124,105
Trainable params: 124,105
Non-trainable params: 0
_________________________________________________________________


In [28]:
deep_model_lstm.fit(X_train_seq_trunc,y_train_oh,epochs=10,batch_size=512)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1e7e6f6b7f0>

In [70]:
deep_model_lstm.evaluate(X_test_seq_trunc,y_test_oh)



[1.6093975376025798, 0.19211822682119942]

## Model selection

### Model       Result 
1.   NN     = >  97.04%
2.   CNN    =>   96.41%
3.   GRU    =>   76.85%
4.   LSTM   =>   73%
5.   Bidire =>


we can try other model and evaluate model but due to RAM and processor is not that much effifcient of my system , it will take lot of time to train the model.

So we will finalize NN to go further to create API for this model.
again we can optimize the result using hyperparameter tuning but I guess we have already achieved good result and for increasing it for 1 2 % we dont need to add much lines of code.

