**News articles classification**

News articles are one of the richest sources of data for many businesses. ABC company wants to build a website and recommend the contents to its users on their web application. So any new article or content is coming they wants to classify that into under one of 5 categories: business,entertainment, politics, sport or tech. As an ML engineer you are required to use a public dataset from the BBC each labelled under one of 5 categories: business, entertainment, politics, sport or tech.


The goal will be to build a system that can accurately classify previously unseen news articles
into the right category. The Evaluation metric you should use is the accuracy. 





In [103]:
import pandas as pd
from textblob import TextBlob

In [104]:
df = pd.read_csv('/content/BBC News.csv')
df.head()

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business


In [105]:
texts = df['Text']
texts

0       worldcom ex-boss launches defence lawyers defe...
1       german business confidence slides german busin...
2       bbc poll indicates economic gloom citizens in ...
3       lifestyle  governs mobile choice  faster  bett...
4       enron bosses in $168m payout eighteen former e...
                              ...                        
1485    double eviction from big brother model caprice...
1486    dj double act revamp chart show dj duo jk and ...
1487    weak dollar hits reuters revenues at media gro...
1488    apple ipod family expands market apple has exp...
1489    santy worm makes unwelcome visit thousands of ...
Name: Text, Length: 1490, dtype: object

In [106]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [107]:
df['Category'].value_counts()

sport            346
business         336
politics         274
entertainment    273
tech             261
Name: Category, dtype: int64

In [108]:
df[df['Category']=='sport']['Text']

6       wales silent on grand slam talk rhys williams ...
14      ireland 21-19 argentina an injury-time dropped...
15      wenger signs new deal arsenal manager arsene w...
17      hantuchova in dubai last eight daniela hantuch...
18      melzer shocks agassi in san jose second seed a...
                              ...                        
1467    charvis set to lose fitness bid flanker colin ...
1468    preview: ireland v england (sun) lansdowne roa...
1471    ferrero eyes return to top form former world n...
1473    dallaglio eyeing lions tour place former engla...
1481    liverpool pledge to keep gerrard liverpool chi...
Name: Text, Length: 346, dtype: object

In [109]:
import re
#Define function for removing special characters
def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"I'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\^^", "", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    return text

In [110]:
df['Text'] = df['Text'].apply(remove_special_characters)

In [111]:
df['Text']

0       worldcom exboss launches defence lawyers defen...
1       german business confidence slides german busin...
2       bbc poll indicates economic gloom citizens in ...
3       lifestyle  governs mobile choice  faster  bett...
4       enron bosses in 168m payout eighteen former en...
                              ...                        
1485    double eviction from big brother model caprice...
1486    dj double act revamp chart show dj duo jk and ...
1487    weak dollar hits reuters revenues at media gro...
1488    apple ipod family expands market apple has exp...
1489    santy worm makes unwelcome visit thousands of ...
Name: Text, Length: 1490, dtype: object

In [112]:
import nltk

In [81]:
#Stemming the text
def simple_stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text

In [113]:
df['Text'] = df['Text'].apply(simple_stemmer)
df['Text']

0       worldcom exboss launch defenc lawyer defend fo...
1       german busi confid slide german busi confid fe...
2       bbc poll indic econom gloom citizen in a major...
3       lifestyl govern mobil choic faster better or f...
4       enron boss in 168m payout eighteen former enro...
                              ...                        
1485    doubl evict from big brother model capric and ...
1486    dj doubl act revamp chart show dj duo jk and j...
1487    weak dollar hit reuter revenu at media group r...
1488    appl ipod famili expand market appl ha expand ...
1489    santi worm make unwelcom visit thousand of web...
Name: Text, Length: 1490, dtype: object

In [114]:
nltk.download('stopwords')
# tokenize before stopword filtering
from nltk.tokenize.toktok import ToktokTokenizer
#Tokenization of text
tokenizer1=ToktokTokenizer()
#Setting English stopwords
stopword_list=nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [115]:
#removing the stopwords
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer1.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [116]:
df['Text'] = df['Text'].apply(remove_stopwords)

In [117]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
vect = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1,2), max_features=400).fit(df.Text)
X = vect.transform(df.Text)

In [88]:
text_transformed = pd.DataFrame(X.toarray(),columns=vect.get_feature_names_out())

In [118]:
text_transformed

Unnamed: 0,000,10,12,20,2003,2004,2005,abl,access,accord,...,way,websit,week,went,william,win,winner,work,world,year
0,0,0,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
1,0,1,0,0,1,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,1,0,0,0,0,2,2,0,0,0,...,0,0,0,0,0,0,0,0,9,1
3,1,0,0,0,2,0,0,1,0,0,...,5,1,0,0,0,0,0,0,0,1
4,0,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1485,1,0,2,0,0,0,0,0,0,0,...,0,0,0,0,0,1,2,0,0,0
1486,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,1,0,1,0,1
1487,0,0,1,0,2,3,1,0,0,0,...,0,0,0,0,0,0,0,0,0,6
1488,0,2,1,0,0,2,1,0,0,3,...,1,1,0,0,0,0,0,0,0,1


In [119]:
df.shape

(1490, 3)

In [120]:
X = text_transformed
y = df['Category']

In [121]:
from sklearn.model_selection import train_test_split
Xtrain,Xtest, ytrain, ytest = train_test_split(X,y,test_size =0.2, random_state = 42)
Xtrain.shape, ytrain.shape

((1192, 400), (1192,))

In [122]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(Xtrain,ytrain)
ypred = lr.predict(Xtest)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [123]:
from sklearn.metrics import accuracy_score
accuracy_score(ypred, ytest)

0.9496644295302014

In [124]:
from sklearn.ensemble import RandomForestClassifier
rf =RandomForestClassifier()
rf.fit(Xtrain,ytrain)
ypred = rf.predict(Xtest)

In [125]:
accuracy_score(ypred, ytest)

0.9731543624161074

COUNT VECTORIZATION

In [126]:
vect = CountVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1,2), max_features=400).fit(df.Text)
X = vect.transform(df.Text)

In [127]:
text_transformed = pd.DataFrame(X.toarray(),columns=vect.get_feature_names_out())
text_transformed

Unnamed: 0,000,10,12,20,2003,2004,2005,abl,access,accord,...,way,websit,week,went,william,win,winner,work,world,year
0,0,0,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
1,0,1,0,0,1,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,1,0,0,0,0,2,2,0,0,0,...,0,0,0,0,0,0,0,0,9,1
3,1,0,0,0,2,0,0,1,0,0,...,5,1,0,0,0,0,0,0,0,1
4,0,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1485,1,0,2,0,0,0,0,0,0,0,...,0,0,0,0,0,1,2,0,0,0
1486,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,1,0,1,0,1
1487,0,0,1,0,2,3,1,0,0,0,...,0,0,0,0,0,0,0,0,0,6
1488,0,2,1,0,0,2,1,0,0,3,...,1,1,0,0,0,0,0,0,0,1


In [128]:
Xc = text_transformed
y = df['Category']
Xtrain,Xtest, ytrain, ytest = train_test_split(Xc,y,test_size =0.2, random_state = 42)
Xtrain.shape, ytrain.shape

((1192, 400), (1192,))

In [129]:
from sklearn.ensemble import RandomForestClassifier
rf1 =RandomForestClassifier()
rf1.fit(Xtrain,ytrain)
ypred = rf1.predict(Xtest)

In [130]:
accuracy_score(ypred, ytest)

0.9765100671140939