#### From the beginning, since the first printed newspaper, every news that makes into a page has had a specific section allotted to it. Although pretty much everything changed in newspapers from the ink to the type of paper used, this proper categorization of news was carried over by generations and even to the digital versions of the newspaper. Newspaper articles are not limited to a few topics or subjects, it covers a wide range of interests from politics to sports to movies and so on. For long, this process of sectioning was done manually by people but now technology can do it without much effort. In this hackathon, Data Science and Machine Learning enthusiasts like you will use Natural Language Processing to predict which genre or category a piece of news will fall in to from the story.
#### The sections are labelled as follows:-
#### Politics: 0 , Technology: 1 , Entertainment: 2 ,  Business: 3

#### Importing the required packages

In [2]:
import nltk
import pandas as pd

  return f(*args, **kwds)
  return f(*args, **kwds)


#### Reading the data

In [3]:
article = pd.read_excel(r"G:\R workplace\Data Files\NLP Article Prediction\Data_Train.xlsx" )

In [4]:
article.head()

Unnamed: 0,STORY,SECTION
0,But the most painful was the huge reversal in ...,3
1,How formidable is the opposition alliance amon...,0
2,Most Asian currencies were trading lower today...,3
3,"If you want to answer any question, click on ‘...",1
4,"In global markets, gold prices edged up today ...",3


In [5]:
type(article)

pandas.core.frame.DataFrame

#### Summary of the data

In [6]:
article.describe(include="all")

Unnamed: 0,STORY,SECTION
count,7628,7628.0
unique,7548,
top,This story has been published from a wire agen...,
freq,28,
mean,,1.357892
std,,0.999341
min,,0.0
25%,,1.0
50%,,1.0
75%,,2.0


#### Converting the target variable into category

In [7]:
article.SECTION = article.SECTION.astype("category")

In [8]:
article.describe(include="all")

Unnamed: 0,STORY,SECTION
count,7628,7628
unique,7548,4
top,This story has been published from a wire agen...,1
freq,28,2772


In [9]:
article.shape          # Shape of the data

(7628, 2)

####  Applying a function len to the message column, will show the length of each mail.

In [10]:
article["LENGTH"] = article.STORY.apply(len)

In [11]:
article.head()

Unnamed: 0,STORY,SECTION,LENGTH
0,But the most painful was the huge reversal in ...,3,843
1,How formidable is the opposition alliance amon...,0,129
2,Most Asian currencies were trading lower today...,3,386
3,"If you want to answer any question, click on ‘...",1,587
4,"In global markets, gold prices edged up today ...",3,299


#### Now importing Stopwords and punctuations to remove them from the Stories

In [12]:
from nltk.corpus import stopwords
import string

#### Creating a user defined function to remove stopwords and punctuations from the data

In [13]:
def process(remove):
    """
    1. Remove the punctuation
    2. Remove the stopwords.
    3. Return the list of clean textwords
    """
    no_punc = [char for char in remove if char not in string.punctuation]        # Removing the punctuations
    no_punc = "".join(no_punc)                                                   # Joining the spaces 
    
    return[word for word in no_punc.split() if word not in stopwords.words("english") ]

In [14]:
article.STORY.apply(process)           # Important Words from each message after removing punctuations and stopwords

0       [But, painful, huge, reversal, fee, income, un...
1       [How, formidable, opposition, alliance, among,...
2       [Most, Asian, currencies, trading, lower, toda...
3       [If, want, answer, question, click, ‘Answer’, ...
4       [In, global, markets, gold, prices, edged, tod...
5       [BEIJING, Chinese, tech, giant, Huawei, announ...
6       [Mumbai, India, Incs, external, commercial, bo...
7       [On, Wednesday, Federal, Reserve, Chairman, Je...
8       [What, give, audience, I, already, done, Yeh, ...
9       [com, Arbaaz, Khan, spoke, getting, back, Daba...
10      [“One, would, think, development, testing, pro...
11      [So, far, year, rupee, gained, 07, foreign, in...
12      [Xiaomi, however, sees, presence, Jio, rural, ...
13      [The, ad, reads, No, bells, whistles, No, Beze...
14      [On, Tuesday, Powell, said, healthy, US, econo...
15      [This, feature, help, make, display, responsiv...
16      [TikTok, popular, among, children, facing, cri...
17      [The, 

#### Importing Count Vectorizer to count the frequency of the words

In [15]:
from sklearn.feature_extraction.text import CountVectorizer   

In [16]:
count = CountVectorizer(analyzer= process).fit(article.STORY)

In [17]:
count.vocabulary_                    # Occurances of each word in data

{'But': 5752,
 'painful': 35985,
 'huge': 31157,
 'reversal': 39006,
 'fee': 28936,
 'income': 31605,
 'unheard': 43715,
 'among': 21522,
 'private': 37377,
 'sector': 39792,
 'lenders': 32996,
 'Essentially': 8153,
 'means': 33868,
 'Yes': 20557,
 'Bank': 4957,
 'took': 42919,
 'granted': 30254,
 'fees': 28951,
 'structured': 41685,
 'loan': 33249,
 'deals': 26104,
 'paid': 35980,
 'accounted': 20883,
 'upfront': 43934,
 'books': 23149,
 'As': 4422,
 'borrowers': 23189,
 'turned': 43387,
 'defaulters': 26241,
 'tied': 42773,
 'fell': 28957,
 'cracks': 25607,
 'Gill': 9100,
 'vowed': 44443,
 'shift': 40173,
 'safer': 39397,
 'accounting': 20884,
 'practice': 37054,
 'amortizing': 21529,
 'rather': 38046,
 'booking': 23144,
 'Gill’s': 9106,
 'move': 34560,
 'mend': 33970,
 'past': 36164,
 'ways': 44622,
 'nasty': 34792,
 'surprises': 42008,
 'future': 29792,
 'This': 18876,
 'good': 30151,
 'news': 34961,
 'considering': 25147,
 'investors': 32242,
 'love': 33403,
 'clean': 24414,
 'ima

#### Creating a sparse matrix now

In [18]:
count_sparse = count.transform(article.STORY)

In [19]:
type(count_sparse)                                 # Sparse Matrix

scipy.sparse.csr.csr_matrix

In [20]:
count_sparse.shape                                 # Each word is now converted into the columns

(7628, 48187)

In [21]:
count_sparse.nnz                                    # Number of non zeros

435565

#### Importing Tfidf Transformer to calculate the importance of the word

In [22]:
from sklearn.feature_extraction.text import TfidfTransformer

In [23]:
tfidf_transformer = TfidfTransformer()                     # Creating instance

In [24]:
tfidf_transformer.fit(count_sparse)       

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [25]:
tfidf_mess = tfidf_transformer.transform(count_sparse)       # calculating tf-idf on each word from sparse matrix

In [26]:
tfidf_mess                                          # Transformed a sparse matrix to a normalized tf or tf-idf representation

<7628x48187 sparse matrix of type '<class 'numpy.float64'>'
	with 435565 stored elements in Compressed Sparse Row format>

In [27]:
tfidf_mess.shape

(7628, 48187)

#### Now splitting the data into train and test

In [28]:
article.head()

Unnamed: 0,STORY,SECTION,LENGTH
0,But the most painful was the huge reversal in ...,3,843
1,How formidable is the opposition alliance amon...,0,129
2,Most Asian currencies were trading lower today...,3,386
3,"If you want to answer any question, click on ‘...",1,587
4,"In global markets, gold prices edged up today ...",3,299


In [29]:
article_section = article.iloc[:,-2]
article_section.head()

0    3
1    0
2    3
3    1
4    3
Name: SECTION, dtype: category
Categories (4, int64): [0, 1, 2, 3]

In [30]:
article_section.shape

(7628,)

#### Train-Test data

In [31]:
msg_x_train = tfidf_mess[0:6001,::]
msg_x_test = tfidf_mess[6001:7628,::]
msg_y_train = article_section[0:6001]
msg_y_test = article_section[6001:7629]

In [32]:
print(msg_x_train.shape)
print(msg_y_train.shape)
print(msg_x_test.shape)
print(msg_y_test.shape)

(6001, 48187)
(6001,)
(1627, 48187)
(1627,)


### Now creating a Model to classify the Stories

#### Model No.1 using Naive Bayes

In [33]:
from sklearn.naive_bayes import  MultinomialNB


In [34]:
naive = MultinomialNB()

In [35]:
mod1 = naive.fit(msg_x_train , msg_y_train)              # Model Building

In [36]:
mod1_pred = naive.predict(msg_x_test)                    # Prediction on test data
mod1_pred

array([3, 1, 0, ..., 1, 0, 2], dtype=int64)

#### Building a Confusion matrix to judge the model

In [37]:
from sklearn.metrics import confusion_matrix, classification_report

In [38]:
conf1 = confusion_matrix(mod1_pred , msg_y_test)
conf1

array([[329,   0,  13,   0],
       [ 15, 602,  40,  36],
       [  2,   0, 349,   0],
       [  2,   1,   0, 238]], dtype=int64)

In [39]:
Accuracy = (sum(conf1.diagonal())/ conf1.sum())*100
Accuracy

93.30055316533497

In [40]:
print(classification_report(mod1_pred, msg_y_test))

              precision    recall  f1-score   support

           0       0.95      0.96      0.95       342
           1       1.00      0.87      0.93       693
           2       0.87      0.99      0.93       351
           3       0.87      0.99      0.92       241

    accuracy                           0.93      1627
   macro avg       0.92      0.95      0.93      1627
weighted avg       0.94      0.93      0.93      1627



#### Model No.2 using Random Forest

In [41]:
from sklearn.ensemble import RandomForestClassifier

  return f(*args, **kwds)


In [42]:
forest = RandomForestClassifier()

In [43]:
mod2 = forest.fit(msg_x_train , msg_y_train)



In [44]:
mod2_pred = forest.predict(msg_x_test)

In [45]:
conf2 = confusion_matrix(mod2_pred , msg_y_test)
conf2

array([[323,   6,  10,   7],
       [ 10, 573,  12,  39],
       [ 13,  18, 379,  11],
       [  2,   6,   1, 217]], dtype=int64)

In [46]:
Accuracy1 = (sum(conf2.diagonal())/ conf2.sum())*100
Accuracy1

91.70251997541487

In [47]:
print(classification_report(mod2_pred, msg_y_test))

              precision    recall  f1-score   support

           0       0.93      0.93      0.93       346
           1       0.95      0.90      0.93       634
           2       0.94      0.90      0.92       421
           3       0.79      0.96      0.87       226

    accuracy                           0.92      1627
   macro avg       0.90      0.92      0.91      1627
weighted avg       0.92      0.92      0.92      1627



#### Naive Bayes model is good accuracy for all the classes and the classification model is also good as compared to random forest model.

#### The reason behind selecting Naive Bayes algorithm is because this algorithm relies on Bayes Rule. This algorithm will classify each object by looking at all of it’s features individually. The posterior probability of the object is calculated for each feature and then these probabilities are multiplied together to get a final probability.