From the beginning, since the first printed newspaper, every news that makes into a page has had a specific section allotted to it. Although pretty much everything changed in newspapers from the ink to the type of paper used, this proper categorization of news was carried over by generations and even to the digital versions of the newspaper. Newspaper articles are not limited to a few topics or subjects, it covers a wide range of interests from politics to sports to movies and so on. For long, this process of sectioning was done manually by people but now technology can do it without much effort. In this hackathon, Data Science and Machine Learning enthusiasts like you will use Natural Language Processing to predict which genre or category a piece of news will fall in to from the story.

# FEATURES:

## STORY:  A part of the main content of the article to be published as a piece of news.
## SECTION: The genre/category the STORY falls in.

There are four distinct sections where each story may fall in to. The Sections are labelled as follows :

* Politics: 0
* Technology: 1
* Entertainment: 2
* Business: 3

In [63]:
import numpy as np
import pandas as pd
import nltk
import warnings
warnings.filterwarnings("ignore")

In [64]:
dftrain = pd.read_csv('NLPTrainData.csv')

In [65]:
dftrain.head()

Unnamed: 0,STORY,SECTION
0,The roadshow and the filing of nomination pape...,0
1,These vulnerabilities could have allowed hacke...,1
2,"""People will now be able to include music in t...",1
3,Jersey is expected to have a good start at the...,2
4,Xiaomi’s unveiling also hints at how Samsung i...,1


In [66]:
dftest = pd.read_csv('NLPtestData.csv')

In [67]:
dftest.head()

Unnamed: 0,STORY
0,Privileged to have done this candid and COMPLE...
1,6) Some analysts expect volatility to remain h...
2,There is no stopping Marvel Cinematic Universe...
3,"According to Ravi Menon, analyst at Elara Secu..."
4,"A complaint against Nadiadwala, known for prod..."


In [68]:
dftrain['STORY']

0       The roadshow and the filing of nomination pape...
1       These vulnerabilities could have allowed hacke...
2       "People will now be able to include music in t...
3       Jersey is expected to have a good start at the...
4       Xiaomi’s unveiling also hints at how Samsung i...
                              ...                        
6097    Based on the video game franchise of the same ...
6098    The seven states have been neglected for decad...
6099    Shanthnu made the announcement on Twitter and ...
6100    The Rock and Jason Statham reprise their roles...
6101    But unlike past developments that never caught...
Name: STORY, Length: 6102, dtype: object

### Identifying stopwords

In [69]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [70]:
dftrain['stopwords'] = dftrain['STORY'].apply(lambda x: len([x for x in x.split() if x in stop_words]))
dftrain['stopwords']

0       60
1       60
2       43
3       19
4       48
        ..
6097    14
6098    45
6099    11
6100    13
6101    57
Name: stopwords, Length: 6102, dtype: int64

### so, we can see there are huge number of stopwords which are need to be handled right way

## Eliminating StopWords

In [71]:
dftrain['STORY'] = dftrain['STORY'].apply(lambda x: ' '.join(x for x in x.split() if x.lower() not in stop_words))
dftrain['STORY'].head()

0    roadshow filing nomination papers also attempt...
1    vulnerabilities could allowed hackers access s...
2    "People able include music videos Facebook Ins...
3    Jersey expected good start box office, attenti...
4    Xiaomi’s unveiling also hints Samsung starting...
Name: STORY, dtype: object

### Lemmetization

In [72]:
from nltk.stem import WordNetLemmatizer
from textblob import Word
from textblob import TextBlob

In [73]:
dftrain['STORY'] = dftrain['STORY'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
dftrain['STORY'].head()

0    roadshow filing nomination paper also attempt ...
1    vulnerability could allowed hacker access sens...
2    "People able include music video Facebook Inst...
3    Jersey expected good start box office, attenti...
4    Xiaomi’s unveiling also hint Samsung starting ...
Name: STORY, dtype: object

In [74]:
TextBlob(dftest['STORY'][1]).ngrams(1)

[WordList(['6']),
 WordList(['Some']),
 WordList(['analysts']),
 WordList(['expect']),
 WordList(['volatility']),
 WordList(['to']),
 WordList(['remain']),
 WordList(['high']),
 WordList(['in']),
 WordList(['the']),
 WordList(['near']),
 WordList(['term']),
 WordList(['Sahaj']),
 WordList(['Agrawal']),
 WordList(['head']),
 WordList(['of']),
 WordList(['research']),
 WordList(['for']),
 WordList(['derivatives']),
 WordList(['at']),
 WordList(['Kotak']),
 WordList(['Securities']),
 WordList(['said']),
 WordList(['After']),
 WordList(['a']),
 WordList(['strong']),
 WordList(['rally']),
 WordList(['seen']),
 WordList(['in']),
 WordList(['the']),
 WordList(['recent']),
 WordList(['past']),
 WordList(['Nifty']),
 WordList(['currently']),
 WordList(['is']),
 WordList(['in']),
 WordList(['a']),
 WordList(['consolidation']),
 WordList(['phase']),
 WordList(['Tech']),
 WordList(['parameters']),
 WordList(['suggest']),
 WordList(['possibility']),
 WordList(['of']),
 WordList(['extended']),
 Word

### TFIDF--Vectorization

In [76]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import TweetTokenizer

In [77]:
tokenizer = TweetTokenizer()

In [78]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2), lowercase=True, tokenizer=tokenizer.tokenize)
full_text = list(dftrain['STORY'].values) + list(dftest['STORY'].values)
vectorizer.fit(full_text)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 2), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<bound method TweetTokenizer.tokenize of <nltk.tokenize.casual.TweetTokenizer object at 0x00000200F5B6DDA0>>,
                use_idf=True, vocabulary=None)

In [79]:
train_vector = vectorizer.transform(dftrain['STORY'])

In [80]:
test_vector = vectorizer.transform(dftest['STORY'])

## Building_Models

### Logistic-Regression

In [81]:
X=train_vector

In [82]:
y=dftrain[['SECTION']]

In [83]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=100)

In [84]:
from sklearn.linear_model import LogisticRegression

In [85]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,f1_score

In [86]:
log=LogisticRegression()

In [87]:
log.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [88]:
pred_log=log.predict(X_test)

In [89]:
print(classification_report(y_test,pred_log))

              precision    recall  f1-score   support

           0       0.99      0.89      0.94       425
           1       0.92      0.99      0.95       672
           2       0.95      1.00      0.97       455
           3       0.98      0.89      0.93       279

    accuracy                           0.95      1831
   macro avg       0.96      0.94      0.95      1831
weighted avg       0.96      0.95      0.95      1831



In [90]:
confusion_matrix(y_test,pred_log)

array([[380,  28,  14,   3],
       [  1, 664,   6,   1],
       [  1,   0, 454,   0],
       [  1,  28,   3, 247]], dtype=int64)

In [91]:
accuracy_score(y_test,pred_log)

0.9530311305297652

### RandomForest

In [93]:
from sklearn.ensemble import RandomForestClassifier

In [94]:
rf=RandomForestClassifier(n_estimators=50,max_depth=10,random_state=101,class_weight='balanced')

In [95]:
rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight='balanced',
                       criterion='gini', max_depth=10, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=50, n_jobs=None, oob_score=False,
                       random_state=101, verbose=0, warm_start=False)

In [96]:
y_pred=rf.predict(X_test)

In [97]:
print(f1_score(y_pred,y_test,average="weighted"))

0.8740271399566795


In [98]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.97      0.88      0.92       425
           1       0.96      0.81      0.88       672
           2       0.74      0.95      0.83       455
           3       0.87      0.90      0.88       279

    accuracy                           0.88      1831
   macro avg       0.88      0.89      0.88      1831
weighted avg       0.89      0.88      0.88      1831



In [99]:
print(confusion_matrix(y_test,y_pred))

[[375   7  38   5]
 [  6 545  89  32]
 [  5  14 434   2]
 [  1   4  24 250]]


In [100]:
accuracy_score(y_test,y_pred)

0.8760240305843802

## SVM(SupportVectorMachine)

In [101]:
from sklearn.svm import LinearSVC

In [102]:
svm_model=LinearSVC()

In [103]:
svm_model.fit(X_train,y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [104]:
svm_pred=svm_model.predict(X_test)
svm_pred

array([0, 1, 2, ..., 0, 3, 2], dtype=int64)

In [105]:
print(classification_report(y_test,svm_pred))

              precision    recall  f1-score   support

           0       0.99      0.94      0.96       425
           1       0.97      0.99      0.97       672
           2       0.95      1.00      0.97       455
           3       1.00      0.95      0.97       279

    accuracy                           0.97      1831
   macro avg       0.98      0.97      0.97      1831
weighted avg       0.97      0.97      0.97      1831



In [106]:
confusion_matrix(y_test,svm_pred)

array([[399,  14,  12,   0],
       [  2, 662,   7,   1],
       [  1,   0, 454,   0],
       [  0,  10,   4, 265]], dtype=int64)

In [107]:
accuracy_score(y_test,svm_pred)

0.9721463681048608

Here, we can see the good predictions but lets go for KNN model

### Testing SVM_Model Performance on TestData

In [109]:
svm_pred=svm_model.predict(test_vector)

In [124]:
svm_pred

array([2, 3, 2, ..., 1, 0, 2], dtype=int64)

In [None]:
submission = pd.DataFrame({'SVM_Predictions': svm_pred })

submission.to_csv('PredictNews_SVM.csv',index=False)

## KNN (K- Nearest Neighbors)

In [112]:
from sklearn.neighbors import KNeighborsClassifier

In [113]:
knn_model=KNeighborsClassifier()

In [114]:
knn_model.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [115]:
pred_knn=knn_model.predict(X_test)

In [116]:
print(classification_report(y_test,pred_knn))

              precision    recall  f1-score   support

           0       0.93      0.96      0.94       425
           1       0.92      0.97      0.94       672
           2       0.99      0.93      0.96       455
           3       0.95      0.89      0.92       279

    accuracy                           0.94      1831
   macro avg       0.95      0.94      0.94      1831
weighted avg       0.95      0.94      0.94      1831



In [117]:
confusion_matrix(y_test,pred_knn)

array([[407,  14,   1,   3],
       [ 12, 650,   1,   9],
       [ 11,  19, 423,   2],
       [  8,  22,   1, 248]], dtype=int64)

In [118]:
accuracy_score(y_test,pred_knn)

0.9437465865647188

### Testing KNN_Model Performance on TestData

In [121]:
pred_knn=knn_model.predict(test_vector)
pred_knn

array([2, 3, 2, ..., 1, 0, 2], dtype=int64)

In [122]:
submission = pd.DataFrame({'KNN_Predictions': pred_knn })

submission.to_csv('PredictNews_KNN.csv',index=False)