# Data Preprocessing
Importing required libraries for preprocessing

pandas and numpy for data manipulation and handling

seaborn for data visualization

In [None]:
!nvidia-smi

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Reading the data stored in excel sheet and printing first five rows

In [None]:
df=pd.read_excel('/content/drive/MyDrive/News_Article_Sorting/News_category.xlsx')
df.head()

In [6]:
pip install openpyxl

In [8]:
df=pd.read_excel('../input/news-category/News_category.xlsx')
df.head()

Visualising the number of datapoints in each class

The Section value represents following:

0: Politics

1: Technology

2: Entertainment

3: Business

In [9]:
df.SECTION.value_counts().plot.bar()

In [10]:
df.shape

Checking for Nan Values in dataset

In [11]:
df.isnull().mean()

Analyze the dataset

In [12]:
df.describe()

import necessary libraries for preprocessing like stopwords from nltk

In [13]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import one_hot

import nltk
import re
from nltk.corpus import stopwords

In [14]:
nltk.download('stopwords')

Stemming is a process of reducing the words into its base form, this is done to bring uniformity in the dataset. 

Here we are removing all the non-alphabetical characters from the news titles as the hold no value in classification

we will also be converting all alphabets to there lower case form as that information is useless.

In [15]:
# data preprocessing


from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()
corpus=[]
for i in range(0,len(df['STORY'])):
    news=re.sub('[^a-zA-Z]',' ',df['STORY'][i])
    news=news.lower()
    news=news.split()
    news=[ps.stem(word) for word in news if not word in stopwords.words('english')]
    news=' '.join(news)
    corpus.append(news)

# **Feature Extraction**

In Natural Language Processing, Feature Extraction is one of the trivial steps to be followed for a better understanding of the context of what we are dealing with. After the initial text is cleaned and normalized, we need to transform it into their features to be used for modeling. We use some particular method to assign weights to particular words within our document before modeling them. We go for numerical representation for individual words as it’s easy for the computer to process numbers, in such cases, we go for word embeddings.

### **One Hot Representation**
For better analysis of the text we want to process, we must come up with a numerical representation of each word. This can be solved using the One-hot Encoding method. Here we treat each word as a class and in a document wherever the word is we assign 1 for it in the table and all other words in that document get 0. This is similar to the bag of words, but here we just keep each word in a bag.

The following algorithm is used to generate a vector with length equal to the number of categories in your dataset, a category being a single distinct word.

It can present some drawbacks such as:
- Real world vocabulary tends to be huge
- loss of order in which the words appear in the review/document
- frequency information of words is lost

Setting the vocabulary size

In [18]:
voc_size=5000

In [19]:
#One Hot representation

onehot_repr=[one_hot(words,voc_size)for words in corpus]
# onehot_repr

We will add padding to the sentences to make them of equal length for efficient processing

In [20]:
#passing this ONR vectors to the embedding layer

sent_length=50
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
X=embedded_docs
X

### **Bag of Words**

The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms.

The bag-of-words model is simple to understand and implement and has seen great success in problems such as language modeling and document classification.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
X=cv.fit_transform(corpus).toarray()

In [None]:
X

### **TF-IDF Vectorizer**

TF-IDF stands for term frequency-inverse document frequency. It highlights a specific issue which might not be too frequent in our corpus but holds great importance. The TF–IDF value increases proportionally to the number of times a word appears in the document and decreases with the number of documents in the corpus that contain the word. One drawback is that each word is still captured in a standalone manner. It is composed of 2 sub-parts, which are :

1. Term Frequency (TF)
2. Inverse Document Frequency (IDF)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv=TfidfVectorizer()
X=cv.fit_transform(corpus).toarray()
X

In [None]:
#create list of model and accuracy dicts
perform_list = [ ]

In [21]:
Y= df['SECTION']

In [22]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0, shuffle = True)

In [23]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import make_scorer, roc_curve, roc_auc_score
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB

In [24]:
def run_model(model_name, est_c, est_pnlty):

    mdl=''

    if model_name == 'Logistic Regression':

      mdl = LogisticRegression()

    elif model_name == 'Random Forest':

      mdl = RandomForestClassifier(n_estimators=100, criterion='entropy', random_state=0)

    elif model_name == 'Multinomial Naive Bayes':

      mdl = MultinomialNB(alpha=1.0,fit_prior=True)

    elif model_name == 'Support Vector Classifer':

      mdl = SVC()

    elif model_name == 'Decision Tree Classifier':

      mdl = DecisionTreeClassifier()

    elif model_name == 'K Nearest Neighbour':

      mdl = KNeighborsClassifier(n_neighbors=10 , n_jobs=-1) # metric= 'minkowski' , p = 4

    elif model_name == 'Gaussian Naive Bayes':

      mdl = GaussianNB()

    oneVsRest = OneVsRestClassifier(mdl)

    oneVsRest.fit(x_train, y_train)

    y_pred = oneVsRest.predict(x_test)

    # Performance metrics

    accuracy = round(accuracy_score(y_test, y_pred) * 100, 2)

    # Get precision, recall, f1 scores

    precision, recall, f1score, support = score(y_test, y_pred, average='micro')

    print(f'Test Accuracy Score of Basic {model_name}: % {accuracy}')

    print(f'Precision : {precision}')

    print(f'Recall : {recall}')

    print(f'F1-score : {f1score}')

    # Add performance parameters to list

    perform_list.append(dict([

    ('Model', model_name),

    ('Test Accuracy', round(accuracy, 2)),

    ('Precision', round(precision, 2)),

    ('Recall', round(recall, 2)),

    ('F1', round(f1score, 2))

    ]))

In [None]:
run_model('Logistic Regression', est_c=None, est_pnlty=None)

In [None]:
run_model('Random Forest', est_c=None, est_pnlty=None)

In [None]:
run_model('Multinomial Naive Bayes', est_c=None, est_pnlty=None)

In [None]:
run_model('Support Vector Classifer', est_c=None, est_pnlty=None)

In [None]:
run_model('Decision Tree Classifier', est_c=None, est_pnlty=None)

In [None]:
run_model('K Nearest Neighbour', est_c=None, est_pnlty=None)

In [None]:
run_model('Gaussian Naive Bayes', est_c=None, est_pnlty=None)

In [25]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Dense, SimpleRNN, Activation, Dropout, Conv1D
from tensorflow.keras.layers import Embedding, Flatten, LSTM, GRU

In [26]:
model=Sequential()
model.add(Embedding(5000,256,input_length=X.shape[1]))
model.add(Dropout(0.3))
model.add(LSTM(256,return_sequences=True,dropout=0.3,recurrent_dropout=0.2))
model.add(LSTM(256,dropout=0.3,recurrent_dropout=0.2))
model.add(Dense(4,activation='softmax'))

In [27]:
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

In [28]:
model.fit(x_train,y_train,epochs=10,batch_size=32,verbose=2)

In [32]:
from keras.layers import Bidirectional

In [33]:
model=Sequential()
model.add(Embedding(5000,256,input_length=X.shape[1]))
model.add(Dropout(0.3))
model.add(Bidirectional(LSTM(256,return_sequences=True,dropout=0.3,recurrent_dropout=0.2)))
model.add(Bidirectional(LSTM(256,dropout=0.3,recurrent_dropout=0.2)))
model.add(Dense(4,activation='softmax'))

In [34]:
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

In [35]:
model.fit(x_train,y_train,epochs=10,batch_size=32,verbose=2)

In [37]:
y_pred=np.argmax(model.predict(x_test), axis=-1)

In [39]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred))