<a href="https://colab.research.google.com/github/OmarMeriwani/Fake-Financial-News-Detection/blob/master/News_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Classification
This document contains the source code for creating the classifier. The dataset uri news aggregator contains 400K news titles with labels for news classification. We used count vectorizer as feature extraction method. 


In [0]:
import re
import pandas as pd
from nltk.corpus import stopwords
from string import punctuation
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
import pickle
import nltk.tokenize

The code below is used to normalize sentences by removing punbtionations and multiple spaces. 

In [0]:
def normalize_text(s):
    s = s.lower()
    # remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
    s = re.sub('\s\W', ' ', s)
    s = re.sub('\W\s', ' ', s)
    # make sure we didn't introduce any double spaces
    s = re.sub('\s+', ' ', s)
    return s

Each row in the dataset is processed, normalized, converted to tokens and then all the tokens were converted to a vocabulary set by keeping only the unique words. 

In [0]:
alltokens = []
classifiedrows = 400000
df3 = pd.DataFrame(columns=['title'])
def get_vocabulary(doc,encoding,textIndex,encodeDecode):
    #'ISO-8859-1'
    df = pd.read_csv(doc, header=0, encoding=encoding)
    df = df[:classifiedrows]
    atokens = []
    for i in range(0,len(df)):
        sentence = df.loc[i][textIndex]
        sentence = normalize_text(sentence)
        if encodeDecode == True:
            sentence = sentence.encode('ascii', errors='ignore').decode("utf-8")
        df3.loc[i] = sentence
        tokens = nltk.tokenize.word_tokenize(sentence)
        for t in tokens:
            atokens.append(t)
    atokens = set(atokens)
    return atokens

The second normalization method to work with sentences for classification, it includes:
* Removing stop words.
* Removing punctuation.
* Keeping words with length larger than 1.
* Keeping alphabetical tokens only.

In [0]:
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc
    # remove punctuation from each token
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

Reading the dataset URI news aggregator.

In [0]:
alltokens = get_vocabulary('uci-news-aggregator.csv','utf-8',1,False)
df_store_vocab = pd.DataFrame(columns=['word'])
seq = 0
for i in alltokens:
    df_store_vocab.loc[seq] = i
    seq += 1


Creating count vectorizer on the previously acquired vocabulary and then reading the dataset. Finally, creating a dataset for the required fields only (category and title).

In [0]:
vectorizer = CountVectorizer(vocabulary=alltokens)
news = pd.read_csv("uci-news-aggregator.csv")
news = news[:classifiedrows]
seq = 0
df2 = pd.DataFrame(columns=['title','category'])


Storing the data from the datasheet to df2 dataset which contains only the required fields.

In [0]:
for i in range(0,len(news)):
    sentence = news.loc[i][1]
    sentence = normalize_text(sentence)
    category = news.loc[i][4]
    r = [sentence, category]
    df2.loc[seq] = r
    seq += 1
print(df2)


Encoding labels and fitting the count vectorizer on the news titles column to create bag of words vector representations.

In [0]:
x = vectorizer.fit_transform(df2['title'])
pickle.dump(vectorizer.vocabulary_, open('vocab.pkl', 'wb'))
print('SHAPE: ',x.shape)
encoder = LabelEncoder()
y = encoder.fit_transform(df2['category'])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

Training and evaluation.

In [0]:
mlp = MLPClassifier(activation='tanh', hidden_layer_sizes=(20,20,20))
mlp.fit(x_train,y_train)

pickle.dump(mlp, open('MLPClassifier4.pkl', 'wb'))
score = mlp.score(x_test, y_test)
print(score)

y2 = mlp.predict(x2)
print(encoder.classes_)
