# Filipino Fake News Detector

This project was made by:
- Justin Clyde Frongoso
- Medwin Devilleres
- Rae Gabriel Samonte
- Alquen Antonio Sarmiento

This project is implemented as a chrome extension tool that helps identify if an article contains fake content in the form of a paragraph, phrase or sentence through the use of the Multinomial Naive Bayes model in predicting the validity of Filipino news articles. This Jupyter notebook is made for documentation and demonstration only.

The steps for the implementation are given below:

## 1. Import Required Libraries

We first import the necessary libraries.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer

from stops import stop_words

## 2. Import Dataset

We now import the dataset as well as separate the features (in this case, only the article) and the label. The label 0 corresponds to legitimate articles and the label 1 corresponds to fake articles.

In [3]:
path = 'full.csv'
data = pd.read_csv(path)
data.head()

Unnamed: 0,label,article
0,0,"Ayon sa TheWrap.com, naghain ng kaso si Krupa,..."
1,0,Kilala rin ang singer sa pagkumpas ng kanyang ...
2,0,"BLANTYRE, Malawi (AP) -- Bumiyahe patungong Ma..."
3,0,"Kasama sa programa ang pananalangin, bulaklak ..."
4,0,Linisin ang Friendship Department dahil dadala...


We check the split of fake and legitimate articles from the dataset through the use of `value_counts()` method.

In [4]:
count = data.label.value_counts()
count

0    1598
1    1598
Name: label, dtype: int64

We set the key values for the feature `article` and `label` to `X` and `y`, respectively.

In [5]:
X = data['article']
y = data['label']

ano to?

In [6]:
lens = 0
for i in X:
    
    cur_len = len(i.split())
    lens += cur_len

print(f"Average Article Length: {lens / count[0]}")

Average Article Length: 365.94180225281605


## 3. Splitting the Dataset (for training and testing)

The data will now be splitted into two sets: the training and test set. Since there are only 3000+ rows, we are splitting the data in a 80:20 split for the training and test set, respectively.

In [8]:
## X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(2556,)
(640,)
(2556,)
(640,)


## 4. Vectorizing the Dataset

We vectorize the dataset into numerical categories in order to easily categorize and fit them using the model.

In [9]:
vect = CountVectorizer(stop_words = [word for word in stop_words])
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)
X_test_dtm

<640x34427 sparse matrix of type '<class 'numpy.int64'>'
	with 52560 stored elements in Compressed Sparse Row format>

## 5. Building the Model

We are using the Multinomial Naive Bayes Classifier as it is suitable for classification of discrete features.  It is also computationally efficient and is commonly used for labeling articles and text classification tasks

In [10]:
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

MultinomialNB()

In [11]:
y_pred_class = nb.predict(X_test_dtm)

In [12]:
metrics.accuracy_score(y_test, y_pred_class)

0.9203125

## 6. Predicting Text

Now that we trained the model, it is now time to use it and predict some pieces of text. To begin, input any article and see whether it is trustworthy or not.

In [13]:
def manual_predict(model, vectorizer, text):
    inp_arr = []
    inp_arr.append(text)
    inp_dtm = vectorizer.transform(inp_arr)
    res = model.predict(inp_dtm)[0]
    return res

inp = str(input())
print(manual_predict(nb, vect, inp))


0


## 7. Exporting the Model
We export the model to a binary file that can be imported by the API to be used for prediction.


In [14]:
import pickle
with open('model_pickle', 'wb') as f:
    pickle.dump(nb, f)
with open('vect_pickle', 'wb') as f:
    pickle.dump(vect, f)

In [15]:
with open('model_pickle', 'rb') as f:
    imported_model = pickle.load (f)
with open('vect_pickle', 'rb') as f:
    imported_vect = pickle.load (f)

In [16]:
inp = str(input())
print(manual_predict(imported_model, imported_vect, inp))


0


## 8. Additional Features
In this section we introduce additional features for the extension. We will use two more models, the SVM model and Logistic Regression, to identify the likelihood that an article is trustworthy or not.

### i. SVM Model

In [15]:
# Import model and numerical vectorizer
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

In [16]:
# Preprocess data into numeric categorizations
vectorizer = TfidfVectorizer()
train_features = vectorizer.fit_transform(X_train)
test_features = vectorizer.transform(X_test)

In [17]:
# Fit and test
classifier = svm.SVC()
classifier.fit(train_features, y_train)
predictions = classifier.predict(test_features)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.92      0.94      0.93       323
           1       0.94      0.91      0.93       317

    accuracy                           0.93       640
   macro avg       0.93      0.93      0.93       640
weighted avg       0.93      0.93      0.93       640



In [18]:
# Export the model
import pickle
with open('model_pickle_svm', 'wb') as f:
    pickle.dump(nb, f)
with open('vect_pickle_svm', 'wb') as f:
    pickle.dump(vect, f)
with open('model_pickle_svm', 'rb') as f:
    imported_model_svm = pickle.load (f)
with open('vect_pickle_svm', 'rb') as f:
    imported_vect_svm = pickle.load (f)

### ii. Logistic Regression

In [19]:
# Import model and numerical vectorizer
from sklearn.linear_model import LogisticRegression

In [20]:
# Preprocess data into numeric categorizations
vectorizer = TfidfVectorizer()
train_features = vectorizer.fit_transform(X_train)
test_features = vectorizer.transform(X_test)

In [21]:
# Fit and test
classifier = LogisticRegression()
classifier.fit(train_features, y_train)
predictions = classifier.predict(test_features)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.91      0.92      0.91       323
           1       0.91      0.91      0.91       317

    accuracy                           0.91       640
   macro avg       0.91      0.91      0.91       640
weighted avg       0.91      0.91      0.91       640



In [22]:
# Export the model
import pickle
with open('model_pickle_logistic_regression', 'wb') as f:
    pickle.dump(nb, f)
with open('vect_pickle_logistic_regression', 'wb') as f:
    pickle.dump(vect, f)
with open('model_pickle_logistic_regression', 'rb') as f:
    imported_model_logistic_regression = pickle.load (f)
with open('vect_pickle_logistic_regression', 'rb') as f:
    imported_vect_logistic_regression = pickle.load (f)