### **Using Machine Learning Linear Support Vector Classifier  to Detect Fake News Articles in Python**

##### **1. Download the Fake News Dataset from Github**

Fake News Dataset in Github: https://github.com/lutzhamel/fake-news

1. Go to data folder, then download fake_or_real_news.csv
2. Place it in the same folder as this notebook

##### **2. Install the Libraries**
1. **NumPy** (Numerical Python): Provides powerful array manipulation tools for numerical computations, linear algebra, random number generation, and more. It's a foundational library for many other scientific Python packages.
2. **Pandas**: Offers high-performance data structures like DataFrames (similar to spreadsheets) and Series (like labeled arrays) for efficient data analysis, manipulation, and cleaning.
3. **Scikit-learn** (scikit-learn): A comprehensive library for machine learning algorithms, including classification, regression, clustering, model selection, and more. It provides easy-to-use implementations of various machine learning techniques.

In [15]:
pip install numpy pandas scikit-learn

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.2 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


##### **3. Import the Libraries**

In [16]:
import numpy as np #for numerical computations
import pandas as pd #for data manipulation and analysis

from sklearn.model_selection import train_test_split #for splitting a dataset into 80% training and 20% testing sets for machine learning models
from sklearn.feature_extraction.text import TfidfVectorizer #convert textual data into numerical features (Term Frequency - Inverse Document Frequency)
from sklearn.svm import LinearSVC #for using a machine learning algorithm called Linear Support Vector Classifier

##### **4. View Contents of the Dataset**

In [17]:
data = pd.read_csv("fake_or_real_news.csv")

In [18]:
data

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


##### **5. Convert Dataset to Numerical Value. If Real then value is 1, If Fake then 0**

In [19]:
data['fake'] = data['label'].apply(lambda x: 0 if x == "REAL" else 1)
data = data.drop("label", axis=1)

##### **6. Divide the Dataset (Text-to-Fake Paired Values). Use 80% of the data for Training and use 20% for Testing**

In [20]:
X, y = data["text"], data["fake"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

TF-IDF: Term Frequency - Inverse Document Frequency

A metric that indicates how important a word is to a document in a collection. It weighs the improtance of each word in a document based on how often it appears in that document and how often it appears accross all documents in the collection.

TF: Number of times a term t appears in a document
IDF: Logarithm of total number of documents divided by no. of docs that contain term
TF-IDF: TF * IDF

Basically allows us to find the most relevant and distinctive words per document.

##### **7. Convert Textual Data that's in Sentence Form into Numerical Feature Vectors**
The code does the following:
1. Removing unnecessary stop words.
2. Focusing on informative words based on document frequency.
3. Converting textual documents into numerical feature vectors suitable for classification algorithms.

In [21]:
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7) #Removing unnecessary stop words "(e.g. a", "the", "an") and excludes words that appear in more than 70% of the documents
X_train_vectorized = vectorizer.fit_transform(X_train)  
X_test_vectorized = vectorizer.transform(X_test)

##### **8. Train the Linear SVC model to fit for fake news detection using the Text (Vectorized earlier) and Fake Status (Real/0 or Fake/1)**

During training, the Linear SVC model learns to identify patterns and relationships between the features (words and their importance) in the training documents and their corresponding class labels. It essentially creates a decision boundary in the feature space that can be used to separate documents belonging to different classes.

Essentially, this code trains the Linear SVC model to become an expert in classifying text based on the patterns it learns from the labeled training data.

In [22]:
clf = LinearSVC()  # Linear SVC is considered one of the best text classification algorithms
clf.fit(X_train_vectorized, y_train)

##### **9. Check Performance or Accuracy of Fake News Detector Model**

Results show that from all 1,267 articles with text data labeled as fake or real, the model...

In [35]:
accuracy = clf.score(X_test_vectorized, y_test) * 100
print(f"Classified {accuracy:.2f}% of the articles correctly")

Classified 94.32% of the articles correctly


In [39]:
num_of_articles = len(y_test) * accuracy
total = len(y_test)
print(f"which is {num_of_articles} articles out of {total} total number of articles from the test data")

which is 119500.0 articles out of 1267 total number of articles from the test data


##### **10. Test the Fake News Detector Model using any Article**

In [24]:
article_text = X_test.iloc[10]
vectorized_text = vectorizer.transform([article_text])

In [25]:
clf.predict(vectorized_text)

array([0], dtype=int64)

In [42]:
fake_in_binary = y_test.iloc[10]

result = "The article is Fake News" if fake_in_binary == 1 else "The article is Real News"
print(result)


The article is Fake News


##### **That's it! Thank you for listening!**