## **Problem Statement**

### **Introduction**
Fake news has emerged as one of the most significant challenges of our time, severely impacting both online and offline discourse. Its proliferation poses a direct threat to the democratic processes and societal stability, particularly in the western world. The ability to accurately identify and reduce the spread of fake news is essential to maintaining informed public discourse and safeguarding democratic institutions.

### **Problem Statement**
The primary challenge addressed by this project is the automatic detection of fake news articles using machine learning and natural language processing (NLP) techniques. By developing a reliable model to classify news articles as either fake or real, we aim to contribute to the efforts to curb the spread of misinformation and enhance the quality of information available to the public.

### **Aim of the Project**

The aim of this project is to build a robust and accurate fake news detection system.

### **How Does the Solution Solve the Problem?**

The proposed solution involves developing a machine learning model that leverages NLP and deep learning techniques to classify news articles as fake or real, allowing users to input news articles and classify them as fake or real, thereby providing a valuable tool for combating misinformation.


### **About the Dataset**

The dataset used in this project contains labeled news articles, categorized as either fake or real. This dataset is essential for training and evaluating the machine learning models developed to detect fake news.

### **Content**
The dataset comprises rows and columns that represent various attributes of news articles, including their textual content and labels indicating whether they are fake or real. The dataset includes information on how it was acquired and the time period it represents, providing valuable context for the analysis.




In [38]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [72]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import spacy

In [40]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/fake_or_real_news.csv', usecols=lambda col: col if 'Unnamed' not in col else None)


In [41]:
df.head()

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [42]:
df['input_text'] = df['title'] + ' ' + df['text']

In [43]:
df.head()

Unnamed: 0,title,text,label,input_text
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,You Can Smell Hillary’s Fear Daniel Greenfield...
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,Watch The Exact Moment Paul Ryan Committed Pol...
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,Kerry to go to Paris in gesture of sympathy U....
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,Bernie supporters on Twitter erupt in anger ag...
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,The Battle of New York: Why This Primary Matte...


## **Machine Learning Task Instructions**

In this task, you will work with the provided `input_text` variable. Your objective is to apply any machine learning algorithm to process the data and achieve meaningful results.

### **Steps to Follow:**

1. **Preprocess the Data**: Clean and preprocess the `input_text` data as necessary. This might include actions such as tokenization, removing stop words, and lemmatization.

2. **Extract Features**: Transform the text data into numerical features suitable for machine learning algorithms. Consider using techniques like `TfidfVectorizer` or `CountVectorizer`.

3. **Select a Machine Learning Algorithm**: Choose an appropriate machine learning algorithm for your task. Options include classification algorithms (e.g., Logistic Regression, SVM, Random Forest, and others).

4. **Train Your Model**: Split your data into training and testing sets, then train your chosen model on the preprocessed data.

5. **Evaluate Your Model**: Measure the performance of your model using suitable metrics (e.g., accuracy, precision, recall, F1-score).

Good luck!


In [44]:
#spacy model
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [45]:
#perform tokenization, lemmitization and stop words

def spacy_tokenizer(sentence):
    tokens = nlp(sentence)
    #lemmatization
    tokens = [token.lemma_.lower().strip() if token.lemma_ != "-PRON-" else token.lower_ for token in tokens]
    #removal of stop words
    token = [token for token in tokens if token not in nlp.Defaults.stop_words]
    return " ".join(tokens)

In [46]:
#preprocessing the input
df['input_text'] = df['input_text'].apply(spacy_tokenizer)

In [47]:
df.head()

Unnamed: 0,title,text,label,input_text
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,you can smell hillary ’s fear daniel greenfiel...
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,watch the exact moment paul ryan committed pol...
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,kerry to go to paris in gesture of sympathy u....
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,bernie supporter on twitter erupt in anger aga...
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,the battle of new york : why this primary matt...


In [48]:
df['label'] = df['label'].apply(lambda x: 1 if x == 'REAL' else 0)

In [49]:
df['label'].value_counts()

label
1    3171
0    3164
Name: count, dtype: int64

In [50]:
#assign input and target
X = df['input_text']
y = df['label']

In [51]:
#count vectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)

In [52]:
new_ = X

In [53]:
new_.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

## Machine Learning Algorithm

In [57]:
#instanciate the models
log_model = LogisticRegression()
svm_model = SVC()
rf_model = RandomForestClassifier()

In [58]:
#fit the logistic regression model
log_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [60]:
#make prediction
log_pred = log_model.predict(X_test)

In [74]:
#Evaluate the logistic regression model
log_model_acc = accuracy_score(y_test, log_pred)
log_model_acc
print(f'Accuracy score of logistic regression model is {log_model_acc}')

Accuracy score of logistic regression model is 0.9348054679284963


In [73]:
#generate classification report
report = classification_report(y_test, log_pred)
print(report)

              precision    recall  f1-score   support

           0       0.93      0.93      0.93       459
           1       0.94      0.94      0.94       492

    accuracy                           0.93       951
   macro avg       0.93      0.93      0.93       951
weighted avg       0.93      0.93      0.93       951



In [75]:
#fit the svm model
svm_model.fit(X_train, y_train)

In [64]:
#make prediction
svm_pred = svm_model.predict(X_test)

In [65]:
#check accuracy score for the svm model
svm_model_acc = accuracy_score(y_test, svm_pred)
svm_model_acc
print(f'Accuracy score of svm model is {svm_model_acc}')

Accuracy score of svm model is 0.8664563617245006


In [66]:
#fit the random forest classifier model
rf_model.fit(X_train, y_train)

In [67]:
#make prediction
rf_pred = rf_model.predict(X_test)

In [78]:
#check accuracy score for the random forest classifier model
rf_model_acc = accuracy_score(y_test, rf_pred)
rf_model_acc
print(f'Accuracy score of random forest classifier model is {rf_model_acc}')

Accuracy score of random forest classifier model is 0.9095688748685594
