<a href="https://colab.research.google.com/github/Mina-Rahmanian/Detecting-fake-news/blob/main/Detecting_fake_news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Fake or Real News Detection

Install the library with pip

In [None]:
pip install numpy pandas sklearn

##Make necessary imports:

In [None]:
import numpy as np
import pandas as pd
import itertools 
# itertools it is used to iterate over data structures that can be stepped over using a for-loop. That utilize computational resources efficiently.
# It works as a fast, memory-efficient tool. Also tends to enhance the readability and maintainability of the code.

import sklearn.datasets
from sklearn.model_selection import train_test_split
# Splitting data arrays into two subsets: for training data and for testing data

from sklearn.feature_extraction.text import TfidfVectorizer
# Extract features in a format supported by ML algorithms from datasets consisting of formats such as text and image.

from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix # Sklearn metrics let you assess the quality of your predictions

###Read the data into a DataFrame, and get the shape of the data and the first 10 records.

In [None]:
#Read the data

from google.colab import drive
%matplotlib inline
drive.mount('/content/drive')

import os
os.chdir("/content/drive/")

# df=pd.read_csv('D:\\Data\\news.csv')

#np.random.seed(7)
df = pd.read_csv('My Drive/fakenews/news.csv')

#Get shape and head
print(df.shape)
df.head(10) # (all dataset, featuers)

### Get the labels from the DataFrame.

In [None]:
#DataFlair - Get the labels
labels=df.label
labels.head(10)

### Get the text from the DataFrame.

In [None]:
texts=df.text
texts.head(5)

## Split the dataset into training and testing sets.

In [None]:
from pandas.core.common import random_state
#DataFlair - Split the dataset
x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)

##  Let’s initialize a TfidfVectorizer with stop words from the English language and a maximum document frequency of 0.7

Stop words are the most common words in a language that are to be filtered out before processing the natural language data. And a TfidfVectorizer turns a collection of raw documents into a matrix of TF-IDF features.

- Fit and transform the vectorizer on the train set, and transform the vectorizer on the test set.

In [None]:
#DataFlair - Initialize a TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
#DataFlair - Fit and transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)

# Accuracy

Initialize a PassiveAggressiveClassifier. This is. We’ll fit this on tfidf_train and y_train.\
Predict on the test set from the TfidfVectorizer and calculate the accuracy with accuracy_score() from sklearn.metrics.

In [None]:
#DataFlair - Initialize a PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)
#DataFlair - Predict on the test set and calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

## Confusion matrix 
to gain insight into the number of false and true negatives and positives.

In [None]:
#DataFlair - Build confusion matrix
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])