### Report on Machine Learning Project
# FAKE NEWS CLASSIFICATION


## Introduction

- Fake news classification refers to the process of identifying and categorizing news articles or other types of media that contain intentionally misleading, false, or fabricated information. With the increasing prevalence of social media and online news sources, fake news has become a serious problem that can have far-reaching consequences. It can be used to spread misinformation, influence public opinion, and even manipulate political outcomes.

- To address this problem, researchers and experts in the field have developed a number of different techniques and tools for detecting and classifying fake news. Some of the most commonly used methods include machine learning algorithms, natural language processing, and network analysis. These techniques allow analysts to analyze the content and structure of news articles, identify patterns and features that distinguish fake from real news, and classify articles into different categories based on their level of credibility.

- However, detecting and classifying fake news is not always straightforward. One of the key challenges is the dynamic and evolving nature of the phenomenon. Fake news can take many different forms, and new techniques for spreading false information are constantly being developed. As a result, fake news classification requires ongoing research and development of new methods and approaches.

- Despite these challenges, the importance of fake news classification cannot be overstated. Fake news can have serious consequences in various areas, such as politics, public health, and security. It can spread misinformation, undermine trust in the media and other information sources, propagate biased or discriminatory views, and even be used to manipulate public opinion or behavior. By detecting and labeling fake news, we can reduce the impact of these negative effects and help people make more informed decisions based on accurate and trustworthy information.

- In addition, fake news classification can help promote a more informed and democratic society. When people have access to reliable information, they are better equipped to participate in public discourse, make informed decisions, and hold their leaders accountable. This can contribute to a more just and equitable society, where individuals are empowered to advocate for their own interests and those of their communities.

- Starting with making the model:

## Importing the Libraries

## Reading the Files

In [None]:
# Data Pre-Processing Libraries
import pandas as pd
import numpy as np 
# Data  Visualisation Libraries
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go 
# Handling Warnings
import warnings
warnings.filterwarnings('ignore')
# Text Pre-Processing Libraries
import re
import string
string.punctuation

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

from wordcloud import WordCloud
# Machine Learning Libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

In [None]:
fake = pd.read_csv('Fake.csv')
real =pd.read_csv('True.csv')

## Cleaning and Pre-processing  

In [None]:
fake.head()

In [None]:
fake.tail()

In [None]:
real.head()

In [None]:
real.tail()

In [None]:
real.info()

In [None]:
fake.info()

In [None]:
fake.duplicated().sum()

In [None]:
fake.drop_duplicates(inplace=True)

In [None]:
real.duplicated().sum()

In [None]:
real.drop_duplicates(inplace=True)

In [None]:
real['category'] = 0
fake['category'] = 1

In [None]:
news = pd.concat([real,fake],axis=0,ignore_index=True)
# previewing the new dataset
news.head()

In [None]:
news.tail()

In [None]:
news.info()

In [None]:
news.columns

In [None]:
# Dropping columns not to be used
news.drop(['title','subject','date'],axis=1,inplace=True)
# Removing all punctuations
import re
news['text'] = news['text'].map(lambda x: re.sub('[-,\.!?]', '', x))
# Converting the text data to lower case
news['text'] = news['text'].map(lambda x: x.lower())

In [None]:
# Joining the different processed titles together.
long_string = ' '.join(news['text'])

# Creating a WordCloud object
wordcloud = WordCloud()

# Generating a word cloud
wordcloud.generate(long_string)

# Visualizing the word cloud
wordcloud.to_image()

In [None]:
# loading the English language model in spaCy
nlp = spacy.load('en_core_web_sm')

def preprocess_text(text):
    # Parsing the text with Spacy
    doc = nlp(text)
    
    # Lemmatizing the tokens and remove stop words
    lemmas = [token.lemma_ for token in doc if not token.is_stop]
    
    # Joining the lemmas back into a string and return it
    return " ".join(lemmas)

# applying the preprocess_text function to the text column
news['text'] = news['text'].apply(preprocess_text)

In [None]:
# Loading splitting library
from sklearn.model_selection import train_test_split

# Defining the independent variable
X = news['text']

# Defining the dependent variable
y = news['category']

# Splitting the data into training and testing set
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.8,random_state=42)

In [None]:
# Loading count vectorizer library
from sklearn.feature_extraction.text import CountVectorizer

# Instantiating count vectorizer
cv = CountVectorizer()

# Fitting and transforming X train 
X_train_vect = cv.fit_transform(X_train)

# Tranforming X test 

## Modeling the Data

In [None]:
# Instantiating logistic regression
logreg = LogisticRegression(random_state = 42)
logreg.fit(X_train_vect,y_train)

# Predicting the value of y_train using the model
y_pred_train = logreg.predict(X_train_vect)

# Predicting the value of y_test using the model
y_pred_test = logreg.predict(X_test_vect)


# Accuracy of the training and testing data
train_accuracy = accuracy_score(y_train,y_pred_train)
test_accuracy = accuracy_score(y_test,y_pred_test)
print(f'Train accuracy - {train_accuracy} \nTest accuracy - {test_accuracy}')

In [None]:
CM = confusion_matrix(y_test, y_pred_test)
print(CM)

In [None]:
sns.heatmap(cm, annot=True)

In [None]:
# Classification report for training data
categories=['real','fake']
print(classification_report(y_train,y_pred_train,target_names=categories,digits=4))

In [None]:
# Classification report for testing data
print(classification_report(y_test,y_pred_test,target_names=categories,digits=4))

# **Conclusion**

Overall, the field of fake news classification is constantly evolving, as researchers and experts work to develop new and more effective methods for detecting and classifying fake news. By continuing to invest in this field, we can help combat the spread of fake news and promote a more informed and democratic society that values accuracy, transparency, and accountability. After analyzing the model's performance, it can be concluded that it displays 99% accuracy rate.