<a href="https://www.kaggle.com/code/mayukhhaldar1/kagglefakenewspredictionfinal?scriptVersionId=188966012" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Description
Develop a machine learning program to identify when an article might be fake news. Run by the UTK Machine Learning Club.

## Evaluation
The evaluation metric for this competition is accuracy, a very straightforward metric.

accuracy= correct predictions / (correct predictions + incorrect predictions)

Accuracy measures false positives and false negeatives equally, and really should only be used in simple cases and when classes are of (generally) equal class size

Submission Format
For every article in the test dataset, submission files should contain two columns: `id` and `label`. The `id` column should refer to a row in the `test.csv` file, and the `label` column should refer it's class of reliable (`0`), or potentially fake (`1`).

The file should contain a header and have the following format:

```
id,label
182041,1
182042,0
182043,1
182044,0
etc.
```

## Citation

[@fake-news]: William Lifferth. (2018). *Fake News*. Kaggle. Available at: https://kaggle.com/competitions/fake-news


## Dataset Description

1. train.csv: A full training dataset with the following attributes:

- id: unique id for a news article
- title: the title of a news article
- author: author of the news article
- text: the text of the article; could be incomplete
- label: a label that marks the article as potentially unreliable


- 1: unreliable
- 0: reliable


2. test.csv: A testing training dataset with all the same attributes at train.csv without the label.

3. submit.csv: A sample submission that you can use as a sample submission

## Importing the Dependencies

In [1]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Error loading stopwords: <urlopen error [Errno -3]
[nltk_data]     Temporary failure in name resolution>


False

In [3]:
# Printing the stopwords in english
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## Data Pre-Processing

In [4]:
#loading the datasets to Pandas DataFrames
df_train = pd.read_csv('/kaggle/input/fake-news/train.csv')
df_test=pd.read_csv('/kaggle/input/fake-news/test.csv')
df_train.shape, df_test.shape

((20800, 5), (5200, 4))

In [5]:
# Preparing the train and test input features
x_train=df_train.drop(columns=['id','label'])
x_test=df_test.drop(columns=['id'])

In [6]:
# Preparing the train target label
Y_train=df_train.label

In [7]:
# Looking for missing values
x_train.info(), x_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   20242 non-null  object
 1   author  18843 non-null  object
 2   text    20761 non-null  object
dtypes: object(3)
memory usage: 487.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5200 entries, 0 to 5199
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   5078 non-null   object
 1   author  4697 non-null   object
 2   text    5193 non-null   object
dtypes: object(3)
memory usage: 122.0+ KB


(None, None)

In [8]:
# Handling Missing Values 
x_train=x_train.fillna('')
x_test=x_test.fillna('')

In [9]:
x_train.info(), x_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   20800 non-null  object
 1   author  20800 non-null  object
 2   text    20800 non-null  object
dtypes: object(3)
memory usage: 487.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5200 entries, 0 to 5199
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   5200 non-null   object
 1   author  5200 non-null   object
 2   text    5200 non-null   object
dtypes: object(3)
memory usage: 122.0+ KB


(None, None)

## Stemming:

Stemming is the process of reducing a word to its root word.

example:
actor, actress, acting --> act

In [10]:
# Concatenating the two input features author and title into one single input feature content
x_train['content'] = x_train['author']+' '+x_train['title']
x_test['content'] = x_test['author']+' '+x_test['title']

In [11]:
# Loading the Porter Stemmer
port_stem=PorterStemmer()

In [12]:
# Defining the Stemming function
def stemming(content):
    stemmed_content=re.sub('[^a-zA-Z]',' ',content)
    stemmed_content=stemmed_content.lower()
    stemmed_content=stemmed_content.split()
    stemmed_content=[port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content=" ".join(stemmed_content)
    return stemmed_content

## Applying Stemming to the DataFrames

In [13]:
x_train['content']=x_train['content'].apply(stemming)

In [14]:
x_test['content']=x_test['content'].apply(stemming)

## TF-IDF(Term Frequency - Inverse Document Frequency) Vectorization

In [15]:
# converting textual data to numerical data
vectorizer=TfidfVectorizer()
X_train=vectorizer.fit_transform(x_train['content'])
X_test=vectorizer.transform(x_test['content'])

In [16]:
print(X_train)
print(X_test)

  (0, 15686)	0.28485063562728646
  (0, 2483)	0.3676519686797209
  (0, 7692)	0.24785219520671603
  (0, 8630)	0.29212514087043684
  (0, 2959)	0.2468450128533713
  (0, 13473)	0.2565896679337957
  (0, 4973)	0.23331696690935097
  (0, 267)	0.2701012497770876
  (0, 3792)	0.2705332480845492
  (0, 7005)	0.21874169089359144
  (0, 8909)	0.3635963806326075
  (0, 3600)	0.3598939188262559
  (1, 1894)	0.15521974226349364
  (1, 2223)	0.3827320386859759
  (1, 16799)	0.30071745655510157
  (1, 1497)	0.2939891562094648
  (1, 2813)	0.19094574062359204
  (1, 6816)	0.19046601982968486
  (1, 5503)	0.7143299355715573
  (1, 3568)	0.26373768806048464
  (2, 5389)	0.3866530551182615
  (2, 5968)	0.3474613386728292
  (2, 9620)	0.49351492943649944
  (2, 15611)	0.41544962664721613
  (2, 2943)	0.31798868006546904
  :	:
  (20797, 1287)	0.33538056804139865
  (20797, 13122)	0.2482526352197606
  (20797, 12344)	0.27263457663336677
  (20797, 14967)	0.31159453154880756
  (20797, 12138)	0.24778257724396507
  (20797, 9518)	0.29

## Training the Model: 
Logistic Regression 

In [17]:
# Loading the Logistic Regression Model For Binary Classification Task
model=LogisticRegression()

In [18]:
# Training the Model
model.fit(X_train, Y_train)

## Evaluation

In [19]:
# accuracy score on the training data
X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)

In [20]:
print(f"Accuracy score of the training data: {training_data_accuracy}")

Accuracy score of the training data: 0.9883173076923077


## Preparing the Submission File

In [21]:
X_test_prediction=model.predict(X_test)

In [22]:
df_submit = pd.read_csv('/kaggle/input/fake-news/submit.csv')

In [23]:
df_submit['label'] = X_test_prediction
df_submit['label'] = df_submit['label'].astype(int)

In [24]:
df_submit.to_csv('submission.csv', index=False)