### Data Description

train.csv: A full training dataset with the following attributes:

id: unique id for a news article \
title: the title of a news article \
author: author of the news article \
text: the text of the article; could be incomplete \
label: a label that marks the article as potentially unreliable \
    1: unreliable \
    0: reliable 


### Importing the dependencies

In [None]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
import nltk
nltk.download('stopwords')
# print(stopwords.words(fileids='english'))

### Data pre-processing

#### Loading the dataset

In [None]:
news_dataset =  pd.read_csv('train.csv')
news_dataset.shape

#### Data analysis

In [None]:
news_dataset.head()

Either we can impute the dataset or we can remove the rows with null values

### In first approach, we use title and author to check for fake news and just impute with empty string

In [None]:
news_dataset = news_dataset.fillna('')
news_dataset.isna().sum()

merging the author name and news title

In [None]:
news_dataset['content'] = news_dataset['author'] + ' ' +  news_dataset['title']

In [None]:
news_dataset.head()

#### Splitting Dataset

In [None]:
X = news_dataset.drop(columns='label', axis=1)
y = news_dataset['label']

#### Data pre-processing

In [None]:
port_stem = PorterStemmer()

In [None]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words(fileids='english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content