### Data Description

train.csv: A full training dataset with the following attributes:

id: unique id for a news article \
title: the title of a news article \
author: author of the news article \
text: the text of the article; could be incomplete \
label: a label that marks the article as potentially unreliable \
    1: unreliable \
    0: reliable 


### Importing the dependencies

In [None]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
import nltk
nltk.download('stopwords')
# print(stopwords.words(fileids='english'))

### Data pre-processing

#### Loading the dataset

In [None]:
news_dataset =  pd.read_csv('train.csv')
news_dataset.shape

#### Data analysis

In [None]:
news_dataset.head()

Either we can impute the dataset or we can remove the rows with null values

In first approach, we use title and author to check for fake news and just impute with empty string

In [None]:
# news_dataset.dropna(inplace=True)
news_dataset = news_dataset.fillna('')
news_dataset.isna().sum()

In this second approach, 'text' is the most important column, so we remove rows with null 'text' and impute the other columns

In [None]:
# Imputation
news_dataset['title'].fillna("Unknown Title", inplace=True)
news_dataset['author'].fillna("Anonymous", inplace=True)

# Drop rows with missing 'text', as it's a critical feature
news_dataset.dropna(subset=['text'], inplace=True)