# Dataset Description

**train.csv**: A full training dataset with the following attributes:

- **id**: unique id for a news article
- **title**: the title of a news article
- **author**: author of the news article
- **text**: the text of the article; could be incomplete
- **label**: a label that marks the article as potentially unreliable
    - 1: unreliable
    - 0: reliable

**test.csv**: A testing dataset with all the same attributes as train.csv without the label.

# Importing Libraries

In [2]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score 

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gusta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
# Print the list of English stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

# Data Preprocessing

In [5]:
# Load the dataset
news_dataset = pd.read_csv('./train.csv')

In [6]:
news_dataset.shape

(20800, 5)

20800 ---> rows (articles)

5 ---> columns

In [7]:
# Print the first 5 rows of the dataframe
news_dataset.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [21]:
# Counting the number of missing values in the dataset
print("Missing values in the dataset:")
print(news_dataset.isnull().sum())

print("-------------------------")
# Percentage of missing values in each column
print("Percentage of missing values in the dataset:")
print((news_dataset.isnull().sum() / news_dataset.shape[0]) * 100)

Missing values in the dataset:
id           0
title      558
author    1957
text        39
label        0
dtype: int64
-------------------------
Percentage of missing values in the dataset:
id        0.000000
title     2.682692
author    9.408654
text      0.187500
label     0.000000
dtype: float64


For this analysis, we chose to fill any rows with missing data with empty strings since they make up only a very small part of our dataset. In larger or more critical projects, however, you would likely handle missing values differently. Instead of deleting them, you might fill in the gaps based on other available information (a process called imputation) or take a closer look to understand the cause of the missing data before deciding on the best approach.

In [22]:
# Replacing the null values with empty strings
news_dataset = news_dataset.fillna('')

In [81]:
# Merging the author name and news title
news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']

In [78]:
print('Content:', news_dataset['content'])

Content: 0        Darrell Lucus House Dem Aide: We Didn’t Even S...
1        Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2        Consortiumnews.com Why the Truth Might Get You...
3        Jessica Purkiss 15 Civilians Killed In Single ...
4        Howard Portnoy Iranian woman jailed for fictio...
                               ...                        
20795    Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796    Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797    Michael J. de la Merced and Rachel Abrams Macy...
20798    Alex Ansary NATO, Russia To Hold Parallel Exer...
20799    David Swanson What Keeps the F-35 Alive   Davi...
Name: content, Length: 20800, dtype: object


In [None]:
# Separating the data & label (Optional)
X = news_dataset.drop(columns='label', axis=1)
y = news_dataset['label']

In [26]:
print(X)
print(y)

          id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
20795  20795  Rapper T.I.: Trump a ’Poster Child For White S...   
20796  20796  N.F.L. Playoffs: Schedule, Matchups and Odds -...   
20797  20797  Macy’s Is Said to Receive Takeover Approach by...   
20798  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
20799  20799                          What Keeps the F-35 Alive   

                                          author  \
0                                  Darrell Lucus   
1                                Daniel J. Flynn   
2                             Consortiu

# Stemming:

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

e.g. intelligence, intelligent, intelligently, intelligible ---> intellig

In [28]:
ort_stem = PorterStemmer()

In [29]:
# Function to perform stemming on the content
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

## Explanation of the Stemming Function

The `stemming` function cleans and simplifies text by following these steps:

- **Remove Non-Letters:** It replaces all characters that are not letters with spaces.
- **Lowercase Conversion:** It converts all text to lowercase for consistency.
- **Split into Words:** The text is divided into individual words.
- **Remove Common Words:** It discards common words (like "the", "is") that don't add much meaning.
- **Simplify Words:** Using a stemmer, each remaining word is reduced to its base form (e.g., "running" becomes "run").
- **Join Words:** The simplified words are combined back into a single string.

This process helps in analyzing text by focusing on the core meaning of each word.

In [79]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

KeyboardInterrupt: 

In [31]:
print('Content:', news_dataset['content'])

Content: 0        darrel lucu hous dem aid even see comey letter...
1        daniel j flynn flynn hillari clinton big woman...
2                   consortiumnew com truth might get fire
3        jessica purkiss civilian kill singl us airstri...
4        howard portnoy iranian woman jail fiction unpu...
                               ...                        
20795    jerom hudson rapper trump poster child white s...
20796    benjamin hoffman n f l playoff schedul matchup...
20797    michael j de la merc rachel abram maci said re...
20798    alex ansari nato russia hold parallel exercis ...
20799                            david swanson keep f aliv
Name: content, Length: 20800, dtype: object


In [32]:
# Separating the data & label
X = news_dataset['content'].values
y = news_dataset['label'].values

In [33]:
print(X)

['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time'
 'alex ansari nato russia hold parallel exercis balkan'
 'david swanson keep f aliv']


In [34]:
print(y)

[1 0 1 ... 0 1 1]


In [37]:
# Check the shape of X and y
print("X.shape:", X.shape)
print("y.shape:", y.shape)

X.shape: (20800,)
y.shape: (20800,)


In [38]:
# Converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [39]:
print(X)

  (0, 15686)	0.28485063562728646
  (0, 13473)	0.2565896679337957
  (0, 8909)	0.3635963806326075
  (0, 8630)	0.29212514087043684
  (0, 7692)	0.24785219520671603
  (0, 7005)	0.21874169089359144
  (0, 4973)	0.233316966909351
  (0, 3792)	0.2705332480845492
  (0, 3600)	0.3598939188262559
  (0, 2959)	0.2468450128533713
  (0, 2483)	0.3676519686797209
  (0, 267)	0.27010124977708766
  (1, 16799)	0.30071745655510157
  (1, 6816)	0.1904660198296849
  (1, 5503)	0.7143299355715573
  (1, 3568)	0.26373768806048464
  (1, 2813)	0.19094574062359204
  (1, 2223)	0.3827320386859759
  (1, 1894)	0.15521974226349364
  (1, 1497)	0.2939891562094648
  (2, 15611)	0.41544962664721613
  (2, 9620)	0.49351492943649944
  (2, 5968)	0.3474613386728292
  (2, 5389)	0.3866530551182615
  (2, 3103)	0.46097489583229645
  :	:
  (20797, 13122)	0.2482526352197606
  (20797, 12344)	0.27263457663336677
  (20797, 12138)	0.24778257724396507
  (20797, 10306)	0.08038079000566466
  (20797, 9588)	0.174553480255222
  (20797, 9518)	0.295420

# Splitting the Data into Training and Testing Sets

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=2)

# Training the Model: Logistic Regression

In [42]:
model = LogisticRegression()

In [43]:
# Training the Logistic Regression model with training data
model.fit(X_train, y_train)


# Evaluation

Accuracy Score

In [44]:
# Accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, y_train)
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.9863581730769231


In [45]:
# Accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, y_test)
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.9790865384615385


# Making a Predictive System

In [46]:
X_new = X_test[0]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
    print('The news is Real')
else:
    print('The news is Fake')

[1]
The news is Fake


# Checking the model with a test cases

In [69]:
# Load test data
test_data = pd.read_csv('./test.csv')
result_data = pd.read_csv('./submit.csv')
test_data.head()


Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...


In [71]:
# Fill the missing values with empty strings
test_data = test_data.fillna('')

In [72]:
def predict_fake_news(title, author, classifier, vectorizer):
    """
    Predict if a news article is fake or reliable based on its title and author.
    
    Parameters:
      title (str): The title of the news article.
      author (str): The author of the news article.
      classifier: The trained classifier model.
      vectorizer: The fitted TF-IDF vectorizer.
      
    Returns:
      int: 1 if the article is predicted to be fake, 0 if reliable.
    """
    # Combine title and author into one text string
    combined_text = title + " " + author

    # Preprocess the combined text using the stemming function
    processed_text = stemming(combined_text)

    # Transform the processed text into TF-IDF features
    features = vectorizer.transform([processed_text])

    # Predict using the trained classifier
    prediction = classifier.predict(features)
    
    return prediction

# Test the function with the test data set
test_data['label'] = test_data.apply(lambda x: predict_fake_news(x['title'], x['author'], model, vectorizer), axis=1)


In [73]:
test_data.head()

Unnamed: 0,id,title,author,text,label
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning...",[0]
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...,[1]
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...,[1]
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different...",[0]
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...,[1]


In [75]:
# Convert label columns to integers
y_true = test_data['label'].astype(int)
y_pred = result_data['label'].astype(int)

In [76]:
# Calculate and print the accuracy
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Accuracy: 64.37%


# **Conclusion**

The analysis demonstrates that the Fake News Prediction pipeline is robust and well-structured:

- **Data Preprocessing:** Missing values are appropriately handled, and text columns are cleaned and standardized using a custom `stemming` function.
- **Feature Extraction:** The TF-IDF vectorization converts the preprocessed text into meaningful features, crucial for capturing the textual nuances.
- **Model Training & Evaluation:** Training with a Logistic Regression classifier yields high accuracy on both training and test datasets, indicating strong predictive performance.
- **Testing & Predictions:** The prediction function seamlessly integrates preprocessing, feature extraction, and classification, allowing reliable identification of fake versus real news.

Overall, the system presents a solid baseline to address fake news detection and offers potential for further enhancements. Future work could explore advanced feature engineering techniques, alternative classification algorithms, or ensemble methods to improve performance and robustness.