# **NEWS CLASSIFIER**

## DATA MINING - FALL 2022
## CUNY GRADUATE CENTER
## Edward Miller

## Overview

Over the past few decades, news sources have drastically shifted from print media to
online sites due to the widespread adoption of the internet by the general public. As a
consequence of this shift, it has become easier for news articles from untrustworthy
sources to pass themselves as real, and cause disinformation to spread rapidly. This
disinformation can cause widespread damage and is often used to push false narratives to
benefit a political party or a government. It has already caused harm to democratic
institutions by eroding trust in media sources, as well as motivating extreme actions based
on incorrect information, such as the Capitol Riots on January 6, 2021 . Therefore, it is now
more important than ever to develop sound methods for determining whether a news
story is real, or not. The intent of this project will be to use deep learning in order to train a
model that can correctly classify a list of news stories as real or not real.

## Goals
1. Train a deep learning model for binary classification to correctly classify whether a
news story is real
2. Test Trained Model on a dataset to determine how well it differentiates between the
two classes, and report findings.

## Dataset
The dataset, as well as the direct inspiration for this project, comes from the following kaggle
website, and the author of the code has some good insights that will be reused here as well.

https://www.kaggle.com/code/urkchar/determine-if-news-is-fake-or-real/notebook

Progress will be updated through Github here:

https://github.com/EdwardMMiller/Data-Mining-Project---Fall-2022/new/main

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# importing basic libraries

from sklearn import metrics
from sklearn.utils import shuffle
from sklearn.model_selection import GridSearchCV, train_test_split

# importing sklearn packages

import re
# importing text library

True_df = pd.read_csv('True.csv')
Fake_df = pd.read_csv('Fake.csv')
# getting the csv files into dataframe

print("Number of rows in True_df = ", True_df.shape[0])
print("Number of rows in Fake_df = ", Fake_df.shape[0])
print("Number of columns in True_df = ", True_df.shape[1])
print("Number of columns in Fake_df = ", Fake_df.shape[1])
# Checking number of rows and columns in each dataset

columns_list_T = True_df.columns.tolist()
columns_list_F = Fake_df.columns.tolist()
print("List of columns in True_df", columns_list_T)
print("List of columns in False_df", columns_list_T)
# checking to see if columns are the same
#%%
def data_file_explore(file, df):

  """
    This is a simple function designed to do surface level
    exploration of the data in a csv file. It will print the
    first five rows of a data file, number of rows and columns in the data file
    number of unique values in the data value, and the number of missing values
    in the data file. It also inputs the minimum, maximum, mean and median
    for the numeric columns, and then finds the mode for the column along with
    the number of times it appears in that column. Finally, it also gets the
    most frequent string found for the string columns along with the number of
    times that string is found in the column. It is assumed that the data
    that is being worked with is either numeric or a string.

  Args:
    file = string referencing the filename
    df = a data frame created from reading in the file
  Returns:
    none
  """


  print("*********************** FILE NAME: %s ***********************\n" %file)

  # This is just a file header

  print("First five rows in file: %s\n" %file)
  display(df.head())
  # looking at the first five rows of the data

  print("\nData types present in file: %s\n" %file)
  print(df.dtypes)
  # looking at data types found in the dataframe

  print("\nNumber of rows and columns in file: %s\n" %file)
  display(df.shape)
  # Getting the size of the dataset

  print("\nNumber of unique values for each column in file: %s\n" %file)
  display(df.nunique())
  # Looking at number of unique values

  print("\nCount of missing values in file: %s\n" %file)
  display(df.isnull().sum())
  print("")
  # Counting the missing values in the datasets

  for col in df.select_dtypes(include=np.number):
    # Looping through numeric columns in data frame a
      #print("Min for col %s = %s" %(col, df[col].min()))
      #print("Max for col %s = %s" %(col, df[col].max()))
    print("Numerical Stats for column - '%s' " % col)
    print("--------------------------------------------")
    (print("Min: %s Mean: %s"
           %(df[col].min(), df[col].mean() )))
    print("Max: %s Median: %s" %(df[col].max(), df[col].median()))
    # Printing the minimum, maximum, mean and median

    col_mode = df[col].value_counts().idxmax()
    # get the most frequent value in the column
    freq = df[col].value_counts()[col_mode]
    # get the count of the most frequent value in the column

    if freq > 1:
      print("Most frequent value: %s found in column %s times.\n" %(col_mode,freq))
    else:
      print("No value repetitions found in column\n")
      # Only returning the most frequent value if it appears more than once

  for col in df.select_dtypes(include=object):
      # Only looking at columns with strings now
    print("Frequency counts for string column - '%s' " % col)
    print("--------------------------------------------------")
    col_mode = df[col].value_counts().idxmax()
      # get the most frequent value in the column
    freq = df[col].value_counts()[col_mode]
      # get the count of the most frequent value in the column
    if freq > 1:
      (print("Most frequent string value: '%s' found in column %s times.\n"
               %(col_mode,freq)))
    else:
      print("No value repetitions found in column\n")
        # Only returning the most frequent value if it appears more than once

data_file_explore('True.csv', True_df)
data_file_explore('Fake.csv', Fake_df)

## OVERVIEW OF DATA SETS

A cursory glance at the data shows that there are no missing values in either data set and that both column types
are the same along with matching data types. There are no numerical columns her, only text and dates. However, the author of the original project points out that the **True.csv** file  shows **WASHINGTON (Reuters) - ** or **SEATTLE/WASHINGTON (Reuters)** before the news articles. Also, there are **6** unique values for **'subject'** in the **Fake.csv** file and only **2** unique values for **'subject'** in the **True.csv**, which might cause the model while training the dataset to mainly look for these two things to determine whether the news article is fake. We will want to remove these things in order to ensure that it's not so obvious which one is which, and also ensure that the model will be able to work when classifying news stories that do not have this format, as one could then easily add the <Reuters> term to a fake news story, in order to trick the model into classifying it is as true.



In [22]:
cnt = 0
for row in True_df.index:
  if "(Reuters)" in True_df.loc[row]['text']:
     cnt = cnt + 1
print("Count of rows in True_df containing term (Reuters) = %s"% cnt )
print("Number of rows in True_df data frame = ", True_df.shape[0])
print("Percentage of rows in True_df containing (Reuters) = %s"% (cnt/True_df.shape[0]*100) )
print('')
# Counting the number of instances that contain the string "(Reuters)" in True_df

cnt = 0
for row in Fake_df.index:
  if "(Reuters)" in Fake_df.loc[row]['text']:
     cnt = cnt + 1
print("Count of rows in Fake_df containing term (Reuters) = %s"% cnt )
print("Number of rows in Fake_df data frame = ", Fake_df.shape[0])
print("Percentage of rows in Fake_df containing (Reuters) = %s"% (cnt/Fake_df.shape[0]*100) )
# Counting the number of instances that contain the string "(Reuters)" in Fake_df

Count of rows in True_df containing term (Reuters) = 21247
Number of rows in True_df data frame =  21417
Percentage of rows in True_df containing (Reuters) = 99.20623803520567

Count of rows in Fake_df containing term (Reuters) = 9
Number of rows in Fake_df data frame =  23481
Percentage of rows in Fake_df containing (Reuters) = 0.038328861632809505


As the original author also pointed out, this is something that needs to be addressed with the data before putting it into the model, as one can clearly see that over **99%** of the **True_df** rows contain the term **'(Reuters)'** and under **4%** of the rows in Fake_df contain this term. Let us also look at the 'subjects' column.

In [23]:
display(True_df['subject'].unique())
display(Fake_df['subject'].unique())
# looking at the unique values in the 'subjects' column

array(['politicsNews', 'worldnews'], dtype=object)

array(['News', 'politics', 'Government News', 'left-news', 'US_News',
       'Middle-east'], dtype=object)

Using the **'subjects'** column in the model would be a dead give-away to the model as the unique values in the **Fake_df** and **True_df** are totally different and seeing one subject or the other would train the model to look for that subject when classifying the news article as real or fake.  Therefore, the text data needs to be pre-processed and the subject column needs to be removed as well. Also, in order to get a better look at just the text itself to see if a model can determine which class it belongs to, the date column will be removed as well.

In [25]:
def text_clean(string):
  """
  This function removes this pattern "^[A-Z/]+ \(Reuters\) - "
  from a text string
  :param string: A single string
  :return:  A single string with the pattern removed
  """
  reuters_pattern = "^[A-Z/]+ \(Reuters\) - "
  # String to remove at the start of True_df looks like
  # WASHINGTON (Reuters) - so saving this pattern
  return re.sub(reuters_pattern, "", string)
##########################################################################
test_string1 = True_df.loc[0]['text']
print("String before text_clean function applied\n")
print(test_string1)
print('')
# getting a test string to test function

print("String after text_clean function applied\n")
print(text_clean(test_string1))
print('')
# testing function to see if it removes string

def df_pre_processor(df_t, df_f, Shuffle = True):
  """ This is a basic pre-processing function that takes in
  the two dataframes True_df and Fake_df, adds correct labels
  to each, combines them, cleans the text column, removes unneeded
  columns and shuffles the dataframe

  :param df_t: a dataframe containing all true news articles
  :param df_f: a dataframe containing all fake news articles
  :param  Shuffle: bool = True
  :return: a pre-processed data frame combined from df_t & df_f
  """
  df_t['label'] = True
  df_f['label'] = False
  # adding correct labels to each

  df = pd.concat([df_t, df_f], axis=0)
  df['text'] = df['text'] .apply(text_clean)
  # combining both data frames and
  # applying text clean to the text column

  df = df.drop(["subject", 'date'],axis = 1)
  # removing two columns from df

  if Shuffle:
     df = shuffle(df).reset_index()
  # shuffles the dataframe before returning it
  return df
##########################################################################
df_combined = df_pre_processor(True_df.copy(), Fake_df.copy())
# putting both dataframes in pre-processing function
display(df_combined.head())
# looking at new data frame
bool_test = True_df.shape[0] \
            + Fake_df.shape[0] \
            == df_combined.shape[0]
print("Will print %sTrue%s if total rows are equal - %s" % ("'","'",bool_test))
# checking to make sure new data frame has correct dimensions

String before text_clean function applied

WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense “discretionary

Unnamed: 0,index,title,text,label
0,20889,U.S. ELECTIONS May Already Be In Serious Jeopa...,If you haven t already signed up to help prote...,False
1,22549,‘Journalistic Malpractice’: CNN Slammed for ‘B...,21st Century Wire says This definitely needed ...,False
2,16185,WATCH FULL RESPONSE FROM TRUMP AFTER OBAMACARE...,,False
3,10277,BURN! NYT TAKES JAB AT FOX NEWS…Watch How FOX ...,"Sometimes, the left just makes it far too easy...",False
4,13821,BEAUTIFUL! TRUMP Hits Liz Warren And HuffPo Wi...,One thing we re discovering is that Trump has ...,False


Will print 'True' if total rows are equal - True


In [None]:
def text_clean(string):
  """
  This function removes this pattern "^[A-Z/]+ \(Reuters\) - "
  from a text string
  :param string: A single string
  :return:  A single string with the pattern removed
  """
  reuters_pattern = "^[A-Z/]+ \(Reuters\) - "
  # String to remove at the start of True_df looks like
  # WASHINGTON (Reuters) - so saving this pattern
  return re.sub(reuters_pattern, "", string)
##########################################################################
test_string1 = True_df.loc[0]['text']
print("String before text_clean function applied\n")
print(test_string1)
print('')
# getting a test string to test function

print("String after text_clean function applied\n")
print(text_clean(test_string1))
print('')
# testing function to see if it removes string

def df_pre_processor(df_t, df_f, Shuffle = True):
  """ This is a basic pre-processing function that takes in
  the two dataframes True_df and Fake_df, adds correct labels
  to each, combines them, cleans the text column, removes unneeded
  columns and shuffles the dataframe

  :param df_t: a dataframe containing all true news articles
  :param df_f: a dataframe containing all fake news articles
  :param  Shuffle: bool = True
  :return: a pre-processed data frame combined from df_t & df_f
  """
  df_t['label'] = True
  df_f['label'] = False
  # adding correct labels to each

  df = pd.concat([df_t, df_f], axis=0)
  df['text'] = df['text'] .apply(text_clean)
  # combining both data frames and
  # applying text clean to the text column

  df = df.drop(['subject', 'date'],axis = 1)
  # removing two columns from df

  if Shuffle:
     df = shuffle(df).reset_index(drop=True)
  # shuffles the dataframe before returning it
  return df
##########################################################################
df_combined = df_pre_processor(True_df.copy(), Fake_df.copy())
# putting both dataframes in pre-processing function
display(df_combined.head())
# looking at new data frame
bool_test = True_df.shape[0] \
            + Fake_df.shape[0] \
            == df_combined.shape[0]
print("Will print %sTrue%s if total rows are equal - %s" % ("'","'",bool_test))
# checking to make sure new data frame has correct dimensions

In [26]:
columns_list = df_combined.columns[::].tolist()
# saving columns list
train_x, test_x, train_y, test_y = train_test_split(df_combined[columns_list],
                                                    df_combined['label'],
                                                    test_size=0.2,
                                                    random_state=5)
# Splitting data in train and test groups
display(train_x)
display(train_y)
display(test_x)
display(test_y)

Unnamed: 0,index,title,text,label
41124,21102,Dozens of prisoners on the run in central Ivor...,Close to 100 prisoners escaped from prison in ...,True
22642,12088,Christmas market opens in Algerian capital,A small Christmas market has opened in Algeria...,True
5161,15983,Factbox: Catalonia crisis - What's next?,Catalonia s ousted leader Carles Puigdemont ag...,True
25434,19058,MEDIA ATTACKS TRUMP…IGNORES Obama’s Miserable ...,Here s Chris Matthews attacking Trump s speech...,False
2009,8543,Ben And Jerry’s Co-Founder Reveals Bernie San...,Supporters of Senator Bernie Sanders can now c...,False
...,...,...,...,...
5520,8365,Maine governor apologizes for obscenity-laced ...,(Reuters) - Maine Governor Paul LePage apologi...,True
35814,565,Trump pardons turkey in annual Thanksgiving tr...,"President Donald Trump, who has raised eyebrow...",True
20463,5403,GOP Official Was Just ‘Emotional’ When He Sai...,It s almost pathetic how quickly and easily ou...,False
18638,12516,BLACK AMERICAN On How I Became A Republican: “...,***WARNING***GRAPHIC LanguageThis video contai...,False


41124     True
22642     True
5161      True
25434    False
2009     False
         ...  
5520      True
35814     True
20463    False
18638    False
35683     True
Name: label, Length: 35918, dtype: bool

Unnamed: 0,index,title,text,label
10128,976,Trump tax overhaul under intensifying fire as ...,President Donald Trump’s plan for overhauling ...,True
43165,12237,Ohio State University Student Says Terrorist A...,,False
14627,11934,Indonesia labels calls for U.S. boycott over J...,Indonesia s vice president said on Tuesday tha...,True
17654,13711,Backlash as Beijing fire safety blitz forces e...,"In Xinjiancun, a ramshackle village of migrant...",True
28412,6269,HORRIBLE News For Do-Nothing GOP: Americans S...,Republican lawmakers oppose paid family leave....,False
...,...,...,...,...
16304,14839,Obama’s Delusion Continues In Vapid Address To...,Obama addressed the Nation In a nothing burger...,False
24306,2765,Fox News Rushes To ID Mosque Attacker As Moro...,Want evidence that Fox News is a functional mo...,False
44244,2202,"In Alabama's Senate race, contenders fight ove...","(Reuters) - At first glance, U.S. Representati...",True
27014,20948,#FeelTheBern: GUY WHO WANTS TO CLEAN UP CORRUP...,Hey Bernie The first step in fighting corrupti...,False


10128     True
43165    False
14627     True
17654     True
28412    False
         ...  
16304    False
24306    False
44244     True
27014    False
4424     False
Name: label, Length: 8980, dtype: bool