# **CS 4701: Fake News vs Real News**
**Authors**: Simar Kohli (sk2523), Shefali Janorkar (skj28), Esther Lee (esl86)

The following project utilizes varying n-gram models (unigram and bigram), a Naive Bayes, as well as a SVM. The team's goal is to be able to accurately determine whether a news article is fake or real.


## **Phase 0: Imports and Pre-processing!**

**Phase 0.1: Mounting and Imports**

In [28]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [29]:
import os
import io
import numpy as np
from nltk import word_tokenize, sent_tokenize
import nltk
import math
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
import numpy as np
import math, string, random
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import model_selection, svm
from string import punctuation
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**Phase 0.2: Loading all Text Files**

In [30]:
root_path = os.path.join(os.getcwd(), "drive", "My Drive/AI_Project_Personal")
with io.open(os.path.join(root_path, "trueDataTrain.txt"), encoding='utf8') as real_file:
  real_news = real_file.read()
with io.open(os.path.join(root_path, "trueDataValidation.txt"), encoding='utf8') as real_file:
  real_news_validation = real_file.read()
with io.open(os.path.join(root_path, "fakeDataTrain.txt"), encoding='utf8') as real_file:
  fake_news = real_file.read()
with io.open(os.path.join(root_path, "fakeDataValidation.txt"), encoding='utf8') as real_file:
  fake_news_validation = real_file.read()

real_news = real_news.lower()
real_news_validation = real_news_validation.lower()
fake_news = fake_news.lower()
fake_news_validation = fake_news_validation.lower()

split_real_news = real_news.splitlines()
split_real_news_validation = real_news_validation.splitlines()
split_fake_news = fake_news.splitlines()
split_fake_news_validation = fake_news_validation.splitlines()

def article_list_tokenizer(corpus): 
  output = []
  for article in corpus:
    output.append(word_tokenize(article))
  return output

tokenize_RN_training = article_list_tokenizer(split_real_news)
tokenize_RN_validate = article_list_tokenizer(split_real_news_validation)
tokenize_FN_training = article_list_tokenizer(split_fake_news)
tokenize_FN_validate = article_list_tokenizer(split_fake_news_validation)

In [None]:
### Sanity check

**Phase 0.3: Preprocessing all Text Files**

In [7]:
## PREPROCESSING
## We're elminating all punctuation, numbers, single-letter words that aren't "I" or "a". 
def preprocessing(lsts):
  output = []
  for lst in lsts:
    new_lst = []
    for word in lst:
      word = word.replace(" ", "")
      if len(word) == 1:
        if (word == 'a' or word == 'i'):
          new_lst.append(word)
      elif len(word) > 1:
        if (not word.isnumeric()):
          x = word.strip(punctuation)
          if (len(x) != 0):
            new_lst.append(word.strip(punctuation))
    output.append(new_lst)
  return output

In [8]:
## Set to be used for ngrams

P_tokenize_RN_training = preprocessing(tokenize_RN_training)
P_tokenize_RN_validate = preprocessing(tokenize_RN_validate)
P_tokenize_FN_training = preprocessing(tokenize_FN_training)
P_tokenize_FN_validate = preprocessing(tokenize_FN_validate)

P_tokenize_RN_training, P_tokenize_RN_test = train_test_split(P_tokenize_RN_training, train_size=0.8)
P_tokenize_FN_training, P_tokenize_FN_test = train_test_split(P_tokenize_FN_training, train_size=0.8)

P_tokenize_test = P_tokenize_RN_test + P_tokenize_FN_test
P_tokenize_test_labels = [0] * len(P_tokenize_RN_test) + [1]*len(P_tokenize_FN_test)

P_tokenize_validate = P_tokenize_RN_validate + P_tokenize_FN_validate
P_tokenize_validate_labels = [0]*len(P_tokenize_RN_validate) + [1]*len(P_tokenize_FN_validate)

## Set to be used for NB and SVM

nb_SVM_RN_training = split_real_news
nb_SVM_FN_training = split_fake_news
nb_SVM_RN_validate = split_real_news_validation
nb_SVM_FN_validate = split_fake_news_validation

nb_SVM_RN_training, nb_SVM_RN_test = train_test_split(nb_SVM_RN_training, train_size=0.8)
nb_SVM_FN_training, nb_SVM_FN_test = train_test_split(nb_SVM_FN_training, train_size=0.8)

nb_SVM_test = nb_SVM_RN_test + nb_SVM_FN_test
nb_SVM_test_labels = [0]*len(nb_SVM_RN_test) + [1]*len(nb_SVM_FN_test)

In [31]:
print(P_tokenize_RN_training[:10])

[['washington', 'reuters', 'president-elect', 'donald', 'trump', 'said', 'on', 'tuesday', 'that', 'a', 'briefing', 'he', 'is', 'to', 'receive', 'from', 'u.s', 'intelligence', 'officials', 'on', 'allegations', 'of', 'russian', 'hacking', 'of', 'the', 'u.s', 'election', 'had', 'been', 'delayed', 'until', 'friday', 'in', 'a', 'tweet', 'trump', 'voiced', 'continued', 'skepticism', 'about', 'the', 'extent', 'of', 'russia', 'cyber', 'hacking', 'he', 'and', 'top', 'advisers', 'believe', 'democrats', 'are', 'trying', 'to', 'delegitimize', 'his', 'nov', 'election', 'victory', 'by', 'accusing', 'russian', 'authorities', 'of', 'helping', 'him', 'the', 'intelligence', 'briefing', 'on', 'so-called', 'russian', 'hacking', 'was', 'delayed', 'until', 'friday', 'perhaps', 'more', 'time', 'needed', 'to', 'build', 'a', 'case', 'very', 'strange', 'trump', 'tweeted', 'it', 'was', 'not', 'clear', 'when', 'the', 'briefing', 'originally', 'had', 'been', 'scheduled', 'to', 'take', 'place', 'the', 'white', 'hou

## **Phase 1: Unigram/Bigram Models!**

**Phase 1.1: N-gram**

In [9]:
def unigram_counts(lsts):
  map = {}; 
  for article in lsts: 
    for word in article:
      if (word in map):
        map[word] = map[word] + 1
      else:
        map[word] = 1
  return map

In [10]:
def bigram_counts(lsts):
  unigram = unigram_counts(lsts)
  bigram = {}
  for article in lsts:
    for idx, word in enumerate(article):
      if (idx != 0):
        f_word = article[idx-1]
        s_word = article[idx]
        key = str(f_word + " " + s_word)
        if (key not in bigram):
          bigram[key] = 1
        else:
          bigram[key] = bigram[key] + 1
  return unigram, bigram


In [11]:
x = unigram_counts(P_tokenize_RN_training)
y, z = bigram_counts(P_tokenize_RN_training)


**Phase 1.2: Unknown-Handling**

In [12]:
def unk_unigram_handler(unigram, k):
  unk_handled = {}
  unk_handled["<UNK>"] = 1
  for word in unigram:
    if unigram[word] <= k:
      unk_handled["<UNK>"] = unk_handled["<UNK>"] + unigram[word]
    else:
      unk_handled[word] = unigram[word]
  return unk_handled

def unk_bigram_handler(unigram, bigram, k):
  unigram_unks = unk_unigram_handler(unigram, k)
  bigram_unks = {}
  bigram_unks["<UNK> <UNK>"] = 1
  for key in bigram:
    fw, sw = key.split(" ")
    if fw not in unigram_unks and sw not in unigram_unks:
      bigram_unks["<UNK> <UNK>"] = bigram_unks["<UNK> <UNK>"] + bigram[key]
    elif fw not in unigram_unks:
      # if statement here for if exists, place, otherwise increment
      skey = "<UNK> " + sw
      if (skey not in bigram_unks):
        bigram_unks[skey] = bigram[key]
      else:
        bigram_unks[skey] = bigram_unks[skey] + bigram[key]
    elif sw not in unigram_unks:
      # if statement here for if exists, place, otherwise increment 
      skey = fw + " <UNK>"
      if (skey not in bigram_unks):
        bigram_unks[skey] = bigram[key]
      else:
        bigram_unks[skey] = bigram_unks[skey] + bigram[key]
    else:
      bigram_unks[key] = bigram[key]
  return unigram_unks, bigram_unks



In [44]:
x1 = unk_unigram_handler(x, 2)
y2, z2 = unk_bigram_handler(y, z, 2)

In [22]:
sort_bigram = sorted(z2.items(), key=lambda x: x[1], reverse=True)

In [23]:
print(sort_bigram[:10])

[('of the', 26842), ('in the', 23347), ('to the', 12656), ('in a', 10308), ('on the', 9450), ('for the', 8675), ('the united', 8038), ('said the', 7055), ('united states', 6945), ('the u.s', 6846)]


**Phase 1.3: Add-K smoothing**

In [13]:
def add_K_smoothing_unigram(unigram, k):
  prob = {}
  total = sum(unigram.values())
  for word in unigram:
    prob[word] = (unigram[word] + k)/(total + k * len(unigram))
  return prob

In [14]:
def add_K_smoothing_bigram(unigram, bigram, k):
  prob = {}
  total = 0
  for word in bigram: 
    if (len(word.split()) == 1):
      print(word)
    x = word.split()[0]
    prob[word] = (bigram[word] + k)/(unigram[x] + k*len(unigram))
  return prob

**Phase 1.4: Probabilities**

In [26]:
def unigram_probabilities(unigram):
  prob = {} 
  total = sum(unigram.values())
  for word in unigram:
    prob[word] = unigram[word]/total
  return prob;

In [37]:
def bigram_probabilities(unigram, bigram):
  prob = {}
  for word in bigram:
    k = word.split(" ")
    if (len(k) == 1):
      print(k)
    fw, sw = word.split(" ")
    prob[word] = bigram[word]/unigram[fw]
  return prob 

**Phase 1.5: Perplexity Computation**

In [15]:
def perplexity_bigram(bigram, review):
  log_perplexity = 0
  for idx, word in enumerate(review):
    if (idx > 0): 
      fw = review[idx-1]
      sw = review[idx] 
      key1 = fw + " " + sw
      key2 = fw + " <UNK>"
      key3 = "<UNK> " + sw
      key4 = "<UNK> <UNK>"
      if key1 in bigram:
        log_perplexity = log_perplexity - math.log(bigram[key1], math.e)
      elif key2 in bigram:
        log_perplexity = log_perplexity - math.log(bigram[key2], math.e)
      elif key3 in bigram:
        log_perplexity = log_perplexity - math.log(bigram[key3], math.e)
      else:
        log_perplexity = log_perplexity - math.log(bigram[key4], math.e)
  return log_perplexity 

def perplexity_unigram(unigram, review):
  log_perplexity = 0
  for idx, word in enumerate(review):
    if (word in unigram):
      log_perplexity = log_perplexity - math.log(unigram[word], math.e)
    else:
      log_perplexity = log_perplexity - math.log(unigram["<UNK>"], math.e)
  return log_perplexity

**Phase 1.6: Training & Validation**

In [19]:
def train_bigram(articles):
  unigram, bigram = bigram_counts(articles)
  return unigram, bigram

def train_unigram(articles):
  unigram = unigram_counts(articles)
  return unigram

def unk_smooth_bigram(unigram, bigram, cut_off, k_smoothing): 
  unigram, bigram = unk_bigram_handler(unigram, bigram, cut_off)
  prob            = add_K_smoothing_bigram(unigram, bigram, k_smoothing)
  return prob

def unk_smooth_unigram(unigram, cut_off, k_smoothing):
  unigram = unk_unigram_handler(unigram, cut_off)
  prob = add_K_smoothing_unigram(unigram, k_smoothing)
  return prob

def predict_bigram(test_set, RUS_bigram, FUS_bigram):
  preds = []
  for article in test_set:
    RUS = perplexity_bigram(RUS_bigram, article)
    FUS = perplexity_bigram(FUS_bigram, article)
    if (RUS < FUS): 
      preds.append(0)
    else:
      preds.append(1)
  return preds

def predict_unigram(test_set, RUS_unigram, FUS_unigram):
  preds = []
  for article in test_set:
    RUS = perplexity_unigram(RUS_unigram, article)
    FUS = perplexity_unigram(FUS_unigram, article)
    if (RUS < FUS):
      preds.append(0)
    else:
      preds.append(1)
  return preds

def accuracy_calc(preds, labels):
  correct = 0
  for idx, pred in enumerate(preds):
    if (pred == labels[idx]):
      correct = correct + 1 
  return correct/len(labels)

def train_hyperparameterize_validate_bigram(real_set, fake_set, val_set, val_labels):
  R_unigram, R_bigram = train_bigram(real_set)
  F_unigram, F_bigram = train_bigram(fake_set)
  best_cut_off = -1
  best_k_smoothing = -1 
  best_accuracy = 0
  best_RUS_model = None
  best_FUS_model = None
  for cut_off in np.arange(0, 10, 1):
    for k_smoothing in np.arange(0, 0.05, 0.01):
      RUS_prob = unk_smooth_bigram(R_unigram, R_bigram, cut_off, k_smoothing)
      FUS_prob = unk_smooth_bigram(F_unigram, F_bigram, cut_off, k_smoothing)
      predictions = predict_bigram(val_set, RUS_prob, FUS_prob)
      accuracy = accuracy_calc(predictions, val_labels)
      print(accuracy)
      if (accuracy > best_accuracy):
        best_cut_off = cut_off
        best_k_smoothing = k_smoothing
        best_RUS_model = RUS_prob
        best_FUS_model = FUS_prob
        best_accuracy = accuracy
  return best_RUS_model, best_FUS_model, best_cut_off, best_k_smoothing, best_accuracy

def train_hyperparameterize_validate_unigram(real_set, fake_set, val_set, val_labels):
  R_unigram = train_unigram(real_set)
  F_unigram = train_unigram(fake_set)
  best_cut_off = -1 
  best_k_smoothing = -1
  best_accuracy = 0
  best_RUS_model = None
  best_FUS_model = None 
  for cut_off in np.arange(0, 10, 1):
    for k_smoothing in np.arange(0, .05, 0.01):
      RUS_prob = unk_smooth_unigram(R_unigram, cut_off, k_smoothing)
      FUS_prob = unk_smooth_unigram(F_unigram, cut_off, k_smoothing)
      predictions = predict_unigram(val_set, RUS_prob, FUS_prob)
      accuracy = accuracy_calc(predictions, val_labels)
      print(accuracy)
      if (accuracy > best_accuracy):
        best_cut_off = cut_off
        best_k_smoothing = k_smoothing
        best_RUS_model = RUS_prob
        best_FUS_model = FUS_prob
        best_accuracy = accuracy
  return best_RUS_model, best_FUS_model, best_cut_off, best_k_smoothing, best_accuracy




In [20]:
bigram_best_RUS_model, bigram_best_FUS_model, bigram_best_cut_off, bigram_best_k_smoothing, bigram_best_accuracy = train_hyperparameterize_validate_bigram(P_tokenize_RN_training, P_tokenize_FN_training, P_tokenize_validate, P_tokenize_validate_labels)

0.17427616926503342
0.6654788418708241
0.6923162583518931
0.7012249443207127
0.705456570155902
0.9125835189309577
0.831403118040089
0.8042316258351893
0.7857461024498886
0.7712694877505568
0.8547884187082405
0.8415367483296213
0.8293986636971047
0.8221603563474388
0.8172605790645879
0.8306236080178173
0.8295100222717149
0.8238307349665924
0.8201559020044543
0.8199331848552338
0.8089086859688196
0.8165924276169265
0.8162583518930958
0.8160356347438753
0.8143652561247215
0.7978841870824054
0.811804008908686
0.810467706013363
0.811358574610245
0.8085746102449889
0.7859688195991091
0.8026726057906459
0.805011135857461
0.8053452115812918
0.8035634743875278
0.7783964365256125
0.7988864142538975
0.8006681514476615
0.8017817371937639
0.8012249443207127
0.7703786191536748
0.7923162583518931
0.7957683741648107
0.7964365256124721
0.7981069042316258
0.7638084632516704
0.7884187082405345
0.7894209354120267
0.7919821826280624
0.7922048997772828


In [21]:
unigram_best_RUS_model, unigram_best_FUS_model, unigram_best_cut_off, unigram_best_k_smoothing, unigram_best_accuracy = train_hyperparameterize_validate_unigram(P_tokenize_RN_training, P_tokenize_FN_training, P_tokenize_validate, P_tokenize_validate_labels)

0.9523385300668151
0.9523385300668151
0.9523385300668151
0.9523385300668151
0.9523385300668151
0.7974387527839644
0.7973273942093542
0.7972160356347439
0.7972160356347439
0.7972160356347439
0.7695991091314032
0.7694877505567929
0.7693763919821827
0.7693763919821827
0.7693763919821827
0.7475501113585746
0.7475501113585746
0.7475501113585746
0.7475501113585746
0.7475501113585746
0.7279510022271715
0.7279510022271715
0.7279510022271715
0.7279510022271715
0.7279510022271715
0.7134743875278396
0.7134743875278396
0.7134743875278396
0.7134743875278396
0.7134743875278396
0.694988864142539
0.694988864142539
0.694988864142539
0.694988864142539
0.694988864142539
0.6749443207126948
0.6749443207126948
0.6749443207126948
0.6749443207126948
0.6749443207126948
0.660022271714922
0.660022271714922
0.660022271714922
0.6601336302895323
0.6601336302895323
0.647772828507795
0.647772828507795
0.647772828507795
0.6476614699331849
0.6476614699331849


**Phase 1.7: Testing**

In [25]:
unigram_predictions = predict_unigram(P_tokenize_test, unigram_best_RUS_model, unigram_best_FUS_model)
bigram_predictions = predict_bigram(P_tokenize_test, bigram_best_RUS_model, bigram_best_FUS_model)
print()

In [27]:
unigram_accuracy = accuracy_calc(unigram_predictions, P_tokenize_test_labels)
bigram_accuracy = accuracy_calc(bigram_predictions, P_tokenize_test_labels)
print(unigram_accuracy)
print(bigram_accuracy)
print(unigram_best_cut_off) 
print(unigram_best_k_smoothing) 
print(unigram_best_accuracy)
print(bigram_best_cut_off)
print(bigram_best_k_smoothing)
print(bigram_best_accuracy)

0.954191188166057
0.91379036106251
0
0.0
0.9523385300668151
1
0.0
0.9125835189309577


## **Phase 2: Naive Bayes!**

**Phase 2.1: Generating pipeline, Conducting HP search, Training**

In [22]:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

training = nb_SVM_RN_training + nb_SVM_RN_validate + nb_SVM_FN_training + nb_SVM_FN_validate
labels = [0] * (len(nb_SVM_RN_training) + len(nb_SVM_RN_validate)) + [1] * (len(nb_SVM_FN_training) + len(nb_SVM_FN_validate))
pipeline.fit(training, labels)
predictions = pipeline.predict(nb_SVM_test)

In [23]:
correct = 0
for idx, pred in enumerate(predictions):
  if (pred == nb_SVM_test_labels[idx]):
    correct = correct + 1

print(correct/len(nb_SVM_test_labels))

0.9339907746142835


## **Phase 3: SVM!**

In [38]:
pipeline2 = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', svm.LinearSVC()),
])

parameters = {
    'clf__C': np.arange(0.1, 3.1, 0.1)
}


training = nb_SVM_RN_training + nb_SVM_RN_validate + nb_SVM_FN_training + nb_SVM_FN_validate
labels = [0] * (len(nb_SVM_RN_training) + len(nb_SVM_RN_validate)) + [1] * (len(nb_SVM_FN_training) + len(nb_SVM_FN_validate))

gridsearch = GridSearchCV(pipeline2, parameters, n_jobs =-1, cv = 5, verbose=4, scoring='accuracy')
gridsearch.fit(training, labels)
z = gridsearch.predict(nb_SVM_test)
correct = 0

for idx, pred in enumerate(z): 
  if (pred == nb_SVM_test_labels[idx]):
    correct = correct + 1

print(correct/len(z))

Fitting 5 folds for each of 30 candidates, totalling 150 fits
0.9952282487672977


In [40]:
print(gridsearch.best_score_)
print(gridsearch.best_params_)

0.9945199962210513
{'clf__C': 2.2}


In [None]:
x = 'A federal appeals court Thursday ruled against former President Donald Trump in his effort to block his White House records from being released to the House select committee investigating January 6.However, the DC Circuit Court of Appeals paused its ruling for two weeks so that Trump could seek a Supreme Court intervention. The events of January 6th exposed the fragility of those democratic institutions and traditions that we had perhaps come to take for granted, said the DC Circuit opinion, which was written by Judge Patricia Millett, who was appointed by former President Barack Obama. "In response, the President of the United States and Congress have each made the judgment that access to this subset of presidential communication records is necessary to address a matter of great constitutional moment for the Republic. Former President Trump has given this court no legal reason to cast aside President Bidens assessment of the Executive Branch interests at stake, or to create a separation of powers conflict that the Political Branches have avoided."'
x = f'Five things that will help get the country where it needs to be and ensure that actual American citizens are duly represented are Voter ID and a Constitutional amendment for term limits on Congress – both the House and Senate. There should be no more career politicians, and no more pensions for politicians. A politician should be a finite service to the country, not a career. Also, biden/harris should be removed from office, and the immigration laws on the books need to be enforced. That is what America needs now.'
x = f'Biden and Harris owe America and the Chicago Police a big apology! Remember Harris was once a prosecutor and a DA, she was worthless then, more worthless and a real danger now. Shows she is NOT competent, got through school as a Affirmative Action Student, slept her way into politics.'
x = f'A gun did not kill Lincoln, a democrat did.'
x = 'A good place to start might not be Berlin, but London. It was here, in early April 2009, that the leaders of the G20 held their second summit, in a drab convention center not far from many of the banks that had brought about the global crisis that the assembled heads of state were now frantically trying to address. The then British prime minister Gordon Brown and Barack Obama sought to apply pressure on Merkel and French president Nicolas Sarkozy to secure a further round of fiscal stimulus in Europe. It had to be large and be comprised of mostly new discretional spending, rather than just automatic stabilizers (such as unemployment insurance) and tax cuts. Also on the table were a raft of reforms aimed at stabilizing the global financial system and addressing global trade imbalances.'
x = f'The artist and widow of John Lennon, who is in Los Angeles to present a collection of cups and saucers she is exhibiting at the Museum of Modern Art, totally took reporters by surprise by admitting she had not only met the former First Lady at various times during a series of protests against the Vietnam War in New York in the 1970s but also knew her “intimately”.'
x = f'On first glance, it would be easy to see the Supreme Court’s decision Friday in Whole Woman’s Health v. Jackson as a win for abortion rights. It would also be wrong. More than two months after the Supreme Court allowed SB 8, a Texas law that effectively bans abortions after the sixth week of pregnancy, to take effect, the Court followed it up with a 5-4 decision that is an even larger defeat to proponents of abortion rights, and a victory to anti-abortion lawmakers in Texas. The specific question in Jackson is whether abortion providers are allowed to bring a federal lawsuit seeking to block SB 8. Although Justice Neil Gorsuch’s majority opinion technically answers this question in the affirmative, it permits suits only against state health officials who play a very minimal role in enforcing the law. It does not allow suits to proceed against the Texas state officials who play the biggest role in enforcing SB 8: state court judges and clerks.
y = pipeline2.predict([x])
print(y)

[1]
