# CSCI 4022 Final Project 
By John Danekind and Daniel Hatakeyama

## Import Stuff


In [34]:
import os 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn 
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
import pyminhash

# Apparently we have to implement our own algos from scratch which is gay af. I just have this here for now

In [35]:
# Make subset of files since they are huge 
# Name file something different 
def make_subset(input_path, output_path):
    # Read in file 
    df = pd.read_csv(input_path)
    
    # Get a random sample of 1000 rows 
    df_sample = df.sample(n=100, random_state=42)
    
    # Save the sample to a new file and give it a different name
    output_path = os.path.join(os.path.dirname(input_path), 'sampled_' + os.path.basename(input_path)) 
    new_file = df_sample.to_csv(output_path)
    
    return new_file

make_subset('../data/True.csv', '../data/sample_True.csv')
make_subset('../data/Fake.csv', '../data/sample_Fake.csv')



### Ideal questions to consider for now
- Can unsupervised clustering naturally separate fake from real news without using labels?
- What distinctive linguistic patterns emerge in clusters of fake vs. real news?
- Are there identifiable sub-categories within fake news that clustering can reveal?
- How do temporal patterns differ between fake and real news clusters?
- Which textual features most strongly contribute to the separation of clusters?


#### Things to think about
- Dataset is meant for more supervised learning tasks (Neural nets, svms, knn, etc)
- Maybe use KMeans, minhashing, and GMM's to cluster news into different categories
- Compare these clusters to the labels later 


#### Current plan since we have no time lmao
- Make datasets into one big dataset with no labels.
- Do K means on that data and cluster articles by similarity (could be jaccard, euclidean with vector embeddings, something else (figure out later))
- See what patterns arise from this. 
- Then take the regular data set and do a simple supervised method that is interpretable. (Logistic regression, random forest. NO BLACK BOX)
- Compare how each of the methods performed. 
- Can fake vs real news be clustered without explicit labels or is it harder to detect?

### Research Question: 
- Can document similarity patterns reveal distinctions between fake and real news? 
- If we cluster documents by some similarity measure into real and fake will this be accurate against the actual dataset? 


## Data Preparation/Preprocessing

#### Loading Data

In [36]:
# Load the data sets
fake_df = pd.read_csv('../data/sampled_Fake.csv')
real_df = pd.read_csv('../data/sampled_True.csv')

# Add labels 
fake_df['label'] = 0
real_df['label'] = 1

# Combine the data sets 
df = pd.concat([fake_df, real_df], ignore_index=True)

# Create a combined text field
df['content'] = df['title'] + ' ' + df['text']

print(f"Total articles: {len(df)}")
print(f"Fake: {len(fake_df)}, Real: {len(real_df)}")

print(f"Fake example: {df['content'].iloc[0]}")
print(f"Real example: {df[df['label'] == 1]['content'].iloc[0]}")


df.head(10)

Total articles: 200
Fake: 100, Real: 100
Fake example: ABOUT HILLARY’S COUGH: We Discovered The Secret To Why She Keeps Coughing [Video]  
Real example: Europe rights watchdog says Turkey's emergency laws go too far BRUSSELS (Reuters) - A leading European rights watchdog called on Turkey on Friday to ease post-coup state of emergency laws that have seen thousands arrested and restore power to regional authorities. President Tayyip Erdogan has overseen a mass purge in the armed forces and the judiciary, as well as a crackdown on critics including academics and journalists since a failed military coup in July last year.  An advisory body to the Council of Europe, of which Turkey is a member, acknowledged in a report  the need for certain extraordinary steps taken by Turkish authorities to face a dangerous armed conspiracy .  However...Turkish authorities have interpreted these extraordinary powers too extensively,  said the experts, known as the Venice Commission, in an opinion that has 

Unnamed: 0.1,Unnamed: 0,title,text,subject,date,label,content
0,13474,ABOUT HILLARY’S COUGH: We Discovered The Secre...,,politics,"Jul 20, 2016",0,ABOUT HILLARY’S COUGH: We Discovered The Secre...
1,11994,BREAKING: OBAMACARE REPEAL Clears First Hurdle...,The Senate voted 51-48 this afternoon to proce...,politics,"Jan 4, 2017",0,BREAKING: OBAMACARE REPEAL Clears First Hurdle...
2,19179,‘SLEEPY’ JUSTICE GINSBURG: Excites Crowd By Sa...,So much for the SCOTUS not being political Che...,left-news,"Feb 7, 2017",0,‘SLEEPY’ JUSTICE GINSBURG: Excites Crowd By Sa...
3,501,WATCH: Kellyanne Conway Very Upset Hillary Cl...,White House counselor Kellyanne Conway crawled...,News,"August 24, 2017",0,WATCH: Kellyanne Conway Very Upset Hillary Cl...
4,3492,"GOP Gives Trump The Middle Finger, Prepares T...",Donald Trump may have decided that Russia is g...,News,"December 9, 2016",0,"GOP Gives Trump The Middle Finger, Prepares T..."
5,1510,Trump Displays Incredible Ignorance Yet Again...,Have you ever wondered where a phrase started?...,News,"May 11, 2017",0,Trump Displays Incredible Ignorance Yet Again...
6,3296,Anthony Bourdain Reveals The ‘ONE Good Thing’...,While Donald Trump is currently freaking out b...,News,"December 22, 2016",0,Anthony Bourdain Reveals The ‘ONE Good Thing’...
7,17798,TRUMP HITS BACK After Cowgirl Congresswoman Tr...,The left is going ballistic over supposed word...,left-news,"Oct 18, 2017",0,TRUMP HITS BACK After Cowgirl Congresswoman Tr...
8,9504,MEDIA DOWNPLAYS Attack By Unhinged Neighbor On...,"5 broken ribs with trouble breathing, lung con...",politics,"Nov 6, 2017",0,MEDIA DOWNPLAYS Attack By Unhinged Neighbor On...
9,6087,Why This Attorney General Is Going After Trum...,New York Attorney General Eric Schneiderman is...,News,"May 31, 2016",0,Why This Attorney General Is Going After Trum...


#### Text Preprocessing

#### Text Processing pipeline (Look into more later )

In [37]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download required NLTK data (only needs to run once)
#nltk.download('stopwords')
#nltk.download('wordnet')

# Initialize lemmatizer and stopwords
# Lemmatization is the process of converting a word to its base form 
# e.g. 'running' -> 'run'
# Stop words are common words in a language that are usually filtered out
# e.g. 'the', 'a', 'is', etc.
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
print(f"Stop words: {stop_words}")

def preprocess_text(text):
    # Convert the text to all
    text = str(text).lower()
    
    # Remove special characters, numbers, punctuation
    text = re.sub(r'[^\w\s]', ' ', text)
    text = re.sub(r'\d+', ' ', text)
    
    # Tokenize
    tokens = text.split()
    
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    
    
    # Rejoin tokens
    return ' '.join(tokens)

# Apply preprocessing
df['processed_content'] = df['content'].apply(preprocess_text)
df.head()

Stop words: {'at', 'there', 'by', 'being', 'have', 'themselves', 'after', 'hasn', 'those', "they've", "she's", 'does', 'having', 'because', "haven't", 'didn', 'd', 'what', 'who', 'am', 'did', 'ourselves', 'with', 'the', 'any', 'my', 'off', 'and', 'then', 's', 'until', "don't", "he'll", 'so', 'he', 'where', 'ma', 'its', 'of', 'your', 'that', 'to', "you'd", 'how', 'only', 'nor', 'below', 'own', "you're", "mustn't", 'him', 'we', 'y', 'through', 'yourselves', 'now', 'other', 'against', 'this', "it's", 'all', 'for', "doesn't", 'just', 'above', 'doesn', 'than', 'are', "didn't", 'don', 'i', "wouldn't", "you've", 'mustn', 'herself', 'about', 'wasn', 'has', 'same', "weren't", 'some', 'while', "i'd", 'over', 'ours', "isn't", 'or', "that'll", "i've", 'yours', "aren't", 'not', 'from', 'into', 'under', "he'd", "wasn't", 'is', 'shan', "we're", "she'll", 'between', 'm', "mightn't", 'should', "they're", 'an', 'these', 'a', 'needn', 'she', 're', 'very', 'down', 'our', 've', "it'll", "hadn't", "they'll"

Unnamed: 0.1,Unnamed: 0,title,text,subject,date,label,content,processed_content
0,13474,ABOUT HILLARY’S COUGH: We Discovered The Secre...,,politics,"Jul 20, 2016",0,ABOUT HILLARY’S COUGH: We Discovered The Secre...,hillary cough discovered secret keep coughing ...
1,11994,BREAKING: OBAMACARE REPEAL Clears First Hurdle...,The Senate voted 51-48 this afternoon to proce...,politics,"Jan 4, 2017",0,BREAKING: OBAMACARE REPEAL Clears First Hurdle...,breaking obamacare repeal clear first hurdle d...
2,19179,‘SLEEPY’ JUSTICE GINSBURG: Excites Crowd By Sa...,So much for the SCOTUS not being political Che...,left-news,"Feb 7, 2017",0,‘SLEEPY’ JUSTICE GINSBURG: Excites Crowd By Sa...,sleepy justice ginsburg excites crowd saying b...
3,501,WATCH: Kellyanne Conway Very Upset Hillary Cl...,White House counselor Kellyanne Conway crawled...,News,"August 24, 2017",0,WATCH: Kellyanne Conway Very Upset Hillary Cl...,watch kellyanne conway upset hillary clinton t...
4,3492,"GOP Gives Trump The Middle Finger, Prepares T...",Donald Trump may have decided that Russia is g...,News,"December 9, 2016",0,"GOP Gives Trump The Middle Finger, Prepares T...",gop give trump middle finger prepares launch p...


#### Implement MinHashing

In [38]:
class MinHash:
  def __init__(self, num_hashes=100, seed=42):
    """
    Initialize MinHash with specified number of hash functions  
    """
    self.num_hashes = num_hashes
    np.random.seed(seed)
    
    # Large prime number for hashing
    self.prime = 2147483647 # 32 bit prime (2^31 - 1)
    
    # Generate random parameters for hash functions (ax + b) % p 
    self.a = np.random.randint(1, self.prime, size=self.num_hashes)
    self.b = np.random.randint(0, self.prime, size=self.num_hashes)
  
  def hash_function(self, x, index):
    """
    Hash function: (a * x + b) % p
    """
    return ((self.a[index] * x + self.b[index]) % self.prime)
  
  def compute_signiture(self, shingles):
    """
    Compute the minhash signature for a set of shingles 
    """
    # Convert shingles to integers using pythons hash functions 
    shingle_hashes = [hash(s) & 0x7fffffff for s in shingles]
    
    # Initialize signature array with max possible values
    signature = np.full(self.num_hashes, np.inf)
    
    # For each shinlge, update the signature 
    for shingle_hash in shingle_hashes:
      
      # For each hash function, update the signature if the hash is smaller
      for i in range(self.num_hashes):
        # Compute hash value for this shingle 
        hash_value = self.hash_function(shingle_hash, i)
        
        # Keep minimum hash value
        signature[i] = min(signature[i], hash_value)
        
    return signature.astype(np.int32)
  
  def jaccard_similarity(self, sig1, sig2):
    """
    Compute Jaccard similarity between two minhash signatures (cardinality of intersection / cardinality of union)        
    """
    # Count how many hash values match 
    matches = np.sum(sig1 == sig2)
    
    return matches / self.num_hashes

   

### Use Minhashing on data

In [40]:
minhash = MinHash(num_hashes=100, seed=42)

# Compute signitures for all docs in data set 
signitures = []
for i, row in df.iterrows():
    # Get shingles for this document 
    shingles = set(row['processed_content'].split())
    
    # Compute minhash signature
    sig = minhash.compute_signiture(shingles)
    
    # Append to list
    signitures.append(sig)

signitures = np.array(signitures)

# Store signitures in the dataframe for later use 
df['signature'] = signitures.tolist()
print(f"Signitures shape: {signitures.shape}")
print(df[['signature', 'label']].head(10))

Signitures shape: (200, 100)
                                           signature  label
0  [125626631, 128955190, 398614528, 189687261, 9...      0
1  [8452765, 1451797, 1259864, 2186467, 4668166, ...      0
2  [11899346, 37956018, 24625774, 155154217, 4668...      0
3  [25600516, 29695097, 42361211, 1884025, 466816...      0
4  [25804383, 3057192, 19125582, 2039817, 6089831...      0
5  [26640613, 1780257, 6064357, 11306456, 4668166...      0
6  [16276679, 2932647, 752375, 28135788, 25923204...      0
7  [5323780, 1780257, 8874933, 2791478, 4668166, ...      0
8  [5088999, 2644040, 104095, 1368453, 4668166, 2...      0
9  [22723904, 34735803, 24625774, 3400311, 380828...      0


### Compute Similarity Matrix

In [43]:
def compute_similarity_matrix(signitures):
  """Compute pairwise similairity for all documents"""
  
  n = len(signitures)
  similarity_matrix = np.zeros((n,n))
  
  for i in range(n): 
    # A document is always 100% similar to itself
    similarity_matrix[i,i] = float(1.0)
    
    for j in range(i+1, n):
      similarity = np.sum(signitures[i] == signitures[j]) / len(signitures[i])
      similarity_matrix[i,j] = similarity
      similarity_matrix[j,i] = similarity
      
    if i % 100 == 0 and i > 0:
      print(f"Processed {i}/{n} documents")
  
  return similarity_matrix

similarity_matrix = compute_similarity_matrix(signitures)

Processed 100/200 documents


### Cluster documents into real and fake by jaccard measure

### Feature Extraction (Figure out how it works later)

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=5000, min_df=5, max_df=0.85)

# Fit and transform the processed content
X = vectorizer.fit_transform(df['processed_content'])
feature_names = vectorizer.get_feature_names_out()
y = df['label']  # Your labels are already 0 and 1

print(f"TF-IDF matrix shape: {X.shape}")

TF-IDF matrix shape: (20000, 5000)


### Setting up MinHash


In [8]:
from datasketch import MinHash

# Create Minhash signatures 
def create_minhash_signature(shingles, num_perm=128):
    """
    Create a MinHash signature for a set of shingles
    """ 
    m = MinHash(num_perm=num_perm)
    for s in shingles:
      m.update(s.encode('utf8'))
    
    return m 

df['minhash'] = df['shingles'].apply(lambda x: create_minhash_signature(x, num_perm=128))
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,subject,date,label,content,processed_content,shingles,minhash
0,13474,ABOUT HILLARY’S COUGH: We Discovered The Secre...,,politics,"Jul 20, 2016",0,ABOUT HILLARY’S COUGH: We Discovered The Secre...,hillary cough discovered secret keep coughing ...,"{cough discovered, keep coughing, secret keep,...",<datasketch.minhash.MinHash object at 0x1969bd...
1,11994,BREAKING: OBAMACARE REPEAL Clears First Hurdle...,The Senate voted 51-48 this afternoon to proce...,politics,"Jan 4, 2017",0,BREAKING: OBAMACARE REPEAL Clears First Hurdle...,breaking obamacare repeal clear first hurdle d...,"{brady texas, republican responsible, consider...",<datasketch.minhash.MinHash object at 0x180350...
2,19179,‘SLEEPY’ JUSTICE GINSBURG: Excites Crowd By Sa...,So much for the SCOTUS not being political Che...,left-news,"Feb 7, 2017",0,‘SLEEPY’ JUSTICE GINSBURG: Excites Crowd By Sa...,sleepy justice ginsburg excites crowd saying b...,"{deal woman, still dark, abolition electoral, ...",<datasketch.minhash.MinHash object at 0x1969b6...
3,501,WATCH: Kellyanne Conway Very Upset Hillary Cl...,White House counselor Kellyanne Conway crawled...,News,"August 24, 2017",0,WATCH: Kellyanne Conway Very Upset Hillary Cl...,watch kellyanne conway upset hillary clinton t...,"{child around, major party, reality show, c tr...",<datasketch.minhash.MinHash object at 0x1969bd...
4,3492,"GOP Gives Trump The Middle Finger, Prepares T...",Donald Trump may have decided that Russia is g...,News,"December 9, 2016",0,"GOP Gives Trump The Middle Finger, Prepares T...",gop give trump middle finger prepares launch p...,"{team probe, worthy examination, going investi...",<datasketch.minhash.MinHash object at 0x1969bc...
