# Misinformation Beat Notebook

This file will be used to double check the results of our research on the Misinformation Beat. 

Since the main data was pulled via ProQuestTDM studio, the raw data at the XML level is not available to double check. '

Therefore, this file should be used to check the logic of the code, but can only be ran locally after the ANALYSIS portion (Excluding imports which can run). 

Run each box in order 





==================================================================

"We used Proquest TDM Studio to access the ProQuest “U.S. Major Dailies” dataset which includes historical archives of The Chicago Tribune, Los Angeles Times, New York Times, Wall Street Journal, and Washington Post. We searched all articles that used the phrases “misinformation,” “disinformation,” “propaganda,” “conspiracy theory,” “conspiracy theories” and “fake news” from January 1, 1980 to April 24, 2025.The data received from ProQuest came in the form of XML files for each of the 231,992 articles that met the criteria."



Step 1: Extract From XML Files
All XML files resulting from query are stored in the data/*search term* folder (*INACCESSABLE*)
Results of each extraction will be in the results/ folder

Note: This cannot be recreated here given that the xml files are contained within the ProQuest TDM environment. This is meant for code review only. 


Imports

In [None]:
#Imports
import xml.etree.ElementTree as ET
import pandas as pd
import re
import os
import pandas as pd
import nltk
import time
import datetime
import glob
import csv




from collections import Counter
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from collections import Counter
from tqdm import tqdm


os.makedirs('FinalTDMOutputs', exist_ok=True)
os.makedirs('FullClean', exist_ok=True)
os.makedirs('MatrixResults', exist_ok=True)
os.makedirs('MatrixResults/Main', exist_ok=True)
os.makedirs('MatrixResults/TermSearchMatrix', exist_ok=True)


print('done with imports')

"The data received from ProQuest came in the form of XML files for each of the 231,992 articles that met the criteria. To accurately parse the data, we removed any markup tags in order to ensure that only the readable text was captured. To clean the dataset and correct for any issues from text identification during scanning, we removed all non-alphanumeric and non-ASCII characters other than hyphens and ending punctuations; question marks and periods were replaced with spaces. Since hyphens are often attached to the keywords listed above, we removed hyphens that preceded or trailed a given keyword."

===================================================================

#Clean the text


- Change keyword based on search as it defaults to 'disinformation'

In [None]:
def clean_and_correct_text(text, keyword='disinformation'):
    if not text or text.strip() == "":
        return "N/A"

    text = re.sub(r"<.*?>", "", text) #Removes markup text like <tag>
    text = re.sub(r"[^a-zA-Z0-9\s\-\.]+", '', text)  # Keeps ASCII, periods and hyphens
    text = text.replace('.', ' ') #Replace Periods with space
    text = text.replace('?', ' ') #Replace Question marks with space
    pattern = rf'(?<=\b{re.escape(keyword)})-(?=\w)|(?<=\w)-(?=\b{re.escape(keyword)})' #Remove hyphens from keyword (keep them elsewhere)
    text = re.sub(pattern, ' ', text) #Apply regex

    return text

print('Done with building clean function!')

"We then extracted every use of each of these search terms throughout all of the articles from each dataset. We included the 25 words that preceded and followed each instance of each search term to capture its context. Within a single article, multiple occurrences of each search term were counted separately."


================================================
# Extract Data

This Function is used to extract data from individual xml files, and returns a list of objects that include GOID, Publisher, Title, Numeric Date, and the 50 surrounding words. 
#This function includes an implementation of the above clean and correct function. 


- Change keyword based on search as it defaults to 'disinformation'


In [None]:
# Function to extract metadata and concordances
# Returns either FALSE or a list of concordances
def extract_data_from_xml(file_path, keyword='disinformation', window=25):
    try:
        tree = ET.parse(file_path)
        root = tree.getroot()

        # Extract metadata fields
        goid = root.find(".//GOID").text if root.find(".//GOID") is not None else "N/A"
        publisher = root.find(".//publisher/PublisherName").text if root.find(".//publisher/PublisherName") is not None else "N/A"
        title = root.find(".//Title").text if root.find(".//Title") is not None else "N/A"
        numeric_date = root.find(".//NumericDate").text if root.find(".//NumericDate") is not None else "N/A"

        # Extract article text
        text_elem = root.find(".//Text")
        full_text = text_elem.text if text_elem is not None and text_elem.text else "N/A"
        
        # If phrase is more than one word, capture its length
        phrase_words = keyword.lower().split()
        phrase_len = len(phrase_words)
        
        theConcordances = []

        # Split text into words
        words = clean_and_correct_text(full_text, keyword).split()
        
        # Find occurrences of the keyword and store concordances.
        # Phrase words determine length of window
        for i in range(len(words) - (phrase_len)): #For every word in the list (minus the length of the phrase)
            window_slice = words[i:i + phrase_len] #look at number of words equal to the length of the phrase
            if [w.lower() for w in window_slice] == phrase_words: #If the words in the window are equal to the phrase words
                left_context = " ".join(words[max(0, i - window): i])
                right_context = " ".join(words[i + phrase_len: min(len(words), i + phrase_len + window)]) #right context is blank + join word
                
                cleaned_left = clean_and_correct_text(left_context).lower().split()  #Split this into individual words
                cleaned_left += [""] * (25 - len(cleaned_left))
                cleaned_right = clean_and_correct_text(right_context).lower().split() #Split this into individual words
                cleaned_right += [""] * (25 - len(cleaned_right))
                
                # Append extracted data
                theConcordances.append({
                    "GOID": goid,
                    "Publisher": publisher.lower(),
                    "Title": clean_and_correct_text(title).lower(),
                    "NumericDate": numeric_date,
                    "Left Context1": cleaned_left[0],
                    "Left Context2": cleaned_left[1],
                    "Left Context3": cleaned_left[2],
                    "Left Context4": cleaned_left[3],
                    "Left Context5": cleaned_left[4],
                    "Left Context6": cleaned_left[5],
                    "Left Context7": cleaned_left[6],
                    "Left Context8": cleaned_left[7],
                    "Left Context9": cleaned_left[8],
                    "Left Context10": cleaned_left[9],
                    "Left Context11": cleaned_left[10],
                    "Left Context12": cleaned_left[11] ,
                    "Left Context13": cleaned_left[12],
                    "Left Context14": cleaned_left[13],
                    "Left Context15": cleaned_left[14],
                    "Left Context16": cleaned_left[15],
                    "Left Context17": cleaned_left[16],
                    "Left Context18": cleaned_left[17],
                    "Left Context19": cleaned_left[18],
                    "Left Context20": cleaned_left[19],
                    "Left Context21": cleaned_left[20],
                    "Left Context22": cleaned_left[21],
                    "Left Context23": cleaned_left[22],
                    "Left Context24": cleaned_left[23],
                    "Left Context25": cleaned_left[24],
                    "Keyword": keyword,
                    "Right Context1": cleaned_right[0],
                    "Right Context2": cleaned_right[1],
                    "Right Context3": cleaned_right[2],
                    "Right Context4": cleaned_right[3],
                    "Right Context5": cleaned_right[4],
                    "Right Context6": cleaned_right[5],
                    "Right Context7": cleaned_right[6],
                    "Right Context8": cleaned_right[7],
                    "Right Context9": cleaned_right[8],
                    "Right Context10": cleaned_right[9],
                    "Right Context11": cleaned_right[10],
                    "Right Context12": cleaned_right[11],
                    "Right Context13": cleaned_right[12],
                    "Right Context14": cleaned_right[13],
                    "Right Context15": cleaned_right[14],
                    "Right Context16": cleaned_right[15],
                    "Right Context17": cleaned_right[16],
                    "Right Context18": cleaned_right[17],
                    "Right Context19": cleaned_right[18],
                    "Right Context20": cleaned_right[19],
                    "Right Context21": cleaned_right[20],
                    "Right Context22": cleaned_right[21],
                    "Right Context23": cleaned_right[22],
                    "Right Context24": cleaned_right[23],
                    "Right Context25": cleaned_right[24]
                })
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
        return False
    return theConcordances

def append_csv(source_path, target_path):
    with open(source_path, 'r', newline='') as src, open(target_path, 'a', newline='') as tgt:
        reader = csv.reader(src)
        writer = csv.writer(tgt)

        next(reader)  # Skip header in source
        for row in reader:
            writer.writerow(row)


print('extraction and split functions ready as well!!!')


# MAIN 
#This is the main driver that utilizes the above functions. It loops through every xml file, ensures that every instance of the keyword is captured, then stores results as a csv. 

#It is problematic if there are several instances of "Keyword not in text" or "keyword in text" appear. 
#prints progress every 1000 files. 

#The result should be a folder called "FinalTDMOutputs" with the relevant CSVs corresponding to each search term. 

- change folder path and key word based on term. 

In [None]:
# Define the folder containing XML files
folder_path = "data/MisinformationTerms"  # Adjust if needed
keywords = ['disinformation', 'misinformation', 'conspiracy theory', 'conspiracy theories', 'propaganda', 'fake news']
window = 25

# List to store extracted data
print('Begin MAIN')

for word in keywords:
    concordances = [] 
    articlesMinusTerm = 0
    if os.path.exists(folder_path):
        xml_files = [f for f in os.listdir(folder_path) if f.endswith('.xml')]
        #Use this for sampling
        #sample_size = min(200, len(xml_files))
        #sampled_files = random.sample(xml_files, sample_size)
        print('begin loop')
        
        
        for i, file_name in enumerate(xml_files): #Change function to take in sampled_files for sampling
            file_path = os.path.join(folder_path, file_name)
            concordance = extract_data_from_xml(file_path, word, window)
        
            #This is here to ensure that all articles and instances of misinformation are actually picked up. "misinformation was hardcoded as keyword 
            if concordance == False: #if keyword NOT in the text - mainly error checking. 
                articlesMinusTerm += 1
                tree = ET.parse(file_path)
                root = tree.getroot()

                text_elem = root.find(".//Text")
                full_text = text_elem.text if text_elem is not None and text_elem.text else "N/A"
                print('#################')
                if(word in full_text):
                    print(f'{word} is in the text')
                else:
                    print(f'{word} is NOT in the text')

                full_text = text_elem.text if text_elem is not None and text_elem.text else "N/A"
            else:
                concordances = concordances + concordance
                
            if (i + 1) % 1000 == 0:  # Show progress every 100 files
                print(f"Processed {i+1} out of {len(xml_files)} files")
                #print('current articlesMinusTerm', articlesMinusTerm)

    # Convert extracted data to DataFrame
    print(len(concordances))
    df_concordances = pd.DataFrame(concordances)

    # Save to CSV        
    filepath = 'FinalTDMOutputs/'
    filename = 'Cleaned_Full_Win' + str(window) + '_' + word.replace(" ", "_") + '.csv'
        
    if not os.path.exists(filepath):
        os.makedirs(filepath)
    df_concordances.to_csv(f'{filepath}/{filename}', index=False)
    
    if word == "conspiracy theories":
        append_csv(f'{filepath}/{filename}', f'{filepath}/Cleaned_Full_Win25_conspiracy_theory.csv')
        print(f'Extracted data added to {filepath}/Cleaned_Full_Win25_conspiracy_theory.csv')
    else: 
        print(f"Extracted data saved to {filepath}{filename}")

From this we should have a folder called "FinalTDMOutputs" containing the csvs for each of the five terms. 

Now we move to step 2

=====================================================

# -----Analyis----- 
===========================================================

========================================================

# Duplicate Removal

"We then removed any duplicate title/context combinations."

#Remove duplicates for all the files
#Outputs files into a folder called FullClean/

In [None]:
def remove_duplicates_and_save(file_paths, output_folder="FullClean"):
    os.makedirs(output_folder, exist_ok=True)
    report = {}

    for path in file_paths:
        df = pd.read_csv(path)
        original_count = len(df)

        # Normalize titles for comparison
        if 'Title' in df.columns:
            df['title_key'] = df['Title'].astype(str).str.strip().str.lower() #lowercase 
        else:
            df['title_key'] = ''

        # Concatenate all context columns into one string
        context_cols = [col for col in df.columns if 'Context' in col]
        df['context_key'] = df[context_cols].astype(str).apply(lambda x: ' '.join(x.dropna()), axis=1)

        # Remove duplicates based on title and context content
        df_cleaned = df.drop_duplicates(subset=['title_key', 'context_key'], keep='first') #use dataframe 

        # Drop helper columns
        df_cleaned = df_cleaned.drop(columns=['title_key', 'context_key'])

        # Save cleaned file
        filename = os.path.basename(path)
        cleaned_path = os.path.join(output_folder, filename)
        df_cleaned.to_csv(cleaned_path, index=False)

        # Record and report
        cleaned_count = len(df_cleaned)
        removed_count = original_count - cleaned_count
        report[filename] = removed_count
        print(f" {removed_count} duplicates removed — saved to {cleaned_path}")

    return report

file_list = glob.glob("FinalTDMOutputs/*.csv")
report = remove_duplicates_and_save(file_list)

# Print summary if wanted. 
# print("\n Duplicate Removal Report:")
# for file, removed in report.items():
#     print(f"{file}: {removed} duplicates removed")

=======================================================================================

"We removed all multi-letter stopwords (e.g. “the” “and” “of”) using the Natural Language Toolkit package, and augmenting their stopwords with our own list developed during data exploration (See Appendix). We further cleaned the text using a set of stem mappings developed during exploration (i.e. turning ‘ukrainian’ to ‘ukraine’, ‘trumps’ to ‘trump’, or ‘coronavirus’ to ‘covid19’)(See Appendix). We also included a custom list of n-grams based on data exploration to better capture core concepts; (i.e., “fake news”, “white house”, “u k” or “u s s r”) (See Appendix). We then removed any remaining single-letter words that were not a part of an n-gram."


# Hardcoded ngrams and term mappings

These functions clean the text and apply ngrams in the process of analysis. They are used later in the script. 

In [None]:
# Hardcoded term mappings (can be expanded as needed)
TERM_MAPPINGS = {
    
    #Platforms
    'facebooks' : 'facebook',
    'twitters': 'twitter',
    'tweet': 'twitter',
    'tweeting': 'twitter',
    'tweeted': 'twitter',
    'Tiktoker': 'tiktok',
    'youtuber': 'youtube',
    'redditor': 'reddit',
    'militarys': 'military',
    'militaries': 'military',
    
    #Trump
    'trumps': 'trump',
    'presidents': 'president',
    'presidential': 'president',
    'republicans': 'republican',
    
     #Nations
    'koreas': 'korea',
    'korean': 'korea',
    
    'mexican': 'mexico',
    'mexicans': 'mexico',
  
    'russias': 'russia',
    'russian': 'russia',
    'kremlin': 'russia',
    'putins': 'putin',

    'chinese': 'china',
    'chinas': 'china',
   
    'ukraines': 'ukraine',
    'ukrainian': 'ukraine',
    'ukrainians': 'ukraine',
    

    'germanys': 'germany',
    'german': 'germany',
    
    'israels': 'israel',
    'israeli': 'israel',
    'israelis': 'israel',
    
    'iranian': 'iran',
    'irans': 'iran',
    'iranians': 'iran',
    
    'turkeys': 'turkey',
    'turkish': 'turkey',
    
    'french': 'france',
    'frances': 'france',
    
    'indian': 'india',
    'indias': 'india',
     
    'pakistani': 'pakistan',
    'pakistans': 'pakistan',
    
    'european': 'europe',
    'europes': 'europe',
    #Other countries added as needed
    
    #America
    'american': 'america',
    'americas': 'america',
    'americans': 'america',
    
    #COVID
    'vaccines': 'vaccine',
    'vaccination': 'vaccine',
    'vaccinations': 'vaccine',
    'vaccinated': 'vaccine',
    'vaccinate': 'vaccine',
    'coronavirus': 'covid19',
    'covid': 'covid19',
    
    #People
    'conways': 'conway',
    'clintons': 'clinton',
    'democrats': 'democrats',
    
    #Words
    'campaigns': 'campaign',
    'elections':'election',
    'spreading': 'spread',
    'spreads': 'spread',
    'platforms': 'platform',
    'officials': 'official',
    'governments': 'government',
    'users': 'user',
    'uses': 'use',
    'used': 'use',
    'journalists': 'journalist',
    'medias': 'media',
    'years': 'year',
    'wars': 'war',
    'groups': 'group',
    'terrorists': 'terrorist',
    'posts': 'post',
    'posting': 'post',
    'posted': 'post',
    'sites': 'site',
    'claimed': 'claim',
    'claims': 'claim',
    'companies': 'company',
    'companys': 'company',
    'accounts': 'account',
    'efforts': 'effort',
    'speeches': 'speech',
    'stories': 'story',
    'outlets': 'outlet',
    'including': 'include',
    'inclusion': 'include', 
    'worlds': 'world',
        
}


MANUAL_EXCLUDE = {'also', 'said', 'mr', 'us', 'would', 'one', 'people', 'like', 'many', 'could', '', '--', ' ', 'de', 'la'}
# N-GRAMS
BIGRAMS = {
    ('u', 'k'): 'uk',
    ('u', 'n'): 'un',
    ('u', 's'): 'us',
    ('e', 'u'): 'eu',
    ('jan', '6'): 'jan 6',
    ('fake', 'news'): 'fake news',
    ('white', 'house'): 'white house',
    ('social', 'media'): 'social media',
    ('new', 'media'): 'new media',
    ('islamic', 'state'): 'islamic state',
    ('press', 'secretary'): 'press secretary',
    ('donald', 'trump'): 'trump',
    ('president', 'trump'): 'trump',
    ('president', 'biden'): 'biden',
    
    ('alternative', 'facts'): 'alternative facts',
    ('alternative', 'fact'): 'alternative facts',
    ('conspiracy', 'theory'): 'conspiracy theory',
    ('conspiracy', 'theories'): 'conspiracy theory',
    ('kellyanne', 'conway'): 'conway',
    ('sean', 'spicer'): 'spicer',
    ('a', 'i'): 'artificial intelligence',
    ('artificial', 'intelligence'): 'artificial intelligence',
    ('state', 'sponsored'): 'state sponsored',
    ('soviet', 'union'): 'ussr',
    ('european', 'union'): 'europe',
    
    ('north', 'korea'): 'north korea',
    ('south', 'korea'): 'south korea',
    
    ('hillary', 'clinton'): 'clinton',
}
TRIGRAMS = {
    
    ('u', 's', 'a'): 'usa',
    ('f', 'b', 'i'): 'fbi',
    ('c', 'i', 'a'): 'cia',
    ('g', 'r', 'u'): 'gru',
    ('k', 'g', 'b'): 'kgb',
    ('u', 'f', 'o'): 'ufo',
    ('u', 'a', 'p'): 'uap',
    ('i', 'r', 'a'): 'ira',
    ('internet', 'research', 'agency'): 'ira',
    ('social', 'media', 'platform'): 'social media',
}
QUADGRAMS = {
    ('u', 's', 's', 'r'): 'ussr',
}

#taking these single letters out of stopwords library. We need these
fixed = stopwords.words('english')
fixed.remove('s')
fixed.remove('t')
fixed.remove('i')
fixed.remove('a')
stop_words = set(fixed)
stop_words.update(MANUAL_EXCLUDE)

def apply_stemming_and_cleaning(word):
    """
    Lowercases, applies term mappings, and skips stopwords.
    """
    word = word.lower()
    if word in TERM_MAPPINGS:
        word = TERM_MAPPINGS[word]
    if word in stop_words:
        return 0
    return word


 #applies existing ngrams to a row of words. 

def apply_ngrams(word_list):
    """
    Scans a list of words and replaces matching bigrams, trigrams, quadgrams.
    """
    i = 0
    result = []
    while i < len(word_list):
        
        # Check quadgrams first
        if i + 3 < len(word_list) and tuple(word_list[i:i+4]) in QUADGRAMS: #If there are more opportunities for QGs, and the next four letters form a quadgram
            result.append(QUADGRAMS[tuple(word_list[i:i+4])]) #add the quadgram  to the main list
            i += 4
        # Then trigrams
        elif i + 2 < len(word_list) and tuple(word_list[i:i+3]) in TRIGRAMS:
            result.append(TRIGRAMS[tuple(word_list[i:i+3])])
            i += 3
        # Then bigrams
        elif i + 1 < len(word_list) and tuple(word_list[i:i+2]) in BIGRAMS:
            #print("found", (BIGRAMS[tuple(word_list[i:i+2])]))
            # if (BIGRAMS[tuple(word_list[i:i+2])] == 'fake news'):
            #     print('Fake news FOUND')
            result.append(BIGRAMS[tuple(word_list[i:i+2])])
            i += 2
        else:
            if not (len(word_list[i]) == 1): #remove single letter words
                result.append(word_list[i])
                i += 1
            else:
                i += 1
    return result

print('Stemming and mapping functions created')

# Helper to deal with filenames

In [None]:
def get_output_filename(filepath, start, end, termName=None):
       # Extract the filename without extension
        file_name = os.path.basename(filepath)
        base_name = os.path.splitext(file_name)[0]

        # Remove 'Cleaned_Full' prefix if present
        if base_name.startswith("Cleaned_Full"):
            base_name = base_name.replace("Cleaned_Full", "").strip('_')
        
        if termName:
            base_name = f'{termName}_{base_name}_Matrix_{start}_{end}.csv'
        else:
            base_name = f'{base_name}_Matrix_{start}_{end}.csv'
        
        return base_name
#todo, create new ngrams automatically
print('get filename function created')

# Process Single CSV 
Function to process individual csv and save the result in a table of the top terms by year

In [None]:
def process_single_csv(filepath, top_n=20, start_year=None, end_year=None):
    WordCountsByYear = {}
    #=== Step 1: Load CSV into DataFrame ===
    df = pd.read_csv(filepath)
    # === Step 2: Identify context columns ===
    context_columns = [col for col in df.columns if "Context" in col]
    # === Step 3: Loop through rows and check year consistency ===
    with tqdm(total=len(df), desc="Processing rows") as pbar:
        for index, row in df.iterrows():
            pbar.update(1)
            numeric_date = row.get("NumericDate")
            # Skip rows without a valid date
            try:
                row_year = pd.to_datetime(numeric_date).year
            except (ValueError, TypeError):
                continue  # Skip invalid dates
            
            # Skip row with years out of bounds
            if not (row_year >= start_year) and (row_year <= end_year): 
                continue  # Skip rows not matching the target year
        
        # === Step 4: While in row loop, Loop through context cells to extract words ===
            row_words = []
            raw_row = []
            for col in context_columns:
                cell_value = str(row[col]) if pd.notnull(row[col]) else ""
                
                # Clean word 
                raw_row.append(cell_value)
                cleaned_word = apply_stemming_and_cleaning(cell_value)
                if not cleaned_word == 0:
                    row_words.append(cleaned_word)
        # == Step 5: While in row loop, apply NGram Search    
            filtered_words = apply_ngrams(row_words)
        # == STep 6: While in row loop, Extract Word Counts For Year/word combo
            for word in filtered_words:
                #if word == 'u':
                    #print(raw_row)
                if row_year not in WordCountsByYear: #if new year in set, add new year dictionary to master dictionary for csv
                    WordCountsByYear[row_year] = {}
                if not word in WordCountsByYear[row_year]: # if word is not null and word not in the year/word combo, add a new one
                    WordCountsByYear[row_year][word] = 1
                else:
                    WordCountsByYear[row_year][word] += 1 # or add to existing year/word combo


        # == 6a 
          
                if row_year not in GlobalMatrix: #if new year in set, add new year dictionary to master dictionary for csv
                    GlobalMatrix[row_year] = {}
                if not word in GlobalMatrix[row_year]: # if word is not null and word not in the year/word combo, add a new one
                    GlobalMatrix[row_year][word] = 1
                else:
                    GlobalMatrix[row_year][word] += 1 # or add to existing year/word combo
            
        
                if not word in GlobalMatrix[row_year]: # if word is not null and word not in the year/word combo, add a new one
                    GlobalMatrix[row_year][word] = 1
                else:
                    GlobalMatrix[row_year][word] += 1 # or add to existing year/word combo
    
    # == 6b
                if not word in TotalMatrix: #if dictionary hasn't been created yet
                    TotalMatrix[word] = 1 #create initial count 
                else:
                    TotalMatrix[word] += 1 #add to total count
            
                    
        
    # == Step 7: Reorganize to sort by highest ccount 
    rows = []  # Prepare list to collect top words per year
    for year, word_dict in WordCountsByYear.items():
        # Convert word-count pairs to a sorted list (by count descending)
        sorted_words = sorted(word_dict.items(), key=lambda x: x[1], reverse=True)[:top_n]
        
        # Build a row: Year + Word 1 + Count 1 + Word 2 + Count 2 + ...
        row = {'Year': year} #Every row will be a year
        for i, (word, count) in enumerate(sorted_words, start=1):
            row[f'Word {i}'] = word #Add word to Collumn name
            row[f'Count {i}'] = count #Add count to column name
        rows.append(row)

    # Convert to DataFrame
    result_df = pd.DataFrame(rows).sort_values(by='Year')
    # Save

    output_path = get_output_filename(filepath, start_year, end_year)
    result_df.to_csv(f'MatrixResults/Main/{output_path}', index=False)

    print(f"DataFrame saved to {output_path}")
    
        
    
    
    # result_df.to_csv('top_20_words_by_year.csv', index=False)
print('process csv function created')

# Process all CSVs 

this function creates a matrix for every term, while creating a global matrix. They will be stored in the folder MatrixResults 

In [None]:
def process_all_csvs(input_folder='FullClean', top_n=20, start_year=None, end_year=None): #Create Individual Matrices For All 
    """
    Loops through all CSVs in the input folder and processes them to create a matrix for each term
    """
    TotalWordCounts = {}
    GlobalWordCountsByYear = {}
    
    for file in os.listdir(input_folder):
        if file.endswith('.csv'):
            filepath = os.path.join(input_folder, file)
            process_single_csv(filepath, top_n=top_n, start_year=start_year, end_year=end_year)
    
    #Do same logic as individual csvs
    rows = []  # Prepare list to collect top words per year

    #Everytime a CSV is processed, it adds word to a global matrix file
    for year, word_dict in GlobalMatrix.items(): #Loop through all word/count pairs found overall 
        # Convert word-count pairs to a sorted list (by count descending)
        sorted_words = sorted(word_dict.items(), key=lambda x: x[1], reverse=True)[:top_n]
        
        # Build a row for output csv: Year + Word 1 + Count 1 + Word 2 + Count 2 + ...
        row = {'Year': year} #Every row will be a year
        for i, (word, count) in enumerate(sorted_words, start=1):
            row[f'Word {i}'] = word #Add word to Collumn name
            row[f'Count {i}'] = count #Add count to column name
        rows.append(row)

    # Convert to DataFrame
    result_df = pd.DataFrame(rows).sort_values(by='Year')
    # Save
    
    word_counts_series = pd.Series(TotalMatrix).sort_values(ascending=False)
    print(word_counts_series.head(30))
          
          
    output_path = 'GlobalMatrix.csv'
    result_df.to_csv(f'MatrixResults/Main/{output_path}', index=False)

print('process all csvs function created')


# Word Search All Terms

This function runs specific word searches on the dataset

Results will be saved

In [None]:
def word_search_all_csvs(input_folder, search_terms, top_n=20, start_year=None, end_year=None, termNames='NO_NAME'):
    
    print('Searching for terms: ' + str(search_terms))
    TotalWordCounts = {}
    GlobalWordCountsByYear = {}
    
    search_terms = [term.lower() for term in search_terms]
    
    for file in os.listdir(input_folder):
        WordCountsByYear = {}
        if file.endswith('.csv'):
            filepath = os.path.join(input_folder, file)
    
            
            #=== Step 1: Load CSV into DataFrame ===
            df = pd.read_csv(filepath)
            # === Step 2: Identify context columns ===
            context_columns = [col for col in df.columns if "Context" in col]
            # === Step 3: Loop through rows and check year consistency ===
            #with tqdm(total=len(df), desc="Processing rows") as pbar:
            
            with tqdm(total=len(df), desc=f'Searching {filepath}') as pbar:
                for index, row in df.iterrows():
                    pbar.update(1)
                    numeric_date = row.get("NumericDate")
                    # Skip rows without a valid date
                    try:
                        row_year = pd.to_datetime(numeric_date).year
                    except (ValueError, TypeError):
                        continue  # Skip invalid dates
                    
                    # Skip row with years out of bounds
                    if not row_year in range(start_year, end_year + 1): #(row_year >= start_year) and (row_year <= end_year): 
                        #print(f'row_ {row_year}')
                        continue  # Skip rows not matching the target year
                    #else:
                        #print(f'row_ {row_year} is later thann {start_year} and arlier than {end_year}')
                # === Step 4: While in row loop, Loop through context cells ===
                    row_words = []
                    raw_row = []
                    for col in context_columns:
                        cell_value = str(row[col]) if pd.notnull(row[col]) else ""
                        
                        # Clean word 
                        raw_row.append(cell_value)
                        cleaned_word = apply_stemming_and_cleaning(cell_value)
                        if not cleaned_word == 0:
                            row_words.append(cleaned_word)
                # == Step 5: While in row loop, apply NGram Search    
                    filtered_words = apply_ngrams(row_words)
                # == STep 6: While in row loop, Extract Word Counts For Year/word combo
                    for word in filtered_words:
                        if word in search_terms: #If word is one of our search terms
                            
                            if not word in TotalWordCounts: #if dictionary hasn't been created yet
                                TotalWordCounts[word] = 1 #create initial count 
                            else:
                                TotalWordCounts[word] += 1 #add to total count
                                
                            
                            if row_year not in WordCountsByYear: #if new year in set, add new year dictionary to master dictionary for csv
                                WordCountsByYear[row_year] = {}
                            if not word in WordCountsByYear[row_year]: # if word is not null and word not in the year/word combo, add a new one
                                WordCountsByYear[row_year][word] = 1
                            else:
                                WordCountsByYear[row_year][word] += 1 # or add to existing year/word combo
                    
                    
                    for word in filtered_words:
                        if word in search_terms: #If word is one of our search terms
                            if row_year not in GlobalWordCountsByYear: #if new year in set, add new year dictionary to master dictionary for csv
                                GlobalWordCountsByYear[row_year] = {}
                            if not word in GlobalWordCountsByYear[row_year]: # if word is not null and word not in the year/word combo, add a new one
                                GlobalWordCountsByYear[row_year][word] = 1
                            else:
                                GlobalWordCountsByYear[row_year][word] += 1 # or add to existing year/word combo
                            
                    
            # == Step 7: Reorganize to sort by highest ccount 
            rows = []  # Prepare list to collect top words per year
            
            for year, word_dict in WordCountsByYear.items():
                # Convert word-count pairs to a sorted list (by count descending)
                sorted_words = sorted(word_dict.items(), key=lambda x: x[1], reverse=True)[:top_n]
                
                # Build a row: Year + Word 1 + Count 1 + Word 2 + Count 2 + ...
                row = {'Year': year} #Every row will be a year
                for i, (word, count) in enumerate(sorted_words, start=1):
                    row[f'Word {i}'] = word #Add word to Collumn name
                    row[f'Count {i}'] = count #Add count to column name
                rows.append(row)

            # Convert to DataFrame
            result_df = pd.DataFrame(rows).sort_values(by='Year')
            # Save

            output_path = get_output_filename(filepath, start_year, end_year, termName=termNames)
            result_df.to_csv(f'MatrixResults/TermSearchMatrix/{termNames}/{output_path}', index=False)

            print(f"DataFrame saved to {output_path}")
            
        
            
    
    word_counts_series = pd.Series(TotalWordCounts).sort_values(ascending=False)
    print(word_counts_series.head(30))    
    
    rows = [] #cleanse
    #Get Totals
    for year, word_dict in GlobalWordCountsByYear.items():
        # Convert word-count pairs to a sorted list (by count descending)
        sorted_words = sorted(word_dict.items(), key=lambda x: x[1], reverse=True)[:top_n]
        
        # Build a row: Year + Word 1 + Count 1 + Word 2 + Count 2 + ...
        row = {'Year': year} #Every row will be a year
        for i, (word, count) in enumerate(sorted_words, start=1):
            row[f'Word {i}'] = word #Add word to Collumn name
            row[f'Count {i}'] = count #Add count to column name
        rows.append(row)

    # Convert to DataFrame
    result_df = pd.DataFrame(rows).sort_values(by='Year')
    # Save
    

    output_path = f'MatrixResults/TermSearchMatrix/{termNames}/{termNames}_GlobalMatrixTerms_{start_year}_{end_year}.csv'
    result_df.to_csv(output_path, index=False)
    
    
    # result_df.to_csv('top_20_words_by_year.csv', index=False)

print('word search function created')

# MAIN DRIVERS

In [None]:
#Creates empty variables to be added
GlobalMatrix = {}
TotalMatrix = {}

# Define folder paths
INPUT_FOLDER = 'FullClean'

TOP_N_WORDS = 30
START_YEAR = 2016
END_YEAR = 2024
    
countries_list = ['Afghanistan', 'ussr', 'yugoslavia', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of the', 'Cook Islands', 'Costa Rica', "Côte d'Ivoire", 'Croatia', 'Cuba', 'Curaçao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Heard Island and McDonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', "North korea", 'South Korea', 'Kuwait', 'Kyrgyzstan', "Lao People's Democratic Republic", 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Réunion', 'Romania', 'russia', 'Rwanda', 'Saint Barthélemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and the South Sandwich Islands', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'South Sudan', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-Leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'uk', 'us', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Virgin Islands', 'Wallis and Futuna', 'Yemen', 'Zambia', 'Zimbabwe']
platforms_list = ['facebook', 'twitter', 'instagram', 'tiktok', 'youtube', 'snapchat', 'whatsapp', 'telegram', 'reddit']
world_leaders = ['biden', 'trump', 'clinton', 'xi', 'putin', 'netanyahu']


This does a full word search and creates tables for the various terms. 

In [None]:
process_all_csvs(top_n=TOP_N_WORDS, start_year=START_YEAR, end_year=END_YEAR)

This goes through each CSV and makes a matrix for each term based on the year

In [None]:
#Replace 'platforms_list' and 'platform' with relevnat list
word_search_all_csvs( INPUT_FOLDER, platforms_list, TOP_N_WORDS, START_YEAR, END_YEAR, 'platform') 
