# Misinformation Beat Notebook

This file will be used to double check the results of our research on the Misinformation Beat. 

Since the main data was pulled via ProQuestTDM studio, the raw data at the XML level is not available to double check. '

Therefore, this file should be used to check the logic of the code, but can only be ran locally after the ANALYSIS portion (Excluding imports which can run). 

Run each box in order 





==================================================================

"We used Proquest TDM Studio to access the ProQuest “U.S. Major Dailies” dataset which includes historical archives of The Chicago Tribune, Los Angeles Times, New York Times, Wall Street Journal, and Washington Post. We searched all articles that used the phrases “misinformation,” “disinformation,” “propaganda,” “conspiracy theory,” “conspiracy theories” and “fake news” from January 1, 1980 to April 24, 2025.The data received from ProQuest came in the form of XML files for each of the 231,992 articles that met the criteria."



Step 1: Extract From XML Files
All XML files resulting from query are stored in the data/*search term* folder (*INACCESSABLE*)
Results of each extraction will be in the results/ folder

Note: This cannot be recreated here given that the xml files are contained within the ProQuest TDM environment. This is meant for code review only. 


Imports

In [3]:
#Imports
import xml.etree.ElementTree as ET
import pandas as pd
import re
import os
import pandas as pd
import nltk
import time
import datetime
import glob
import csv
import zipfile



from collections import Counter
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from collections import Counter
from tqdm import tqdm
from datetime import datetime
from collections import defaultdict
from tqdm import tqdm
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import matplotlib.dates as mdates



os.makedirs('FinalTDMOutputs', exist_ok=True)
os.makedirs('FullClean', exist_ok=True)
os.makedirs('MatrixResults', exist_ok=True)
os.makedirs('MatrixResults/Main', exist_ok=True)
os.makedirs('MatrixResults/TermSearchMatrix', exist_ok=True)


print('done with imports')

done with imports


"The data received from ProQuest came in the form of XML files for each of the 231,992 articles that met the criteria. To accurately parse the data, we removed any markup tags in order to ensure that only the readable text was captured. To clean the dataset and correct for any issues from text identification during scanning, we removed all non-alphanumeric and non-ASCII characters other than hyphens and ending punctuations; question marks and periods were replaced with spaces. Since hyphens are often attached to the keywords listed above, we removed hyphens that preceded or trailed a given keyword."

===================================================================

#Clean the text


- Change keyword based on search as it defaults to 'disinformation'

In [6]:
def clean_and_correct_text(text, keyword='disinformation'):
    if not text or text.strip() == "":
        return "N/A"

    text = re.sub(r"<.*?>", "", text) #Removes markup text like <tag>
    text = re.sub(r"[^a-zA-Z0-9\s\-\.]+", '', text)  # Keeps ASCII, periods and hyphens
    text = text.replace('.', ' ') #Replace Periods with space
    text = text.replace('?', ' ') #Replace Question marks with space
    pattern = rf'(?<=\b{re.escape(keyword)})-(?=\w)|(?<=\w)-(?=\b{re.escape(keyword)})' #Remove hyphens from keyword (keep them elsewhere)
    text = re.sub(pattern, ' ', text) #Apply regex

    return text

print('Done with building clean function!')

Done with building clean function!


"We then extracted every use of each of these search terms throughout all of the articles from each dataset. We included the 25 words that preceded and followed each instance of each search term to capture its context. Within a single article, multiple occurrences of each search term were counted separately."


================================================
# Extract Data

This Function is used to extract data from individual xml files, and returns a list of objects that include GOID, Publisher, Title, Numeric Date, and the 50 surrounding words. 
#This function includes an implementation of the above clean and correct function. 


- Change keyword based on search as it defaults to 'disinformation'


In [7]:
# Function to extract metadata and concordances
# Returns either FALSE or a list of concordances
def extract_data_from_xml(file_path, keyword='disinformation', window=25):
    try:
        tree = ET.parse(file_path)
        root = tree.getroot()

        # Extract metadata fields
        goid = root.find(".//GOID").text if root.find(".//GOID") is not None else "N/A"
        publisher = root.find(".//publisher/PublisherName").text if root.find(".//publisher/PublisherName") is not None else "N/A"
        title = root.find(".//Title").text if root.find(".//Title") is not None else "N/A"
        numeric_date = root.find(".//NumericDate").text if root.find(".//NumericDate") is not None else "N/A"

        # Extract article text
        text_elem = root.find(".//Text")
        full_text = text_elem.text if text_elem is not None and text_elem.text else "N/A"
        
        # If phrase is more than one word, capture its length
        phrase_words = keyword.lower().split()
        phrase_len = len(phrase_words)
        
        theConcordances = []

        # Split text into words
        words = clean_and_correct_text(full_text, keyword).split()
        
        # Find occurrences of the keyword and store concordances.
        # Phrase words determine length of window
        for i in range(len(words) - (phrase_len)): #For every word in the list (minus the length of the phrase)
            window_slice = words[i:i + phrase_len] #look at number of words equal to the length of the phrase
            if [w.lower() for w in window_slice] == phrase_words: #If the words in the window are equal to the phrase words
                left_context = " ".join(words[max(0, i - window): i])
                right_context = " ".join(words[i + phrase_len: min(len(words), i + phrase_len + window)]) #right context is blank + join word
                
                cleaned_left = clean_and_correct_text(left_context).lower().split()  #Split this into individual words
                cleaned_left += [""] * (25 - len(cleaned_left))
                cleaned_right = clean_and_correct_text(right_context).lower().split() #Split this into individual words
                cleaned_right += [""] * (25 - len(cleaned_right))
                
                # Append extracted data
                theConcordances.append({
                    "GOID": goid,
                    "Publisher": publisher.lower(),
                    "Title": clean_and_correct_text(title).lower(),
                    "NumericDate": numeric_date,
                    "Left Context1": cleaned_left[0],
                    "Left Context2": cleaned_left[1],
                    "Left Context3": cleaned_left[2],
                    "Left Context4": cleaned_left[3],
                    "Left Context5": cleaned_left[4],
                    "Left Context6": cleaned_left[5],
                    "Left Context7": cleaned_left[6],
                    "Left Context8": cleaned_left[7],
                    "Left Context9": cleaned_left[8],
                    "Left Context10": cleaned_left[9],
                    "Left Context11": cleaned_left[10],
                    "Left Context12": cleaned_left[11] ,
                    "Left Context13": cleaned_left[12],
                    "Left Context14": cleaned_left[13],
                    "Left Context15": cleaned_left[14],
                    "Left Context16": cleaned_left[15],
                    "Left Context17": cleaned_left[16],
                    "Left Context18": cleaned_left[17],
                    "Left Context19": cleaned_left[18],
                    "Left Context20": cleaned_left[19],
                    "Left Context21": cleaned_left[20],
                    "Left Context22": cleaned_left[21],
                    "Left Context23": cleaned_left[22],
                    "Left Context24": cleaned_left[23],
                    "Left Context25": cleaned_left[24],
                    "Keyword": keyword,
                    "Right Context1": cleaned_right[0],
                    "Right Context2": cleaned_right[1],
                    "Right Context3": cleaned_right[2],
                    "Right Context4": cleaned_right[3],
                    "Right Context5": cleaned_right[4],
                    "Right Context6": cleaned_right[5],
                    "Right Context7": cleaned_right[6],
                    "Right Context8": cleaned_right[7],
                    "Right Context9": cleaned_right[8],
                    "Right Context10": cleaned_right[9],
                    "Right Context11": cleaned_right[10],
                    "Right Context12": cleaned_right[11],
                    "Right Context13": cleaned_right[12],
                    "Right Context14": cleaned_right[13],
                    "Right Context15": cleaned_right[14],
                    "Right Context16": cleaned_right[15],
                    "Right Context17": cleaned_right[16],
                    "Right Context18": cleaned_right[17],
                    "Right Context19": cleaned_right[18],
                    "Right Context20": cleaned_right[19],
                    "Right Context21": cleaned_right[20],
                    "Right Context22": cleaned_right[21],
                    "Right Context23": cleaned_right[22],
                    "Right Context24": cleaned_right[23],
                    "Right Context25": cleaned_right[24]
                })
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
        return False
    return theConcordances

def append_csv(source_path, target_path):
    with open(source_path, 'r', newline='') as src, open(target_path, 'a', newline='') as tgt:
        reader = csv.reader(src)
        writer = csv.writer(tgt)

        next(reader)  # Skip header in source
        for row in reader:
            writer.writerow(row)


print('extraction and split functions ready as well!!!')

extraction and split functions ready as well!!!



# MAIN 
#This is the main driver that utilizes the above functions. It loops through every xml file, ensures that every instance of the keyword is captured, then stores results as a csv. 

#It is problematic if there are several instances of "Keyword not in text" or "keyword in text" appear. 
#prints progress every 1000 files. 

#The result should be a folder called "FinalTDMOutputs" with the relevant CSVs corresponding to each search term. 

- change folder path and key word based on term. 

In [None]:
# Define the folder containing XML files
folder_path = "data/MisinformationTerms"  # Adjust if needed
keywords = ['disinformation', 'misinformation', 'conspiracy theory', 'conspiracy theories', 'propaganda', 'fake news']
window = 25

# List to store extracted data
print('Begin MAIN')

for word in keywords:
    concordances = [] 
    articlesMinusTerm = 0
    if os.path.exists(folder_path):
        xml_files = [f for f in os.listdir(folder_path) if f.endswith('.xml')]
        #Use this for sampling
        #sample_size = min(200, len(xml_files))
        #sampled_files = random.sample(xml_files, sample_size)
        print('begin loop')
        
        
        for i, file_name in enumerate(xml_files): #Change function to take in sampled_files for sampling
            file_path = os.path.join(folder_path, file_name)
            concordance = extract_data_from_xml(file_path, word, window)
        
            #This is here to ensure that all articles and instances of misinformation are actually picked up. "misinformation was hardcoded as keyword 
            if concordance == False: #if keyword NOT in the text - mainly error checking. 
                articlesMinusTerm += 1
                tree = ET.parse(file_path)
                root = tree.getroot()

                text_elem = root.find(".//Text")
                full_text = text_elem.text if text_elem is not None and text_elem.text else "N/A"
                print('#################')
                if(word in full_text):
                    print(f'{word} is in the text')
                else:
                    print(f'{word} is NOT in the text')

                full_text = text_elem.text if text_elem is not None and text_elem.text else "N/A"
            else:
                concordances = concordances + concordance
                
            if (i + 1) % 1000 == 0:  # Show progress every 100 files
                print(f"Processed {i+1} out of {len(xml_files)} files")
                #print('current articlesMinusTerm', articlesMinusTerm)

    # Convert extracted data to DataFrame
    print(len(concordances))
    df_concordances = pd.DataFrame(concordances)

    # Save to CSV        
    filepath = 'FinalTDMOutputs/'
    filename = 'Cleaned_Full_Win' + str(window) + '_' + word.replace(" ", "_") + '.csv'
        
    if not os.path.exists(filepath):
        os.makedirs(filepath)
    df_concordances.to_csv(f'{filepath}/{filename}', index=False)
    
    if word == "conspiracy theories":
        append_csv(f'{filepath}/{filename}', f'{filepath}/Cleaned_Full_Win25_conspiracy_theory.csv')
        os.remove(f'{filepath}/{filename}')
        os.rename(f'{filepath}/Cleaned_Full_Win25_conspiracy_theory.csv', f'{filepath}/Cleaned_Full_Win25_conspiracy.csv')
        print(f'Extracted data added to {filepath}/Cleaned_Full_Win25_conspiracy.csv')
        
    else: 
        print(f"Extracted data saved to {filepath}{filename}")

Begin MAIN
0
Extracted data saved to FinalTDMOutputs/Cleaned_Full_Win25_disinformation.csv
0
Extracted data saved to FinalTDMOutputs/Cleaned_Full_Win25_misinformation.csv
0
Extracted data saved to FinalTDMOutputs/Cleaned_Full_Win25_conspiracy_theory.csv
0
Extracted data added to FinalTDMOutputs//Cleaned_Full_Win25_conspiracy_theory.csv
0
Extracted data saved to FinalTDMOutputs/Cleaned_Full_Win25_propaganda.csv
0
Extracted data saved to FinalTDMOutputs/Cleaned_Full_Win25_fake_news.csv


In [4]:
def zip_selected_items(include_list, output_zip='archive.zip'):
    current_dir = os.getcwd()

    with zipfile.ZipFile(output_zip, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for item in include_list:
            full_path = os.path.join(current_dir, item)

            if not os.path.exists(full_path):
                print(f"Skipping: {item} (not found)")
                continue

            if os.path.isfile(full_path):
                # Single file
                zipf.write(full_path, arcname=item)
            elif os.path.isdir(full_path):
                # Walk through directory
                for root, _, files in os.walk(full_path):
                    for file in files:
                        file_path = os.path.join(root, file)
                        arcname = os.path.relpath(file_path, current_dir)
                        zipf.write(file_path, arcname)

    print(f"Created zip archive: {output_zip}")


output_zipfile='MisinfoBeatRawData.zip'
include_these = ['FinalTDMOutputs']
zip_selected_items(include_these, output_zip=output_zipfile)


Created zip archive: MisinfoBeatRawData.zip
