<a href="https://colab.research.google.com/github/Nastiiasaenko/Final-Project---Explainable-AI-/blob/main/Datasets_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Creating a bias traces dataset - Phase 1.
In this project phase, I’m creating a comprehensive, enriched dataset covering multiple topics like COVID-19, politics, wars, general news, and more. For now, I’m focusing on COVID-19-related content, pulling data from diverse sources (e.g., news articles, Reddit, Twitter) to capture various biases and perspectives.

My goal is to organize and enrich this data by adding metadata like source, political leaning, publication date, and themes. This dataset will allow me to analyze narratives, identify biases, and later compare AI-generated content to real-world data.

### Steps:

**Collect Data**: Identify and gather datasets focused on COVID-19 with diverse viewpoints and metadata potential.

**Preprocess Data:** Clean and standardize entries for consistency (e.g., removing duplicates, normalizing formats).

**Annotate:** Add labels for bias, source, and themes (e.g., vaccination, lockdowns).

**Integrate:** Merge datasets into one unified, structured dataset.

**Validate:** Double-check annotations and dataset quality to ensure accuracy.

Once I finish the COVID-19 segment, I’ll expand this process to other topics like politics and general news to build a robust, multi-topic dataset for later AI analysis.

In [None]:
!pip install kaggle




In [3]:
import pandas as pd
import spacy
from tqdm import tqdm
import os
from datetime import datetime

In [34]:
from google.colab import drive
import os

In [35]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Specify the path to your desired folder in Drive
drive_folder = '/content/drive/MyDrive/MIDS/XAI/XAI_Final_project'

# Ensure the folder exists
os.makedirs(drive_folder, exist_ok=True)

# Change the current working directory to the specified folder
os.chdir(drive_folder)

# Confirm the working directory
print("Current working directory:", os.getcwd())



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Current working directory: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project


## Global DF

In [6]:
# global_df = pd.DataFrame()
global_df = pd.read_csv('global_dataset.csv')
global_df.tail()

  global_df = pd.read_csv('global_dataset.csv')


Unnamed: 0,title,body,source,timestamp,Misinfo_flag,type_of_misinfo,type_reddit,topic,subtopic,entities,Polarization_flag,\tMisinfo_flag,type_of_content,potential_prompt0,hashtags,gender,sentiment_category,Publisher,subtitle,prochoice_prolife
219744,Obama's Job Approval Rating Hits 50 Percent In...,The president's ratings have risen among Democ...,news,2016-03-10 00:00:00,,,,POLITICS,,"{'people': [], 'organizations': [], 'locations...",,,,Obama's Job Approval Rating Hits 50 Percent In...,,,,,,
219745,New Site Tells Tourists What Their Destination...,A soon-to-be-launched online platform will pro...,news,2015-08-24 00:00:00,,,,IMPACT,,"{'people': [], 'organizations': [], 'locations...",,,,New Site Tells Tourists What Their Destination...,,,,,,
219746,"This LBJ Fan Says You Should See The Film ""Selma""","Should you see the movie ""Selma,"" or should yo...",news,2015-01-13 00:00:00,,,,ENTERTAINMENT,,"{'people': [], 'organizations': ['LBJ', 'LBJ']...",,,,"This LBJ Fan Says You Should See The Film ""Selma""",,,,,,
219747,8 Reminders That The Best Is Yet To Come,The stress and strain of constantly being conn...,news,2014-04-03 00:00:00,,,,WELLNESS,,"{'people': [], 'organizations': [], 'locations...",,,,8 Reminders That The Best Is Yet To Come,,,,,,
219748,How This Mom Really Feels About Going Out To E...,How many of us celebrate Mother's Day by going...,news,2014-05-09 00:00:00,,,,PARENTS,,"{'people': [], 'organizations': [], 'locations...",,,,How This Mom Really Feels About Going Out To E...,,,,,,


## Construction of the global dataset.

## Topic one - Covid, misinformation on the pandemic and vaccines.

Datasets used for this topic and their purpose:

* [Reddit Vaccine Myths](https://www.kaggle.com/code/fahadmehfoooz/reddit-vaccine-myths) - provides misinformation on vaccines in general so we can trace back the bias for vaccines in general.

####  Preprocessing [Reddit Vaccine Myths](https://www.kaggle.com/code/fahadmehfoooz/reddit-vaccine-myths)

In [None]:
target_folder = '/content/drive/MyDrive/MIDS/XAI/XAI_Final_project/Datasets'
print(os.path.exists(target_folder))


True


In [None]:
global_df.head(1)

Unnamed: 0,title,body,source,timestamp,Misinfo_flag,type_of_misinfo,type_reddit,topic,subtopic,entities,Polarization_flag,\tMisinfo_flag,type_of_content,potential_prompt0,hashtags,gender,sentiment_category,Publisher
0,Health Canada approves AstraZeneca COVID-19 va...,,Reddit,2021-02-27 06:33:45,1.0,myth,post_title,COVID,Vaccines,,,,,,,,,


In [None]:
## Reddit Vaccine Myths
reddit_vaccine_myths = pd.read_csv('./Datasets/Covid_Datasets/reddit_vm.csv')


In [19]:
nlp = spacy.load("en_core_web_sm")

In [20]:
import pandas as pd
import spacy


# Initialize an empty global dataframe
# global_df = pd.DataFrame(columns=["title", "body", "source", "timestamp"])


In [7]:
def extract_entities(text):
    """
    Extract named entities from text using SpaCy.
    """
    doc = nlp(text)
    entities = {"people": [], "organizations": [], "locations": []}
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            entities["people"].append(ent.text)
        elif ent.label_ == "ORG":
            entities["organizations"].append(ent.text)
        elif ent.label_ == "GPE":
            entities["locations"].append(ent.text)
    return entities


In [9]:
from tqdm import tqdm
import os
from datetime import datetime

def process_and_add_to_global(
    dataset, dataset_name, column_mapping, topic=None, subtopic=None, additional_columns=None, save_dir=None
):
    """
    Process a dataset and integrate it into the global dataframe, dynamically updating the schema with a progress bar.

    Args:
    - dataset: pd.DataFrame, the input dataset.
    - dataset_name: str, the name of the dataset/source.
    - column_mapping: dict, mapping from dataset-specific column names to global column names.
    - topic: str, the high-level topic (e.g., 'COVID').
    - subtopic: str, the sub-topic within the main topic (e.g., 'Vaccines').
    - additional_columns: dict, optional additional columns and their default values for this dataset only.
    - save_dir: str, directory to save the global dataframe. Defaults to the current working directory.
    """
    global global_df

    print(f"Processing dataset: {dataset_name}")
    # Progress bar for the number of rows
    with tqdm(total=len(dataset), desc=f"Processing rows in {dataset_name}", unit="row", disable=True) as pbar:
        # Apply column mapping to standardize dataset
        standardized_data = pd.DataFrame()
        for global_col, dataset_col in column_mapping.items():
            if dataset_col in dataset.columns:
                standardized_data[global_col] = dataset[dataset_col]
            else:
                raise ValueError(f"Column '{dataset_col}' missing in the dataset '{dataset_name}'.")

        # Add standard fields
        standardized_data["source"] = dataset_name
        standardized_data["timestamp"] = pd.to_datetime(standardized_data.get("timestamp", None), errors="coerce")
        standardized_data["topic"] = topic
        standardized_data["subtopic"] = subtopic

        # Extract entities using SpaCy
        if "body" in standardized_data.columns:
            standardized_data["entities"] = [
                extract_entities(text) if pd.notna(text) else None for text in tqdm(
                    standardized_data["body"], desc="Extracting entities", leave=False, unit="entry"
                )
            ]
        elif "title" in standardized_data.columns:
            standardized_data["entities"] = [
                extract_entities(text) if pd.notna(text) else None for text in tqdm(
                    standardized_data["title"], desc="Extracting entities", leave=False, unit="entry"
                )
            ]
        else:
            standardized_data["entities"] = None

        # Add additional columns dynamically
        if additional_columns:
            for col_name, default_value in additional_columns.items():
                standardized_data[col_name] = default_value

                # Ensure column exists in global_df with None for prior rows
                if col_name not in global_df.columns:
                    global_df[col_name] = None

        # Simulate processing rows to update the progress bar
        for _ in standardized_data.iterrows():
            pbar.update(1)

    # Append new data to global_df
    global_df = pd.concat([global_df, standardized_data], ignore_index=True)

    # Save the updated global dataframe
    save_dir = save_dir or os.getcwd()  # Default to current working directory if no directory specified
    save_path = os.path.join(save_dir, "global_dataset.csv")
    global_df.to_csv(save_path, index=False)
    print(f"Global dataframe saved to: {save_path}")


In [None]:
# Compute "type_reddit" inline and pass dynamically
process_and_add_to_global(
    reddit_vaccine_myths,
    dataset_name="Reddit",
    column_mapping={"title": "title", "body": "body", "timestamp": "timestamp"},
    topic="COVID",
    subtopic="Vaccines",
    additional_columns={
        "Misinfo_flag": 1, "type_of_misinfo": "myth",
        "type_reddit": reddit_vaccine_myths["title"].apply(
            lambda x: "comment_body" if "Comment" in x else "post_title"
        )
    }
)


Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


  global_df = pd.concat([global_df, standardized_data], ignore_index=True)


In [None]:
global_df.head()

Unnamed: 0,title,body,source,timestamp,Misinfo_flag,type_of_misinfo,type_reddit,topic,subtopic,entities
0,Health Canada approves AstraZeneca COVID-19 va...,,Reddit,2021-02-27 06:33:45,1,myth,post_title,COVID,Vaccines,
1,COVID-19 in Canada: 'Vaccination passports' a ...,,Reddit,2021-02-26 07:11:07,1,myth,post_title,COVID,Vaccines,
2,Coronavirus variants could fuel Canada's third...,,Reddit,2021-02-21 07:50:08,1,myth,post_title,COVID,Vaccines,
3,Canadian government to extend COVID-19 emergen...,,Reddit,2021-02-20 06:35:13,1,myth,post_title,COVID,Vaccines,
4,Canada: Pfizer is 'extremely committed' to mee...,,Reddit,2021-02-16 11:36:28,1,myth,post_title,COVID,Vaccines,


## [COVID-19 Vaccine News Reddit Discussions](https://www.kaggle.com/datasets/xhlulu/covid19-vaccine-news-reddit-discussions)

In [None]:
## download reddit comments
covid_vaccine_reddit_comments = pd.read_csv('./Datasets/Covid_Datasets/comments.csv')
#

In [None]:
process_and_add_to_global(
    covid_vaccine_reddit_comments,
    dataset_name="Reddit",
    column_mapping={"title": "post_title", "body": "comment_body", "timestamp": "comment_date"},
    topic="COVID",
    subtopic="Vaccines",
    additional_columns={
        "type_reddit": 'post_and_title'

    }
)

Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


### [COVID-19 Fake News Dataset](https://www.kaggle.com/datasets/arashnic/covid19-fake-news)




In [None]:
## read the data from the folder
import os
import pandas as pd

# Define the working directory containing the CSV files
working_directory = './Datasets/Covid_Datasets/Covid_realFake'

# Initialize empty DataFrames for "claim" and "news"
claim_df = pd.DataFrame()
news_df = pd.DataFrame()

# Iterate through all CSV files in the folder
for file in os.listdir(working_directory):
    if file.endswith('.csv'):
        file_path = os.path.join(working_directory, file)
        # Read the CSV file
        df = pd.read_csv(file_path)

        # Check conditions for "claim" files
        if 'claim' in file.lower():
            if 'fake' in file.lower():
                df['Misinfo_flag'] = 1
                claim_df = pd.concat([claim_df, df], ignore_index=True)
            elif 'real' in file.lower():
                df['Misinfo_flag'] = 0
                claim_df = pd.concat([claim_df, df], ignore_index=True)

        # Check conditions for "news" files
        elif 'news' in file.lower():
            if 'fake' in file.lower():
                df['Misinfo_flag'] = 1
                news_df = pd.concat([news_df, df], ignore_index=True)
            elif 'real' in file.lower():
                df['Misinfo_flag'] = 0
                news_df = pd.concat([news_df, df], ignore_index=True)




In [None]:
## append claims_df
process_and_add_to_global(
    claim_df,
    dataset_name="twitter",
    column_mapping={"title": "title"},
    topic="COVID",

    additional_columns={
        'Misinfo_flag': claim_df['Misinfo_flag']


    }
)

Processing dataset: twitter


Processing rows in twitter: 100%|██████████| 481/481 [00:00<00:00, 21553.37row/s]
  global_df = pd.concat([global_df, standardized_data], ignore_index=True)


Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


In [None]:
news_df.shape

(2914, 16)

In [None]:
## append news df

process_and_add_to_global(
    news_df,
    dataset_name="news_twitter",
    column_mapping={"title": "title"},
    topic="COVID",

    additional_columns={
        'Misinfo_flag': news_df['Misinfo_flag']


    }
)

Processing dataset: news_twitter


Processing rows in news_twitter:   0%|          | 0/2914 [00:00<?, ?row/s]
Extracting entities:   0%|          | 0/2914 [00:00<?, ?entry/s][A
Extracting entities:   0%|          | 12/2914 [00:00<00:24, 119.30entry/s][A
Extracting entities:   1%|          | 24/2914 [00:00<00:24, 118.51entry/s][A
Extracting entities:   1%|▏         | 37/2914 [00:00<00:24, 119.24entry/s][A
Extracting entities:   2%|▏         | 50/2914 [00:00<00:23, 120.94entry/s][A
Extracting entities:   2%|▏         | 63/2914 [00:00<00:23, 122.83entry/s][A
Extracting entities:   3%|▎         | 79/2914 [00:00<00:21, 132.27entry/s][A
Extracting entities:   3%|▎         | 95/2914 [00:00<00:20, 138.94entry/s][A
Extracting entities:   4%|▍         | 110/2914 [00:00<00:19, 141.20entry/s][A
Extracting entities:   4%|▍         | 125/2914 [00:00<00:19, 142.89entry/s][A
Extracting entities:   5%|▍         | 140/2914 [00:01<00:19, 143.92entry/s][A
Extracting entities:   5%|▌         | 155/2914 [00:01<00:19, 143.71entry/s

Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


In [None]:
global_df.shape

(36334, 10)

#### [COVID19 Fake News Dataset NLP](https://www.kaggle.com/datasets/elvinagammed/covid19-fake-news-dataset-nlp)

# Russia and Ukraine

sentiment about news in Russian and ukraine (amazing) - https://www.kaggle.com/datasets/subhajournal/political-sentiment-analysis

(theres also one about disinformation)

headlines and news - https://www.kaggle.com/datasets/hskhawaja/russia-ukraine-conflict (FOR TRAINING)

kremlin disinformation - https://www.kaggle.com/datasets/corrieaar/disinformation-articles

reddit with comments (can be used for prompts) - https://www.kaggle.com/datasets/bwandowando/ukrainesubredditthreadsandcomments

twitter - https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows

reddit  - https://www.kaggle.com/datasets/tariqsays/reddit-russiaukraine-conflict-dataset

https://www.kaggle.com/datasets/asaniczka/public-opinion-russia-ukraine-war-updated-daily

https://www.kaggle.com/datasets/hskhawaja/russia-ukraine-conflict


russia ukraine with sentiment - https://www.kaggle.com/datasets/subhajournal/political-sentiment-analysis

In [39]:
guirdian_news = pd.read_csv('./Datasets/Russia_Ukraine/Guardians_Russia_Ukraine.csv')

In [42]:
## append news df

process_and_add_to_global(
    guirdian_news,
    dataset_name="news",
    column_mapping={"title": "headlines", 'body':'articles'},
    topic="War",
    subtopic="Russia and Ukraine",

    additional_columns={
        'Polarization_flag': "Democratic",
        'Publisher': 'The Guardian'



    }
)

Processing dataset: news




Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


In [41]:
global_df.columns

Index(['title', 'body', 'source', 'timestamp', 'Misinfo_flag',
       'type_of_misinfo', 'type_reddit', 'topic', 'subtopic', 'entities',
       'Polarization_flag', '\tMisinfo_flag', 'type_of_content',
       'potential_prompt0', 'hashtags', 'gender', 'sentiment_category',
       'Publisher', 'subtitle', 'prochoice_prolife'],
      dtype='object')

In [44]:
nyt_news = pd.read_csv('./Datasets/Russia_Ukraine/NYT_Russia_Ukraine.csv')


In [45]:

process_and_add_to_global(
    nyt_news,
    dataset_name="news",
    column_mapping={"title": "headlines", 'body':'articles'},
    topic="War",
    subtopic="Russia and Ukraine",

    additional_columns={
        'Polarization_flag': "Democratic",
        'Publisher': 'NYT'



    }
)

Processing dataset: news




Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


In [46]:
global_df.head()

Unnamed: 0,title,body,source,timestamp,Misinfo_flag,type_of_misinfo,type_reddit,topic,subtopic,entities,Polarization_flag,\tMisinfo_flag,type_of_content,potential_prompt0,hashtags,gender,sentiment_category,Publisher,subtitle,prochoice_prolife
0,Health Canada approves AstraZeneca COVID-19 va...,,Reddit,2021-02-27 06:33:45,1.0,myth,post_title,COVID,Vaccines,,,,,,,,,,,
1,COVID-19 in Canada: 'Vaccination passports' a ...,,Reddit,2021-02-26 07:11:07,1.0,myth,post_title,COVID,Vaccines,,,,,,,,,,,
2,Coronavirus variants could fuel Canada's third...,,Reddit,2021-02-21 07:50:08,1.0,myth,post_title,COVID,Vaccines,,,,,,,,,,,
3,Canadian government to extend COVID-19 emergen...,,Reddit,2021-02-20 06:35:13,1.0,myth,post_title,COVID,Vaccines,,,,,,,,,,,
4,Canada: Pfizer is 'extremely committed' to mee...,,Reddit,2021-02-16 11:36:28,1.0,myth,post_title,COVID,Vaccines,,,,,,,,,,,


## disinformation tactics pro russian media


In [74]:
rus_disinfo = pd.read_csv('./Datasets/Russia_Ukraine/russian_disinformation.csv')

In [75]:
rus_disinfo.columns

Index(['Unnamed: 0', 'claims_id', 'claim_published', 'first_appearance',
       'review_id', 'is_part_of', 'claim_reviewed', 'review_published',
       'review_name', 'html_text', 'text', 'issue_id', 'keyword_id',
       'keyword_name', 'country_id', 'country_name', 'appearances',
       'has_parts', 'creative_work_id', 'type', 'url', 'author', 'claim',
       'web_archive_url', 'abstract', 'in_language', 'start_time', 'end_time',
       'organization_id', 'location', 'organization_name', 'image_id',
       'image_type', 'image_content_url', 'language_id', 'language_name',
       'language_code'],
      dtype='object')

In [76]:
rus_disinfo.head()

Unnamed: 0.1,Unnamed: 0,claims_id,claim_published,first_appearance,review_id,is_part_of,claim_reviewed,review_published,review_name,html_text,...,end_time,organization_id,location,organization_name,image_id,image_type,image_content_url,language_id,language_name,language_code
0,0,/claims/100,2019-12-13T00:00:00+00:00,/news_articles/598,/claim_reviews/100,/issues/177,Ukraine has put itself in a situation when ext...,2019-12-16T00:00:00+00:00,Normandy summit results: the EU plays on Russi...,<p>This article misrepresents the actual Germa...,...,,/organizations/262,/countries/55,sputnik.by // lifenews.ru,/image_objects/23,http://schema.org/ImageObject,https://api.veedoo.io/images/5e3150bc27830_2fd...,/languages/3,Russian,rus
1,1,/claims/1000,2019-09-26T00:00:00+00:00,/news_articles/1835,/claim_reviews/1000,/issues/166,Regardless who was behind the recent attack on...,2019-09-27T00:00:00+00:00,The US benefits from the attack on the Saudi o...,<p>No evidence is provided to support the clai...,...,,/organizations/205,,southfront.org,,,,/languages/7,English,eng
2,2,/claims/1002,2019-09-23T00:00:00+00:00,/media_objects/1837,/claim_reviews/1002,/issues/166,"Pilsudski is a historical figure, who establis...",2019-09-27T00:00:00+00:00,The Polish Legions of Pilsudski organized the ...,<p>This message is a part of the Kremlin’s pol...,...,262.0,/organizations/122,/countries/4,Rossia 24,/image_objects/10,http://schema.org/ImageObject,https://api.veedoo.io/images/5e313449bca43_Ros...,/languages/3,Russian,rus
3,3,/claims/1003,2019-09-27T00:00:00+00:00,/media_objects/1838,/claim_reviews/1003,/issues/166,Washington (and to a large degree Brussels) ar...,2019-09-27T00:00:00+00:00,The West might use the Eastern Europeans as ca...,"<p><a href=""https://euvsdisinfo.eu/why-authori...",...,,/organizations/461,,fort-russ.com,,,,/languages/7,English,eng
4,4,/claims/1004,2019-09-25T00:00:00+00:00,/news_articles/1839,/claim_reviews/1004,/issues/166,The beneficiary of the resolution of the Europ...,2019-09-27T00:00:00+00:00,The resolution of the European Parliament reli...,<p>This message is part of the Kremlin’s polic...,...,,/organizations/335,,rubaltic.ru,,,,/languages/3,Russian,rus


In [72]:
rus_disinfo = rus_disinfo[['claim_reviewed','review_published',
             'review_name',
             'text',
             'keyword_name',
             'country_name',
             'abstract',
                            'language_code'
             ]]
eng_rusdisinfo = rus_disinfo[rus_disinfo[ 'language_code']=='eng']

In [73]:
eng_rusdisinfo

Unnamed: 0,claim_reviewed,review_published,review_name,text,keyword_name,country_name,abstract,language_code
1,Regardless who was behind the recent attack on...,2019-09-27T00:00:00+00:00,The US benefits from the attack on the Saudi o...,No evidence is provided to support the claim. ...,"['Conspiracy', 'Terrorism', 'Donald Trump']","['Iran', 'United States', 'Saudi Arabia']",,eng
3,Washington (and to a large degree Brussels) ar...,2019-09-27T00:00:00+00:00,The West might use the Eastern Europeans as ca...,"Conspiracy theory, presented without evidence....","['Russophobia', 'Encircling Russia', 'Operatio...","['United States', 'The West', 'Czech Republic'...",,eng
6,A growing backlash has begun across Hong Kong ...,2019-09-27T00:00:00+00:00,The protests in Hong Kong are US-funded,Conspiracy theory presented without evidence.\...,"['Conspiracy', 'Protest']","['United States', 'Hong Kong']",,eng
9,Under the guise of a concrete pressing issue (...,2019-09-26T00:00:00+00:00,Western propaganda masters control mass demons...,Conspiracy theory presented without evidence.\...,"['Conspiracy', 'Protest']","['The West', 'Hong Kong']",,eng
61,"Nowadays, Washington is playing the “religion ...",2019-09-23T00:00:00+00:00,"Washington is playing the ""religion card"" agai...",No evidence given. Conspiracy theory and recur...,"['Conspiracy', 'Religion']","['United States', 'The West', 'Georgia', 'CIS']",,eng
...,...,...,...,...,...,...,...,...
7298,Crimea carried out a referendum in 2014 after ...,2019-10-05T00:00:00+00:00,2014 Kyiv coup led to Crimea referendum and Do...,Recurring pro-Kremlin disinformation narrative...,"['Crimea', 'Donbas', 'War in Ukraine', 'Easter...","['Russia', 'Ukraine']",,eng
7299,"Since Ukraine declared independence in 1991, W...",2019-10-05T00:00:00+00:00,Ukraine has been ruled by a US-funded client r...,The story advances a recurring pro-Kremlin nar...,"['Ukrainian disintegration', 'US presence in E...","['Russia', 'US', 'Ukraine']",,eng
7303,Ukraine has filed a problematic complaint thro...,2019-10-03T00:00:00+00:00,ECHR might rule to treat Russian control of Cr...,Recurring pro-Kremlin narrative about alleged ...,"['Crimea', 'ECHR']","['Russia', 'Ukraine']",,eng
7323,While there is a significant grassroots moveme...,2019-10-01T00:00:00+00:00,The Climate Movement is distracting millions f...,"Conspiracy theory, presented without evidence....","['Conspiracy', 'Climate']",['United States'],,eng


## Political biases


## Political polarizations datasets

### [Democrat Vs. Republican Tweets](https://www.kaggle.com/datasets/kapastor/democratvsrepublicantweets)

In [None]:
dem_vs_rep_tweets = pd.read_csv('./Datasets/Polarization_datasets/DemocratVsRepublicanTweets.csv')

In [None]:
global_df.head()

Unnamed: 0,title,body,source,timestamp,Misinfo_flag,type_of_misinfo,type_reddit,topic,subtopic,entities
0,Health Canada approves AstraZeneca COVID-19 va...,,Reddit,2021-02-27 06:33:45,1.0,myth,post_title,COVID,Vaccines,
1,COVID-19 in Canada: 'Vaccination passports' a ...,,Reddit,2021-02-26 07:11:07,1.0,myth,post_title,COVID,Vaccines,
2,Coronavirus variants could fuel Canada's third...,,Reddit,2021-02-21 07:50:08,1.0,myth,post_title,COVID,Vaccines,
3,Canadian government to extend COVID-19 emergen...,,Reddit,2021-02-20 06:35:13,1.0,myth,post_title,COVID,Vaccines,
4,Canada: Pfizer is 'extremely committed' to mee...,,Reddit,2021-02-16 11:36:28,1.0,myth,post_title,COVID,Vaccines,


In [None]:
global_df.shape

(39729, 10)

In [None]:
dem_vs_rep_tweets['Party'].value_counts()

Unnamed: 0_level_0,count
Party,Unnamed: 1_level_1
Republican,44392
Democrat,42068


In [None]:
 dem_vs_rep_tweets = dem_vs_rep_tweets.sample(25000)
 process_and_add_to_global(
     dem_vs_rep_tweets,
    dataset_name="twitter",
    column_mapping={"body": "Tweet"},
    topic="Polarization",

    additional_columns={
        'Polarization_flag':  dem_vs_rep_tweets['Party']


    }
)

Processing dataset: twitter


Processing rows in twitter:   0%|          | 0/25000 [00:00<?, ?row/s]
Extracting entities:   0%|          | 0/25000 [00:00<?, ?entry/s][A
Extracting entities:   0%|          | 13/25000 [00:00<03:16, 127.06entry/s][A
Extracting entities:   0%|          | 26/25000 [00:00<03:20, 124.68entry/s][A
Extracting entities:   0%|          | 39/25000 [00:00<03:27, 120.24entry/s][A
Extracting entities:   0%|          | 52/25000 [00:00<03:24, 122.10entry/s][A
Extracting entities:   0%|          | 65/25000 [00:00<03:22, 123.36entry/s][A
Extracting entities:   0%|          | 78/25000 [00:00<03:18, 125.35entry/s][A
Extracting entities:   0%|          | 91/25000 [00:00<03:19, 125.09entry/s][A
Extracting entities:   0%|          | 104/25000 [00:00<03:21, 123.43entry/s][A
Extracting entities:   0%|          | 117/25000 [00:00<03:23, 122.05entry/s][A
Extracting entities:   1%|          | 130/25000 [00:01<03:22, 122.80entry/s][A
Extracting entities:   1%|          | 143/25000 [00:01<03:20, 124.1

Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


In [None]:
global_df.shape

(64729, 11)

##  [Reddit: /r/NotTheOnion](https://www.kaggle.com/datasets/thedevastator/discovering-fact-through-humor-investigating-r-n)

In [None]:
nottheonion = pd.read_csv('./Datasets/Overall_news/nottheonion.csv')

In [None]:
nottheonion.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
0,Human heart found in TDOT salt pile in Humphre...,49,znmpf0,https://www.wkrn.com/news/local-news/human-hea...,5,1671219000.0,,2022-12-16 21:22:06
1,[TwinsDaily] Twins Announce Best Efforts and P...,1,znly7o,https://twinsdaily.com/news-rumors/just-for-fu...,0,1671217000.0,,2022-12-16 20:49:01
2,"Another driver loses control, crashes into Pha...",18,znly28,https://www.fox35orlando.com/news/another-driv...,11,1671217000.0,,2022-12-16 20:48:50
3,Reporter ‘alarmed at the intelligence level of...,1898,znlmte,https://www.msnbc.com/katie-phang/watch/report...,109,1671216000.0,,2022-12-16 20:34:52
4,Aromatherapy spray that killed two people in a...,159,znllto,https://www.nbcnews.com/news/us-news/aromather...,14,1671216000.0,,2022-12-16 20:33:36


In [None]:
global_df.head(1)

Unnamed: 0,title,body,source,timestamp,Misinfo_flag,type_of_misinfo,type_reddit,topic,subtopic,entities,Polarization_flag
0,Health Canada approves AstraZeneca COVID-19 va...,,Reddit,2021-02-27 06:33:45,1.0,myth,post_title,COVID,Vaccines,,


In [None]:
global_df = pd.read_csv('global_dataset.csv')

  global_df = pd.read_csv('global_dataset.csv')


In [None]:
 process_and_add_to_global(
    nottheonion,
    dataset_name="reddit",
    column_mapping={"title": "title",'timestamp':'timestamp'},
    topic="All_news",

    additional_columns={
        'Misinfo_flag': 1,
        'type_of_misinfo':'Onion'


    }
)

Processing dataset: reddit


Processing rows in reddit:   0%|          | 0/1627 [00:00<?, ?row/s]
Extracting entities:   0%|          | 0/1627 [00:00<?, ?entry/s][A
Extracting entities:   1%|          | 16/1627 [00:00<00:10, 154.70entry/s][A
Extracting entities:   2%|▏         | 32/1627 [00:00<00:10, 153.46entry/s][A
Extracting entities:   3%|▎         | 48/1627 [00:00<00:10, 151.67entry/s][A
Extracting entities:   4%|▍         | 64/1627 [00:00<00:10, 152.26entry/s][A
Extracting entities:   5%|▍         | 80/1627 [00:00<00:10, 150.87entry/s][A
Extracting entities:   6%|▌         | 96/1627 [00:00<00:10, 149.19entry/s][A
Extracting entities:   7%|▋         | 111/1627 [00:00<00:10, 149.09entry/s][A
Extracting entities:   8%|▊         | 126/1627 [00:00<00:10, 149.26entry/s][A
Extracting entities:   9%|▊         | 142/1627 [00:00<00:09, 149.56entry/s][A
Extracting entities:  10%|▉         | 158/1627 [00:01<00:09, 150.19entry/s][A
Extracting entities:  11%|█         | 174/1627 [00:01<00:09, 150.88entry/s][A


Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


## Political and gender bias

## Abortion and pro-choice
add code from R later to repo to not forget how to parse the files


In [None]:
!pip install pyreadr
!pip install rdata




In [None]:
gender_abortions = pd.read_csv('./Datasets/Gender_bias/genderwithna_abo_botp_subset.csv')

In [None]:
gender_abortions.head()

Unnamed: 0,created_at,text,gender,is_quote,hashtags,quoted_text
0,2017-11-01 15:15:14,"RT @PPact: Doubling down on health attacks, ex...",unknown,False,,
1,2017-11-03 02:30:37,RT @EmmarKirwan: It's almost 2018 and people s...,male,False,,
2,2017-11-03 14:31:31,RT @EmmarKirwan: It's almost 2018 and people s...,unknown,False,,
3,2017-12-17 03:56:30,@AGSchneiderman Republicans are against aborti...,female,False,,
4,2018-02-20 05:13:38,RT @RealKyleMorris: How do Democrats expect cr...,male,False,,


We will make two datasets here - one of just tweets (whatever) + one with re tweets for possible prompting

## dataset 1 (without quotes)

In [None]:
global_df.head(1)

Unnamed: 0,title,body,source,timestamp,Misinfo_flag,type_of_misinfo,type_reddit,topic,subtopic,entities,Polarization_flag,\tMisinfo_flag,type_of_content,potential_prompt0,hashtags,gender
0,Health Canada approves AstraZeneca COVID-19 va...,,Reddit,2021-02-27 06:33:45,1.0,myth,post_title,COVID,Vaccines,,,,,,,


In [None]:
global_df.shape

(122989, 16)

In [None]:
abortion_1 = gender_abortions[['created_at','text','hashtags','gender']]

In [None]:
# prompt: rewrite abortion 1 to include those where is_quoto is false

# ... (Your existing code)

# ## dataset 1 (without quotes)
abortion_1 = gender_abortions[~gender_abortions['is_quote']] [['created_at','text','hashtags','gender']]


In [None]:
abortion_1

Unnamed: 0,created_at,text,hashtags,gender
0,2017-11-01 15:15:14,"RT @PPact: Doubling down on health attacks, ex...",,unknown
1,2017-11-03 02:30:37,RT @EmmarKirwan: It's almost 2018 and people s...,,male
2,2017-11-03 14:31:31,RT @EmmarKirwan: It's almost 2018 and people s...,,unknown
3,2017-12-17 03:56:30,@AGSchneiderman Republicans are against aborti...,,female
4,2018-02-20 05:13:38,RT @RealKyleMorris: How do Democrats expect cr...,,male
...,...,...,...,...
49995,2018-01-04 19:08:24,March for Life Chicago 2018 to be Largest Midw...,,male
49996,2017-11-03 15:24:58,"RT @benwikler: Today, the party that wants to ...",,female
49997,2017-11-03 12:40:29,"RT @benwikler: Today, the party that wants to ...",,unknown
49998,2017-12-11 17:55:35,RT @SouthernKeeks: I see certain people I resp...,,unknown


In [None]:
sample_abortion1 = abortion_1.sample(10000)
# Process and add abortion_1 to the global dataset
process_and_add_to_global(
    sample_abortion1,
    dataset_name="twitter",
    column_mapping={"timestamp": "created_at", "body": "text"},
    topic="Abortion",
    additional_columns={
        "hashtags": sample_abortion1["hashtags"],
        "gender": sample_abortion1["gender"],
    }
)

Processing dataset: twitter


Processing rows in twitter:   0%|          | 0/10000 [00:00<?, ?row/s]

Extracting entities:   0%|          | 0/10000 [00:00<?, ?entry/s][A[A

Extracting entities:   0%|          | 11/10000 [00:00<01:33, 106.35entry/s][A[A

Extracting entities:   0%|          | 22/10000 [00:00<01:33, 106.89entry/s][A[A

Extracting entities:   0%|          | 34/10000 [00:00<01:31, 109.43entry/s][A[A

Extracting entities:   0%|          | 46/10000 [00:00<01:29, 110.87entry/s][A[A

Extracting entities:   1%|          | 58/10000 [00:00<01:32, 108.01entry/s][A[A

Extracting entities:   1%|          | 70/10000 [00:00<01:30, 109.76entry/s][A[A

Extracting entities:   1%|          | 82/10000 [00:00<01:28, 112.03entry/s][A[A

Extracting entities:   1%|          | 94/10000 [00:00<01:27, 112.79entry/s][A[A

Extracting entities:   1%|          | 106/10000 [00:00<01:27, 112.46entry/s][A[A

Extracting entities:   1%|          | 118/10000 [00:01<01:28, 111.87entry/s][A[A

Extracting entities:   

Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


In [None]:
global_df.sample(1)

Unnamed: 0,title,body,source,timestamp,Misinfo_flag,type_of_misinfo,type_reddit,topic,subtopic,entities,Polarization_flag,\tMisinfo_flag,type_of_content,potential_prompt0,hashtags,gender
91921,,"RT @mercatus: ""Regardless of how one feels abo...",twitter,,,,,Polarization,,"{'people': [], 'organizations': [], 'locations...",Republican,,,,,


In [None]:
# prompt: filter gender_abortions to those where is_guote is true

# ... (Your existing code)

# Filter gender_abortions where 'is_quote' is True
abortion_with_quotes = gender_abortions[gender_abortions['is_quote']]

process_and_add_to_global(
    abortion_with_quotes,
    dataset_name="twitter",
    column_mapping={"timestamp": "created_at", "body": "text", 'title':'quoted_text'},
    topic="Abortion",
    additional_columns={
        "hashtags": abortion_with_quotes["hashtags"],
        "gender": abortion_with_quotes["gender"],
        "potential_prompt0": abortion_with_quotes['quoted_text']
    }
)



Processing dataset: twitter




Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


In [None]:
# prompt: filter abortion_1 where hashtags are non Nan
abortion_with_quotes


Unnamed: 0,created_at,text,gender,is_quote,hashtags,quoted_text
50,2018-01-27 15:56:02,"You're gonna lose, abortion Ken. https://t.co/...",unknown,True,,"You gotta see my view! Garland, TX https://t.c..."
53,2017-11-18 19:47:59,If our Governments vote massive money for weap...,male,True,,Anyone would think these high profile Shinners...
79,2017-12-17 21:08:45,It's also a good case for #abortion. https://t...,male,True,abortion,There was once an orange #fetus who grew into ...
86,2017-11-09 02:34:52,Just imagine how many babies are aborted at th...,female,True,prolife,"Born before 22 weeks, 'most premature' baby is..."
107,2017-12-19 17:55:07,Easy...they're not really pro-life. https://t....,female,True,,Serious question: how can anyone who identifi...
...,...,...,...,...,...,...
49795,2018-01-04 20:31:56,"""...so should abortion.."".\n\nBut here we are....",male,True,,"Campaign Trump: ""Marijuana and legalization ....."
49839,2017-12-17 19:49:37,"Abortion is a choice Lee, just like murder, st...",unknown,True,,@Maximus_4EVR -Alabama had the highest Infant ...
49891,2018-02-20 03:03:46,I wonder when @scooterbraun and @justinbieber ...,male,True,"PlannedParenthood, MarchForOurLives, fraud","Cameron we heard your call... On March 24, we ..."
49987,2018-01-27 09:06:02,This is a must read! Nails the Rights take on ...,female,True,,When Slavery Won’t Die: The Oppressive Biblica...


## Gender and climate

In [None]:
gender_climate = pd.read_csv('./Datasets/Gender_bias/gender_climate.csv')

gender_climate = gender_climate.sample(20000)

In [None]:
gender_climate

Unnamed: 0,created_at,text,gender,is_quote,hashtags,quoted_text
0,2017-11-01 15:15:14,"RT @PPact: Doubling down on health attacks, ex...",unknown,False,,
1,2017-11-03 02:30:37,RT @EmmarKirwan: It's almost 2018 and people s...,male,False,,
2,2017-11-03 14:31:31,RT @EmmarKirwan: It's almost 2018 and people s...,unknown,False,,
3,2017-12-17 03:56:30,@AGSchneiderman Republicans are against aborti...,female,False,,
4,2018-02-20 05:13:38,RT @RealKyleMorris: How do Democrats expect cr...,male,False,,
...,...,...,...,...,...,...
49995,2018-01-04 19:08:24,March for Life Chicago 2018 to be Largest Midw...,male,False,,
49996,2017-11-03 15:24:58,"RT @benwikler: Today, the party that wants to ...",female,False,,
49997,2017-11-03 12:40:29,"RT @benwikler: Today, the party that wants to ...",unknown,False,,
49998,2017-12-11 17:55:35,RT @SouthernKeeks: I see certain people I resp...,unknown,False,,


In [None]:
process_and_add_to_global(
    gender_climate,
    dataset_name="twitter",
    column_mapping={"timestamp": "created_at", "body": "text", 'title':'quoted_text'},
    topic="Abortion",
    additional_columns={
        "hashtags": gender_climate["hashtags"],
        "gender": gender_climate["gender"],
        "potential_prompt0": gender_climate['quoted_text']
    }
)

Processing dataset: twitter




Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


## All around Climate datasets to balance off the dimensions

### Climate change and sentiment

In [None]:
climate_sentiment = pd.read_csv('./Datasets/Climate_change/twitter_sentiment_data.csv')

In [None]:
climate_sentiment.shape

(43943, 3)

In [None]:
climate_sentiment = climate_sentiment.sample(20000)

In [None]:
climate_sentiment.head()

Unnamed: 0,sentiment,message,tweetid
2955,1,"RT @Oxfam: Last year, 190+ countries signed th...",796476782216146945
31636,1,"RT @OMGno2trump: Hey #MAGA, think about this. ...",959541827551428609
39268,1,Cop21 in Paris: Commit to climate change and P...,667895923876343808
35150,1,RT @NGO_Reporting: What are the implications o...,957251467466829824
3757,1,The #ParisAgreement enters into force! Learn m...,797018604046843904


In [None]:
# prompt: recode sentiment column 2(News): the tweet links to factual news about climate change
# 1(Pro): the tweet supports the belief of man-made climate change
# 0(Neutral: the tweet neither supports nor refutes the belief of man-made climate change
# -1(Anti): the tweet does not believe in man-made climate change

def recode_sentiment(sentiment):
    if sentiment == 2:
        return "News"  # Assuming 2 represents factual news
    elif sentiment == 1:
        return "Pro"
    elif sentiment == 0:
        return "Neutral"
    elif sentiment == -1:
        return "Anti"
    else:
        return "Unknown"  # Handle cases with unexpected sentiment values

# Apply the function to the 'sentiment' column
climate_sentiment['sentiment_category'] = climate_sentiment['sentiment'].apply(recode_sentiment)

In [None]:
climate_sentiment.head()

Unnamed: 0,sentiment,message,tweetid,sentiment_category
2955,1,"RT @Oxfam: Last year, 190+ countries signed th...",796476782216146945,Pro
31636,1,"RT @OMGno2trump: Hey #MAGA, think about this. ...",959541827551428609,Pro
39268,1,Cop21 in Paris: Commit to climate change and P...,667895923876343808,Pro
35150,1,RT @NGO_Reporting: What are the implications o...,957251467466829824,Pro
3757,1,The #ParisAgreement enters into force! Learn m...,797018604046843904,Pro


In [None]:
global_df.head(1)

Unnamed: 0,title,body,source,timestamp,Misinfo_flag,type_of_misinfo,type_reddit,topic,subtopic,entities,Polarization_flag,\tMisinfo_flag,type_of_content,potential_prompt0,hashtags,gender
0,Health Canada approves AstraZeneca COVID-19 va...,,Reddit,2021-02-27 06:33:45,1.0,myth,post_title,COVID,Vaccines,,,,,,,


In [None]:
process_and_add_to_global(
    climate_sentiment,
    dataset_name="twitter",
    column_mapping={ "body": "message"},
    topic="Climate Change",
    additional_columns={
        "sentiment_category": climate_sentiment['sentiment_category']
    }
)

Processing dataset: twitter




Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


## climate reddit data
https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset?select=the-reddit-climate-change-dataset-comments.csv

## Normal news about climate
https://www.kaggle.com/datasets/beridzeg45/guardian-environment-related-news

In [None]:
## get more data here

reddit_climate =

In [None]:
normal_climate = pd.read_csv('./Datasets/Climate_change/guardian_environment_news.csv')

In [None]:
normal_climate = normal_climate.sample(10000)

In [None]:
process_and_add_to_global(
    normal_climate,
    dataset_name="news",
    column_mapping={"title":"Title",
                    "body":"Article Text",
                    "timestamp":'Date Published'},
    topic="Climate Change",
    additional_columns={
        'subtitle': normal_climate['Intro Text']
    }
)

Processing dataset: news




Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


## Datasets on abortions different news

https://www.kaggle.com/datasets/mcantoni81/twitter-supervised-dataset-prochoice-vs-prolife

answer to roe v vade. https://www.kaggle.com/datasets/hamiltonwhite/twitter-response-to-roe-v-wade-scotus

## pro choice vs pro-life

In [None]:
pro_choice = pd.read_csv('./Datasets/prochoice_prolife.csv')

In [None]:
pro_choice = pro_choice.sample(10000)
process_and_add_to_global(
    pro_choice,
    dataset_name="twitter",
    column_mapping={
                    "body":"text",
                    "timestamp":'created_at'},
    topic="Abortion",
    additional_columns={
        'prochoice_prolife': pro_choice['target'].map({0:'prochoice',1:'prolife'})
    }
)

Processing dataset: twitter




Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


### Baseline News and News with titles

## Political Emails

In [None]:
global_df.shape

(92983, 12)

In [None]:
##
political_emails = pd.read_csv('./Datasets/political-emails.csv')

In [None]:
political_emails.head()

Unnamed: 0.1,Unnamed: 0,sender_name,email,subject,datetime,cleaned_content
0,0,Indiana Republican Party,info@indiana.gop,Next Level Roads / Congress of Counties Survey...,2017-07-14 13:17:52,Next Level Roads / Congress of Counties Survey...
1,1,DWS,info@dwsforcongress.com,Time to turn up the heat on Republicans,2017-07-14 12:04:50,"I know it's Friday, friends.But before you hea..."
2,2,midnight-deadline@dscc.org,info@dscc.org,Your impact QUADRUPLED (limited time only!),2017-09-30 15:57:03,Time is almost up...deadline: MIDNIGHT!Pitch i...
3,3,Hillary Clinton,info@timkaine.com,Will you join me?,2017-09-29 10:46:22,Kaine for VirginiaWhen I was running for presi...
4,4,Team Ryan,info@speakerryan.com,Do you live paycheck to paycheck?,2017-10-10 16:31:15,"Friend,Do you live paycheck to paycheck? It'..."


In [None]:
political_emails.shape

(86517, 6)

In [None]:
political_emails = political_emails.sample(10000)

process_and_add_to_global(
    political_emails,
    dataset_name="political emails",
    column_mapping={"title": "subject",'timestamp':'datetime','body':'cleaned_content'},
    topic="Polarization",

    additional_columns={

        'type_of_content': 'political email',
        'potential_prompt0': (political_emails['sender_name']+ ": " + political_emails['subject'])


    }
)

Processing dataset: political emails


Processing rows in political emails:   0%|          | 0/10000 [00:00<?, ?row/s]
Extracting entities:   0%|          | 0/10000 [00:00<?, ?entry/s][A
Extracting entities:   0%|          | 2/10000 [00:00<09:44, 17.11entry/s][A
Extracting entities:   0%|          | 5/10000 [00:00<08:30, 19.58entry/s][A
Extracting entities:   0%|          | 7/10000 [00:00<09:24, 17.71entry/s][A
Extracting entities:   0%|          | 9/10000 [00:00<09:57, 16.71entry/s][A
Extracting entities:   0%|          | 12/10000 [00:00<08:36, 19.34entry/s][A
Extracting entities:   0%|          | 14/10000 [00:00<12:47, 13.01entry/s][A
Extracting entities:   0%|          | 16/10000 [00:01<11:34, 14.37entry/s][A
Extracting entities:   0%|          | 18/10000 [00:01<11:27, 14.51entry/s][A
Extracting entities:   0%|          | 20/10000 [00:01<11:04, 15.01entry/s][A
Extracting entities:   0%|          | 22/10000 [00:01<13:30, 12.31entry/s][A
Extracting entities:   0%|          | 25/10000 [00:01<11:53, 13.98entry/s]

Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


In [None]:
global_df.sample(10)

Unnamed: 0,title,body,source,timestamp,Misinfo_flag,type_of_misinfo,type_reddit,topic,subtopic,entities,Polarization_flag,\tMisinfo_flag,type_of_content,potential_prompt0
12437,Cherokee Nation set to receive first doses of ...,"No, not really. But the rightwing is fucking *...",Reddit,2020-12-11 16:39:32,,,post_and_title,COVID,Vaccines,"{'people': [], 'organizations': [], 'locations...",,,,
100512,Donald doesn't want you to know you can still ...,The Democratic PartyThe deadline to get covere...,political emails,2017-01-30 05:33:00,,,,Polarization,,"{'people': ['Trump'], 'organizations': ['Party...",,,political email,Democrats.org: Donald doesn't want you to know...
89958,,"In the last year, the Trump Administration has...",twitter,,,,,Polarization,,"{'people': [], 'organizations': ['the Trump Ad...",Democrat,,,
70294,,RT @HouseVetAffairs: House passes legislation ...,twitter,,,,,Polarization,,"{'people': [], 'organizations': ['House'], 'lo...",Republican,,,
98283,Phase 3 guidance and plans for schools,"Your RI COVID-19 News Update Hello friend, I ...",political emails,2020-06-19 18:50:56,,,,Polarization,,"{'people': [], 'organizations': ['the Departme...",,,political email,Gina Raimondo : Phase 3 guidance and plans for...
90574,,Walker's Community Hero Award Goes To Williams...,twitter,,,,,Polarization,,"{'people': [], 'organizations': ['Walker's'], ...",Republican,,,
48663,,"RT @CourantPhotogs: At the Colt Armory, hundre...",twitter,,,,,Polarization,,"{'people': [], 'organizations': ['RT @CourantP...",Democrat,,,
22949,U.K. Will Start Immunizing People Against COVI...,Vaccinate as soon as humanly possible.\n\nDo i...,Reddit,2020-12-05 01:51:03,,,post_and_title,COVID,Vaccines,"{'people': [], 'organizations': ['Vaccinate'],...",,,,
102873,MSNBC ANNOUNCED: devastating news,I need your help right now. As I’m sure you’r...,political emails,2019-11-21 01:18:00,,,,Polarization,,"{'people': ['Julián Paid', 'Julián', 'Julián C...",,,political email,Julián Castro: MSNBC ANNOUNCED: devastating news
101140,"Tonight: Join AOC, Jamaal, and organizers for ...","Tonight, AOC, Jamaal, and organizers from Hous...",political emails,2020-07-15 19:31:00,,,,Polarization,,"{'people': ['John', 'Jamaal Bowman', 'John', '...",,,political email,"Team Bowman: Tonight: Join AOC, Jamaal, and or..."


## Political Ads on Facebook

# Gaza Conflict (optional)

### Environmental differnt issues

### youtube commentary

## News headlines with outlet political labels

https://www.kaggle.com/datasets/diegoexe/2024-us-presidential-elections-top-news?select=fox-news_articles.txt

can guess political affiliation by the outlet

putin carlson sentiment - https://www.kaggle.com/datasets/kanchana1990/social-media-sentiments-putin-and-carlson-interview

political sentiment youtube comments - https://www.kaggle.com/datasets/rashidul0/political-sentiment-analysis-comments-from-youtube

## Google War news

https://www.kaggle.com/datasets/armitaraz/google-war-news

In [None]:
global_df.shape

(164095, 17)

In [None]:
war_news = pd.read_csv('./Datasets/war-news.csv', encoding='latin-1') # Try 'latin-1' encoding

In [None]:
war_news.head()

Unnamed: 0.1,Unnamed: 0,Headlines,Summary,Press,Date,Keyword
0,0,I served in Iraq and Afghanistan but the horro...,A WAR hero traumatised by the horrors of comba...,The Sun,1 day ago,Afghanistan
1,1,The forever war in Afghanistan is nowhere near...,Islamic State is seeking to overthrow the Tali...,ThePrint,2 weeks ago,Afghanistan
2,2,"Hell at Abbey Gate: Chaos, Confusion and Death...","In firsthand accounts, Afghan civilians and U....",ProPublica,1 month ago,Afghanistan
3,3,A second Afghanistan: Doubts over Russias w...,Russia's lack of progress in its war against U...,Al Jazeera,5 days ago,Afghanistan
4,4,Afghanistan: Former army general vows new war ...,Lt Gen Sami Sadat tells the BBC of planned ope...,BBC,1 week ago,Afghanistan


In [None]:
war_news.shape

(5654, 6)

In [None]:
global_df.head(1)

Unnamed: 0,title,body,source,timestamp,Misinfo_flag,type_of_misinfo,type_reddit,topic,subtopic,entities,Polarization_flag,\tMisinfo_flag,type_of_content,potential_prompt0,hashtags,gender,sentiment_category
0,Health Canada approves AstraZeneca COVID-19 va...,,Reddit,2021-02-27 06:33:45,1.0,myth,post_title,COVID,Vaccines,,,,,,,,


In [None]:
process_and_add_to_global(
    war_news,
    dataset_name="news",
    column_mapping={ "title":"Headlines",
                    "body":"Summary"},
    topic="War",

    additional_columns={
        "Publisher": war_news['Press'],
        "subtopic" : war_news['Keyword'],
        "type_of_content": "War news"

    }
)

Processing dataset: news





Extracting entities:   0%|          | 0/5654 [00:00<?, ?entry/s][A[A[A


Extracting entities:   0%|          | 13/5654 [00:00<00:46, 121.16entry/s][A[A[A


Extracting entities:   0%|          | 26/5654 [00:00<00:44, 125.48entry/s][A[A[A


Extracting entities:   1%|          | 39/5654 [00:00<00:44, 124.79entry/s][A[A[A


Extracting entities:   1%|          | 52/5654 [00:00<00:46, 120.36entry/s][A[A[A


Extracting entities:   1%|          | 65/5654 [00:00<00:47, 118.77entry/s][A[A[A


Extracting entities:   1%|▏         | 78/5654 [00:00<00:46, 120.04entry/s][A[A[A


Extracting entities:   2%|▏         | 91/5654 [00:00<00:45, 122.12entry/s][A[A[A


Extracting entities:   2%|▏         | 104/5654 [00:00<00:48, 115.34entry/s][A[A[A


Extracting entities:   2%|▏         | 117/5654 [00:00<00:47, 117.40entry/s][A[A[A


Extracting entities:   2%|▏         | 130/5654 [00:01<00:45, 120.99entry/s][A[A[A


Extracting entities:   3%|▎         | 144/5654 [00:01<00:4

Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


## Google daily news


### Actual news

### Sample datasets with more prompt information (but also with content ofc)

different subheadings

https://www.kaggle.com/datasets/diegoexe/2024-us-presidential-elections-top-news?select=fox-news_articles.txt

https://www.kaggle.com/datasets/rmisra/news-category-dataset


https://www.kaggle.com/code/adityarahul/fakenewsdetection/input


https://www.kaggle.com/datasets/mrisdal/fake-news

https://www.kaggle.com/datasets/ruchi798/source-based-news-classification/data


includes half truth - https://www.kaggle.com/datasets/shivkumarganesh/politifact-factcheck-data


https://www.kaggle.com/datasets/thedevastator/fakecovid-fact-checked-news-dataset - fact checked covid

## Uk titles

In [26]:
uk_headlines = pd.read_csv('./Datasets/Overall_news/news_headlines_20_days.csv')
uk_headlines = uk_headlines.sample(10000)

In [27]:
uk_headlines['website'].unique()

array(['Telegraph', 'Times', 'Daily Mail', 'Independent',
       'Manchester Evening News', 'Liverpool Echo', 'Birmingham Live',
       'Guardian', 'BBC', 'Daily Express', 'Mirror', 'Sun', 'Sky News',
       'Metro', 'Evening Standard'], dtype=object)

In [28]:
uk_headlines.columns, global_df.columns

(Index(['id', 'website', 'timestamp scraped', 'headline'], dtype='object'),
 Index(['title', 'body', 'source', 'timestamp', 'Misinfo_flag',
        'type_of_misinfo', 'type_reddit', 'topic', 'subtopic', 'entities',
        'Polarization_flag', '\tMisinfo_flag', 'type_of_content',
        'potential_prompt0', 'hashtags', 'gender', 'sentiment_category',
        'Publisher', 'subtitle', 'prochoice_prolife'],
       dtype='object'))

In [30]:
process_and_add_to_global(
    uk_headlines,
    dataset_name="news",
    column_mapping={ "title":"headline",
                    "timestamp":'timestamp scraped',
                     },
    # topic = 'news',

    additional_columns={
        'potential_prompt0': uk_headlines['headline'],
        'Polarization_flag': uk_headlines['website'].map({'BBC': 'Neutral/Mixed',
    'Sun': 'Conservative/Republican-leaning',
    'Mirror': 'Labour/Democratic-leaning',
    'Daily Mail': 'Conservative/Republican-leaning',
    'Independent': 'Labour/Democratic-leaning',
    'Telegraph': 'Conservative/Republican-leaning',
    'Guardian': 'Labour/Democratic-leaning',
    'Manchester Evening News': 'Neutral/Mixed',
    'Sky News': 'Neutral/Mixed (slightly Conservative)',
    'Metro': 'Neutral/Mixed',
    'Daily Express': 'Conservative/Republican-leaning',
    'Times': 'Conservative/Republican-leaning',
    'Liverpool Echo': 'Labour/Democratic-leaning',
    'Birmingham Live': 'Neutral/Mixed',
    'Evening Standard': 'Conservative/Republican-leaning'}),
        "Publisher": uk_headlines['website']
    }

)


  standardized_data["timestamp"] = pd.to_datetime(standardized_data.get("timestamp", None), errors="coerce")
  standardized_data["timestamp"] = pd.to_datetime(standardized_data.get("timestamp", None), errors="coerce")


Processing dataset: news




Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


In [31]:
global_df.tail(15)

Unnamed: 0,title,body,source,timestamp,Misinfo_flag,type_of_misinfo,type_reddit,topic,subtopic,entities,Polarization_flag,\tMisinfo_flag,type_of_content,potential_prompt0,hashtags,gender,sentiment_category,Publisher,subtitle,prochoice_prolife
239734,You can see what drivers rate you on Uber: Fin...,,news,NaT,,,,,,"{'people': [], 'organizations': [], 'locations...",Conservative/Republican-leaning,,,You can see what drivers rate you on Uber: Fin...,,,,Times,,
239735,Norovirus levels ‘significantly higher’ than l...,,news,2023-02-16 13:00:00-10:00,,,,,,"{'people': [], 'organizations': [], 'locations...",Labour/Democratic-leaning,,,Norovirus levels ‘significantly higher’ than l...,,,,Independent,,
239736,Westhoughton town hall project could see cinem...,,news,2023-02-17 13:00:00-18:00,,,,,,"{'people': [], 'organizations': [], 'locations...",Neutral/Mixed,,,Westhoughton town hall project could see cinem...,,,,Manchester Evening News,,
239737,Eggs and margarine drive food inflation to rec...,,news,NaT,,,,,,"{'people': [], 'organizations': [], 'locations...",Neutral/Mixed,,,Eggs and margarine drive food inflation to rec...,,,,BBC,,
239738,Two elderly people trapped in their homes in C...,,news,NaT,,,,,,"{'people': [], 'organizations': ['Chaah'], 'lo...",Conservative/Republican-leaning,,,Two elderly people trapped in their homes in C...,,,,Sun,,
239739,The final email and an abandoned Teams call: T...,,news,NaT,,,,,,"{'people': [], 'organizations': [], 'locations...",Labour/Democratic-leaning,,,The final email and an abandoned Teams call: T...,,,,Independent,,
239740,Mortified woman buys portraits - then discover...,,news,NaT,,,,,,"{'people': [], 'organizations': [], 'locations...",Labour/Democratic-leaning,,,Mortified woman buys portraits - then discover...,,,,Mirror,,
239741,"\n\tMore than 10,000 pride pin badges ordered ...",,news,NaT,,,,,,"{'people': [], 'organizations': ['NHS'], 'loca...",Conservative/Republican-leaning,,,"\n\tMore than 10,000 pride pin badges ordered ...",,,,Daily Mail,,
239742,Argos to close 50 stores in 2023 amid Sainsbur...,,news,2023-02-25 14:00:00-19:00,,,,,,"{'people': [], 'organizations': ['Sainsbury'],...",Conservative/Republican-leaning,,,Argos to close 50 stores in 2023 amid Sainsbur...,,,,Times,,
239743,EasyJet pilot does mid-air 360-degree turn so ...,,news,NaT,,,,,,"{'people': [], 'organizations': ['EasyJet'], '...",Neutral/Mixed,,,EasyJet pilot does mid-air 360-degree turn so ...,,,,Metro,,


# news with short descriptions (may be used for prompting)


In [None]:
news_category = pd.read_json('./Datasets/Overall_news/News_Category_Dataset_v3.json', lines=True)

In [None]:
news_category = news_category.sample(10000)

In [None]:
news_category.head()

Unnamed: 0,link,headline,category,short_description,authors,date
153,https://www.huffpost.com/entry/bc-lt-mexico-mi...,6 Missing College Students In Mexico Were Held...,WORLD NEWS,Six of 43 students kidnapped in 2014 were kept...,"Fabiola Sánchez and Christopher Sherman, AP",2022-08-27
57281,https://www.huffingtonpost.com/entry/nickelode...,Even Kids Know Equality For Everyone Is Important,POLITICS,The top issues for kids in this election inclu...,Amber Ferguson,2016-09-07
135354,https://www.huffingtonpost.com/entry/nike-shoe...,"If You Don't Have A Pair Of Badass Nike Shoes,...",STYLE & BEAUTY,This week Nike celebrated the 27th anniversary...,Megan Mayer,2014-03-30
171993,https://www.huffingtonpost.com/entry/paris-fas...,Who Needs Fashion When You Really Need Sleep?,STYLE & BEAUTY,It is the end of Paris Fashion Week and I am f...,"Eddie Parsons, Contributor\nFashion Marketing",2013-03-06
181422,https://www.huffingtonpost.comhttp://www.care2...,"If A Teenage Girl Is Unhappy, It's Mom's Fault...",PARENTING,Just what mothers need: one more thing to feel...,,2012-11-26


In [2]:
process_and_add_to_global(
    news_category,
    dataset_name="news",
    column_mapping={ "title":"headline",
                    "timestamp":"date",
                     'body':'short_description'},
    # topic = 'news',

    additional_columns={
        'potential_prompt0': news_category['headline'],
        'topic': news_category['category']}
        # 'type_of_content': 'news',
        )



NameError: name 'process_and_add_to_global' is not defined

In [1]:
global_df.tail()

NameError: name 'global_df' is not defined

## Fake news

In [None]:
fakenews = pd.read_csv('./Datasets/fakeNews/WELFake_Dataset.csv')
fakenews = fakenews.sample(10000)
## append fakenews



In [None]:
fakenews.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
60909,60909,Trove of Bin Laden documents released,U.S. intelligence officials on Wednesday relea...,0
16707,16707,Secret Service Just Killed The Fun: No Open C...,Despite a petition calling for guns to be allo...,1
40031,40031,Democrats Are Pushing A Bill On Tax Returns T...,What is the one thing that might actually scar...,1
53583,53583,"Zimbabwe truck accident kills 21, injures others",HARARE (Reuters) - At least 21 people died in ...,0
31336,31336,Clinton Gets Back In The Game After Blowout Lo...,Clinton Gets Back In The Game After Blowout Lo...,0


In [None]:
process_and_add_to_global(
    fakenews,
    dataset_name="news",
    column_mapping={ "title":"title",
                    "body":"text"},


    additional_columns={
        'Misinfo_flag': fakenews['label']

    }
)

Processing dataset: news


Extracting entities:  27%|██▋       | 2719/10000 [03:31<04:30, 26.92entry/s]



Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


## for prompting   - headlines

https://www.kaggle.com/datasets/sagunsh/coronavirus-news-headline

https://www.kaggle.com/datasets/fringewidth/climate-change-news

https://www.kaggle.com/datasets/dexmasa/uk-news-headlines

good on the fake news in politics - https://www.kaggle.com/datasets/techykajal/fakereal-news

## Coronavirus headlines with publisher and polit affiliation


In [None]:
covid_headlines = pd.read_csv('./Datasets/Covid_Datasets/corona_news.csv', on_bad_lines='skip')
covid_headlines = covid_headlines.sample(10000)



  covid_headlines = pd.read_csv('./Datasets/Covid_Datasets/corona_news.csv', on_bad_lines='skip')


In [None]:
process_and_add_to_global(
    covid_headlines,
    dataset_name="news",
    column_mapping={ "title":"title",
                    "timestamp":"published_date"},
    topic = 'COVID',

    additional_columns={
        'potential_prompt0': covid_headlines['title'],
        'Publisher': covid_headlines['source'],
        'type_of_content': 'Covid news',
        'Polarization_flag': covid_headlines['source'].map({
            'nationalpost.com': 'Neutral/Mixed',
    'news.yahoo.com': 'Democratic',
    'dailymail.co.uk': 'Republican',
    'theguardian.com': 'Democratic',
    'feeds.foxnews.com': 'Republican'})


    }
)

Processing dataset: news




Global dataframe saved to: /content/drive/MyDrive/MIDS/XAI/XAI_Final_project/global_dataset.csv


Unnamed: 0,title,body,source,timestamp,Misinfo_flag,type_of_misinfo,type_reddit,topic,subtopic,entities,Polarization_flag,\tMisinfo_flag,type_of_content,potential_prompt0,hashtags,gender,sentiment_category,Publisher,subtitle,prochoice_prolife
185614,,Life is always best!!! #chooselife #ProLife ht...,twitter,2022-06-26 21:31:04+00:00,,,,Abortion,,"{'people': [], 'organizations': [], 'locations...",,,,,,,,,,prolife
125839,,RT @NARAL: “There is nothing ‘pro-life’ about ...,twitter,2018-02-09 00:05:53,,,,Abortion,,"{'people': [], 'organizations': ['RT @NARAL'],...",,,,,,unknown,,,,
12216,People who suffer from ‘significant’ allergic ...,Those are entirely valid concerns. \n\nDo keep...,Reddit,2020-12-10 16:40:06,,,post_and_title,COVID,Vaccines,"{'people': [], 'organizations': ['CVS'], 'loca...",,,,,,,,,,
162166,,An interesting article for those who have been...,twitter,,,,,Climate Change,,"{'people': ['¦'], 'organizations': [], 'locati...",,,,,,,Pro,,,
164142,How sanctions became Bidens weapon of choice ...,"Sanctions, long before the war in Ukraine, hav...",news,,,,,War,Afghanistan,"{'people': ['Biden'], 'organizations': [], 'lo...",,,War news,,,,,Vox,,
52649,,"The Broward Co. Sheriff, in front of a nationw...",twitter,,,,,Polarization,,"{'people': ['https://t.co/DsSuPpiyGs'], 'organ...",Republican,,,,,,,,,
203432,"Maharashtra Covid-19 toll up by 1,328 after st...",,news,2020-06-17 03:09:49,,,,COVID,,"{'people': [], 'organizations': ['Maharashtra ...",,,Covid news,"Maharashtra Covid-19 toll up by 1,328 after st...",,,,business-standard.com,,
189211,,It is what it is 🤷🏻‍♀️#roevwade #prochoice #fu...,twitter,2022-07-03 18:23:07+00:00,,,,Abortion,,"{'people': [], 'organizations': [], 'locations...",,,,,,,,,,prochoice
208194,Navy opens full investigation into coronavirus...,,news,2020-04-29 23:38:25,,,,COVID,,"{'people': ['USS Theodore Roosevelt'], 'organi...",Republican,,Covid news,Navy opens full investigation into coronavirus...,,,,feeds.foxnews.com,,
158121,,RT @LarrySabato: The term $q$cafeteria Catholi...,twitter,,,,,Climate Change,,"{'people': ['q$cafeteria', 'Rs'], 'organizatio...",,,,,,,Neutral,,,


# super powered tweets
https://www.kaggle.com/datasets/jackksoncsie/twitter-dataset-keywords-likes-and-tweets -- with a lot of topics labelled for me

political events classifier https://www.kaggle.com/datasets/akashhiremath25/eventclassifier-twitter-data-set -- may use to see which ones it would generate?

## Truth social - alternative on politicals discussions
https://www.kaggle.com/datasets/kashishashah/truthsocial-2024-election-integrity-initiative

## trump elections and stuff like that

https://www.kaggle.com/datasets/muhammetakkurt/trump-2024-campaign-truthsocial-truths-tweets




## gaza

https://www.kaggle.com/datasets/emirslspr/israel-hamas-conflict-news-dataset

reddit - https://www.kaggle.com/datasets/asaniczka/reddit-on-israel-palestine-daily-updated


## sentiments

hatred on me too https://www.kaggle.com/datasets/rahulgoel1106/hatred-on-twitter-during-metoo-movement