# Newspaper portrayal analysis of Israel and Palestine
In this project, we train a word embedding model (a model that can assign meaningful vectors to words), specifically Word2Vec, on multiples newspapers' corpora. We create these corpora by scraping websites of different sources. If you would like to see how we scraped, please check out the github repository for the project: https://github.com/McGill-AI-Lab/news-bias-model

#### Note:
In the following jupyter notebook, we had to delete some of the cell outputs due to either them being too large, or some copyright constraints. Please know that we ran all of the code in the notebook.

### Abstract
lorem ipsum




### Access and preprocess our data

We access our data in 'data/news-data-extracted.json', which you should also have access to through the repository. This file is a dictionary, with keys corresponding to different newspapers, and for each newspaper key, the corresponding value is a list of dictionaries, each dictionary containing key-value pairs for a single article. Keys include: url, title, authors, date, text

In [1]:
import json

# Open and read the JSON file
with open('data/news-data-extracted.json', 'r') as file:
    data = json.load(file)

# Print the data
first_article_data = data["cnn.com"][0] #cnn is the key to a value which is a list of dictionaries, we get the first dictionary (article) of that list of dictionary
first_article = first_article_data["text"]
print(first_article_data)

{'url': 'https://www.cnn.com/2020/01/23/opinions/auschwitz-anniversary-anti-semitism-fears-linger-andelman/index.html', 'title': 'World leaders in Jerusalem show battle against anti-Semitism not yet a victory (opinion)', 'authors': ['David A. Andelman'], 'date': '2020-01-23 00:00:00', 'text': 'Editors Note: David A. Andelman, Executive Director of The RedLines Project, is a contributor to CNN where his columns won the Deadline Club Award for Best Opinion Writing. Author of A Shattered Peace: Versailles 1919 and the Price We Pay Today, and the forthcoming A Red Line in the Sand: Diplomacy, Strategy and a History of Wars That Almost Happened, he was formerly a foreign correspondent for The New York Times and CBS News in Europe and Asia. Follow him on Twitter @DavidAndelman. The views expressed in this commentary are his own. View more opinion on CNN.CNN The world converged on Jerusalem this week to observe the 75th anniversary of the liberation of the Auschwitz death camp  and with a col

In [2]:
first_article[0]

'E'

We need to preprocess our data. Preprocessing includes dividing articles into sentences using nltk library, since Word2Vec is trained by using list of words (sentences). Nltk uses a machine learning model to decide how to divide an article into sentences, so there will be some inaccuracies, however we can ignore these. After, we want all words to be lowercase. We want to remove all extremely high-frequency words which don't really contribute to any of the word embeddings for other words they co-occur with as these high-frequency word co-occur with a big portion of our corpus. These words are called "stop words" and some example would be "I", "you", "of", "there" etc. Then, we lemmatize words, i.e. try to convert each word to their root (running -> run). The goal of this process is that so we have more information about the word "run", instead of the information being distributed between various forms of the word ("runs", "running", "ran"). For more information on lemmatizers: https://www.geeksforgeeks.org/python-lemmatization-with-nltk/. Finally, we remove punctuation and put all of these functions in one "preprocess" function.

In [3]:
from gensim.utils import tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

# tokenizes article into sentences, which are also tokenized into words
def tokenize_article(article):
    tokenized_article = []
    sentences  = sent_tokenize(article, language="english") # divide article into sentences
    
    for sentence in sentences:
        tokenized_sentence = tokenize(sentence) # divide sentences into words
        tokenized_article.append(tokenized_sentence) 
    return tokenized_article

# makes each word lowercase
def lowercase(tokenized_article):
    lowercase_article = []

    for sentence in tokenized_article:
        current_sentence = []
        for word in sentence:
            current_sentence.append(word.lower())
        lowercase_article.append(current_sentence)

    return lowercase_article

stop_words = set(stopwords.words("english"))

def remove_stopwords(tokenized_article):
    # Iterate over the index and content of each sentence
    for i in range(len(tokenized_article)):
        # Create a new list for the filtered sentence
        filtered_sentence = []
        for word in tokenized_article[i]:
            if word not in stop_words:
                filtered_sentence.append(word)
        # Replace the original sentence with the filtered sentence
        tokenized_article[i] = filtered_sentence
    return tokenized_article

def lammetization(tokenized_article):
    lammetizer = WordNetLemmatizer()

    lammetized_article = []

    for sentence in tokenized_article:
        current_sentence = []
        for word in sentence:
            current_sentence.append(lammetizer.lemmatize(word))
        lammetized_article.append(current_sentence)

    return lammetized_article


def remove_punctuation(tokenized_article):
    punc_removed_article = []

    for sentence in tokenized_article:
        punc_removed_sentence = []
        for word in sentence:
            # Split by punctuation, filter out empty strings, and join back if needed
            split_word = ''.join(re.split(r"[^\w]+", word))
            if split_word:  # Add non-empty words only
                punc_removed_sentence.append(split_word)

        punc_removed_article.append(punc_removed_sentence)

    return punc_removed_article

def preprocess_article(article):
    t_article = tokenize_article(article)
    l_article = lowercase(t_article)
    r_article = remove_stopwords(l_article)
    la_article = lammetization(r_article)
    re_article = remove_punctuation(la_article)
    return re_article

In [4]:
print(stop_words)

{'whom', 'doing', "should've", 'only', 'his', "she's", "aren't", 'the', 'which', 'they', 'because', "haven't", 'weren', 'but', 'there', 'were', 'has', 'should', "hadn't", 'do', 'after', 'ourselves', 'myself', 'this', 'was', 'shan', 'few', 'having', "shouldn't", 'am', 'i', 'me', 'nor', 'or', 'about', 'won', 'shouldn', 'hasn', "you're", 'no', "mustn't", "wouldn't", 'aren', 'him', 'have', 'from', 'in', 'hers', 'yourselves', 'above', 'and', "couldn't", 'through', 'them', 'same', 'y', 'against', 've', 'hadn', 'just', 'with', 'at', 'what', 'themselves', 'their', "you'd", 'its', 'under', 'wasn', 'other', 'not', "you've", 'those', 'now', 'ain', 'couldn', 'once', 'any', 'how', 'as', 'both', 'don', "didn't", 'd', 'is', 'all', 're', 'we', 'if', 'had', 'until', 'too', 'isn', 'own', 'himself', 'off', 'herself', "doesn't", "needn't", 'be', "it's", "hasn't", "wasn't", 'down', 'does', 'than', 'ours', 'to', 'on', 'our', 'here', 'been', "won't", 'very', 'being', 's', 'theirs', 'it', 'he', 'into', "shan'

In [5]:
preprocessed = preprocess_article(first_article)
print(preprocessed) # this is how a preprocessed article looks like

[['editor', 'note', 'david', 'andelman', 'executive', 'director', 'redlines', 'project', 'contributor', 'cnn', 'column', 'deadline', 'club', 'award', 'best', 'opinion', 'writing'], ['author', 'shattered', 'peace', 'versailles', 'price', 'pay', 'today', 'forthcoming', 'red', 'line', 'sand', 'diplomacy', 'strategy', 'history', 'war', 'almost', 'happened', 'formerly', 'foreign', 'correspondent', 'new', 'york', 'time', 'cbs', 'news', 'europe', 'asia'], ['follow', 'twitter', 'davidandelman'], ['view', 'expressed', 'commentary'], ['view', 'opinion', 'cnn', 'cnn', 'world', 'converged', 'jerusalem', 'week', 'observe', 'th', 'anniversary', 'liberation', 'auschwitz', 'death', 'camp', 'collective', 'determination', 'battle', 'anti', 'semitism', 'many', 'form'], ['time', 'gathering', 'exposed', 'number', 'old', 'festering', 'political', 'wound', 'threatened', 'weaken', 'impact', 'head', 'state', 'government', 'russian', 'president', 'vladimir', 'putin', 'french', 'president', 'emmanuel', 'macron',

In [None]:

def create_article_list(extracted_file, newspaper_name):
    """
    Takes in the file of extracted news and the newspaper name.
    Outputs an article (text) list for the given newspaper.
    """
    import json

    # Load the JSON file
    with open(extracted_file, "r") as json_file:
        data = json.load(json_file)

    # Extract newspaper data
    newspaper = data.get(newspaper_name, [])  # Default to an empty list if not found
    newspaper_articles = []

    # Loop through articles in the newspaper
    for article in newspaper:
        # Check if article has a valid "text" key
        if article and isinstance(article, dict) and "text" in article:
            newspaper_articles.append(article["text"])  # Use append to add the text to the list

    print(f"Extracted {len(newspaper_articles)} articles from {newspaper_name}.")
    return newspaper_articles



In [None]:
def preprocess_newspaper(article_list):
    """ Takes in article list and give back a list of list which is preprocessed article in the form of every element in the list is a sentence which consist of lists of words"""

    if not article_list:  # Handle empty or None input
        print("No articles provided for preprocessing.")
        return []

    preprocessed_article_list = []
    i = 0 # to see how many articles we processed

    for article in article_list:
        preprocessed_article_list.extend(preprocess_article(article))  # extends preproccessed
        # articles to newspaper's article list
        print(f"article {i} preprocessed")
        i += 1

    return preprocessed_article_list

In [None]:
# Lets try with CNN
article_list = create_article_list("data/news-data-extracted.json", "cnn.com")
print(article_list)
preprocessed = preprocess_newspaper(article_list)
print(preprocessed)

In [None]:
# create get_article_number, corpussize, and other helper functions

def no_of_articles(article_list):
    return len(article_list)

def corpus_size_before(article_list):
    corpus = article_list

    corpus_size = 0
    for article in article_list:
        corpus_size += len(article.split())

    return corpus_size


def corpus_size_after(preprocessed_article_list):
    corpus = preprocessed_article_list

    corpus_size = 0
    for sentence in corpus:
        for word in sentence:
            corpus_size += 1

    return corpus_size


def no_of_unique_words(preprocessed_article_list):
    words = []

    for sentence in preprocessed_article_list:
        for word in sentence:
            if word in words:
                pass
            else:
                words.append(word)

    return len(words)

def no_of_sentences(preprocessed_article_list):
    return len(preprocessed_article_list)

In [None]:
""" Occurance Counter """

# Palestine
def occurance(target_word, preprocessed_article_list):
    counter = 0

    for sentence in preprocessed_article_list:
        for word in sentence:
            if word == f"{target_word}":
                counter += 1
    
    return counter

### Training function

In [None]:
def train(newspaper_name, sentence_list):
    from gensim.models import Word2Vec
    import os

    # Ensure the directory exists
    os.makedirs(newspaper_name, exist_ok=True)

    # Train Word2Vec model
    # Initialize the model with parameters
    model = Word2Vec(sentences=sentence_list, vector_size=300, window=5, min_count=10, sg=1, workers=4, negative=20)

    # Train and save the model
    model.train(sentence_list, total_examples=len(sentence_list), epochs=20)
    model.save(os.path.join(newspaper_name, f"{newspaper_name}_w2v.model"))

    # # Save just the word vectors in a text and binaryformat
    # model.wv.save_word2vec_format(f"{newspaper_name/newspaper_name}_w2v_vectors.txt", binary=False)
    # model.wv.save_word2vec_format(f"{newspaper_name/newspaper_name}_w2v_vectors.bin", binary=True)


    import os

    model.wv.save_word2vec_format(os.path.join(newspaper_name, f"{newspaper_name}_w2v_vectors.txt"),binary=False)
    model.wv.save_word2vec_format(os.path.join(newspaper_name, f"{newspaper_name}_w2v_vectors.bin"),binary=True)


    return (
        os.path.join(newspaper_name, f"{newspaper_name}_w2v.model"),
        os.path.join(newspaper_name, f"{newspaper_name}_w2v_vectors.txt"),
        os.path.join(newspaper_name, f"{newspaper_name}_w2v_vectors.bin"),
    )


### Calculate portrayal

In [None]:
def calculate_portrayal(model, palestinian_words, israeli_words, positive_portrayal_words, negative_portrayal_words): # target_words and portrayal_words are lists
    palestine_portrayal_scores = {}
    israel_portrayal_scores = {}

    # Access the list of words in the vocabulary
    vocabulary_words = list(model.wv.key_to_index.keys())
    
    # no of portrayal words
    pos_count = 0
    for word in positive_portrayal_words:
        if word in vocabulary_words:
            pos_count += 1
    neg_count = 0
    for word in negative_portrayal_words:
        if word in vocabulary_words:
            neg_count += 1       


    for word in palestinian_words:
            palestine_portrayal_scores[word] = 0
            for positive in positive_portrayal_words:
                if positive in vocabulary_words:
                    palestine_portrayal_scores[word] += (model.wv.similarity(f"{word}", f"{positive}")/pos_count)
            for negative in negative_portrayal_words:
                if positive in vocabulary_words:
                    palestine_portrayal_scores[word] -= (model.wv.similarity(f"{word}", f"{negative}")/neg_count)

    for word in israeli_words:
        israel_portrayal_scores[word] = 0
        for positive in positive_portrayal_words:
            if positive in vocabulary_words:
                israel_portrayal_scores[word] += (model.wv.similarity(f"{word}", f"{positive}")/pos_count)
        for negative in negative_portrayal_words:
            if positive in vocabulary_words:
                israel_portrayal_scores[word] -= (model.wv.similarity(f"{word}", f"{negative}")/neg_count)

    return palestine_portrayal_scores, israel_portrayal_scores

Should I include gaza, if yes, add an occurance function and add it to the target word_list and portrayal

Save newspaper dictionary

In [None]:
import json
import os

def save_newspaper_dict(newspaper_dict):
    # File path for the JSON file
    file_path = "preprocessed_newspapers_dict.json"

    # Open the JSON file
    with open(file_path, "r") as json_file:
        data = json.load(json_file)  # Load existing data

    # Iterate over items in the dictionary
    for key, value in newspaper_dict.items():  # Use .items() to get key-value pairs
        if key not in data:
            data[key] = value  # Save new key-value pair

    # Save updated data back to the file
    with open(file_path, "w") as json_file:
        json.dump(data, json_file, indent=4)

In [None]:
import json
import os

def load_preprocessed_newspapers(json_file):
    """
    Load preprocessed newspapers from a JSON file.
    """
    if os.path.exists(json_file):
        try:
            with open(json_file, 'r') as file:
                data = json.load(file)
                if isinstance(data, dict):
                    print(f"Successfully loaded preprocessed newspapers from {json_file}.")
                    return data
                else:
                    print("Error: JSON data is not a dictionary. Returning an empty dictionary.")
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON file {json_file}: {e}")
    else:
        print(f"File {json_file} does not exist. Starting with an empty dictionary.")

    return {}


In [None]:
newspaper_list = ["cnn.com", "WashingtonPost.com"]

def master(extracted_file, newspaper_list):
    """
    Get a list of newspapers
    Create a dictionary of newspapers, which is a dictionary
    For every newspaper, have the following keys:
    # of articles, corpus size (before preprocessing), # of unique words (before preprocessing),
    list of articles (only text) (before preprocessing),
    preprocessed articles (a list of sentences, which are a list of words)
    corpus size (after preprocessing), # of unique words (after preprocessing),
    how many times each target word appears (palestine, israel, hamas, idf, netanyahu, sinwar, etc.)
    train a word2vec, save the model and the weights,
    bias score for palestine, israel, hamas, idf, etc,
    Add the following key to each articles
    """
    from gensim.models import KeyedVectors
    from gensim.models import Word2Vec

    preprocessed_newspapers = load_preprocessed_newspapers("preprocessed_newspapers_dict.json")


    # check if the newspaper is already preprocessed, if it is skip it
    for newspaper in newspaper_list:
        if f"{newspaper}" not in preprocessed_newspapers:
            preprocessed_newspapers[newspaper] = {}
            dict_newspaper = preprocessed_newspapers[newspaper]

            article_list = create_article_list(extracted_file, newspaper)
            sentence_list = preprocess_newspaper(article_list)

            dict_newspaper["no_of_articles"] = no_of_articles(article_list)
            dict_newspaper["corpus_size_before_preprocess"] = corpus_size_before(article_list)
            dict_newspaper["corpus_size"] = corpus_size_after(sentence_list)
            dict_newspaper["no_of_unique_words"] = no_of_unique_words(sentence_list)
            dict_newspaper["no_of_sentences"] = no_of_sentences(sentence_list)
            dict_newspaper["occurance_palestine"] = occurance("palestine", sentence_list)
            dict_newspaper["occurance_palestinian"] = occurance("palestinian", sentence_list)
            dict_newspaper["occurance_hamas"] = occurance("hamas", sentence_list)
            dict_newspaper["occurance_sinwar"] = occurance("sinwar", sentence_list)
            dict_newspaper["occurance_israel"] = occurance("israel", sentence_list)
            dict_newspaper["occurance_israeli"] = occurance("israeli", sentence_list)
            dict_newspaper["occurance_idf"] = occurance("idf", sentence_list)
            dict_newspaper["occurance_netanyahu"] = occurance("netanyahu", sentence_list)
            dict_newspaper["model_location"] = ""
            dict_newspaper["vectors_txt_location"] = ""
            dict_newspaper["vectors_bin_location"] = ""
            dict_newspaper["portrayal_palestine"] = {}
            dict_newspaper["portrayal_palestine_score"] = 0
            dict_newspaper["portrayal_israel"] = {}
            dict_newspaper["portrayal_israel_score"] = 0
            dict_newspaper["palestine-israel_score"] = 0
            dict_newspaper["articles"] = article_list
            dict_newspaper["preprocessed"] = sentence_list

            # actually fill out the values for model-related keys
            dict_newspaper["model_location"], dict_newspaper["vectors_txt_location"], dict_newspaper["vectors_bin_location"] = train(newspaper, sentence_list)

            # Load the model from a file
            model = Word2Vec.load(f"{newspaper}/{newspaper}_w2v.model")


            palestinian_words = ["palestine", "palestinian", "hamas", "sinwar"]
            israeli_words = ["israel", "israeli", "idf", "netanyahu"]

            # positive categories: general (good etc), victim, 
            positive_portrayal_words = ["positive", "good", "victim", "resilient", "justified", "defend", "innocent", "rightful", "humane"]
            negative_portrayal_words = ["negative", "bad", "aggressor", "attacker", "brutal", "illegal", "terrorist", "barbaric", "massacre", "invade"]

            dict_newspaper["portrayal_palestine"], dict_newspaper["portrayal_israel"] = calculate_portrayal(model,  palestinian_words, israeli_words, positive_portrayal_words, negative_portrayal_words)
            print(f"{newspaper}", dict_newspaper["portrayal_palestine"], dict_newspaper["portrayal_israel"])

            for key, value in dict_newspaper["portrayal_palestine"].items():
                dict_newspaper["portrayal_palestine_score"] += (value/4)  # divide by four to get the average

            for key, value in dict_newspaper["portrayal_israel"].items():
                dict_newspaper["portrayal_israel_score"] += (value/4)

            dict_newspaper["palestine-israel_score"] = dict_newspaper["portrayal_palestine_score"] - dict_newspaper["portrayal_israel_score"]
            print("palestinian are better portrayed by: ", dict_newspaper["palestine-israel_score"])

            save_newspaper_dict(preprocessed_newspapers)
            
    return preprocessed_newspapers

In [None]:
processed_newspapers = master("data/news-data-extracted.json", newspaper_list)