# Newspaper portrayal analysis of Israel and Palestine
In this project, we train a word embedding model (a model that can assign meaningful vectors to words), specifically Word2Vec, on multiples newspapers' corpora. We create these corpora by scraping websites of different sources. If you would like to see how we scraped, please check out the github repository for the project: https://github.com/McGill-AI-Lab/news-bias-model

#### Note:
In the following jupyter notebook, we had to delete some of the cell outputs due to either them being too large, or some copyright constraints. Please know that we ran all of the code in the notebook.

### Abstract
lorem ipsum





### Access and preprocess our data

We access our data in 'data/news-data-extracted.json', which you should also have access to through the repository. This file is a dictionary, with keys corresponding to different newspapers, and for each newspaper key, the corresponding value is a list of dictionaries, each dictionary containing key-value pairs for a single article. Keys include: url, title, authors, date, text

In [1]:
import json

# Open and read the JSON file
with open('data/news-data-extracted.json', 'r') as file:
    data = json.load(file)

# Print the data
first_article_data = data["cnn.com"][0] # cnn is the key to a value which is a list of dictionaries, we get the first dictionary (article) of that list of dictionary
first_article = first_article_data["text"] # we get the article text instead of getting the whole dictionary that includes url, title, authors, etc. 
print(first_article_data)

{'url': 'https://www.cnn.com/2020/01/23/opinions/auschwitz-anniversary-anti-semitism-fears-linger-andelman/index.html', 'title': 'World leaders in Jerusalem show battle against anti-Semitism not yet a victory (opinion)', 'authors': ['David A. Andelman'], 'date': '2020-01-23 00:00:00', 'text': 'Editors Note: David A. Andelman, Executive Director of The RedLines Project, is a contributor to CNN where his columns won the Deadline Club Award for Best Opinion Writing. Author of A Shattered Peace: Versailles 1919 and the Price We Pay Today, and the forthcoming A Red Line in the Sand: Diplomacy, Strategy and a History of Wars That Almost Happened, he was formerly a foreign correspondent for The New York Times and CBS News in Europe and Asia. Follow him on Twitter @DavidAndelman. The views expressed in this commentary are his own. View more opinion on CNN.CNN The world converged on Jerusalem this week to observe the 75th anniversary of the liberation of the Auschwitz death camp  and with a col

In [2]:
first_article[0] # since we get a string, this should return the first letter in the article

'E'

We need to preprocess our data. Preprocessing includes dividing articles into sentences using nltk library, since Word2Vec is trained by using list of words (sentences). Nltk uses a machine learning model to decide how to divide an article into sentences, so there will be some inaccuracies, however we can ignore these. After, we want all words to be lowercase. We want to remove all extremely high-frequency words which don't really contribute to any of the word embeddings for other words they co-occur with as these high-frequency word co-occur with a big portion of our corpus. These words are called "stop words" and some example would be "I", "you", "of", "there" etc. Then, we lemmatize words, i.e. try to convert each word to their root (running -> run). The goal of this process is that so we have more information about the word "run", instead of the information being distributed between various forms of the word ("runs", "running", "ran"). For more information on lemmatizers: https://www.geeksforgeeks.org/python-lemmatization-with-nltk/. Finally, we remove punctuation and put all of these functions in one "preprocess" function.

In [3]:
from gensim.utils import tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

# tokenizes article into sentences, which are also tokenized into words
def tokenize_article(article):
    tokenized_article = []
    sentences  = sent_tokenize(article, language="english") # divide article into sentences
    
    for sentence in sentences:
        tokenized_sentence = tokenize(sentence) # divide sentences into words
        tokenized_article.append(tokenized_sentence) 
    return tokenized_article

# makes each word lowercase
def lowercase(tokenized_article):
    lowercase_article = []

    for sentence in tokenized_article:
        current_sentence = []
        for word in sentence:
            current_sentence.append(word.lower())
        lowercase_article.append(current_sentence)

    return lowercase_article

stop_words = set(stopwords.words("english"))

def remove_stopwords(tokenized_article):
    # Iterate over the index and content of each sentence
    for i in range(len(tokenized_article)):
        # Create a new list for the filtered sentence
        filtered_sentence = []
        for word in tokenized_article[i]:
            if word not in stop_words:
                filtered_sentence.append(word)
        # Replace the original sentence with the filtered sentence
        tokenized_article[i] = filtered_sentence
    return tokenized_article

def lammetization(tokenized_article):
    lammetizer = WordNetLemmatizer()

    lammetized_article = []

    for sentence in tokenized_article:
        current_sentence = []
        for word in sentence:
            current_sentence.append(lammetizer.lemmatize(word))
        lammetized_article.append(current_sentence)

    return lammetized_article


def remove_punctuation(tokenized_article):
    punc_removed_article = []

    for sentence in tokenized_article:
        punc_removed_sentence = []
        for word in sentence:
            # Split by punctuation, filter out empty strings, and join back if needed
            split_word = ''.join(re.split(r"[^\w]+", word))
            if split_word:  # Add non-empty words only
                punc_removed_sentence.append(split_word)

        punc_removed_article.append(punc_removed_sentence)

    return punc_removed_article

def preprocess_article(article):
    t_article = tokenize_article(article)
    l_article = lowercase(t_article)
    r_article = remove_stopwords(l_article)
    la_article = lammetization(r_article)
    re_article = remove_punctuation(la_article)
    return re_article

In [6]:
print(stop_words)

{'up', 'my', 'am', 'how', 'through', 'yourself', 'so', 'you', 'yourselves', 'by', 'is', 'had', 'itself', "shouldn't", "that'll", 'has', 'about', 'each', 'him', "she's", 'been', 'didn', 'its', 'our', 'that', 't', 'will', 'same', 'then', 'isn', 'shan', 'any', 'which', 'out', 'down', 'of', "you're", 'who', 'mustn', 'me', 'at', 'the', "won't", 'weren', 'herself', 'haven', 'if', 'hasn', 'her', 'doesn', 'his', 'as', 'again', 'and', 'an', 'because', 'are', 'their', "you've", "should've", 'ain', 'won', "don't", 've', 'll', 'to', 'do', 'once', "it's", 'ourselves', 'just', 'whom', 're', 'further', 'only', "isn't", 'wouldn', 'for', 'needn', 'he', 'have', 'below', "doesn't", 'against', 'aren', 'here', 'they', 'both', 'a', 'm', 'or', 'mightn', "mightn't", 'more', "hadn't", 'after', 'above', "wouldn't", 'she', 'theirs', 'with', 'over', 'than', 'own', 'such', 'we', 'nor', "shan't", "aren't", 'himself', 'too', 'while', 'these', 'now', 'them', 'into', 'those', 'was', 'all', 'couldn', "wasn't", 'until',

In [7]:
preprocessed = preprocess_article(first_article)
print(preprocessed) # this is how a preprocessed article looks like

[['editor', 'note', 'david', 'andelman', 'executive', 'director', 'redlines', 'project', 'contributor', 'cnn', 'column', 'deadline', 'club', 'award', 'best', 'opinion', 'writing'], ['author', 'shattered', 'peace', 'versailles', 'price', 'pay', 'today', 'forthcoming', 'red', 'line', 'sand', 'diplomacy', 'strategy', 'history', 'war', 'almost', 'happened', 'formerly', 'foreign', 'correspondent', 'new', 'york', 'time', 'cbs', 'news', 'europe', 'asia'], ['follow', 'twitter', 'davidandelman'], ['view', 'expressed', 'commentary'], ['view', 'opinion', 'cnn', 'cnn', 'world', 'converged', 'jerusalem', 'week', 'observe', 'th', 'anniversary', 'liberation', 'auschwitz', 'death', 'camp', 'collective', 'determination', 'battle', 'anti', 'semitism', 'many', 'form'], ['time', 'gathering', 'exposed', 'number', 'old', 'festering', 'political', 'wound', 'threatened', 'weaken', 'impact', 'head', 'state', 'government', 'russian', 'president', 'vladimir', 'putin', 'french', 'president', 'emmanuel', 'macron',

In [None]:
def create_article_list(extracted_file, newspaper_name): 
    import json
    with open(extracted_file, "r") as json_file:
        data = json.load(json_file)
    
    newspaper = data[f"{newspaper_name}"] # newspaper will be a dictionary of articles with values being url, date, authors, text etc.
    newspaper_articles = []
    
    for article in newspaper:
        newspaper_articles.append(article["text"])
    
    print(newspaper_articles)

In [None]:
def preprocess_newspaper(newspaper, newspaper_name, newspaper_dict):
    newspaper_dict[f"{newspaper_name}"] = []
    i = 0
    for article_data in newspaper:
        text = article_data['text']
        newspaper_dict[f"{newspaper_name}"].extend(preprocess_article(text))  # extends preproccessed
        # articles to newspaper's article list
        print(f"{newspaper_name}: article {i} preprocessed")
        i += 1
    return newspaper_dict

In [None]:
# Lets try with CNN
newspaper = data["cnn.com"]
newspaper_dict = {}
newspaper_dict = preprocess_newspaper(newspaper, "cnn.com", newspaper_dict)
print(newspaper_dict)

### Save and load preprocessed newspapers

In [None]:
import json
import os

def save_newspaper_dict(newspaper_dict):
    # File path for the JSON file
    file_path = "preprocessed_newspaper_articles.json"

    # Step 1: Load existing data if the file exists, otherwise start with an empty list
    if os.path.exists(file_path):
        with open(file_path, "r") as json_file:
            data = json.load(json_file)  # Load existing data
        for key,value in newspaper_dict:
            if key not in data:
                data["key"] = value

    else:
        data = newspaper_dict

    # Step 3: Write the updated data back to the file
    with open(file_path, "w") as json_file:
        json.dump(data, json_file, indent=4)


In [None]:
save_newspaper_dict(newspaper_dict)

In [None]:
import json
with open("preprocessed_newspaper_articles.json", "r") as json_file:
    loaded_newspaper_dict = json.load(json_file)
    print(loaded_newspaper_dict)

Find corpus size for cnn

In [None]:
def corpus_size(dict, newspaper):
    corpus = dict[f"{newspaper}"]
    
    corpus_size = 0
    for sentence in corpus:
        for word in sentence:
            corpus_size += 1
    
    return corpus_size

cs = corpus_size(loaded_newspaper_dict, "cnn.com")
print(cs)

### Train word2vec on cnn.com

In [None]:
from gensim.models import Word2Vec

# Prepare sentences for Word2Vec
sentences = loaded_newspaper_dict["cnn.com"] # Each newspaper's corpus is one "document"
print(sentences)
# Train Word2Vec model
# Initialize the model with parameters
model = Word2Vec(sentences=sentences, vector_size=300, window=5, min_count=10, sg=1, workers=4, negative=20)

# Train the model
model.train(sentences, total_examples=len(sentences), epochs=20)

In [None]:
# model.save("cnn_w2v.model")
# Save just the word vectors in a text format
model.wv.save_word2vec_format("cnn_w2v_vectors.txt", binary=False)

# To save in binary format:
model.wv.save_word2vec_format("cnn_w2v_vectors.bin", binary=True)


### Load the model

In [None]:
from gensim.models import KeyedVectors
from gensim.models import Word2Vec

# Load the model from a file
model = Word2Vec.load("cnn_w2v.model")

# Now you can use the model
print(model.wv.most_similar("israeli"))  # Replace "your_word" with the word you're interested in

# Load the word vectors
word_vectors = KeyedVectors.load_word2vec_format("cnn_W2v_vectors.txt", binary=False)

In [None]:
# Get the vector for a word
vector = model.wv["idf"]

# Find most similar words
similar_words = model.wv.most_similar("bad")
print(similar_words)

# Calculate similarity
similarity = model.wv.similarity("palestine", "victim")
print(f"Similarity between 'palestine' and 'victim': {similarity}")

# Calculate similarity
similarity = model.wv.similarity("israel", "victim")
print(f"Similarity between 'israel' and 'victim': {similarity}")

### Potential Portrayal Words
Positive: positive, good, victim, humane, heroic, brave, noble, resilient, justified, courageous, victorious, liberating, righteous, defenders
Negative: negative, bad, aggressor, attacker, aggressive, brutal, oppressive, merciless, barbaric, ruthless, massacra
invaders, terrorist
terroristic, dictatorial, destructive, illegal, corrupt, authoritarian, regressive, settler

Find word frequency for these words


# MASTER FUNCTION

In [None]:
import json
with open("data/news-data-extracted.json", "r") as json_file:
    data = json.load(json_file)

newspaper = data["cnn.com"] # newspaper will be a dictionary of articles with values being url, date, authors, text etc.
newspaper_articles = []

for article in newspaper:
    newspaper_articles.append(article["text"])

print(newspaper_articles)

In [None]:
newspaper

In [None]:
# take a scraped newspaper in, which will be a key to the dictionary we download: example: data['cnn.com']
# get corpus size before preprocessing
# preprocess the newspaper
# get corpus size after preprocessing
# find word frequency for israel, palestine, idf, hamas, gaza, west bank
# save it to the preprocessed_newspaper_articles dictionary
# train word2vec on it
# measure portrayal for both sides
# all this metadata & results in a dict, and preprocessed corpus to preprocessed_newspaper_articles
def master(extracted_file, preprocessed_file, newspaper_name):
    
    import json
    with open(extracted_file, "r") as json_file:
        data = json.load(json_file)
        
    newspaper = data[f"{newspaper_name}"] # newspaper will be a list of dictionaries, each dictionary representing an article with keys being url, date, authors, text etc.
    newspaper_articles = newspaper['text'] # newspaper_articles will be 
    
    
    pass