# Mitigating Stereotypical Bias and Political Polarization  
A tutorial notebook that shows the execution of each of the component of our code.

## 1. Setting up the environment
Here, we clone the github repository and download the necessary files for our code to run

In [None]:
!git clone https://github.com/Kriti-K/NLP_W2022.git
%cd NLP_W2022

Cloning into 'NLP_W2022'...
remote: Enumerating objects: 10548, done.[K
remote: Counting objects: 100% (10548/10548), done.[K
remote: Compressing objects: 100% (6838/6838), done.[K
remote: Total 10548 (delta 3308), reused 10434 (delta 3270), pack-reused 0[K
Receiving objects: 100% (10548/10548), 50.79 MiB | 15.50 MiB/s, done.
Resolving deltas: 100% (3308/3308), done.
/content/NLP_W2022


In [None]:
!pip install -r requirements.txt
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1pVViO4phYWIJ2UgC_xaZrU4Y_1fVcNDF' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1pVViO4phYWIJ2UgC_xaZrU4Y_1fVcNDF" -O /content/NLP_W2022/Models/IBC_BERT/variables/variables.data-00000-of-00001 && rm -rf /tmp/cookies.txt
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1A4cSYIi5fak-dMmP5uNaYeGTAbbk-rg9' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1A4cSYIi5fak-dMmP5uNaYeGTAbbk-rg9" -O /content/NLP_W2022/Code/gcp_creds.json && rm -rf /tmp/cookies.txt
!python -m nltk.downloader all

Collecting spacy==3.2.3
  Downloading spacy-3.2.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 4.7 MB/s 
[?25hCollecting google.cloud
  Downloading google_cloud-0.34.0-py2.py3-none-any.whl (1.8 kB)
Collecting python-Levenshtein
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
[K     |████████████████████████████████| 50 kB 6.2 MB/s 
Collecting tensorflow_text
  Downloading tensorflow_text-2.8.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 31.3 MB/s 
Collecting gensim==4.1.2
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 58.4 MB/s 
[?25hCollecting scipy==1.7.3
  Downloading scipy-1.7.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (38.1 MB)
[K     |████████████████████████████████| 38.1 MB 1.3 MB/s 
Collecting spacy-loggers<2.

2. Getting Political Ideology Detection 

The first step of our project is to get the political ideology of the given articles 

In [None]:
import tensorflow as tf
import tensorflow_text
from nltk.corpus import stopwords
import numpy as np
stop = stopwords.words('english')

model = tf.keras.models.load_model(r'/content/NLP_W2022/Models/IBC_BERT') # importing the BERT model that we trained on the Ideological Books Corpus(IBC) Dataset 

def preProcessText(text):
    '''
    This function takes in text (string) as input and performs the following functions.
    1. Converting string to lowercase
    2. Removing punctuations
    3. Removing the stop words
    '''
    text = text.lower()
    text = text.replace(r'[^\w\s]+', '') 
    text = text.replace('@', '')
    text = ' '.join(word.lower() for word in text.split() if word not in stop)
    return text

def pred(t):
    '''
    A simple prediction function that takes in the pre-processed text as input 
    and detects whether the given input is biased towards liberals or conservatives
    or whether it is neutral .
    '''
    tx = preProcessText(t)
    out = np.argmax(model.predict([tx]))
    if out==1:
        return 'Neutral'
    elif out==2: 
        return 'Liberal'
    elif out==0:
        return 'Conservative'

In [None]:
preProcessText('The uncontrolled profit motive is destroying health and increasing medical costs dramatically as it poisons its customers with adulterated and unhealthy foods.')

'uncontrolled profit motive destroying health increasing medical costs dramatically poisons customers adulterated unhealthy foods.'

In [None]:
pred('The uncontrolled profit motive is destroying health and increasing medical costs dramatically as it poisons its customers with adulterated and unhealthy foods.')

'Conservative'

## 3. Getting the entities and their sentiments.
The 'StereosetEntities.pkl' is a dataset that we synthesized from the Social Bias Frames and the Stereoset dataset. It includes a list of all the racial and religious entities. 
With this dataset, we filter out all the close entities that have racial and religious references using the Levenshtein distance.

In [None]:
import pickle
from google.cloud import language_v1
import os
import numpy as np
from Levenshtein import distance as lsd
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/content/NLP_W2022/Code/gcp_creds.json"
client = language_v1.LanguageServiceClient()
language = "en"
encoding_type = "UTF8"

with open(r'/content/NLP_W2022/Entities/StereosetEntities.pkl','rb') as file_handle:
    ss_ents = pickle.load(file_handle) # loading our own entity dataset
    
def filter(entities):
    '''
    This filter function takes a list of all the entities that we detect and then filters out the ones that are 
    close to the entities in our entityDataset (ss_ents).
    The measure of closeness is achieved by using the Levenshtein distance of the words.
    '''
    return_ents = []
    for i, entity in enumerate(entities):
        distances = [lsd(entity,word) for word in ss_ents]
        if min(distances) < 2: 
            return_ents.append(entity)
    return return_ents 

def get_entity_sentiment(text):
    '''
    This function uses the google cloud natural language api to extract all the entities (using the WikiData Knowledge Graph)
    Then performs the sentiment analysis with respect to all the entities and then normalizes the overall sentiment according to the entities that are filtered out.
    '''
    ents = []
    sents = []
    document = {"content": text, "type": "PLAIN_TEXT", "language": language}
    response = client.analyze_entity_sentiment(document,encoding_type)
    for entity in response.entities:
            sentiment = entity.sentiment
            ents.append(entity.name)
            sents.append(sentiment.score)
    fents = filter(ents)
    if not fents:
        return 'It does not speak about race or religion'
        
    avg_sent = sum(sents)/len(fents) # Normalizing the sentiment according to the number of filtered entities
    if avg_sent < 0:
        return 'It speaks negatively about {}'.format(fents)
    elif avg_sent > 0:
        return 'It speaks positively about {}'.format(fents)
    else:
        return 'It is mostly neutral about {}'.format(fents)

In [None]:
get_entity_sentiment('uncontrolled profit motive destroying health increasing medical costs dramatically poisons customers adulterated unhealthy foods.')

'It does not speak about race or religion'