# Dictionaries
Authors: "Petro Tolochko & Fabienne Lind 
Date: November 2024

## Preparation
### Required Packages
First, install, and import the required packages for text analysis.

In [6]:
#Make sure to install the pandas library if you haven't already, using:
!pip install pandas numpy krippendorff

import pandas as pd
import re # regular expressions
import numpy as np
import krippendorff
import requests
from io import StringIO


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Data

For this tasks, we will work with selected headlines from news articles about migration. The data set is a subset of the [REMINDER media corpus](https://doi.org/10.11587/IEGQ1B).

Let's load the data first and take a look. Each row represents one news article.

For our exercise, we work again with the English headlines (published in UK newspapers). 
Now, we load the data directly.


In [7]:
# Read the CSV file from the URL
# articles_en = pd.read_csv("https://raw.githubusercontent.com/fabiennelind/text-as-data-in-R/main/data/articles_en.csv")

# Fetch the CSV content
url = "https://raw.githubusercontent.com/fabiennelind/text-as-data-in-R/main/data/articles_en.csv"
response = requests.get(url, verify=False)  # Disable SSL verification

# Load into a pandas DataFrame
articles_en = pd.read_csv(StringIO(response.text))


# Check corpus size
corpus_size = len(articles_en)
print(f'Corpus size: {corpus_size}')

# Display the dataset
print(articles_en.head())

# Display column names
print(articles_en.columns)



Corpus size: 500
   Unnamed: 0  id country publication_date           source source_type  \
0           1   1      UK       2013-02-09     Daily Mirror       Print   
1           4   4      UK       2012-03-16  telegraph.co.uk      Online   
2           5   5      UK       2012-08-27  telegraph.co.uk      Online   
3           8   8      UK       2016-12-13     mirror.co.uk      Online   
4          11  11      UK       2016-03-03     The Guardian       Print   

                                            headline  \
0                 Asylum girl 'fed up' in UK;\nCOURT   
1  Archbishop of Canterbury, Dr Rowan Williams: C...   
2  France's 'scandalous' expulsion of Roma camps ...   
3  Labour's stance on EU immigration is not susta...   
4  'It was petrifying': lorry driver attacked nea...   

                                         headline_mt  m_fr_eco  m_fr_lab  \
0                 Asylum girl 'fed up' in UK;\nCOURT         0         0   
1  Archbishop of Canterbury, Dr Rowan Willi

## Automated Classification with a Dictionary

For this tutorial, we like to identify all articles that mention political actors in their headlines. The salience of 'Political actors' is the concept that we like to measure with an automated text analysis method, a dictionary. As a first step, we define the concept more closely.

### Concept Definition

**Political actors** are here defined as political parties represented in the House of Commons between 2000 and 2017, which is the period in which the articles in our sample where published. Next to these parties, we define UK politicians with a leading role as political actors. To keep the task manageable for this exercise, we focus only on actors highly relevant between 2000 and 2017. 

We intend to measure the salience of political actors as simple binary variable:
1 = At least one political actor is mentioned
0 = No political actor is mentioned.

### Dictionary creation

A dictionary is a set of keywords or phrases that represent the concept of interest. 

We now start to collect relevant keywords for the dictionary. We start with a list of keywords that we consider most relevant. An example for a relevant keyword is "Boris Johnson".
For clarity, we here work with two keyword sets: we collect the keywords related to politicians in one vector (here named `politicians`), and keywords related to political parties in another vector (here named `parties`). 

The keywords are written as regular expressions. A ‘regular expression’ is a pattern that describes a string. To test regular expressions quickly, visit https://regex101.com/

In [8]:
# List of politicians
politicians = [
    "tony blair", 
    "gordon brown", 
    "david cameron", 
    "theresa may", 
    "boris johnson", 
    "prime minister"
]

In [9]:
# List of parties with a regular expression
parties = [
    "conservative party", 
    "tor(y|ies)",  
    "ukip", 
    "labour party", 
    "liberal democrats", 
    "scottish national party", 
    "green party"
]

Some questions:

Alternative ways to store the keywords?

What other keywords are relevant to measure the concept?


Before we search the keyword in the headlines, we apply some pre-processing steps to the headlines. For this exercise, we designed the keywords all in lower case, so the headlines have to be lower case too.

In [10]:
articles_en['headline'] = articles_en['headline'].str.lower() # Convert the 'headline' column to lowercase

print(articles_en['headline'].head()) # Display the first few values of the 'headline' column

0                   asylum girl 'fed up' in uk;\ncourt
1    archbishop of canterbury, dr rowan williams: c...
2    france's 'scandalous' expulsion of roma camps ...
3    labour's stance on eu immigration is not susta...
4    'it was petrifying': lorry driver attacked nea...
Name: headline, dtype: object


We now search the keywords in the article headlines. The re.findall() function finds all occurrences of a keyword in the text. The function can search for regular expression. We here ask to count a pattern in the column `headline` of the dataframe `articles_en`. 

The patterns to count are the politician keywords and the party keywords.

In [11]:
# Function to count keywords in a text
def count_keywords(text, keywords):
    # Count occurrences of each keyword (case-insensitive)
    keyword_counts = [len(re.findall(rf"(?i)\b{keyword}\b", text)) for keyword in keywords]
    return sum(keyword_counts)

# Add columns for counts of politicians and parties
articles_en['politicians_count'] = articles_en['headline'].apply(lambda x: count_keywords(x, politicians))
articles_en['parties_count'] = articles_en['headline'].apply(lambda x: count_keywords(x, parties))

# Display frequency tables for politicians_count and parties_count
print(articles_en['politicians_count'].value_counts())
print(articles_en['parties_count'].value_counts())


politicians_count
0    473
1     23
2      4
Name: count, dtype: int64
parties_count
0    478
1     15
2      6
3      1
Name: count, dtype: int64


Check which keywords were found for each group for each row and create a single column

In [12]:
# Function to find and list keywords in text
def check_keywords(text, keywords):
    # Find keywords that are present in the text
    found_keywords = [keyword for keyword in keywords if re.search(rf"(?i)\b{keyword}\b", text)]
    # Return the found keywords as a comma-separated string
    return ", ".join(found_keywords)

# Apply the function to find keywords in the 'headline' column
articles_en['politicians_keywords_found'] = articles_en['headline'].apply(lambda x: check_keywords(x, politicians))
articles_en['parties_keywords_found'] = articles_en['headline'].apply(lambda x: check_keywords(x, parties))

# Display frequency tables for found keywords
print(articles_en['politicians_keywords_found'].value_counts())
print(articles_en['parties_keywords_found'].value_counts())

politicians_keywords_found
                                 473
theresa may                       12
prime minister                     5
david cameron                      5
david cameron, prime minister      3
boris johnson                      2
Name: count, dtype: int64
parties_keywords_found
                478
ukip             11
tor(y|ies)       10
labour party      1
Name: count, dtype: int64


So far, we obtained a count, that represents how often the keywords were detected per text. Since we initially proposed a simple binary measurement, we now do some recoding. 

We add a new column to the dataframe called `actors_d`. This column includes a 1 if at least one of all defined keywords creates a hit, and a 0 if no keyword was found.

In [13]:
# Add a new column 'actors_d' based on conditions
articles_en['actors_d'] = np.where(
    (articles_en['parties_count'] >= 1) | (articles_en['politicians_count'] >= 1), 1, 0
)

# Ensure missing values in 'actors_d' are replaced with 0
articles_en['actors_d'] = articles_en['actors_d'].fillna(0).astype(int)

According to our automated measurement, how many articles mention political actors in their headlines?

In [14]:
# Descriptive overview of the 'actors_d' column
print(articles_en['actors_d'].value_counts())

actors_d
0    453
1     47
Name: count, dtype: int64


We have now managed to get an automated measurement for the variable. **But how valid is this measurement?** Does our small set of keywords represent the concept adequately?

A common procedure in automated content analysis is to test construct validity. We ask:
How close is this automated measurement to a more trusted measurement: Human understanding of text.
Let's put this to practice. 

## Dictionary validation with a human coded baseline

To validate the dictionary, we compare the classifications of the dictionary with the classifications of human coders. 

We create the human coded baseline together. 

### Intercoder reliability test

To ensure the quality of our manual coding, we first perform an intercoder reliability test. For this tutorial, we select a random set of 10 articles. In a real study the number of observations coded by several coders should be higher. 

In [15]:
# Set the random seed for reproducibility
random_state = 57

# Sample 10 rows from the DataFrame
intercoder_set = articles_en.sample(n=10, random_state=random_state)

# Show the sampled DataFrame
print(intercoder_set)

     Unnamed: 0    id country publication_date               source  \
131         375   375      UK       2006-05-24         The Guardian   
445        1314  1314      UK       2013-07-13      telegraph.co.uk   
408        1225  1225      UK       2005-01-25         The Guardian   
439        1294  1294      UK       2016-05-30         Daily Mirror   
7            16    16      UK       2017-11-02  The Daily Telegraph   
9            23    23      UK       2001-12-01         The Guardian   
362        1098  1098      UK       2016-02-19         mirror.co.uk   
328        1008  1008      UK       2003-09-30         Daily Mirror   
253         788   788      UK       2015-03-06         The Guardian   
443        1312  1312      UK       2013-11-28  The Daily Telegraph   

    source_type                                           headline  \
131       Print      dublin urged to translate road safety message   
445      Online  keith vaz: immigration backlog 'totally unnacc...   
408     

We now add an empty column called `actors_m`, so that coders can enter the manual codes. We drop all columns that are not necessary.

In [16]:
# Add a new column 'actors_m' initialized with empty strings
intercoder_set['actors_m'] = ""

# Select specific columns (id, actors_m, headline)
intercoder_set = intercoder_set[['id', 'actors_m', 'headline']]

# Display the resulting DataFrame
print(intercoder_set)

       id actors_m                                           headline
131   375               dublin urged to translate road safety message
445  1314           keith vaz: immigration backlog 'totally unnacc...
408  1225           howard stirs up migrant storm: un and eu conde...
439  1294                                   blair: out not the answer
7      16                       stowaways leap from bus into raf base
9      23                          in brief: 2,745 lose asylum battle
362  1098           david cameron warns eu summit it's suicide to ...
328  1008           life jail for refugee: killed by dad for being...
253   788           orange lifeboats used to return asylum seekers...
443  1312                     boris: some people too stupid to get on


We then create several duplicates of the intercoder reliability set, one for each coder. We create separate files so that coders code individually and do not peek by mistake.
To each of these sets we add the coder name in a new column called `coder_name`.
For this example, we now need 2 volunteers. Who would like to code?

In [17]:
# For Coder 1
intercoder_set_coder1 = intercoder_set.copy()  # Create a copy of the DataFrame
intercoder_set_coder1['coder_name'] = "Coder1"  # Add the 'coder_name' column

# For Coder 2
intercoder_set_coder2 = intercoder_set.copy()  # Create a copy of the DataFrame
intercoder_set_coder2['coder_name'] = "Coder2"  # Add the 'coder_name' column

# Display the resulting DataFrames
print(intercoder_set_coder1.head())
print(intercoder_set_coder2.head())

       id actors_m                                           headline  \
131   375               dublin urged to translate road safety message   
445  1314           keith vaz: immigration backlog 'totally unnacc...   
408  1225           howard stirs up migrant storm: un and eu conde...   
439  1294                                   blair: out not the answer   
7      16                       stowaways leap from bus into raf base   

    coder_name  
131     Coder1  
445     Coder1  
408     Coder1  
439     Coder1  
7       Coder1  
       id actors_m                                           headline  \
131   375               dublin urged to translate road safety message   
445  1314           keith vaz: immigration backlog 'totally unnacc...   
408  1225           howard stirs up migrant storm: un and eu conde...   
439  1294                                   blair: out not the answer   
7      16                       stowaways leap from bus into raf base   

    coder_name  
131

## Write and Read Google Sheets

We then want to save the data sets in google sheets. Detailed instructions about the conncection of **Python** and **Google Sheets** can be found in  https://google-auth.readthedocs.io/en/master/
https://google-auth.readthedocs.io/en/master/user-guide.html
https://medium.com/@jb.ranchana/write-and-append-dataframes-to-google-sheets-in-python-f62479460cf0

In [18]:
!pip install gspread google-auth
!pip install gspread_dataframe
!pip install google
!pip install pydrive

Collecting gspread
  Using cached gspread-6.1.4-py3-none-any.whl.metadata (11 kB)
Collecting google-auth
  Using cached google_auth-2.36.0-py2.py3-none-any.whl.metadata (4.7 kB)
Collecting google-auth-oauthlib>=0.4.1 (from gspread)
  Using cached google_auth_oauthlib-1.2.1-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting cachetools<6.0,>=2.0.0 (from google-auth)
  Using cached cachetools-5.5.0-py3-none-any.whl.metadata (5.3 kB)
Collecting pyasn1-modules>=0.2.1 (from google-auth)
  Using cached pyasn1_modules-0.4.1-py3-none-any.whl.metadata (3.5 kB)
Collecting rsa<5,>=3.1.4 (from google-auth)
  Using cached rsa-4.9-py3-none-any.whl.metadata (4.2 kB)
Collecting requests-oauthlib>=0.7.0 (from google-auth-oauthlib>=0.4.1->gspread)
  Using cached requests_oauthlib-2.0.0-py2.py3-none-any.whl.metadata (11 kB)
Collecting pyasn1<0.7.0,>=0.4.6 (from pyasn1-modules>=0.2.1->google-auth)
  Using cached pyasn1-0.6.1-py3-none-any.whl.metadata (8.4 kB)
Collecting oauthlib>=3.0.0 (from requests-oauthl

Using cached google-3.0.0-py2.py3-none-any.whl (45 kB)
Using cached beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
Using cached soupsieve-2.6-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4, google
Successfully installed beautifulsoup4-4.12.3 google-3.0.0 soupsieve-2.6

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting pydrive
  Using cached PyDrive-1.3.1.tar.gz (987 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting google-api-python-client>=1.2 (from pydrive)
  Using cached google_api_python_client-2.154.0-py2.py3-none-any.whl.metadata (6.7 kB)
Collecting oauth2client>=4.0.0 (from pydrive)
  Using cached oauth2client-4.1.3-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting PyYAML>=3.0 (from pydrive)
  Downloading PyYAML-6.0.2-

In [20]:
!pip install tensorflow

Collecting numpy~=1.19.2 (from tensorflow)
  Downloading numpy-1.19.5-cp37-cp37m-macosx_10_9_x86_64.whl.metadata (2.0 kB)
Collecting google-auth<2,>=1.6.3 (from tensorboard~=2.4->tensorflow)
  Downloading google_auth-1.35.0-py2.py3-none-any.whl.metadata (3.5 kB)
Downloading numpy-1.19.5-cp37-cp37m-macosx_10_9_x86_64.whl (15.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.6/15.6 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0mm
[?25hDownloading google_auth-1.35.0-py2.py3-none-any.whl (152 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.9/152.9 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip

In [19]:
import gspread
from gspread_dataframe import get_as_dataframe, set_with_dataframe
from google.oauth2.service_account import Credentials
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive

# Define the scope of the API access
scopes = ['https://www.googleapis.com/auth/spreadsheets',
          'https://www.googleapis.com/auth/drive']

SERVICE_ACCOUNT_FILE = '/Users/fabiennelind/ucloud/Research/APIs/gsheets_creds.json'

# Authenticate using service account
creds = Credentials.from_service_account_file(SERVICE_ACCOUNT_FILE, scopes=scopes)

# Connect to Google Sheets
gc = gspread.authorize(creds)

gauth = GoogleAuth()
drive = GoogleDrive(gauth)

We now save the datasets for the intercoder reliability test as Google Sheets

In [20]:
# Different methods to open a google sheet
# open a google sheet from its name
gs = gc.open('Text as data')

# use a key (which can be extracted from the spreadsheet’s id
#gs = gc.open_by_key('15ulUYe0zu2aDw9_PQ9f_GFQ42WwTGvmPrk5hWd53Zsk')

# paste the entire spreadsheet’s url
##gs = gc.open_by_url('https://docs.google.com/spreadsheets/d/15ulUYe0zu2aDw9_PQ9f_GFQ42WwTGvmPrk5hWd53Zsk/edit?gid=0#gid=0')


In [21]:
# select a specific worksheet by name from the sheet
worksheet1 = gs.worksheet('Sheet1')
worksheet2 = gs.worksheet('Sheet2')

In [22]:
# write data for coder 1 to dataframe
worksheet1.clear()
set_with_dataframe(worksheet=worksheet1, dataframe=intercoder_set_coder1, include_index=False,
include_column_header=True, resize=True)


In [23]:
# write data for coder 2 to dataframe
worksheet2.clear()
set_with_dataframe(worksheet=worksheet2, dataframe=intercoder_set_coder2, include_index=False,
include_column_header=True, resize=True)

Ready to code? We will post links for the different files. Read the column `headline`. If the headline mentions a political actor insert `1` in the column `actors_m`. Enter a `0` in `actors_m` if the headline does not mention a political actor.

After you finished coding, we read all sheets back (now with manual classifications for `actors_m`).

In [24]:
# Get all values from the sheet of coder 1
intercoder_set_coder1_c = pd.DataFrame(worksheet1.get_all_values())

# Use values of the first row as column names
headers = intercoder_set_coder1_c.iloc[0].values
intercoder_set_coder1_c.columns = headers
intercoder_set_coder1_c.drop(index=0, axis=0, inplace=True)

intercoder_set_coder1_c

# convert relevant column to numeric
intercoder_set_coder1_c['actors_m'] = pd.to_numeric(intercoder_set_coder1_c['actors_m'], errors='coerce')

In [25]:
# Get all values from the sheet of coder 2
intercoder_set_coder2_c = pd.DataFrame(worksheet2.get_all_values())

# Use values of the first row as column names
headers = intercoder_set_coder2_c.iloc[0].values
intercoder_set_coder2_c.columns = headers
intercoder_set_coder2_c.drop(index=0, axis=0, inplace=True)

intercoder_set_coder2_c


# convert relevant column to numeric
intercoder_set_coder2_c['actors_m'] = pd.to_numeric(intercoder_set_coder2_c['actors_m'], errors='coerce')

Too calculate the agreement between coders, we first restructure the dataframes

In [26]:
# Merge the two dataframes on 'id' to align the codings for each coder

merged_df_actors = pd.merge(intercoder_set_coder1_c[['id', 'actors_m']],
                            intercoder_set_coder2_c[['id', 'actors_m']],
                            on='id',
                            suffixes=('_Coder1', '_Coder2')
                        )
merged_df_actors



Unnamed: 0,id,actors_m_Coder1,actors_m_Coder2
0,375,,
1,1314,,
2,1225,,
3,1294,,
4,16,,
5,23,,
6,1098,,
7,1008,,
8,788,,
9,1312,,


In [75]:
# Create a matrix where rows are items and columns are coders' ratings
ratings_actors = merged_df_actors[['actors_m_Coder1', 'actors_m_Coder2']].values.T  # Transpose to match input format
ratings_actors

array([[1, 0, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 0, 1, 1, 1, 1, 1, 0, 1, 1]])

We calculate Krippendorff's alpha for this example.

In [None]:
# Calculate Krippendorff's alpha for nominal data
alpha_actors = krippendorff.alpha(reliability_data=ratings_actors, level_of_measurement='nominal')

print(f"Krippendorff's alpha for 'actors_m': {alpha_actors}")


Krippendorff's alpha for 'actors_m': 0.6274509803921569


If alpha is large enough, we consider the quality of our manual coding as sufficient. We can then start with the creation of a larger manual baseline to be compared with the dictionary classifications.

## Creating a manually coded baseline

We pick 100 headlines randomly. 

In [79]:
# Set the random seed for reproducibility
random_state = 576

# Sample 10 rows from the DataFrame
manual_set = articles_en.sample(n=100, random_state=random_state)


We add again an empty column called `actors_m`, for coders to enter the manual codes. This time, we also add an empty column for the coder names. We split the work. Each of us gets some headlines to code (in a real application: each of the coders would need to take part in the intercoder test) 

In [88]:
# Add a new column 'actors_m' initialized with empty strings
manual_set['actors_m'] = ""
manual_set['coder_name'] = ""

# Select specific columns (id, actors_m, headline)
manual_set = manual_set[['id', 'actors_m', 'headline', 'coder_name']]


We create a google sheet for the task

In [89]:
# select a specific worksheet by name from the sheet
worksheet3 = gs.worksheet('Sheet3')

# write data to sheet
worksheet3.clear()
set_with_dataframe(worksheet=worksheet3, dataframe=manual_set, include_index=False,
include_column_header=True, resize=True)

Please open the sheet in your browser. Enter a coding name (free to pick) in the column `coder_name` for a couple of rows first. Then start to enter 1 (political actor in headline mentioned) or 0 (not mentioned) in the column `actors_m` for the rows with your coding name. Our goal is to finish coding of all headlines.


After you finish coding, we read all sheets back (now with manual classifications for `actors_m`).

In [104]:
# Get all values from the sheet of coder 2
manual_set_c = pd.DataFrame(worksheet3.get_all_values())

# Use values of the first row as column names
headers = manual_set_c.iloc[0].values
manual_set_c.columns = headers
manual_set_c.drop(index=0, axis=0, inplace=True)

manual_set_c


# convert relevant column to numeric
manual_set_c['actors_m'] = pd.to_numeric(manual_set_c['actors_m'], errors='coerce')
manual_set_c['id'] = pd.to_numeric(manual_set_c['id'], errors='coerce')

We need to create an object, where the manual and automated classifications are included.

In [None]:
# Select and merge
manual_set_c = manual_set_c[['id', 'actors_m']]
articles_d_m = pd.merge(manual_set_c, articles_en, on='id')
len(articles_d_m)



Unnamed: 0.1,id,actors_m,Unnamed: 0,country,publication_date,source,source_type,headline,headline_mt,m_fr_eco,m_fr_lab,m_fr_wel,m_fr_sec,politicians_count,parties_count,politicians_keywords_found,parties_keywords_found,actors_d
0,345,0,345,UK,2017-06-26,The Guardian,Print,theresa may's attacks on human rights laws are...,Theresa May's attacks on human rights laws are...,0,0,0,1,1,0,theresa may,,1
1,8,1,8,UK,2016-12-13,mirror.co.uk,Online,labour's stance on eu immigration is not susta...,Labour's stance on EU immigration is not susta...,0,1,0,0,0,0,,,0
2,800,1,800,UK,2013-07-29,Daily Mirror,Print,ad nausea;\nvoice of the voice@mirror.co.uk,Ad nausea;\nVOICE OF THE voice@mirror.co.uk,0,0,0,1,0,0,,,0
3,1430,1,1430,UK,2017-09-30,telegraph.co.uk,Online,"racists nearly killed ukip this week, but we l...","Racists nearly killed Ukip this week, but we l...",0,0,0,0,0,1,,ukip,1
4,177,1,177,UK,2014-02-11,The Daily Telegraph,Print,salmond 'not honest' about border controls;\ns...,Salmond 'not honest' about border controls;\nS...,0,1,1,0,0,0,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,981,1,981,UK,2017-10-24,The Guardian,Print,the guardian view on universities and brexit: ...,The Guardian view on universities and Brexit: ...,0,0,1,0,0,0,,,0
96,371,1,371,UK,2002-12-28,The Guardian,Print,britain 'takes more refugees than is fair': un...,Britain 'takes more refugees than is fair': UN...,0,0,0,1,0,0,,,0
97,956,1,956,UK,2007-10-24,The Guardian,Print,europe: eu moves to bring in skilled foreign w...,Europe: EU moves to bring in skilled foreign w...,0,1,0,0,0,0,,,0
98,561,1,561,UK,2000-04-20,Daily Mirror,Print,bishop: we're no racists,BISHOP: WE'RE NO RACISTS,0,0,0,0,0,0,,,0


## Compare automated with manual classifications 

We compare the automated classification (in column `actors_d`) with the manual classifications (in column `actors_m`) we use three metrics: Recall, Precision, and F1.
The metrics inform us about the quality of the dictionary. All three metrics range from 0 to 1. 
We assume that our manual classification identified all relevant articles (here: headlines that mention a political actor).


In [108]:
# Calculate True Positives, False Positives, and False Negatives
tp = ((articles_d_m['actors_m'] == 1) & (articles_d_m['actors_d'] == 1)).sum()  # True Positives
fp = ((articles_d_m['actors_m'] == 0) & (articles_d_m['actors_d'] == 1)).sum()  # False Positives
fn = ((articles_d_m['actors_m'] == 1) & (articles_d_m['actors_d'] == 0)).sum()  # False Negatives

# Precision and Recall
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")


Precision: 0.76
Recall: 0.15


### Recall 

By inspecting recall we can say how many relevant articles are retrieved by the dictionary.
A recall of 1.0 means that our dictionary retrieved all relevant articles. 
A recall of 0.8 means that our dictionary retrieved 80% of all relevant articles. 

To obtain recall, we calculate:

### Precision 

By inspecting precision we can say how many retrieved articles are relevant.
A precision of 1,0 means that all articles retrieved by the dictionary are relevant. 
A precision of 0.8 means that 80% of the articles that our dictionary retrieved are relevant articles. 