# Filtering tweets for Tracking Pandemic Borderscapes

This document explains how the tweets are filtered and selected from the collection of tweets from various news outlets.

We use Python for most data management tasks - including filtering - so this document will include a lot of Python code. I will explain the parts of the code essential to understanding the filtering but will not go into details about the rest of the code as such.

## Reading in the data

The code below reads in all of the tweets from the news outlets. 423.803 tweets are collected in total.

In [1]:
# import packages

import pandas as pd
import ast 
import numpy as np
import json
import re

# load data
with open('../data/tpb_tweets_news-outlets_20211208.json', "r") as outfile:
    all_data = json.load(outfile)

In [2]:
len(all_data.get("data"))

423803

## Creating filters

We are using so-called "regular expressions" to filter the data. Regular expressions are special text lines that uses special character to match specific text patterns, words, sentences and so on.

Similar to how we are able to use "wildcards" in literature search (using a \*) and using operators like AND and OR or something similar, regular expressions can be written to match certain criteria in texts.

In this case, we are using regular expressions to create different filters. We are currently working with three filters:
- A covid filter: Matches the words related to the pandemic - either if used in the tweet text or as a hashtag in the tweet
- An migration filter: Matches the words related to migration - either if used in the tweet text or as a hashtag in the tweet
- A geography filter: Matches the words related to the geographies - either if used in the tweet text or as a hashtag in the tweet

Each filter has the condition that *one* of the words in the filter has to be present in the tweet when used. 


### Using regular expressions as filters

When writing filters like this, we want to make sure that we capture spelling variations of the same word (fx "boat" and "boats") while avoiding irrelevant words containing the same sequence of characters (fx "boat" and "showboat"). It is very difficult to be 100% precise because it sometimes depends on the context whether the word in question is actually the word of interest (fx differentiating between "corona" used in relation to the virus and "corona" used in relation to the beer).

In these filters we use regular expressions for two things:
- Setting up word boundaries where relevant
- Specifying that the filter should just match one of the words

**Word boundaries**

Regular expressions uses a variety of special characters to match text. `\b` is a special character matching a word boundary; meaning that there has to be whitespace (space, newline, etc.) or puntuation. This is what we use to make sure we match "boat" and "boats" but not "showboat".

In the example below, we are creating the regular expression "boat", and seeing what happens, when we use that on the text:

> "I do not mean to showboat but I have a pretty cool car with shiny rims."

In [3]:
text = "I do not mean to showboat but I have a pretty cool car with shiny rims."
regex_boats_nob = re.compile(r"boat")

if regex_boats_nob.search(text):
    print("It's a match")
else:
    print("It's not a match")

It's a match


The code returns "it's a match", meaning the filter matches the text.

We avoid that by specifying a word boundary at the beginning of the word using the regular expression "\bboat" instead:

In [4]:
text = "I do not mean to showboat but I have a pretty cool car with shiny rims."
regex_boats_b = re.compile(r"\bboat")

if regex_boats_b.search(text):
    print("It's a match")
else:
    print("It's not a match")

It's not a match


Because we told the filter to match a word boundary before "boat", it is no longer a match.

To sum up using different regular expressions:
- "boat" will match "boat", "boats" and "showboats"
- "\bboat" will match "boat" and "boats" (because of word boundary of left side)
- "\bboat\b" will match "boat" (because of word boundaries on both left and right side)

**Matching one of several words**

The character `|` is used in regular expression to speciy an "or"-condition; meaning that it will match the text as long as one of the words separated by the `|` is present.

The example below uses the regular expression "Italy|Cyprus" to look for either "Italy" or "Cyprus" in the sentence:

> I do not get out of Scandinavia much but if I did, I would like to go to Cyprus.

In [5]:
text = "I do not get out of Scandinavia much but if I did, I would like to go to Cyprus."
regex_itacyp = re.compile(r"Italy|Cyprus")

if regex_itacyp.search(text):
    print("It's a match")
else:
    print("It's not a match")

It's a match


### The covid filter

The covid filter matches one of the following (notice placement of word boundaries):
- \bpandemic\b
- \bcovid\b
- \bcovid-19\b 
- \bcorona
- \bvaccine
- \bquarantine

### The migration filter

The migration filter matches one of the following (notice placement of word boundaries):
- \bmigrant
- \brefugee
- \btransit
- \bdisplacement\b
- \bborder
- \breturn\b
- \bpushback
- \bboat
- \bdrowning\b
- \bhunger

### The geographies filter

The migration filter matches one of the following (notice placement of word boundaries):
- \blebanon\b
- \blebanese\b
- \bsyria
- \bjordan
- \biraq
- \bgreece\b
- \bgreek
- \bturkey\b
- \bturkish\b
- \bcyprus\b
- \bcypriot
- \bmediterranean\b
- \bEU\b
- \btunisia
- \bitaly\b
- \bitalian\b
- \beuropean\b

## Code for creating the filters

The code below creates te filters using regular expressions. The `re.IGNORECASE` parts of the codes tells the filter to match the word regardless of casing (both upper-case and lower-case are matched).

In [10]:
regex_string_covid = r"\bpandemic\b|\bcovid\b|\bcovid-19\b|\bcorona|\bvaccine|\bquarantine"
regex_string_migrations = r"\bmigrant|\brefugee|\btransit|\bdisplacement\b|\bborder|\breturn\b|\bpushback|\bboat|\bdrowning\b|\bhunger"
regex_string_geos = r"\blebanon\b|\blebanese\b|\bsyria|\bjordan|\biraq|\bgreece\b|\bgreek|\bturkey\b|\bturkish\b|\bcyprus\b|\bcypriot|\bmediterranean\b|\bEU\b|\btunisia|\bitaly\b|\bitalian\b|\beuropean\b"

regex_covid = re.compile(regex_string_covid, re.IGNORECASE)
regex_migrations = re.compile(regex_string_migrations, re.IGNORECASE)
regex_geos = re.compile(regex_string_geos, re.IGNORECASE)

## Applying the filters

The codes below applies various combinations of filters. Because the words are currently split into three filters, we can combine them in different ways either requiring all three filters match, two of the filters are matced or one of the filters are matched.

Notice the line of code:

```python
if regex_covid.search(entry.get('text')) and regex_migrations.search(entry.get('text')) and regex_geos.search(entry.get('text'))
```

The blocks containing either `regex_covid`, `regex_migrations`, `regex_geos` refer to the different filters especially. The `and` specifies that both filters should be present. So because `and` is used between all three filters, all filters are applied. Other combinations uses `or` meaning one of the filters must match. A combination like "filter1 *and* filter2 *or* filter3" will require that filter1 is a match *and* either filter2 *or* filter3 matches.

**All filters must match: 387 tweets**

In [11]:
## all filters
data_filter_cig = []

for entry in all_data.get('data'):
    if regex_covid.search(entry.get('text')) and regex_migrations.search(entry.get('text')) and regex_geos.search(entry.get('text')):
        data_filter_cig.append(entry)
len(data_filter_cig)

387

**Migration and geographies: 3516 tweets**

In [12]:
## migration and geos
data_filter_ig = []

for entry in all_data.get('data'):
    if regex_migrations.search(entry.get('text')) and regex_geos.search(entry.get('text')):
        data_filter_ig.append(entry)
len(data_filter_ig)

3516

**Migration and covid: 1129 tweets**

In [15]:
## migration and covid
data_filter_ci = []

for entry in all_data.get('data'):
    if regex_covid.search(entry.get('text')) and regex_migrations.search(entry.get('text')):
        data_filter_ci.append(entry)
len(data_filter_ci)

1129

**Migration only: 13157 tweets**

In [13]:
## migration only
data_filter_i = []

for entry in all_data.get('data'):
    if regex_migrations.search(entry.get('text')):
        data_filter_i.append(entry)
len(data_filter_i)

13157

**Geographies only: 14197 tweets**

In [14]:
## geos only
data_filter_g = []

for entry in all_data.get('data'):
    if regex_geos.search(entry.get('text')):
        data_filter_g.append(entry)
len(data_filter_g)

22671

**Migration or geographies: 25234 tweets**

In [18]:
## migration or geo
data_filter_iorg = []

for entry in all_data.get('data'):
    if regex_migrations.search(entry.get('text')) or regex_geos.search(entry.get('text')):
        data_filter_iorg.append(entry)
len(data_filter_iorg)

25234

**covid and migration or geographies: 2152 tweets**

In [19]:
## covid and migration or geo
data_filter_ciorg = []

for entry in all_data.get('data'):
    if regex_covid.search(entry.get('text')) and (regex_migrations.search(entry.get('text')) or regex_geos.search(entry.get('text'))):
        data_filter_ciorg.append(entry)
len(data_filter_ciorg)

2152