# Twitter Query Reformulation 
Step by step building a custom pipeline to handle queries for Tweet retrieval.

This notebook is divided in the following steps:
1. Query Preprocessing using SpaCy
2. Word Embeddings to find expansion terms
3. Compose Query for Elastic Search

---

In [1]:
import pandas as pd
import spacy

from spacy import displacy

Download one of the predefined German models.

In [2]:
# !python -m spacy download de_core_news_sm
# !python -m spacy download de_core_news_lg

In [3]:
# select german package
MODEL = 'de_core_news_lg'

# load german language model
nlp = spacy.load(MODEL)

Define a user query to test the whole pipeline

In [4]:
QUERY = "@amthor Ist die große Koalition gescheitert unter Merkel? #Groko#SPD #CDU"

---
# 1. Preprocessing
The initial query is investigated and preprocessed. Thereby, several components of the pipeline are utilized:
- Tokenizer
- Matcher
- Named Entities
- POS Tagging

But first investigate the tokens from SpaCy without any custom modifications.

In [5]:
doc = nlp(QUERY)

# displacy.render(doc, style="dep", jupyter=True)
print([token.text for token in doc])


['@amthor', 'Ist', 'die', 'große', 'Koalition', 'gescheitert', 'unter', 'Merkel', '?', '#', 'Groko#SPD', '#', 'CDU']


As we can see, the hashtags are treated poorly. We want to detect hashtags and prevent the tokenizer from splitting them. Or more precisely:
- don't split hashtag and it's text
- split compound hashtags
- mark hashtags in SpaCy

The user mentions are kept as one token. 
- mark them as well

## 1.1 Tokenizer
Modify the tokenizer such that hashtags are not split at `#`

In [6]:
from pipeline.tokenizer.tweet_tokenizer import add_pattern

pattern = "#\w+|\w+-\w+"
nlp = add_pattern(nlp=nlp, pattern=pattern)

In [7]:
print([token.text for token in nlp(QUERY)])

['@amthor', 'Ist', 'die', 'große', 'Koalition', 'gescheitert', 'unter', 'Merkel', '?', '#Groko#SPD', '#CDU']


Then make sure the whitespaces are set correctly in between the hashtags.

In [8]:
from pipeline.tokenizer.tweet_tokenizer import separate_hashtags

QUERY = separate_hashtags(nlp(QUERY))

print(QUERY)

@amthor Ist die große Koalition gescheitert unter Merkel ? #Groko #SPD #CDU


---
## 1.2 Matcher
Customize the Matcher to handle Tweet-specific syntax - i.e. hashtags.
- Mark Hashtag (#)
- Mark Twitter User (@)

In [9]:
from pipeline.matcher.hashtag_matcher import create_hashtag_matcher
from pipeline.matcher.user_matcher import create_user_matcher

nlp.add_pipe("hashtag_matcher") 
nlp.add_pipe("user_matcher") 

<pipeline.matcher.user_matcher.UserMatcher at 0x7fb8859b69b0>

In [10]:
doc = nlp(QUERY)
data = []

for token in doc:
    data.append([token, token._.is_hashtag])
pd.DataFrame(data, columns=["Text", "is_hashtag"])

Unnamed: 0,Text,is_hashtag
0,@amthor,False
1,Ist,False
2,die,False
3,große,False
4,Koalition,False
5,gescheitert,False
6,unter,False
7,Merkel,False
8,?,False
9,#Groko,True


In [11]:
data = []

for token in doc:
    data.append([token, token._.is_user])
pd.DataFrame(data, columns=["Text", "is_user"])

Unnamed: 0,Text,is_user
0,@amthor,True
1,Ist,False
2,die,False
3,große,False
4,Koalition,False
5,gescheitert,False
6,unter,False
7,Merkel,False
8,?,False
9,#Groko,False


---
## 1.3 Named Entities
How are named entities detected? Especially those that are hashtags.

In [12]:
doc = nlp(QUERY)
data = []

for ent in doc.ents:
    data.append([ent.text, spacy.explain(ent.label_)])
    
# displacy.render(doc, style="ent")
pd.DataFrame(data, columns=["Text", "NER Label"])

Unnamed: 0,Text,NER Label
0,Merkel,Named person or family.


It seems that named entities are treated not optimally. Sometimes named entities aren't detected or the corresponding tokens don't make sense. 

---
## 1.4 Part of Speech Tagging

In [13]:
data = []

for token in doc:
    data.append ([token.text, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop, token._.is_hashtag, token._.is_user])

pd.DataFrame(data, columns=["Text", "UPOS Tag", "Tag", "Syntactics", "Shape", "Alpha Token", "Stop Token", "Hashtag", "User"], index=None)

Unnamed: 0,Text,UPOS Tag,Tag,Syntactics,Shape,Alpha Token,Stop Token,Hashtag,User
0,@amthor,NOUN,NE,sb,@xxxx,False,False,False,True
1,Ist,AUX,VAFIN,ROOT,Xxx,True,True,False,False
2,die,DET,ART,nk,xxx,True,True,False,False
3,große,ADJ,ADJA,nk,xxxx,True,True,False,False
4,Koalition,NOUN,NN,sb,Xxxxx,True,False,False,False
5,gescheitert,VERB,VVFIN,oc,xxxx,True,False,False,False
6,unter,ADP,APPR,mo,xxxx,True,True,False,False
7,Merkel,PROPN,NE,nk,Xxxxx,True,False,False,False
8,?,PUNCT,$.,punct,?,False,False,False,False
9,#Groko,PROPN,NE,nk,#Xxxxx,False,False,True,False


---
## 1.5 Candidate Selection
Extract terms that are used to find relevant expansion terms. These words should be:
- verbs or nouns
- no hashtags or users
- only alphabet characters
- no e-mail, URLs or currencies

In [14]:
def select_candidate_terms(doc: spacy.tokens.doc.Doc, pos_tags):
    """
    Select the tokens that should be used for finding similar terms.
    """
    candidate_terms = []

    for token in doc:
        if token.pos_ not in pos_tags:
            continue

        if token._.is_hashtag is True:
            continue

        if token._.is_user is True:
            continue

        if token.is_alpha is False:
            continue

        if token.like_email:
            continue

        if token.like_url:
            continue

        if token.is_currency:
            continue

        candidate_terms.append(token.text)
    
    return candidate_terms

In [15]:
pos_tags = ["VERB", "NOUN", "PROPN", "ADJ"]
candidate_terms = select_candidate_terms(doc, pos_tags)

print(candidate_terms)

['große', 'Koalition', 'gescheitert', 'Merkel']


---
# 2. Word Embeddings
In order to find suitable expansion terms, Word Embeddings are applied. This is done by looking at similar words for every candidate term.

The following embeddings are applied to the selected terms
- FastText
- Word2Vec

In [16]:
# number of most similar terms
NUM_SIM_TERMS = 3

## 2.1 FastText

Load FastText model with **FastText**

In [17]:
# Download german model from fasttext website
# !wget -P ./data/fasttext https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.bin.gz

In [18]:
# unzip the fasttext model
# !gunzip -d ./data/fasttext/cc.de.300.bin.gz

In [19]:
import fasttext
ft_model = fasttext.load_model('data/fasttext/cc.de.300.bin')



Find the most similar terms...

In [20]:
ft_most_similar = {}

# obtain candidate terms and store them in a json object
for term in candidate_terms:
    similar_terms = ft_model.get_nearest_neighbors(term, k=NUM_SIM_TERMS)
    ft_most_similar[f"{term}"] = [n[1] for n in similar_terms]
    
print(ft_most_similar)

{'große': ['größere', 'grosse', 'riesengroße'], 'Koalition': ['Regierungskoalition', 'Koalitionsrunde', 'Koalitionspartei'], 'gescheitert': ['scheitert', 'Gescheitert', 'gescheitert.'], 'Merkel': ['Kanzlerin', 'Merkels', 'Bundeskanzlerin']}


In [21]:
del ft_model

The FastText module gives pretty fancy results. Even out-of-vocabulary words are treated well as expected.


---
## 2.2 Word2Vec


Load Word2Vec model via **Gensim**

In [22]:
# Download german model from devmount website
# !wget -P ./data/fasttext https://cloud.devmount.de/d2bc5672c523b086/german.model

In [23]:
from gensim.models import KeyedVectors

gensim_w2v_model = KeyedVectors.load_word2vec_format(fname="data/word2vec/german.model", no_header=False, binary=True)

In [24]:
w2v_most_similar = {}

# obtain candidate terms and store them in a json object
for term in candidate_terms:
    if not gensim_w2v_model.has_index_for(term):
        print(f"The word '{term}' does not appear in this model")
    else:
        similar_terms = gensim_w2v_model.most_similar(term)[:NUM_SIM_TERMS]
        w2v_most_similar[f"{term}"] = [n[0].replace("_"," ") for n in similar_terms]

print(w2v_most_similar)

The word 'große' does not appear in this model
{'Koalition': ['Grosse Koalition', 'Grossen Koalition', 'Regierungskoalition'], 'gescheitert': ['scheitert', 'Gescheitert', 'scheitern'], 'Merkel': ['Kanzlerin Merkel', 'Merkel CDU', 'Bundeskanzlerin']}


In [25]:
del gensim_w2v_model

Model seems to work properly. However, it is case-sensitive and requires to lemmatize the terms. Otherwise the model can't find the correct word vector.

---
# 3. Elastic Search

Finally, the obtained terms are used to retrieve Tweets from the Elastic Search index. Beforehand, the most relevant expansion terms must be determined. For this purpose, the [Adjacency Matrix Aggregations](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-adjacency-matrix-aggregation.html) are utilized. They allow to find the number of documents in which the initial user term and the similar term occur together. By taking this measurement into account we can decide if the similar term can act as an expansion. This is the case if the two terms cooccur multiple times. 

In [26]:
from pipeline.utils import es_connect
import configparser

config = configparser.ConfigParser()
config.read('auth/es-credentials.ini')

es_client = es_connect(credentials=config["ELASTIC"])

Connecting to Elastic Search...
Successfully connected to https://localhost:9200


In [27]:
# Name of the Elastic Search index 
INDEX = "tweets"

---
## 3.1 Aggregation Query
Now, it must be determined which of the terms of the initial query should be replaced or used to expand the query. For this purpose the co-occurrence of the expansion terms as well as initial terms are investigated. Terms that occur often together might be suitable expansions for the final query.

In [28]:
import json
# load the predefined aggregation query
es_agg_query = json.load(open('config/es-adjacency-matrix.conf'))

In [29]:
# TODO: don't want to blow up my memory by loading embeddings, thus take the copied output here
w2v_most_similar = {'Koalition': ['Grosse Koalition', 'Grossen Koalition', 'Regierungskoalition'], 'gescheitert': ['scheitert', 'Gescheitert', 'scheitern'], 'Merkel': ['Kanzlerin Merkel', 'Merkel CDU', 'Bundeskanzlerin']}
ft_most_similar = {'große': ['größere', 'grosse', 'riesengroße'], 'Koalition': ['Regierungskoalition', 'Koalitionsrunde', 'Koalitionspartei'], 'gescheitert': ['scheitert', 'Gescheitert', 'gescheitert.'], 'Merkel': ['Kanzlerin', 'Merkels', 'Bundeskanzlerin']}

In [30]:
from pipeline.querybuilder.aggregation_query import compose_aggregation_query, get_expansion_terms

compose_aggregation_query(es_agg_query, ft_most_similar)

{'size': 0,
 'aggs': {'interactions': {'adjacency_matrix': {'filters': {'große': {'term': {'txt': 'große'}},
     'größere': {'term': {'txt': 'größere'}},
     'grosse': {'term': {'txt': 'grosse'}},
     'riesengroße': {'term': {'txt': 'riesengroße'}},
     'Koalition': {'term': {'txt': 'koalition'}},
     'Regierungskoalition': {'term': {'txt': 'regierungskoalition'}},
     'Koalitionsrunde': {'term': {'txt': 'koalitionsrunde'}},
     'Koalitionspartei': {'term': {'txt': 'koalitionspartei'}},
     'gescheitert': {'term': {'txt': 'gescheitert'}},
     'scheitert': {'term': {'txt': 'scheitert'}},
     'Gescheitert': {'term': {'txt': 'gescheitert'}},
     'gescheitert.': {'term': {'txt': 'gescheitert.'}},
     'Merkel': {'term': {'txt': 'merkel'}},
     'Kanzlerin': {'term': {'txt': 'kanzlerin'}},
     'Merkels': {'term': {'txt': 'merkels'}},
     'Bundeskanzlerin': {'term': {'txt': 'bundeskanzlerin'}}}}}}}

In [32]:
# execute the search aggregation query
res = es_client.search(index=INDEX, size=es_agg_query["size"], aggregations=es_agg_query["aggs"])

# get the aggregations and their score from the response
aggregations = {}
aggregations.update((t["key"], t["doc_count"]) for t in res["aggregations"]["interactions"]["buckets"])

# sort the aggregations based on their score
# print the results
print("Took", res["took"],"ms\n")
pd.DataFrame(sorted(aggregations.items(), key=lambda x :x[1], reverse=True), columns=["Term Aggregation", "Document Count"])

Took 233 ms



Unnamed: 0,Term Aggregation,Document Count
0,Merkel,2908
1,Koalition,2516
2,Kanzlerin,906
3,Gescheitert,514
4,Gescheitert&gescheitert,514
5,gescheitert,514
6,Bundeskanzlerin,486
7,Kanzlerin&Merkel,394
8,Bundeskanzlerin&Merkel,225
9,scheitert,176


In [33]:
ALPHA = 0.1
expansion_terms = get_expansion_terms(candidate_terms, ft_most_similar, aggregations, ALPHA)

query_terms = candidate_terms + expansion_terms
query_terms

['große', 'Koalition', 'gescheitert', 'Merkel', 'Gescheitert', 'Kanzlerin']

## 3.2 Data Preparation
Finally, the hashtags, twitter users and entities are prepared. Given the final expansion terms, the Elastic Search template is loaded and the query is executed on the specified `INDEX`.

In [39]:
hashtags = [t.text.lower() for t in doc if t._.is_hashtag ]

pd.DataFrame(hashtags, columns=["Hashtag"])

Unnamed: 0,Hashtag
0,#groko
1,#spd
2,#cdu


In [34]:
users = [t.text.lower() for t in doc if t._.is_user ]

pd.DataFrame(users, columns=["User"])

Unnamed: 0,User
0,@amthor


In [35]:
entities = [ent.text.lower() for ent in doc.ents]

pd.DataFrame(entities, columns=["Entity"])

Unnamed: 0,Entity
0,merkel


---
## 3.3 Query Formulation 
The resulting terms must be arranged in an Elastic Search query. Therefore a query template is defined to retrieve relevant tweets. It is located under `config/es-query.conf`.

The pattern is developed by utilizing specific Elastic Search query syntax such as:
- boolean operators (`AND`, `OR`)
- boosting  `^`
- filters

The following Hyperparameters are set in order to modify the query template:

In [40]:
# configure parameters for query composition
params = {
    "retweet": False,
    "hashtag_boost": 0.5,
    "tweet_range": ("2021-01-01", "2023-01-01")
}

In [41]:
from pipeline.querybuilder.search_query import compose_search_query

# Load the pre-configured template for an elastic search query
search = json.load(open('config/es-query.conf'))
compose_search_query(search, query_terms, hashtags, entities, params)

{'size': 10,
 'query': {'bool': {'should': [{'match': {'txt': {'query': 'große Koalition gescheitert Merkel Gescheitert Kanzlerin',
       'operator': 'OR'}}},
    {'terms': {'hashtags': ['große',
       'koalition',
       'gescheitert',
       'merkel',
       'gescheitert',
       'kanzlerin'],
      'boost': 0.5}}],
   'must': {'terms_set': {'hashtags': {'terms': ['groko', 'spd', 'cdu'],
      'minimum_should_match_script': {'source': 'Math.min(params.num_terms, 1)'}}}},
   'must_not': {'term': {'txt': '_retweet_'}},
   'filter': [{'range': {'created_at': {'gte': '2021-01-01'}}},
    {'range': {'created_at': {'lte': '2023-01-01'}}}]}},
 'aggs': {'sample': {'sampler': {'shard_size': 500},
   'aggs': {'keywords': {'significant_terms': {'field': 'hashtags'}}}}},
 'collapse': {},
 'sort': {}}

### Execute the final Query

In [42]:
res = es_client.search(index=INDEX, size=search['size'], query=search['query'], aggregations=search["aggs"])

print(f'Total of {res["hits"]["total"]["value"]} hits in {res["took"]}ms \n')

for i, doc in enumerate(res["hits"]["hits"]):
    print("Tweet", i, "\n", doc["_source"], "\n")


Total of 4561 hits in 509ms 

Tweet 0 
 {'retweet_count': 42, 'reply_count': 17, 'like_count': 350, 'created_at': '2021-08-29T15:25:10+02:00', 'txt': 'Von den letzten 16 Jahren hat die #SPD 12 Jahre mit der #CDU regiert. Die #SPD hat Scholz mit großem Getöse nicht zum Parteivorsitzenden gewählt,mit dem Argument,er stünde für die #GroKo Jetzt ist er Kanzlerkandidat und kokettiert offen damit merkellike zu sein.Die Wahrheit ist:', 'hashtags': ['spd', 'spd', 'cdu', 'groko'], 'word_count': 44} 

Tweet 1 
 {'retweet_count': 49, 'reply_count': 41, 'like_count': 435, 'created_at': '2021-09-05T17:39:20+02:00', 'txt': 'Da Herr #Söder vor einem Linksrutsch warnt, der größte Linksrutsch fand unter den #Groko-Regierungen unter Angela #Merkel statt. ☝️🤔\n👉Sie war die beste #CDU-Kanzlerin, welche die #SPD je hatte.🤷\u200d♂️ https://t.co/UwrxHT8yDu', 'hashtags': ['spd', 'cdu', 'söder', 'merkel', 'groko'], 'word_count': 29} 

Tweet 2 
 {'retweet_count': 28, 'reply_count': 7, 'like_count': 191, 'create