# Twitter Query Expansion
© Jason Pyanowski

In this demo file, the application of the project **Twitter Query Expansion** is explained. Starting with the initial Tweets data retrieval and download of the Word Embedding models. Then the pipeline is invoked and the configurations are elaborated.  

## (1) Twitter Data Preparation
Parse Tweets from the PostgreSQL database into an Elastic Search Index. This task is handled by the script `/scripts/tweet_feeder.py` as stated below. It is required to have a running Elastic Search Cluster and a PostgreSQL database at hand.

In [5]:
!python3 scripts/tweet_feeder.py -h

usage: tweet_feeder.py [-h] -i INDEX -t TABLE [-ec ELASTIC_CREDENTIALS]
                       [-pc POSTGRES_CREDENTIALS] [-es ELASTIC_SETTINGS]
                       [-wc WORDCOUNT]

Feed Postgres data into Elastic Search Index

options:
  -h, --help            show this help message and exit
  -i INDEX, --index INDEX
                        Elastic Search index
  -t TABLE, --table TABLE
                        Postgres table
  -ec ELASTIC_CREDENTIALS, --elastic_credentials ELASTIC_CREDENTIALS
                        Path to Elastic Search credentials file
  -pc POSTGRES_CREDENTIALS, --postgres_credentials POSTGRES_CREDENTIALS
                        Path to Postgres credentials file
  -es ELASTIC_SETTINGS, --elastic_settings ELASTIC_SETTINGS
                        Settings for new Index; Look at "/templates/es-
                        config.conf"
  -wc WORDCOUNT, --wordcount WORDCOUNT
                        Minimum number of words per Tweet


## (2) Download Word Embedding Models
This Pipeline allows tu use different word embedding models. The download link of the desired model can be used to load the model below. The model types of `fasttext` and `word2vec` are currently supported. To speed up the performance of the query expansion pipeline, the models are consequently compressed.

|Parameter|Possible Values|Datatype|
|---|---|---|
|type|`'fasttext'`, `'word2vec'`|`str`|
|url|`'url to model'`|`str`|

**Download Word2Vec model**

In [None]:
from scripts.model_loader import load_model
load_model(type="word2vec", url="https://cloud.devmount.de/d2bc5672c523b086/german.model")

**Download FastText model**

In [None]:
load_model(type="fasttext", url="https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.vec.gz")

## (3) Pipeline Configuration
### Define Queries
Specify queries on which to evaluate the pipeline. Queries may include Twitter-specific syntax like hashtags `#EU` or user mentions `@bundestag`.

In [1]:
QUERIES = [
    "große Koalition gescheitert unter Merkel? #Groko #SPD #CDU",
    "Lauterbach Deutschland Corona-Maßnahmen #Impfung",
    "@bundestag Bundestagswahl 2021 Ergebnisse",
    #"EU Brexit Boris Johnson",
    "Gesetzliche Rentenversicherung Rente Mit 67",
    #"Klimapolitik Deutschland #Grüne",
    #"Asylpolitik Merkel",
    #"Soli abschaffen Westen",
    "Bundeswehr Afghanistan Krieg stoppen",
    #"Deutschland Energiewende mit SPD CDU"
]

### Set Word Embedding Parameters
In order to obtain the desired results, modify the parameters for Word Embeddings. These configurations determine which of the initial query terms are actually used to find related terms.

| Parameter | Possible Values | Datatype |
|---|---|---|
|type|`'word2vec', 'fasttext'`|`str`|
|model| `'path to model'`|`str`|
|pos_list|`['ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'CONJ', 'DET', 'EOL', 'IDS', 'INTJ', 'NAMES', 'NOUN', 'NO_TAG', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SPACE', 'SYM', 'VERB', 'X']`| `list[str]`|
|entity_list|`['LOC', 'MISC', 'ORG', 'PER']`|`list[str]`|
|hashtag|`True, False`|`bool`|
|user|`True, False`|`bool`|
|num_nearest_terms|`1...N`|`int`|


In [2]:
EMBEDDING_PARAMS = {
    "type": "word2vec",
    "model": "models/word2vec/german.model",
    "pos_list": ["NOUN","ADJ","VERB","PROPN"],
    "entity_list": ['LOC', 'ORG'],
    "hashtag": True,
    "user": False,
    "num_nearest_terms": 3
}

### Set Elastic Search Parameters
After Query Expansion, the Tweets are retrieved from an Elastic Search Index. Specify the parameters below and make sure that an Index is running on your machine. 

| Parameter | Possible Values | Datatype |
|---|---|---|
|index|`'tweets'`|`str`|
|num_of_tweets|`1...N`| `int`|
|retweet|`True, False`|`bool`|
|hashtag_boost|`0...N`|`float`|
|tweet_range|`(date, date)`|`tuple`|

In [3]:
ELASTIC_PARAMS = {
    "index": "tweets_all",
    "num_of_tweets": 20,
    "retweet": False,
    "hashtag_boost": 1.0,
    "tweet_range": ("2020-01-01", "2023-01-01")
}

## (4) Execute Pipeline
Run the Pipeline - the results are stored in the `/output` directory.

In [4]:
# run pipeline
from scripts import pipeline

res = pipeline.run(
    queries=QUERIES, 
    embedding_params=EMBEDDING_PARAMS,
    elastic_params=ELASTIC_PARAMS)

Processing text using SpaCy...
Evaluating word2vec model...
Connecting to Elastic Search...
Retrieving Tweets...
Writing results to output/word2vec/18-01-23_16-51-42
Finished!


## (5) Inspect Results
Load results and have a look through the retrieved Tweets. 

In [14]:
for tweets, query in zip(res, QUERIES):
    print("Query:",query)
    
    for tweet in tweets["tweets"][:3]:
        print("Tweet: ", tweet["_source"]["txt"])
    print("\n")

Query: große Koalition gescheitert unter Merkel? #Groko #SPD #CDU
Tweet:  Die erste #GroKo in Deutschland vereinte 1966 noch 86,9% der Wähler:innen hinter sich. Das lässt sich heute nicht mal mit einer Mosambik-Koalition erreichen. Aber wenn man #CDU und #FDP einerseits und #SPD und #Grüne anderseits beobachtet, könnte man das glatt als Lösung sehen.
Tweet:  Aber…. Ach so, das wollten ja #SPD #CDU und #CSU explizit nicht. Obwohl Grüne, FDP und Linke das beantragt hatten. Die #GroKo nimmt es schlicht billigend in Kauf, dass bis zu 300 Abgeordnete mehr in den Bundestag kommen. Es wäre schlicht verheerend.
Tweet:  Bis zu 1000 Abgeordnete könnte der neue #Bundestag nach der Wahl bekommen! Die #FDP hatte mit anderen Oppositionsparteien eine effektive #Wahlrechtsreform auf den Weg gebracht. Stattdessen hat sich die #GroKo aus #CDUCSU und #SPD für dieses Reförmchen ohne Wirkung gefeiert. @fdp https://t.co/VZJ1k0dbyW


Query: Lauterbach Deutschland Corona-Maßnahmen #Impfung
Tweet:  3-G bleibt 