# Twitter Query Expansion
In this demo file, the project **Twitter Query Expansion** is ...

## Download Models
This Pipeline allows tu use different word embedding models. The download link of the desired model can be used to load the model below. The model types of `fasttext` and `word2vec` are currently supported. To speed up the performance of the query expansion pipeline, the models are consequently compressed.

Download Word2Vec model

In [7]:
from scripts.model_loader import load_model
load_model(type="word2vec", url="https://cloud.devmount.de/d2bc5672c523b086/german.model")

Download FastText model

In [None]:
load_model(type="fasttext", url="https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.vec.gz")

## Define Queries
Specify queries on which to evaluate the pipeline. Queries may include Twitter-specific syntax like hashtags `#EU` or user mentions `@bundestag`.

In [1]:
QUERIES = [
    "große Koalition gescheitert unter Merkel? #Groko #SPD #CDU",
    "Lauterbach Deutschland Corona-Maßnahmen #Impfung",
    "@bundestag Bundestagswahl 2021 Ergebnisse",
    "Europäische Union Brexit Boris Johnson",
    "Gesetzliche Rentenversicherung Rente Mit 67",
    "Klimapolitik Deutschland #Grüne",
    "Asylpolitik Merkel",
    "Soli abschaffen im Westen",
    "Bundeswehr Afghanistan Krieg stoppen",
    "Deutschland Energiewende unter SPD CDU"
]

## Set Word Embedding Parameters
In order to obtain the desired results, modify the parameters for Word Embeddings. These configurations determine which of the initial query terms are actually used to find related terms.

| Parameter | Possible Values | Datatype |
|---|---|---|
|type|`'word2vec', 'fasttext'`|`str`|
|model| `'path to model'`|str|
|pos_list|`['ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'CONJ', 'DET', 'EOL', 'IDS', 'INTJ', 'NAMES', 'NOUN', 'NO_TAG', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SPACE', 'SYM', 'VERB', 'X']`| `list[str]`|
|entity_list|`['LOC', 'MISC', 'ORG', 'PER']`|`list[str]`|
|hashtag|`True, False`|`bool`|
|user|`True, False`|`bool`|
|num_nearest_terms|`1...N`|`int`|


In [2]:
EMBEDDING_PARAMS = {
    "type": "fasttext",
    "model": "models/fasttext/cc.de.300.model",
    "pos_list": ["NOUN","ADJ","VERB","PROPN"],
    "entity_list": ['LOC', 'ORG'],
    "hashtag": False,
    "user": False,
    "num_nearest_terms": 5
}

## Set Elastic Search Parameters
| Parameter | Possible Values | Datatype |
|---|---|---|
|index|`'tweets'`|`str`|
|num_of_tweets|`1...N`| `int`|
|retweet|`True, False`|`bool`|
|hashtag_boost|`0...N`|`float`|
|tweet_range|`(date, date)`|`tuple`|

In [3]:
ELASTIC_PARAMS = {
    "index": "tweets",
    "num_of_tweets": 10,
    "retweet": False,
    "hashtag_boost": 0.5,
    "tweet_range": ("2021-01-01", "2023-01-01")
}

## Execute Pipeline
Run the Pipeline - the results are stored in the `/out` directory.

In [None]:
# run pipeline
from scripts import pipeline

res = pipeline.run(
    queries=QUERIES, 
    embedding_params=EMBEDDING_PARAMS,
    elastic_params=ELASTIC_PARAMS)

## Inspect Results
Load results and have a look through the retrieved Tweets. 

In [None]:
for tweets, i in zip(res,range(len(QUERIES))):
    print("Query:", QUERIES[i], "\n")
    for tweet in tweets["tweets"]:
        print("->", tweet["_source"]["txt"])