# Twitter Query Expansion
In this demo file, the project **Twitter Query Expansion** is ...

## Download Models
First, different models for text processing and word embeddings are downloaded. 

For text processing SpaCy is utilized. The word embeddings of Word2Vec and Fasttext are 

In [1]:
SPACY_MODEL = "de_core_news_lg"

In [7]:
import subprocess

# download the respective SpaCy model
subprocess.run(f"python -m spacy download {SPACY_MODEL}", shell=True)

Download FastText model

In [None]:
# Download german model from fasttext website
subprocess.run("wget -P ./data/fasttext https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.bin.gz", shell=True)

# unzip the fasttext model
subprocess.run(f"gunzip -d ./data/fasttext/cc.de.300.bin.gz", shell=True)

Download Word2Vec model

In [None]:
# Download german model from devmount website
subprocess.run("wget -P ./data/fasttext https://cloud.devmount.de/d2bc5672c523b086/german.model", shell=True)

## Define Queries
The queries on which to evaluate the query expansion are stated.

In [2]:
QUERIES = [
    "@amthor Ist die große Koalition gescheitert unter Merkel? #Groko#SPD #CDU",
    "#Lauterbach Virus! gelogen #Covid19 #Corona"
]

## Set Pipeline Parameters
Modify the following parameters to use different options for the word embeddings and elastic search query.

In [3]:
EMBEDDING_PARAMS = {
    "type": "word2vec",
    "pos_list": ["NOUN","ADJ","VERB"],
    "entity": True,
    "hashtag": False,
    "user": False,
    "num_nearest_terms": 3
}

ELASTIC_PARAMS = {
    "index": "tweets",
    "retweet": False,
    "hashtag_boost": 0.5,
    "tweet_range": ("2021-01-01", "2023-01-01")
}

## Execute Pipeline
Run the Pipeline - the results are stored in the `/out` directory.

In [4]:
# run pipeline
from scripts import pipeline

pipeline.run(
    queries=QUERIES, 
    spacy_model=SPACY_MODEL,
    embedding_params=EMBEDDING_PARAMS,
    elastic_params=ELASTIC_PARAMS)

Successfully connected to https://localhost:9200


## Inspect Results
Load results and have a look through the retrieved Tweets. 

In [14]:
import os
import json

with open(os.path.join("out","word2vec", "09-01-23_11:08:31.log.json"), "r") as f:
    a = json.load(f)

[tweet["tweets"] for tweet in a["tweets"]]

[[{'_index': 'tweets',
   '_id': '1435949446152531970',
   '_score': 6.772748,
   '_source': {'retweet_count': 7,
    'reply_count': 0,
    'like_count': 22,
    'created_at': '2021-09-09T14:53:21+02:00',
    'txt': 'Die Große Koalition hat einfach zugeschaut, wie die Netzbetreiber sich ein Zwei-Klassen-Internet basteln. Bis die vom EuGH gerügten Angebote von Markt verschwinden, wird es noch eine Weile dauern. #Netzneutralität\n@netzpolitik_org @TabeaRoessner \nhttps://t.co/6g6aVWtHv9',
    'hashtags': ['netzneutralität'],
    'word_count': 31}},
  {'_index': 'tweets',
   '_id': '1433155153272705026',
   '_score': 6.7635736,
   '_source': {'retweet_count': 0,
    'reply_count': 8,
    'like_count': 16,
    'created_at': '2021-09-01T21:49:49+02:00',
    'txt': 'Stand jetzt läuft es darauf hinaus, dass @c_lindner entscheidet, wer Bundeskanzler werden wird oder es gibt wieder eine "Große Koalition". Macht das nur mir Bauchweh und schlaflose Nächte? #btw21',
    'hashtags': ['btw21'],
    