# SNRM Extension Steindl

## Preparations: 

*   Checkout original snrm code with extended functions
*   download datasets, embedding
*   extract download files
*   move required data files in project directory
*   setup anaconda with package dependencies

### OPTIONAL: Checkout 'snrm-extension' project from Google Drive

If needed mount Google Drive. We do not need it because we will use Github in the next step:
Use Google Drive if you do not want to connect with GitHub directly.
Assumption for usage: You have snrm extension project in your Google Drive or some of the datasets used.

In [0]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
!ls -lah "/content/drive/My Drive"
# !unzip -q "/content/drive/My Drive/snrm-extension.zip"
# !cp -R "/content/drive/My Drive/snrm-extension" "/content/snrm-extension"

### Checkout Github Repo 'snrm-extension'
You will be asked for your Github credentials.

In [0]:
from getpass import getpass
import os
import urllib

%cd /content/
!rm -rf /content/snrm-extension # delete previous checkout of repo
%cd /content/
!pwd
repository_owner = 'Bernhard-St'
repository_name = 'snrm-extension'
# ask for username, password
user = getpass('enter Github username')
password = getpass('enter Github password')
password = urllib.parse.quote(password)

cmd_string = 'git clone https://{0}:{1}@github.com/{2}/{3}.git'.format(user, password, repository_owner, repository_name)
os.system(cmd_string) # execute git clone with provided arguments
user, password, cmd_string = "", "", "" # removing variables
%cd /content/snrm-extension
! git status
! git checkout extension # switch to implementation branch
!chmod 770 -R /content/snrm-extension # set permissions
!ls -lah /content/snrm-extension

In [0]:
%cd /content/snrm-extension
!git pull

In [0]:
%cd /content/snrm-extension
!git reset --hard

### Download Datasets, Embedding
We download MS Marco collection.tsv, train.triples.small.tsv, GloVe embeddings with different vector dimensions.

In [0]:
%cd /content/
!pwd
# download collection.tsv: PID\tPASSAGE
!wget https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz

# cache file if needed in Google Drive
#!cp /content/collection.tar.gz /content/drive/My\ Drive 

In [0]:
%cd /content/
!pwd
# download triples.train.small: QUERY\tPOSITIVE_PASSAGE\tNEGATIVE_PASSAGE)
!wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz

# cache file if needed in Google Drive
#!cp /content/triples.train.small.tar.gz /content/drive/My\ Drive 

In [0]:
%cd /content/
!pwd
# download glove embeddings - choose wich one you like
# mind different vector dimensions and embedding file name
# you have to edit the application if you choose to use a different embedding
# !wget http://nlp.stanford.edu/data/glove.42B.300d.zip

!wget http://nlp.stanford.edu/data/glove.6B.zip

# cache file if needed in Google Drive
#!cp "/content/glove.6B.zip" "/content/drive/My Drive"
#!cp "/content/glove.42B.300d.zip" "/content/drive/My Drive"

If you like you can just download the datasets from your Google Drive (if you have enough space)

In [0]:
#!cp "/content/drive/My Drive/glove.6B.zip" "/content/glove.6B.zip"
#!cp "/content/drive/My Drive/glove.42B.300d.zip" "/content/glove.42B.300d.zip"
#!cp "/content/drive/My Drive/triples.train.small.tar.gz" "/content/triples.train.small.tar.gz"
#!cp "/content/drive/My Drive/collection.tar.gz" "/content/collection.tar.gz"
#!cp "/content/drive/My Drive/collection.tsv" "/content/collection.tsv"

### Unzip downloads

In [0]:
%cd /content/
!tar -zxvf  "/content/collection.tar.gz"
!tar -zxvf  "/content/triples.train.small.tar.gz"
!unzip -q "/content/glove.6B.zip"

#!unzip -q "/content/glove.42B.300d.zip"

!ls -lah /content/

### Move required files into snrm-extension project directory

In [0]:
!cp "/content/glove.6B.100d.txt" "/content/snrm-extension/data/embeddings"

# you have to edit the application if you choose to use a different embedding
#!cp "/content/glove.6B.200d.txt" "/content/snrm-extension/data/embeddings"
#!cp "/content/glove.6B.300d.txt" "/content/snrm-extension/data/embeddings"
#!cp "/content/glove.42B.300d.txt" "/content/snrm-extension/data/embeddings"

In [0]:
!cp "/content/collection.tsv" "/content/snrm-extension/data/document_collection"
!cp "/content/triples.train.small.tsv" "/content/snrm-extension/data/training_data"
!cp "/content/snrm-extension/triples.train.small.tsv" "/content/snrm-extension/data/training_data"

### Setup Anaconda and install project dependencies

In [0]:
!wget -c https://repo.anaconda.com/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
!chmod +x Miniconda3-4.5.4-Linux-x86_64.sh
!bash ./Miniconda3-4.5.4-Linux-x86_64.sh -b -f -p /usr/local

In [0]:
!conda --version

In [0]:
!conda install -q -y -c conda-forge python=3.6.9 tensorflow=1.4.0 nltk=3.4.5 numpy=1.16.4 

In [0]:
!conda env list
!conda list -n base

## Run executable Tasks
**Available tasks:**

*   create dictionary (dictionary.py)
*   train and create model (train.py)
*   create inverted index (index_construction.py)
*   retrieval with test query collection and create retrieval-result-candidate file for evaluation with qrels (retrieval.py)

Usually, a task depends on action results from previous tasks

In [0]:
%cd /content/snrm-extension/
!python code/params.py

In [0]:
%cd /content/snrm-extension/
# just for testing if we can create the dictionary in the subsequent tasks
!python code/dictionary.py 

In [0]:
%cd /content/snrm-extension/
!python code/train.py

/content/snrm-extension
  return f(*args, **kwds)
BASE_PATH=
BATCH_SIZE=128
DICT_FILE_NAME=data/allen_vocab_lower_10/tokens.txt
DICT_MIN_FREQ=20
DOCUMENT_COLLECTION_FILE=data/document_collection/collection.tsv
DROPOUT_PARAMETER=1.0
EMB_DIM=100
EVALUATION_QUERY_FILE=data/evaluation/queries.dev.small.tsv
EVALUATION_RESULT_CANDIDATE_FILE_PREFIX=results/evaluation_candidate_
EXPERIMENT_MODE=False
HIDDEN_1=80
HIDDEN_2=500
HIDDEN_3=5000
HIDDEN_4=-1
HIDDEN_5=-1
LEARNING_RATE=0.0001
LOG_PATH=tf-log/
MAX_DOC_LEN=128
MAX_Q_LEN=10
MODEL_PATH=model/
NUM_TRAIN_STEPS=10000
NUM_VALID_STEPS=1000
PRE_TRAINED_EMBEDDING_FILE_NAME=data/embeddings/glove.6B.100d.txt
REGULARIZATION_TERM=0.0001
RESULT_PATH=results/
RUN_NAME=snrm-extension-example-run
SAVE_SNAPSHOT_EVERY_N_STEPS=10000
TRAINING_DATA_TRIPLES_FILE=data/training_data/triples.train.small.tsv
VALIDATE_EVERY_N_STEPS=20000
VALIDATION_DATA_TRIPLES_FILE=

Running with TensorFlow version: 1.4.0

Python System information: 3.6.9 |Anaconda, Inc.| (default,

In [0]:
%cd /content/snrm-extension/
!python code/index_construction.py

In [0]:
%cd /content/snrm-extension/
!python code/retrieval.py

In [0]:
# sequential batch execution
%cd /content/snrm-extension/
!python code/train.py
!python code/index_construction.py
!python code/retrieval.py