# Embeddings Exercise
## Preparation
- 1 Download Word2Vec embeddings from [link](http://www.italianlp.it/resources/italian-word-embeddings/), both are in sqlite file format. Each embedding has 128 dimensions.
- 2 Get the [Haspeede 2020](https://ceur-ws.org/Vol-2765/paper162.pdf) dataset
- 3 Install pandas and scikit-learn libraries.

## Exercise
Use word embeddings from point 1 to classify hate speech from dataset at point 2. The Haspedee dataset contains Twitter data labeled with "Hate" or "No-Hate".


In [1]:
# Insert here paths to embeddings files
itwac_path = 'data/word_2_vec/itwac128.sqlite'
twitter_path ='data/word_2_vec/twitter128.sqlite'

# Insert path to haspedee datasets here
haspedee_dataset_path = 'hate_speech/haspeede2020/haspeede2_dev_taskAB.tsv'

data_path = twitter_path

In [2]:
# Import and initializations
import pandas as pd
import sqlite3
import random
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

id_to_label = {"0":"NoHate", "1": "Hate"}
text_to_id_map = {}

def read_embedings(sqllite_path):
    """ Read sqlite embeddings from sqllite_path and returns them into a pandas DataFrame
    """
    con = sqlite3.connect(sqllite_path)
    df = pd.read_sql_query("SELECT * FROM store", con)
    con.close()
    return  df

def read_dataset(input_file):
  examples = []
  labels = []
  with open(input_file, 'r', encoding='utf-8', errors='ignore') as f:
      contents = f.read()
      file_as_list = contents.splitlines()
      random.shuffle(file_as_list)
      for line in file_as_list:
          if line.startswith("id"):
            continue
          split = line.split("\t")
          text = split[1]
          label = split[2]
          text_to_id_map[text] = split[0]
          labels.append(label)
          examples.append(text)
      f.close()
  return examples, labels


In [3]:
# Reading embeddings
# Each row contains a word and the corresponding embedding (128 dimensions) 
df = read_embedings(data_path)

In [4]:
# x contains the textual dataset
# y_ref contains the reference label i.e., 0 for no-hate and 1 for hate-
x, y_ref = read_dataset(haspedee_dataset_path)

In [None]:
# *****PUT HERE YOUR CODE*****



y_hyp = None

In [None]:
# Evaluate your results with these metrics
a = accuracy_score(y_ref,y_hyp)
p = precision_score(y_ref, y_hyp, pos_label="1")
r = recall_score(y_ref, y_hyp, pos_label="1")
f1 = f1_score(y_ref, y_hyp, pos_label="1")
print("precision: " + str(p) )
print("recall: " + str(r) )
print("accuracy: " + str(a) )
print("f1: " + str(f1) )