<h1 id="Introduction-to-Python-and-Natural-Language-Technologies">Introduction to Python and Natural Language Technologies</h1>
<h2 id="Laboratory-06,-NLP-Introduction">Laboratory 06, NLP Introduction</h2>
<p><strong>March 18, 2020</strong></p>
<p><strong>&Aacute;d&aacute;m Kov&aacute;cs</strong></p>
<p>During this laboratory we are going to use a classification dataset of SemEval 2019 - Task 6. This is called Identifying and Categorizing Offensive Language in Social Media.</p>
<h2 id="Preparation">Preparation</h2>
<p style="padding-left: 40px;"><a href="http://sandbox.hlt.bme.hu/~adaamko/glove.6B.100d.txt" target="_blank" rel="noopener">Download GLOVE</a>(and place it into this directory)</p>
<p style="padding-left: 40px;">Download the dataset (with python code)</p>

In [1]:
import os
if not os.path.isdir('./data'):
    os.mkdir('./data')

import urllib
u = urllib.request.URLopener()
u.retrieve("http://sandbox.hlt.bme.hu/~adaamko/offenseval.tsv", "data/offenseval.tsv")

('data/offenseval.tsv', <http.client.HTTPMessage at 0x7fe7f8183100>)

# 1. Train a Logistic Regression on the dataset

Use a CountVectorizer for featurizing your data. You can reuse the code presented during the lecture

## 1.1 Read in the dataset into a Pandas DataFrame
Use `pd.read_csv` with the correct parameters to read in the dataset. If done correctly, `DataFrame` should have 3 columns, 
`id`, `tweet`, `subtask_a`.

In [2]:
import pandas as pd
import numpy as np

In [3]:
def read_dataset():
    dataset = pd.read_csv("./data/offenseval.tsv",sep="\t", names = ["id", "tweet", "subtask_a"]) # there is an unecessary row 0th
    # how to determine the seperator in the file.tsv:
    # first step:
    # run: dataset = pd.read_csv("./data/offenseval.tsv", names = ["id", "tweet", "subtask_a"])
    # print out the 'dataset'
    # in the column which containes the text, we can see the seperator by comparing it with the file.tsv in the directory where we loaded it from
    
    final_dataset= dataset.iloc[1:]
    return final_dataset
d=read_dataset()
type(d)
d
    #raise NotImplementedError()

Unnamed: 0,id,tweet,subtask_a
1,86426,@USER She should ask a few native Americans wh...,OFF
2,90194,@USER @USER Go home you’re drunk!!! @USER #MAG...,OFF
3,16820,Amazon is investigating Chinese employees who ...,NOT
4,62688,"@USER Someone should'veTaken"" this piece of sh...",OFF
5,43605,@USER @USER Obama wanted liberals &amp; illega...,NOT
...,...,...,...
13236,95338,@USER Sometimes I get strong vibes from people...,OFF
13237,67210,Benidorm ✅ Creamfields ✅ Maga ✅ Not too sh...,NOT
13238,82921,@USER And why report this garbage. We don't g...,OFF
13239,27429,@USER Pussy,OFF


In [4]:
train_data_unprocessed = read_dataset()

assert type(train_data_unprocessed) == pd.core.frame.DataFrame
assert len(train_data_unprocessed.columns) == 3
assert (train_data_unprocessed.columns == ['id', 'tweet', 'subtask_a']).all()

## 1.2 Convert `subtask_a` into a binary label
The task is to classify the given tweets into two category: _offensive(OFF)_ , _not offensive (NOT)_. For machine learning algorithms you will need integer labels instead of strings. Add a new column to the dataframe called `label`, and transform the `subtask_a` column into a binary integer label.

In [5]:
def transform(train_data):
    # YOUR CODE HERE
    train_data["label"]=train_data.subtask_a.apply(lambda x: 1 if x== "NOT" else 0 )
    # reference for 'if, else' in lambda : https://thispointer.com/python-how-to-use-if-else-elif-in-lambda-functions/ 
    return train_data
    #raise NotImplementedError()
    
d=transform(train_data_unprocessed)
d
type(d.iloc[0, 3])

numpy.int64

In [6]:
from pandas.api.types import is_numeric_dtype

train_data = transform(train_data_unprocessed)

assert "label" in train_data
assert is_numeric_dtype(train_data.label)
assert (train_data.label.isin([0,1])).all()

In [7]:
train_data.groupby("label").size()

label
0    4400
1    8840
dtype: int64

## 1.3 Initialize CountVectorizer and _train_ it on the _tweet_ column of the dataset
The _training_ will prepare the vocabulary for us so we will be able to use it for training a LogisticRegression algorithm later. Set the number of `max_features` to 5000 so vocabulary won't be too big for training. Also filter out english `stop_words`.

In [8]:
# We will need to use a random seed for our methods so they will be reproducible
SEED = 1234

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# BAG OF WORDS:

def prepare_vectorizer(train_data):
    # YOUR CODE HERE
    vectorizer = CountVectorizer(max_features=5000, stop_words="english")
    
    X = vectorizer.fit(train_data.tweet) # create a bag of words for the nlp machine learning
                                         # each word in the bag is a unique word in the whole text
                                         # (in our case, the text is the composition of all english and distinctive words
                                         #  in the tweet column of the train_data set ).
                                         # each word is a feature.
                                         # stop_words="english" : used to exclude all the stop words which we get from the 'english' library out of our feature vector.   
    return X

    #raise NotImplementedError()
vectorizer = prepare_vectorizer(train_data)
vectorizer.vocabulary_
vectorizer

# ["hello this is the intro to nlp"] is an input text,
# => we need to convert it to a vector with 5000 elements, each element represents 1 if the word in the input text corresponds
# to a feature (a word) of our bag of words (which contains 5000....)
transformed = vectorizer.transform(["hello this is the intro to nlp"])
transformed
transformed = vectorizer.transform(["hello this is the intro to nlp"])

In [10]:
vectorizer = prepare_vectorizer(train_data)

transformed = vectorizer.transform(["hello this is the intro to nlp"])
assert transformed.dtype == np.dtype('int64')
assert transformed.shape == (1, 5000)

## 1.4 Featurize the dataset with the prepared CountVectorizer, and split it into _train_ and _test_ dataset
You should use the random seed when you are splitting the dataset. The scale of the training and the test dataset should be 70% to 30%.

In [11]:
import gensim
from tqdm import tqdm
from sklearn.model_selection import train_test_split as split

def vectorize_to_bow(tr_data, tst_data, vectorizer):   # bow: bag of words
    # YOUR CODE HERE
    tr_vectors = vectorizer.transform(tr_data) # transform the textdatas into numerical datas for the input of the machine learning (in our case, called nlp)
    
    tst_vectors = vectorizer.transform(tst_data)
    return tr_vectors, tst_vectors
    #raise NotImplementedError()

def get_features_and_labels(data, labels, vectorizer):
    
    # tr_data,tst_data,tr_labels,tst_labels = split...
    # ...
    # tr_vecs, tst_vecs = vectorize_to_bow(...
    # YOUR CODE HERE
    tr_data,tst_data,tr_labels,tst_labels = split(data,labels, test_size=0.3, random_state=1234)
    
    tst_vecs = []
    tr_vecs = []
    tr_vecs, tst_vecs = vectorize_to_bow(tr_data, tst_data, vectorizer)    
    return tr_vecs, tr_labels, tst_vecs, tst_labels
    #raise NotImplementedError()
tr_vecs, tr_labels, tst_vecs, tst_labels = get_features_and_labels(train_data.tweet, train_data.label, vectorizer)
tr_vecs.shape
type(tr_vecs)

scipy.sparse.csr.csr_matrix

In [12]:
tr_vecs, tr_labels, tst_vecs, tst_labels = get_features_and_labels(train_data.tweet, train_data.label, vectorizer)
assert tr_vecs.shape == (9268, 5000)
assert tr_labels.shape == (9268,)
assert tst_vecs.shape == (3972, 5000)
assert tst_labels.shape == (3972,)
assert tr_vecs[0].toarray().shape == (1, 5000)

In [13]:
# Import a bunch of stuff from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# We will train a LogisticRegression algorithm for the classification
lr  = LogisticRegression(n_jobs=-1)

## 1.5 Train and evaluate your method!

In [14]:
# Training on the train dataset
# YOUR CODE HERE
lr.fit(tr_vecs, tr_labels)
#raise NotImplementedError()

LogisticRegression(n_jobs=-1)

In [15]:
from sklearn.utils.validation import check_is_fitted
from sklearn.exceptions import NotFittedError

try:
    check_is_fitted(lr)
except NotFittedError as e:
    assert None, repr(e)

In [16]:
from sklearn.metrics import accuracy_score

# Evaluation on the test dataset
def preds(lr, tst_vecs):
    # YOUR CODE HERE
    #print(type(tst_vecs))

    lr_pred = lr.predict(tst_vecs)
    #print("Logistic Regression Test accuracy : {}".format(accuracy_score(tst_labels, lr_pred)))
    return lr_pred
    #raise NotImplementedError()

In [17]:
# If you have done everything right, the accuracy should be around 75%
lr_pred = preds(lr, tst_vecs)
assert lr_pred.shape == (3972,)
print("Logistic Regression Test accuracy : {}".format(
    accuracy_score(tst_labels, lr_pred)))

Logistic Regression Test accuracy : 0.7560422960725075


## 1.1 Change to TfidfVectorizer, and also change the configuration

Look up the documentation of [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). It has a lot of parameters to play with. 

This time, change the parameters to include _maximum_ of __10000__ features. Also include filtering of _stopwords_ and _lowercasing_ the features. (hint: look at the parameter names in the documentation)

Also [_ngram_](https://en.wikipedia.org/wiki/N-gram) features can improve the performance of the model. A bigram is an n-gram for n=2, trigram is when n=3, etc..


Bigram features include not only single words in the vocabulary, but the frequency of every occuring bigram in the text (e.g. it will include not only the words _brown_ and _dog_ but __brown dog__ also)

Change the configuration of the _TfidfVectorizer_ to also include the _bigrams_ and _trigrams_ in the vocabulary.


In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

def prepare_tfidf_vectorizer(train_data):
    vectorizer = TfidfVectorizer(max_features=10000, stop_words="english", lowercase=True, ngram_range=(2, 3))
    feature_vector = vectorizer.fit(train_data.tweet)
    
    return feature_vector
    
    # YOUR CODE HERE
    #raise NotImplementedError()


In [19]:
tr_vecs, tr_labels, tst_vecs, tst_labels = get_features_and_labels(
    train_data.tweet, train_data.label, prepare_tfidf_vectorizer(train_data))

In [22]:
# Train and evaluate! 
lr  = LogisticRegression(n_jobs=-1)


#lr.fit...
lr.fit(tr_vecs, tr_labels)

# Evaluation on the test dataset
#lr_pred = ..
lr_pred = lr.predict(tst_vecs)
# YOUR CODE HERE
print("Logistic Regression Test accuracy : {}".format(
    accuracy_score(tst_labels, lr_pred)))
#raise NotImplementedError()

Logistic Regression Test accuracy : 0.6852970795568983


In [23]:
from sklearn.utils.validation import check_is_fitted
from sklearn.exceptions import NotFittedError

try:
    check_is_fitted(lr)
except NotFittedError as e:
    assert None, repr(e)

## 1.2 Write a custom tokenizer for TfidfVectorizer

Right now, the vectorizer uses it's own tokenizer for creating the vocabulary. You can also create a custom function and tell the vectorizer to use that when tokenizing the text.

Use [spacy](https://spacy.io/) for tokenization. write your own function.

Your function should:
- get a sentence as an input
- run spacy on the input text
- return a token list that includes:
    - filtering of stop words
    - filtering of punctuation
    - lemmatizing the text
    - lowercasing the text

In [24]:
import spacy

nlp = spacy.load('en_core_web_sm')

'''    
sen_list=["Muffins", "cost"]
joint_sen= " ".join(sen_list) # join a list of words into a big string.
                              # reference: https://note.nkmk.me/en/python-string-concat/#:~:text=If%20you%20want%20to%20concatenate%20a%20list%20of%20numbers%20into,concatenate%20them%20with%20join()%20.  
'''


def spacy_tokenizer(sentence):
    # YOUR CODE HERE
    
    doc = nlp(sentence) # run the machine learning on the sentence
    #tokens = [token.text for token in doc] # create a list to contain all the tokenizers which are recognized by nlp spacy.
    tokens_lemma = [token.lemma_ for token in doc] # create a list to contain all the tokenizers which are recognized by nlp spacy
                                             # and also lemmatated 
    
    # Create new doc for furthur processing:
    new_string=" ".join(tokens_lemma)
    doc1=nlp(new_string)
    
    
    #print(tokens_lemma)
    '''
    # TESTING SOME FEATURES OF TOKENIZATION IN SPACY:
    for token in doc:
        print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop, type (token.pos_))
    # test the type of token.pos_ => result: string type
    #print(lemmata)
    print(type(tokens))
    '''
    
   
    # Remove stop words:
    for token in doc1:
        
        rw=token.orth_ # reference: https://stackoverflow.com/questions/49889113/converting-spacy-token-vectors-into-text
        if token.is_stop == True:
            rw=rw.lower()
            if (rw in tokens_lemma): # check whether the remove words exist in the list of tokens:
                tokens_lemma.remove(rw)
        
    
    # Remove punctuation:
    for token in doc1:
        #print(type(token))
        rw=token.orth_ # reference: https://stackoverflow.com/questions/49889113/converting-spacy-token-vectors-into-text
        #print(type(rw))
        if token.pos_ == "PUNCT" or token.pos_=="punct":
            if (rw in tokens_lemma):
                tokens_lemma.remove(rw)
    
    # Lowercase the tokens:
    lwTokens=[]
    for i in range(0, len(tokens_lemma)) :
        lwTokens.append(tokens_lemma[i].lower())
        
    
    return lwTokens


# SELF_TEST:
f=spacy_tokenizer("This is the NLP lab, this text should not contain any punctuations and stopwords, and the text should be lowercased.")
print(f)

vectorizer_with_spacy = TfidfVectorizer(
    max_features=10000, tokenizer=spacy_tokenizer)

['nlp', 'lab', 'text', 'contain', 'punctuation', 'stopword', 'text', 'lowercase']


In [25]:
assert (spacy_tokenizer("This is the NLP lab, this text should not contain any punctuations and stopwords, and the text should be lowercased.") == [
        'nlp', 'lab', 'text', 'contain', 'punctuation', 'stopword', 'text', 'lowercase'])

In [26]:
X = vectorizer_with_spacy.fit(train_data.tweet)

tr_vecs, tr_labels, tst_vecs, tst_labels = get_features_and_labels(train_data.tweet, train_data.label, X)



In [27]:
# Train and evaluate! 
# If you have done everything right you should get the same or a little better performance than the standard
# TfidfVectorizer and CountVectorizer
lr  = LogisticRegression(n_jobs=-1) # LogisticRegression is a Classifier

#lr.fit...
lr.fit(tr_vecs, tr_labels)
#lr_pred = ..
lr_pred = lr.predict(tst_vecs)
print("Logistic Regression Test accuracy : {}".format(
    accuracy_score(tst_labels, lr_pred)))
# YOUR CODE HERE
#raise NotImplementedError()

Logistic Regression Test accuracy : 0.7565458207452165


# 2. Word embeddings

## 2.1 Transform word vectors to sentence vector taking the average of the word vectors
Word vectors transform words to a vector space where similar words have similar vectors.
These vectors can be used as features for ML algorithms. But to feature a sentence first you need to create a _sentence vector_ from the vectors of the words. The easiest way of transforming word vectors to sentence vector is to take the average of all the word vectors.

![ww](https://www.researchgate.net/profile/Md-Shajalal/publication/329394770/figure/fig1/AS:701809937088513@1544335936936/A-framework-for-learning-word-vectors-7_W640.jpg)

In [29]:
#Load the embedding file
embedding_file = "glove.6B.100d.txt"

model = gensim.models.KeyedVectors.load_word2vec_format(embedding_file, binary=False)
vectorizer = model.wv
vocab_length = len(model.wv.vocab)


  vectorizer = model.wv
  vocab_length = len(model.wv.vocab)


**Your transform function should:**
- tokenize the sentence with the spacy tokenizer
- get the embedding vector:
    - get the embedding vector from the model if the word is in the vocabulary
    - initialize a vector with zeros with the same dimension if the word is not in the vocabulary
- take the mean of the word vectors to return a sentence vector

In [30]:
vocab_length
model.wv.vocab
type(model.wv.vocab) # dictionary type

# CHECK WHETHER WORDS IN THE DICT:
#reference: https://able.bio/rhett/check-if-a-key-exists-in-a-python-dictionary--73iajoz#:~:text=To%20simply%20check%20if%20a,')%20%23%20Dogs%20found!&text=A%20dictionary%20can%20be%20a,counting%20the%20occurrence%20of%20items.
sen=['nlp', 'lab', 'tapphao', 'lalung']

Dct=[]
print(type(Dct))
inDct=0
notDct=0
for i in range (0, len(sen)):
    if (sen[i] in model.wv.vocab):
        inDct= inDct+1
        #if inDct==1:
        t=str(sen[i])
        print(type(t))
        Dct.append(t)
        print(Dct)
        #else:
        #    m=np.array(sen[i])
        #    Dct=np.concatenate((Dct,m), axis=0)
    else:
        notDct= notDct +1
print(inDct)
print(notDct)
Dct

<class 'list'>
<class 'str'>
['nlp']
<class 'str'>
['nlp', 'lab']
2
2


  model.wv.vocab
  type(model.wv.vocab) # dictionary type
  if (sen[i] in model.wv.vocab):


['nlp', 'lab']

In [31]:
def transform(words):
    # YOUR CODE HERE
    
    tokens_spacy= spacy_tokenizer(words)
    #print(tokens_spacy)
    
    # Filter out which tokens are in the model vocabulary and which are not:

    InVocab=[] # new list contains only tokens in the vocabulary
    Vocab=0
    notVocab=0
    for i in range (0, len(tokens_spacy)):
            if (tokens_spacy[i] in model.wv.vocab):
                Vocab= Vocab+1
                t=str(tokens_spacy[i])
                InVocab.append(t)
                
            else: # NotInVocab contains the zero vectors for tokens not the model vocab:
                notVocab= notVocab +1
                if notVocab ==1 :
                    NotInVocab= np.zeros((1,100)) # 100 is used because by default in this model the embedded word vector size is (100,)
                else:
                    m=np.zeros((1,100))
                    NotInVocab= np.concatenate((NotInVocab, m), axis=0)
                
    
    # Embedding tokens which are in the model vocab:
    embedded_vector = model[InVocab]
    print(embedded_vector[0].shape)
    print(type(embedded_vector[0]))
    
    #The matrix contains all the embedding vectors of all the tokens:
    if notVocab == 0:
        FinalMatrix= embedded_vector
    else:
        FinalMatrix=np.concatenate((embedded_vector,NotInVocab), axis=0)
        # this matrix has dimension (len(tokens_spacy), 100) #len(tokens_spacy) : total number of words (or tokens) in the list tokens_spacy
    
    # Get the average vector of all embedding vectors for returning value:
    s=0 # used for computing sum of the embedding vectors
    for i in range (0, len(tokens_spacy)):
        s= s+FinalMatrix[i]
    return_vector= s/len(tokens_spacy)
    
    return return_vector
#t=np.zeros((1,2))
#m=np.zeros((1,2))
#TBlock= np.concatenate((t, m), axis=0)

#print(TBlock.shape)
r=transform("this is a nlp lab ")
r.shape
r
rl=np.array(r)
rl.shape
rl
type(rl)
    #raise NotImplementedError()

(100,)
<class 'numpy.ndarray'>


  if (tokens_spacy[i] in model.wv.vocab):


numpy.ndarray

In [32]:
assert transform("this is a nlp lab").shape == (100,)

(100,)
<class 'numpy.ndarray'>


  if (tokens_spacy[i] in model.wv.vocab):


**We can calculate similarities between sentences now the same way that we did between words! For this we need to use the cosine_similarity function!**

In [33]:
from sklearn.metrics.pairwise import cosine_similarity

print(cosine_similarity(transform("hello my name is adam").reshape(
    1, -1), transform("hello my name is andrea").reshape(1, -1))[0][0])

  if (tokens_spacy[i] in model.wv.vocab):


(100,)
<class 'numpy.ndarray'>
(100,)
<class 'numpy.ndarray'>
0.7142756824337592


In [34]:
assert cosine_similarity(transform("hello my name is adam").reshape(
    1, -1), transform("hello my name is andrea").reshape(1, -1)).shape == (1, 1)

(100,)
<class 'numpy.ndarray'>
(100,)
<class 'numpy.ndarray'>


  if (tokens_spacy[i] in model.wv.vocab):


## 2.4 Finding Analogies
Word vectors have been shown to sometimes have the ability to solve analogies.

We discussed this during the lecture that for the analogy "man : king :: woman : x" (read: man is to king as woman is to x), x is _queen_

Find more examples of analogies that holds according to these vectors (i.e. the intended word is ranked top)!

Also find an example of analogy that does not hold according to these vectors!

Summarize your findings in a few sentences.

In [35]:
# YOUR CODE HERE
def analogy(word1, word2, word3, n=5): # to find the similariries between words.
    
    #get vectors for each word
    word1_vector = model[word1]
    word2_vector = model[word2]
    word3_vector = model[word3]
    
    #calculate analogy vector
    analogy_vector = model.most_similar(positive=[word3, word2], negative=[word1])
    
    
    #calculate not-analogy vector
    not_analogy_vector = model.most_similar(negative=[word1, word2, word3])
    
    print(word1 + " is to " + word2 + " as " + word3 + " is to...")
    
    return analogy_vector, not_analogy_vector

a,n=analogy('man', 'king', 'woman') # the words most analogy to those vectors are
# 'queen' cause it has weight  0.7698541283607483 ranked as top
# next, 'monarch', 0.6843380928039551
# next, 'throne', 0.6755737066268921
print(a)
print('\n')
print('Vectors do not hold analogy to these vectors') # words really not related to these 'man', 'king', 'woman' vectors:
print(n)
# rank2 for most not-related : 'ryryryryryry', 0.6911153793334961
# next: 'anobiotechnology', 0.6836133003234863
# next: 'shoshani', 0.683387041091919


man is to king as woman is to...
[('queen', 0.7698541283607483), ('monarch', 0.6843380928039551), ('throne', 0.6755737066268921), ('daughter', 0.6594556570053101), ('princess', 0.6520533561706543), ('prince', 0.6517034769058228), ('elizabeth', 0.6464517116546631), ('mother', 0.631171703338623), ('emperor', 0.6106470823287964), ('wife', 0.6098655462265015)]


Vectors do not hold analogy to these vectors
[('___________________________________________________________', 0.7018510699272156), ('ryryryryryry', 0.6911153793334961), ('nanobiotechnology', 0.6836133003234863), ('shoshani', 0.683387041091919), ('tom.fowler@chron.com', 0.6821105480194092), ('soejima', 0.6800203323364258), ('brett.clanton@chron.com', 0.6775467991828918), ('geoinformatics', 0.6769911050796509), ('methoni', 0.6744081377983093), ('zety', 0.6733587384223938)]


## 2.5 Bias in word vectors

It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias  in word vectors can be dangerous because it can incorporate stereotypes through applications that employ these models.

Run the cell below, to examine a sample of gender bias present in the data. Try to come up with another examples that can reflect biases in datasets (gender, race, sexual orientation etc.)

Summarize your findings in a few sentences.

In [36]:
print(model.most_similar(positive=['woman', 'doctor'], negative=['man']))

print(model.most_similar(positive=['man', 'doctor'], negative=['woman']))

[('nurse', 0.7735227346420288), ('physician', 0.7189429998397827), ('doctors', 0.6824328303337097), ('patient', 0.6750683188438416), ('dentist', 0.6726033687591553), ('pregnant', 0.6642460227012634), ('medical', 0.6520450115203857), ('nursing', 0.645348072052002), ('mother', 0.6393327116966248), ('hospital', 0.6387495398521423)]
[('dr.', 0.65594881772995), ('brother', 0.6274003982543945), ('him', 0.6236444711685181), ('he', 0.6169813871383667), ('himself', 0.6075390577316284), ('physician', 0.6073206067085266), ('father', 0.5924621820449829), ('master', 0.5798596143722534), ('friend', 0.5763945579528809), ('taken', 0.5739063024520874)]


In [37]:
# YOUR CODE HERE
print(model.most_similar(positive=['woman', 'parent' ], negative=['man']))

print(model.most_similar(positive=['man', 'parent'], negative=['woman']))

# The result shows: when 'man' is related to 'parent', the 'woman' relates to ('parents', 'sister', 'spouse ') 
#which are quite analogy to 'parent'

# while 'women' is related to 'parent', the 'man' relates to ('company', 'subsidiary', 'group') 
# which are really far from the meaning of 'parent'

# => CONCLUSION: the distances between ('woman' to 'parent') and ('men' to 'parent') are very different. From the result

# the word 'women' sourrounded by very big group of analogy words related 'parent'

# while the word 'men' closely sourrounded by gorups of words not analogy to 'parent' (i.e 'company', 'subsidiary', etc), 
# further than that there is maybe the groups of words analogy to 'parent'


# =>> THERE IS A BIAS IN GENDER with the parental work.(Women tends to parent more than men)

print('\n')
print(model.most_similar(positive=['asia', 'timid' ], negative=['europe']))

print(model.most_similar(positive=['europe', 'timid'], negative=['asia']))

# The result shows: when 'europe' is related to 'timid', the 'asia' relates to ('hesitant', 'feeble', 'diffident') 
#which are quite analogy to 'timid'

# while 'asia' is related to 'timid', the 'europe' relates to ('pushy', 'cynical', 'arrogant') 
# which are really far from the meaning of 'timid'

# => CONCLUSION: the distances between ('asia' to 'timid') and ('europe' to 'timid') are very different. From the result

# the word 'asia' sourrounded by very big group of analogy words related 'timid'

# while the word 'europe' closely sourrounded by gorups of words not too analogy to 'timid' (i.e 'pushy', 'cynical', 'arrogant' etc), 
# further than that there is maybe the groups of words most analogy to 'timid'


# =>> THERE IS A BIAS IN Race within the timid characteristic. (Asia tends to timid more than Europe)

#raise NotImplementedError()

[('parents', 0.5668851137161255), ('sister', 0.5646761059761047), ('employer', 0.5568981766700745), ('spouse', 0.5436244010925293), ('mothers', 0.5349369645118713), ('pregnant', 0.53443843126297), ('adult', 0.5250305533409119), ('sibling', 0.5245696902275085), ('provider', 0.5196497440338135), ('mother', 0.5136243104934692)]
[('company', 0.6763399839401245), ('subsidiary', 0.6488295793533325), ('unit', 0.6444345712661743), ('group', 0.6404563784599304), ('brothers', 0.6023896932601929), ('owner', 0.582736611366272), ('owned', 0.5813060402870178), ('based', 0.5802786350250244), ('executive', 0.5775868892669678), ('owns', 0.5734580755233765)]


[('hesitant', 0.5911028385162354), ('feeble', 0.5903016924858093), ('diffident', 0.5849113464355469), ('introverted', 0.5825530886650085), ('cocky', 0.5734557509422302), ('cheerful', 0.5699166655540466), ('lethargic', 0.5676059126853943), ('indecisive', 0.567166805267334), ('witless', 0.5650360584259033), ('headstrong', 0.5644875764846802)]
[('pus

# ================ PASSING LEVEL ====================

# 3. Logistic regression using word vectors

These sentence vectors can be used as feature vectors for classifiers. Rewrite the featurizing process and transform each sentence into a sentence vector using the embedding model!

__Note: it is OK if your model is not better than the other classifiers__

In [38]:

def transform(words):
    # YOUR CODE HERE
    
    tokens_spacy= spacy_tokenizer(words)
    #print(tokens_spacy)
    
    # Filter out which tokens are in the model vocabulary and which are not:

    InVocab=[] # new list contains only tokens in the vocabulary
    Vocab=0
    notVocab=0
    for i in range (0, len(tokens_spacy)):
            if (tokens_spacy[i] in model.wv.vocab):
                Vocab= Vocab+1
                t=str(tokens_spacy[i])
                InVocab.append(t)
                
            else: # NotInVocab contains the zero vectors for tokens not the model vocab:
                notVocab= notVocab +1
                if notVocab ==1 :
                    NotInVocab= np.zeros((1,100)) # 100 is used because by default in this model the embedded word vector size is (100,)
                else:
                    m=np.zeros((1,100))
                    NotInVocab= np.concatenate((NotInVocab, m), axis=0)
                
    
    # Embedding tokens which are in the model vocab:
    if Vocab==0:
        embedded_vector=np.zeros((1,100))
    else:
        embedded_vector = model[InVocab]
    #print(embedded_vector.shape)
    #The matrix contains all the embedding vectors of all the tokens:
    if notVocab == 0:
        FinalMatrix= embedded_vector
    else:
        FinalMatrix=np.concatenate((embedded_vector,NotInVocab), axis=0)
        # this matrix has dimension (len(tokens_spacy), 100) #len(tokens_spacy) : total number of words (or tokens) in the list tokens_spacy
    
    # Get the average vector of all embedding vectors for returning value:
    s=0 # used for computing sum of the embedding vectors
    for i in range (0, len(tokens_spacy)):
        s= s+FinalMatrix[i]
    return_vector= s/len(tokens_spacy)
    
    return return_vector # return type is np.narray dimension (100,) (= (1,100))


def vectorize_to_embedding(tr_data, tst_data):    
    # YOUR CODE HERE
    #print(tr_data[0])
    #print(type(tr_data[0]))
    
    # Transform each string object (a sentece) into word embedding vector (using the Function 'transform' in task 2)
    count_tr=0;
    for i in range (0, len(tr_data)): #MUST CHANGE 4 TO len(tr_data) for actual run
        #print(tr_data[i])
        output= transform(tr_data[i]).reshape(1,100) # reshape to concatanate them together
        #print("output shape:")
        #print(output.shape==(1,100))
        count_tr=count_tr+1
        if count_tr ==1:
             tr_WordEmbedVectors= output
        else:
            tr_WordEmbedVectors= np.concatenate((tr_WordEmbedVectors,output), axis=0)
        #print(tr_WordEmbedVectors.shape)
            

    count_tst=0;
    for i in range (0, len(tst_data)): #MUST CHANGE 4 TO len(tr_data) for actual run
        output= transform(tst_data[i]).reshape(1,100)
        count_tst=count_tst+1
        if count_tst ==1:
             tst_WordEmbedVectors= output
        else:
            tst_WordEmbedVectors= np.concatenate((tst_WordEmbedVectors,output), axis=0)
        
    return tr_WordEmbedVectors, tst_WordEmbedVectors
    #raise NotImplementedError()
    
def get_features_and_labels(data, labels):
    # YOUR CODE HERE
    #print(data)
    
    #print(labels)
    
    try:
        tr_data,tst_data,tr_labels,tst_labels = split(data, labels, test_size=0.3, random_state=1234)
    except:
        pass
    
    #print(tst_labels.shape)
    # MUST CONVERT THE tr_data and tst_data INTO list type for The word embedding vetorize process
    process_tr_data=tr_data.tolist()
    process_tst_data=tst_data.tolist()
    #print( process_tr_data[1])
    #print(type(process_tr_data[1]))
    # After The word embedding vetorize process, each 'string object' ( each sentence) 
    # in tr_data or tr_test is converted into a vector dimension (1,100) (100: 100 features), which contains only numerical types.
    finish_tr_data, finish_tst_data =vectorize_to_embedding(process_tr_data, process_tst_data)
    
    #print(finish_tr_data.shape)
    #print(finish_tst_data.shape)
    #print("\n")
    #print(finish_tr_data[0].shape)
    #print(finish_tst_data[0].shape)
   
    #tst_vecs = []
    #tr_vecs = []
#    raise NotImplementedError()
    #return process_tr_data[1]

    return finish_tr_data,tr_labels, finish_tst_data,tst_labels
#r=transform("hello")
#r.shape
#l=r.reshape(1,100)
#l

In [39]:
tr_vecs, tr_labels, tst_vecs, tst_labels = get_features_and_labels(train_data.tweet, train_data.label)

  if (tokens_spacy[i] in model.wv.vocab):


In [40]:
assert tr_vecs[0].shape == (100,)

In [41]:
# Train and evaluate! 
lr  = LogisticRegression(n_jobs=-1)

#lr.fit...
print(tr_vecs.shape)
print(tr_labels.shape)
lr.fit(tr_vecs, tr_labels)

#lr_pred = ..
lr_pred = lr.predict(tst_vecs)
print("Logistic Regression Test accuracy : {}".format(
    accuracy_score(tst_labels, lr_pred)))
# YOUR CODE HERE
#raise NotImplementedError()

(9268, 100)
(9268,)
Logistic Regression Test accuracy : 0.7449647532729103


## 3.1 Ensemble model

Try out other classifiers from: [sklearn](https://scikit-learn.org/stable/supervised_learning.html). Choose three and build a [VotingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) with the choosen classifiers. If the _voting_ strategy is set to _hard_ it will do a majority voting among the classifiers and choose the class with the most votes.

Make a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) with a TFIdfVectorizer and with your Ensemble model. Pipeline objects make it easy to assemble several steps together and makes your machine learning pipeline executable in just one step.

In [45]:
# Reference:https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier
# Reference: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
# Reference: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

# IN GENERAL: the machine leaarning (or more complicated: Deep Learning) contains:
# FIRST:
# The input layer: at this layer, we must convert (or process) our raw database,
# each element of which, into a vector of numerical values 
# (the number of numerical values depend on how many characteristics (features) we want to use to describe an element, IMPORTANT:
# THIS NUMBER OF FEATURES ARE THE NUMBER OF INPUT NODES OF THE INPUT LAYER OF THE MACHINE LEARNING).

# SECOND:
# the hidden layer: at this layer, there is a classifier which will predict the labels (or classes) of the input vectors.
# The prediction ability of the classifer obtains after we have trained the classifier (the heart of the machine learning)
# with our training data and our training labels.
# For the classifier, there are a variety of classifiers we can choose from in order to satisfy our main requirement 
# for the outputs (predictions) of the machine learning.


# IN TASK 1 AND TASK 2: we have always used logistic Regression classifier

from sklearn.pipeline import Pipeline
from sklearn.ensemble import VotingClassifier

# There are lots of kinds of classifiers => There are also many categories for these classifiers:
# first category: called linear_model which contains i.e 'Logistic Regression' classifier 
## IN TASK 1 AND TASK 2: we have always used 'logistic Regression' classifier

# second category: called naive_bayes which contains i.e 'GaussianNB' classifier

# third category: called ensemble which contains i.e 'RandomForestClassifier' classifier, 'Voting classifier'

from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn import tree



## TRY OUT 4 DIFFERENT CLASSIFIERS BESIDES LOSGISTIC REGRESSION CLASSIFIER:

# 'DecisionTreeClassifier' classifier:

clf= tree.DecisionTreeClassifier()
clf = clf.fit(tr_vecs, tr_labels)
clf_pred = clf.predict(tst_vecs) # get the predicted labels of the machine when we use the test data vectors.
print("DecisionTreeClassifier Test accuracy : {}".format(
    accuracy_score(tst_labels, clf_pred)))


#   'RandomForestClassifier' classifier
clf1= RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)
clf1 = clf1.fit(tr_vecs, tr_labels)
clf1_pred = clf1.predict(tst_vecs)
print("RandomForestClassifier Test accuracy : {}".format(
    accuracy_score(tst_labels, clf1_pred)))

# 'AdaBoostClassifier' classifier:
clf2=  AdaBoostClassifier()
clf2 = clf2.fit(tr_vecs, tr_labels)
clf2_pred = clf2.predict(tst_vecs)
print("AdaBoostClassifier Test accuracy : {}".format(
    accuracy_score(tst_labels, clf2_pred)))

# 'Voting classifier' classifier:
eclf1 = VotingClassifier(estimators=[('dstr', clf), ('rdf', clf1), ('ada', clf2)], voting='hard')
    # 'dstr', 'rdf', 'ada' are just how we want to name these classifiers, we can change those names to different ones 
    # without affecting the VotingClassifier.
eclf1 = eclf1.fit(tr_vecs, tr_labels)
eclf1_pred = eclf1.predict(tst_vecs)
print("VotingClassifier Test accuracy : {}".format(
    accuracy_score(tst_labels, eclf1_pred)))


# Pipline declration: pipe = Pipeline([<step 1>, <step 2>])
# Pipline only
# combines the all the steps: (step 1: how to process the raw data (i.e bag of word, word embedding, etc))
# and (step 2: choose the classifier) into one command.
# However, this command only demonstrates our steps, we do not apply any real inputs for these steps,
# which is equivalent to our raw data is still not vetorized, our classifier is not trained.
# As a consequent, we have to apply the inputs (train data and the train labels) to the pipline as follow:
# pipe.fit(trained data, train label)
# After that we can also evaluate the prediction of our model:
#pipe.score(test_data, preictions_of_machine)
def make_pipeline_ensemble(tweet, label):
    
    # YOUR CODE HERE
    # declaration:
    pipe=Pipeline([('Tfid', TfidfVectorizer(max_features=10000, stop_words="english", lowercase=True, ngram_range=(2, 3))), 
                   ('VoteClass', VotingClassifier(estimators=[('dstr', clf), ('rdf', clf1), ('ada', clf2)]))])
    
    # train the machine:
    
    return pipe
    #raise NotImplementedError()

DecisionTreeClassifier Test accuracy : 0.6143001007049346
RandomForestClassifier Test accuracy : 0.6759818731117825
AdaBoostClassifier Test accuracy : 0.7225579053373615
VotingClassifier Test accuracy : 0.7056898288016112


In [46]:
pipeline = make_pipeline_ensemble(train_data.tweet, train_data.label)

In [47]:
assert type(pipeline) == Pipeline
assert type(pipeline.steps[0][1]) == TfidfVectorizer
assert type(pipeline.steps[1][1]) == VotingClassifier

In [48]:
# Train and evaluate! 
# YOUR CODE HERE


# Prepare the train data, test data, train label, test labe from the raw data
raw_tr_data,raw_tst_data,raw_tr_labels,raw_tst_labels = split(train_data.tweet, train_data.label, test_size=0.3, random_state=1234)

# train:
pipeline.fit(raw_tr_data,raw_tr_labels)

# evaluate:
pipeline_pred = pipeline.predict(raw_tst_data)
print("Ensmeble_machine Test accuracy : {}".format(
    accuracy_score(raw_tst_labels, pipeline_pred)))



Ensmeble_machine Test accuracy : 0.6840382678751259


## 3.2 __Also evaluate your classifiers separately as well. Summarize your results in a cell below. Did the ensemble model improved your performance?__

In [49]:
# YOUR CODE HERE

# 'DecisionTreeClassifier' classifier: (use vectorized data by Word Embedding Vectorization in task 3)

clf = tree.DecisionTreeClassifier()
clf = clf.fit(tr_vecs, tr_labels)
clf_pred = clf.predict(tst_vecs) # get the predicted labels of the machine when we use the test data vectors.
print("DecisionTreeClassifier Test accuracy : {}".format(
    accuracy_score(tst_labels, clf_pred)))


DecisionTreeClassifier Test accuracy : 0.6092648539778449


In [50]:
# 'RandomForestClassifier' classifier
clf1= RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)
clf1 = clf1.fit(tr_vecs, tr_labels)
clf1_pred = clf1.predict(tst_vecs)
print("RandomForestClassifier Test accuracy : {}".format(
    accuracy_score(tst_labels, clf1_pred)))

RandomForestClassifier Test accuracy : 0.6779959718026183


In [51]:
#'AdaBoostClassifier' classifier: (use vectorized data by Word Embedding Vectorization in task 3)
clf2=  AdaBoostClassifier()
clf2 = clf2.fit(tr_vecs, tr_labels)
clf2_pred = clf2.predict(tst_vecs)
print("AdaBoostClassifier Test accuracy : {}".format(
    accuracy_score(tst_labels, clf2_pred)))

AdaBoostClassifier Test accuracy : 0.7225579053373615


In [52]:
# 'Voting classifier' classifier: (use vectorized data by Word Embedding Vectorization in task 3)
eclf1 = VotingClassifier(estimators=[('dstr', clf), ('rdf', clf1), ('ada', clf2)], voting='hard')
    # 'dstr', 'rdf', 'ada' are just how we want to name these classifiers, we can change those names to different ones 
    # without affecting the VotingClassifier.
eclf1 = eclf1.fit(tr_vecs, tr_labels)
eclf1_pred = eclf1.predict(tst_vecs)
print("VotingClassifier Test accuracy : {}".format(
    accuracy_score(tst_labels, eclf1_pred)))

VotingClassifier Test accuracy : 0.7046827794561934


In [53]:
# The ensemble_machine (pipeline) (the pipeline uses vectorized data by TfidfVectorizer (= pipeline[0][1])):
pipeline_pred = pipeline.predict(raw_tst_data)
print("Ensmeble_machine Test accuracy : {}".format(
    accuracy_score(raw_tst_labels, pipeline_pred)))

Ensmeble_machine Test accuracy : 0.6840382678751259


In [None]:
## CONCLUSION:
# the machine learning made of Word Embedding Vectorizatioin and AdaBoostClassifier gives out the best prediction accuracy: 0.7225579053373615

# the machine made by 'pipeline' gives out fairly accuracy (rank 3/5) :  0.6842900302114804

# ================ EXTRA LEVEL ====================