### Word Embeddings

- We'll be using the [spacy](https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/) library for embeddings. 

In [1]:
import spacy

Run the following cell once, it downloads the relevant spacy embeddings. 

In [226]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:02[0m
[?25h
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [227]:
nlp = spacy.load("en_core_web_lg")

In [228]:
# create sentence.
sentence = nlp('The grass is green .')

# now check out the embedded tokens.
for token in sentence:
    print(token.text, token.vector.shape, token.vector)

The (300,) [ 2.7204e-01 -6.2030e-02 -1.8840e-01  2.3225e-02 -1.8158e-02  6.7192e-03
 -1.3877e-01  1.7708e-01  1.7709e-01  2.5882e+00 -3.5179e-01 -1.7312e-01
  4.3285e-01 -1.0708e-01  1.5006e-01 -1.9982e-01 -1.9093e-01  1.1871e+00
 -1.6207e-01 -2.3538e-01  3.6640e-03 -1.9156e-01 -8.5662e-02  3.9199e-02
 -6.6449e-02 -4.2090e-02 -1.9122e-01  1.1679e-02 -3.7138e-01  2.1886e-01
  1.1423e-03  4.3190e-01 -1.4205e-01  3.8059e-01  3.0654e-01  2.0167e-02
 -1.8316e-01 -6.5186e-03 -8.0549e-03 -1.2063e-01  2.7507e-02  2.9839e-01
 -2.2896e-01 -2.2882e-01  1.4671e-01 -7.6301e-02 -1.2680e-01 -6.6651e-03
 -5.2795e-02  1.4258e-01  1.5610e-01  5.5510e-02 -1.6149e-01  9.6290e-02
 -7.6533e-02 -4.9971e-02 -1.0195e-02 -4.7641e-02 -1.6679e-01 -2.3940e-01
  5.0141e-03 -4.9175e-02  1.3338e-02  4.1923e-01 -1.0104e-01  1.5111e-02
 -7.7706e-02 -1.3471e-01  1.1900e-01  1.0802e-01  2.1061e-01 -5.1904e-02
  1.8527e-01  1.7856e-01  4.1293e-02 -1.4385e-02 -8.2567e-02 -3.5483e-02
 -7.6173e-02 -4.5367e-02  8.9281e-02  3.

### Topic Modeling

- Given a document, determine the topic of the document
- For this task, we'll use the Brown corpus of texts accessible via NLTK

In [229]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to
[nltk_data]     /Users/hunterbarclay/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [230]:
from nltk.corpus import brown
import numpy as np
from collections import defaultdict
import tqdm # tqdm displays a progress bar
from tqdm import tqdm_notebook as tqdm # tqdm is a nice process indicator 

category_vectors = []

cats = brown.categories()
    
# for each category
for cat in cats:
    print(cat)
    # grab all of the documents
    for fileid in tqdm(brown.fileids(categories=[cat])):
        sents = brown.sents(fileids=[fileid])
        sent_vecs = []
        for sent in sents:
            sent = ' '.join(sent)
            sent = nlp(sent)
            # grab all of the words, find their embedding, sum all embeddings
            word_sum = np.sum([tok.vector for tok in sent], axis=0) # why axis=0?
            # add the now summed embedding to the list for this category
            sent_vecs.append(word_sum)
        category_vectors.append((cat,np.sum(sent_vecs, axis=0)))
    

adventure


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for fileid in tqdm(brown.fileids(categories=[cat])):


  0%|          | 0/29 [00:00<?, ?it/s]

belles_lettres


  0%|          | 0/75 [00:00<?, ?it/s]

editorial


  0%|          | 0/27 [00:00<?, ?it/s]

fiction


  0%|          | 0/29 [00:00<?, ?it/s]

government


  0%|          | 0/30 [00:00<?, ?it/s]

hobbies


  0%|          | 0/36 [00:00<?, ?it/s]

humor


  0%|          | 0/9 [00:00<?, ?it/s]

learned


  0%|          | 0/80 [00:00<?, ?it/s]

lore


  0%|          | 0/48 [00:00<?, ?it/s]

mystery


  0%|          | 0/24 [00:00<?, ?it/s]

news


  0%|          | 0/44 [00:00<?, ?it/s]

religion


  0%|          | 0/17 [00:00<?, ?it/s]

reviews


  0%|          | 0/17 [00:00<?, ?it/s]

romance


  0%|          | 0/29 [00:00<?, ?it/s]

science_fiction


  0%|          | 0/6 [00:00<?, ?it/s]

In [234]:
import pandas as pd

keys,values=zip(*category_vectors) # unzip using a *

data = pd.DataFrame({'cat':keys,'vectors':values})

In [235]:
data[:3]

Unnamed: 0,cat,vectors
0,adventure,"[-22.51716, 449.9568, -423.67093, -223.53218, ..."
1,adventure,"[33.477146, 375.79608, -379.33813, -253.57874,..."
2,adventure,"[82.20694, 250.26768, -268.21576, -123.84788, ..."


In [236]:
total = len(data)

#### compute the baselines

In [237]:
print('random baseline {}'.format(1.0/len(cat)))

print('most common baseline?')
for cat in cats:
    print(cat, len(data[data.cat==cat])/total)

random baseline 0.06666666666666667
most common baseline?
adventure 0.058
belles_lettres 0.15
editorial 0.054
fiction 0.058
government 0.06
hobbies 0.072
humor 0.018
learned 0.16
lore 0.096
mystery 0.048
news 0.088
religion 0.034
reviews 0.034
romance 0.058
science_fiction 0.012


#### split the data into train/test

In [238]:
test = data.sample(frac=0.1,random_state=200)
train = data.drop(test.index)

test.shape, train.shape 

((50, 2), (450, 2))

#### train a classifier

In [239]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(data.cat) 
X = [x for x in train.vectors]
y = le.transform(train.cat)

In [240]:
from sklearn.linear_model import LogisticRegression

In [241]:
clfr = LogisticRegression(multi_class='multinomial', solver='lbfgs')

In [242]:
clfr.fit(X,y)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#### evaluate 

In [243]:
from sklearn.metrics import accuracy_score

In [244]:
test_y = le.transform(test.cat)
test_X = [x for x in test.vectors]

score = accuracy_score(clfr.predict(test_X), test_y)
score

0.54

### Results

- GoogleNews-vectors-negative300.magnitude 0.4 (w2v)
- wiki-news-300d-1M.magnitude 0.56 (bert)
- glove.6B.300d.magnitude 0.52 (glove)

In [245]:
test.shape, train.shape 

((50, 2), (450, 2))

# My Horrific Neural Network

I got it to like 65% at one point, but that's the best I've seen. With the large embedding set, the logistic regression got significantly better, so I just tinkered around for an hour or so until I finally got it to perform better than any of the numbers above.

### One-hot encoding for data

In [269]:
from sklearn import preprocessing

#test = data.sample(frac=0.1,random_state=200)
#train = data.drop(test.index)

le = preprocessing.LabelEncoder()
ohe = preprocessing.OneHotEncoder()
le.fit(data.cat)
data_y = le.transform(data.cat).reshape(-1, 1) # this is magic
ohe.fit(data_y)
y = ohe.transform(le.transform(train.cat).reshape(-1, 1)).todense()

X = np.array([x for x in train.vectors])

print(X.shape, y.shape)

test_y = ohe.transform(le.transform(test.cat).reshape(-1, 1)).todense()
test_X = np.array([x for x in test.vectors])

print(test_X.shape, test_y.shape)

(450, 300) (450, 15)
(50, 300) (50, 15)


In [256]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input, Dropout

In [265]:
model = Sequential()

#Input Layer
model.add(Input((300,)))
# Hidden Layers
model.add(Dense(250, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(200, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(120, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(70, activation='relu'))
model.add(Dense(40, activation='relu'))
# Output Layer
model.add(Dense(15, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=1000, batch_size=10, verbose=0)

<keras.src.callbacks.history.History at 0x3efb20050>

In [266]:
print("\tTraining Data Accuracy")
model.evaluate(X, y)
print("\tTesting Data Accuracy")
_, _ = model.evaluate(test_X, test_y)

	Training Data Accuracy
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 926us/step - accuracy: 0.9971 - loss: 0.0024
	Testing Data Accuracy
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.5742 - loss: 0.2918


## Questions and Answers
#### 1. What would you say is the neural network "learning"?

The network is "learning" to produce the training outputs given the training inputs.

#### 2. How does the depth or width of the network affect the training and the results?

Higher width tends to result in overfitting. It allows the model to fit to specific situations in the training set rather than gaining and understanding of the data.
Higher depth encourages more unique solutions to fitting to the training set that helps accuracy in the testing set, but results in higher training need.

#### 3. As you made changes to the network, what do you notice about how hyperparameters (network depth, number of nodes, learning rate, etc.) and how they interact with each other? We said that neural networks are learning non-convex problems, but what about finding the best parameters? Is that a convex problem?

Each hyperparameter brings something to the table. Some conflict in aspects and combine in others to reach a better result. It's definitely not a convex problem since you can go for multiple approaches in selecting your hyperparameters, and each set could produce similar results.

#### 4. What is regularization? Why is it important?

Regularization prevents from overfitting. It hinderences the network from learning in a way that essentially copies what it's see so it can better anticipate data it's never seen before.

#### 5. Which activation functions did you choose (besides logitistic/sigmoid)? For one of the activation functions you tried, spend some time learning about it. Whereas logistic/sigmoid maps from inputs to a probability between 0-1, what does the activation function you chose do?

I used the ReLU activation function. It zeros out all node values that are negative, while keeping the rest the same. This break from a simple linear activation function causes non-linearity in the model.