# Lesson 4: NLP, Tabular Data, Collaborative Filtering and Embeddings

## NLP -> IMDB

IMDB is a dataset that consists of movie reviews and a binary result in wether they are positive or not.

In general we will be training this model in a very particular way. In fact, we will have **3 models**.

Something important to note is that we have 25k movie reviews in our IMDB dataset.

Something to point out is that if we try to create a model from scratch with the given data we will certainly not get a very good accuracy in our model. It would be very good if we can have some transfered learning in our mode since otherwise basically our model has to learn english which is extremely hard given the current data.

- We will start with a pretrained model trained to do something different which is a **Language Model**, which tries to predict what is the next word of a sentencem, trained using Wikitext 103 (a subset of wikipedia).

- Then we will fine tune such model.

- Finally we will create our classifier.

![image.png](attachment:image.png)

The way the first model will help us is basicall y in understanding the english language and how words interact on it, not that much to make our final prediction but rather to understand the english language.

The way the second model will help us will be to make our previous model learn how movie reviews are written.

### Imports and initial configuration

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai.text import *

### Language Model

#### Data

In [9]:
bs = 48

In [10]:
path = untar_data(URLs.IMDB)
path.ls()

[PosixPath('/home/ubuntu/.fastai/data/imdb/README'),
 PosixPath('/home/ubuntu/.fastai/data/imdb/unsup'),
 PosixPath('/home/ubuntu/.fastai/data/imdb/imdb.vocab'),
 PosixPath('/home/ubuntu/.fastai/data/imdb/test'),
 PosixPath('/home/ubuntu/.fastai/data/imdb/train'),
 PosixPath('/home/ubuntu/.fastai/data/imdb/tmp_lm'),
 PosixPath('/home/ubuntu/.fastai/data/imdb/tmp_clas')]

In [11]:
(path/'train').ls()

[PosixPath('/home/ubuntu/.fastai/data/imdb/train/pos'),
 PosixPath('/home/ubuntu/.fastai/data/imdb/train/unsupBow.feat'),
 PosixPath('/home/ubuntu/.fastai/data/imdb/train/neg'),
 PosixPath('/home/ubuntu/.fastai/data/imdb/train/labeledBow.feat')]

Something to note from this text data set is that this data gets processed in an important way:

- Tokenization: The text is normalized so that words such as don't become 2 tokens as -> don 't so that our model receives more accurate data.

- Numericalization: We trasform each number into a token since that's whatwe will and can feed our nerual net.

In [None]:
data_lm = (TextList.from_folder(path)
          .filter_by_folder(include=['train', 'test', 'unsup'])
          .split_by_rand_pct(0.1)
          .label_for_lm()
          .databunch(bs = bs))
data_lm.save('data_lm.pkl')

In [None]:
data_lm = load_data(path, 'data_lm.pkl', bs=bs)

In [None]:
dat.show_batch()

Visualizing tokenization:

In [None]:
data = TextClasDataBunch.from_csv(path, 'texts.csv')
data.show_batch(

Visualizing numericalization:

In [None]:
data.vocab.itos[:10]

In [2]:
data.train_ds[0][0]

NameError: name 'data' is not defined

In [None]:
data.train_ds[0][0].data[:10]

#### Model training

Something important to not eis that we are using an AWD_LSTM architecture which, as far as I know, uses the pretrained weights of our Wikitext103.

Another important thing to note here is **drop_mult=0.3** which is setting the ammount of droput which we will address in the future but it helps us avoid underitting by using an low number.

In [None]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot(skip_end=15)

One important thing we should note in the below training is the **moms** which stans for momentums. Which

In [None]:
learn.fit_one_cicle(1, 1e-2, moms=(0.8, 0.7))

In [None]:
learn.save('fit_head')

In [None]:
learn.load('fit_head')

In [None]:
learn.unfreeze()

In [None]:
learn.fit_one_cycle(10, 1e-3, moms=(0.8, 0.7))

As we can see from the above results we are prediting the correct word 1/3 of the time which is pretty good thinking about it, I would say that's great even for a regular person.

In [None]:
learn.save('fine_tunned')

In [None]:
learn.load('fine_tunned')

Let's try a prediction:

In [None]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2

In [None]:
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

####  Saving the model for future use.

Now, before jumping to the classifier we shoould save the **encoder**, which is the first bit from our model which helps us recognize the sentence so far (rather than predicting the next word), to load it to out classifier.

In [4]:
learn.save_encoder('fine_tuned_enc')

NameError: name 'learn' is not defined

### Classifier Model

This will be the model that really helps us classify our reviews as positive or negative.

#### Data

In [None]:
path = untar_data(URLs.IMDB)

Some stuff to note here is that we are also passing a **vocab** to our TextList. This is because we don't wan't to create a new vocab (numericalization) but we rather want to use the exact same one from out **language model**.

In [None]:
data_clas = (TextList.from_folder(path, vocab = data_lm.vocab)
            .split_by_folder(valid='text')
            .label_from_folder(classes=['neg', 'pos']
            .databunch(bs=bs)))

In [None]:
data_clas.save('data_clas.pkl')

In [None]:
data_clas = load_data(path, 'data_clas.pkl', bs = bs)

In [None]:
data_clas.show_batch()

#### Model Training

In [None]:
learn = text_classifier_learner(data_clas, AWD_LSTM, frop_mult=0.5)
learn.load_encoder('fine_tuned_enc')

In [None]:
learn.lt_find()

In [None]:
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(1, 2e-2, moms=(0.8, 0.7))

In [None]:
learn.save('first')

In [None]:
learn.load('first')

What **freeze_to(-N)** with a negative number is telling us is that we want to unfreeze the last N layers so that we train those.

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4), 1e-2), moms=(0.8, 0.7))

In [None]:
learn.save('second')

In [None]:
learn.load('second')

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4), 5e-3), moms=(0.8, 0.7))

In [None]:
learn.save('third')

In [None]:
learn.load('third')

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4), 1e-3), moms=(0.8, 0.7))

Testing the model:

In [None]:
learn.predict("I really loved that movie, it was awesome!")

## Tabular Data -> Adults

Many people say NN aren't that useful for tabular data but that's not the case, it's actually quite the opposite since with NN we are actually focusing less on feature engineering which is very important in Machine Learning. Something to note is that other methods (ML) are random forests, logistic regression, and gradient boosting machines which may require a bit more of engineering than a NN.

### Data

For the data we will be using pandas since this is the most common form for tabular data.

In [2]:
from fastai.tabular import *

In [4]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

As we can see below we need to tell data special information about it.


- cat_names: simply the categorical/discrete variables.


- cont_names: simply the continious variables.


- procs: procs are the transformations to apply to our data. The main difference from this and the transformations we applied on images is that this procceses are applied ahead of time, the data is preprocesed using this procesors rather than doing it as we go. So transformations are for data augmentation where we want to randomize it and do it differently each time whereas proccesses are thing we want to do once ahead of time. For **normalization** it's importan to note that whatever we do to the training set would be the same we do to the validation and or test sets. In this case normalization is usually substracting the mean and dividing by standard deviation.


Something important to note is that it's usually good to keep our validation sets as conitnious groups following a sequence relatively similar to the one we are trying to predict.

In [5]:
deep_var = 'salary'
#categorical values
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
#continious variables
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [FillMissing, Categorify, Normalize]

In [8]:
test = TabularList.from_df(df.iloc[800:1000].copy(), path = path, cat_names=cat_names, cont_names=cont_names)

In [11]:
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
       .split_by_idx(list(range(800, 1000)))
       .label_from_df(cols=deep_var)
       .add_test(test, label=0)
       .databunch())

In [12]:
data.show_batch(rows = 10)

workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,target
Private,10th,Divorced,Machine-op-inspct,Not-in-family,White,False,-0.2629,-0.5334,-1.5958,<50k
Private,HS-grad,Divorced,Exec-managerial,Unmarried,White,False,0.7632,-0.4702,-0.4224,<50k
Private,10th,Never-married,Other-service,Own-child,White,False,-1.5823,0.1214,-1.5958,<50k
Private,5th-6th,Never-married,Other-service,Other-relative,White,False,-0.8493,0.6234,-2.7692,<50k
Federal-gov,Bachelors,Married-civ-spouse,Adm-clerical,Wife,Asian-Pac-Islander,False,0.0303,-0.7499,1.1422,<50k
Self-emp-not-inc,Masters,Married-civ-spouse,Sales,Husband,White,False,2.3758,-0.4388,1.5334,>=50k
Private,Some-college,Never-married,Handlers-cleaners,Not-in-family,White,False,-1.0692,-0.2338,-0.0312,<50k
Private,Some-college,Married-civ-spouse,Other-service,Wife,Black,False,0.2502,-1.5677,-0.0312,<50k
Private,HS-grad,Never-married,Exec-managerial,Not-in-family,White,False,0.3968,0.1516,-0.4224,<50k
Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,False,0.6166,-0.4456,-0.4224,<50k


layers[] is information about our architecture.

### Training our model

In [None]:
learn = tabular_learner(data, layers[200, 100], metrics=accuracy)

In [None]:
learn.fit(1, 1e-2)

In [None]:
row = df_iloc[0]

In [None]:
learn.predict(row)

## Collab Filtering -> Movielens

Collaborative filtering is about who bought what or who reviewed what. In a very simple version of collaborative fitlering we just have 2 columns, something like user id and movie id which says a user bought that movie, then we can add different information to such data such as reviews or time codes. There's 2 ways we can draw such collaborative filtering structure.
One is as an pair of adjacencys list and the other one is a basic adjacency matrix. Usually we use the first one to store those since matrices can get quite sparse.

The dataset we will be using is **movielens** which is a dataset that uses movies, rating and timestamps.

In [16]:
path = untar_data(URLs.ML_SAMPLE)
path

Downloading http://files.fast.ai/data/examples/movie_lens_sample


PosixPath('/home/ubuntu/.fastai/data/movie_lens_sample')

In [24]:
ratings = pd.read_csv(path/'ratings.csv')
series2cat(ratings, 'userId', 'movieId')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,73,1097,4.0,1255504951
1,561,924,3.5,1172695223
2,157,260,3.5,1291598691
3,358,1210,5.0,957481884
4,130,316,2.0,1138999234


In [None]:
data = CollabDataBunch.from_df(ratings, seed=42)

In [None]:
y_range = [0, 5.5]

- For the architecture we have to tell it how many factors we want
- The y range is that basically, what's the y range for our predictions. This because we will be using a weighted sigmoid funciton on our last layer to get a result between such values.

In [None]:
learn = collab_learner(ratings, n_factors=50, y_range = y_range)

In [None]:
learn.fit_one_cycle(3, 5e-3)

Something to note is that, unfortunately, this collab_learner don't work for new users (this is called the cold start problem). There's some workarounds for this which are:

- Use some other meta data kind of tabular driven model for new users or new movies.
- Use ux to make your new user not that new. for instance Netflix used to ask for 20 very common movies if you've already seen them and how much rating you gave them. 

## Collab Filtering Theory : Behind the scenes.

For illustrative purposes we first take a good (non-sparse) matrix sample:
![image.png](attachment:image.png)


So somehow we have to come up with a function that can fill in the blanks. Here's a simple possible approach using matrix multiplications:


![image.png](attachment:image.png)


In the above images we come up with 2 random matrices of size (n X 5) and (5 X m) resulting in a matrix (after multiplying those) of size (n X m) with every value in the result representing the score the **n**th user gave the **m**th movie.


Now let's say we have a loss function (RMSE), what's left is to use a gradient descent to make this an accurate predictive model by modifying both matrices.

### A look into collab learner:

In [2]:
from fastai.collab import *

We'll have a look into **collab_learner** from fast.ai wich uses **EmbeddingDotBias** from pytorch.

In [3]:
collab_learner??

In [5]:
EmbeddingDotBias??

As we can see the EmbeddingDotBias inherits from nn.Module which is the basic definition from pytorch of a layer in a nn.

This class has 2 default functions __init__ and __forward__.


- Forward is basically what calculates the ouptut from this module/layer.

- Init is what it's what's called when the class is created.


An important thing to note is that pytorch calculates the gradients of the mdoel for us. So we only tell pytorch how to calculate the output from our model and it will do the gradient descent stuff.



In this case the model contains:

- weights for user : random embedding (matrix) that we generate using pytorch (randomized initially)
- weights for item : random embedding (matrix) that we generate using pytorch (randomized initially)
- bias for user : embedding (vector) of bias for users 
- bias for item : embedding (vector) of bias for items


And what our model is basically calculating is the follwing:
- res = WU * WI + BU + BI
- return **sigmoid(res) * (max_score - min_score) + min_score**


### How ANN work unsing linear algebra?

Well, lets take a look at an example:
- As we may notice the difference between purple and blue is that blue are results of matrix multiplicationsand purple are sigmoid functions.
- Also it's important to note that at the end we have a scaled sigmoid function which basically takes our reult and puts it within the range of our desired output.

![image.png](attachment:image.png)