># Content based filtering

Compared to collabarative filtering, the difference of content based filtering is that this uses the features of a given item for recommendation. Basically in this type of recommendation systems, normal ML classification and similarity techniques will be used.

For the implementation of such system, there are 3 main parts that we need to consider.

1. Content analyzer - Basically offline training of a model on the items. We may build vectors/profiles for each item which can be used later during recommendation process.

2. User Profiler -  We create profiles for users as well. This is to identify the unique person interests to match against the items of our system.

3. Item Retriever - The inference process where we match the user profile with items to get the recommendation. This is an online process.


>## Content Analyzer

Mainly based on what we called metadata. This include tags, descriptions, reviews etc. about an item. One of the important yet annoying part of this is recognizing what to use and what not to use. Because adding all will be expensive(computationally/development wise and actual cost) and adding less/ wrong features will yield in bad/weird recommendations.

This closely relates to NLP. Therefore many techniques in NLP world can e used such as Bag of Words, word2vec, TF-IDF, transformers, data cleaning steps(lematization, stop word removal). Some interesting pointers to related techniques are below.


- *Theres a python package named `stop-words` which include stop words from various languages. Install it using `pip install stop-words`.*

- *We can remove highest occuring and lowest occuring tokens from the dataset, if we are using a token based string vectorization method to transform strings.*

- *Can use stemmer/lemmatizer to reduce token forms in dataset as well. But again usefulness may depend on the application and the data you have.*



>### TF-IDF

- TF = Term Frequency (how many times the word appear in the document.)
- IDF = Inverse Document Frequency (Measure of how many documents have the considering word. If few documents have the considering word, higher the value would be.)

We take the log forms of above measures for calculations.

__<center>TF-IDF = TF(word, document)*IDF(word, documents)</center>__

<center>TF(word, document) = 1 + log(word frequency in the document)</center>
<center>IDF(word, documents) = log(Total number of documents) - log(num of documents with considering word)</center>




>### LDA (Latent Dirichlet Allocation)

In this ML technique words get allocated to hidden(latent) topics in the document distribution. Then those topics will be used to describe a document in a mathematical formula of percentages.


<center><image src="./images/LDA training.jpg" width="500px" /></center>

* Can use `pyLDAvis` package to visualize the LDA topics distribution. This helps to understand the proper topic number "k" we need to use for the algorithm.
* There are parameters `alpha` and `beta` in LDA which can be used to finetune the topic and word dstributions.


## TF IDF Implementation

In [53]:
from collections import Counter, defaultdict
import math

def tf_idf(documents):

    document_tfs = []
    token_counter = defaultdict(lambda:0)

    for document in documents:
        tokens = document.split(" ")
        counts = Counter(tokens)

        for token in set(tokens):
            token_counter[token] += 1
        document_tfs.append(counts)


    base_vector = list(token_counter.keys())    
    document_vectors = []
    for doc in document_tfs:
        vec = []
        for key in base_vector:
            tf = (doc[key]/sum(doc.values()))
            idf = math.log(len(documents)/(token_counter[key]))
            vec.append(tf*idf)
        document_vectors.append(vec)

    return document_vectors


tf_idf(["my name is dilan", "my name is dinushka", "I suck at statistics.", "I love my dog"])

[[0.34657359027997264,
  0.17328679513998632,
  0.17328679513998632,
  0.07192051811294521,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0],
 [0.0,
  0.17328679513998632,
  0.17328679513998632,
  0.07192051811294521,
  0.34657359027997264,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0],
 [0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.34657359027997264,
  0.34657359027997264,
  0.17328679513998632,
  0.34657359027997264,
  0.0,
  0.0],
 [0.0,
  0.0,
  0.0,
  0.07192051811294521,
  0.0,
  0.0,
  0.0,
  0.17328679513998632,
  0.0,
  0.34657359027997264,
  0.34657359027997264]]

>### TF IDF is meant to be used in applications where we know the inputs beforehand. As you can see we need to have the total counts in all the documents to get the IDF value. Also once we add a new document with overlapping tokens with other documents, it causes their idf values to change.

## LDA Implementation

Based on the following 2 youtube videos regarging the theory of LDA and Gibbs Sampling.

1. [First Video](https://www.youtube.com/watch?v=T05t-SqKArY)
2. [Second Video](https://www.youtube.com/watch?v=BaM1uiCpj_E) 

[Gibbs Sampling Explanation](https://ethen8181.github.io/machine-learning/clustering_old/topic_model/LDA.html#latent-dirichlet-allocation)  --> This is very good article/implementation details on Gibbs sampling for LDA.

In [81]:
documents =[ "my name is dilan", 
             "my name is dinushka", 
             "I suck at statistics.", 
             "I love my dog very much"]

tokenized_documents = [doc.split(" ") for doc in documents]
vocabulary = set()
token2index = {}
i = 0
for doc in tokenized_documents:
    for token in doc:
        vocabulary.add(token)
        if(token not in token2index.keys()):
            token2index[token] = i
            i += 1

import numpy as np
from random import choice, choices
    

k = 3 # num of topics


word_topic = np.zeros((k, len(vocabulary)))
document_topic = np.zeros((len(tokenized_documents), k))


# Randomly assign a topic to word in the document and
# updating the related counts in word_topic and document_topic matrices
topic_assignments = []
for doc in range(len(tokenized_documents)):
    topics = []
    for idx in range(len(tokenized_documents[doc])):
        topic_index = choice(range(k))

        token = tokenized_documents[doc][idx]
        token_index = token2index[token]

        topics.append(topic_index)
        word_topic[topic_index][token_index] += 1
        document_topic[doc][topic_index] += 1

    topic_assignments.append(topics)




print(topic_assignments)
print("-------------------------------")
print(token2index.keys())
print("-------------------------------")
print(word_topic)
print("-------------------------------")
print(document_topic)

[[1, 0, 2, 1], [1, 2, 0, 1], [0, 2, 1, 1], [0, 1, 1, 2, 2, 0]]
-------------------------------
dict_keys(['my', 'name', 'is', 'dilan', 'dinushka', 'I', 'suck', 'at', 'statistics.', 'love', 'dog', 'very', 'much'])
-------------------------------
[[0. 1. 1. 0. 0. 2. 0. 0. 0. 0. 0. 0. 1.]
 [3. 0. 0. 1. 1. 0. 0. 1. 1. 1. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0.]]
-------------------------------
[[1. 2. 1.]
 [1. 2. 1.]
 [1. 2. 1.]
 [2. 2. 2.]]


<center><image src="./images/LDA Gibbs 1.jpg" width="650px" /></center>
<center><image src="./images/LDA Gibbs 2.jpg" width="500px" /></center>

In [90]:
from copy import deepcopy

alpha = 1
beta = 1

for doc_idx, token_lst in enumerate(topic_assignments):
    for token_idx, topic in enumerate(token_lst):
        token = tokenized_documents[doc_idx][token_idx]
        vocab_index = token2index[token]
        
        Cwt_Wij = deepcopy(word_topic[:, vocab_index])
        Cwt_Wij[topic] = Cwt_Wij[topic] - 1

        Sum_Cwt_Wij = np.sum(word_topic, axis=1)
        Sum_Cwt_Wij[topic] = Sum_Cwt_Wij[topic] - 1

        left = (Cwt_Wij + alpha)/(Sum_Cwt_Wij + (len(vocabulary)*beta))

        Cdt_Dij = deepcopy(document_topic[doc_idx])
        Cdt_Dij[topic] = Cdt_Dij[topic] - 1

        Sum_Cdt_Dij = np.sum(document_topic[doc_idx]) - 1
        right = (Cdt_Dij + alpha)/(Sum_Cdt_Dij + (k*alpha))

        prob_dist = (left*right)/sum(left*right)

        new_topic = np.random.choice(range(k), 1, p=prob_dist)
        print(topic, new_topic)
    


# Just remember after drawing the new topic we also have to update the
# topic assignment list with newly sampled topic for token w; re-increment the word-topic 
# and document-topic count matrices with the new sampled topic for token w.




1 [0]
0 [1]
2 [0]
1 [2]
1 [1]
2 [0]
0 [2]
1 [0]
0 [2]
2 [1]
1 [2]
1 [0]
0 [2]
1 [2]
1 [2]
2 [0]
2 [2]
0 [2]


>## Pros & Cons of Content based Filtering

- New items are easy to add, calculate the item vector and we are good to go.
- Dont need historical data. Just need to have details about the item.
- Not related to popularity, so wider recommendations.
- Limited understanding about items may give bad results.