<a href="https://colab.research.google.com/github/Foutse/Recommendation_systems/blob/master/Recommendation_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Inspired from: https://heartbeat.fritz.ai/recommender-systems-with-python-part-i-content-based-filtering-5df4940bd831 

In [4]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 


In [6]:
dataset = pd.read_csv("/sample-data.csv")

We start by creating a *TF-IDF* vectorizer, to weigh the keyword in any document and assign the importance to that keyword based on the number of times it appears in the doc.
- TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
- IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

![Texte alternatif…](https://miro.medium.com/max/455/1*3Ig7VSgscBzXaYa0Q-UM1w.png)

In [8]:
dataset.head(10)

Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."
5,6,Ascensionist jkt - Our most technical soft she...
6,7,"Atom - A multitasker's cloud nine, the Atom pl..."
7,8,Print banded betina btm - Our fullest coverage...
8,9,Baby micro d-luxe cardigan - Micro D-Luxe is a...
9,10,Baby sun bucket hat - This hat goes on when th...


Scikit-learn’s *Tfidftransformer* and *Tfidfvectorizer* aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The main difference is that:
- With *Tfidftransformer* you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.

- With *Tfidfvectorizer* on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.


There are cases where you want to use *Tfidftransformer* over *Tfidfvectorizer* and it is sometimes not that obvious. Here is a general guideline:
- If you need the term frequency (term count) vectors for different tasks, use *Tfidftransformer*.
- If you need to compute tf-idf scores on documents within your “training” dataset, use *Tfidfvectorizer*
- If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, *both will work*.

In [9]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(dataset['description'])

The tfidf_matrix is the matrix containing each word and its TF-IDF score with regard to each document, or item in this case.

## Vector space model

In the vector space model a document D is represented as an m-dimensional vector, where each dimension corresponds to a distinct term and m is the total number of terms used in the collection of documents. 
- Now, we have a representation of every item in terms of its description. 
- Next, we need to calculate the relevance or similarity of one document to another.

The user’s likes / dislikes / measures is calculated by taking the cosine of the angle between the user profile vector (Ui ) and the document vector; or in our case, the angle between two document vectors. A visual illustration can be better uderstood with the following image.

![Texte alternatif…](https://miro.medium.com/max/518/1*LWoRop9T6hC7zhi32UxhCQ.png)

- The value of *cosine* will increase as the angle between vectors decreases, which signifies more similarity.
- The vectors are length-normalized, after which they become vectors of length 1.

## Calculating Cosine Similarity

![Texte alternatif…](https://miro.medium.com/max/871/1*Q4xQoV8k_7S7xB-NfvFdrw.png)

We calculate the *cosine similarity* of each item with every other item in the dataset, and then arranged them according to their similarity with item i, and stored the values in results.

In [10]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
results = {}
for idx, row in dataset.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1] 
    similar_items = [(cosine_similarities[idx][i], dataset['id'][i]) for i in similar_indices] 
    results[row['id']] = similar_items[1:]

## Making a recommendation

Let's build a function that once we input an *item_id* and the number of recommendations that we want,*num*, and voilà! Our function collects the results[] corresponding to that *item_id*, and we get our recommendations on screen.
![Texte alternatif…](https://miro.medium.com/max/1334/1*oYpMnPQFZaiZQizgVWBpoA.png)

In [11]:
def item_is(id, data):
    item = dataset.loc[data['id'] == id]['description'].tolist()[0].split(' - ')[0]
    return item

# Just reads the results out of the dictionary.
def recommendation(item_id, num, data):
    print("Recommending " + str(num) + " products similar to " + item_is(item_id, data) + "...")   
    print("-------")
    print("We get the following results:")
    print("-------")
    recs = results[item_id][:num]   
    for rec in recs: 
        print("Recommended: " + item_is(rec[1], data) + " (with a score of:" +      str(rec[0]) + ")")

In [12]:
recommendation(item_id=11, num=5, data=dataset)

Recommending 5 products similar to Baby sunshade top...
-------
We get the following results:
-------
Recommended: Sunshade hoody (with a score of:0.21330296021085024)
Recommended: Baby baggies apron dress (with a score of:0.10975311296284812)
Recommended: Runshade t-shirt (with a score of:0.09988151262780731)
Recommended: Runshade t-shirt (with a score of:0.09530698241688207)
Recommended: Runshade top (with a score of:0.08510550093018411)


## Analyzing the Results

![Texte alternatif…](https://i2.wp.com/evidentiasoftware.com/wp-content/uploads/sites/2/2014/05/JobSeekingDescription_crop380w1.jpg?w=380&ssl=1)

##### Advantages of Content Based Filtering

- *User independence:* This method only has to analyze the items and a single user’s profile for the recommendation, which makes the process less cumbersome. Thus produces more reliable results with fewer users in the system.
- *Transparency:* Items are recommended on a feature-level basis.
- *No cold start:* New items can be suggested before being rated by a substantial number of users.

##### Disadvantages of Content Based Filtering

- *Limited content analysis:* If the content doesn’t contain enough information to discriminate the items precisely, the recommendation itself risks being imprecise.
- *Over-specialization:* It provides a limited degree of novelty, since it has to match up the features of a user’s profile with available items. In the case of item-based filtering, only item profiles are created and users are suggested items similar to what they rate or search for, instead of their past history. A perfect content-based filtering system may suggest nothing unexpected or surprising.

You now know how to make a fully-functional recommender system in Python with content-based filtering. More is yet to come ;)

![Texte alternatif…](https://images.rapgenius.com/481633f1184fed769ed4f7aef5d5ff36.500x281x8.gif)
