<h1><center>Content based recommender systems</center></h1>

## Basic imports

In [1]:
# Imports
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [2]:
df = pd.read_excel("medical.xlsx")
df.head()

Unnamed: 0,Name,description
0,ketolac,Analgesic belongs to Nonsteroidal Anti-inflamm...
1,cataflam,Nonsteroidal Anti-inflammatory Drug (NSAID) us...
2,Catafast,"Catafast Powder is used for the treatment,cont..."
3,fast-flam,Inflammation of the bones and joints such as r...
4,Adwiflam,"Used to treat actinic keratosis, a skin proble..."


Calculate the TF-IDF transform

In [3]:
# Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

# Replace NaN with an empty string
df['description'] = df['description'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df['description'])

# Output the shape of tfidf_matrix
tfidf_matrix.shape

(26, 255)

Compute the pair-wise cosine similarity


Remember: since that:

$$x.y = ||x||.||y||.cos(\theta)$$

So 
$$cos(\theta) = \frac{x.y}{||x||||y||}$$

So we can use the `linear_kernel` from sklearn, which is faster than `cosine_similarity`

In [4]:
# Compute the cosine Similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

Implement the function `get_recommendations` that produces top most similar medicines, based on cosine_similarity between the TF-IDF scores of descriptions

In [5]:
# Construct a reverse map of indices and medicine titles
indices = pd.Series(df.index, index=df['Name']).drop_duplicates()

In [6]:
# Function that takes in medicine title as input and outputs most similar medicines
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the medicine that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all medicines with that medicine
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the medicines based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the top two most similar medicine
    sim_scores = sim_scores[1:3]

    # Get the medicine indices
    medicine_indices = [i[0] for i in sim_scores]

    # Return the top most similar medicines
    return df['Name'].iloc[medicine_indices]

In [7]:
print(get_recommendations('fast-flam'))

1    cataflam
0     ketolac
Name: Name, dtype: object
