# **Content-Based Recommendation System Notebook**
- In this notebook, we will explore and implement a content-based recommendation system. Content-based recommendation systems suggest items to users based on the characteristics of the items and a profile of the user's preferences. 
- This approach is particularly useful when we have a lot of information about the items and the users' preferences. We will build a simple content-based recommendation system using Python and the scikit-learn library.

## **1. Introduction**
- **What is a Content-Based Recommendation System?**
    - A content-based recommendation system recommends items to users based on the content or characteristics of the items. This type of recommendation system focuses on understanding the properties of items and learning user preferences from the items they have interacted with in the past.


- **How Does it Work?**
    - The working principle of a content-based recommendation system can be summarized in a few steps:
        1. **Feature Extraction**: Extract relevant features from the items. For example, in a movie recommendation system, features could include genre, director, actors, and plot keywords.

        2. **User Profile**: Create a user profile based on their interactions with items. This profile is essentially a summary of the features of items the user has liked or interacted with in the past.

        3. **Recommendation**: Calculate the similarity between the user profile and each item's features. Items that are most similar to the user profile are recommended.

In [1]:
# Import needed modules
import numpy as np
import pandas as pd
import difflib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import precision_score, f1_score, recall_score

In [2]:
# Read data
df = pd.read_csv('A_Z_medicines_dataset_of_India.csv')

- Let's obtain some analysis

In [3]:
# printing the first 5 rows of the dataframe
df.head()

Unnamed: 0,id,name,price(₹),Is_discontinued,manufacturer_name,type,pack_size_label,short_composition1,short_composition2
0,1,Augmentin 625 Duo Tablet,223.42,False,Glaxo SmithKline Pharmaceuticals Ltd,allopathy,strip of 10 tablets,Amoxycillin (500mg),Clavulanic Acid (125mg)
1,2,Azithral 500 Tablet,132.36,False,Alembic Pharmaceuticals Ltd,allopathy,strip of 5 tablets,Azithromycin (500mg),
2,3,Ascoril LS Syrup,118.0,False,Glenmark Pharmaceuticals Ltd,allopathy,bottle of 100 ml Syrup,Ambroxol (30mg/5ml),Levosalbutamol (1mg/5ml)
3,4,Allegra 120mg Tablet,218.81,False,Sanofi India Ltd,allopathy,strip of 10 tablets,Fexofenadine (120mg),
4,5,Avil 25 Tablet,10.96,False,Sanofi India Ltd,allopathy,strip of 15 tablets,Pheniramine (25mg),


In [4]:
# Get data information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253973 entries, 0 to 253972
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  253973 non-null  int64  
 1   name                253973 non-null  object 
 2   price(₹)            253973 non-null  float64
 3   Is_discontinued     253973 non-null  bool   
 4   manufacturer_name   253973 non-null  object 
 5   type                253973 non-null  object 
 6   pack_size_label     253973 non-null  object 
 7   short_composition1  253973 non-null  object 
 8   short_composition2  112171 non-null  object 
dtypes: bool(1), float64(1), int64(1), object(6)
memory usage: 15.7+ MB


In [5]:
# Selecting the relevant features for recommendation
selected_features = ['short_composition1', 'short_composition2']
print(selected_features)

['short_composition1', 'short_composition2']


**Data Preprocessing**
- Before building the recommendation system, we need to preprocess the data. This may include text cleaning, handling missing values, and tokenization.
- Since the data is too big for the system to compute we will trim the size of the data to 5000 rows 

In [21]:
np.random.seed(42)
df = df.sample(frac=1)

In [22]:
df = df[:5000]
df


Unnamed: 0,name,price(₹),manufacturer_name,pack_size_label,short_composition1,short_composition2
1501,Synoret 10mg Capsule,124.00,Synnove Pharmaceuticals Pvt Ltd,strip of 10 capsules,Isotretinoin (10mg),
2586,Dapamac M 10mg/1000mg Tablet,132.20,Macleods Pharmaceuticals Pvt Ltd,strip of 10 tablets,Dapagliflozin (10mg),Metformin (1000mg)
2653,Glispo M 1mg/500mg Tablet SR,48.00,Wells Biosciences,strip of 10 tablet sr,Glimepiride (1mg),Metformin (500mg)
1055,Tendoachilles Tablet,180.00,Achilles Healthcare Pvt Ltd,strip of 10 tablets,Chondroitin (200mg),Collagen Peptide (40mg)
705,Ciclocan Cream,210.00,Canbro Healthcare,tube of 30 gm Cream,Ciclopirox (1% w/w),
...,...,...,...,...,...,...
4426,Alex Junior Syrup,95.00,Glenmark Pharmaceuticals Ltd,bottle of 60 ml Syrup,Chlorpheniramine Maleate (2mg/5ml),Dextromethorphan Hydrobromide (5mg/5ml)
466,Raberide 20mg Tablet,10.00,Zeelab Pharmacy Pvt Ltd,strip of 10 tablets,Rabeprazole (20mg),
3092,Piromac 20mg Tablet DT,33.00,Blubell Pharma,strip of 10 tablet dt,Piroxicam (20mg),
3772,Pizone 15mg Tablet,75.23,Anthus Pharmaceuticals Pvt Ltd,strip of 30 tablets,Pioglitazone (15mg),


In [23]:
df.reset_index(drop=True, inplace=True)

In [24]:
df

Unnamed: 0,name,price(₹),manufacturer_name,pack_size_label,short_composition1,short_composition2
0,Synoret 10mg Capsule,124.00,Synnove Pharmaceuticals Pvt Ltd,strip of 10 capsules,Isotretinoin (10mg),
1,Dapamac M 10mg/1000mg Tablet,132.20,Macleods Pharmaceuticals Pvt Ltd,strip of 10 tablets,Dapagliflozin (10mg),Metformin (1000mg)
2,Glispo M 1mg/500mg Tablet SR,48.00,Wells Biosciences,strip of 10 tablet sr,Glimepiride (1mg),Metformin (500mg)
3,Tendoachilles Tablet,180.00,Achilles Healthcare Pvt Ltd,strip of 10 tablets,Chondroitin (200mg),Collagen Peptide (40mg)
4,Ciclocan Cream,210.00,Canbro Healthcare,tube of 30 gm Cream,Ciclopirox (1% w/w),
...,...,...,...,...,...,...
4995,Alex Junior Syrup,95.00,Glenmark Pharmaceuticals Ltd,bottle of 60 ml Syrup,Chlorpheniramine Maleate (2mg/5ml),Dextromethorphan Hydrobromide (5mg/5ml)
4996,Raberide 20mg Tablet,10.00,Zeelab Pharmacy Pvt Ltd,strip of 10 tablets,Rabeprazole (20mg),
4997,Piromac 20mg Tablet DT,33.00,Blubell Pharma,strip of 10 tablet dt,Piroxicam (20mg),
4998,Pizone 15mg Tablet,75.23,Anthus Pharmaceuticals Pvt Ltd,strip of 30 tablets,Pioglitazone (15mg),


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   name                5000 non-null   object 
 1   price(₹)            5000 non-null   float64
 2   manufacturer_name   5000 non-null   object 
 3   pack_size_label     5000 non-null   object 
 4   short_composition1  5000 non-null   object 
 5   short_composition2  5000 non-null   object 
dtypes: float64(1), object(5)
memory usage: 234.5+ KB


- Dropping the unnecessary columns

In [26]:
df = df.drop(columns=["type", "Is_discontinued", "id"])

KeyError: "['type', 'Is_discontinued', 'id'] not found in axis"

In [None]:
df

In [None]:
# Replacing the null values with null string
for feature in selected_features:
    df[feature] = df[feature].fillna(" ")
    

In [None]:
df

In [27]:
# combining relevant features
combined_features = df['short_composition1']+ " "+ df['short_composition2']
combined_features

0                                   Isotretinoin (10mg)  
1               Dapagliflozin (10mg)   Metformin (1000mg)
2                   Glimepiride (1mg)   Metformin (500mg)
3          Chondroitin (200mg)   Collagen Peptide (40mg) 
4                                   Ciclopirox (1% w/w)  
                              ...                        
4995    Chlorpheniramine Maleate (2mg/5ml)   Dextromet...
4996                                 Rabeprazole (20mg)  
4997                                   Piroxicam (20mg)  
4998                                Pioglitazone (15mg)  
4999                                Ceftriaxone (250mg)  
Length: 5000, dtype: object

## 3. **Building the Content-Based Recommendation System**
**TF-IDF Vectorization**
- We use TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text features (descriptions) into numerical vectors. 
- TF-IDF gives more weight to terms that are important in a specific document and less weight to common terms.


In [28]:
# converting the text data to feature vectors
vectorizer = TfidfVectorizer()

feature_vectors = vectorizer.fit_transform(combined_features)

In [29]:
print(feature_vectors)

  (0, 555)	0.9127272692893265
  (0, 26)	0.4085693721947954
  (1, 26)	0.34318867150930693
  (1, 364)	0.7065183003965594
  (1, 627)	0.4485696281917355
  (1, 18)	0.4264255100440899
  (2, 627)	0.5147932005225202
  (2, 495)	0.5517364983010509
  (2, 60)	0.5538853873928794
  (2, 135)	0.35183202635762784
  (3, 322)	0.5467059770707919
  (3, 65)	0.21707029866595895
  (3, 352)	0.5467059770707919
  (3, 725)	0.5467059770707919
  (3, 120)	0.23710379444935878
  (4, 325)	1.0
  (5, 275)	0.6171439508221576
  (5, 80)	0.4469532714646837
  (5, 308)	0.5663112812089528
  (5, 149)	0.31409815323575124
  (6, 149)	0.25810292279441055
  (6, 362)	0.7104823484956804
  (6, 93)	0.6546737460147956
  (7, 666)	0.6589968944107542
  (7, 369)	0.6554639153905568
  :	:
  (4991, 305)	0.5332752248064835
  (4991, 830)	0.5944438133801264
  (4992, 68)	0.5905332525406808
  (4992, 851)	0.8070133069805755
  (4993, 135)	0.3519772125008345
  (4993, 713)	0.4134139361832042
  (4993, 68)	0.4112258218052606
  (4993, 386)	0.732184596075497

**Cosine Similarity**
- We compute the cosine similarity between the TF-IDF vectors of items. Cosine similarity measures the cosine of the angle between two non-zero vectors and is used to determine how similar two items are based on their feature vectors.

In [30]:
# getting the similarity scores using cosine similarity
similarity = cosine_similarity(feature_vectors, feature_vectors)

In [31]:
print(similarity)

[[1.         0.14021638 0.         ... 0.         0.         0.        ]
 [0.14021638 1.         0.23092059 ... 0.         0.         0.        ]
 [0.         0.23092059 1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]


**Test your Recommendation System**

In [32]:
# creating a list with all the medicines
all_meds = df["name"].tolist()
all_meds

['Synoret 10mg Capsule',
 'Dapamac M 10mg/1000mg Tablet',
 'Glispo M 1mg/500mg Tablet SR',
 'Tendoachilles Tablet',
 'Ciclocan Cream',
 'Benimcold Tablet',
 'Dactino 0.5mg Injection',
 'Lifedec 50mg Injection',
 'Xpen X Syrup',
 'Mucon 150mg Tablet',
 'Cilarem 5 Tablet',
 'Paracip Suspension',
 'Miconya NT 75mg/10mg Tablet',
 'Hemolok Solution',
 'Max-D3  60K Soft Gelatin Capsule',
 'Psoranext C Ointment',
 'Ecoclav Duo Tablet',
 'Bacdroxyl 250mg Tablet',
 'Samstar 2.5mg Tablet',
 'Wavlon Antiseptic Liquid',
 'Maglivo ML 5mg/10mg Tablet',
 'Laciclav 200mg/28.5mg Tablet DT',
 'Zoecef O 200mg/200mg Tablet',
 'Calmpose 5mg Injection',
 'E-Panta Injection',
 'Cefoder 50mg Dry Syrup',
 'Copod-B Syrup',
 'Ticocin 200mg Injection',
 'Aarpik 20mg Tablet',
 'Xime 50mg Tablet',
 'Ligocain 2% Jelly',
 'Dilgel',
 'Glowzi Cream',
 'Torib 120mg Tablet',
 'Azilee 500mg Tablet',
 'Eridol 50 Tablet',
 'Oflokon 200mg Tablet',
 'Gynodan 200mg Capsule',
 'Ceftazim 250 Injection',
 'Almox C Tablet',
 'Domi

In [33]:
# getting the medicine name from the user
med_name = input(' Enter Medicine name : ')

 Enter Medicine name :  Samstar 2.5mg Tablet


In [34]:
# finding the close match for the medicine given by the user
find_similar_med = difflib.get_close_matches(med_name, all_meds)
print(find_similar_med)

['Samstar 2.5mg Tablet', 'Ramistar 2.5 Tablet', 'Ramisa 2.5mg Tablet']


In [35]:
# finding the index of the matched medicines
for medicine_name in find_similar_med:
    idx = df.index[df['name'] == medicine_name].tolist()
    if idx:  # Check if the index is found
        print("Index of", medicine_name, "is", idx[0])
    else:
        print(medicine_name, "not found in DataFrame")

Index of Samstar 2.5mg Tablet is 18
Index of Ramistar 2.5 Tablet is 189
Index of Ramisa 2.5mg Tablet is 1590


## **Full Recommendation System**

In [36]:
import pickle
import difflib

In [37]:
med_name = input("Enter Medicine Name: ")
all_meds = df["name"].tolist()
find_similar_med = difflib.get_close_matches(med_name, all_meds)

print(find_similar_med)

# Finding the index of the matched medicines
med_idx = {}
for medicine_name in find_similar_med:
    idx = df.index[df['name'] == medicine_name].tolist()
    if idx:  # Check if the index is found
        med_idx[medicine_name] = idx[0]
        print("Index of", medicine_name, "is", idx[0])
    else:
        med_idx[medicine_name] = None
        print(medicine_name, "not found in DataFrame")

# Save the model (medicine_indices) as a pickle file
with open('med_idx.pkl', 'wb') as f:
    pickle.dump(med_idx, f)


Enter Medicine Name:  Samstar 2.5mg Tablet'


['Samstar 2.5mg Tablet', 'Ramistar 2.5 Tablet', 'Ramisa 2.5mg Tablet']
Index of Samstar 2.5mg Tablet is 18
Index of Ramistar 2.5 Tablet is 189
Index of Ramisa 2.5mg Tablet is 1590
