Cosine Similarity

Cosine Similarity is a metric used to measure how similar two vectors are, irrespective of their magnitude. It calculates the cosine of the angle between two non-zero vectors in an inner product space. The formula is:

\text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \|B\|}

Where:

is the dot product of vectors A and B

and  are the magnitudes (or Euclidean norms) of the vectors

Value Range:

1 → Vectors are identical

0 → Vectors are orthogonal (no similarity)

-1 → Vectors are diametrically opposite

Use Cases:

Text similarity in NLP

Recommendation systems

Clustering and classification

It’s especially useful in high-dimensional spaces, like comparing TF-IDF vectors in document similarity tasks.

In [2]:
import pandas as pd

# Load the dataset (update path accordingly)
df = pd.read_csv("drugLibTrain_raw.tsv", sep="\t")

# Display basic info
print(df.info())

# Show first few rows
print(df.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3107 entries, 0 to 3106
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         3107 non-null   int64 
 1   urlDrugName        3107 non-null   object
 2   rating             3107 non-null   int64 
 3   effectiveness      3107 non-null   object
 4   sideEffects        3107 non-null   object
 5   condition          3106 non-null   object
 6   benefitsReview     3089 non-null   object
 7   sideEffectsReview  3032 non-null   object
 8   commentsReview     3095 non-null   object
dtypes: int64(2), object(7)
memory usage: 218.6+ KB
None
   Unnamed: 0       urlDrugName  rating         effectiveness  \
0        2202         enalapril       4      Highly Effective   
1        3117  ortho-tri-cyclen       1      Highly Effective   
2        1146           ponstel      10      Highly Effective   
3        3947          prilosec       3  Marginally Effectiv

In [3]:
# Check for missing values
print(df.isnull().sum())


Unnamed: 0            0
urlDrugName           0
rating                0
effectiveness         0
sideEffects           0
condition             1
benefitsReview       18
sideEffectsReview    75
commentsReview       12
dtype: int64


In [4]:
df.fillna({"condition": "Unknown",
           "benefitsReview": "No Review",
           "sideEffectsReview": "No Review",
           "commentsReview": "No Review"}, inplace=True)

In [5]:
print(df.isnull().sum())

Unnamed: 0           0
urlDrugName          0
rating               0
effectiveness        0
sideEffects          0
condition            0
benefitsReview       0
sideEffectsReview    0
commentsReview       0
dtype: int64


In [6]:
import nltk
import re
from nltk.corpus import stopwords

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

def clean_text(text):
    text = text.lower() # Convert to lowercase
    text = re.sub(r"[^a-zA-Z0-9]", " ", text) # Remove special characters
    text = " ".join([word for word in text.split() if word not in stop_words]) # Remove stopwords
    return text

df["cleaned_benefits"] = df["benefitsReview"].apply(clean_text)
df["cleaned_sideEffects"] = df["sideEffectsReview"].apply(clean_text)
df["cleaned_comments"] = df["commentsReview"].apply(clean_text)

print(df[["cleaned_benefits", "cleaned_sideEffects", "cleaned_comments"]].head())

[nltk_data] Downloading package stopwords to C:\Users\Arman
[nltk_data]     Gusain/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                    cleaned_benefits  \
0  slowed progression left ventricular dysfunctio...   
1  although type birth control cons pros help cra...   
2  used cramps badly would leave balled bed least...   
3  acid reflux went away months days drug heartbu...   
4  think lyrica starting help pain side effects s...   

                                 cleaned_sideEffects  \
0  cough hypotension proteinuria impotence renal ...   
1  heavy cycle cramps hot flashes fatigue long la...   
2                   heavier bleeding clotting normal   
3  constipation dry mouth mild dizziness would go...   
4  felt extremely drugged dopey could drive med a...   

                                    cleaned_comments  
0  monitor blood pressure weight asses resolution...  
1            hate birth control would suggest anyone  
2  took 2 pills onset menstrual cramps every 8 12...  
3  given prilosec prescription dose 45mg per day ...  
4                                                se

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Combine important features
df["combined_features"] = df["condition"] + " " + df["effectiveness"]

# Convert text data into numerical form using TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df["combined_features"].fillna(""))

# Compute similarity scores
cosine_sim = cosine_similarity(tfidf_matrix)



In [8]:
# Function to get similar drugs
def recommend_drug(drug_name, df, cosine_sim):
    idx = df[df["urlDrugName"] == drug_name].index[0] # Get index of drug
    sim_scores = list(enumerate(cosine_sim[idx])) # Get similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) # Sort by similarity
    sim_scores = sim_scores[1:6] # Get top 5 similar drugs
    drug_indices = [i[0] for i in sim_scores]
    return df.iloc[drug_indices][["urlDrugName", "condition", "effectiveness"]]

# Example usage
print(recommend_drug("prilosec", df, cosine_sim))

     urlDrugName    condition         effectiveness
2699    prevacid  acid reflux  Marginally Effective
87        zantac  acid reflux      Highly Effective
135     protonix  acid reflux      Highly Effective
319     protonix  acid reflux      Highly Effective
669     prilosec  acid reflux      Highly Effective


In [9]:
effectiveness_mapping = {
    "Highly Effective": 3,
    "Moderately Effective": 2,
    "Marginally Effective": 1,
    "Ineffective": 0
}

df["effectiveness_score"] = df["effectiveness"].map(effectiveness_mapping)

In [10]:
print(df[["sideEffectsReview", "commentsReview"]].dtypes)

sideEffectsReview    object
commentsReview       object
dtype: object


In [11]:
df["sideEffectsReview"] = pd.to_numeric(df["sideEffectsReview"], errors="coerce")
df["commentsReview"] = pd.to_numeric(df["commentsReview"], errors="coerce")

In [12]:
df["sideEffectsReview"].fillna(df["sideEffectsReview"].median(), inplace=True)
df["commentsReview"].fillna(df["commentsReview"].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["sideEffectsReview"].fillna(df["sideEffectsReview"].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["commentsReview"].fillna(df["commentsReview"].median(), inplace=True)


In [13]:
from sklearn.preprocessing import MinMaxScaler

# Fill NaN values with 0 or a neutral value before scaling
df[["sideEffectsReview", "commentsReview"]] = df[["sideEffectsReview", "commentsReview"]].fillna(0)

# Now apply MinMaxScaler
scaler = MinMaxScaler()
df[["sideEffectsReview", "commentsReview"]] = scaler.fit_transform(df[["sideEffectsReview", "commentsReview"]])

In [14]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[["sideEffectsReview", "commentsReview"]] = scaler.fit_transform(df[["sideEffectsReview", "commentsReview"]])



In [15]:
import numpy as np
from scipy.spatial.distance import cosine

# Convert numerical features to a NumPy array
numerical_features = df[["effectiveness_score", "sideEffectsReview", "commentsReview"]].values

# Combine TF-IDF matrix with numerical features
combined_features = np.hstack((tfidf_matrix.toarray(), numerical_features))

In [16]:
df["combined_features"] = df["combined_features"].fillna("")

In [17]:
print(df["combined_features"].isnull().sum())

0


In [18]:
print(df["combined_features"].apply(lambda x: x if pd.notna(x) else "NaN found").unique())

['management of congestive heart failure Highly Effective'
 'birth prevention Highly Effective' 'menstrual cramps Highly Effective'
 ... 'hives, itching, swelling due to allergies Highly Effective'
 'muscle relaxant - spinal disorder Considerably Effective'
 'total hysterctomy Marginally Effective']


In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Step 1: Convert text to numerical representation
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df["combined_features"])  # Convert text into numerical vectors

# Step 2: Compute cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix)  # Now compute similarity correctly

In [20]:
def recommend_drug(drug_name, df, cosine_sim):
    idx = df[df["urlDrugName"] == drug_name].index[0] # Get index of the drug
    sim_scores = list(enumerate(cosine_sim[idx])) # Get similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) # Sort by similarity
    sim_scores = sim_scores[1:6] # Get top 5 similar drugs

    drug_indices = [i[0] for i in sim_scores]
    return df.iloc[drug_indices][["urlDrugName", "condition", "effectiveness"]]

In [21]:
print(recommend_drug("prilosec", df, cosine_sim))

     urlDrugName    condition         effectiveness
2699    prevacid  acid reflux  Marginally Effective
87        zantac  acid reflux      Highly Effective
135     protonix  acid reflux      Highly Effective
319     protonix  acid reflux      Highly Effective
669     prilosec  acid reflux      Highly Effective


In [22]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Step 1: Drop nulls and combine text
df_filtered = df.dropna(subset=['condition', 'urlDrugName', 'commentsReview', 'benefitsReview', 'sideEffectsReview']).copy()
df_filtered['full_review'] =( df_filtered['commentsReview'].astype(str) + " " + df_filtered['benefitsReview'].astype(str) + " " + df_filtered['sideEffectsReview'].astype(str))

In [23]:
# Step 2: Create TF-IDF matrix
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df_filtered['full_review'])

In [24]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(tfidf_matrix)

In [25]:
# Step 3: Function to recommend medicine based on disease
def recommend_medicine_by_disease(disease_name, top_n=5):
    # Filter entries with matching disease
    disease_df = df_filtered[df_filtered['condition'].str.lower() == disease_name.lower()]
   
    if disease_df.empty:
        print("Disease not found in dataset.")
        return
   
    # Use average of all matching disease reviews
    disease_reviews_tfidf = tfidf.transform(disease_df['full_review'])
    avg_vector = np.asarray(disease_reviews_tfidf.mean(axis=0)).ravel()
   
    # Compute cosine similarity with all medicines
    similarities = cosine_similarity(avg_vector.reshape(1, -1), tfidf_matrix).flatten()
    similar_indices = similarities.argsort()[-top_n:][::-1]
   
    # Get recommended medicine names
    recommended_medicines = df_filtered.iloc[similar_indices]['urlDrugName'].unique()
   
    print(f"Recommended medicines for '{disease_name}':")
    for med in recommended_medicines:
        print(f"- {med}")

In [26]:
recommend_medicine_by_disease("Diabetes")

Recommended medicines for 'Diabetes':
- metformin
- glucophage
- byetta


In [27]:
recommend_medicine_by_disease("Cold")

Recommended medicines for 'Cold':
- claritin
- celexa
- zyrtec-d
- allegra
- valtrex


In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the vectorizer again
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['condition'].astype(str))

In [30]:
import joblib

# Save vectorizer
joblib.dump(tfidf_vectorizer, 'tfidf_vectorizer_cosine.pkl')

# save the DataFrame used 
df.to_csv('cosine_similarity_data.csv', index=False)