<a href="https://colab.research.google.com/github/RachelNderitu/Recommender-Systems/blob/main/Doctor_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from geopy.distance import geodesic
import random

## **Step 1: Load Data**
- The code loads two CSV files: one with doctor information(Doctors_practice_final.csv) and one with patient information (Patient_practice_final.csv).

- These files are read into pandas DataFrames using pd.read_csv().

In [4]:
# Load Data
DOCTORS_FILE = "/content/Doctors_Dataset.csv"
PATIENTS_FILE = "/content/Patient_Dataset.csv"

doctors_df = pd.read_csv(DOCTORS_FILE, encoding="ISO-8859-1")
patients_df = pd.read_csv(PATIENTS_FILE, encoding="ISO-8859-1")

## **Step 2: Clean and Transform Data**
- The column names are standardized by stripping whitespace and converting them to lowercase.

- Columns like "specialization", "specialist_needed", and "speciality" are renamed to a consistent name, "specialist".

- A new column named "content" is created for both doctors and patients by combining their specialty and location. This column will be used for TF-IDF (Term Frequency-Inverse Document Frequency) vectorization.

- Latitude and longitude columns are converted to numeric values, and any missing or invalid coordinate values are dropped.

In [5]:
# Standardize column names
doctors_df.columns = doctors_df.columns.str.strip().str.lower()
patients_df.columns = patients_df.columns.str.strip().str.lower()

# Rename columns for consistency
column_mapping = {"specialization": "specialist", "specialist_needed": "specialist", "speciality": "specialist"}
doctors_df.rename(columns={k: v for k, v in column_mapping.items() if k in doctors_df.columns}, inplace=True)
patients_df.rename(columns={k: v for k, v in column_mapping.items() if k in patients_df.columns}, inplace=True)

In [6]:
# Create content column for TF-IDF
doctors_df["content"] = (doctors_df["specialist"] + " " + doctors_df["location"]).fillna("")
patients_df["content"] = (patients_df["specialist"] + " " + patients_df["location"]).fillna("")

# Convert coordinates to numeric
doctors_df["latitude"] = pd.to_numeric(doctors_df["latitude"], errors="coerce")
doctors_df["longitude"] = pd.to_numeric(doctors_df["longitude"], errors="coerce")
patients_df["latitude"] = pd.to_numeric(patients_df["latitude"], errors="coerce")
patients_df["longitude"] = pd.to_numeric(patients_df["longitude"], errors="coerce")

# Drop missing coordinate values
doctors_df.dropna(subset=["latitude", "longitude"], inplace=True)
patients_df.dropna(subset=["latitude", "longitude"], inplace=True)

## **Step 3: TF-IDF Vectorization**
- The TfidfVectorizer from sklearn.feature_extraction.text is used to convert the "content" column (which contains text descriptions of the specialists and locations) into numerical representations (TF-IDF matrices).

- A TF-IDF matrix for both doctors and patients is created, which will be used to calculate the similarity between doctors and patients based on their content.

In [7]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(stop_words="english")
doctor_tfidf_matrix = vectorizer.fit_transform(doctors_df["content"])
patient_tfidf_matrix = vectorizer.transform(patients_df["content"])

## **Step 4: Compute Cosine Similarity**
- The cosine_similarity function from sklearn.metrics.pairwise is used to compute the cosine similarity between the patients' TF-IDF matrix and the doctors' TF-IDF matrix. This measures the similarity between patients' and doctors' specialties and locations.

In [8]:
# Compute Cosine Similarity
similarity_matrix = cosine_similarity(patient_tfidf_matrix, doctor_tfidf_matrix)

## **Step 5: Function to Find Nearest Doctors Based on Location and Specialization**
- The function find_nearest_doctors finds the nearest doctors to a patient based on their geographical coordinates (latitude and longitude). It calculates the distance between the patient and each doctor using the geodesic function from the geopy library and returns the nearest doctors in terms of distance.

In [9]:
# Function to find nearest doctors based on location and specialization
def find_nearest_doctors(patient_lat, patient_lon, patient_specialist, num_nearby=5):
    matching_doctors = doctors_df[doctors_df["specialist"].str.lower() == patient_specialist.lower()].copy()
    if matching_doctors.empty:
        return []

    matching_doctors["distance_km"] = matching_doctors.apply(
        lambda row: geodesic((patient_lat, patient_lon), (row["latitude"], row["longitude"])).km, axis=1
    )

    return matching_doctors.nsmallest(num_nearby, "distance_km")[["doctor id", "name", "specialist", "location", "distance_km"]].to_dict(orient="records")


## **Step 6 :Generate Recommendations**
- For each patient, the code attempts to find doctors who match the patient's required specialty.

- If matching doctors are found, it sorts them by their similarity score (from the cosine similarity matrix) and selects the top k most similar doctors.

- If no doctors are found or if the matching doctors don't meet the criteria, the code falls back on finding the nearest doctors based on geographical distance and specialty.

In [10]:
# Generate recommendations
recommendations = {}

top_k = 3
for i, patient in patients_df.iterrows():
    patient_id = patient["patient id"]
    patient_lat, patient_lon = patient["latitude"], patient["longitude"]
    patient_specialist = patient["specialist"].lower()

    # Get doctors that match the specialist requirement
    matching_doctors = doctors_df[doctors_df["specialist"].str.lower() == patient_specialist]

    if not matching_doctors.empty:
        # Get similarity scores for matching doctors only
        matching_indices = matching_doctors.index
        matching_similarities = similarity_matrix[i, matching_indices]

        # Get top-k matching doctors
        top_indices = matching_indices[np.argsort(matching_similarities)[-top_k:][::-1]]
        recommended_doctors = doctors_df.loc[top_indices, ["doctor id", "name", "specialist", "location", "latitude", "longitude"]].to_dict(orient="records")
    else:
        recommended_doctors = []  # No matching specialists found in TF-IDF

    # Calculate distance for each recommended doctor
    for doctor in recommended_doctors:
        doctor_lat, doctor_lon = doctor["latitude"], doctor["longitude"]
        doctor["distance_km"] = geodesic((patient_lat, patient_lon), (doctor_lat, doctor_lon)).km

    # If no strong matches, use nearest specialist-based recommendations
    if not recommended_doctors:
        recommended_doctors = find_nearest_doctors(patient_lat, patient_lon, patient_specialist)

    recommendations[patient_id] = recommended_doctors


## **Step 7: Display a sample**
- A random sample of patient recommendations is displayed. For each patient, the recommended doctors' names, specialties, locations, and distances from the patient are printed.

In [14]:
# Display sample recommendations
sample_patient_ids = random.sample(list(recommendations.keys()), min(10, len(recommendations)))

for patient_id in sample_patient_ids:
    patient_info = patients_df.loc[patients_df["patient id"] == patient_id, ["specialist", "location"]].values[0]
    specialist_needed, patient_location = patient_info

    print(f"\n📌 Patient ID: {patient_id} | Location: {patient_location} | Needs: {specialist_needed}")
    recommended_doctors = recommendations.get(patient_id, [])

    if recommended_doctors:
        for i, doctor in enumerate(recommended_doctors, 1):
            print(f"  {i}. Dr. {doctor['name']} ({doctor['specialist']}) - {doctor['location']} ({doctor['distance_km']:.1f} km away)")
    else:
        print("❌ No suitable doctor found.")


📌 Patient ID: E499 | Location: Ngong | Needs: General Surgeon
  1. Dr. Paul Kiptoo (General Surgeon) - Nairobi (19.0 km away)
  2. Dr. Daniel Njoroge (General Surgeon) - Nairobi (19.0 km away)
  3. Dr. Emily Njoroge (General Surgeon) - Nairobi (19.0 km away)

📌 Patient ID: C555 | Location: Kericho | Needs: General Surgeon
  1. Dr. Paul Kiptoo (General Surgeon) - Nairobi (198.6 km away)
  2. Dr. Daniel Njoroge (General Surgeon) - Nairobi (198.6 km away)
  3. Dr. Emily Njoroge (General Surgeon) - Nairobi (198.6 km away)

📌 Patient ID: D437 | Location: Kajiado | Needs: Neurologist
  1. Dr. Joseph Karanja (Neurologist) - Nairobi (62.8 km away)
  2. Dr. Paul Akinyi (Neurologist) - Kisumu (296.2 km away)
  3. Dr. John Ndegwa (Neurologist) - Mombasa (402.5 km away)

📌 Patient ID: 9887 | Location: Athi River | Needs: Pulmonologist
❌ No suitable doctor found.

📌 Patient ID: 0E9B | Location: Mandera | Needs: Infectious Disease Specialist
  1. Dr. James Mutua (Infectious Disease Specialist) - Na

## **Step 8: Evaluate Recommender System**
Various evaluation metrics are defined to assess the quality of the recommendation system:

- Precision@k: Measures the proportion of relevant doctors (those who are actually specialists for the patient) among the top k recommended doctors.

- Recall@k: Measures the proportion of relevant doctors that appear in the top k recommended doctors.

- Mean Reciprocal Rank (MRR): Measures the rank at which the first relevant doctor is found.

- Mean Average Precision (MAP): Measures the average precision for the top recommendations across all patients.

- Geodesic Distance Error: Measures the average distance error between recommended doctors and the patient's actual location.

The function weighted_recommendations combines the similarity score and distance to rank the recommendations based on a weighted score.

In [15]:
from sklearn.metrics import average_precision_score
import numpy as np

def precision_at_k(recommendations, ground_truth, k=3):
    precision_scores = []
    for patient_id, recommended_doctors in recommendations.items():
        relevant_doctors = set(ground_truth.get(patient_id, []))
        recommended_set = set([doc["doctor id"] for doc in recommended_doctors[:k]])
        precision = len(recommended_set & relevant_doctors) / k
        precision_scores.append(precision)
    return np.mean(precision_scores)

In [16]:
def recall_at_k(recommendations, ground_truth, k=3):
    recall_scores = []
    for patient_id, recommended_doctors in recommendations.items():
        relevant_doctors = set(ground_truth.get(patient_id, []))
        if not relevant_doctors:
            continue  # Skip if no relevant doctors available
        recommended_set = set([doc["doctor id"] for doc in recommended_doctors[:k]])
        recall = len(recommended_set & relevant_doctors) / len(relevant_doctors)
        recall_scores.append(recall)
    return np.mean(recall_scores)

In [17]:
def mean_reciprocal_rank(recommendations, ground_truth):
    reciprocal_ranks = []
    for patient_id, recommended_doctors in recommendations.items():
        relevant_doctors = set(ground_truth.get(patient_id, []))
        for rank, doc in enumerate(recommended_doctors, start=1):
            if doc["doctor id"] in relevant_doctors:
                reciprocal_ranks.append(1 / rank)
                break
    return np.mean(reciprocal_ranks) if reciprocal_ranks else 0

In [18]:
def mean_average_precision(recommendations, ground_truth):
    average_precisions = []
    for patient_id, recommended_doctors in recommendations.items():
        relevant_doctors = set(ground_truth.get(patient_id, []))
        y_true = [1 if doc["doctor id"] in relevant_doctors else 0 for doc in recommended_doctors]
        if sum(y_true) > 0:
            average_precisions.append(average_precision_score(y_true, list(range(len(y_true), 0, -1))))
    return np.mean(average_precisions) if average_precisions else 0

In [19]:
def geodesic_distance_error(recommendations, patients_df):
    distance_errors = []
    for patient_id, recommended_doctors in recommendations.items():
        patient = patients_df[patients_df["patient id"] == patient_id].iloc[0]
        patient_location = (patient["latitude"], patient["longitude"])
        distances = [geodesic(patient_location, (doc["latitude"], doc["longitude"])).km for doc in recommended_doctors]
        if distances:
            distance_errors.append(np.mean(distances))
    return np.mean(distance_errors) if distance_errors else 0

In [20]:
def weighted_recommendations(recommendations, alpha=0.5):
    for patient_id, recommended_doctors in recommendations.items():
        for doc in recommended_doctors:
            doc['score'] = alpha * (1 / (1 + doc['distance_km'])) + (1 - alpha) * doc.get('similarity', 0)
        recommendations[patient_id] = sorted(recommended_doctors, key=lambda x: x['score'], reverse=True)


## **Step 9:  Display Results**
- The evaluation metrics are calculated and displayed, showing how well the recommendation system performs in terms of precision, recall, and other metrics.

In [21]:
ground_truth = {
    patient["patient id"]: set(doctors_df[doctors_df["specialist"] == patient["specialist"]]["doctor id"])
    for _, patient in patients_df.iterrows()
}

weighted_recommendations(recommendations, alpha=0.7)

precision = precision_at_k(recommendations, ground_truth, k=3)
recall = recall_at_k(recommendations, ground_truth, k=3)
mrr = mean_reciprocal_rank(recommendations, ground_truth)
map_score = mean_average_precision(recommendations, ground_truth)
gde = geodesic_distance_error(recommendations, patients_df)

print(f"Precision@3: {precision:.4f}")
print(f"Recall@3: {recall:.4f}")
print(f"MRR: {mrr:.4f}")
print(f"MAP: {map_score:.4f}")
print(f"Geodesic Distance Error: {gde:.2f} km")


Precision@3: 0.6523
Recall@3: 0.2159
MRR: 1.0000
MAP: 1.0000
Geodesic Distance Error: 229.72 km


## **Step 10: Display Popup Notification**
- A JavaScript alert is triggered to show a popup notification to the user using IPython's Javascript class. This is typically used in a Jupyter notebook environment.

In [23]:
from IPython.display import Javascript

def show_popup():
    js_code = 'alert("Welcome! Please check your recommended doctors.");'
    display(Javascript(js_code))

# Call the function to show a browser pop-up
show_popup()

<IPython.core.display.Javascript object>

## **Step 11: Collect User Feedback**
- The system asks the user to rate their satisfaction with the recommendations on a scale from 1 to 5.

- The feedback is saved to a text file (user_feedback.txt) for future analysis or improvement.

In [24]:
# Collecting feedback
feedback = input("How satisfied are you with the recommendations? (Rate 1-5): ")

# Save feedback to a file
with open("user_feedback.txt", "a") as f:
    f.write(f"User rating: {feedback}\n")

print("Thank you for your feedback!")

How satisfied are you with the recommendations? (Rate 1-5): 4
Thank you for your feedback!
