# Medicare Physician Similarity Analysis</span>

Find the 10 nearest neighbours for the following physicians:
 - 1568588127 - Santosh Mathews
 - 1528283249 - Brett Gidney

<br>

**Author:** Rohit Ganji<br>
**Date:** 01/15/2024<br>
**Dataset:** [Medicare Physician and Other Practitioners](https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service/data)

## Introduction
The Medicare Physician and Other Practitioners dataset is a comprehensive collection of information regarding the services and procedures provided to Medicare beneficiaries by physicians and other healthcare practitioners.

## Key Features
**Rndrng_NPI:** Physician indentified by their National Provider Identifier (NPI).<br>
**HCPCS_Cd:** Procedures coded using the Healthcare Common Procedure Coding System (HCPCS).<br>
**Tot_Srvcs:** Total number of partical procedure performed.<br>

In [1]:
import pandas as pd
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
from sklearn.decomposition import TruncatedSVD
import warnings
warnings.filterwarnings('ignore')

I'm initially loading the complete dataset from the source and extracting the required columns for our analysis.

In [2]:
dataset = pd.read_csv('/Users/rohitganji/Downloads/Medicare_Physician_Other_Practitioners_by_Provider_and_Service_2021.csv')
df = dataset[["Rndrng_NPI", "HCPCS_Cd", "Tot_Srvcs"]]
names = dataset[["Rndrng_NPI", "Rndrng_Prvdr_Last_Org_Name", "Rndrng_Prvdr_First_Name"]].drop_duplicates()

In [3]:
df

Unnamed: 0,Rndrng_NPI,HCPCS_Cd,Tot_Srvcs
0,1003000126,99213,191.0
1,1003000126,99214,47.0
2,1003000126,99217,39.0
3,1003000126,99220,21.0
4,1003000126,99222,12.0
...,...,...,...
9886172,1992999825,99214,152.0
9886173,1992999874,99223,51.0
9886174,1992999874,99232,259.0
9886175,1992999874,99233,606.0


## Data Exploration and Cleaning

Let's explore the columns and look for null values

In [4]:
print(df["Rndrng_NPI"].value_counts())

Rndrng_NPI
1538144910    656
1891731626    640
1932145778    622
1063497451    613
1366479099    603
             ... 
1528180171      1
1528180197      1
1760559272      1
1528180361      1
1821491598      1
Name: count, Length: 1123589, dtype: int64


In [5]:
df["HCPCS_Cd"].value_counts()

HCPCS_Cd
99213    471853
99214    471650
99204    221681
99232    171455
99203    170993
          ...  
26952         1
66600         1
54220         1
50553         1
64630         1
Name: count, Length: 6253, dtype: int64

In [6]:
df["Tot_Srvcs"].value_counts()

Tot_Srvcs
12.0        305871
13.0        295450
11.0        286231
14.0        278838
15.0        261066
             ...  
13390.7          1
1817.4           1
19943.3          1
128760.0         1
1726.1           1
Name: count, Length: 30816, dtype: int64

In [7]:
df.isnull().sum()

Rndrng_NPI    0
HCPCS_Cd      0
Tot_Srvcs     0
dtype: int64

## Data Sampling and Nearest Neighbors Computation

The Medicare Physician and Other Practitioners dataset presents a significant computational challenge due to its size, encompassing approximately 1.1 million physicians and 6,253 distinct procedures. Constructing a full sparse matrix from this dataset would result in a matrix of approximately 1,123,589 rows (physicians) by 6,253 columns (procedures), posing substantial computational demands.

**Strategy for Efficient Analysis**<br>
To manage this challenge, a targeted sampling approach is adopted. The focus is on a subset of physicians who perform at least one of the same procedures as our target physician. This sampling strategy narrows down the dataset to a more manageable size, ensuring computational feasibility while still providing meaningful insights.

**Nearest Neighbors Computation**<br>
Upon creating a sparse matrix for this sampled dataset, cosine similarity will be calculated to identify the nearest neighbors for a specific physician. The primary goal is to find the top 10 physicians most similar to the physician in question, based on their procedure profiles. This method provides a focused and efficient means of analyzing the dataset, offering valuable insights into the similarities between different physicians' practice patterns.

In [4]:
target_NPI = 1568588127 # Santosh Mathews

In [5]:
# Extract unique procedures performed by the target physician
unique_procedures = df[df["Rndrng_NPI"] == target_NPI]["HCPCS_Cd"].unique()
unique_procedures

array(['33361', '37184', '37185', '37224', '37228', '37252', '37253',
       '75625', '75710', '75716', '76937', '78452', '92928', '92933',
       '92973', '92978', '92979', '93000', '93010', '93016', '93018',
       '93280', '93294', '93296', '93298', '93306', '93308', '93321',
       '93325', '93458', '93459', '93460', '93571', '93798', '93880',
       '93925', '93970', '93978', '99204', '99213', '99214', '99222',
       '99223', '99232', 'G2066'], dtype=object)

In [6]:
len(unique_procedures)

45

In [7]:
# Retrieve the name of the target physician
target_name = " ".join(names[names["Rndrng_NPI"] == target_NPI].values[0][::-1][:2])
print(f"{target_NPI} ({target_name}) performs {len(unique_procedures)} unique procedures.")

1568588127 (Santhosh Mathews) performs 45 unique procedures.


In [11]:
# Sample the dataset for only those procedures performed by the target physician
sampled_df = df[df['HCPCS_Cd'].isin(unique_procedures)]
sampled_df

Unnamed: 0,Rndrng_NPI,HCPCS_Cd,Tot_Srvcs
0,1003000126,99213,191.0
1,1003000126,99214,47.0
4,1003000126,99222,12.0
5,1003000126,99223,18.0
8,1003000126,99232,204.0
...,...,...,...
9886170,1992999825,99204,71.0
9886171,1992999825,99214,79.0
9886172,1992999825,99214,152.0
9886173,1992999874,99223,51.0


In [12]:
sampled_df['Rndrng_NPI'].value_counts()

Rndrng_NPI
1568588127    47
1720180433    40
1831484773    39
1396788022    38
1245288505    37
              ..
1740358886     1
1740359017     1
1740359298     1
1740359363     1
1780115766     1
Name: count, Length: 696528, dtype: int64

In [13]:
sampled_df['Tot_Srvcs'].value_counts()

Tot_Srvcs
13.0      40243
12.0      39969
14.0      38692
15.0      37242
11.0      36577
          ...  
3647.0        1
3650.0        1
3174.0        1
3703.0        1
2159.0        1
Name: count, Length: 4062, dtype: int64

In [14]:
# Create a sparse matrix from the sampled dataset
pivot = sampled_df.pivot_table(index='Rndrng_NPI', columns='HCPCS_Cd', values='Tot_Srvcs', fill_value=0)
sparse_matrix = sparse.csr_matrix(pivot.values)
pivot

HCPCS_Cd,33361,37184,37185,37224,37228,37252,37253,75625,75710,75716,...,93925,93970,93978,99204,99213,99214,99222,99223,99232,G2066
Rndrng_NPI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1003000126,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,191.0,47.0,12.0,18.0,204.0,0.0
1003000142,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,57.0,150.0,203.0,0.0,0.0,0.0,0.0
1003000480,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,21.0,0.0,0.0,0.0,0.0,0.0
1003000530,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,90.0,735.0,0.0,0.0,0.0,0.0
1003000597,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,172.0,192.0,716.0,48.0,74.0,16.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1992999148,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1992999270,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,17.0,58.0,0.0,115.0,0.0
1992999551,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,342.0,327.0,0.0,0.0,0.0,0.0
1992999825,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,71.5,0.0,115.5,0.0,0.0,0.0,0.0


In [15]:
# Normalize the rows in the sparse matrix
normalized_matrix = normalize(sparse_matrix, norm='l2', axis=1)

In [16]:
# Apply PCA for dimensionality reduction
svd = TruncatedSVD(n_components=20)
reduced_matrix = svd.fit_transform(normalized_matrix)
target_row = reduced_matrix[pivot.index.get_loc(target_NPI)]

In [17]:
# Compute cosine similarity between the target physician and all others
cosine_similarities = cosine_similarity(target_row.reshape(1, -1), reduced_matrix)
similarity_series = pd.Series(cosine_similarities[0], index=pivot.index)
similarity_series.drop(target_NPI, inplace=True)

In [18]:
print("Top 10 nearest neighbors:")
for neighbor_NPI in similarity_series.nlargest(10).index:
    neighbor_name = " ".join(names[names["Rndrng_NPI"] == neighbor_NPI].values[0][::-1][:2])
    print(f"{neighbor_NPI} - {neighbor_name}")

Top 10 nearest neighbors:
1841441474 - James Nguyen
1235390535 - Michael Cammarata
1265501985 - Biren Parikh
1194799502 - George Daniel
1184865982 - Aman Saw
1871502492 - Bhaskar Reddy
1093791287 - Bruce Lipskind
1033110234 - John Pappas
1194771188 - Joel Garcia
1629030127 - Hoshedar Tamboli


In [19]:
def find_nearest_neighbors(target_NPI, use_PCA=False, n_components=20):
    """
    Finds the top 10 nearest neighbors for a given physician based on procedure profiles.

    Args:
    target_NPI (int): The NPI number of the target physician.

    Prints the target physician's details and their top 10 nearest neighbors.
    """

    # Extract unique procedures performed by the target physician
    unique_procedures = df[df["Rndrng_NPI"] == target_NPI]["HCPCS_Cd"].unique()

    # Retrieve the name of the target physician
    target_name = " ".join(names[names["Rndrng_NPI"] == target_NPI].values[0][::-1][:2])
    print(f"{target_NPI} ({target_name}) performs {len(unique_procedures)} unique procedures.")

    # Sample the dataset for only those procedures performed by the target physician
    sampled_df = df[df['HCPCS_Cd'].isin(unique_procedures)]

    # Create a sparse matrix from the sampled dataset
    pivot = sampled_df.pivot_table(index='Rndrng_NPI', columns='HCPCS_Cd', values='Tot_Srvcs', fill_value=0)
    sparse_matrix = sparse.csr_matrix(pivot.values)

    # Normalize the rows in the sparse matrix
    normalized_matrix = normalize(sparse_matrix, norm='l2', axis=1)

    # Apply PCA for dimensionality reduction if requested
    if use_PCA:
        svd = TruncatedSVD(n_components=n_components)
        reduced_matrix = svd.fit_transform(normalized_matrix)
        target_row = reduced_matrix[pivot.index.get_loc(target_NPI)]
    else:
        reduced_matrix = normalized_matrix
        target_row = reduced_matrix[pivot.index.get_loc(target_NPI), :]

    # Compute cosine similarity between the target physician and all others
    cosine_similarities = cosine_similarity(target_row.reshape(1, -1), reduced_matrix)
    similarity_series = pd.Series(cosine_similarities[0], index=pivot.index)
    similarity_series.drop(target_NPI, inplace=True)

    # Print the top 10 nearest neighbors
    print("\nTop 10 nearest neighbors:")
    for neighbor_NPI in similarity_series.nlargest(10).index:
        neighbor_name = " ".join(names[names["Rndrng_NPI"] == neighbor_NPI].values[0][::-1][:2])
        print(f"{neighbor_NPI} - {neighbor_name}")

In [20]:
find_nearest_neighbors(1568588127, use_PCA=True, n_components=30)

1568588127 (Santhosh Mathews) performs 45 unique procedures.

Top 10 nearest neighbors:
1841441474 - James Nguyen
1235390535 - Michael Cammarata
1265501985 - Biren Parikh
1194799502 - George Daniel
1184865982 - Aman Saw
1871502492 - Bhaskar Reddy
1093791287 - Bruce Lipskind
1033110234 - John Pappas
1194771188 - Joel Garcia
1629030127 - Hoshedar Tamboli


In [21]:
find_nearest_neighbors(1568588127)

1568588127 (Santhosh Mathews) performs 45 unique procedures.

Top 10 nearest neighbors:
1841441474 - James Nguyen
1265501985 - Biren Parikh
1194799502 - George Daniel
1871502492 - Bhaskar Reddy
1235390535 - Michael Cammarata
1194771188 - Joel Garcia
1093791287 - Bruce Lipskind
1033110234 - John Pappas
1629030127 - Hoshedar Tamboli
1891719175 - David Grossman


In [22]:
find_nearest_neighbors(1528283249, use_PCA=True, n_components=20)

1528283249 (Brett Gidney) performs 33 unique procedures.

Top 10 nearest neighbors:
1124019526 - Steven Hao
1841382421 - Andrea Natale
1417273129 - Stephen Rechenmacher
1982786083 - Patrick Hranitzky
1366880825 - David Okada
1700994662 - John Sturdivant
1821241134 - Saumya Sharma
1669760021 - Diego Alcivar Franco
1013151653 - Kishore Subnani
1871551630 - Rajesh Malik


In [23]:
find_nearest_neighbors(1528283249)

1528283249 (Brett Gidney) performs 33 unique procedures.

Top 10 nearest neighbors:
1417273129 - Stephen Rechenmacher
1124019526 - Steven Hao
1841382421 - Andrea Natale
1982786083 - Patrick Hranitzky
1013151653 - Kishore Subnani
1669760021 - Diego Alcivar Franco
1700994662 - John Sturdivant
1871551630 - Rajesh Malik
1831366731 - Ashish Patel
1821241134 - Saumya Sharma
