# Predicting Medicaid Drug Spending Using State Drug Utilization Data

## Business Understanding.
Medicaid is a government health program that is funded by both the State goverment and federal goverment in the United States. It is a program that helps cover for medical expenses of the middle income individuals making access to health care and medication easier and affordable. It is able to subsidize the cost of medication by paying for medical bills claimed by the various hospitals or pharamacies that have are covered under the rebate agreement with the federal government.

Key stakeholders are the state and federal government that are responsible for the funding of the program, alongside the policymakers and decision makers in the healthcare industry. This project basically helps us use the collected data over the years to identify patterns, predict spending, identify the high cost and low cost drugs and identify any possible loopholes such as reimbursing manufacturers that are not part of the rebate agreement leading to loss of money for Medicaid.

Moreover, it can help Medicaid in forecasting drug spending for better budget allocation and negotiate drug costs in the future. Some of the challenges we faced in this project is the large nature of our dataset, since it contains data that has been collected over the last decade which can be difficult to run models on without a powerful CPU, large number of missing values and the changing or flactuating drug prices of various drugs in the market.



## Problem Statement.


Medicaid has been spending millions of dollars on the payment of prescribed drugs by patients all over the country. Spending has increased making it difficult to monitor and control costs effectively. Data that has been collected over the years is huge and scattered so it is very difficult to identify the most used drugs, the most expensive drugs, how drug use differs between states and how to save money and make better decisions in future.Cureent reporting system mainly describes pas spending but fails to accurately predict future expenditures. Without predictive machine learning tools, stakeholders struggle to identify high cost drugs early and allocate budgets efficiently.This project aims to develop a machine learning  model that predicts drug spending based on utilization and reimbursement data available. This in the long run helps stakeholders in better decision making and budget allocation through interactive dashboards and a website that will be available for users to check future spending of specific drugs according to various descriptions.

## Objectives.
### General Objective
To develop a machine learning model that predicts total drug spending using Medicaid drug utilization data

### Specific Objectives
1. Which are the high costs and low cost drugs?
2. Which are the most popular drugs?
3. How does drug use and spending differ between states?
4. How to save money or make better decisions about Medicaid spending?

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("../data/Medicaid_data.csv")
df.head()

Unnamed: 0,Utilization Type,State,NDC,Labeler Code,Product Code,Package Size,Year,Quarter,Suppression Used,Product Name,Units Reimbursed,Number of Prescriptions,Total Amount Reimbursed,Medicaid Amount Reimbursed,Non Medicaid Amount Reimbursed
0,FFSU,AK,2143380,2,1433,80,2025,2,False,TRULICITY,216.0,107.0,102976.4,98630.87,4345.53
1,FFSU,AK,2143480,2,1434,80,2025,2,False,TRULICITY,218.0,109.0,104481.92,101806.64,2675.28
2,FFSU,AK,2143611,2,1436,11,2025,2,False,EMGALITY P,21.0,20.0,15227.25,15227.25,0.0
3,FFSU,AK,2144511,2,1445,11,2025,2,False,TALTZ AUTO,33.0,30.0,231532.28,231532.28,0.0
4,FFSU,AK,2145780,2,1457,80,2025,2,False,MOUNJARO,208.0,104.0,108908.8,105953.32,2955.48


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313397 entries, 0 to 1313396
Data columns (total 15 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   Utilization Type                1313397 non-null  object 
 1   State                           1313397 non-null  object 
 2   NDC                             1313397 non-null  int64  
 3   Labeler Code                    1313397 non-null  int64  
 4   Product Code                    1313397 non-null  int64  
 5   Package Size                    1313397 non-null  int64  
 6   Year                            1313397 non-null  int64  
 7   Quarter                         1313397 non-null  int64  
 8   Suppression Used                1313397 non-null  bool   
 9   Product Name                    1313397 non-null  object 
 10  Units Reimbursed                1313397 non-null  float64
 11  Number of Prescriptions         1313397 non-null  float64
 12  

In [5]:
df.isna().sum()

Utilization Type                  0
State                             0
NDC                               0
Labeler Code                      0
Product Code                      0
Package Size                      0
Year                              0
Quarter                           0
Suppression Used                  0
Product Name                      0
Units Reimbursed                  0
Number of Prescriptions           0
Total Amount Reimbursed           0
Medicaid Amount Reimbursed        0
Non Medicaid Amount Reimbursed    0
dtype: int64

In [6]:
df.duplicated().value_counts()

False    1313397
dtype: int64

## Checking for outliers using Interquatile Range 

In [7]:
# Select your target column
column = "Total Amount Reimbursed"

# Calculate Q1 and Q3
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)

# Calculate IQR
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]

print("Lower Bound:", lower_bound)
print("Upper Bound:", upper_bound)
print("Number of Outliers:", outliers.shape[0])
print("Percentage of Outliers:", (outliers.shape[0] / df.shape[0]) * 100)

Lower Bound: -9221.155
Upper Bound: 16410.325
Number of Outliers: 210679
Percentage of Outliers: 16.040770612389093


Top 10 most expensive drugs reimburbed by medicaid 

In [8]:
outliers[["Product Name", "Total Amount Reimbursed"]].sort_values(
    by="Total Amount Reimbursed", ascending=False
).head(10)


Unnamed: 0,Product Name,Total Amount Reimbursed
735981,Biktarvy,451565600.0
676260,Biktarvy,448880100.0
676261,Biktarvy,371033000.0
735982,Biktarvy,368949200.0
700544,HUMIRA PEN,352411400.0
640995,HUMIRA PEN,314077800.0
640028,DUPIXENT S,273321600.0
700543,HUMIRA PEN,271622800.0
699541,DUPIXENT S,255818500.0
640994,HUMIRA PEN,252276300.0


standardizing the product names so that the same drug written in slightly different ways is treated as one value.This makes sure we have the exact number of drugs without double counting due to typing errors.

In [9]:
import re


df["Product Name_raw"] = df["Product Name"]

def clean_product_name(x):
    if pd.isna(x):
        return np.nan
    x = str(x).strip()                 
    x = re.sub(r"\s+", " ", x)        
    x = x.casefold()                   
    return x

df["Product Name_std"] = df["Product Name_raw"].apply(clean_product_name)

In [10]:
# Remove the national/suppressed "XX" row
df = df[df['State'] != 'XX'].copy()
print("Dataset shape after removing XX:", df.shape)
print("Remaining states:", sorted(df['State'].unique()))

Dataset shape after removing XX: (1194315, 17)
Remaining states: ['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']


In [11]:
df.columns

Index(['Utilization Type', 'State', 'NDC', 'Labeler Code', 'Product Code',
       'Package Size', 'Year', 'Quarter', 'Suppression Used', 'Product Name',
       'Units Reimbursed', 'Number of Prescriptions',
       'Total Amount Reimbursed', 'Medicaid Amount Reimbursed',
       'Non Medicaid Amount Reimbursed', 'Product Name_raw',
       'Product Name_std'],
      dtype='object')

Filtering the top 300 drugs that drive spending. Classifying them to filtered rows from the original rows from the dataset.

In [12]:
top_drugs = (
    df.groupby("Product Name_std")["Total Amount Reimbursed"]
      .sum()
      .sort_values(ascending=False)
      .head(300)
      .index
)

df_filtered = df[df["Product Name_std"].isin(top_drugs)].copy()

print("Original rows:", len(df))
print("Filtered rows:", len(df_filtered))
print("Unique drugs after filter:", df_filtered["Product Name_std"].nunique())

Original rows: 1194315
Filtered rows: 261177
Unique drugs after filter: 300


Calculating the percentage of spending retained after filtering to checking if filtering was appropriate or was too aggressive leading to loss of spending data to be used in modelling.

In [13]:
original_total = df["Total Amount Reimbursed"].sum()
filtered_total = df_filtered["Total Amount Reimbursed"].sum()

print("Spending retained (%):", (filtered_total / original_total) * 100)


Spending retained (%): 69.30136929521737


The percentage retained is 69%. Above 90% is excellenet data retention that can be used for modelling, 70 - 90% is acceptable since the model can learn from trained data  and forecast properly while below 70% might indicate filtering was too aggressive.

We can use a better approach to select the unique drugs driving spending. Instead of limiting our selection to a specified number of drugs like 300, we can keep drugs that account for 80-90% of spending. This is known as cumulative spending threshold.

In [14]:
drug_spending = (
    df.groupby("Product Name_std")["Total Amount Reimbursed"]
      .sum()
      .sort_values(ascending=False)
)

cumulative_spending = drug_spending.cumsum() / drug_spending.sum()

top_drugs = cumulative_spending[cumulative_spending <= 0.85].index  # 85% threshold

df_filtered = df[df["Product Name_std"].isin(top_drugs)].copy()

print("Unique drugs kept:", len(top_drugs))


Unique drugs kept: 683


Now it is visible that our unique drug number has increased from 300 to 683, which is more than double the number. Now we have a high number of high spending drugs to aid in prediction.