# Predicting Medicaid Drug Spending Using State Drug Utilization Data

## Business Understanding.
Medicaid is a government health program that is funded by both the State goverment and federal goverment in the United States. It is a program that helps cover for medical expenses of the middle income individuals making access to health care and medication easier and affordable. It is able to subsidize the cost of medication by paying for medical bills claimed by the various hospitals or pharamacies that have are covered under the rebate agreement with the federal government.

Key stakeholders are the state and federal government that are responsible for the funding of the program, alongside the policymakers and decision makers in the healthcare industry. This project basically helps us use the collected data over the years to identify patterns, predict spending, identify the high cost and low cost drugs and identify any possible loopholes such as reimbursing manufacturers that are not part of the rebate agreement leading to loss of money for Medicaid.

Moreover, it can help Medicaid in forecasting drug spending for better budget allocation and negotiate drug costs in the future. Some of the challenges we faced in this project is the large nature of our dataset, since it contains data that has been collected over the last decade which can be difficult to run models on without a powerful CPU, large number of missing values and the changing or flactuating drug prices of various drugs in the market.



## Problem Statement.
Medicaid has been spending millions of dollars on the payment of prescribed drugs by patients all over the country. Spending has increased making it difficult to monitor and control costs effectively. Data that has been collected over the years is huge and scattered so it is very difficult to identify the most used drugs, the most expensive drugs, how drug use differs between states and how to save money and make better decisions in future.Cureent reporting system mainly describes pas spending but fails to accurately predict future expenditures. Without predictive machine learning tools, stakeholders struggle to identify high cost drugs early and allocate budgets efficiently.This project aims to develop a machine learning  model that predicts drug spending based on utilization and reimbursement data available. This in the long run helps stakeholders in better decision making and budget allocation through interactive dashboards and a website that will be available for users to check future spending of specific drugs according to various descriptions.

## Objectives.
### General Objective
To develop a machine learning model that predicts total drug spending using Medicaid drug utilization data

### Specific Objectives
1. Identify the high costs and low cost drugs
2. Identify the most popular drugs
3. How drug use and spending differ between states
4. How to save money or make better decisions about Medicaid spending

In [2]:
import pandas as pd
import numpy as np

In [5]:
df = pd.read_csv("../data/Medicaid_data.csv")
df.head()

Unnamed: 0,Utilization Type,State,NDC,Labeler Code,Product Code,Package Size,Year,Quarter,Suppression Used,Product Name,Units Reimbursed,Number of Prescriptions,Total Amount Reimbursed,Medicaid Amount Reimbursed,Non Medicaid Amount Reimbursed
0,FFSU,AK,2143380,2,1433,80,2025,2,False,TRULICITY,216.0,107.0,102976.4,98630.87,4345.53
1,FFSU,AK,2143480,2,1434,80,2025,2,False,TRULICITY,218.0,109.0,104481.92,101806.64,2675.28
2,FFSU,AK,2143611,2,1436,11,2025,2,False,EMGALITY P,21.0,20.0,15227.25,15227.25,0.0
3,FFSU,AK,2144511,2,1445,11,2025,2,False,TALTZ AUTO,33.0,30.0,231532.28,231532.28,0.0
4,FFSU,AK,2145780,2,1457,80,2025,2,False,MOUNJARO,208.0,104.0,108908.8,105953.32,2955.48


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313397 entries, 0 to 1313396
Data columns (total 15 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   Utilization Type                1313397 non-null  object 
 1   State                           1313397 non-null  object 
 2   NDC                             1313397 non-null  int64  
 3   Labeler Code                    1313397 non-null  int64  
 4   Product Code                    1313397 non-null  int64  
 5   Package Size                    1313397 non-null  int64  
 6   Year                            1313397 non-null  int64  
 7   Quarter                         1313397 non-null  int64  
 8   Suppression Used                1313397 non-null  bool   
 9   Product Name                    1313397 non-null  object 
 10  Units Reimbursed                1313397 non-null  float64
 11  Number of Prescriptions         1313397 non-null  float64
 12  

In [7]:
df.isna().sum()

Utilization Type                  0
State                             0
NDC                               0
Labeler Code                      0
Product Code                      0
Package Size                      0
Year                              0
Quarter                           0
Suppression Used                  0
Product Name                      0
Units Reimbursed                  0
Number of Prescriptions           0
Total Amount Reimbursed           0
Medicaid Amount Reimbursed        0
Non Medicaid Amount Reimbursed    0
dtype: int64

In [9]:
df.duplicated().value_counts()

False    1313397
dtype: int64