# Data Prep for Evergreen Patent Identification

## Script purpose
In order to build a evergreen patent flagger, multiple datasets from different sources are to be utilized. Doing so allows a model to capture more information, but creates difficulty when working with the data. Not all of the data can be easily linked together. The purpose of this script is to properly merge all the datasets such that they correctly link to each other. Additionally, the merged data must be properly handled such that all dates are lined up. For example, the **Orange Book** patent database will be utilized for it's meta data and link to the proper patent. The issue is that it is more up to date than the UC Hasting's Evergreen Drug Patent database, which only contains information from **2005 to 2018**. 

The datasets being cleaned and merged in this file are:
- UC Hasting's Evergreen Drug Patent database
- The Department of Veteran Affairs' contract pricing information
- Filings from the European Patent Office and Japanese Patent Office
- FDA Orange book


## Code

In [1]:
import pandas as pd
import numpy as np

### Linking Evergreen Drug Database to Orange Book
The first step in the data prep process will be linking the evergreen drug database to the Orange Book. Since we only care if a drug appears in the evergreen drug database, we only need to find the NDA and Patent number of the drugs that appear in the database. 

In [2]:
#Dataframe for Evergreen Database
Evgn = pd.read_csv("EvergreenDatasetRaw_Dataset_2005-2018_v02.csv")
#Dataframe for individual drugs
Product = pd.read_csv('products.txt',delimiter="~")
#Dataframe for Orange Book
Patent = pd.read_csv("patent.txt",delimiter="~")

In [3]:
Evgn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21292 entries, 0 to 21291
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Active Ingredient  21292 non-null  object 
 1   NDA #              21292 non-null  int64  
 2   Product Name       21292 non-null  object 
 3   Company            21292 non-null  object 
 4   Approval Date      2112 non-null   object 
 5   P or E             21292 non-null  object 
 6   Date Added         21292 non-null  object 
 7   Patent Number      17329 non-null  object 
 8   Expiration Date    21292 non-null  object 
 9   Codes              17949 non-null  object 
 10  Strengths          21264 non-null  object 
 11  Delist Request     167 non-null    object 
 12  Orig               12584 non-null  object 
 13  Analysis           20774 non-null  object 
 14  Added strength     1082 non-null   object 
 15  # added strengths  0 non-null      float64
 16  Applied to UC      113

Let's utilize the patent number to create 

In [4]:
Patent['EvergreenFlag'] = [0] * len(Patent)
Patent.loc[Patent['Patent_No'].isin(Evgn['Patent Number']),'EvergreenFlag'] = 1

Implementing this change leads to duplicate values. The duplicate entries will need to be removed in order utilize the patent number as link to the I-MAK Database.

In [5]:
print(len(Patent))
Patent['EvergreenFlag'].value_counts(normalize=True)

17835


1    0.621082
0    0.378918
Name: EvergreenFlag, dtype: float64

In [6]:
Patent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17835 entries, 0 to 17834
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Appl_Type                17835 non-null  object
 1   Appl_No                  17835 non-null  int64 
 2   Product_No               17835 non-null  int64 
 3   Patent_No                17835 non-null  object
 4   Patent_Expire_Date_Text  17835 non-null  object
 5   Drug_Substance_Flag      3080 non-null   object
 6   Drug_Product_Flag        8936 non-null   object
 7   Patent_Use_Code          10040 non-null  object
 8   Delist_Flag              48 non-null     object
 9   Submission_Date          14425 non-null  object
 10  EvergreenFlag            17835 non-null  int64 
dtypes: int64(3), object(8)
memory usage: 1.5+ MB


In [7]:
len(Patent['Appl_No'].unique())

1183

In [8]:
Patent = Patent[~Patent['Submission_Date'].isna()]
Patent['Submission_Date'] = pd.to_datetime(Patent['Submission_Date'])
Patent = Patent[(Patent['Submission_Date'] >=np.datetime64('2005-01-01'))&
               (Patent['Submission_Date'] <=np.datetime64('2018-12-31'))]
Patent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8073 entries, 2 to 15885
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Appl_Type                8073 non-null   object        
 1   Appl_No                  8073 non-null   int64         
 2   Product_No               8073 non-null   int64         
 3   Patent_No                8073 non-null   object        
 4   Patent_Expire_Date_Text  8073 non-null   object        
 5   Drug_Substance_Flag      1529 non-null   object        
 6   Drug_Product_Flag        4581 non-null   object        
 7   Patent_Use_Code          5170 non-null   object        
 8   Delist_Flag              43 non-null     object        
 9   Submission_Date          8073 non-null   datetime64[ns]
 10  EvergreenFlag            8073 non-null   int64         
dtypes: datetime64[ns](1), int64(3), object(7)
memory usage: 756.8+ KB


In [9]:
print(len(Patent))
Patent['EvergreenFlag'].value_counts(normalize=True)

8073


1    0.991453
0    0.008547
Name: EvergreenFlag, dtype: float64

In [10]:
Evgn

Unnamed: 0,Active Ingredient,NDA #,Product Name,Company,Approval Date,P or E,Date Added,Patent Number,Expiration Date,Codes,Strengths,Delist Request,Orig,Analysis,Added strength,# added strengths,Applied to UC,2nd add,Comments
0,Abacavir Sulfate,20977,Ziagen*,VIIV HLTHCARE,12/17/98,P,pre-2005,5034394,12/18/11,,1,,,Pre-2005,,,,,10/26/72
1,Abacavir Sulfate,20977,Ziagen*,VIIV HLTHCARE,,P,pre-2005,5089500,6/26/09,U-248,1,,,Pre-2005,,,,,
2,Abacavir Sulfate,20977,Ziagen*,VIIV HLTHCARE,,P,pre-2005,6294540,5/14/18,U-65,1,,,Pre-2005,,,,,
3,Abacavir Sulfate,20977,Ziagen*,VIIV HLTHCARE,,P,pre-2005,5034394*PED,6/18/12,,1,,,Pre-2005,,,,,12/22/15
4,Abacavir Sulfate,20977,Ziagen*,VIIV HLTHCARE,,P,pre-2005,5089500*PED,12/26/09,,1,,,Pre-2005,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21287,Zolpidem Tartrate,22328,Intermezzo,PURDUE PHARMA,,P,11/28/11,7682628,2/16/25,U-1194,"001, 002",,Yes,P:UC,,,,,
21288,Zolpidem Tartrate,22328,Intermezzo,PURDUE PHARMA,,E,11/28/11,,11/23/14,NP,"001, 002",,Yes,NP,,,,,
21289,Zolpidem Tartrate,22328,Intermezzo,PURDUE PHARMA,,P,8/28/12,8242131,8/20/29,U-1266,"001, 002",,No,P:UCnew,,,,,
21290,Zolpidem Tartrate,22328,Intermezzo,PURDUE PHARMA,,P,8/28/12,8252809,2/16/25,DP,"001, 002",,No,P:DP,,,,,
