# Data Prep for Evergreen Patent Identification

## Script purpose
In order to build a evergreen patent flagger, multiple datasets from different sources are to be utilized. Doing so allows a model to capture more information, but creates difficulty when working with the data. Not all of the data can be easily linked together. The purpose of this script is to properly merge all the datasets such that they correctly link to each other. Additionally, the merged data must be properly handled such that all dates are lined up. For example, the **Orange Book** patent database will be utilized for it's meta data and link to the proper patent. The issue is that it is more up to date than the UC Hasting's Evergreen Drug Patent database, which only contains information from **2005 to 2018**. 

The datasets being cleaned and merged in this file are:
- UC Hasting's Evergreen Drug Patent database
- The Department of Veteran Affairs' contract pricing information
- Filings from the European Patent Office and Japanese Patent Office
- FDA Orange book


## Code

In [1]:
import pandas as pd
import numpy as np

### Linking Evergreen Drug Database to Orange Book
The first step in the data prep process will be linking the evergreen drug database to the Orange Book. Since we only care if a drug appears in the evergreen drug database, we only need to find the NDA and Patent number of the drugs that appear in the database. 

In [2]:
#Dataframe for Evergreen Database
Evgn = pd.read_csv("EvergreenDatasetRaw_Dataset_2005-2018_v02.csv")
#Dataframe for individual drugs
Product = pd.read_csv('products.txt',delimiter="~")
#Dataframe for Orange Book
Orange = pd.read_csv("patent.txt",delimiter="~")

In [3]:
Orange.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17835 entries, 0 to 17834
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Appl_Type                17835 non-null  object
 1   Appl_No                  17835 non-null  int64 
 2   Product_No               17835 non-null  int64 
 3   Patent_No                17835 non-null  object
 4   Patent_Expire_Date_Text  17835 non-null  object
 5   Drug_Substance_Flag      3080 non-null   object
 6   Drug_Product_Flag        8936 non-null   object
 7   Patent_Use_Code          10040 non-null  object
 8   Delist_Flag              48 non-null     object
 9   Submission_Date          14425 non-null  object
dtypes: int64(2), object(8)
memory usage: 1.4+ MB


Let's isolate the NDA number for the Evergreen Dataset as well as the patent and NDA number of the Orange Book. In the evergreen dataset this is the column called "1 NDA #". For the Orange Book this is the "Patent_No" and "Appl_No".

In [4]:
Evgn

Unnamed: 0,Active Ingredient,NDA #,Product Name,Company,Approval Date,P or E,Date Added,Patent Number,Expiration Date,Codes,Strengths,Delist Request,Orig,Analysis,Added strength,# added strengths,Applied to UC,2nd add,Comments
0,Abacavir Sulfate,20977,Ziagen*,VIIV HLTHCARE,12/17/98,P,pre-2005,5034394,12/18/11,,1,,,Pre-2005,,,,,10/26/72
1,Abacavir Sulfate,20977,Ziagen*,VIIV HLTHCARE,,P,pre-2005,5089500,6/26/09,U-248,1,,,Pre-2005,,,,,
2,Abacavir Sulfate,20977,Ziagen*,VIIV HLTHCARE,,P,pre-2005,6294540,5/14/18,U-65,1,,,Pre-2005,,,,,
3,Abacavir Sulfate,20977,Ziagen*,VIIV HLTHCARE,,P,pre-2005,5034394*PED,6/18/12,,1,,,Pre-2005,,,,,12/22/15
4,Abacavir Sulfate,20977,Ziagen*,VIIV HLTHCARE,,P,pre-2005,5089500*PED,12/26/09,,1,,,Pre-2005,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21287,Zolpidem Tartrate,22328,Intermezzo,PURDUE PHARMA,,P,11/28/11,7682628,2/16/25,U-1194,"001, 002",,Yes,P:UC,,,,,
21288,Zolpidem Tartrate,22328,Intermezzo,PURDUE PHARMA,,E,11/28/11,,11/23/14,NP,"001, 002",,Yes,NP,,,,,
21289,Zolpidem Tartrate,22328,Intermezzo,PURDUE PHARMA,,P,8/28/12,8242131,8/20/29,U-1266,"001, 002",,No,P:UCnew,,,,,
21290,Zolpidem Tartrate,22328,Intermezzo,PURDUE PHARMA,,P,8/28/12,8252809,2/16/25,DP,"001, 002",,No,P:DP,,,,,


From here, the Orange book dataset will be subset where the all entries of "Product_No" are in the Evgn "1 NDA #" column. To do this, an inner join will performed on both datasets.

In [5]:
Orange['EvergreenFlag'] = [0] * len(Orange)
Orange.loc[Orange['Appl_No'].isin(Evgn['Patent Number']),'EvergreenFlag'] = 1

Implementing this change leads to duplicate values. The duplicate entries will need to be removed in order utilize the patent number as link to the I-MAK Database.

In [6]:
print(len(Orange))
Orange['EvergreenFlag'].value_counts(normalize=True)

17835


0    1.0
Name: EvergreenFlag, dtype: float64

In [7]:
Product

Unnamed: 0,Ingredient,DF;Route,Trade_Name,Applicant,Strength,Appl_Type,Appl_No,Product_No,TE_Code,Approval_Date,RLD,RS,Type,Applicant_Full_Name
0,BUDESONIDE,"AEROSOL, FOAM;RECTAL",UCERIS,SALIX,2MG/ACTUATION,N,205613,1,,"Oct 7, 2014",Yes,Yes,RX,SALIX PHARMACEUTICALS INC
1,MINOCYCLINE HYDROCHLORIDE,"AEROSOL, FOAM;TOPICAL",AMZEEQ,JOURNEY,EQ 4% BASE,N,212379,1,,"Oct 18, 2019",Yes,Yes,RX,JOURNEY MEDICAL CORP
2,AZELAIC ACID,"AEROSOL, FOAM;TOPICAL",AZELAIC ACID,TEVA PHARMS USA,15%,A,210928,1,,"Oct 7, 2020",No,No,DISCN,TEVA PHARMACEUTICALS USA INC
3,BETAMETHASONE VALERATE,"AEROSOL, FOAM;TOPICAL",BETAMETHASONE VALERATE,NOVAST LABS,0.12%,A,207144,1,,"May 24, 2017",No,No,DISCN,NOVAST LABORATORIES LTD
4,BETAMETHASONE VALERATE,"AEROSOL, FOAM;TOPICAL",BETAMETHASONE VALERATE,PADAGIS ISRAEL,0.12%,A,78337,1,AB,"Nov 26, 2012",No,No,RX,PADAGIS ISRAEL PHARMACEUTICALS LTD
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42395,FENTANYL CITRATE,TROCHE/LOZENGE;TRANSMUCOSAL,FENTANYL CITRATE,SPECGX LLC,EQ 0.4MG BASE,A,78907,2,AB,"Oct 30, 2009",No,No,RX,SPECGX LLC
42396,FENTANYL CITRATE,TROCHE/LOZENGE;TRANSMUCOSAL,FENTANYL CITRATE,SPECGX LLC,EQ 0.6MG BASE,A,78907,3,AB,"Oct 30, 2009",No,No,RX,SPECGX LLC
42397,FENTANYL CITRATE,TROCHE/LOZENGE;TRANSMUCOSAL,FENTANYL CITRATE,SPECGX LLC,EQ 0.8MG BASE,A,78907,4,AB,"Oct 30, 2009",No,No,RX,SPECGX LLC
42398,FENTANYL CITRATE,TROCHE/LOZENGE;TRANSMUCOSAL,FENTANYL CITRATE,SPECGX LLC,EQ 1.2MG BASE,A,78907,5,AB,"Oct 30, 2009",No,No,RX,SPECGX LLC
