Below is the importing of all relevant libraries for this specific document.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sma

Below reads in all the cleaned data created in other notebooks and then subsets it to only include drugs with a price greater than 0.

In [2]:
df = pd.read_csv("CleanedData.csv")
df = df[df["Price"] > 0]

Below is a quick info to just check the datatypes of the cleaned columns, and again validate that there are no null values in the data.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1643 entries, 0 to 1642
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Price                   1643 non-null   float64
 1   PriceStartDate          1643 non-null   object 
 2   Date Added              1643 non-null   object 
 3   InflationAdjustedPrice  1643 non-null   float64
 4   Analysis                1614 non-null   object 
 5   P or E                  1643 non-null   object 
 6   Pre2005Flag             1643 non-null   int64  
 7   PreviousPatents         1643 non-null   int64  
 8   LatestExpiration        1643 non-null   object 
 9   MonthsUntilExpiration   1643 non-null   float64
dtypes: float64(3), int64(2), object(5)
memory usage: 141.2+ KB


Below is the generation of the summary statistics of Inflation Adjusted Price, which is then used to create the Z score of inflation adjusted price for implementation in our models. We also filter out all prices which exceed 3 standard deviations in either direction from the mean.

In [1]:
# These summary stats are generated for use in the formula present in the next block.
mean = df['InflationAdjustedPrice'].mean()
print(f"mean for the inflation adjusted price: {mean}")
std =df['InflationAdjustedPrice'].std()
print(f"std for the inflation adjusted price: {std} \n")

# Here we generate the z-score for the variable, and compare the size from before and after the swap.
df['InflationAdjustedPriceZScore'] = (df['InflationAdjustedPrice'] - mean)/std
print(f"Size before z-score filter:{df.shape[0]}")
df= df[(df['InflationAdjustedPriceZScore'] <=3)&(df['InflationAdjustedPriceZScore'] >=-3)]
print(f"Size after z-score filter:{df.shape[0]}")

NameError: name 'df' is not defined

Here we're just looking at the distribution between our target variable classes after this filter has been applied.

In [5]:
df['P or E'].value_counts(normalize=True)

P    0.806187
E    0.193813
Name: P or E, dtype: float64

Assigning binary dependent variable from the Analysis variable. The following categories will be classified as suspected evergreen:
- P:PED
- PTAorPTE
- DlistRequest
- NPP
- PED
- P-PEDExtension
- UCsamemonth
- DP
- DS/DP
- DS/DP/UCnew
- DP/UCnew
- DS

Create a new column with a direct evergreen flag, whose value is determined solely by the 'E' flag being present in the 'P or E' column,

In [6]:
df['EvergreenFlag'] = [0] * len(df)
df.loc[df['P or E']=='E','EvergreenFlag'] = 1

We recount the distribution, just to make sure it's the same.

In [7]:
df['EvergreenFlag'].value_counts(normalize=True)

0    0.806187
1    0.193813
Name: EvergreenFlag, dtype: float64

This is the creation of a specific dataframe for use in the model, which includes the following variables:

* Inflation Adjusted Price (InflationAdjustedPrice)
* Months Until Exclusivity Expiration (MonthsUntilExpiration)
* Number of Previous Patent Filings (PreviousPatents)
* Evergreen Flag (EvergreenFlag)

Please add any variable changes to this list to reflect the current conceptual model.

In [8]:
model_df = df[['InflationAdjustedPrice','MonthsUntilExpiration',
              'PreviousPatents','EvergreenFlag']]
#model_df['Initial Filing'] = [0] * len(model_df)
#model_df.loc[model_df['PreviousPatents']==0,'Initial Filing'] = 1

Just another quick info run to check on our variables and distribution.

In [9]:
model_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1584 entries, 0 to 1642
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   InflationAdjustedPrice  1584 non-null   float64
 1   MonthsUntilExpiration   1584 non-null   float64
 2   PreviousPatents         1584 non-null   int64  
 3   EvergreenFlag           1584 non-null   int64  
dtypes: float64(2), int64(2)
memory usage: 61.9 KB


This is the creation of a quick function which standardizes a numberical column to a z-score representation of it's values. This is used subsequently.

In [10]:
def ZStandardize(array):
    mean = array.mean()
    std = array.std()
    new_array = (array - mean)/std
    return(new_array)

Here we standardize all the columns except for the binary target using the above z-score conversion.

In [11]:
for column in model_df.drop(columns='EvergreenFlag').columns:
    model_df[column] = ZStandardize(model_df[column])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_df[column] = ZStandardize(model_df[column])


Here we just re-run and check the function worked correctly.

In [12]:
model_df.describe()

Unnamed: 0,InflationAdjustedPrice,MonthsUntilExpiration,PreviousPatents,EvergreenFlag
count,1584.0,1584.0,1584.0,1584.0
mean,3.059491e-15,4.243239e-16,-3.452485e-15,0.193813
std,1.0,1.0,1.0,0.395409
min,-0.4903532,-2.899132,-0.6724874,0.0
25%,-0.4566217,-0.7671145,-0.6724874,0.0
50%,-0.2895706,0.2385543,-0.4052805,0.0
75%,-0.0194588,0.8218422,0.307271,0.0
max,5.851313,1.807398,4.627115,1.0


Here we check the target variable proportions again.

In [13]:
model_df['EvergreenFlag'].value_counts(normalize=True)

0    0.806187
1    0.193813
Name: EvergreenFlag, dtype: float64

Here we run a quick stats-models-api logistic regression model for initial analysis.

In [14]:
model_df = sma.add_constant(model_df)
clf = sma.Logit(model_df['EvergreenFlag'],model_df.drop(columns='EvergreenFlag')).fit()
print(clf.summary())

Optimization terminated successfully.
         Current function value: 0.475172
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:          EvergreenFlag   No. Observations:                 1584
Model:                          Logit   Df Residuals:                     1580
Method:                           MLE   Df Model:                            3
Date:                Tue, 21 Mar 2023   Pseudo R-squ.:                 0.03362
Time:                        12:35:21   Log-Likelihood:                -752.67
converged:                       True   LL-Null:                       -778.86
Covariance Type:            nonrobust   LLR p-value:                 2.490e-11
                             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                     -1.4856      0.067    -22.251      0.000      -1.616      

In [None]:
from sklearn.Neih