# Data Prep for Models
This file handles the data prep for modeling including binning, choosing features, and creating dummy variables.

In [1]:
import pandas as pd 
import numpy as np

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/SullyRC/Drug-Patents/PriceDelta/CleanedData.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1643 entries, 0 to 1642
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Company                 1643 non-null   object 
 1   Price                   1643 non-null   float64
 2   PriceStartDate          1643 non-null   object 
 3   Date Added              1643 non-null   object 
 4   InflationAdjustedPrice  1643 non-null   float64
 5   Analysis                1614 non-null   object 
 6   P or E                  1643 non-null   object 
 7   Pre2005Flag             1643 non-null   int64  
 8   PreviousPatents         1643 non-null   int64  
 9   LatestExpiration        1643 non-null   object 
 10  MonthsUntilExpiration   1643 non-null   float64
 11  PriceDelta              1643 non-null   float64
 12  PercentageE             1643 non-null   float64
dtypes: float64(5), int64(2), object(6)
memory usage: 167.0+ KB


We'll subset our dataset to not include Pre2005 Patents

In [4]:
df = df[df['Pre2005Flag']!=1]
df = df.drop(columns=['Pre2005Flag'])

We'll also subset where price is greater than 0.

In [5]:
df = df[df['InflationAdjustedPrice'] > 0]
df = df.drop(columns=['InflationAdjustedPrice'])

We'll change P or E to be "1" representing an extension and "0" representing a patent.

In [6]:
df.loc[df['P or E'] == 'E','EvergreenFlag'] = 1
df.loc[df['P or E'] == 'P','EvergreenFlag'] = 0
df['P or E'] = pd.to_numeric(df['EvergreenFlag'])

Next we'll subset the dataframe to the columns we want.

In [7]:
df = df[['PercentageE','PriceDelta','MonthsUntilExpiration','PreviousPatents','EvergreenFlag']]

Now we'll bin our continuous data.

In [8]:
df.describe()

Unnamed: 0,PercentageE,PriceDelta,MonthsUntilExpiration,PreviousPatents,EvergreenFlag
count,1381.0,1381.0,1381.0,1381.0,1381.0
mean,0.117934,-0.113799,136.984794,17.717596,0.188269
std,0.12217,0.225417,50.295336,23.069772,0.391069
min,0.0,-0.845604,9.0,0.0,0.0
25%,0.0,-0.092968,99.0,1.0,0.0
50%,0.12973,-0.043701,146.0,9.0,0.0
75%,0.183036,-0.026729,182.0,25.0,0.0
max,1.0,1.502865,228.0,119.0,1.0


In [9]:
def binContinuous(column,start,stepsize,df=df):
    binStart = start
    while binStart <= df[column].max():
        binEnd = binStart+stepsize
        binName = column+ str(binStart) +":"+ str(binEnd)
        df.loc[(df[column]>=binStart)&(df[column]<binEnd),binName] = 1
        df.loc[df[binName]!=1,binName]=0
        if binEnd == df[column].max():
            df.loc[df[column]==binEnd,binName]=1
        binStart += stepsize

In [10]:
binContinuous('PercentageE',0,.2)

To ensure that the function works properly we'll create a check column. This column will just add all of the bins together in order to ensure that there is at least one category for each record.

In [11]:
df['Check'] = df['PercentageE0:0.2']+df['PercentageE0.2:0.4']+ df['PercentageE0.4:0.6000000000000001']+df['PercentageE0.6000000000000001:0.8']+df['PercentageE0.8:1.0']
df['Check'].value_counts()

1.0    1381
Name: Check, dtype: int64

This all appears to work fine, so we'll continue for the rest of the continuous data.

In [12]:
binContinuous('PriceDelta',-1,.2)
binContinuous('MonthsUntilExpiration',0,20)
binContinuous('PreviousPatents',0,20)

In [13]:
df.shape

(1381, 43)

In [14]:
df.to_csv('mbdata.csv',index=False)