---
---
# Loading Dataset & Modules

Modules

In [1]:
import pandas as pd

Loading Data

In [2]:
xls = pd.ExcelFile('Healthcare_dataset.xlsx')
df1 = pd.read_excel(xls, 'Feature Description')
df2 = pd.read_excel(xls, 'Dataset')

Dataset Sheet 1/2

In [3]:
df1

Unnamed: 0,Bucket,Variable,Variable Description
0,Unique Row Id,Patient ID,Unique ID of each patient
1,Target Variable,Persistency_Flag,Flag indicating if a patient was persistent or...
2,Demographics,Age,Age of the patient during their therapy
3,,Race,Race of the patient from the patient table
4,,Region,Region of the patient from the patient table
5,,Ethnicity,Ethnicity of the patient from the patient table
6,,Gender,Gender of the patient from the patient table
7,,IDN Indicator,Flag indicating patients mapped to IDN
8,Provider Attributes,NTM - Physician Specialty,Specialty of the HCP that prescribed the NTM Rx
9,Clinical Factors,NTM - T-Score,T Score of the patient at the time of the NTM ...


Dataset Sheet 2/2

In [4]:
df2.head(2)

Unnamed: 0,Ptid,Persistency_Flag,Gender,Race,Ethnicity,Region,Age_Bucket,Ntm_Speciality,Ntm_Specialist_Flag,Ntm_Speciality_Bucket,...,Risk_Family_History_Of_Osteoporosis,Risk_Low_Calcium_Intake,Risk_Vitamin_D_Insufficiency,Risk_Poor_Health_Frailty,Risk_Excessive_Thinness,Risk_Hysterectomy_Oophorectomy,Risk_Estrogen_Deficiency,Risk_Immobilization,Risk_Recurring_Falls,Count_Of_Risks
0,P1,Persistent,Male,Caucasian,Not Hispanic,West,>75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,0
1,P2,Non-Persistent,Male,Asian,Not Hispanic,West,55-65,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,0


--- 
---
# Problem Understanding

Aim: 
* Understanding Persistency of the drug by using physician prescriptions.
* Automate this process of identification. 

Methodology:
* Gather insights from factors 
* Use ML techniques to classify the dataset

---
---
# Preprocess Part 0
## `Data Understanding`
### `Management of Data` 


## Organizing Data
`Dataset`, df2, consists of 69 columns which have 6 main parts according to `Feature Description`,df1. 

ID, Target, Demographics, Provider, Clinical, Disease

---
Unique Row Id

In [5]:
# Patient ID
p_df = df2.iloc[:,0]    

In [6]:
ID = pd.concat([p_df], axis = 1 ) 
ID.head() 

Unnamed: 0,Ptid
0,P1
1,P2
2,P3
3,P4
4,P5


---
### Target Variable


In [7]:
# Target 
t_df = df2.iloc[:,1]  

In [8]:
Target = pd.concat([t_df], axis = 1 ) 
Target.head() 

Unnamed: 0,Persistency_Flag
0,Persistent
1,Non-Persistent
2,Non-Persistent
3,Non-Persistent
4,Non-Persistent


---
### Demographics

In [9]:
# Demographics
d_df = df2.loc[:, ['Age_Bucket', 'Race', 'Region', 'Gender', 'Ethnicity', 'Idn_Indicator' ] ]  

In [10]:
Demographics = pd.concat([d_df], axis = 1 ) 
Demographics.head() 

Unnamed: 0,Age_Bucket,Race,Region,Gender,Ethnicity,Idn_Indicator
0,>75,Caucasian,West,Male,Not Hispanic,N
1,55-65,Asian,West,Male,Not Hispanic,N
2,65-75,Other/Unknown,Midwest,Female,Hispanic,N
3,>75,Caucasian,Midwest,Female,Not Hispanic,N
4,>75,Caucasian,Midwest,Female,Not Hispanic,N


---
### Provider Attributes

In [11]:
# Provider Attributes
p_df = df2.loc[:,[
     'Ntm_Speciality', 'Ntm_Specialist_Flag', 'Ntm_Speciality_Bucket' ] ]  

In [12]:
Provider = p_df 
Provider.head()  

Unnamed: 0,Ntm_Speciality,Ntm_Specialist_Flag,Ntm_Speciality_Bucket
0,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown
1,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown
2,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown
3,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown
4,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown


---
### Clinical Factors
`!!!` 
* `# NTM - Multiple Risk Factors` is investigated under Disease/Treatment Factors. 
* `NTM - Dexa Scan Recency` cannot be found in the raw data. 

In [13]:
# NTM - T-Score 
ntm_t = df2.loc[:,[
    'Tscore_Bucket_Prior_Ntm','Tscore_Bucket_During_Rx'] ]  

# Change in T Score 
ct = df2.loc[:,[
    'Change_T_Score'] ] 

# NTM - Risk Segment
ntm_r = df2.loc[:,[
'Risk_Segment_Prior_Ntm', 'Risk_Segment_During_Rx' ] ]

# Change in Risk Segment
c_r = df2.loc[:,[
'Change_Risk_Segment' ] ] 

# NTM - Multiple Risk Factors


# NTM - Dexa Scan Frequency
ntm_d_f = df2.loc[:,[
'Dexa_Freq_During_Rx' ] ] 

# NTM - Dexa Scan Recency


# Dexa During Therapy
d_d = df2.loc[:,[
    'Dexa_During_Rx' ] ] 

# NTM - Fragility Fracture Recency
ntm_f = df2.loc[:,[
'Frag_Frac_Prior_Ntm'] ]  

# Fragility Fracture During Therapy
f_f = df2.loc[:,[
'Frag_Frac_During_Rx'] ]  

# NTM - Glucocorticoid Recency
ntm_glu = df2.loc[:,[
       'Gluco_Record_Prior_Ntm'] ] 
       
# Glucocorticoid Usage During Therapy
glu = df2.loc[:,[
'Gluco_Record_During_Rx'] ]  

In [14]:
Clinical = pd.concat( [ntm_t,ct,ntm_r, c_r,ntm_d_f ,d_d,ntm_f,f_f, ntm_glu, glu], axis = 1 ) 
Clinical.head() 

Unnamed: 0,Tscore_Bucket_Prior_Ntm,Tscore_Bucket_During_Rx,Change_T_Score,Risk_Segment_Prior_Ntm,Risk_Segment_During_Rx,Change_Risk_Segment,Dexa_Freq_During_Rx,Dexa_During_Rx,Frag_Frac_Prior_Ntm,Frag_Frac_During_Rx,Gluco_Record_Prior_Ntm,Gluco_Record_During_Rx
0,>-2.5,<=-2.5,No change,VLR_LR,VLR_LR,Unknown,0,N,N,N,N,N
1,>-2.5,Unknown,Unknown,VLR_LR,Unknown,Unknown,0,N,N,N,N,N
2,<=-2.5,<=-2.5,No change,HR_VHR,HR_VHR,No change,0,N,N,N,N,N
3,>-2.5,<=-2.5,No change,HR_VHR,HR_VHR,No change,0,N,N,N,N,Y
4,<=-2.5,Unknown,Unknown,HR_VHR,Unknown,Unknown,0,N,N,N,Y,Y


---
### Disease/Treatment Factor

In [15]:
# NTM - Injectable Experience 
ntm_i = df2.loc[:,[
    'Injectable_Experience_During_Rx'] ] 

# Risk Factors 
r_f = df2.loc[:,[ 
       #'Risk_Segment_Prior_Ntm', 'Risk_Segment_During_Rx','Change_Risk_Segment',
       
       'Risk_Type_1_Insulin_Dependent_Diabetes',
       'Risk_Osteogenesis_Imperfecta', 'Risk_Rheumatoid_Arthritis',
       'Risk_Untreated_Chronic_Hyperthyroidism',
       'Risk_Untreated_Chronic_Hypogonadism', 'Risk_Untreated_Early_Menopause',
       'Risk_Patient_Parent_Fractured_Their_Hip', 'Risk_Smoking_Tobacco',
       'Risk_Chronic_Malnutrition_Or_Malabsorption',
       'Risk_Chronic_Liver_Disease', 'Risk_Family_History_Of_Osteoporosis',
       'Risk_Low_Calcium_Intake', 'Risk_Vitamin_D_Insufficiency',
       'Risk_Poor_Health_Frailty', 'Risk_Excessive_Thinness',
       'Risk_Hysterectomy_Oophorectomy', 'Risk_Estrogen_Deficiency',
       'Risk_Immobilization', 'Risk_Recurring_Falls', 'Count_Of_Risks'] ] 
       
# NTM - Comorbidity  
ntm_com = df2.loc[:,[ 
       'Comorb_Encounter_For_Screening_For_Malignant_Neoplasms',
       'Comorb_Encounter_For_Immunization',
       'Comorb_Encntr_For_General_Exam_W_O_Complaint,_Susp_Or_Reprtd_Dx',
       'Comorb_Vitamin_D_Deficiency',
       'Comorb_Other_Joint_Disorder_Not_Elsewhere_Classified',
       'Comorb_Encntr_For_Oth_Sp_Exam_W_O_Complaint_Suspected_Or_Reprtd_Dx',
       'Comorb_Long_Term_Current_Drug_Therapy', 'Comorb_Dorsalgia',
       'Comorb_Personal_History_Of_Other_Diseases_And_Conditions',
       'Comorb_Other_Disorders_Of_Bone_Density_And_Structure',
       'Comorb_Disorders_of_lipoprotein_metabolism_and_other_lipidemias',
       'Comorb_Osteoporosis_without_current_pathological_fracture',
       'Comorb_Personal_history_of_malignant_neoplasm',
       'Comorb_Gastro_esophageal_reflux_disease'  ] ] 

# NTM - Concomitancy
ntm_con = df2.loc[:,[ 
       'Concom_Cholesterol_And_Triglyceride_Regulating_Preparations',
       'Concom_Narcotics', 'Concom_Systemic_Corticosteroids_Plain',
       'Concom_Anti_Depressants_And_Mood_Stabilisers',
       'Concom_Fluoroquinolones', 'Concom_Cephalosporins',
       'Concom_Macrolides_And_Similar_Types',
       'Concom_Broad_Spectrum_Penicillins', 'Concom_Anaesthetics_General',
       'Concom_Viral_Vaccines' ]  ]  

# Adherence 
ad = df2.loc[:,[ 
       'Adherent_Flag'] ]  

In [16]:
Disease = pd.concat( [ ntm_i,r_f ,ntm_com,ntm_con, ad ], axis =1 )
Disease.head() 

Unnamed: 0,Injectable_Experience_During_Rx,Risk_Type_1_Insulin_Dependent_Diabetes,Risk_Osteogenesis_Imperfecta,Risk_Rheumatoid_Arthritis,Risk_Untreated_Chronic_Hyperthyroidism,Risk_Untreated_Chronic_Hypogonadism,Risk_Untreated_Early_Menopause,Risk_Patient_Parent_Fractured_Their_Hip,Risk_Smoking_Tobacco,Risk_Chronic_Malnutrition_Or_Malabsorption,...,Concom_Narcotics,Concom_Systemic_Corticosteroids_Plain,Concom_Anti_Depressants_And_Mood_Stabilisers,Concom_Fluoroquinolones,Concom_Cephalosporins,Concom_Macrolides_And_Similar_Types,Concom_Broad_Spectrum_Penicillins,Concom_Anaesthetics_General,Concom_Viral_Vaccines,Adherent_Flag
0,Y,N,N,N,N,N,N,N,N,N,...,N,N,N,N,N,N,N,N,N,Adherent
1,Y,N,N,N,N,N,N,N,N,N,...,N,N,N,N,N,N,N,N,N,Adherent
2,Y,N,N,N,N,N,N,Y,N,N,...,N,N,N,N,N,N,N,N,N,Adherent
3,Y,N,N,N,N,N,N,N,Y,N,...,Y,Y,N,N,N,N,N,N,Y,Adherent
4,Y,N,N,N,N,N,N,N,Y,N,...,Y,Y,Y,N,N,N,N,N,N,Adherent


---
# Preprocess Part 0 Result
## Organized Data

`! Important Notes` 
* `# NTM - Multiple Risk Factors` is investigated under Disease/Treatment Factors. 
* `NTM - Dexa Scan Recency` cannot be found in the raw data.  

In [17]:
df = pd.concat( [ID, Target, Demographics, Provider, Clinical, Disease],axis=1 )  
df.head() 

Unnamed: 0,Ptid,Persistency_Flag,Age_Bucket,Race,Region,Gender,Ethnicity,Idn_Indicator,Ntm_Speciality,Ntm_Specialist_Flag,...,Concom_Narcotics,Concom_Systemic_Corticosteroids_Plain,Concom_Anti_Depressants_And_Mood_Stabilisers,Concom_Fluoroquinolones,Concom_Cephalosporins,Concom_Macrolides_And_Similar_Types,Concom_Broad_Spectrum_Penicillins,Concom_Anaesthetics_General,Concom_Viral_Vaccines,Adherent_Flag
0,P1,Persistent,>75,Caucasian,West,Male,Not Hispanic,N,GENERAL PRACTITIONER,Others,...,N,N,N,N,N,N,N,N,N,Adherent
1,P2,Non-Persistent,55-65,Asian,West,Male,Not Hispanic,N,GENERAL PRACTITIONER,Others,...,N,N,N,N,N,N,N,N,N,Adherent
2,P3,Non-Persistent,65-75,Other/Unknown,Midwest,Female,Hispanic,N,GENERAL PRACTITIONER,Others,...,N,N,N,N,N,N,N,N,N,Adherent
3,P4,Non-Persistent,>75,Caucasian,Midwest,Female,Not Hispanic,N,GENERAL PRACTITIONER,Others,...,Y,Y,N,N,N,N,N,N,Y,Adherent
4,P5,Non-Persistent,>75,Caucasian,Midwest,Female,Not Hispanic,N,GENERAL PRACTITIONER,Others,...,Y,Y,Y,N,N,N,N,N,N,Adherent


---
---
# Preprocess Part 1
## `Data Cleaning and Feature engineering`
### `Unique & Missing Values`


---
## Preprocessing 1.0

Modules

Necessary Functions

Modules

In [18]:
from collections import defaultdict

from collections import Counter 

Taking out Unique Variables

In [19]:
def uvc(data):
    temp = defaultdict() 
    for i in data:
        temp[i] = pd.unique(data[i])
    return temp 

Creating Dataframe for Unique & NaN Values

In [20]:
def un(data, conf = 1):
    temp = data 
    u_temp = [] # Unique Values 
    n_temp = [] # Nan Values 

    for i in temp.columns:
        u_temp.append( len( pd.unique(temp[i])) ) 
        if conf == 0:
            n_temp.append( temp[i].isna().sum() ) 
        else:
            try:
                n_temp.append( dict(Counter(temp[i]))['NaN']  )
            except:
                n_temp.append(0)

    un_temp = pd.DataFrame(list(zip(u_temp, n_temp)),
                columns =['Unique Values', 'Nan Values'])
    un_temp = un_temp.T
    un_temp.columns = temp.columns

    # Sums
    sum_temp = [un_temp.loc['Unique Values'].sum() , un_temp.loc['Nan Values'].sum()] 
    s_temp = pd.DataFrame(sum_temp, columns=['Sum'], index= un_temp.index ) 

    # Sums without ID
    sum_temp = [un_temp.loc['Unique Values'].sum() , un_temp.loc['Nan Values'].sum()] 
    s_temp = pd.DataFrame(sum_temp, columns=['Sum'], index= un_temp.index ) 
    
    un_temp = pd.concat( [s_temp, un_temp] ,axis=1) 
    return  un_temp


Extracting Meaningful Values & Converting Binary(Meaningful & NaN)

In [21]:
def ext(data):
    cv = [] 
    for i in list(uvc(data).items()):
        for j in list(i)[1]: 
            if j not in nv:
                cv.append(j) 
    
    data = data.replace(cv,'M') 
    data = data.replace(nv,'NaN') 

    return data 


---
## Preprocessing 1.1

Determining of Missing Values

### `Check Unique Values to find NaN / Missing Values`

In [22]:
uvc(df)  

defaultdict(None,
            {'Ptid': array(['P1', 'P2', 'P3', ..., 'P3422', 'P3423', 'P3424'], dtype=object),
             'Persistency_Flag': array(['Persistent', 'Non-Persistent'], dtype=object),
             'Age_Bucket': array(['>75', '55-65', '65-75', '<55'], dtype=object),
             'Race': array(['Caucasian', 'Asian', 'Other/Unknown', 'African American'],
                   dtype=object),
             'Region': array(['West', 'Midwest', 'South', 'Other/Unknown', 'Northeast'],
                   dtype=object),
             'Gender': array(['Male', 'Female'], dtype=object),
             'Ethnicity': array(['Not Hispanic', 'Hispanic', 'Unknown'], dtype=object),
             'Idn_Indicator': array(['N', 'Y'], dtype=object),
             'Ntm_Speciality': array(['GENERAL PRACTITIONER', 'Unknown', 'ENDOCRINOLOGY', 'RHEUMATOLOGY',
                    'ONCOLOGY', 'PATHOLOGY', 'OBSTETRICS AND GYNECOLOGY',
                    'PSYCHIATRY AND NEUROLOGY', 'ORTHOPEDIC SURGERY',
        

`Define NaN / Missing Values`

In [23]:
nv = ['Other/Unknown', 'Unknown' ]   

---
## Preprocessing 1.2

Investigation of Missing Values

### Demographics & Target

* `Persistency_Flag`, 
`Age`, 
`Race`, 
`Region`, 
`Ethnicity`, 
`Gender`, 
`IDN Indicator`, 
`NTM - Physician Specialty`

In [24]:
temp = pd.concat( [Target, Demographics,Provider],axis=1)  

Datafreme of Unique & Missing Values

In [25]:
print('Number of Rows:',len(temp)) 
print('Number of Columns:',len(temp.columns)) 
un( ext(temp) ).T.head(1+len(temp.columns)).T

Number of Rows: 3424
Number of Columns: 10


Unnamed: 0,Sum,Persistency_Flag,Age_Bucket,Race,Region,Gender,Ethnicity,Idn_Indicator,Ntm_Speciality,Ntm_Specialist_Flag,Ntm_Speciality_Bucket
Unique Values,14,1,1,2,2,1,2,1,2,1,1
Nan Values,558,0,0,97,60,0,91,0,310,0,0


In [26]:
print('Number of Rows:',len(df2)) 
print('Number of Columns:',len(df2.columns)) 
un( ext(df2) ).T.head(1+len(df2.columns)).T

Number of Rows: 3424
Number of Columns: 69


Unnamed: 0,Sum,Ptid,Persistency_Flag,Gender,Race,Ethnicity,Region,Age_Bucket,Ntm_Speciality,Ntm_Specialist_Flag,...,Risk_Family_History_Of_Osteoporosis,Risk_Low_Calcium_Intake,Risk_Vitamin_D_Insufficiency,Risk_Poor_Health_Frailty,Risk_Excessive_Thinness,Risk_Hysterectomy_Oophorectomy,Risk_Estrogen_Deficiency,Risk_Immobilization,Risk_Recurring_Falls,Count_Of_Risks
Unique Values,77,1,1,1,2,2,2,1,2,1,...,1,1,1,1,1,1,1,1,1,1
Nan Values,7278,0,0,0,97,91,60,0,310,0,...,0,0,0,0,0,0,0,0,0,0


### Clinical 
`NTM - T-Score `, 
`Change in T Score `, 
`NTM - Risk Segment`,
`Change in Risk Segment`, 
`NTM - Multiple Risk Factors` , 
`NTM - Dexa Scan Frequency`, 
`NTM - Dexa Scan Recency` , 
`Dexa During Therapy` , 
`NTM - Fragility Fracture Recency`, 
`Fragility Fracture During Therapy` , 
`NTM - Glucocorticoid Recency` , 
`Glucocorticoid Usage During Therapy` 


In [27]:
temp = Clinical.copy() 

In [28]:
print('Number of Rows:',len(temp)) 
print('Number of Columns:',len(temp.columns)) 
un( ext(temp) ).T.head(1+len(temp.columns)).T

Number of Rows: 3424
Number of Columns: 12


Unnamed: 0,Sum,Tscore_Bucket_Prior_Ntm,Tscore_Bucket_During_Rx,Change_T_Score,Risk_Segment_Prior_Ntm,Risk_Segment_During_Rx,Change_Risk_Segment,Dexa_Freq_During_Rx,Dexa_During_Rx,Frag_Frac_Prior_Ntm,Frag_Frac_During_Rx,Gluco_Record_Prior_Ntm,Gluco_Record_During_Rx
Unique Values,16,1,2,2,1,2,2,1,1,1,1,1,1
Nan Values,6720,0,1497,1497,0,1497,2229,0,0,0,0,0,0


### Disease/Treatment Factor

* `NTM - Injectable Experience`, `NTM - Comorbidity`, `NTM - Concomitancy`, `NTM - Risk Factors`, `Adherence`


In [29]:
temp = pd.concat([ntm_i, r_f ,ad],axis=1)

In [30]:
print('Number of Rows:',len(temp)) 
print('Number of Columns:',len(temp.columns)) 
un( ext(temp) ).T.head(1+len(temp.columns)).T

Number of Rows: 3424
Number of Columns: 22


Unnamed: 0,Sum,Injectable_Experience_During_Rx,Risk_Type_1_Insulin_Dependent_Diabetes,Risk_Osteogenesis_Imperfecta,Risk_Rheumatoid_Arthritis,Risk_Untreated_Chronic_Hyperthyroidism,Risk_Untreated_Chronic_Hypogonadism,Risk_Untreated_Early_Menopause,Risk_Patient_Parent_Fractured_Their_Hip,Risk_Smoking_Tobacco,...,Risk_Low_Calcium_Intake,Risk_Vitamin_D_Insufficiency,Risk_Poor_Health_Frailty,Risk_Excessive_Thinness,Risk_Hysterectomy_Oophorectomy,Risk_Estrogen_Deficiency,Risk_Immobilization,Risk_Recurring_Falls,Count_Of_Risks,Adherent_Flag
Unique Values,22,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
Nan Values,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


`NTM - Comorbidity `

In [31]:
temp = ntm_com 

In [32]:
print('Number of Rows:',len(temp)) 
print('Number of Columns:',len(temp.columns)) 
un( ext(temp) ).T.head(1+len(temp.columns)).T

Number of Rows: 3424
Number of Columns: 14


Unnamed: 0,Sum,Comorb_Encounter_For_Screening_For_Malignant_Neoplasms,Comorb_Encounter_For_Immunization,"Comorb_Encntr_For_General_Exam_W_O_Complaint,_Susp_Or_Reprtd_Dx",Comorb_Vitamin_D_Deficiency,Comorb_Other_Joint_Disorder_Not_Elsewhere_Classified,Comorb_Encntr_For_Oth_Sp_Exam_W_O_Complaint_Suspected_Or_Reprtd_Dx,Comorb_Long_Term_Current_Drug_Therapy,Comorb_Dorsalgia,Comorb_Personal_History_Of_Other_Diseases_And_Conditions,Comorb_Other_Disorders_Of_Bone_Density_And_Structure,Comorb_Disorders_of_lipoprotein_metabolism_and_other_lipidemias,Comorb_Osteoporosis_without_current_pathological_fracture,Comorb_Personal_history_of_malignant_neoplasm,Comorb_Gastro_esophageal_reflux_disease
Unique Values,14,1,1,1,1,1,1,1,1,1,1,1,1,1,1
Nan Values,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


`NTM - Concomitancy
`

In [33]:
temp = ntm_con

In [34]:
print('Number of Rows:',len(temp)) 
print('Number of Columns:',len(temp.columns)) 
un( ext(temp) ).T.head(1+len(temp.columns)).T

Number of Rows: 3424
Number of Columns: 10


Unnamed: 0,Sum,Concom_Cholesterol_And_Triglyceride_Regulating_Preparations,Concom_Narcotics,Concom_Systemic_Corticosteroids_Plain,Concom_Anti_Depressants_And_Mood_Stabilisers,Concom_Fluoroquinolones,Concom_Cephalosporins,Concom_Macrolides_And_Similar_Types,Concom_Broad_Spectrum_Penicillins,Concom_Anaesthetics_General,Concom_Viral_Vaccines
Unique Values,10,1,1,1,1,1,1,1,1,1,1
Nan Values,0,0,0,0,0,0,0,0,0,0,0


---
## Preprocessing 1.3.0
Extraction of NaN Values

In [35]:
df_view = un( ext(df) ).T.head(1+len(df.columns)) 

Check in the editor

In [36]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None): 
    print(df_view)

                                                    Unique Values  Nan Values
Sum                                                            77        7278
Ptid                                                            1           0
Persistency_Flag                                                1           0
Age_Bucket                                                      1           0
Race                                                            2          97
Region                                                          2          60
Gender                                                          1           0
Ethnicity                                                       2          91
Idn_Indicator                                                   1           0
Ntm_Speciality                                                  2         310
Ntm_Specialist_Flag                                             1           0
Ntm_Speciality_Bucket                                           

---
## Preprocessing 1.3.1

Drop Operations

In [37]:
temp = df.copy() 

In [38]:
# Drop Values  
will_drop = nv 

# Columns to be ignored for checking 
ignore_col = ['Change_Risk_Segment', 'Change_T_Score','Tscore_Bucket_During_Rx','Risk_Segment_During_Rx'] 

# Columns to be checked
temp_col = [] 
for i in list(temp.columns):
    if i not in ignore_col:
        temp_col.append(i) 

# Drop rows that contain any value in the list, will_drop
for i in temp_col:
    temp = temp[ temp[i].isin(will_drop) == False ] 


---
# Preprocess Part 1 Result
NaN & Unique Value Extraction

In [39]:
print('Number of Rows of Raw Data:', len(df)) 
print('Number of Rows of Raw Data:', len(temp)) 
print('Number of Dropped Values:', (len(df) - len(temp)) ) 

Number of Rows of Raw Data: 3424
Number of Rows of Raw Data: 2912
Number of Dropped Values: 512


In [40]:
df = temp 

---
---
# Preprocess Part 2 

### Model Development

In [48]:
df.head(2).T 

                                              0               1
Ptid                                         P1              P2
Persistency_Flag                     Persistent  Non-Persistent
Age_Bucket                                  >75           55-65
Race                                  Caucasian           Asian
Region                                     West            West
...                                         ...             ...
Concom_Macrolides_And_Similar_Types           N               N
Concom_Broad_Spectrum_Penicillins             N               N
Concom_Anaesthetics_General                   N               N
Concom_Viral_Vaccines                         N               N
Adherent_Flag                          Adherent        Adherent

[69 rows x 2 columns]
