----
# Inpatient Admissions Train
----

## **Dataset Description:**

- **Internalpatientid:** This is the unique identifier for each patient in the dataset.
- **Age at admission:** This feature indicates the age of the patient at the time of admission to the hospital.
- **Admission date:** This represents the date and time when the patient was admitted to the hospital.
- **Discharge date:** This represents the date and time when the patient was discharged from the hospital after receiving inpatient care.
- **Admitting unit service:** This feature indicates the unit or department of the hospital where the patient was admitted.
- **Discharging unit service:** This feature indicates the unit or department of the hospital where the patient was discharged from.
- **Admitting specialty:** This feature indicates the medical specialty of the physician who admitted the patient to the hospital.
- **Discharging specialty:** This feature indicates the medical specialty of the physician who discharged the patient from the hospital.
- **First listed discharge diagnosis icd10 subcategory:** This feature represents the primary diagnosis for which the patient received treatment during their hospital stay.
- **Second listed discharge diagnosis icd10 subcategory:** This feature represents any additional secondary diagnoses for which the patient received treatment during their hospital stay.
- **Discharge disposition:** This feature indicates the status of the patient at the time of discharge, such as whether they were discharged to home or to another healthcare facility.
- **Died during admission:** This feature indicates whether the patient passed away during their hospital stay.
 - **Yes/No**
- **Outpatient referral flag:** This feature indicates whether the patient was referred to outpatient care after their hospital stay.
- **Service-connected flag:** This feature indicates whether the patient's health condition is related to their military service.
- **Agent Orange flag:** This feature indicates whether the patient's health condition is related to exposure to Agent Orange, a herbicide used during the Vietnam War.
- **State:** This feature indicates the state where the hospital is located.

## Azure notebook Setup

In [1]:
#A class attribute that provides access to the TabularDatasetFactory methods for creating new TabularDataset objects. 
#Usage: Dataset.Tabular.from_delimited_files().
from azureml.core import Workspace, Dataset

subscription_id = 'bcfe0c62-8ebe-4df0-a46d-1efcf8739a5b' #check the launch studio there will get this id
resource_group = 'VChamp-Team3' # resource group name
workspace_name = 'vchamp-team3' # worksapce name

# storage account : Algorithmia, Resource group: VChamp-Team3 and workspace: vchamp-team3.
#Constructor
workspace = Workspace(subscription_id, resource_group, workspace_name)

In [4]:
#['data_team3_synthetic_train']
datastore = workspace.datastores['data_team3_synthetic_train']

In [5]:
#from_delimited_files (Create a TabularDataset to represent tabular data in delimited files (e.g. CSV and TSV).

dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'inpatient_admissions_train.csv')])

# preview the first 3 rows of the dataset
# dataset.to_pandas_dataframe()

In [6]:
#Converting the dataset into data frame(default as dataset in Azure, thus we must convert the needed formate)
inpatient_admissions_train_data= dataset.to_pandas_dataframe()

In [7]:
type(inpatient_admissions_train_data)

pandas.core.frame.DataFrame

In [8]:
inpatient_admissions_train_data.head()

Unnamed: 0,Column1,Internalpatientid,Age at admission,Admission date,Discharge date,Admitting unit service,Discharging unit service,Admitting specialty,Discharging specialty,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory,Discharge disposition,Died during admission,Outpatientreferralflag,Serviceconnectedflag,Agentorangeflag,State
0,0,10,65.649075,2015-11-28 17:41:09,2015-11-29 01:43:14,NON-COUNT,NON-COUNT,DRUG DEPENDENCE TRMT UNIT,MEDICAL OBSERVATION,"Pneumonia, unspecified organism",Hypokalemia,Regular,False,False,,True,Utah
1,1,100001,83.767138,2009-10-01 21:19:50,2009-10-04 16:51:33,MEDICINE,MEDICINE,PSYCHIATRIC MENTALLY INFIRM,GENERAL(ACUTE MEDICINE),"Pneumonia, unspecified organism",Essential (primary) hypertension,Regular,False,True,False,False,North Carolina
2,2,100001,84.873295,2010-11-10 04:32:39,2010-11-19 08:49:45,SURGERY,SURGERY,SUBSTANCE ABUSE RES TRMT PROG,ORTHOPEDIC,"Osteoarthritis, unspecified site",Type 2 diabetes mellitus with neurological com...,Regular,False,False,,False,North Carolina
3,3,10001,70.900369,2020-03-20 02:02:26,2020-03-28 08:47:01,SURGERY,SURGERY,PLASTIC SURGERY,SURGICAL STEPDOWN,Nonrheumatic mitral (valve) prolapse,Postprocedural shock,Regular,False,True,,True,Florida
4,5,100016,83.054993,1999-11-20 14:23:45,1999-12-01 03:57:43,MEDICINE,MEDICINE,HEMATOLOGY/ONCOLOGY,GENERAL(ACUTE MEDICINE),"Pneumonia, unspecified organism",Unspecified mental disorder due to known physi...,Regular,False,False,,,Idaho


## **Importing Libraries**

In [9]:
# Importing essential libraries
import pandas as pd                 # Library for data manipulation and analysis
import numpy as np                  # Library for mathematical operations

## **Data Exploration**

In [10]:
# changing variable name for dataframe
df = inpatient_admissions_train_data

In [11]:
# Display the first few rows of a DataFrame
df.head()

Unnamed: 0,Column1,Internalpatientid,Age at admission,Admission date,Discharge date,Admitting unit service,Discharging unit service,Admitting specialty,Discharging specialty,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory,Discharge disposition,Died during admission,Outpatientreferralflag,Serviceconnectedflag,Agentorangeflag,State
0,0,10,65.649075,2015-11-28 17:41:09,2015-11-29 01:43:14,NON-COUNT,NON-COUNT,DRUG DEPENDENCE TRMT UNIT,MEDICAL OBSERVATION,"Pneumonia, unspecified organism",Hypokalemia,Regular,False,False,,True,Utah
1,1,100001,83.767138,2009-10-01 21:19:50,2009-10-04 16:51:33,MEDICINE,MEDICINE,PSYCHIATRIC MENTALLY INFIRM,GENERAL(ACUTE MEDICINE),"Pneumonia, unspecified organism",Essential (primary) hypertension,Regular,False,True,False,False,North Carolina
2,2,100001,84.873295,2010-11-10 04:32:39,2010-11-19 08:49:45,SURGERY,SURGERY,SUBSTANCE ABUSE RES TRMT PROG,ORTHOPEDIC,"Osteoarthritis, unspecified site",Type 2 diabetes mellitus with neurological com...,Regular,False,False,,False,North Carolina
3,3,10001,70.900369,2020-03-20 02:02:26,2020-03-28 08:47:01,SURGERY,SURGERY,PLASTIC SURGERY,SURGICAL STEPDOWN,Nonrheumatic mitral (valve) prolapse,Postprocedural shock,Regular,False,True,,True,Florida
4,5,100016,83.054993,1999-11-20 14:23:45,1999-12-01 03:57:43,MEDICINE,MEDICINE,HEMATOLOGY/ONCOLOGY,GENERAL(ACUTE MEDICINE),"Pneumonia, unspecified organism",Unspecified mental disorder due to known physi...,Regular,False,False,,,Idaho


In [12]:
# Shape of the dataset
df.shape

num_rows = df.shape[0]  # Number of rows
num_cols = df.shape[1]  # Number of columns

print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

Number of rows: 522740
Number of columns: 17


In [13]:
# Get the number of unique values in the 'Internalpatientid' column
print("Number of Unique Internalpatientid")
df['Internalpatientid'].nunique()

Number of Unique Internalpatientid


84536

In [14]:
# Dropping unnammed column
df = df.drop(df.columns[0], axis=1)

In [15]:
# Display the concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522740 entries, 0 to 522739
Data columns (total 16 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Internalpatientid                                    522740 non-null  int64         
 1   Age at admission                                     522740 non-null  float64       
 2   Admission date                                       522740 non-null  datetime64[ns]
 3   Discharge date                                       522246 non-null  datetime64[ns]
 4   Admitting unit service                               522740 non-null  object        
 5   Discharging unit service                             522740 non-null  object        
 6   Admitting specialty                                  522740 non-null  object        
 7   Discharging specialty                                522740 non-null  obje

- The 'Internalpatientid' column contains integer values, while the 'Age at admission' column is in float format. The rest of the features are represented as objects.

## **Checking for Missing Values**

In [16]:
# Count the number of missing values in each column
df.isnull().sum()

Internalpatientid                                           0
Age at admission                                            0
Admission date                                              0
Discharge date                                            494
Admitting unit service                                      0
Discharging unit service                                    0
Admitting specialty                                         0
Discharging specialty                                       0
First listed discharge diagnosis icd10 subcategory          0
Second listed discharge diagnosis icd10 subcategory         0
Discharge disposition                                       0
Died during admission                                       0
Outpatientreferralflag                                  26443
Serviceconnectedflag                                   487689
Agentorangeflag                                        104512
State                                                       0
dtype: i

- The dataset has missing values, particularly in the 'Outpatientreferralflag', 'Serviceconnectedflag', and 'Agentorangeflag' columns. Additionally, there are a few missing values in the 'Discharge date' column."

---

## **Data Preprocessing**

In [17]:
# changing variable name for dataframe
inpatient_admissions = df

#### Drop the low potential columns

In [18]:
# Drop the specified columns from the DataFrame
inpatient_admissions.drop(['Admission date','Discharge date','Admitting unit service','Discharging unit service','Admitting specialty','Discharging specialty','Discharge disposition', 'Died during admission', 'Outpatientreferralflag','Serviceconnectedflag','Agentorangeflag','State'], axis=1,inplace=True)

#### Checking Missing Values for Potential Attributes

In [19]:
# Count the number of missing values in each column
inpatient_admissions.isnull().sum()

Internalpatientid                                      0
Age at admission                                       0
First listed discharge diagnosis icd10 subcategory     0
Second listed discharge diagnosis icd10 subcategory    0
dtype: int64

- There is no missing values in the 'First listed discharge diagnosis icd10 subcategory' and 'Second listed discharge diagnosis icd10 subcategory' 

#### Sort the Dataset based on the 'Internalpatientid' and 'Age at admission' column in ascending order

In [20]:
# Sort the dataset based on the patientid and 'age' column in ascending order
inpatient_admissions.sort_values(["Internalpatientid","Age at admission"],inplace=True)

### **Round off the Ages to One Decimal Place (00.0)**

In [21]:
# Format 'Age at measurement' values in the format 00.0
inpatient_admissions["Age at admission"] = inpatient_admissions["Age at admission"].map("{:.1f}".format)

In [22]:
# Print the updated DataFrame
inpatient_admissions

Unnamed: 0,Internalpatientid,Age at admission,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory
20794,1,67.7,Unspecified atrial fibrillation and atrial flu...,Essential (primary) hypertension
23988,1,68.3,Other chest pain,Essential (primary) hypertension
23989,1,72.3,"Malignant neoplasm of bladder, unspecified",Other intraoperative and postprocedural compli...
16811,1,78.7,Contusion of hip,Unspecified atrial fibrillation and atrial flu...
232316,2,55.0,Unspecified convulsions,Paranoid schizophrenia
...,...,...,...,...
177008,169062,73.6,Encounter for other specified aftercare,Malignant neoplasm of sigmoid colon
227589,169062,73.8,Other and unspecified noninfective gastroenter...,"Malignant neoplasm of colon, unspecified"
222043,169062,73.8,"Chronic obstructive pulmonary disease, unspeci...",Hypovolemia
226861,169062,74.1,Encounter for palliative care,"Malignant neoplasm of lower lobe, bronchus or ..."


### **Getting only Maximum Ages of 'Internalpatientid'**

In [23]:
# Find the maximum age for each internalpatientid
max_ages = inpatient_admissions.groupby('Internalpatientid')['Age at admission'].max().reset_index()

In [24]:
# Merge with the original dataframe to get the rows with the highest age
inpatient_admission = pd.merge(inpatient_admissions, max_ages, on =['Internalpatientid','Age at admission'], how = 'inner')

In [25]:
inpatient_admission

Unnamed: 0,Internalpatientid,Age at admission,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory
0,1,78.7,Contusion of hip,Unspecified atrial fibrillation and atrial flu...
1,2,69.0,"Acute kidney failure, unspecified",Systolic (congestive) heart failure
2,2,69.0,"Acute kidney failure, unspecified","Chronic obstructive pulmonary disease, unspeci..."
3,2,69.0,Age-related physical debility,Systolic (congestive) heart failure
4,3,81.2,Encounter for other specified aftercare,Acquired absence of leg below knee
...,...,...,...,...
102655,169057,85.8,Hypertensive heart and chronic kidney disease ...,Systolic (congestive) heart failure
102656,169057,85.8,Hypertensive heart and chronic kidney disease ...,Systolic (congestive) heart failure
102657,169060,71.3,"Heart failure, unspecified",Hypo-osmolality and hyponatremia
102658,169062,74.1,Encounter for palliative care,"Malignant neoplasm of lower lobe, bronchus or ..."


### Round off the age

In [26]:
# Convert 'Age at measurement' column from object to float
inpatient_admission["Age at admission"] = inpatient_admission["Age at admission"].astype(float)

# Round off the values in the 'Age at measurement'
inpatient_admission["Age at admission"] = inpatient_admission["Age at admission"].round()

In [27]:
inpatient_admission['Age at admission'] = inpatient_admission['Age at admission'].astype('int')

In [28]:
inpatient_admission

Unnamed: 0,Internalpatientid,Age at admission,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory
0,1,79,Contusion of hip,Unspecified atrial fibrillation and atrial flu...
1,2,69,"Acute kidney failure, unspecified",Systolic (congestive) heart failure
2,2,69,"Acute kidney failure, unspecified","Chronic obstructive pulmonary disease, unspeci..."
3,2,69,Age-related physical debility,Systolic (congestive) heart failure
4,3,81,Encounter for other specified aftercare,Acquired absence of leg below knee
...,...,...,...,...
102655,169057,86,Hypertensive heart and chronic kidney disease ...,Systolic (congestive) heart failure
102656,169057,86,Hypertensive heart and chronic kidney disease ...,Systolic (congestive) heart failure
102657,169060,71,"Heart failure, unspecified",Hypo-osmolality and hyponatremia
102658,169062,74,Encounter for palliative care,"Malignant neoplasm of lower lobe, bronchus or ..."


In [29]:
inpatient_admission['Internalpatientid'].nunique()

84536

### Checking Value Counts

In [30]:
inpatient_ad_value_counts = inpatient_admission['Internalpatientid'].value_counts().to_frame()

In [31]:
inpatient_ad_value_counts = inpatient_ad_value_counts.reset_index()

In [32]:
inpatient_ad_value_counts.columns=['Internalpatientid','counts for inpatient admission']

In [33]:
inpatient_ad_value_counts

Unnamed: 0,Internalpatientid,counts for inpatient admission
0,127315,8
1,139058,7
2,9622,7
3,64296,7
4,25763,6
...,...,...
84531,154531,1
84532,25508,1
84533,153342,1
84534,936,1


### Merging 'Age at admission' Column and Potential Columns with underscore

In [34]:
inpatient_admission

Unnamed: 0,Internalpatientid,Age at admission,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory
0,1,79,Contusion of hip,Unspecified atrial fibrillation and atrial flu...
1,2,69,"Acute kidney failure, unspecified",Systolic (congestive) heart failure
2,2,69,"Acute kidney failure, unspecified","Chronic obstructive pulmonary disease, unspeci..."
3,2,69,Age-related physical debility,Systolic (congestive) heart failure
4,3,81,Encounter for other specified aftercare,Acquired absence of leg below knee
...,...,...,...,...
102655,169057,86,Hypertensive heart and chronic kidney disease ...,Systolic (congestive) heart failure
102656,169057,86,Hypertensive heart and chronic kidney disease ...,Systolic (congestive) heart failure
102657,169060,71,"Heart failure, unspecified",Hypo-osmolality and hyponatremia
102658,169062,74,Encounter for palliative care,"Malignant neoplasm of lower lobe, bronchus or ..."


In [35]:
# To create a new column that combines the 'Age at admission' and 'First listed discharge diagnosis icd10 subcategory' columns into a single string representation
inpatient_admission['inpatient_admissions_First_listed_discharge_diagnosis_icd10_subcategory'] = inpatient_admission['Age at admission'].astype(str) + '_' + inpatient_admission['First listed discharge diagnosis icd10 subcategory'].astype(str)

In [36]:
# To create a new column that combines the 'Age at admission' and 'Second listed discharge diagnosis icd10 subcategory' columns into a single string representation
inpatient_admission['inpatient_admissions_Second_listed_discharge_diagnosis_icd10_subcategory'] = inpatient_admission['Age at admission'].astype(str) + '_' + inpatient_admission['Second listed discharge diagnosis icd10 subcategory'].astype(str)

#### Drop unwanted Columns

In [37]:
# Drop the specified columns from the DataFrame
inpatient_admission.drop(['Age at admission','First listed discharge diagnosis icd10 subcategory','Second listed discharge diagnosis icd10 subcategory'], axis=1,inplace=True)

In [38]:
inpatient_admission

Unnamed: 0,Internalpatientid,inpatient_admissions_First_listed_discharge_diagnosis_icd10_subcategory,inpatient_admissions_Second_listed_discharge_diagnosis_icd10_subcategory
0,1,79_Contusion of hip,79_Unspecified atrial fibrillation and atrial ...
1,2,"69_Acute kidney failure, unspecified",69_Systolic (congestive) heart failure
2,2,"69_Acute kidney failure, unspecified","69_Chronic obstructive pulmonary disease, unsp..."
3,2,69_Age-related physical debility,69_Systolic (congestive) heart failure
4,3,81_Encounter for other specified aftercare,81_Acquired absence of leg below knee
...,...,...,...
102655,169057,86_Hypertensive heart and chronic kidney disea...,86_Systolic (congestive) heart failure
102656,169057,86_Hypertensive heart and chronic kidney disea...,86_Systolic (congestive) heart failure
102657,169060,"71_Heart failure, unspecified",71_Hypo-osmolality and hyponatremia
102658,169062,74_Encounter for palliative care,"74_Malignant neoplasm of lower lobe, bronchus ..."


In [39]:
# Shape of the dataset
inpatient_admission.shape
num_rows = inpatient_admission.shape[0]
num_cols = inpatient_admission.shape[1]
print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

Number of rows: 102660
Number of columns: 3


### **Group the DataFrame by 'Internalpatientid'**

In [40]:
# Group the DataFrame by 'Internalpatientid' and concatenate the icd10 values and the purpose of `x.dropna()` is to remove any missing values from the Series before applying a subsequent operation,
# such as concatenation using the `join` function. By dropping the missing values, we ensure that only non-null values are included in the resulting concatenated string.
df_grouped = inpatient_admission.groupby('Internalpatientid').agg(lambda x: ','.join(x.dropna()))

In [41]:
# Reset the index of the grouped DataFrame
inpatient_admissions_df_grouped = df_grouped.reset_index()
inpatient_admissions_df_grouped

Unnamed: 0,Internalpatientid,inpatient_admissions_First_listed_discharge_diagnosis_icd10_subcategory,inpatient_admissions_Second_listed_discharge_diagnosis_icd10_subcategory
0,1,79_Contusion of hip,79_Unspecified atrial fibrillation and atrial ...
1,2,"69_Acute kidney failure, unspecified,69_Acute ...","69_Systolic (congestive) heart failure,69_Chro..."
2,3,81_Encounter for other specified aftercare,81_Acquired absence of leg below knee
3,4,84_Acute gastric ulcer with hemorrhage,84_Non-ST elevation (NSTEMI) myocardial infarc...
4,5,76_Acute and subacute infective endocarditis,76_Acute respiratory failure
...,...,...,...
84531,169055,59_ST elevation (STEMI) myocardial infarction ...,59_Atherosclerotic heart disease of native cor...
84532,169057,86_Hypertensive heart and chronic kidney disea...,"86_Systolic (congestive) heart failure,86_Syst..."
84533,169060,"71_Heart failure, unspecified",71_Hypo-osmolality and hyponatremia
84534,169062,74_Encounter for palliative care,"74_Malignant neoplasm of lower lobe, bronchus ..."


### **Saving Inpatient Admission Grouped File**

In [49]:
import os
cwd = os.getcwd()
cwd

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/2211574'

In [42]:
inpatient_admissions_df_grouped.to_csv('/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/900379/Output_files_train/Potential_files_train/df_inpatient_admission_train.csv')

-----------