----
# Inpatient Admissions Test
----

## **Dataset Description:**

- **Internalpatientid:** This is the unique identifier for each patient in the dataset.
- **Age at admission:** This feature indicates the age of the patient at the time of admission to the hospital.
- **Admission date:** This represents the date and time when the patient was admitted to the hospital.
- **Discharge date:** This represents the date and time when the patient was discharged from the hospital after receiving inpatient care.
- **Admitting unit service:** This feature indicates the unit or department of the hospital where the patient was admitted.
- **Discharging unit service:** This feature indicates the unit or department of the hospital where the patient was discharged from.
- **Admitting specialty:** This feature indicates the medical specialty of the physician who admitted the patient to the hospital.
- **Discharging specialty:** This feature indicates the medical specialty of the physician who discharged the patient from the hospital.
- **First listed discharge diagnosis icd10 subcategory:** This feature represents the primary diagnosis for which the patient received treatment during their hospital stay.
- **Second listed discharge diagnosis icd10 subcategory:** This feature represents any additional secondary diagnoses for which the patient received treatment during their hospital stay.
- **Discharge disposition:** This feature indicates the status of the patient at the time of discharge, such as whether they were discharged to home or to another healthcare facility.
- **Died during admission:** This feature indicates whether the patient passed away during their hospital stay.
 - **Yes/No**
- **Outpatient referral flag:** This feature indicates whether the patient was referred to outpatient care after their hospital stay.
- **Service-connected flag:** This feature indicates whether the patient's health condition is related to their military service.
- **Agent Orange flag:** This feature indicates whether the patient's health condition is related to exposure to Agent Orange, a herbicide used during the Vietnam War.
- **State:** This feature indicates the state where the hospital is located.

## Azure notebook Setup

In [1]:
#A class attribute that provides access to the TabularDatasetFactory methods for creating new TabularDataset objects. 
#Usage: Dataset.Tabular.from_delimited_files().
from azureml.core import Workspace, Dataset

subscription_id = 'bcfe0c62-8ebe-4df0-a46d-1efcf8739a5b' #check the launch studio there will get this id
resource_group = 'VChamp-Team3' # resource group name
workspace_name = 'vchamp-team3' # worksapce name

# storage account : Algorithmia, Resource group: VChamp-Team3 and workspace: vchamp-team3.
#Constructor
workspace = Workspace(subscription_id, resource_group, workspace_name)

In [2]:
#['data_team3_synthetic_train']
datastore = workspace.datastores['data_team3_synthetic_test']

In [3]:
#from_delimited_files (Create a TabularDataset to represent tabular data in delimited files (e.g. CSV and TSV).

dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'inpatient_admissions_test.csv')])

# preview the first 3 rows of the dataset
# dataset.to_pandas_dataframe()

In [4]:
#Converting the dataset into data frame(default as dataset in Azure, thus we must convert the needed formate)
inpatient_admissions_test_data= dataset.to_pandas_dataframe()

In [5]:
type(inpatient_admissions_test_data)

pandas.core.frame.DataFrame

In [6]:
inpatient_admissions_test_data.head()

Unnamed: 0,Column1,Internalpatientid,Age at admission,Admission date,Discharge date,Admitting unit service,Discharging unit service,Admitting specialty,Discharging specialty,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory,Discharge disposition,Died during admission,Outpatientreferralflag,Serviceconnectedflag,Agentorangeflag,State
0,14,100041,83.927801,2009-03-09 14:40:14,2009-03-20 14:25:08,MEDICINE,SURGERY,HALFWAY HOUSE,Not specified (no value),Atherosclerotic heart disease of native corona...,Unstable angina,Regular,False,True,,,Minnesota
1,18,10005,49.697229,2001-04-30 08:05:33,2001-05-01 10:13:18,MEDICINE,MEDICINE,REHABILITATION MEDICINE,TELEMETRY,Other chest pain,Pure hypercholesterolemia,Regular,False,True,False,,Michigan
2,33,100106,65.239819,2015-04-13 03:41:17,2015-04-14 19:59:28,MEDICINE,MEDICINE,DOD BEDS IN VA FACILITY,GENERAL(ACUTE MEDICINE),"Anemia, unspecified",Unspecified protein-calorie malnutrition,Regular,False,True,,False,Ohio
3,34,100123,65.905689,2006-09-28 12:42:05,2006-10-30 19:14:57,NHCU,NHCU,NH LONG-STAY CONTINUING CARE,NH SHORT STAY REHABILITATION,Encounter for other specified aftercare,Unspecified atrial fibrillation and atrial flu...,Regular,False,True,False,False,Oklahoma
4,35,100126,71.81896,2020-11-23 19:55:17,2020-11-25 21:16:25,MEDICINE,MEDICINE,HALFWAY HOUSE,GENERAL(ACUTE MEDICINE),Burkitt lymphoma,Secondary malignant neoplasm of lung,Regular,False,True,,False,California


## **Importing Libraries**

In [7]:
# Importing essential libraries
import pandas as pd                 # Library for data manipulation and analysis
import numpy as np                  # Library for mathematical operations

## **Data Exploration**

In [8]:
# changing variable name for dataframe
df = inpatient_admissions_test_data

In [9]:
# Display the first few rows of a DataFrame
df.head()

Unnamed: 0,Column1,Internalpatientid,Age at admission,Admission date,Discharge date,Admitting unit service,Discharging unit service,Admitting specialty,Discharging specialty,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory,Discharge disposition,Died during admission,Outpatientreferralflag,Serviceconnectedflag,Agentorangeflag,State
0,14,100041,83.927801,2009-03-09 14:40:14,2009-03-20 14:25:08,MEDICINE,SURGERY,HALFWAY HOUSE,Not specified (no value),Atherosclerotic heart disease of native corona...,Unstable angina,Regular,False,True,,,Minnesota
1,18,10005,49.697229,2001-04-30 08:05:33,2001-05-01 10:13:18,MEDICINE,MEDICINE,REHABILITATION MEDICINE,TELEMETRY,Other chest pain,Pure hypercholesterolemia,Regular,False,True,False,,Michigan
2,33,100106,65.239819,2015-04-13 03:41:17,2015-04-14 19:59:28,MEDICINE,MEDICINE,DOD BEDS IN VA FACILITY,GENERAL(ACUTE MEDICINE),"Anemia, unspecified",Unspecified protein-calorie malnutrition,Regular,False,True,,False,Ohio
3,34,100123,65.905689,2006-09-28 12:42:05,2006-10-30 19:14:57,NHCU,NHCU,NH LONG-STAY CONTINUING CARE,NH SHORT STAY REHABILITATION,Encounter for other specified aftercare,Unspecified atrial fibrillation and atrial flu...,Regular,False,True,False,False,Oklahoma
4,35,100126,71.81896,2020-11-23 19:55:17,2020-11-25 21:16:25,MEDICINE,MEDICINE,HALFWAY HOUSE,GENERAL(ACUTE MEDICINE),Burkitt lymphoma,Secondary malignant neoplasm of lung,Regular,False,True,,False,California


In [10]:
# Shape of the dataset
df.shape

num_rows = df.shape[0]  # Number of rows
num_cols = df.shape[1]  # Number of columns

print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

Number of rows: 139086
Number of columns: 17


In [11]:
# Get the number of unique values in the 'Internalpatientid' column
print("Number of Unique Internalpatientid")
df['Internalpatientid'].nunique()

Number of Unique Internalpatientid


22255

In [12]:
# Dropping unnammed column
df = df.drop(df.columns[0], axis=1)

In [13]:
# Display the concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139086 entries, 0 to 139085
Data columns (total 16 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Internalpatientid                                    139086 non-null  int64         
 1   Age at admission                                     139086 non-null  float64       
 2   Admission date                                       139086 non-null  datetime64[ns]
 3   Discharge date                                       138948 non-null  datetime64[ns]
 4   Admitting unit service                               139086 non-null  object        
 5   Discharging unit service                             139086 non-null  object        
 6   Admitting specialty                                  139086 non-null  object        
 7   Discharging specialty                                139086 non-null  obje

- The 'Internalpatientid' column contains integer values, while the 'Age at admission' column is in float format. The rest of the features are represented as objects.

## **Checking for Missing Values**

In [14]:
# Count the number of missing values in each column
df.isnull().sum()

Internalpatientid                                           0
Age at admission                                            0
Admission date                                              0
Discharge date                                            138
Admitting unit service                                      0
Discharging unit service                                    0
Admitting specialty                                         0
Discharging specialty                                       0
First listed discharge diagnosis icd10 subcategory          0
Second listed discharge diagnosis icd10 subcategory         0
Discharge disposition                                       0
Died during admission                                       0
Outpatientreferralflag                                   7000
Serviceconnectedflag                                   129649
Agentorangeflag                                         27832
State                                                       0
dtype: i

- The dataset has missing values, particularly in the 'Outpatientreferralflag', 'Serviceconnectedflag', and 'Agentorangeflag' columns. Additionally, there are a few missing values in the 'Discharge date' column."

---

## **Data Preprocessing**

In [15]:
# changing variable name for dataframe
inpatient_admissions = df

#### Drop the low potential columns

In [16]:
# Drop the specified columns from the DataFrame
inpatient_admissions.drop(['Admission date','Discharge date','Admitting unit service','Discharging unit service','Admitting specialty','Discharging specialty','Discharge disposition', 'Died during admission', 'Outpatientreferralflag','Serviceconnectedflag','Agentorangeflag','State'], axis=1,inplace=True)

#### Checking Missing Values for Potential Attributes

In [17]:
# Count the number of missing values in each column
inpatient_admissions.isnull().sum()

Internalpatientid                                      0
Age at admission                                       0
First listed discharge diagnosis icd10 subcategory     0
Second listed discharge diagnosis icd10 subcategory    0
dtype: int64

- There is no missing values in the 'First listed discharge diagnosis icd10 subcategory' and 'Second listed discharge diagnosis icd10 subcategory' 

#### Sort the Dataset based on the 'Internalpatientid' and 'Age at admission' column in ascending order

In [18]:
# Sort the dataset based on the patientid and 'age' column in ascending order
inpatient_admissions.sort_values(["Internalpatientid","Age at admission"],inplace=True)

### **Round off the Ages to One Decimal Place (00.0)**

In [19]:
# Format 'Age at measurement' values in the format 00.0
inpatient_admissions["Age at admission"] = inpatient_admissions["Age at admission"].map("{:.1f}".format)

In [20]:
# Print the updated DataFrame
inpatient_admissions

Unnamed: 0,Internalpatientid,Age at admission,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory
103944,7,57.1,Acute pancreatitis,Calculus of bile duct without cholangitis or c...
101704,7,61.1,"Acute kidney failure, unspecified",Cholangitis
90423,7,61.1,Encounter for other specified aftercare,"Acute kidney failure, unspecified"
103468,7,63.2,Incisional hernia without obstruction or gangrene,Essential (primary) hypertension
102431,7,66.8,Intestinal adhesions [bands] with obstruction ...,Not specified
...,...,...,...,...
59552,169065,45.5,Quadriplegia,Unspecified atrial fibrillation and atrial flu...
88195,169065,47.4,Other hemorrhoids,Encounter for general adult medical examination
59366,169065,51.5,Quadriplegia,Encounter for general adult medical examination
59738,169065,53.1,"Acute pericarditis, unspecified",Quadriplegia


### **Getting only Maximum Ages of 'Internalpatientid'**

In [21]:
# Find the maximum age for each internalpatientid
max_ages = inpatient_admissions.groupby('Internalpatientid')['Age at admission'].max().reset_index()

In [22]:
# Merge with the original dataframe to get the rows with the highest age
inpatient_admission = pd.merge(inpatient_admissions, max_ages, on =['Internalpatientid','Age at admission'], how = 'inner')

In [23]:
inpatient_admission

Unnamed: 0,Internalpatientid,Age at admission,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory
0,7,67.7,Intestinal adhesions [bands] with obstruction ...,Type 2 diabetes mellitus without complications
1,9,51.8,Other restrictive cardiomyopathy,Nonrheumatic mitral (valve) insufficiency
2,9,51.8,"Acute kidney failure, unspecified","Atrioventricular block, complete"
3,12,74.0,Other forms of chronic ischemic heart disease,"Volume depletion, unspecified"
4,17,82.4,"Acute kidney failure, unspecified",Hypertensive chronic kidney disease with stage...
...,...,...,...,...
27050,169011,69.1,"Major depressive disorder, recurrent",Other symptoms and signs involving emotional s...
27051,169037,85.0,Fracture of acetabulum,Fracture of acetabulum
27052,169037,85.0,(Censored),Fracture of pubis
27053,169059,79.9,Embolism and thrombosis of other specified veins,Other hyperlipidemia


### Round off the age

In [24]:
# Convert 'Age at measurement' column from object to float
inpatient_admission["Age at admission"] = inpatient_admission["Age at admission"].astype(float)

# Round off the values in the 'Age at measurement'
inpatient_admission["Age at admission"] = inpatient_admission["Age at admission"].round()

In [25]:
inpatient_admission['Age at admission'] = inpatient_admission['Age at admission'].astype('int')

In [26]:
inpatient_admission

Unnamed: 0,Internalpatientid,Age at admission,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory
0,7,68,Intestinal adhesions [bands] with obstruction ...,Type 2 diabetes mellitus without complications
1,9,52,Other restrictive cardiomyopathy,Nonrheumatic mitral (valve) insufficiency
2,9,52,"Acute kidney failure, unspecified","Atrioventricular block, complete"
3,12,74,Other forms of chronic ischemic heart disease,"Volume depletion, unspecified"
4,17,82,"Acute kidney failure, unspecified",Hypertensive chronic kidney disease with stage...
...,...,...,...,...
27050,169011,69,"Major depressive disorder, recurrent",Other symptoms and signs involving emotional s...
27051,169037,85,Fracture of acetabulum,Fracture of acetabulum
27052,169037,85,(Censored),Fracture of pubis
27053,169059,80,Embolism and thrombosis of other specified veins,Other hyperlipidemia


In [27]:
inpatient_admission['Internalpatientid'].nunique()

22255

### Checking Value Counts

In [28]:
inpatient_ad_value_counts = inpatient_admission['Internalpatientid'].value_counts().to_frame()

In [29]:
inpatient_ad_value_counts = inpatient_ad_value_counts.reset_index()

In [30]:
inpatient_ad_value_counts.columns=['Internalpatientid','counts for inpatient admission']

In [31]:
inpatient_ad_value_counts

Unnamed: 0,Internalpatientid,counts for inpatient admission
0,125810,6
1,111964,5
2,21301,5
3,43011,5
4,45037,5
...,...,...
22250,102648,1
22251,104697,1
22252,33018,1
22253,35067,1


### Merging 'Age at admission' Column and Potential Columns with underscore

In [32]:
inpatient_admission

Unnamed: 0,Internalpatientid,Age at admission,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory
0,7,68,Intestinal adhesions [bands] with obstruction ...,Type 2 diabetes mellitus without complications
1,9,52,Other restrictive cardiomyopathy,Nonrheumatic mitral (valve) insufficiency
2,9,52,"Acute kidney failure, unspecified","Atrioventricular block, complete"
3,12,74,Other forms of chronic ischemic heart disease,"Volume depletion, unspecified"
4,17,82,"Acute kidney failure, unspecified",Hypertensive chronic kidney disease with stage...
...,...,...,...,...
27050,169011,69,"Major depressive disorder, recurrent",Other symptoms and signs involving emotional s...
27051,169037,85,Fracture of acetabulum,Fracture of acetabulum
27052,169037,85,(Censored),Fracture of pubis
27053,169059,80,Embolism and thrombosis of other specified veins,Other hyperlipidemia


In [33]:
# To create a new column that combines the 'Age at admission' and 'First listed discharge diagnosis icd10 subcategory' columns into a single string representation
inpatient_admission['inpatient_admissions_First_listed_discharge_diagnosis_icd10_subcategory'] = inpatient_admission['Age at admission'].astype(str) + '_' + inpatient_admission['First listed discharge diagnosis icd10 subcategory'].astype(str)

In [34]:
# To create a new column that combines the 'Age at admission' and 'Second listed discharge diagnosis icd10 subcategory' columns into a single string representation
inpatient_admission['inpatient_admissions_Second_listed_discharge_diagnosis_icd10_subcategory'] = inpatient_admission['Age at admission'].astype(str) + '_' + inpatient_admission['Second listed discharge diagnosis icd10 subcategory'].astype(str)

#### Drop unwanted Columns

In [35]:
# Drop the specified columns from the DataFrame
inpatient_admission.drop(['Age at admission','First listed discharge diagnosis icd10 subcategory','Second listed discharge diagnosis icd10 subcategory'], axis=1,inplace=True)

In [36]:
inpatient_admission

Unnamed: 0,Internalpatientid,inpatient_admissions_First_listed_discharge_diagnosis_icd10_subcategory,inpatient_admissions_Second_listed_discharge_diagnosis_icd10_subcategory
0,7,68_Intestinal adhesions [bands] with obstructi...,68_Type 2 diabetes mellitus without complications
1,9,52_Other restrictive cardiomyopathy,52_Nonrheumatic mitral (valve) insufficiency
2,9,"52_Acute kidney failure, unspecified","52_Atrioventricular block, complete"
3,12,74_Other forms of chronic ischemic heart disease,"74_Volume depletion, unspecified"
4,17,"82_Acute kidney failure, unspecified",82_Hypertensive chronic kidney disease with st...
...,...,...,...
27050,169011,"69_Major depressive disorder, recurrent",69_Other symptoms and signs involving emotiona...
27051,169037,85_Fracture of acetabulum,85_Fracture of acetabulum
27052,169037,85_(Censored),85_Fracture of pubis
27053,169059,80_Embolism and thrombosis of other specified ...,80_Other hyperlipidemia


In [37]:
# Shape of the dataset
inpatient_admission.shape
num_rows = inpatient_admission.shape[0]
num_cols = inpatient_admission.shape[1]
print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

Number of rows: 27055
Number of columns: 3


### **Group the DataFrame by 'Internalpatientid'**

In [38]:
# Group the DataFrame by 'Internalpatientid' and concatenate the icd10 values and the purpose of `x.dropna()` is to remove any missing values from the Series before applying a subsequent operation,
# such as concatenation using the `join` function. By dropping the missing values, we ensure that only non-null values are included in the resulting concatenated string.
df_grouped = inpatient_admission.groupby('Internalpatientid').agg(lambda x: ','.join(x.dropna()))

In [39]:
# Reset the index of the grouped DataFrame
inpatient_admissions_df_grouped = df_grouped.reset_index()
inpatient_admissions_df_grouped

Unnamed: 0,Internalpatientid,inpatient_admissions_First_listed_discharge_diagnosis_icd10_subcategory,inpatient_admissions_Second_listed_discharge_diagnosis_icd10_subcategory
0,7,68_Intestinal adhesions [bands] with obstructi...,68_Type 2 diabetes mellitus without complications
1,9,"52_Other restrictive cardiomyopathy,52_Acute k...","52_Nonrheumatic mitral (valve) insufficiency,5..."
2,12,74_Other forms of chronic ischemic heart disease,"74_Volume depletion, unspecified"
3,17,"82_Acute kidney failure, unspecified",82_Hypertensive chronic kidney disease with st...
4,22,61_Unspecified complication of internal prosth...,"61_Not specified,61_Unspecified complication o..."
...,...,...,...
22250,168995,75_Chronic obstructive pulmonary disease with ...,75_Acute and chronic respiratory failure
22251,169011,"69_Major depressive disorder, recurrent",69_Other symptoms and signs involving emotiona...
22252,169037,"85_Fracture of acetabulum,85_(Censored)","85_Fracture of acetabulum,85_Fracture of pubis"
22253,169059,80_Embolism and thrombosis of other specified ...,80_Other hyperlipidemia


### **Saving Inpatient Admission Grouped File**

In [40]:
import os
cwd = os.getcwd()
cwd

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/2211574/Best_Files'

In [41]:
inpatient_admissions_df_grouped.to_csv('/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/900379/Output_files_test/Potential_files_test/df_inpatient_admission_test.csv')

-----------