----
# Inpatient Admissions Quality
----

## **Dataset Description:**

- **Internalpatientid:** This is the unique identifier for each patient in the dataset.
- **Age at admission:** This feature indicates the age of the patient at the time of admission to the hospital.
- **Admission date:** This represents the date and time when the patient was admitted to the hospital.
- **Discharge date:** This represents the date and time when the patient was discharged from the hospital after receiving inpatient care.
- **Admitting unit service:** This feature indicates the unit or department of the hospital where the patient was admitted.
- **Discharging unit service:** This feature indicates the unit or department of the hospital where the patient was discharged from.
- **Admitting specialty:** This feature indicates the medical specialty of the physician who admitted the patient to the hospital.
- **Discharging specialty:** This feature indicates the medical specialty of the physician who discharged the patient from the hospital.
- **First listed discharge diagnosis icd10 subcategory:** This feature represents the primary diagnosis for which the patient received treatment during their hospital stay.
- **Second listed discharge diagnosis icd10 subcategory:** This feature represents any additional secondary diagnoses for which the patient received treatment during their hospital stay.
- **Discharge disposition:** This feature indicates the status of the patient at the time of discharge, such as whether they were discharged to home or to another healthcare facility.
- **Died during admission:** This feature indicates whether the patient passed away during their hospital stay.
 - **Yes/No**
- **Outpatient referral flag:** This feature indicates whether the patient was referred to outpatient care after their hospital stay.
- **Service-connected flag:** This feature indicates whether the patient's health condition is related to their military service.
- **Agent Orange flag:** This feature indicates whether the patient's health condition is related to exposure to Agent Orange, a herbicide used during the Vietnam War.
- **State:** This feature indicates the state where the hospital is located.

## Azure notebook Setup

In [1]:
#A class attribute that provides access to the TabularDatasetFactory methods for creating new TabularDataset objects. 
#Usage: Dataset.Tabular.from_delimited_files().
from azureml.core import Workspace, Dataset

subscription_id = 'bcfe0c62-8ebe-4df0-a46d-1efcf8739a5b' #check the launch studio there will get this id
resource_group = 'VChamp-Team3' # resource group name
workspace_name = 'vchamp-team3' # worksapce name

# storage account : Algorithmia, Resource group: VChamp-Team3 and workspace: vchamp-team3.
#Constructor
workspace = Workspace(subscription_id, resource_group, workspace_name)

In [2]:
#['data_team3_synthetic_train']
datastore = workspace.datastores['data_team3_synthetic_quality_check']

In [3]:
#from_delimited_files (Create a TabularDataset to represent tabular data in delimited files (e.g. CSV and TSV).

dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'inpatient_admissions_qual.csv')])

# preview the first 3 rows of the dataset
# dataset.to_pandas_dataframe()

In [4]:
#Converting the dataset into data frame(default as dataset in Azure, thus we must convert the needed formate)
inpatient_admissions_qual_data= dataset.to_pandas_dataframe()

In [5]:
type(inpatient_admissions_qual_data)

pandas.core.frame.DataFrame

In [6]:
inpatient_admissions_qual_data.head()

Unnamed: 0,Column1,Internalpatientid,Age at admission,Admission date,Discharge date,Admitting unit service,Discharging unit service,Admitting specialty,Discharging specialty,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory,Discharge disposition,Died during admission,Outpatientreferralflag,Serviceconnectedflag,Agentorangeflag,State
0,4,100012,55.31702,54:01.0,05:06.0,SURGERY,SURGERY,NEUROSURGERY,GENERAL SURGERY,Other and unspecified noninfective gastroenter...,Other specified disorders of white blood cells,Regular,,True,,,New Mexico
1,82,100399,85.70674,31:38.0,50:48.0,NHCU,NHCU,DOMICILIARY,NH HOSPICE,Unspecified mental disorder due to known physi...,"Malignant neoplasm of stomach, unspecified",Death without autopsy,,,False,False,Minnesota
2,154,100694,83.92612,55:24.0,55:24.0,NON-COUNT,NON-COUNT,SPINAL CORD INJURY,MEDICAL OBSERVATION,Abnormal levels of other serum enzymes,Other acute ischemic heart diseases,Regular,,True,,False,Idaho
3,155,100694,88.611203,28:13.0,36:18.0,NON-COUNT,NON-COUNT,SPINAL CORD INJURY,MEDICAL OBSERVATION,"Viral intestinal infection, unspecified",Hypo-osmolality and hyponatremia,Regular,,True,,False,Idaho
4,322,101407,88.925931,15:50.0,05:08.0,MEDICINE,MEDICINE,SPINAL CORD INJURY OBSERVATION,GENERAL(ACUTE MEDICINE),Unspecified dementia,Hypertensive chronic kidney disease with stage...,Regular,,True,,False,Louisiana


## **Importing Libraries**

In [7]:
# Importing essential libraries
import pandas as pd                 # Library for data manipulation and analysis
import numpy as np                  # Library for mathematical operations

## **Data Exploration**

In [8]:
# changing variable name for dataframe
df = inpatient_admissions_qual_data

In [9]:
# Display the first few rows of a DataFrame
df.head()

Unnamed: 0,Column1,Internalpatientid,Age at admission,Admission date,Discharge date,Admitting unit service,Discharging unit service,Admitting specialty,Discharging specialty,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory,Discharge disposition,Died during admission,Outpatientreferralflag,Serviceconnectedflag,Agentorangeflag,State
0,4,100012,55.31702,54:01.0,05:06.0,SURGERY,SURGERY,NEUROSURGERY,GENERAL SURGERY,Other and unspecified noninfective gastroenter...,Other specified disorders of white blood cells,Regular,,True,,,New Mexico
1,82,100399,85.70674,31:38.0,50:48.0,NHCU,NHCU,DOMICILIARY,NH HOSPICE,Unspecified mental disorder due to known physi...,"Malignant neoplasm of stomach, unspecified",Death without autopsy,,,False,False,Minnesota
2,154,100694,83.92612,55:24.0,55:24.0,NON-COUNT,NON-COUNT,SPINAL CORD INJURY,MEDICAL OBSERVATION,Abnormal levels of other serum enzymes,Other acute ischemic heart diseases,Regular,,True,,False,Idaho
3,155,100694,88.611203,28:13.0,36:18.0,NON-COUNT,NON-COUNT,SPINAL CORD INJURY,MEDICAL OBSERVATION,"Viral intestinal infection, unspecified",Hypo-osmolality and hyponatremia,Regular,,True,,False,Idaho
4,322,101407,88.925931,15:50.0,05:08.0,MEDICINE,MEDICINE,SPINAL CORD INJURY OBSERVATION,GENERAL(ACUTE MEDICINE),Unspecified dementia,Hypertensive chronic kidney disease with stage...,Regular,,True,,False,Louisiana


In [10]:
# Shape of the dataset
df.shape

num_rows = df.shape[0]  # Number of rows
num_cols = df.shape[1]  # Number of columns

print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

Number of rows: 4010
Number of columns: 17


In [11]:
# Get the number of unique values in the 'Internalpatientid' column
print("Number of Unique Internalpatientid")
df['Internalpatientid'].nunique()

Number of Unique Internalpatientid


632

In [12]:
# Dropping unnammed column
df = df.drop(df.columns[0], axis=1)

In [13]:
# Display the concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4010 entries, 0 to 4009
Data columns (total 16 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   Internalpatientid                                    4010 non-null   int64  
 1   Age at admission                                     4010 non-null   float64
 2   Admission date                                       4010 non-null   object 
 3   Discharge date                                       4006 non-null   object 
 4   Admitting unit service                               4010 non-null   object 
 5   Discharging unit service                             4010 non-null   object 
 6   Admitting specialty                                  4010 non-null   object 
 7   Discharging specialty                                4010 non-null   object 
 8   First listed discharge diagnosis icd10 subcategory   4010 non-null  

- The 'Internalpatientid' column contains integer values, while the 'Age at admission' column is in float format. The rest of the features are represented as objects.

## **Checking for Missing Values**

In [14]:
# Count the number of missing values in each column
df.isnull().sum()

Internalpatientid                                         0
Age at admission                                          0
Admission date                                            0
Discharge date                                            4
Admitting unit service                                    0
Discharging unit service                                  0
Admitting specialty                                       0
Discharging specialty                                     0
First listed discharge diagnosis icd10 subcategory        0
Second listed discharge diagnosis icd10 subcategory       0
Discharge disposition                                     0
Died during admission                                  4010
Outpatientreferralflag                                  194
Serviceconnectedflag                                   3750
Agentorangeflag                                         730
State                                                     0
dtype: int64

- The dataset has missing values, particularly in the 'Outpatientreferralflag', 'Serviceconnectedflag', and 'Agentorangeflag' columns. Additionally, there are a few missing values in the 'Discharge date' column."

---

## **Data Preprocessing**

In [15]:
# changing variable name for dataframe
inpatient_admissions = df

#### Drop the low potential columns

In [16]:
# Drop the specified columns from the DataFrame
inpatient_admissions.drop(['Admission date','Discharge date','Admitting unit service','Discharging unit service','Admitting specialty','Discharging specialty','Discharge disposition', 'Died during admission', 'Outpatientreferralflag','Serviceconnectedflag','Agentorangeflag','State'], axis=1,inplace=True)

#### Checking Missing Values for Potential Attributes

In [17]:
# Count the number of missing values in each column
inpatient_admissions.isnull().sum()

Internalpatientid                                      0
Age at admission                                       0
First listed discharge diagnosis icd10 subcategory     0
Second listed discharge diagnosis icd10 subcategory    0
dtype: int64

- There is no missing values in the 'First listed discharge diagnosis icd10 subcategory' and 'Second listed discharge diagnosis icd10 subcategory' 

#### Sort the Dataset based on the 'Internalpatientid' and 'Age at admission' column in ascending order

In [18]:
# Sort the dataset based on the patientid and 'age' column in ascending order
inpatient_admissions.sort_values(["Internalpatientid","Age at admission"],inplace=True)

### **Round off the Ages to One Decimal Place (00.0)**

In [19]:
# Format 'Age at measurement' values in the format 00.0
inpatient_admissions["Age at admission"] = inpatient_admissions["Age at admission"].map("{:.1f}".format)

In [20]:
# Print the updated DataFrame
inpatient_admissions

Unnamed: 0,Internalpatientid,Age at admission,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory
1667,67,57.0,COVID-19,Hypertensive heart and chronic kidney disease ...
1408,200,68.5,Malignant neoplasm of prostate,"Retention of urine, unspecified"
3423,200,77.1,Diaphragmatic hernia without obstruction or ga...,Gastro-esophageal reflux disease without esoph...
1760,200,77.7,Pulmonary embolism without acute cor pulmonale,Morbid (severe) obesity with alveolar hypovent...
1772,200,83.8,Acute and chronic respiratory failure,"Acute kidney failure, unspecified"
...,...,...,...,...
1492,168008,73.2,Calculus of kidney,"Chronic kidney disease, stage 5"
1617,168008,73.3,End stage renal disease,Unspecified bacterial pneumonia
3308,168008,73.3,Other specified diseases of intestine,End stage renal disease
1684,168008,73.3,Unspecified bacterial pneumonia,End stage renal disease


### **Getting only Maximum Ages of 'Internalpatientid'**

In [21]:
# Find the maximum age for each internalpatientid
max_ages = inpatient_admissions.groupby('Internalpatientid')['Age at admission'].max().reset_index()

In [22]:
# Merge with the original dataframe to get the rows with the highest age
inpatient_admission = pd.merge(inpatient_admissions, max_ages, on =['Internalpatientid','Age at admission'], how = 'inner')

In [23]:
inpatient_admission

Unnamed: 0,Internalpatientid,Age at admission,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory
0,67,57.0,COVID-19,Hypertensive heart and chronic kidney disease ...
1,200,84.1,Acute and chronic respiratory failure,(Censored)
2,291,83.2,"Malignant neoplasm of middle lobe, bronchus or...",Secondary malignant neoplasm of lung
3,330,72.7,Other specified sepsis,Other specified bacterial agents as the cause ...
4,351,85.5,Hypertensive heart disease with heart failure,Diastolic (congestive) heart failure
...,...,...,...,...
763,167917,49.6,Hypertensive emergency,Systolic (congestive) heart failure
764,168008,73.3,End stage renal disease,Unspecified bacterial pneumonia
765,168008,73.3,Other specified diseases of intestine,End stage renal disease
766,168008,73.3,Unspecified bacterial pneumonia,End stage renal disease


### Round off the age

In [24]:
# Convert 'Age at measurement' column from object to float
inpatient_admission["Age at admission"] = inpatient_admission["Age at admission"].astype(float)

# Round off the values in the 'Age at measurement'
inpatient_admission["Age at admission"] = inpatient_admission["Age at admission"].round()

In [25]:
inpatient_admission['Age at admission'] = inpatient_admission['Age at admission'].astype('int')

In [26]:
inpatient_admission

Unnamed: 0,Internalpatientid,Age at admission,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory
0,67,57,COVID-19,Hypertensive heart and chronic kidney disease ...
1,200,84,Acute and chronic respiratory failure,(Censored)
2,291,83,"Malignant neoplasm of middle lobe, bronchus or...",Secondary malignant neoplasm of lung
3,330,73,Other specified sepsis,Other specified bacterial agents as the cause ...
4,351,86,Hypertensive heart disease with heart failure,Diastolic (congestive) heart failure
...,...,...,...,...
763,167917,50,Hypertensive emergency,Systolic (congestive) heart failure
764,168008,73,End stage renal disease,Unspecified bacterial pneumonia
765,168008,73,Other specified diseases of intestine,End stage renal disease
766,168008,73,Unspecified bacterial pneumonia,End stage renal disease


In [27]:
inpatient_admission['Internalpatientid'].nunique()

632

### Checking Value Counts

In [28]:
inpatient_ad_value_counts = inpatient_admission['Internalpatientid'].value_counts().to_frame()

In [29]:
inpatient_ad_value_counts = inpatient_ad_value_counts.reset_index()

In [30]:
inpatient_ad_value_counts.columns=['Internalpatientid','counts for inpatient admission']

In [31]:
inpatient_ad_value_counts

Unnamed: 0,Internalpatientid,counts for inpatient admission
0,13910,5
1,157947,4
2,39707,4
3,48647,4
4,89178,4
...,...,...
627,75714,1
628,29315,1
629,108167,1
630,51215,1


### Merging 'Age at admission' Column and Potential Columns with underscore

In [32]:
inpatient_admission

Unnamed: 0,Internalpatientid,Age at admission,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory
0,67,57,COVID-19,Hypertensive heart and chronic kidney disease ...
1,200,84,Acute and chronic respiratory failure,(Censored)
2,291,83,"Malignant neoplasm of middle lobe, bronchus or...",Secondary malignant neoplasm of lung
3,330,73,Other specified sepsis,Other specified bacterial agents as the cause ...
4,351,86,Hypertensive heart disease with heart failure,Diastolic (congestive) heart failure
...,...,...,...,...
763,167917,50,Hypertensive emergency,Systolic (congestive) heart failure
764,168008,73,End stage renal disease,Unspecified bacterial pneumonia
765,168008,73,Other specified diseases of intestine,End stage renal disease
766,168008,73,Unspecified bacterial pneumonia,End stage renal disease


In [33]:
# To create a new column that combines the 'Age at admission' and 'First listed discharge diagnosis icd10 subcategory' columns into a single string representation
inpatient_admission['inpatient_admissions_First_listed_discharge_diagnosis_icd10_subcategory'] = inpatient_admission['Age at admission'].astype(str) + '_' + inpatient_admission['First listed discharge diagnosis icd10 subcategory'].astype(str)

In [34]:
# To create a new column that combines the 'Age at admission' and 'Second listed discharge diagnosis icd10 subcategory' columns into a single string representation
inpatient_admission['inpatient_admissions_Second_listed_discharge_diagnosis_icd10_subcategory'] = inpatient_admission['Age at admission'].astype(str) + '_' + inpatient_admission['Second listed discharge diagnosis icd10 subcategory'].astype(str)

#### Drop unwanted Columns

In [35]:
# Drop the specified columns from the DataFrame
inpatient_admission.drop(['Age at admission','First listed discharge diagnosis icd10 subcategory','Second listed discharge diagnosis icd10 subcategory'], axis=1,inplace=True)

In [36]:
inpatient_admission

Unnamed: 0,Internalpatientid,inpatient_admissions_First_listed_discharge_diagnosis_icd10_subcategory,inpatient_admissions_Second_listed_discharge_diagnosis_icd10_subcategory
0,67,57_COVID-19,57_Hypertensive heart and chronic kidney disea...
1,200,84_Acute and chronic respiratory failure,84_(Censored)
2,291,"83_Malignant neoplasm of middle lobe, bronchus...",83_Secondary malignant neoplasm of lung
3,330,73_Other specified sepsis,73_Other specified bacterial agents as the cau...
4,351,86_Hypertensive heart disease with heart failure,86_Diastolic (congestive) heart failure
...,...,...,...
763,167917,50_Hypertensive emergency,50_Systolic (congestive) heart failure
764,168008,73_End stage renal disease,73_Unspecified bacterial pneumonia
765,168008,73_Other specified diseases of intestine,73_End stage renal disease
766,168008,73_Unspecified bacterial pneumonia,73_End stage renal disease


In [37]:
# Shape of the dataset
inpatient_admission.shape
num_rows = inpatient_admission.shape[0]
num_cols = inpatient_admission.shape[1]
print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

Number of rows: 768
Number of columns: 3


### **Group the DataFrame by 'Internalpatientid'**

In [38]:
# Group the DataFrame by 'Internalpatientid' and concatenate the icd10 values and the purpose of `x.dropna()` is to remove any missing values from the Series before applying a subsequent operation,
# such as concatenation using the `join` function. By dropping the missing values, we ensure that only non-null values are included in the resulting concatenated string.
df_grouped = inpatient_admission.groupby('Internalpatientid').agg(lambda x: ','.join(x.dropna()))

In [39]:
# Reset the index of the grouped DataFrame
inpatient_admissions_df_grouped = df_grouped.reset_index()
inpatient_admissions_df_grouped

Unnamed: 0,Internalpatientid,inpatient_admissions_First_listed_discharge_diagnosis_icd10_subcategory,inpatient_admissions_Second_listed_discharge_diagnosis_icd10_subcategory
0,67,57_COVID-19,57_Hypertensive heart and chronic kidney disea...
1,200,84_Acute and chronic respiratory failure,84_(Censored)
2,291,"83_Malignant neoplasm of middle lobe, bronchus...",83_Secondary malignant neoplasm of lung
3,330,73_Other specified sepsis,73_Other specified bacterial agents as the cau...
4,351,86_Hypertensive heart disease with heart failure,86_Diastolic (congestive) heart failure
...,...,...,...
627,166881,76_Secondary malignant neoplasm of liver and i...,"76_Chronic obstructive pulmonary disease, unsp..."
628,167102,75_Unspecified atrial fibrillation and atrial ...,"75_Thoracic aortic aneurysm, without rupture"
629,167404,"78_Nonrheumatic aortic valve disorder, unspeci...",78_Non-ST elevation (NSTEMI) myocardial infarc...
630,167917,50_Hypertensive emergency,50_Systolic (congestive) heart failure


### **Saving Inpatient Admission Grouped File**

In [40]:
import os
cwd = os.getcwd()
cwd

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/2211574/Best_Files'

In [41]:
inpatient_admissions_df_grouped.to_csv('/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/900379/Output_files_quality/df_inpatient_admission_qual.csv')

-----------