----
## **Genarating Readmission Test Target file from Inpatient Admission file**
----

### Azure notebook Setup

In [1]:
#A class attribute that provides access to the TabularDatasetFactory methods for creating new TabularDataset objects. 
#Usage: Dataset.Tabular.from_delimited_files().
from azureml.core import Workspace, Dataset

subscription_id = 'bcfe0c62-8ebe-4df0-a46d-1efcf8739a5b' #check the launch studio there will get this id
resource_group = 'VChamp-Team3' # resource group name
workspace_name = 'vchamp-team3' # worksapce name

# storage account : Algorithmia, Resource group: VChamp-Team3 and workspace: vchamp-team3.
#Constructor
workspace = Workspace(subscription_id, resource_group, workspace_name)

In [2]:
#['data_team3_synthetic_train']
datastore = workspace.datastores['data_team3_synthetic_test']

In [3]:
#from_delimited_files (Create a TabularDataset to represent tabular data in delimited files (e.g. CSV and TSV).

dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'inpatient_admissions_test.csv')])

# preview the first 3 rows of the dataset
# dataset.to_pandas_dataframe()

In [4]:
#Converting the dataset into data frame(default as dataset in Azure, thus we must convert the needed formate)
inpatient_admissions_test_data= dataset.to_pandas_dataframe()

----

### **Importing Libraries**

In [5]:
# Importing essential libraries
import pandas as pd                 # Library for data manipulation and analysis
import numpy as np                  # Library for mathematical operations

### **Data Exploration**

In [6]:
# changing variable name for dataframe
df = inpatient_admissions_test_data

In [7]:
# Display the first few rows of a DataFrame
df.head()

Unnamed: 0,Column1,Internalpatientid,Age at admission,Admission date,Discharge date,Admitting unit service,Discharging unit service,Admitting specialty,Discharging specialty,First listed discharge diagnosis icd10 subcategory,Second listed discharge diagnosis icd10 subcategory,Discharge disposition,Died during admission,Outpatientreferralflag,Serviceconnectedflag,Agentorangeflag,State
0,14,100041,83.927801,2009-03-09 14:40:14,2009-03-20 14:25:08,MEDICINE,SURGERY,HALFWAY HOUSE,Not specified (no value),Atherosclerotic heart disease of native corona...,Unstable angina,Regular,False,True,,,Minnesota
1,18,10005,49.697229,2001-04-30 08:05:33,2001-05-01 10:13:18,MEDICINE,MEDICINE,REHABILITATION MEDICINE,TELEMETRY,Other chest pain,Pure hypercholesterolemia,Regular,False,True,False,,Michigan
2,33,100106,65.239819,2015-04-13 03:41:17,2015-04-14 19:59:28,MEDICINE,MEDICINE,DOD BEDS IN VA FACILITY,GENERAL(ACUTE MEDICINE),"Anemia, unspecified",Unspecified protein-calorie malnutrition,Regular,False,True,,False,Ohio
3,34,100123,65.905689,2006-09-28 12:42:05,2006-10-30 19:14:57,NHCU,NHCU,NH LONG-STAY CONTINUING CARE,NH SHORT STAY REHABILITATION,Encounter for other specified aftercare,Unspecified atrial fibrillation and atrial flu...,Regular,False,True,False,False,Oklahoma
4,35,100126,71.81896,2020-11-23 19:55:17,2020-11-25 21:16:25,MEDICINE,MEDICINE,HALFWAY HOUSE,GENERAL(ACUTE MEDICINE),Burkitt lymphoma,Secondary malignant neoplasm of lung,Regular,False,True,,False,California


In [8]:
# Shape of the dataset
df.shape

num_rows = df.shape[0]  # Number of rows
num_cols = df.shape[1]  # Number of columns

print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

Number of rows: 139086
Number of columns: 17


In [9]:
# Get the number of unique values in the 'Internalpatientid' column
print("Number of Unique Internalpatientid")
df['Internalpatientid'].nunique()

Number of Unique Internalpatientid


22255

In [10]:
# Dropping unnammed column
df = df.drop(df.columns[0], axis=1)

In [11]:
# Display the concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139086 entries, 0 to 139085
Data columns (total 16 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Internalpatientid                                    139086 non-null  int64         
 1   Age at admission                                     139086 non-null  float64       
 2   Admission date                                       139086 non-null  datetime64[ns]
 3   Discharge date                                       138948 non-null  datetime64[ns]
 4   Admitting unit service                               139086 non-null  object        
 5   Discharging unit service                             139086 non-null  object        
 6   Admitting specialty                                  139086 non-null  object        
 7   Discharging specialty                                139086 non-null  obje

- The 'Internalpatientid' column contains integer values, while the 'Age at admission' column is in float format. The rest of the features are represented as objects.

### **Checking for Missing Values**

In [12]:
# List comprehension to find columns with missing values
[features for features in df.columns if df[features].isnull().sum()>0]

['Discharge date',
 'Outpatientreferralflag',
 'Serviceconnectedflag',
 'Agentorangeflag']

In [13]:
# Count the number of missing values in each column
df.isnull().sum()

Internalpatientid                                           0
Age at admission                                            0
Admission date                                              0
Discharge date                                            138
Admitting unit service                                      0
Discharging unit service                                    0
Admitting specialty                                         0
Discharging specialty                                       0
First listed discharge diagnosis icd10 subcategory          0
Second listed discharge diagnosis icd10 subcategory         0
Discharge disposition                                       0
Died during admission                                       0
Outpatientreferralflag                                   7000
Serviceconnectedflag                                   129649
Agentorangeflag                                         27832
State                                                       0
dtype: i

- The dataset has missing values, particularly in the 'Outpatientreferralflag', 'Serviceconnectedflag', and 'Agentorangeflag' columns. Additionally, there are a few missing values in the 'Discharge date' column."

---

### **Checking Readmission Count**

In [14]:
df_readmission = pd.DataFrame(df['Internalpatientid'].value_counts()).reset_index()
df_readmission.columns = ['Internalpatientid', 'counts']  # Rename the columns

In [15]:
# Create a new column based on the condition
df_readmission['Readmission'] = df_readmission['counts'].apply(lambda x: 1 if x > 1 else 0)

In [16]:
df_readmission

Unnamed: 0,Internalpatientid,counts,Readmission
0,49791,209,1
1,2125,123,1
2,48536,113,1
3,122832,103,1
4,98143,82,1
...,...,...,...
22250,119434,1,0
22251,70733,1,0
22252,24989,1,0
22253,41349,1,0


In [17]:
df_readmission.drop(['counts'], axis=1,inplace=True)

In [18]:
df_readmission

Unnamed: 0,Internalpatientid,Readmission
0,49791,1
1,2125,1
2,48536,1
3,122832,1
4,98143,1
...,...,...
22250,119434,0
22251,70733,0
22252,24989,0
22253,41349,0


### **Saving Readmission Target file**

In [19]:
import os
cwd = os.getcwd()
cwd

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/2211574/Best_Files'

In [20]:
df_readmission.to_csv('/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/900379/Output_files_test/Potential_files_test/df_readmission_test.csv')

----------