---
# **Measurements Train File**
---

## **Dataset Description:**

- **Internalpatientid:** It represents an internal identifier for each patient. It is likely a unique identifier assigned to each individual in the dataset.
- **Age at measurement:** This column denotes the age of the patient at the time of the measurement. It provides information about the patient's age in a numeric format.
- **Measurement date:** It indicates the date and time when the measurement was taken. It provides the timestamp of the measurement in a specific format.
- **Measurement:** This column specifies the type of measurement that was taken. It could include various types of health measurements such as pulse, weight, blood pressure, respiratory rate, pain level, etc.
- **Result numeric:** It represents the numeric result of the measurement. It contains the actual numerical value associated with the specific measurement type.
- **Result textual:** This column holds the textual representation or description of the measurement result. But it only representing blood pressure value in texual.
- **State:** This column indicates the state associated with the measurement. It represents the geographical location or state information where the measurement was recorded.


### Azure notebook Setup

In [1]:
#A class attribute that provides access to the TabularDatasetFactory methods for creating new TabularDataset objects. 
#Usage: Dataset.Tabular.from_delimited_files().
from azureml.core import Workspace, Dataset

subscription_id = 'bcfe0c62-8ebe-4df0-a46d-1efcf8739a5b' #check the launch studio there will get this id
resource_group = 'VChamp-Team3' # resource group name
workspace_name = 'vchamp-team3' # worksapce name


# storage account : Algorithmia, Resource group: VChamp-Team3 and workspace: vchamp-team3.
#Constructor
workspace = Workspace(subscription_id, resource_group, workspace_name)

In [2]:
#['data_team3_synthetic_train']
datastore = workspace.datastores['data_team3_synthetic_train'] 

In [3]:
#from_delimited_files (Create a TabularDataset to represent tabular data in delimited files (e.g. CSV and TSV).

dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'measurements_train.csv')])

# preview the first 3 rows of the dataset
# dataset.to_pandas_dataframe()

In [4]:
#Converting the dataset into data frame(default as dataset in Azure, thus we must convert the needed formate)
measurements_train_data= dataset.to_pandas_dataframe()

In [5]:
type(measurements_train_data)

pandas.core.frame.DataFrame

In [6]:
measurements_train_data.head()

Unnamed: 0,Column1,Internalpatientid,Age at measurement,Measurement date,Measurement,Result numeric,Result textual,State
0,0,1,59.124538,2003-05-21 00:27:01,Temperature,95.804066,,Indiana
1,1,1,59.124538,2003-05-21 00:27:01,Pain,3.0,,Indiana
2,2,1,59.124538,2003-05-21 00:27:01,Pulse,74.0,,Indiana
3,3,1,59.124538,2003-05-21 00:27:01,Respiratory rate,16.0,,Indiana
4,4,1,59.124538,2003-05-21 00:27:01,Blood pressure,,140/96,Indiana


## **Importing Libraries**

In [7]:
# Importing essential libraries
import pandas as pd        # Library for data manipulation and analysis
import numpy as np         # Library for mathematical operations

## **Data Exploration**

In [8]:
# Changing variable name for dataframe
df = measurements_train_data

In [9]:
# Display the first few rows of a DataFrame
df.head()

Unnamed: 0,Column1,Internalpatientid,Age at measurement,Measurement date,Measurement,Result numeric,Result textual,State
0,0,1,59.124538,2003-05-21 00:27:01,Temperature,95.804066,,Indiana
1,1,1,59.124538,2003-05-21 00:27:01,Pain,3.0,,Indiana
2,2,1,59.124538,2003-05-21 00:27:01,Pulse,74.0,,Indiana
3,3,1,59.124538,2003-05-21 00:27:01,Respiratory rate,16.0,,Indiana
4,4,1,59.124538,2003-05-21 00:27:01,Blood pressure,,140/96,Indiana


In [10]:
# Shape of the dataset
df.shape
num_rows = df.shape[0]  # Number of rows
num_cols = df.shape[1]  # Number of columns
print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

Number of rows: 125247162
Number of columns: 8


In [11]:
# Get the number of unique values in the 'Internalpatientid' column
print("Number of Unique Internalpatientid")
df['Internalpatientid'].nunique()

Number of Unique Internalpatientid


132289

In [12]:
# Drop the specified columns from the DataFrame
df.drop(['Column1','Result textual','State'], axis=1,inplace=True)

In [13]:
# Display the concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125247162 entries, 0 to 125247161
Data columns (total 5 columns):
 #   Column              Dtype         
---  ------              -----         
 0   Internalpatientid   int64         
 1   Age at measurement  float64       
 2   Measurement date    datetime64[ns]
 3   Measurement         object        
 4   Result numeric      float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 4.7+ GB


## **Checking Missing Values**

In [14]:
# Count the number of missing values in each column
df.isnull().sum()

Internalpatientid            0
Age at measurement           0
Measurement date             0
Measurement                  0
Result numeric        22866376
dtype: int64

## **Removing NaN values**

In [15]:
df = df.dropna(subset=['Result numeric'])

In [16]:
df.Internalpatientid.nunique()

132281

In [17]:
df.isnull().sum()

Internalpatientid     0
Age at measurement    0
Measurement date      0
Measurement           0
Result numeric        0
dtype: int64

### **Removing Outliers**

In [18]:
# Define the filter conditions for each category
filters = {
    'Pulse': (40, 250),
    'Respiratory rate': (1, 30),
    'Pain': (0, 15),
    'Temperature': (90, 107)
}

# Apply the filters for each category
filtered_data = pd.DataFrame()
for category, (min_val, max_val) in filters.items():
    category_data = df[df['Measurement'] == category]
    filtered_category_data = category_data[(category_data['Result numeric'] >= min_val) & (category_data['Result numeric'] <= max_val)]
    filtered_data = pd.concat([filtered_data, filtered_category_data])

In [19]:
# Reset the index of the filtered data
filtered_data = filtered_data.reset_index(drop=True)

In [20]:
# Display the final filtered dataframe
filtered_data

Unnamed: 0,Internalpatientid,Age at measurement,Measurement date,Measurement,Result numeric
0,1,59.124538,2003-05-21 00:27:01,Pulse,74.000000
1,1,61.907156,2006-03-03 01:34:22,Pulse,51.000000
2,1,67.698135,2011-12-18 16:02:35,Pulse,77.000000
3,1,68.304132,2012-07-27 03:50:40,Pulse,45.000000
4,1,68.344198,2012-08-10 19:18:08,Pulse,50.000000
...,...,...,...,...,...
79334689,99999,96.139480,2013-01-19 23:31:04,Temperature,95.092719
79334690,99999,96.188334,2013-02-06 20:03:44,Temperature,91.271475
79334691,99999,96.314751,2013-03-25 00:59:32,Temperature,102.000000
79334692,99999,96.326070,2013-03-29 04:17:14,Temperature,93.000000


In [21]:
# Check minimum and maximum values for each category in 'Measurement' column
min_max_values = filtered_data.groupby('Measurement')['Result numeric'].agg(['min', 'max'])
min_max_values

Unnamed: 0_level_0,min,max
Measurement,Unnamed: 1_level_1,Unnamed: 2_level_1
Pain,0.0,14.0
Pulse,40.0,250.0
Respiratory rate,1.0,30.0
Temperature,90.0,107.0


### **'Max' Condition to the 'Measurement date' Column**

In [22]:
df_group = pd.merge(filtered_data.groupby(['Internalpatientid','Measurement'], as_index=False)['Measurement date'].max(),filtered_data,on=['Internalpatientid','Measurement date','Measurement'],how = 'left')

In [23]:
df_group

Unnamed: 0,Internalpatientid,Measurement,Measurement date,Age at measurement,Result numeric
0,1,Pain,2024-06-25 02:06:00,80.208174,0.000000
1,1,Pulse,2024-06-25 02:06:00,80.208174,61.000000
2,1,Respiratory rate,2024-06-25 02:06:00,80.208174,17.000000
3,1,Temperature,2024-04-03 14:19:10,79.982481,102.765598
4,2,Pain,2024-04-05 06:54:25,69.530199,0.000000
...,...,...,...,...,...
526644,169063,Temperature,2004-06-01 18:35:35,76.811969,97.026778
526645,169064,Pain,2014-11-04 07:55:08,87.877784,5.000000
526646,169064,Pulse,2014-11-19 22:14:38,87.920456,81.000000
526647,169064,Respiratory rate,2014-11-04 07:55:08,87.877784,21.000000


In [24]:
# Check minimum and maximum values for each category in 'Measurement' column
min_max_values = df_group.groupby('Measurement')['Result numeric'].agg(['min', 'max'])
min_max_values

Unnamed: 0_level_0,min,max
Measurement,Unnamed: 1_level_1,Unnamed: 2_level_1
Pain,0.0,10.0
Pulse,40.0,220.0
Respiratory rate,1.0,30.0
Temperature,90.0,107.0


## **Creating Pivot Table**

In [25]:
categories = ['Pain', 'Pulse', 'Respiratory rate', 'Temperature']
filtered_df = df_group[df_group['Measurement'].isin(categories)]
pivot_table = filtered_df.pivot_table(index='Internalpatientid', columns='Measurement', values='Result numeric', aggfunc='last')

In [26]:
# Print the pivot table
pivot_table

Measurement,Pain,Pulse,Respiratory rate,Temperature
Internalpatientid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.0,61.0,17.0,102.765598
2,0.0,91.0,19.0,92.922517
3,0.0,52.0,19.0,93.886986
4,0.0,69.0,20.0,94.900520
5,0.0,100.0,28.0,100.819041
...,...,...,...,...
169060,0.0,76.0,17.0,99.637162
169061,0.0,51.0,17.0,97.737097
169062,0.0,86.0,21.0,96.000000
169063,0.0,69.0,19.0,97.026778


### **Checking Pivot Table Missing Values**

In [27]:
pivot_table.isnull().sum()

Measurement
Pain                 578
Pulse                 65
Respiratory rate    1091
Temperature         1373
dtype: int64

In [28]:
pivot_table.dropna(axis=0, inplace = True)

In [29]:
pivot_table

Measurement,Pain,Pulse,Respiratory rate,Temperature
Internalpatientid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.0,61.0,17.0,102.765598
2,0.0,91.0,19.0,92.922517
3,0.0,52.0,19.0,93.886986
4,0.0,69.0,20.0,94.900520
5,0.0,100.0,28.0,100.819041
...,...,...,...,...
169060,0.0,76.0,17.0,99.637162
169061,0.0,51.0,17.0,97.737097
169062,0.0,86.0,21.0,96.000000
169063,0.0,69.0,19.0,97.026778


### **Saving Measurements file**

In [30]:
import os
cwd = os.getcwd()
cwd

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/2211574/Best_Files'

In [31]:
pivot_table.to_csv('/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/900379/Output_files_train/Potential_files_train/df_measurements_pivot_train_v1.csv')

-----