---
# **Measurements Test File**
---

## **Dataset Description:**

- **Internalpatientid:** It represents an internal identifier for each patient. It is likely a unique identifier assigned to each individual in the dataset.
- **Age at measurement:** This column denotes the age of the patient at the time of the measurement. It provides information about the patient's age in a numeric format.
- **Measurement date:** It indicates the date and time when the measurement was taken. It provides the timestamp of the measurement in a specific format.
- **Measurement:** This column specifies the type of measurement that was taken. It could include various types of health measurements such as pulse, weight, blood pressure, respiratory rate, pain level, etc.
- **Result numeric:** It represents the numeric result of the measurement. It contains the actual numerical value associated with the specific measurement type.
- **Result textual:** This column holds the textual representation or description of the measurement result. But it only representing blood pressure value in texual.
- **State:** This column indicates the state associated with the measurement. It represents the geographical location or state information where the measurement was recorded.


### Azure notebook Setup

In [1]:
#A class attribute that provides access to the TabularDatasetFactory methods for creating new TabularDataset objects. 
#Usage: Dataset.Tabular.from_delimited_files().
from azureml.core import Workspace, Dataset

subscription_id = 'bcfe0c62-8ebe-4df0-a46d-1efcf8739a5b' #check the launch studio there will get this id
resource_group = 'VChamp-Team3' # resource group name
workspace_name = 'vchamp-team3' # worksapce name


# storage account : Algorithmia, Resource group: VChamp-Team3 and workspace: vchamp-team3.
#Constructor
workspace = Workspace(subscription_id, resource_group, workspace_name)

In [2]:
#['data_team3_synthetic_train']
datastore = workspace.datastores['data_team3_synthetic_test'] 

In [3]:
#from_delimited_files (Create a TabularDataset to represent tabular data in delimited files (e.g. CSV and TSV).

dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'measurements_test.csv')])

# preview the first 3 rows of the dataset
# dataset.to_pandas_dataframe()

In [4]:
#Converting the dataset into data frame(default as dataset in Azure, thus we must convert the needed formate)
measurements_test_data= dataset.to_pandas_dataframe()

In [5]:
type(measurements_test_data)

pandas.core.frame.DataFrame

In [6]:
measurements_test_data.head()

Unnamed: 0,Column1,Internalpatientid,Age at measurement,Measurement date,Measurement,Result numeric,Result textual,State
0,88,100,52.629598,2014-05-15 02:50:03,Respiratory rate,15.0,,New York
1,89,100,52.629598,2014-05-15 02:50:03,Blood pressure,,128/60,New York
2,90,100,52.629598,2014-05-15 02:50:03,Pain,0.0,,New York
3,91,100,52.629598,2014-05-15 02:50:03,Temperature,94.879216,,New York
4,92,100,52.629598,2014-05-15 02:50:03,Pulse,106.0,,New York


## **Importing Libraries**

In [7]:
# Importing essential libraries
import pandas as pd        # Library for data manipulation and analysis
import numpy as np         # Library for mathematical operations

## **Data Exploration**

In [8]:
# Changing variable name for dataframe
df = measurements_test_data

In [9]:
# Display the first few rows of a DataFrame
df.head()

Unnamed: 0,Column1,Internalpatientid,Age at measurement,Measurement date,Measurement,Result numeric,Result textual,State
0,88,100,52.629598,2014-05-15 02:50:03,Respiratory rate,15.0,,New York
1,89,100,52.629598,2014-05-15 02:50:03,Blood pressure,,128/60,New York
2,90,100,52.629598,2014-05-15 02:50:03,Pain,0.0,,New York
3,91,100,52.629598,2014-05-15 02:50:03,Temperature,94.879216,,New York
4,92,100,52.629598,2014-05-15 02:50:03,Pulse,106.0,,New York


In [10]:
# Shape of the dataset
df.shape
num_rows = df.shape[0]  # Number of rows
num_cols = df.shape[1]  # Number of columns
print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

Number of rows: 33598003
Number of columns: 8


In [11]:
# Get the number of unique values in the 'Internalpatientid' column
print("Number of Unique Internalpatientid")
df['Internalpatientid'].nunique()

Number of Unique Internalpatientid


34593

In [12]:
# Drop the specified columns from the DataFrame
df.drop(['Column1','Result textual','State'], axis=1,inplace=True)

In [13]:
# Display the concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33598003 entries, 0 to 33598002
Data columns (total 5 columns):
 #   Column              Dtype         
---  ------              -----         
 0   Internalpatientid   int64         
 1   Age at measurement  float64       
 2   Measurement date    datetime64[ns]
 3   Measurement         object        
 4   Result numeric      float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 1.3+ GB


## **Checking Missing Values**

In [14]:
# Count the number of missing values in each column
df.isnull().sum()

Internalpatientid           0
Age at measurement          0
Measurement date            0
Measurement                 0
Result numeric        6170986
dtype: int64

## **Removing NaN values**

In [15]:
df = df.dropna(subset=['Result numeric'])

In [16]:
df.Internalpatientid.nunique()

34593

In [17]:
df.isnull().sum()

Internalpatientid     0
Age at measurement    0
Measurement date      0
Measurement           0
Result numeric        0
dtype: int64

### **Removing Outliers**

In [18]:
# Define the filter conditions for each category
filters = {
    'Pulse': (40, 250),
    'Respiratory rate': (1, 30),
    'Pain': (0, 15),
    'Temperature': (90, 107)
}

# Apply the filters for each category
filtered_data = pd.DataFrame()
for category, (min_val, max_val) in filters.items():
    category_data = df[df['Measurement'] == category]
    filtered_category_data = category_data[(category_data['Result numeric'] >= min_val) & (category_data['Result numeric'] <= max_val)]
    filtered_data = pd.concat([filtered_data, filtered_category_data])

In [19]:
# Reset the index of the filtered data
filtered_data = filtered_data.reset_index(drop=True)

In [20]:
# Display the final filtered dataframe
filtered_data

Unnamed: 0,Internalpatientid,Age at measurement,Measurement date,Measurement,Result numeric
0,100,52.629598,2014-05-15 02:50:03,Pulse,106.000000
1,100,53.385242,2015-02-15 07:20:28,Pulse,108.000000
2,100,53.769035,2015-07-05 13:58:43,Pulse,108.000000
3,100,54.299602,2016-01-15 12:06:42,Pulse,89.000000
4,100,54.796701,2016-07-15 04:39:45,Pulse,88.000000
...,...,...,...,...,...
21253489,99997,78.577741,2007-09-18 03:56:32,Temperature,93.000000
21253490,99997,83.001798,2012-02-21 03:46:34,Temperature,102.183060
21253491,99997,84.397668,2013-07-15 08:20:31,Temperature,100.000000
21253492,99997,84.996155,2014-02-19 02:16:29,Temperature,98.306201


In [21]:
# Check minimum and maximum values for each category in 'Measurement' column
min_max_values = filtered_data.groupby('Measurement')['Result numeric'].agg(['min', 'max'])
min_max_values

Unnamed: 0_level_0,min,max
Measurement,Unnamed: 1_level_1,Unnamed: 2_level_1
Pain,0.0,14.0
Pulse,40.0,250.0
Respiratory rate,1.0,30.0
Temperature,90.0,107.0


### **'Max' Condition to the 'Measurement date' Column**

In [22]:
df_group = pd.merge(filtered_data.groupby(['Internalpatientid','Measurement'], as_index=False)['Measurement date'].max(),filtered_data,on=['Internalpatientid','Measurement date','Measurement'],how = 'left')

In [23]:
df_group

Unnamed: 0,Internalpatientid,Measurement,Measurement date,Age at measurement,Result numeric
0,6,Pain,2014-01-18 14:24:43,87.358688,0.000000
1,6,Pulse,2014-01-18 14:24:43,87.358688,51.000000
2,6,Respiratory rate,2014-01-18 14:24:43,87.358688,16.000000
3,6,Temperature,2014-01-18 14:24:43,87.358688,97.030389
4,7,Pain,2020-10-19 02:58:45,74.276673,0.000000
...,...,...,...,...,...
137681,169059,Temperature,2013-11-21 14:10:52,90.622684,97.727853
137682,169065,Pain,2011-06-11 13:04:26,53.320232,5.000000
137683,169065,Pulse,2011-06-10 22:14:45,53.318541,109.000000
137684,169065,Respiratory rate,2011-06-10 22:14:45,53.318541,20.000000


In [24]:
# Check minimum and maximum values for each category in 'Measurement' column
min_max_values = df_group.groupby('Measurement')['Result numeric'].agg(['min', 'max'])
min_max_values

Unnamed: 0_level_0,min,max
Measurement,Unnamed: 1_level_1,Unnamed: 2_level_1
Pain,0.0,10.0
Pulse,40.0,226.0
Respiratory rate,1.0,30.0
Temperature,90.0,107.0


## **Creating Pivot Table**

In [25]:
categories = ['Pain', 'Pulse', 'Respiratory rate', 'Temperature']
filtered_df = df_group[df_group['Measurement'].isin(categories)]
pivot_table = filtered_df.pivot_table(index='Internalpatientid', columns='Measurement', values='Result numeric', aggfunc='last')

In [26]:
# Print the pivot table
pivot_table

Measurement,Pain,Pulse,Respiratory rate,Temperature
Internalpatientid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6,0.0,51.0,16.0,97.030389
7,0.0,67.0,19.0,97.567325
9,0.0,91.0,21.0,98.600910
12,3.0,64.0,24.0,97.693046
17,0.0,63.0,25.0,101.122904
...,...,...,...,...
169037,0.0,65.0,14.0,95.063409
169045,0.0,58.0,17.0,92.387524
169058,5.0,84.0,18.0,93.916635
169059,3.0,66.0,20.0,97.727853


### **Checking Pivot Table Missing Values**

In [27]:
pivot_table.isnull().sum()

Measurement
Pain                146
Pulse                22
Respiratory rate    310
Temperature         340
dtype: int64

In [28]:
pivot_table.dropna(axis=0, inplace = True)

In [29]:
pivot_table

Measurement,Pain,Pulse,Respiratory rate,Temperature
Internalpatientid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6,0.0,51.0,16.0,97.030389
7,0.0,67.0,19.0,97.567325
9,0.0,91.0,21.0,98.600910
12,3.0,64.0,24.0,97.693046
17,0.0,63.0,25.0,101.122904
...,...,...,...,...
169037,0.0,65.0,14.0,95.063409
169045,0.0,58.0,17.0,92.387524
169058,5.0,84.0,18.0,93.916635
169059,3.0,66.0,20.0,97.727853


### **Saving Measurements file**

In [30]:
import os
cwd = os.getcwd()
cwd

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/2211574/Best_Files'

In [31]:
pivot_table.to_csv('/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/900379/Output_files_test/Potential_files_test/df_measurements_pivot_test_v1.csv')

-----