---
# **Measurements Quality File**
---

## **Dataset Description:**

- **Internalpatientid:** It represents an internal identifier for each patient. It is likely a unique identifier assigned to each individual in the dataset.
- **Age at measurement:** This column denotes the age of the patient at the time of the measurement. It provides information about the patient's age in a numeric format.
- **Measurement date:** It indicates the date and time when the measurement was taken. It provides the timestamp of the measurement in a specific format.
- **Measurement:** This column specifies the type of measurement that was taken. It could include various types of health measurements such as pulse, weight, blood pressure, respiratory rate, pain level, etc.
- **Result numeric:** It represents the numeric result of the measurement. It contains the actual numerical value associated with the specific measurement type.
- **Result textual:** This column holds the textual representation or description of the measurement result. But it only representing blood pressure value in texual.
- **State:** This column indicates the state associated with the measurement. It represents the geographical location or state information where the measurement was recorded.


### Azure notebook Setup

In [1]:
#A class attribute that provides access to the TabularDatasetFactory methods for creating new TabularDataset objects. 
#Usage: Dataset.Tabular.from_delimited_files().
from azureml.core import Workspace, Dataset

subscription_id = 'bcfe0c62-8ebe-4df0-a46d-1efcf8739a5b' #check the launch studio there will get this id
resource_group = 'VChamp-Team3' # resource group name
workspace_name = 'vchamp-team3' # worksapce name


# storage account : Algorithmia, Resource group: VChamp-Team3 and workspace: vchamp-team3.
#Constructor
workspace = Workspace(subscription_id, resource_group, workspace_name)

In [2]:
#['data_team3_synthetic_train']
datastore = workspace.datastores['data_team3_synthetic_quality_check'] 

In [3]:
#from_delimited_files (Create a TabularDataset to represent tabular data in delimited files (e.g. CSV and TSV).

dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'measurements_qual.csv')])

# preview the first 3 rows of the dataset
# dataset.to_pandas_dataframe()

In [4]:
#Converting the dataset into data frame(default as dataset in Azure, thus we must convert the needed formate)
measurements_qual_data= dataset.to_pandas_dataframe()

In [5]:
type(measurements_qual_data)

pandas.core.frame.DataFrame

In [7]:
measurements_qual_data.head()

Unnamed: 0,Column1,Internalpatientid,Age at measurement,Measurement date,Measurement,Result numeric,Result textual,State
0,581,100012,52.872186,2002-05-15 18:48:46,Pulse,86.0,,New Mexico
1,582,100012,52.872186,2002-05-15 18:48:46,Weight,387.467147,,New Mexico
2,583,100012,52.872186,2002-05-15 18:48:46,Blood pressure,,135/81,New Mexico
3,584,100012,52.872186,2002-05-15 18:48:46,Respiratory rate,20.0,,New Mexico
4,585,100012,52.872186,2002-05-15 18:48:46,Pain,2.0,,New Mexico


## **Importing Libraries**

In [8]:
# Importing essential libraries
import pandas as pd        # Library for data manipulation and analysis
import numpy as np         # Library for mathematical operations

## **Data Exploration**

In [9]:
# Changing variable name for dataframe
df = measurements_qual_data

In [10]:
# Display the first few rows of a DataFrame
df.head()

Unnamed: 0,Column1,Internalpatientid,Age at measurement,Measurement date,Measurement,Result numeric,Result textual,State
0,581,100012,52.872186,2002-05-15 18:48:46,Pulse,86.0,,New Mexico
1,582,100012,52.872186,2002-05-15 18:48:46,Weight,387.467147,,New Mexico
2,583,100012,52.872186,2002-05-15 18:48:46,Blood pressure,,135/81,New Mexico
3,584,100012,52.872186,2002-05-15 18:48:46,Respiratory rate,20.0,,New Mexico
4,585,100012,52.872186,2002-05-15 18:48:46,Pain,2.0,,New Mexico


In [11]:
# Shape of the dataset
df.shape
num_rows = df.shape[0]  # Number of rows
num_cols = df.shape[1]  # Number of columns
print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

Number of rows: 1070001
Number of columns: 8


In [12]:
# Get the number of unique values in the 'Internalpatientid' column
print("Number of Unique Internalpatientid")
df['Internalpatientid'].nunique()

Number of Unique Internalpatientid


994

In [13]:
# Drop the specified columns from the DataFrame
df.drop(['Column1','Result textual','State'], axis=1,inplace=True)

In [14]:
# Display the concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1070001 entries, 0 to 1070000
Data columns (total 5 columns):
 #   Column              Non-Null Count    Dtype         
---  ------              --------------    -----         
 0   Internalpatientid   1070001 non-null  int64         
 1   Age at measurement  1070001 non-null  float64       
 2   Measurement date    1070001 non-null  datetime64[ns]
 3   Measurement         1070001 non-null  object        
 4   Result numeric      870923 non-null   float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 40.8+ MB


## **Checking Missing Values**

In [15]:
# Count the number of missing values in each column
df.isnull().sum()

Internalpatientid          0
Age at measurement         0
Measurement date           0
Measurement                0
Result numeric        199078
dtype: int64

## **Removing NaN values**

In [16]:
df = df.dropna(subset=['Result numeric'])

In [17]:
df.Internalpatientid.nunique()

994

In [18]:
df.isnull().sum()

Internalpatientid     0
Age at measurement    0
Measurement date      0
Measurement           0
Result numeric        0
dtype: int64

### **Removing Outliers**

In [19]:
# Define the filter conditions for each category
filters = {
    'Pulse': (40, 250),
    'Respiratory rate': (1, 30),
    'Pain': (0, 15),
    'Temperature': (90, 107)
}

# Apply the filters for each category
filtered_data = pd.DataFrame()
for category, (min_val, max_val) in filters.items():
    category_data = df[df['Measurement'] == category]
    filtered_category_data = category_data[(category_data['Result numeric'] >= min_val) & (category_data['Result numeric'] <= max_val)]
    filtered_data = pd.concat([filtered_data, filtered_category_data])

In [20]:
# Reset the index of the filtered data
filtered_data = filtered_data.reset_index(drop=True)

In [21]:
# Display the final filtered dataframe
filtered_data

Unnamed: 0,Internalpatientid,Age at measurement,Measurement date,Measurement,Result numeric
0,100012,52.872186,2002-05-15 18:48:46,Pulse,86.000000
1,100012,53.280554,2002-10-12 01:01:26,Pulse,93.000000
2,100012,53.512267,2003-01-04 17:36:15,Pulse,86.000000
3,100012,54.815596,2004-04-25 02:24:39,Pulse,86.000000
4,100012,54.894050,2004-05-23 18:36:16,Pulse,92.000000
...,...,...,...,...,...
681993,99941,73.344576,2015-11-28 10:37:14,Temperature,94.021018
681994,99941,73.350285,2015-11-30 12:42:13,Temperature,95.279459
681995,99941,73.369421,2015-12-07 12:33:33,Temperature,101.981426
681996,99941,73.377264,2015-12-10 09:21:42,Temperature,97.391722


In [22]:
# Check minimum and maximum values for each category in 'Measurement' column
min_max_values = filtered_data.groupby('Measurement')['Result numeric'].agg(['min', 'max'])
min_max_values

Unnamed: 0_level_0,min,max
Measurement,Unnamed: 1_level_1,Unnamed: 2_level_1
Pain,0.0,10.0
Pulse,40.0,231.0
Respiratory rate,1.0,30.0
Temperature,90.0,107.0


### **'Max' Condition to the 'Measurement date' Column**

In [23]:
df_group = pd.merge(filtered_data.groupby(['Internalpatientid','Measurement'], as_index=False)['Measurement date'].max(),filtered_data,on=['Internalpatientid','Measurement date','Measurement'],how = 'left')

In [24]:
df_group

Unnamed: 0,Internalpatientid,Measurement,Measurement date,Age at measurement,Result numeric
0,67,Pain,2020-10-18 07:53:28,58.468752,5.000000
1,67,Pulse,2020-10-18 07:53:28,58.468752,68.000000
2,67,Respiratory rate,2020-10-18 07:53:28,58.468752,18.000000
3,67,Temperature,2020-10-18 07:53:28,58.468752,99.304386
4,200,Pain,2023-02-07 11:57:42,87.785221,0.000000
...,...,...,...,...,...
3946,168496,Temperature,2022-09-22 13:07:19,98.600675,95.000000
3947,168899,Pain,2018-02-08 02:41:22,95.280744,0.000000
3948,168899,Pulse,2018-02-08 02:41:22,95.280744,98.000000
3949,168899,Respiratory rate,2018-02-08 02:41:22,95.280744,19.000000


In [25]:
# Check minimum and maximum values for each category in 'Measurement' column
min_max_values = df_group.groupby('Measurement')['Result numeric'].agg(['min', 'max'])
min_max_values

Unnamed: 0_level_0,min,max
Measurement,Unnamed: 1_level_1,Unnamed: 2_level_1
Pain,0.0,10.0
Pulse,40.0,162.0
Respiratory rate,10.0,30.0
Temperature,90.0,106.576364


## **Creating Pivot Table**

In [26]:
categories = ['Pain', 'Pulse', 'Respiratory rate', 'Temperature']
filtered_df = df_group[df_group['Measurement'].isin(categories)]
pivot_table = filtered_df.pivot_table(index='Internalpatientid', columns='Measurement', values='Result numeric', aggfunc='last')

In [27]:
# Print the pivot table
pivot_table

Measurement,Pain,Pulse,Respiratory rate,Temperature
Internalpatientid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
67,5.0,68.0,18.0,99.304386
200,0.0,92.0,18.0,93.422097
291,0.0,75.0,18.0,98.475425
330,2.0,67.0,16.0,95.141939
351,0.0,73.0,19.0,95.922920
...,...,...,...,...
167907,0.0,91.0,19.0,97.000000
167917,0.0,84.0,18.0,98.130331
168008,0.0,114.0,16.0,96.112248
168496,0.0,59.0,16.0,95.000000


### **Checking Pivot Table Missing Values**

In [28]:
pivot_table.isnull().sum()

Measurement
Pain                 2
Pulse                1
Respiratory rate     8
Temperature         11
dtype: int64

In [29]:
pivot_table.dropna(axis=0, inplace = True)

In [30]:
pivot_table

Measurement,Pain,Pulse,Respiratory rate,Temperature
Internalpatientid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
67,5.0,68.0,18.0,99.304386
200,0.0,92.0,18.0,93.422097
291,0.0,75.0,18.0,98.475425
330,2.0,67.0,16.0,95.141939
351,0.0,73.0,19.0,95.922920
...,...,...,...,...
167907,0.0,91.0,19.0,97.000000
167917,0.0,84.0,18.0,98.130331
168008,0.0,114.0,16.0,96.112248
168496,0.0,59.0,16.0,95.000000


### **Saving Measurements file**

In [31]:
import os
cwd = os.getcwd()
cwd

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/2211574/Best_Files'

In [32]:
pivot_table.to_csv('/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/900379/Output_files_quality/df_measurements_pivot_qual_v1.csv')

-----