----
# **Measurements Blood Pressure Quality**
----

## **Dataset Description:**

- **Internalpatientid:** The unique identifier for each patient in the dataset.
- **Age at measurement bp:** The age of the patient at the time the blood pressure measurement was taken.
- **Measurement date:** The date and time when the blood pressure measurement was taken.
- **Diastolic bp:** The diastolic blood pressure value, which represents the pressure in the arteries when the heart is at rest between beats.
- **Systolic bp:** The systolic blood pressure value, which represents the pressure in the arteries when the heart is actively pumping blood.
- **State:** The state where the measurement was taken or the patient's residence.

### Azure notebook Setup

In [1]:
#A class attribute that provides access to the TabularDatasetFactory methods for creating new TabularDataset objects. 
#Usage: Dataset.Tabular.from_delimited_files().
from azureml.core import Workspace, Dataset

subscription_id = 'bcfe0c62-8ebe-4df0-a46d-1efcf8739a5b' #check the launch studio there will get this id
resource_group = 'VChamp-Team3' # resource group name
workspace_name = 'vchamp-team3' # worksapce name


# storage account : Algorithmia, Resource group: VChamp-Team3 and workspace: vchamp-team3.
#Constructor
workspace = Workspace(subscription_id, resource_group, workspace_name)

In [2]:
#['data_team3_synthetic_train']
datastore = workspace.datastores['data_team3_synthetic_quality_check']

In [3]:
#from_delimited_files (Create a TabularDataset to represent tabular data in delimited files (e.g. CSV and TSV).
dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'measurements_blood_pressure_qual.csv')])

# preview the first 3 rows of the dataset
# dataset.to_pandas_dataframe()

In [4]:
#Converting the dataset into data frame(default as dataset in Azure, thus we must convert the needed formate)
measurements_blood_pressure_qual_data= dataset.to_pandas_dataframe()

In [5]:
type(measurements_blood_pressure_qual_data)

pandas.core.frame.DataFrame

In [6]:
measurements_blood_pressure_qual_data.head()

Unnamed: 0,Column1,Internalpatientid,Age at measurement bp,Measurement date,Diastolic bp,Systolic bp,State
0,110,100012,53.086669,2002-08-02 04:15:26,75.0,137.0,New Mexico
1,111,100012,53.583655,2003-01-30 19:49:29,87.0,161.0,New Mexico
2,112,100012,53.837326,2003-05-03 13:01:16,77.0,144.0,New Mexico
3,113,100012,53.898581,2003-05-25 22:21:03,73.0,136.0,New Mexico
4,114,100012,54.044102,2003-07-18 02:51:48,68.0,143.0,New Mexico


## **Importing Libraries**

In [7]:
# Importing essential libraries
import pandas as pd        # Library for data manipulation and analysis
import numpy as np         # Library for mathematical operations

## **Data Exploration**

In [8]:
# changing variable name for dataframe
df = measurements_blood_pressure_qual_data

In [9]:
# Display the first few rows of a DataFrame
df.head()

Unnamed: 0,Column1,Internalpatientid,Age at measurement bp,Measurement date,Diastolic bp,Systolic bp,State
0,110,100012,53.086669,2002-08-02 04:15:26,75.0,137.0,New Mexico
1,111,100012,53.583655,2003-01-30 19:49:29,87.0,161.0,New Mexico
2,112,100012,53.837326,2003-05-03 13:01:16,77.0,144.0,New Mexico
3,113,100012,53.898581,2003-05-25 22:21:03,73.0,136.0,New Mexico
4,114,100012,54.044102,2003-07-18 02:51:48,68.0,143.0,New Mexico


In [10]:
# Shape of the dataset
df.shape

num_rows = df.shape[0]  # Number of rows
num_cols = df.shape[1]  # Number of columns

print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

Number of rows: 192324
Number of columns: 7


In [11]:
# Get the number of unique values in the 'Internalpatientid' column
df['Internalpatientid'].nunique()

993

In [12]:
# Display the concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192324 entries, 0 to 192323
Data columns (total 7 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   Column1                192324 non-null  int64         
 1   Internalpatientid      192324 non-null  int64         
 2   Age at measurement bp  192324 non-null  float64       
 3   Measurement date       192324 non-null  datetime64[ns]
 4   Diastolic bp           192324 non-null  float64       
 5   Systolic bp            192324 non-null  float64       
 6   State                  192324 non-null  object        
dtypes: datetime64[ns](1), float64(3), int64(2), object(1)
memory usage: 10.3+ MB


- The 'Internalpatientid' column contains integer values, and the columns for 'Age at measurement bp', 'Diastolic bp' and 'Systolic bp' are in float format, while the rest of the features are in object format.

## **Checking Missing Values**

In [13]:
# Count the number of missing values in each column
df.isnull().sum()

Column1                  0
Internalpatientid        0
Age at measurement bp    0
Measurement date         0
Diastolic bp             0
Systolic bp              0
State                    0
dtype: int64

- There is no missing value in this file.

## **Data Preprocessing**

In [14]:
# changing variable name for dataframe
measurement_bp = df

In [15]:
# Drop the specified columns from the DataFrame
measurement_bp.drop(['Column1','Measurement date','State'], axis=1,inplace=True)

In [16]:
measurement_bp.head()

Unnamed: 0,Internalpatientid,Age at measurement bp,Diastolic bp,Systolic bp
0,100012,53.086669,75.0,137.0
1,100012,53.583655,87.0,161.0
2,100012,53.837326,77.0,144.0
3,100012,53.898581,73.0,136.0
4,100012,54.044102,68.0,143.0


### Checking Minimum and Maximum for Potential Columns

In [17]:
# Checking maximum and minimum values for Systolic and Diastolic bp columns
max_systolic_bp = measurement_bp['Systolic bp'].max()
min_systolic_bp = measurement_bp['Systolic bp'].min()

max_diastolic_bp = measurement_bp['Diastolic bp'].max()
min_diastolic_bp = measurement_bp['Diastolic bp'].min()

print("Systolic bp:")
print(f"Maximum value: {max_systolic_bp}")
print(f"Minimum value: {min_systolic_bp}")
print("--------------------")
print("Diastolic bp:")
print(f"Maximum value: {max_diastolic_bp}")
print(f"Minimum value: {min_diastolic_bp}")

Systolic bp:
Maximum value: 293.0
Minimum value: 46.0
--------------------
Diastolic bp:
Maximum value: 192.0
Minimum value: 30.0


### Removing Outliers in Systolic and Diastolic bp

In [18]:
# Apply the filters for 'Diastolic bp' and 'Systolic bp'
filtered_data = measurement_bp[(measurement_bp['Diastolic bp'] >= 50) & (measurement_bp['Diastolic bp'] <= 150) & (measurement_bp['Systolic bp'] >= 80) & (measurement_bp['Systolic bp'] <= 200)]

# Reset the index of the filtered data
filtered_data_bp = filtered_data.reset_index(drop=True)

# Display the final filtered dataframe
filtered_data_bp

Unnamed: 0,Internalpatientid,Age at measurement bp,Diastolic bp,Systolic bp
0,100012,53.086669,75.0,137.0
1,100012,53.583655,87.0,161.0
2,100012,53.837326,77.0,144.0
3,100012,53.898581,73.0,136.0
4,100012,54.044102,68.0,143.0
...,...,...,...,...
185967,99941,70.999496,71.0,148.0
185968,99941,73.381373,81.0,153.0
185969,99941,73.384256,63.0,120.0
185970,99944,80.132857,63.0,130.0


In [19]:
max_systolic_bp = filtered_data_bp['Systolic bp'].max()
min_systolic_bp = filtered_data_bp['Systolic bp'].min()

max_diastolic_bp = filtered_data_bp['Diastolic bp'].max()
min_diastolic_bp = filtered_data_bp['Diastolic bp'].min()

print("Systolic bp:")
print(f"Maximum value: {max_systolic_bp}")
print(f"Minimum value: {min_systolic_bp}")
print("--------------------")
print("Diastolic bp:")
print(f"Maximum value: {max_diastolic_bp}")
print(f"Minimum value: {min_diastolic_bp}")

Systolic bp:
Maximum value: 200.0
Minimum value: 80.0
--------------------
Diastolic bp:
Maximum value: 150.0
Minimum value: 50.0


In [20]:
filtered_data_bp['Internalpatientid'].nunique()

993

### Getting Maximum Age for each 'Internalpatientid'

In [21]:
# Find the maximum age for each internal patient id
max_ages = filtered_data_bp.groupby('Internalpatientid')['Age at measurement bp'].max().reset_index()

In [22]:
# Merge with the original dataframe to get the rows with the highest age
measurement_bp = pd.merge(df, max_ages, on =['Internalpatientid','Age at measurement bp'], how = 'inner')

measurement_bp

Unnamed: 0,Internalpatientid,Age at measurement bp,Diastolic bp,Systolic bp
0,100012,73.030604,68.0,114.0
1,100229,82.852885,68.0,187.0
2,100314,74.346569,51.0,102.0
3,100694,90.700582,75.0,128.0
4,101530,89.520523,64.0,95.0
...,...,...,...,...
988,93550,91.507833,54.0,133.0
989,9535,76.290901,80.0,126.0
990,94080,76.136998,54.0,112.0
991,98416,78.998030,85.0,195.0


In [23]:
measurement_bp['Internalpatientid'].nunique()

993

In [24]:
# Drop the specified columns from the DataFrame
measurement_bp.drop(['Age at measurement bp'], axis=1,inplace=True)

In [25]:
measurement_bp

Unnamed: 0,Internalpatientid,Diastolic bp,Systolic bp
0,100012,68.0,114.0
1,100229,68.0,187.0
2,100314,51.0,102.0
3,100694,75.0,128.0
4,101530,64.0,95.0
...,...,...,...
988,93550,54.0,133.0
989,9535,80.0,126.0
990,94080,54.0,112.0
991,98416,85.0,195.0


### **Saving Measurement blood pressure file**

In [26]:
import os
cwd = os.getcwd()
cwd

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/2211574/Best_Files'

In [28]:
measurement_bp.to_csv('/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/900379/Output_files_quality/df_measurement_blood_pressure_qual_v1.csv')

------