----
# **Measurements Blood Pressure Train**
----

## **Dataset Description:**

- **Internalpatientid:** The unique identifier for each patient in the dataset.
- **Age at measurement bp:** The age of the patient at the time the blood pressure measurement was taken.
- **Measurement date:** The date and time when the blood pressure measurement was taken.
- **Diastolic bp:** The diastolic blood pressure value, which represents the pressure in the arteries when the heart is at rest between beats.
- **Systolic bp:** The systolic blood pressure value, which represents the pressure in the arteries when the heart is actively pumping blood.
- **State:** The state where the measurement was taken or the patient's residence.

### Azure notebook Setup

In [1]:
#A class attribute that provides access to the TabularDatasetFactory methods for creating new TabularDataset objects. 
#Usage: Dataset.Tabular.from_delimited_files().
from azureml.core import Workspace, Dataset

subscription_id = 'bcfe0c62-8ebe-4df0-a46d-1efcf8739a5b' #check the launch studio there will get this id
resource_group = 'VChamp-Team3' # resource group name
workspace_name = 'vchamp-team3' # worksapce name


# storage account : Algorithmia, Resource group: VChamp-Team3 and workspace: vchamp-team3.
#Constructor
workspace = Workspace(subscription_id, resource_group, workspace_name)

In [2]:
#['data_team3_synthetic_train']
datastore = workspace.datastores['data_team3_synthetic_train']

In [3]:
#from_delimited_files (Create a TabularDataset to represent tabular data in delimited files (e.g. CSV and TSV).
dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'measurements_blood_pressure_train.csv')])

# preview the first 3 rows of the dataset
# dataset.to_pandas_dataframe()

In [4]:
#Converting the dataset into data frame(default as dataset in Azure, thus we must convert the needed formate)
measurements_blood_pressure_train_data= dataset.to_pandas_dataframe()

In [5]:
type(measurements_blood_pressure_train_data)

pandas.core.frame.DataFrame

In [6]:
measurements_blood_pressure_train_data.head()

Unnamed: 0,Column1,Internalpatientid,Age at measurement bp,Measurement date,Diastolic bp,Systolic bp,State
0,0,1,61.55404,2005-10-25 00:02:08,75.0,140.0,Indiana
1,1,1,67.03726,2011-04-21 02:50:27,72.0,116.0,Indiana
2,2,1,68.30414,2012-07-27 03:54:47,100.0,145.0,Indiana
3,3,1,68.347339,2012-08-11 22:51:23,89.0,155.0,Indiana
4,4,1,68.781623,2013-01-17 16:23:39,72.0,143.0,Indiana


## **Importing Libraries**

In [7]:
# Importing essential libraries
import pandas as pd        # Library for data manipulation and analysis
import numpy as np         # Library for mathematical operations

## **Data Exploration**

In [8]:
# changing variable name for dataframe
df = measurements_blood_pressure_train_data

In [9]:
# Display the first few rows of a DataFrame
df.head()

Unnamed: 0,Column1,Internalpatientid,Age at measurement bp,Measurement date,Diastolic bp,Systolic bp,State
0,0,1,61.55404,2005-10-25 00:02:08,75.0,140.0,Indiana
1,1,1,67.03726,2011-04-21 02:50:27,72.0,116.0,Indiana
2,2,1,68.30414,2012-07-27 03:54:47,100.0,145.0,Indiana
3,3,1,68.347339,2012-08-11 22:51:23,89.0,155.0,Indiana
4,4,1,68.781623,2013-01-17 16:23:39,72.0,143.0,Indiana


In [10]:
# Shape of the dataset
df.shape

num_rows = df.shape[0]  # Number of rows
num_cols = df.shape[1]  # Number of columns

print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

Number of rows: 21997558
Number of columns: 7


In [11]:
# Get the number of unique values in the 'Internalpatientid' column
df['Internalpatientid'].nunique()

132210

In [12]:
# Display the concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21997558 entries, 0 to 21997557
Data columns (total 7 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   Column1                int64         
 1   Internalpatientid      int64         
 2   Age at measurement bp  float64       
 3   Measurement date       datetime64[ns]
 4   Diastolic bp           float64       
 5   Systolic bp            float64       
 6   State                  object        
dtypes: datetime64[ns](1), float64(3), int64(2), object(1)
memory usage: 1.1+ GB


- The 'Internalpatientid' column contains integer values, and the columns for 'Age at measurement bp', 'Diastolic bp' and 'Systolic bp' are in float format, while the rest of the features are in object format.

## **Checking Missing Values**

In [13]:
# Count the number of missing values in each column
df.isnull().sum()

Column1                  0
Internalpatientid        0
Age at measurement bp    0
Measurement date         0
Diastolic bp             0
Systolic bp              0
State                    0
dtype: int64

- There is no missing value in this file.

## **Data Preprocessing**

In [14]:
# changing variable name for dataframe
measurement_bp = df

In [15]:
# Drop the specified columns from the DataFrame
measurement_bp.drop(['Column1','Measurement date','State'], axis=1,inplace=True)

In [16]:
measurement_bp.head()

Unnamed: 0,Internalpatientid,Age at measurement bp,Diastolic bp,Systolic bp
0,1,61.55404,75.0,140.0
1,1,67.03726,72.0,116.0
2,1,68.30414,100.0,145.0
3,1,68.347339,89.0,155.0
4,1,68.781623,72.0,143.0


### Checking Minimum and Maximum for Potential Columns

In [17]:
# Checking maximum and minimum values for Systolic and Diastolic bp columns
max_systolic_bp = measurement_bp['Systolic bp'].max()
min_systolic_bp = measurement_bp['Systolic bp'].min()

max_diastolic_bp = measurement_bp['Diastolic bp'].max()
min_diastolic_bp = measurement_bp['Diastolic bp'].min()

print("Systolic bp:")
print(f"Maximum value: {max_systolic_bp}")
print(f"Minimum value: {min_systolic_bp}")
print("--------------------")
print("Diastolic bp:")
print(f"Maximum value: {max_diastolic_bp}")
print(f"Minimum value: {min_diastolic_bp}")

Systolic bp:
Maximum value: 312.0
Minimum value: 40.0
--------------------
Diastolic bp:
Maximum value: 208.0
Minimum value: 29.0


### Removing Outliers in Systolic and Diastolic bp

In [18]:
# Apply the filters for 'Diastolic bp' and 'Systolic bp'
filtered_data = measurement_bp[(measurement_bp['Diastolic bp'] >= 50) & (measurement_bp['Diastolic bp'] <= 150) & (measurement_bp['Systolic bp'] >= 80) & (measurement_bp['Systolic bp'] <= 200)]

# Reset the index of the filtered data
filtered_data_bp = filtered_data.reset_index(drop=True)

# Display the final filtered dataframe
filtered_data_bp

Unnamed: 0,Internalpatientid,Age at measurement bp,Diastolic bp,Systolic bp
0,1,61.554040,75.0,140.0
1,1,67.037260,72.0,116.0
2,1,68.304140,100.0,145.0
3,1,68.347339,89.0,155.0
4,1,68.781623,72.0,143.0
...,...,...,...,...
21201155,99999,96.324828,62.0,147.0
21201156,99999,96.326070,57.0,123.0
21201157,99999,96.331619,68.0,154.0
21201158,99999,96.357136,79.0,147.0


In [19]:
max_systolic_bp = filtered_data_bp['Systolic bp'].max()
min_systolic_bp = filtered_data_bp['Systolic bp'].min()

max_diastolic_bp = filtered_data_bp['Diastolic bp'].max()
min_diastolic_bp = filtered_data_bp['Diastolic bp'].min()

print("Systolic bp:")
print(f"Maximum value: {max_systolic_bp}")
print(f"Minimum value: {min_systolic_bp}")
print("--------------------")
print("Diastolic bp:")
print(f"Maximum value: {max_diastolic_bp}")
print(f"Minimum value: {min_diastolic_bp}")

Systolic bp:
Maximum value: 200.0
Minimum value: 80.0
--------------------
Diastolic bp:
Maximum value: 150.0
Minimum value: 50.0


In [20]:
filtered_data_bp['Internalpatientid'].nunique()

132032

### Getting Maximum Age for each 'Internalpatientid'

In [21]:
# Find the maximum age for each internal patient id
max_ages = filtered_data_bp.groupby('Internalpatientid')['Age at measurement bp'].max().reset_index()

In [22]:
# Merge with the original dataframe to get the rows with the highest age
measurement_bp = pd.merge(df, max_ages, on =['Internalpatientid','Age at measurement bp'], how = 'inner')

measurement_bp

Unnamed: 0,Internalpatientid,Age at measurement bp,Diastolic bp,Systolic bp
0,100000,67.833933,90.0,152.0
1,100060,56.340829,73.0,112.0
2,100062,80.022034,61.0,108.0
3,100079,90.039410,60.0,111.0
4,100081,60.925214,62.0,119.0
...,...,...,...,...
132027,99928,83.835264,60.0,130.0
132028,99954,95.538267,62.0,113.0
132029,99960,74.647564,71.0,128.0
132030,99977,81.617250,83.0,127.0


In [28]:
measurement_bp['Internalpatientid'].nunique()

132032

In [24]:
# Drop the specified columns from the DataFrame
measurement_bp.drop(['Age at measurement bp'], axis=1,inplace=True)

In [25]:
measurement_bp

Unnamed: 0,Internalpatientid,Diastolic bp,Systolic bp
0,100000,90.0,152.0
1,100060,73.0,112.0
2,100062,61.0,108.0
3,100079,60.0,111.0
4,100081,62.0,119.0
...,...,...,...
132027,99928,60.0,130.0
132028,99954,62.0,113.0
132029,99960,71.0,128.0
132030,99977,83.0,127.0


### **Saving Measurement blood pressure file**

In [29]:
import os
cwd = os.getcwd()
cwd

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/2211574/Best_Files'

In [30]:
measurement_bp.to_csv('/mnt/batch/tasks/shared/LS_root/mounts/clusters/team3-lavanya-gpu2/code/Users/900379/Output_files_train/Potential_files_train/df_measurements_blood_pressure_train_v1.csv')

------