# UNDERSTANDING THE PROBLEM STATEMENT


Aim of the problem is to detect the presence or absence of cardiovascular disease in person based on the given features.
Features available are:


- Age | Objective Feature | age | int (days)
- Height | Objective Feature | height | int (cm) |
- Weight | Objective Feature | weight | float (kg) |
- Gender | Objective Feature | gender | categorical code |
- Systolic blood pressure | Examination Feature | ap_hi | int |
- Diastolic blood pressure | Examination Feature | ap_lo | int |
- Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
- Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
- Smoking | Subjective Feature | smoke | binary |
- Alcohol intake | Subjective Feature | alco | binary |
- Physical activity | Subjective Feature | active | binary |
- Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

Note that:
- Objective: factual information;
- Examination: results of medical examination;
- Subjective: information given by the patient.

Data Source:https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

# 1-Data Preprocessing 


- Drop irrelevant column (id). Change the ‘age in days’ column to ‘age in years’.
- Drop 'cardio' column, as this the target for future inferences.
- Drop categorical data. PCA works well with numerical data
- Check for null values. 
- Dropping outliers, using 3  z-score as a threshold. 
- Dataset overview.
- scaling and vectorizing the data.

In [30]:
# import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing


In [31]:
# read the csv file 
cardio_df = pd.read_csv("cardio_train.csv", sep=";")

In [32]:
cardio_df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


In [33]:
cardio_df = cardio_df [['age','height','weight','ap_hi','ap_lo']]
cardio_df.head()

Unnamed: 0,age,height,weight,ap_hi,ap_lo
0,18393,168,62.0,110,80
1,20228,156,85.0,140,90
2,18857,165,64.0,130,70
3,17623,169,82.0,150,100
4,17474,156,56.0,100,60


In [34]:
cardio_df['age'] = cardio_df['age']/365

In [35]:
cardio_df.shape

(70000, 5)

In [36]:
cardio_df.isnull().sum()

age       0
height    0
weight    0
ap_hi     0
ap_lo     0
dtype: int64

### outliers 

In [37]:
z_scores = np.abs((cardio_df - cardio_df.mean()) / cardio_df.std())
threshold = 3
cardio_df = cardio_df[(z_scores<3).all(axis=1)]
cardio_df

Unnamed: 0,age,height,weight,ap_hi,ap_lo
0,50.391781,168,62.0,110,80
1,55.419178,156,85.0,140,90
2,51.663014,165,64.0,130,70
3,48.282192,169,82.0,150,100
4,47.873973,156,56.0,100,60
...,...,...,...,...,...
69994,57.736986,165,80.0,150,80
69995,52.712329,168,76.0,120,80
69997,52.235616,183,105.0,180,90
69998,61.454795,163,72.0,135,80


In [38]:
cardio_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68067 entries, 0 to 69999
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     68067 non-null  float64
 1   height  68067 non-null  int64  
 2   weight  68067 non-null  float64
 3   ap_hi   68067 non-null  int64  
 4   ap_lo   68067 non-null  int64  
dtypes: float64(2), int64(3)
memory usage: 3.1 MB


In [39]:
cardio_df.describe()

Unnamed: 0,age,height,weight,ap_hi,ap_lo
count,68067.0,68067.0,68067.0,68067.0,68067.0
mean,53.333878,164.372515,73.566468,126.171228,81.287452
std,6.758602,7.691645,13.201634,17.852058,10.233601
min,39.109589,140.0,32.0,-150.0,-70.0
25%,48.383562,159.0,65.0,120.0,80.0
50%,53.978082,165.0,72.0,120.0,80.0
75%,58.421918,170.0,81.0,140.0,90.0
max,64.967123,188.0,117.0,401.0,602.0


Since numpy is more efficient than pandas for machine learning models that dont deal with categorical data 
for numerical computations and linear algebra operations.

### standardize the dataset 

In [42]:
from sklearn.preprocessing import StandardScaler

# Select the columns to be standardized
columns_to_standardize = ['age','height','weight','ap_hi','ap_lo']

# Create a StandardScaler object and fit it to the selected columns
scaler = StandardScaler()
scaler.fit(cardio_df[columns_to_standardize])

# Transform the selected columns using the scaler object
cardio_df[columns_to_standardize] = scaler.transform(cardio_df[columns_to_standardize])


In [43]:
cardio_df

Unnamed: 0,age,height,weight,ap_hi,ap_lo
0,-0.435315,0.471617,-0.876146,-0.905853,-0.125807
1,0.308542,-1.088529,0.866076,0.774637,0.851373
2,-0.247222,0.081581,-0.724648,0.214474,-1.102988
3,-0.747451,0.601629,0.638830,1.334801,1.828553
4,-0.807851,-1.088529,-1.330638,-1.466017,-2.080168
...,...,...,...,...,...
69994,0.651487,0.081581,0.487332,1.334801,-0.125807
69995,-0.091965,0.471617,0.184337,-0.345690,-0.125807
69997,-0.162500,2.421799,2.381051,3.015292,0.851373
69998,1.201576,-0.178444,-0.118658,0.494556,-0.125807


In [44]:
df_matrix = cardio_df.to_numpy() 

In [45]:
# scaler = preprocessing.StandardScaler().fit(df_matrix)
# df_matrix = scaler.transform(df_matrix)

### Creating a sagemaker session and retrieving the IAM role to access AWS resources 

In [46]:
import sagemaker
import boto3
from sagemaker import Session

In [47]:
# Let's create a Sagemaker session
sagemaker_session = sagemaker.Session()

In [48]:
role = sagemaker.get_execution_role()

### Data serialization 

Writing the NumPy array df_matrix to the BytesIO object buf allows us to store the array's binary data in memory as a bytes object. This can be useful for various purposes such as transmitting the data over a network, saving it to a file, or passing it to other functions or processes.
In this specific case, the binary data stored in the BytesIO object buf is being used as input to a machine learning algorithm  that requires the data to be in a particular format or passed through a specific API. Storing the data as a bytes object in memory can be more efficient than writing it to a file or transmitting it over a network.

In [49]:
import io  
import numpy as np
import sagemaker.amazon.common as smac 


# This line of code creates an empty BytesIO object buf. 
# BytesIO is a class in Python's io module that provides a file-like interface for working with bytes data in memory
buf = io.BytesIO() 

# writes the data to the buffer 
smac.write_numpy_to_dense_tensor(buf, df_matrix)
buf.seek(0)

0

# Data and Model Artifact pathway 

In [50]:
# bucket = Session().default_bucket()    # in case we didnt have a bucket 
bucket = 'deteam4'
prefix = 'pca_folder' 
subfolder = 'train'
key = 'pca_file'
output_file = 'output'

In [51]:
import os


boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, subfolder, key)).upload_fileobj(buf)

s3_data_location = 's3://{}/{}/{}/{}'.format(bucket, prefix ,subfolder, key)


print('uploaded data location: {}'.format(s3_data_location))

uploaded data location: s3://deteam4/pca_folder/train/pca_file


In [52]:
# create output placeholder in S3 bucket to store the PCA output

output_location = 's3://{}/{}/{}'.format(bucket, prefix,output_file)
print('model artifacts will be uploaded to: {}'.format(output_location))

model artifacts will be uploaded to: s3://deteam4/pca_folder/output


# Retrieve the container image URI for the "PCA" algorithm

In [53]:
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'pca')

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


# Training the model using sagemaker library

In [54]:
pca = sagemaker.estimator.Estimator(container,
                                       role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.c4.xlarge',
                                       output_path=output_location,
                                       sagemaker_session=sagemaker_session)



pca.set_hyperparameters(feature_dim=5,
                        num_components=3,
                        subtract_mean=False,
                        algorithm_mode='regular',
                        mini_batch_size=100)


# Pass in the training data from S3 to train the pca model


pca.fit({subfolder : s3_data_location})

# Let's see the progress using cloudwatch logs

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker:Creating training-job with name: pca-2023-04-28-16-23-54-591


2023-04-28 16:23:55 Starting - Starting the training job...
2023-04-28 16:24:20 Starting - Preparing the instances for training......
2023-04-28 16:25:20 Downloading - Downloading input data...
2023-04-28 16:25:46 Training - Downloading the training image......
2023-04-28 16:27:06 Training - Training image download completed. Training in progress...[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[04/28/2023 16:27:15 INFO 140251673741120] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-conf.json: {'algorithm_mode': 'regular', 'subtract_mean': 'true', 'extra_components': '-1', 'force_dense': 'true', 'epochs': 1, '_log_level': 'info', '_kvstore': 'dist_sync', '_num_kv_servers': 'auto', '_num_gpus': 'auto'}[0m
[34m[04/28/2023 16:27:15 INFO 140251673741120] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {'algorithm_mode': 'reg


2023-04-28 16:27:33 Uploading - Uploading generated training model
2023-04-28 16:27:33 Completed - Training job completed
Training seconds: 132
Billable seconds: 132


# DEPLOY THE TRAINED PCA MODEL 

In [55]:
# Deploy the model to perform inference 

deployed_pca = pca.deploy(initial_instance_count = 1,
                                          instance_type = 'ml.m4.xlarge')

INFO:sagemaker:Creating model with name: pca-2023-04-28-16-32-21-951
INFO:sagemaker:Creating endpoint-config with name pca-2023-04-28-16-32-21-951
INFO:sagemaker:Creating endpoint with name pca-2023-04-28-16-32-21-951


--------!

In [56]:
from sagemaker.predictor import csv_serializer, json_deserializer

deployed_pca.serializer = csv_serializer
deployed_pca.deserializer = json_deserializer

In [58]:
# make prediction on the test data

result = deployed_pca.predict(np.array(df_matrix[:5000]))

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [59]:
result # results are in Json format

{'projections': [{'projection': [0.22405357658863068,
    -0.28495800495147705,
    1.034017562866211]},
  {'projection': [0.20192334055900574,
    0.7894298434257507,
    -1.2551134824752808]},
  {'projection': [0.04656071215867996,
    0.06124899908900261,
    0.8853034973144531]},
  {'projection': [1.333034873008728, -0.4813711643218994, -2.076537609100342]},
  {'projection': [0.1388053297996521, 0.5129266381263733, 3.080711841583252]},
  {'projection': [-0.5794967412948608,
    1.8201987743377686,
    0.5062223672866821]},
  {'projection': [-1.0726518630981445,
    0.4192901849746704,
    -0.8005136251449585]},
  {'projection': [-1.424269437789917,
    -1.4341429471969604,
    -1.9284309148788452]},
  {'projection': [0.237821564078331, 0.0648123100399971, 1.6171625852584839]},
  {'projection': [-0.9334927797317505, -0.2209094911813736, 1.94552743434906]},
  {'projection': [-1.414697527885437,
    -0.2929782271385193,
    -0.3370371162891388]},
  {'projection': [0.13097093999385834,

In [60]:
# Since the results are in Json format, we access the scores by iterating through the scores in the predictions
pca_result = np.array([r['projection'] for r in result['projections']])

In [61]:
pca_result

array([[ 0.22405358, -0.284958  ,  1.03401756],
       [ 0.20192334,  0.78942984, -1.25511348],
       [ 0.04656071,  0.061249  ,  0.8853035 ],
       ...,
       [-0.09081471,  0.09926381,  2.6663363 ],
       [-0.52963275,  0.0320712 ,  0.21081814],
       [-1.28588235, -0.64360505, -1.14546871]])

In [62]:
pca_result.shape , df_matrix.shape

((5000, 3), (68067, 5))

# Run Prediction 

In [63]:
deployed_pca.predict([[55.419178,156,85.0,140,90]])


See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


{'projections': [{'projection': [-30.023483276367188,
    -92.96271514892578,
    -213.91310119628906]}]}

# Claen Up

In [64]:
# Delete the end-point

deployed_pca.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: pca-2023-04-28-16-32-21-951
INFO:sagemaker:Deleting endpoint with name: pca-2023-04-28-16-32-21-951
