# Assessment 1: Predict diabetes using Perceptron

## Overview

The first assignment is to implement, describe, and test
a specific algorithm called Perceptron (which can be
interpreted as a dense layer neural network) for predicting
diabetes (using the diabetes dataset provided). 

## Data Loading

In [2]:
# Common imports
import numpy as np
import pandas as pd

# visualizaiton
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# machine learning & Deep learning imports
import sklearn
import tensorflow as tf
from tensorflow import keras


In [3]:
# Load the pre-processed data diabetes_scale.txt
data = []
labels = []

# Step 1: Set the features number as 8 based on the description of the dataset
feature_num = 8

with open('diabetes_scale.txt', 'r') as f:
    for line in f:
        line = line.strip().split()
        labels.append(int(line[0]))  # First item is the label (+1 or -1)
        
        # Extract feature-value pairs, split by ':', and update max feature index
        features = {int(item.split(":")[0]): float(item.split(":")[1]) for item in line[1:]}
        data.append(features)

# Step 2: Create a NumPy array with np.nan for missing features
# Initialize an array with np.nan and populate it with existing feature values
X = np.full((len(data), feature_num), np.nan)  # Initialize with np.nan

for i, features in enumerate(data):
    for idx, value in features.items():
        X[i, idx - 1] = value  # Subtract 1 since feature indices are 1-based in the file

y = np.array(labels)

# Now you can proceed to use X and y for further processing or model training
print("Shape of the data:", X.shape)
print("Sample row:", X[0])

Shape of the data: (768, 8)
Sample row: [-0.294118    0.487437    0.180328   -0.292929   -1.          0.00149028
 -0.53117    -0.0333333 ]


In [4]:
# Check for rows with missing values (np.nan)
rows_with_nan = np.any(np.isnan(X), axis=1)  # Returns a boolean array: True if the row has any NaN values

# Get the indices of rows with NaN values
nan_indices = np.where(rows_with_nan)[0]

print(f"Rows with NaN values: {nan_indices}")

Rows with NaN values: [ 14  24 236 259 285 401 458 517 658]
