# Hands-on Assignment: Data Exploration with NumPy

In this hands-on assignment, we'll apply the NumPy python library to explore a dataset. The dataset we'll be using is a medical dataset with information about some patients on metrics like glucose, insulin levels, and other metrics related to diabetes. The assignment will serve two primary objectives - (a) practice NumPy on a realistic task, and (b) learn how to get a feel for a large dataset (also known as data cleaning and data exploration).

The following are the column names: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome.

In [1]:
import numpy as np
 
# tell NumPy that the values are separated by commas (,)
matrix = np.loadtxt('diabetes.csv', skiprows=1, delimiter=',')

In [3]:
matrix

array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

How many patients does the dataset have information about?

In [5]:
matrix.shape

(768, 9)

What is the blood pressure of the patient number 5 (0-indexed)?

In [6]:
matrix[5, 2]

74.0

What is the age of the patient number 112 (0-indexed)?

In [7]:
matrix[112, 7]

23.0

In this dataset, Outcome = 0 denotes that the patient does not have diabetes. And Outcome = 1 denotes that the patient has diabetes.

Does patient number 227 (0-indexed) have diabetes?

In [8]:
matrix[227, 8]

1.0

Out of the 768 patients total, how many have diabetes?

In [10]:
ans = sum(matrix[:, 8] == 1)
ans

268

For features Glucose, BloodPressure, SkinThickness, Insulin and BMI (columns 1, 2, 3, 4 and 5 0-indexed) the values are missing for some of the patients. Instead of the actual value, the dataset simply has a 0.

For how many patients is the Insulin value missing?

In [11]:
ans = sum(matrix[:, 4] == 0)
ans

374

For how many patients is at-least one of the features missing? (Be careful that it is okay for someone to be Pregnant 0 times).

In [12]:
sum( 
    (matrix[:, 1] == 0) | 
    (matrix[:, 2] == 0) | 
    (matrix[:, 3] == 0) | 
    (matrix[:, 4] == 0) | 
    (matrix[:, 5] == 0) | 
    (matrix[:, 6] == 0) | 
    (matrix[:, 7] == 0) 
  )

376

Filter out the dataset so that only the patients who don't have any data missing remain. You might find the np.logical_not() function useful. Verify that the shape of the resulting matrix is (392, 9).

For all future questions, use the filtered data.

In [13]:
bad = ( 
    (matrix[:, 1] == 0) | 
    (matrix[:, 2] == 0) |
    (matrix[:, 3] == 0) |
    (matrix[:, 4] == 0) |
    (matrix[:, 5] == 0) |
    (matrix[:, 6] == 0) |
    (matrix[:, 7] == 0)
)
filtered = matrix[np.logical_not(bad), :]
filtered.shape

(392, 9)

Out of the 392 patients, what is the total number of patients who have diabetes in the filtered dataset?

We've already defined the filtered variable for you from the previous question, so please use it for this and future questions.

In [14]:
ans = sum(filtered[:, 8] == 1)
ans

130

What is the average glucose level in the filtered dataset?

In [15]:
ans = np.mean(filtered, axis=0)
ans

array([  3.30102041, 122.62755102,  70.66326531,  29.14540816,
       156.05612245,  33.08622449,   0.52304592,  30.86479592,
         0.33163265])

What is the average glucose level among the diabetes patients?

In [16]:
diabetic = filtered[ (filtered[:, 8] == 1) , :]
ans = np.mean(diabetic, axis=0)
ans

array([  4.46923077, 145.19230769,  74.07692308,  32.96153846,
       206.84615385,  35.77769231,   0.62558462,  35.93846154,
         1.        ])

What is the average glucose level among the non-diabetic people?

In [17]:
non_diabetic = filtered[ (filtered[:, 8] == 0) , :]
ans = np.mean(non_diabetic, axis=0)
ans

array([  2.72137405, 111.43129771,  68.96946565,  27.2519084 ,
       130.85496183,  31.75076336,   0.47216794,  28.34732824,
         0.        ])