# Unit 2: Data for AI

## Data

For this tutorial we'll use <b>Diabetes</b> dataset (available at https://aka.ms/diabetes-data). However, for this tutorial, we modified the original dataset.<br>

Available columns in the dataset:<br>
* Pregnancies: Number of times pregnant
* PlasmaGlucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* DiastolicBloodPressure: Diastolic blood pressure (mm Hg)
* TricepsThickness	: Triceps skin fold thickness (mm)
* SerumInsulin: 2-Hour serum insulin (mu U/ml)
* BMI: Body mass index (weight in kg/(height in m)^2)
* DiabetesPedigree: Diabetes pedigree function
* Age: Age (years)
* Ethnicity: <span style="color:red">This feature is not availalbe in the original dataset</span>
* Diabetic: Class variable (No or Yes) <span style="color:red">In original dataset the class variable is 0/1</span>


A similar dataset is available at https://www.kaggle.com/datasets/mathchi/diabetes-data-set

## Load data into python workspace

Load required libraries

In [1]:
import pandas as pd

Load data from a Comma Separated Value (CSV) formatted file

In [2]:
df = pd.read_csv('diabetes_mod.csv')

Check available features/columns

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
PatientID                 10000 non-null int64
Pregnancies               9500 non-null float64
PlasmaGlucose             10000 non-null int64
DiastolicBloodPressure    10000 non-null int64
TricepsThickness          8500 non-null float64
SerumInsulin              10000 non-null int64
BMI                       10000 non-null float64
DiabetesPedigree          9500 non-null float64
Age                       10000 non-null int64
Diabetic                  10000 non-null object
Ethnicity                 9500 non-null object
dtypes: float64(4), int64(5), object(2)
memory usage: 859.5+ KB


Check first few lines of data 

In [4]:
df.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Ethnicity
0,1354778,,171,80,34.0,23,43.509726,1.213191,21,No,White
1,1147438,8.0,92,93,47.0,36,21.240576,0.158365,23,No,White
2,1640031,7.0,115,47,52.0,35,41.511523,,23,No,White
3,1883350,9.0,103,78,25.0,304,29.582192,1.28287,43,Yes,White
4,1424119,1.0,85,59,27.0,35,42.604536,0.549542,22,No,Hispanic


Check a glimpse of the dataset with number of columns and rows

In [5]:
df

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Ethnicity
0,1354778,,171,80,34.0,23,43.509726,1.213191,21,No,White
1,1147438,8.0,92,93,47.0,36,21.240576,0.158365,23,No,White
2,1640031,7.0,115,47,52.0,35,41.511523,,23,No,White
3,1883350,9.0,103,78,25.0,304,29.582192,1.282870,43,Yes,White
4,1424119,1.0,85,59,27.0,35,42.604536,0.549542,22,No,Hispanic
5,1619297,0.0,82,92,9.0,253,19.724160,0.103424,26,No,White
6,1660149,,133,47,,227,21.941357,0.174160,21,No,White
7,1458769,0.0,67,87,43.0,36,18.277723,0.236165,26,No,White
8,1201647,8.0,80,95,33.0,24,26.624929,0.443947,53,Yes,White
9,1403912,1.0,72,31,,42,36.889576,0.103944,26,No,White


You can also check the dimention of the dataframe with `dataframe.shape`

In [37]:
df.shape

(10000, 11)

### Check data distribution

Check data distribution for numeric columns

In [6]:
df.describe()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age
count,10000.0,9500.0,10000.0,10000.0,8500.0,10000.0,10000.0,9500.0,10000.0
mean,1502122.0,3.259789,107.8502,71.2075,28.788,139.2436,31.567022,0.400806,30.1341
std,289286.8,3.401616,31.920909,16.801478,14.492659,133.777919,9.804366,0.380767,12.106047
min,1000038.0,0.0,44.0,24.0,7.0,14.0,18.200807,0.078044,21.0
25%,1251672.0,0.0,84.0,58.0,15.0,39.0,21.247427,0.137321,22.0
50%,1504394.0,2.0,105.0,72.0,30.0,85.0,31.922421,0.199888,24.0
75%,1754608.0,6.0,129.0,85.0,41.0,197.0,39.328921,0.621,35.0
max,1999997.0,14.0,192.0,117.0,92.0,796.0,56.034628,2.301594,77.0


Check data distribtions for categorical features

In [7]:
df['Ethnicity'].value_counts()

White       5738
Asian       1436
Hispanic    1433
Other        893
Name: Ethnicity, dtype: int64

In [8]:
df['Diabetic'].value_counts()

No     6656
Yes    3344
Name: Diabetic, dtype: int64

### Check missing data

Missing datas are loaded as `NaN (Not a Number)` in pandas dataframe. Pandas treats `missing`, `null`, `not available` or `NA` in a similar way. Count `missing` values for all features

In [9]:
df.isnull().sum()

PatientID                    0
Pregnancies                500
PlasmaGlucose                0
DiastolicBloodPressure       0
TricepsThickness          1500
SerumInsulin                 0
BMI                          0
DiabetesPedigree           500
Age                          0
Diabetic                     0
Ethnicity                  500
dtype: int64

Count rows having at least one `null` value

In [10]:
df.isnull().values.ravel().sum()

3000

## Process Missing data

Depending on the missing data category and number of missing values, you can choose to:
* remove instances/rows with missing values
* remove features/columns 
* impute the missing values

### Option 1: Instance deletion (remove instances/rows)

pandas dataframe `dropna()` removes all rows having `missing` value.

In [11]:
df = df.dropna()

Check if there is any `null` avaialble

In [12]:
df.isnull().values.ravel().sum()

0

Check availalble data info

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7287 entries, 1 to 9999
Data columns (total 11 columns):
PatientID                 7287 non-null int64
Pregnancies               7287 non-null float64
PlasmaGlucose             7287 non-null int64
DiastolicBloodPressure    7287 non-null int64
TricepsThickness          7287 non-null float64
SerumInsulin              7287 non-null int64
BMI                       7287 non-null float64
DiabetesPedigree          7287 non-null float64
Age                       7287 non-null int64
Diabetic                  7287 non-null object
Ethnicity                 7287 non-null object
dtypes: float64(4), int64(5), object(2)
memory usage: 683.2+ KB


### Option 2: Feature deletion (remove features/columns)

Load the data again (We removed data having `missing`value in the previous section)

In [14]:
df = pd.read_csv('diabetes_mod.csv')

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
PatientID                 10000 non-null int64
Pregnancies               9500 non-null float64
PlasmaGlucose             10000 non-null int64
DiastolicBloodPressure    10000 non-null int64
TricepsThickness          8500 non-null float64
SerumInsulin              10000 non-null int64
BMI                       10000 non-null float64
DiabetesPedigree          9500 non-null float64
Age                       10000 non-null int64
Diabetic                  10000 non-null object
Ethnicity                 9500 non-null object
dtypes: float64(4), int64(5), object(2)
memory usage: 859.5+ KB


drop column `DiabetesPedigree` from the dataframe

In [17]:
df = df.drop(columns=['DiabetesPedigree'])

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
PatientID                 10000 non-null int64
Pregnancies               9500 non-null float64
PlasmaGlucose             10000 non-null int64
DiastolicBloodPressure    10000 non-null int64
TricepsThickness          8500 non-null float64
SerumInsulin              10000 non-null int64
BMI                       10000 non-null float64
Age                       10000 non-null int64
Diabetic                  10000 non-null object
Ethnicity                 9500 non-null object
dtypes: float64(3), int64(5), object(2)
memory usage: 781.3+ KB


### Option 3: Impute missing data

You can impute data in 2 primary ways:
* Univariate - impute missing values of a feature based on that feature only. You can do so by different ways (not exhaustive):
    * constant - a constant value for all missing
    * mean - mean of the non-missing values
    * median - median of the non-missing values
    * mode - mode or most frequent value of the non-missing values
    * interpolate - based on the relationship between data within the feature
* Multivariate - impute missing values of a feature based on all/selected features in the dataframe
        

Load the data again (We removed the data having `missing`value in the previous section)

In [19]:
df = pd.read_csv('diabetes_mod.csv')

Check data

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
PatientID                 10000 non-null int64
Pregnancies               9500 non-null float64
PlasmaGlucose             10000 non-null int64
DiastolicBloodPressure    10000 non-null int64
TricepsThickness          8500 non-null float64
SerumInsulin              10000 non-null int64
BMI                       10000 non-null float64
DiabetesPedigree          9500 non-null float64
Age                       10000 non-null int64
Diabetic                  10000 non-null object
Ethnicity                 9500 non-null object
dtypes: float64(4), int64(5), object(2)
memory usage: 859.5+ KB


Impute `DiabetesPedigree` with a constant value `0.2`

In [21]:
df['DiabetesPedigree'] = df['DiabetesPedigree'].fillna(0.2)

Impute `TricepsThickness` with `mean`

In [22]:
df['TricepsThickness'] = df['TricepsThickness'].fillna(df['TricepsThickness'].median())

Impute `Pregnancies` with `median`

In [23]:
df['Pregnancies'] = df['Pregnancies'].fillna(df['Pregnancies'].median())

Impute `Ethnicity` with `Mode (most frequent value)` 

In [24]:
df['Ethnicity'] = df['Ethnicity'].fillna(df['Ethnicity'].mode()[0])

Check data

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
PatientID                 10000 non-null int64
Pregnancies               10000 non-null float64
PlasmaGlucose             10000 non-null int64
DiastolicBloodPressure    10000 non-null int64
TricepsThickness          10000 non-null float64
SerumInsulin              10000 non-null int64
BMI                       10000 non-null float64
DiabetesPedigree          10000 non-null float64
Age                       10000 non-null int64
Diabetic                  10000 non-null object
Ethnicity                 10000 non-null object
dtypes: float64(4), int64(5), object(2)
memory usage: 859.5+ KB


## Process available data

In [28]:
df.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Ethnicity
0,1354778,2.0,171,80,34.0,23,43.509726,1.213191,21,No,White
1,1147438,8.0,92,93,47.0,36,21.240576,0.158365,23,No,White
2,1640031,7.0,115,47,52.0,35,41.511523,0.2,23,No,White
3,1883350,9.0,103,78,25.0,304,29.582192,1.28287,43,Yes,White
4,1424119,1.0,85,59,27.0,35,42.604536,0.549542,22,No,Hispanic


### Convert categorical value to numeric

Convert categorical values to user-defined numeric values

In [29]:
df = df.replace({'Diabetic': {'Non-Diabetic': 0, 'Diabetic':1}})

In [30]:
# Convert Ethnicity to non-numeric label:
# following code will not change the data, as there is no assignment operrator (=) used
df.replace({'Ethnicity': {'White': 0, 'Asian':1, 'Hispanic':2, 'Other':3}})

# to change the data you can use following code (after removing comment)
# df = df.replace({'Ethnicity': {'White': 0, 'Asian':1, 'Hispanic':2, 'Other':1}})

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Ethnicity
0,1354778,2.0,171,80,34.0,23,43.509726,1.213191,21,No,0
1,1147438,8.0,92,93,47.0,36,21.240576,0.158365,23,No,0
2,1640031,7.0,115,47,52.0,35,41.511523,0.200000,23,No,0
3,1883350,9.0,103,78,25.0,304,29.582192,1.282870,43,Yes,0
4,1424119,1.0,85,59,27.0,35,42.604536,0.549542,22,No,2
5,1619297,0.0,82,92,9.0,253,19.724160,0.103424,26,No,0
6,1660149,2.0,133,47,30.0,227,21.941357,0.174160,21,No,0
7,1458769,0.0,67,87,43.0,36,18.277723,0.236165,26,No,0
8,1201647,8.0,80,95,33.0,24,26.624929,0.443947,53,Yes,0
9,1403912,1.0,72,31,30.0,42,36.889576,0.103944,26,No,0


Check current data

In [31]:
df.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Ethnicity
0,1354778,2.0,171,80,34.0,23,43.509726,1.213191,21,No,White
1,1147438,8.0,92,93,47.0,36,21.240576,0.158365,23,No,White
2,1640031,7.0,115,47,52.0,35,41.511523,0.2,23,No,White
3,1883350,9.0,103,78,25.0,304,29.582192,1.28287,43,Yes,White
4,1424119,1.0,85,59,27.0,35,42.604536,0.549542,22,No,Hispanic


### One-hot coding

You can use pandas `get_dummies()` to convert each unique values in a column to it's unique column with 1/0 value (i.e. one-hot coding)

In [32]:
import pandas as pd

df = pd.get_dummies(df, columns=["Ethnicity"], prefix=["Ethnicity"] )
df.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Ethnicity_Asian,Ethnicity_Hispanic,Ethnicity_Other,Ethnicity_White
0,1354778,2.0,171,80,34.0,23,43.509726,1.213191,21,No,0,0,0,1
1,1147438,8.0,92,93,47.0,36,21.240576,0.158365,23,No,0,0,0,1
2,1640031,7.0,115,47,52.0,35,41.511523,0.2,23,No,0,0,0,1
3,1883350,9.0,103,78,25.0,304,29.582192,1.28287,43,Yes,0,0,0,1
4,1424119,1.0,85,59,27.0,35,42.604536,0.549542,22,No,0,1,0,0


## Other raw data format

In [33]:
df = pd.read_csv('diabetes_mod.tsv',sep="\t")

In [34]:
df.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Ethnicity
0,1354778,,171,80,34.0,23,43.509726,1.213191,21,No,White
1,1147438,8.0,92,93,47.0,36,21.240576,0.158365,23,No,White
2,1640031,7.0,115,47,52.0,35,41.511523,,23,No,White
3,1883350,9.0,103,78,25.0,304,29.582192,1.28287,43,Yes,White
4,1424119,1.0,85,59,27.0,35,42.604536,0.549542,22,No,Hispanic


In [35]:
df = pd.read_json('diabetes_mod.json')

In [36]:
df.head()

Unnamed: 0,Age,BMI,DiabetesPedigree,Diabetic,DiastolicBloodPressure,Ethnicity,PatientID,PlasmaGlucose,Pregnancies,SerumInsulin,TricepsThickness
0,21,43.509726,1.213191,No,80,White,1354778,171,,23,34.0
1,23,21.240576,0.158365,No,93,White,1147438,92,8.0,36,47.0
2,23,41.511523,,No,47,White,1640031,115,7.0,35,52.0
3,43,29.582192,1.28287,Yes,78,White,1883350,103,9.0,304,25.0
4,22,42.604536,0.549542,No,59,Hispanic,1424119,85,1.0,35,27.0
