# Unit 2: Data for AI

## Data

For this tutorial we'll use <b>Diabetes</b> dataset (available at https://aka.ms/diabetes-data). However, for this tutorial, we modified the original dataset.<br>

Available columns in the dataset:<br>
* Pregnancies: Number of times pregnant
* PlasmaGlucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* DiastolicBloodPressure: Diastolic blood pressure (mm Hg)
* TricepsThickness	: Triceps skin fold thickness (mm)
* SerumInsulin: 2-Hour serum insulin (mu U/ml)
* BMI: Body mass index (weight in kg/(height in m)^2)
* DiabetesPedigree: Diabetes pedigree function
* Age: Age (years)
* Ethnicity: <span style="color:red">This feature is not availalbe in the original dataset</span>
* Diabetic: Class variable (No or Yes) <span style="color:red">In original dataset the class variable is 0/1</span>


A similar dataset is available at https://www.kaggle.com/datasets/mathchi/diabetes-data-set

## Load data into python workspace

Load required libraries

In [None]:
import pandas as pd
import numpy as np

Load data from a Comma Separated Value (CSV) formatted file

In [None]:
df = pd.read_csv('diabetes_mod.csv')

Check available features/columns

In [None]:
df.info()

Check first few lines of data 

In [None]:
df.head()

Check a glimpse of the dataset with number of columns and rows

In [None]:
df

You can also check the dimention of the dataframe with `dataframe.shape`

In [None]:
df.shape

### Check data distribution

Check data distribution for numeric columns

In [None]:
df.describe()

Check data distribtions for categorical features

In [None]:
df['Ethnicity'].value_counts()

In [None]:
df['Diabetic'].value_counts()

### Check missing data

Missing datas are loaded as `NaN (Not a Number)` in pandas dataframe. Pandas treats `missing`, `null`, `not available` or `NA` in a similar way. you can check which value is `missing` by `isnull()`

In [None]:
df.isnull()

Count `missing` values for all features

In [None]:
df.isnull().sum()

Count `null` value in the dataframe

In [None]:
df.isnull().values.sum()

Count rows having at least one `null` value

In [None]:
np.sum(df.isnull().sum(axis=1)>0)

## Process Missing data

Depending on the missing data category and number of missing values, you can choose to:
* remove instances/rows with missing values
* remove features/columns 
* impute the missing values

### Option 1: Instance deletion (remove instances/rows)

pandas dataframe `dropna()` removes all rows having `missing` value.

In [None]:
df = df.dropna()

Check if there is any `null` avaialble

In [None]:
np.sum(df.isnull().values.sum(axis=1)>0)

Check availalble data info

In [None]:
df.info()

### Option 2: Feature deletion (remove features/columns)

Load the data again (We removed data having `missing`value in the previous section)

In [None]:
df = pd.read_csv('diabetes_mod.csv')

In [None]:
df.info()

drop column `DiabetesPedigree` from the dataframe

In [None]:
df = df.drop(columns=['TricepsThickness'])

In [None]:
df.info()

### Option 3: Impute missing data

You can impute data in 2 primary ways:
* Univariate - impute missing values of a feature based on that feature only. You can do so by different ways (not exhaustive):
    * constant - a constant value for all missing
    * mean - mean of the non-missing values
    * median - median of the non-missing values
    * mode - mode or most frequent value of the non-missing values
    * random - fill null values by random sampling of non-missing values
    * interpolate - based on the relationship between data within the feature
* Multivariate - impute missing values of a feature based on all/selected features in the dataframe
        

Load the data again (We removed the data having `missing`value in the previous section)

In [None]:
df = pd.read_csv('diabetes_mod.csv')

Check data

In [None]:
df.info()

Impute `DiabetesPedigree` with a constant value `0.2`

In [None]:
df['DiabetesPedigree'] = df['DiabetesPedigree'].fillna(0.2)

Impute `TricepsThickness` with `mean`

In [None]:
df['TricepsThickness'] = df['TricepsThickness'].fillna(df['TricepsThickness'].mean())

Impute `Pregnancies` with `median`

In [None]:
df['Pregnancies'] = df['Pregnancies'].fillna(df['Pregnancies'].median())

Impute `Ethnicity` with `Mode (most frequent value)` 

In [None]:
df['Ethnicity'] = df['Ethnicity'].fillna(df['Ethnicity'].mode()[0])

Check data

In [None]:
df.info()

## Process available data

In [None]:
df.head()

### Convert categorical value to numeric

Convert categorical values to user-defined numeric values

In [None]:
df['Diabetic'] = df['Diabetic'].replace('No', 0)
df['Diabetic'] = df['Diabetic'].replace('Yes', 1)

# you can replace all categorical values at the same time by using dictionary
#df['Diabetic'] = df['Diabetic'].replace({'No': 0, 'Yes':1})

# you can apply replace all categorical values in a dataframe
# df = df.replace('Diabetic': {'No': 0, 'Yes':1})

In [None]:
df

In [None]:
# Convert Ethnicity to non-numeric label:
# following code will not change the data, as there is no assignment operrator (=) used
df.replace({'Ethnicity': {'White': 0, 'Asian':1, 'Hispanic':2, 'Other':3}})

# to change the data you can use following code (after removing comment)
# df = df.replace({'Ethnicity': {'White': 0, 'Asian':1, 'Hispanic':2, 'Other':1}})

Check current data

In [None]:
df.head()

### One-hot coding

You can use pandas `get_dummies()` to convert each unique values in a column to it's unique column with 1/0 value (i.e. one-hot coding)

In [None]:
import pandas as pd

df = pd.get_dummies(df, columns=["Ethnicity"], prefix=["Ethnicity"] )
df.head()

### Save processed data to a csv file

In [None]:
df.to_csv('diabetes_mod2.csv')

## Other raw data format

In [None]:
df = pd.read_csv('diabetes_mod.tsv',sep="\t")

In [None]:
df.head()

In [None]:
df = pd.read_json('diabetes_mod.json')

In [None]:
df.head()