<a href="https://colab.research.google.com/github/yohanesnuwara/machine_learning/blob/master/01_dataload/load_csv_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Loading Data**

Clone repository!

In [1]:
!git clone https://github.com/yohanesnuwara/machine-learning

Cloning into 'machine-learning'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (29/29), done.[K
remote: Total 30 (delta 10), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (30/30), done.


## From Scratch

Create function `load_csv`

In [0]:
# Load a CSV file
def load_csv(filename):
  dataset = list()
  with open(filename, 'r') as file:
    csv_reader = reader(file)
    for row in csv_reader:
      if not row: # skip empty rows
        continue
      dataset.append(row)
  return dataset

Load `pima-indians-diabetes` dataset using the function

N.B.: `pima-indian-databetes` involves predicting the onset of diabetes within 5 years in Pima Indians. The dataset consists of columns:
0. Number of times pregnant.
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
6. Diabetes pedigree function.
7. Age (years).
8. Class variable (0 or 1).

In [15]:
from csv import reader

# Load dataset
filename = '/content/machine-learning/datasets/pima-indians-diabetes.csv'
dataset = load_csv(filename)
print('Loaded data file {0} with {1} rows and {2} columns'.format(filename, len(dataset),
len(dataset[0])))

# other implementation
print('Loaded data file', filename, 'with', len(dataset), 'rows and', len(dataset[0]), 'columns')

Loaded data file /content/machine-learning/datasets/pima-indians-diabetes.csv with 768 rows and 9 columns
Loaded data file /content/machine-learning/datasets/pima-indians-diabetes.csv with 768 rows and 9 columns


Check the type of loaded dataset

In [16]:
print(dataset[0])

['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1']


It appears as a `list of list`. Because ML prefers handling with `floating point numbers`, we need to convert.

In [0]:
# Convert string column to float
def str_column_to_float(dataset, column):
  for row in dataset:
    row[column] = float(row[column].strip())

In [18]:
# convert string columns to float
for i in range(len(dataset[0])):
  str_column_to_float(dataset, i)
print(dataset[0])

[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0]


Load another dataset `iris`

In [20]:
# Load dataset
filename = '/content/machine-learning/datasets/iris.csv'
dataset = load_csv(filename)
print('Loaded data file {0} with {1} rows and {2} columns'.format(filename, len(dataset),
len(dataset[0])))

Loaded data file /content/machine-learning/datasets/iris.csv with 150 rows and 5 columns


In [21]:
print(dataset[0])

['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']


There is `string` type in the last column of dataset. ML prefers not in `string` type. So, the `string` should be **mapped** into a type of `integers`. e.g. `Iris-setosa` to 0, etc to 1, 2. 

In [0]:
# Convert string column to integer
def str_column_to_int(dataset, column):
  class_values = [row[column] for row in dataset]
  unique = set(class_values)
  lookup = dict()
  for i, value in enumerate(unique):
    lookup[value] = i
  for row in dataset:
    row[column] = lookup[row[column]]
  return lookup

In [23]:
# convert string columns to float
for i in range(4):
  str_column_to_float(dataset, i)
# convert class column to int
lookup = str_column_to_int(dataset, 4)

print(lookup)

{'Iris-versicolor': 0, 'Iris-virginica': 1, 'Iris-setosa': 2}


## Extension: Implementation with `Pandas` and `Numpy` to detect **missing values**

In [27]:
import pandas as pd

dataset = pd.read_csv('/content/machine-learning/datasets/pima-indians-diabetes.csv', header=None)

# describe data
dataset.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


There are columns which have `min` value 0. Some are unreasonable with this value, columns `1, 2, 3, 4, 5`, so they must be missing values. Replace 0 with `NaN`

In [32]:
import numpy as np

# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, np.NaN)

dataset.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5,116.0,74.0,,,25.6,0.201,30,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
7,10,115.0,,,,35.3,0.134,29,0
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
9,8,125.0,96.0,,,,0.232,54,1


Count the missing values. 

In [33]:
# count the number of NaN values in each column
print(dataset.isnull().sum())

0      0
1      5
2     35
3    227
4    374
5     11
6      0
7      0
8      0
dtype: int64


Drop and delete the rows with missing values `NaN`. First, the dataset has 768 rows, now it will be reduced

In [35]:
# drop rows with missing values
dataset.dropna(inplace=True)
# summarize the number of rows and columns in the dataset
print('The number of observations after missing values deleted is now:', dataset.shape[0])

The number of observations after missing values deleted is now: 392


Other ways of handling missing data is by doing **data imputation** or replace the missing values with certain specific value. 

## Extension: Data Imputation

There are many ways of imputing the data:
* A constant value that has meaning within the domain, such as 0, distinct from all other values.
* A value from another randomly selected record.
* A mean, median or mode value for the column.
* A value estimated by another predictive model.

For example, the `dataset` will be imputed with `mean value` of each column, rather than deleting it. 

In [37]:
dataset = pd.read_csv('/content/machine-learning/datasets/pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, np.NaN)
# fill missing values with mean column values
dataset.fillna(dataset.mean(), inplace=True)

dataset.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148.0,72.0,35.0,155.548223,33.6,0.627,50,1
1,1,85.0,66.0,29.0,155.548223,26.6,0.351,31,0
2,8,183.0,64.0,29.15342,155.548223,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5,116.0,74.0,29.15342,155.548223,25.6,0.201,30,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
7,10,115.0,72.405184,29.15342,155.548223,35.3,0.134,29,0
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
9,8,125.0,96.0,29.15342,155.548223,32.457464,0.232,54,1


Reference: https://machinelearningmastery.com/handle-missing-data-python/