## Statistical Imputation for Missing Value
#### A popular approach for data imputation is to calculate a statistical value for each column (such as a mean) and replace all missing values for that column with the statistic. It is a popular approach because the statistic is easy to calculate using the training dataset and because it often results in good performance.

#### The dataset has numerous missing values for many of the columns where each missing value is marked with a question mark character (“?”).

In [74]:
import pandas as pd
# specify the “na_values” to load values of ‘?‘ as missing
df = pd.read_csv('horse-colic.csv', header=None, na_values='?')
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
0,2.0,1,530101,38.5,66.0,28.0,3.0,3.0,,2.0,...,45.0,8.4,,,2.0,2,11300,0,0,2
1,1.0,1,534817,39.2,88.0,20.0,,,4.0,1.0,...,50.0,85.0,2.0,2.0,3.0,2,2208,0,0,2
2,2.0,1,530334,38.3,40.0,24.0,1.0,1.0,3.0,1.0,...,33.0,6.7,,,1.0,2,0,0,0,1
3,1.0,9,5290409,39.1,164.0,84.0,4.0,1.0,6.0,2.0,...,48.0,7.2,3.0,5.3,2.0,1,2208,0,0,1
4,2.0,1,530255,37.3,104.0,35.0,,,6.0,2.0,...,74.0,7.4,,,2.0,2,4300,0,0,2


In [75]:
# displaying missing values quickly
df.isnull().sum()

0       1
1       0
2       0
3      60
4      24
5      58
6      56
7      69
8      47
9      32
10     55
11     44
12     56
13    104
14    106
15    247
16    102
17    118
18     29
19     33
20    165
21    198
22      1
23      0
24      0
25      0
26      0
27      0
dtype: int64

In [76]:
# displaying missing values in all columns with percetnage
for i in range(df.shape[1]):
    # counting number of rows with missing values
    n_miss = df[[i]].isnull().sum()
    perc = n_miss/df.shape[0]*100
    print('%d, Missing_Values= %d (%.1f%%)' % (i, n_miss, perc))

0, Missing_Values= 1 (0.3%)
1, Missing_Values= 0 (0.0%)
2, Missing_Values= 0 (0.0%)
3, Missing_Values= 60 (20.0%)
4, Missing_Values= 24 (8.0%)
5, Missing_Values= 58 (19.3%)
6, Missing_Values= 56 (18.7%)
7, Missing_Values= 69 (23.0%)
8, Missing_Values= 47 (15.7%)
9, Missing_Values= 32 (10.7%)
10, Missing_Values= 55 (18.3%)
11, Missing_Values= 44 (14.7%)
12, Missing_Values= 56 (18.7%)
13, Missing_Values= 104 (34.7%)
14, Missing_Values= 106 (35.3%)
15, Missing_Values= 247 (82.3%)
16, Missing_Values= 102 (34.0%)
17, Missing_Values= 118 (39.3%)
18, Missing_Values= 29 (9.7%)
19, Missing_Values= 33 (11.0%)
20, Missing_Values= 165 (55.0%)
21, Missing_Values= 198 (66.0%)
22, Missing_Values= 1 (0.3%)
23, Missing_Values= 0 (0.0%)
24, Missing_Values= 0 (0.0%)
25, Missing_Values= 0 (0.0%)
26, Missing_Values= 0 (0.0%)
27, Missing_Values= 0 (0.0%)


#### We can see that some columns (e.g. column indexes 1 and 2) have no missing values and other columns (e.g. column indexes 15 and 21) have many or even a majority of missing values.

#### There are many fields we could select to predict in this dataset. In this case, we will predict whether the problem was surgical or not (column index 23), making it a binary classification problem.

In [77]:
from sklearn.impute import SimpleImputer
# imputing missing values using mean
imputer = SimpleImputer(strategy='mean')
# separating lables(col 23) from features
ix = [i for i in range(df.shape[1]) if i != 23]
# splitting input and output
X, y = df.values[:, ix], df.values[:, 23]
# the imputer is fit on a dataset to calculate the statistic for each column
imputer.fit(X)
# fit imputer is then applied to a dataset to create a copy of the dataset with all missing values for each column replaced with a statistic value
Xtrans = imputer.transform(X)
print('Missing_values= %d' % sum(isnan(X).flatten()))
print('Missing_values= %d' % sum(isnan(Xtrans).flatten()))

Missing_values= 1605
Missing_values= 0


#### Running the example first loads the dataset and reports the total number of missing values in the dataset as 1,605.

#### The transform is configured, fit, and performed and the resulting new dataset has no missing values, confirming it was performed as we expected.

#### Each missing value was replaced with the mean value of its column.