# Lesson 3

1. In real-world data, tuples with *missing values* for some attributes are a common occurrence. Describe various methods for handling this problem.

    > There are 6 ways to handle it. They are:
    > 1. Just ignore these incomplete tuples.
    > 2. Fill in the missing value manually.
    > 3. Use a global constant to fill in the missing value.
    > 4. Use a measure of central tendency for the attribute (e.g., the *mean* or *median*) to fill in the missing value.
    > 5. Use the attribute *mean* or *median* for all samples belonging to the same class as the given tuple.
    > 6. Use the most probable value to fill in the missing value.

2. *Exercise 2.2* gave the folowing data (in increasing order) for the attribute *age*: `13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70`.
    1. Use *smoothing by bin means* to smooth these data, using a bin depth of 3. Illustrate your steps. Comment on the effect of this technique of the given data.
        > 1. Divide the given dataset into 9 buckets with 3 elements per bucket.
        > 2. Calculate the *mean* of each bucket.
        > 3. Replace every element by the *mean* value in each bucket.
        >
        > | Bucket | Data_1 | Data_2 | Data_3 | Mean |
        > | :----- | -----: | -----: | -----: | ---: |
        > | 1      |     13 |     15 |     16 | 14.7 |
        > | 2      |     16 |     19 |     20 | 18.3 |
        > | 3      |     20 |     21 |     22 | 21.0 |
        > | 4      |     22 |     25 |     25 | 24.0 |
        > | 5      |     25 |     25 |     30 | 26.7 |
        > | 6      |     33 |     33 |     35 | 33.7 |
        > | 7      |     35 |     35 |     35 | 35.0 |
        > | 8      |     36 |     40 |     45 | 40.3 |
        > | 9      |     46 |     52 |     70 | 56.0 |
    
    2. How might you determine *outliers* in the data?
        > 1. Using *clustering*. Intuitively, values that fall outside of the set of clusters may be considered outliers.
        > 2. Using *maximum and minimum observations*. In the previous chapters, we find out the *5 Number Summary* of the dataset. If a value is smaller than *minimum observation* or larger than *maximum observation*, it may be considered outliers.
    
    3. What other methods are there for data smoothing?
        > There are bining, regression and clustering.

3. Discussing issues to consider during *data integration*.
    > There are issues to consider such as *Entity Identification Problem*, *Redundancy*, *Tuple Dulplication* and *Data Value Conflict*.

4. What are the value ranges of the following *normalization methods*?

    1. min-max normalization
    2. z-score normalization
    3. z-score normalization using the mean absolute deviation instead of standard deviation.
    4. normalization by decimal scaling.

> *Min-max normalization* will squash all values into a range from 0 to 1.
>
> The value range of *z-score normalization* is based on the *mean* and *standard deviation* of raw data. Although it has squashed by normalization, the value range is still unknown.
>
> The same as above, the value range of *z-score normalization using the mean absolute deviation instead of standard deviation* is unknown too.
>
> *Normalization by decimal scaling* will squash all values into a range from -1 to 1.

5. Use these methods to *normalize* the following group of data:

$$
200,300,400,600,1000
$$

    1. min-max normalization by setting *min* = 0 and *max* = 1
    2. z-score normalization
    3. z-score normalization using the mean absolute deviation instead of standard deviation.
    4. normalization by decimal scaling.

In [1]:
import pandas
import numpy


array = numpy.array([200, 300, 400, 600, 1000])
frame = pandas.DataFrame({'Raw Data': array})

minimum = array.min()
maximum = array.max()
frame['Min-max'] = [
    (elem - minimum) / (maximum - minimum) for elem in array
]

average = array.mean()
standard_deviation = array.std()
frame['Z-score.std'] = [
    (elem - average) / standard_deviation for elem in array
]

absolutes = numpy.array([abs(elem - average) for elem in array])
absolute_deviation = absolutes.mean()
frame['Z-score.abs'] = [
    (elem - average) / absolute_deviation
    for elem in array
]

frame['Dec-Scale'] = [elem / (10 ** 4) for elem in array]
frame

Unnamed: 0,Raw Data,Min-max,Z-score.std,Z-score.abs,Dec-Scale
0,200,0.0,-1.06066,-1.25,0.02
1,300,0.125,-0.707107,-0.833333,0.03
2,400,0.25,-0.353553,-0.416667,0.04
3,600,0.5,0.353553,0.416667,0.06
4,1000,1.0,1.767767,2.083333,0.1


6. Suppose a group of 12 *sales price* records has been sorted as follows: $$5,10,11,13,15,35,50,55,72,92,204,215$$ Partition them into three bins by each of the following methods:

    1. equal-frequency (equal-depth) partitioning.
    2. equal-width partitioning
    3. clustering

In [2]:
import pandas
from sklearn.cluster import KMeans


frame = pandas.DataFrame({
    'Data': [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
})

frame['EFP'] = [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]
frame['EWP'] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2]

model = KMeans(n_clusters=3)
cluster = model.fit_predict(frame.loc[:, ['Data']])
frame['Cluster'] = cluster
frame.style

Unnamed: 0,Data,EFP,EWP,Cluster
0,5,0,0,1
1,10,0,0,1
2,11,0,0,1
3,13,0,0,1
4,15,1,0,1
5,35,1,0,1
6,50,1,0,2
7,55,1,0,2
8,72,2,0,2
9,92,2,1,2
