## Noisy data

What is noisy data?
* Noisy data is meaningless data. 
* Cannot be understood and interpreted correctly by machines
* Can be caused by faulty data collection instruments, human or computer errors occurring at data entry, data transmission errors, etc

Techniques to handle noisy data?



### Binning
    - A data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall into a given small interval, a bin, are replaced by a value representative of that interval, often the central value. It is a form of quantization.
    - Binning methods smooth a **sorted** data value by consulting the values around it.
    - Because binning methods consult the values around it, they perform local smoothing
    - In general, the larger the width of bins, the greater the effect of the smoothing
    - Binning is also used as a discretization technique
    
    * Ways of binning
        * mean
        * boundary (replaced by the closer boudary)
        * median

In [55]:
# Binning
import numpy as np

data = np.random.randint(200, size=100)
data = np.sort(data)

bin1 = np.zeros((20, 5))
bin2 = np.zeros((20, 5))

for i in range(0, 100, 5):
    mean = np.sum( data[i : i + 5] ) / 5
    for j in range(5):
        bin1[int( i / 5 ), j] = mean
print('Binning using mean\n\n', bin1)

for i in range(0, 100, 5):
    for j in range(5):
        if data[i+j] - data[j] < data[i+4] - data[j]:
            bin2[int(i/5), j] = data[i]
        else:
            bin2[int(i/5), j] = data[i+4]
print('\nBinning using boundary\n\n', bin2)
    

Binning using mean

 [[  4.2   4.2   4.2   4.2   4.2]
 [ 14.8  14.8  14.8  14.8  14.8]
 [ 25.4  25.4  25.4  25.4  25.4]
 [ 34.8  34.8  34.8  34.8  34.8]
 [ 39.6  39.6  39.6  39.6  39.6]
 [ 47.8  47.8  47.8  47.8  47.8]
 [ 59.6  59.6  59.6  59.6  59.6]
 [ 68.6  68.6  68.6  68.6  68.6]
 [ 76.4  76.4  76.4  76.4  76.4]
 [ 86.   86.   86.   86.   86. ]
 [ 97.6  97.6  97.6  97.6  97.6]
 [113.4 113.4 113.4 113.4 113.4]
 [128.4 128.4 128.4 128.4 128.4]
 [141.6 141.6 141.6 141.6 141.6]
 [146.8 146.8 146.8 146.8 146.8]
 [155.8 155.8 155.8 155.8 155.8]
 [163.6 163.6 163.6 163.6 163.6]
 [169.6 169.6 169.6 169.6 169.6]
 [182.2 182.2 182.2 182.2 182.2]
 [190.6 190.6 190.6 190.6 190.6]]

Binning using boundary

 [[  1.   1.   1.   8.   8.]
 [  9.   9.   9.   9.  22.]
 [ 23.  23.  23.  23.  28.]
 [ 33.  33.  33.  33.  37.]
 [ 39.  39.  39.  39.  42.]
 [ 44.  44.  44.  51.  51.]
 [ 55.  55.  55.  64.  64.]
 [ 66.  66.  66.  66.  71.]
 [ 72.  72.  72.  79.  79.]
 [ 81.  81.  81.  81.  91.]
 [ 92.  92. 

### Regression

LOESS

* What is LOESS? - locally weighted running line smoother

![Image](https://miro.medium.com/max/1188/1*g91pI0qM-q4TjQrkgVWvhA.png)

*First of all, think of the red line as an ordered sequence of equally spaced x values, in this case between 0 and 2π. For each of these values, select an appropriate neighborhood of sampled points, and use them as the training set for a linear regression problem. With the resulting model, estimate the new value for your point.* [link](https://towardsdatascience.com/loess-373d43b03564)

* Package to use [statsmodels](https://www.statsmodels.org/dev/generated/statsmodels.nonparametric.smoothers_lowess.lowess.html)

For LOESS, the input is the original x values. LOESS estimates the corresponding y values by using [local regression](https://en.wikipedia.org/wiki/Local_regression)

### Clustering

Will cover more in clustring section

### Reference

[1](https://datascience.stackexchange.com/questions/42014/how-to-handle-noisy-data)
[2](https://www.geeksforgeeks.org/python-binning-method-for-data-smoothing/)
[More reading on smoothing](https://rafalab.github.io/dsbook/smoothing.html)