## wk1_1 The overrall process of Cross-industry standard process for data mining
https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining

<img src="img/data_mining.png" width="50%" />




## Data Cleansing + Feature Engineering =
<img src="img/Data-Preparation.jpg" width= "70%" />


## 1.1 Steps to follow 

### 1.1.1 Data Preprocessing
a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.
Therefore, certain steps are executed to convert the data into a small clean data set. This technique is performed before the execution of __Iterative Analysis__
> ### + Data Validation - Quality
> ### + Data Cleansing 
> Material to read https://en.wikipedia.org/wiki/Data_cleansing

>> #### Dealing with Missing Values
- dropping instances
- dropping attributes
- imputing the attribute mean for all missing values
- imputing the attribute median for all missing values
- imputing the attribute mode for all missing values
- using regression to impute attribute missing values

>> #### Dealing with Outliers
- univar with boxplot
- multivar with regression

>> #### Dealing with Imbalanced Data | Skew distribution
- weighting
- log

> ### + Data Transformation
- Binarization
- Mean Removal
- Lable Encoder
- Scale
- Normalization

### 1.1.2 Data Wrangleing
a technique that is executed at the time of making an interactive model. In other words, it is used to convert the raw data into the format that is convenient for the consumption of data.

> ### + EDA
At a high level, EDA is the practice of using __visual and quantitative methods to understand and summarize a dataset__ without making any assumptions about its contents. It is a crucial step to take before diving into machine learning or statistical modeling because __it provides the context needed to develop an appropriate model__ for the problem at hand and to correctly interpret its results.

> ### + Data Partitioning
- Model Set
    - Training Set
    - Validateion Set
    - Test Set
- Score Set

## Data Preprocssing
For achieving better results from the applied model in Machine Learning and Deep Learning projects __the format of the data has to be in a proper manner__. Some specified Machine Learning and Deep Learning model need information in a specified format, for example, 
- Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values has to be managed from the original raw data set.
- Neuraul network can only handle data ranges from -1 to 1

Another aspect is that data set should __be formatted in such a way that more than one Machine Learning and Deep Learning algorithms are executed in one data set__, and best out of them is chosen.
    
 


In [2]:
import numpy as np
from sklearn import preprocessing

input_data = np.array([[5.1, -2.9, 3.3], [-1.2, 7.8, -6.1], [3.9, 0.4, 2.1], [7.3, -9.9, -4.5]])

In [3]:
# Binarization
# make 2.1 as threshold

data_binarized = preprocessing.Binarizer(threshold=2.1).transform(input_data)
print(data_binarized)

[[1. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]


In [3]:
# Mean Removal
print( 'Before Mean Removal, Mean - {}'.format(input_data.mean(axis=0)))
print( 'Before Mean Removal, Std  - {}'.format(input_data.std(axis=0)))
data_mr = preprocessing.scale(input_data)
print( 'After Mean Removal, Mean  - {}'.format(data_mr.mean(axis=0)))
print( 'After Mean Removal, Std   - {}'.format(data_mr.std(axis=0)))


Before Mean Removal, Mean - [ 3.775 -1.15  -1.3  ]
Before Mean Removal, Std  - [3.12039661 6.36651396 4.0620192 ]
After Mean Removal, Mean  - [1.11022302e-16 0.00000000e+00 2.77555756e-17]
After Mean Removal, Std   - [1. 1. 1.]


In [5]:
# Scale Min to Max
# If the the range is set to (0, 1)
# MinMaxScaler scale the min value of a column to 0, and max value of a column to 1

data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1))
print('fit {}'.format(data_scaler_minmax.fit(input_data)))
print('Before MinMax Scaling - {}'.format(input_data))
print('After  MinMax Scaling - {}'.format(data_scaler_minmax))


fit MinMaxScaler(copy=True, feature_range=(0, 1))
Before MinMax Scaling - [[ 5.1 -2.9  3.3]
 [-1.2  7.8 -6.1]
 [ 3.9  0.4  2.1]
 [ 7.3 -9.9 -4.5]]
After  MinMax Scaling - MinMaxScaler(copy=True, feature_range=(0, 1))


In [None]:
# L1 Normalization sum() = 1

# L2 Normalization Sum(Squre () ) = 1

In [6]:
# Label Encoding
import numpy as np
from sklearn import preprocessing

input_data = ['Paris', 'Toronto', 'Beijing','Beijing']
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(input_data)

for i, item in enumerate(label_encoder.classes_):
    print('{}->{}'.format(i, item))
print(label_encoder.transform(['Toronto', 'Beijing']))
print(label_encoder.inverse_transform([0, 1, 2]))

0->Beijing
1->Paris
2->Toronto
[2 0]
['Beijing' 'Paris' 'Toronto']


In [None]:
# OneHotEncoder



## EDA
__Typical types of EDA__
- Univariate visualization of and summary statistics for each field in the raw dataset
- Bivariate visualization and summary statistics for assessing the relationship between each variable in the dataset and the target variable of interest (e.g. time until churn, spend)
- Multivariate visualizations to understand interactions between different fields in the data
- Dimensionality reduction to understand the fields in the data that account for the most variance between observations and allow for the processing of a reduced volume of data
- Clustering of similar observations in the dataset into differentiated groupings, which by collapsing the data into a few small data points, patterns of behavior can be more easily identified

__Represent GEO Data__
- vincent | conda install -c conda-forge vincent
- folium  | conda install -c conda-forge folium