# Data Mining Process 
Goal of this lecture is to understand and implement the entire data mining process chain according to the [Cross-industry standard process (CRISP)](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining) for data mining. This process chain is sketched in the picture below:  

<img src="./Pics/crispIndallnodep.png" style="width:600px" align="middle">

**Example application of this notebook:**

The `Data`-folder contains the dataset `churnPrediction.csv`. The dataset includes information about:
* Customers who left within the last month – the column is named `Churn`.
* Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
* Customer account information – how long they’ve been customers, contract, payment method, paperless billing, monthly charges, and total charges
* Demographic info about customers – gender, age range, and if they have partners and dependents

The overall task is to predict churn from the other observable features 

---
## Course of Action

* Please write all executable python code in ```Code```-Cells (```Cell```->```Cell Type```->```Code```) and all Text as [Markdown](http://commonmark.org/help/) in ```Markdown```-Cells
* Describe your thinking and your decisions (where appropriate) in an extra Markdown Cell or via Python comments
* In general: discuss all your results and comment on them (are they good/bad/unexpected, could they be improved, how?, etc.). Furthermore, visualise your data (input and output).
* Write a short general conclusion at the end of the notebook
* Further experiments are encouraged. However, don't forget to comment on your reasoning.
* Use a scientific approach for all experiments (i.e. develop a hypothesis or concrete question, make observations, evaluate results)

## Submission

E-Mail your complete Notebook to [maucher@hdm-stuttgart.de](mailto:maucher@hdm-stuttgart.de) until the start of the next lecture. One Notebook per group is sufficient. Edit the teammember table below.

**Important**: Also attach a HTML version of your notebook (```File```->```Download as```->```HTML```) in addition to the ```.ipynb```-File.

| Teammember |                    |
|------------|--------------------|
| 1.         | Geoffrey Hinton    |
| 2.         | Yoshua Bengio      |
| 3.         | Yann LeCun         |
| 4.         | Jürgen Schmidhuber |
---

## Prerequisities

1. The main packages applied in this lecture are: 
    * [Pandas](https://pandas.pydata.org/pandas-docs/stable/)
    * [Scikit-Learn](http://scikit-learn.org/stable/)
    * [Matplotlib](https://matplotlib.org/)
    * [Seaborn](https://seaborn.pydata.org/)
    
    Knowledge of the general concepts of these packages shall be available.
---

## Tasks
### Data Access
#### Task 1: Access Data
Load the file `churnPrediction.csv`-file into a pandas dataframe by applying the pandas-method `read_csv()`. Use this method's argument `na_values` in order to define, which fields in the input file (e.g. empty strings) shall be mapped to `NaN`. Display the shape and the head of this dataframe.

#### Task 2: Process missing values 
Check if there is missing data (`NaN`) in this file. If so, delete all rows with missing values. How much rows remain in the dataset?

### Preprocess and Understand Data

#### Task 3: Check domains of columns
For each of the dataframe's columns display the value-range. For columns with a large value-range display only the first 5 values.

#### Task 4: Transformation of non-numeric data
There a many columns with non-numeric values in the dataframe. Transform them into numeric representations by applying the `LabelEncoder` from sckikit-learn. Before this transformation the column `customerID` can be removed from the dataframe.

#### Task 5: Understand data by calculating descriptive statistics
For pandas- dataframes simple descriptive statistics can be calculated by applying the `describe()`-method. Display the statistics returned by this method. 

#### Task 6: Understand data by univariate distribution visualisation - numeric variables
For the numeric features `tenure`, `MonthlyCharges` and `TotalCharges` visualize the value-distribution by applying *violinplots* from *seaborn*.

#### Task 7: Understand data by univariate distribution visualisation - discrete variables
For the 16 non-numeric features visualize the value-distribution by applying *countplots* from *seaborn*. The 16 countplots shall be arranged in a (4x4)-grid.

#### Task 8: Understand data by conditional distribution visualisation
Repeat the distribution-visualisations of the two previous tasks. However, now for each variable the distribution in class `churn=0` and in class `churn=1` shall be calculated separately. Apply seaborn's *FacetGrid*-class for this. Are there features for which the distributions in the 2 classes are significantly different? If yes: Which ones? For columns with significant different distributions in the two classes:  Do you expect these features to be informative with respect to the classification task? Why? 

#### Task 9: Understand data by correlation-analysis
Use the pandas-dataframe method `corr()` for calculating the pairwise correlations of all columns. Visualize the calculated pairwise correlations by applying seaborn's `heatmap()`.

### Task 10: Univariate Feature Selection
Goal of univariate feature selection is to select a set of most informative features, based on univariate statistical tests. In scikit-learn the following tests are available:
* **Regression:** [Mutual Information for Regression](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression), [f-measure for regression](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression)
* **Classification:** [Mutual Information for Classification](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif), [$\chi^2$-test](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2), [f-measure for classification](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif).

For unsupervised learning for example the [sklearn.feature_selection.VarianceThreshold class](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold) can be applied. This method just analysis the variance of a single feature and does not require a class-label or regression target-value.

**Subtasks:**
* Calculate feature importance of all features in the churn-prediction data w.r.t. to all 3 feature importance tests for classification. Display the results in a single dataframe, whose rows are the features and whose columns are the distinct feature-importance-tests. For $\chi^2$ and F-measure the value of the test and it's p-value shall be contained in this dataframe. 
* Discuss the result.
* Apply scikit-learn's method `selectKBest()` for extracting the $k=8$ most relevant features with respect to *mutual information*

### Task 11: Transform Data: One-Hot Encoding
The following picture displays the different data types:
![data types](./Pics/dataTypes.png)


Non-binary nominal data should be one-hot encoded. Determine all columns with non-binary nominal data and transform the feature-array of the churnPrediction-dataframe (with all features) into acversion, in which non-binary nominal features are one-hot encoded. 

### Task 12: Transform data: Scaling
Except decision trees and ensemble methods, which contain decision trees, nearly all machine learning algorithms require features of similar scale at the input. Since the value ranges of practical data can be very different a corresponding scaling must be performed in the preprocessing chain. The most common scaling approaches are *normalization (MinMax-scaling)* and *standardization*.

**Normalization:** In order to normalize feature *x* it's minimum $x_{min}$ and maximum $x_{max}$ must be determined. Then the normalized values $x_n^{(i)}$ are calculated from the original values $x^{(i)}$ by  
$$x_n^{(i)}=\frac{x^{(i)}-x_{min}}{x_{max}-x_{min}}.$$
The range of normalized values is $[0,1]$. A problem of this type of scaling is that in the case of outliers the value range of non-outliers may be very small. 

**Standardization:** In order to standardize feature *x* it's mean value $\mu_x$ and standard deviation $\sigma_x$ must be determined. Then the standardized values $x_s^{(i)}$ are calculated from the original values $x^{(i)}$ by
$$x_s^{(i)}=\frac{x^{(i)}-\mu_x}{\sigma_x}$$
All standardized features have zero mean and a standard deviation of one.

Calculate 
* a normalized
* a standardized 

representation of the feature-array of the churn-prediction data (from the original data, without features-selection and one-hot encoding).

### Task 13: Definition and Evaluation of multiple data mining processing chains 
The entire Data Mining process usually comprises a sequence of modules, e.g: 

*data access -> cleaning -> feature selection -> transformations -> modelling -> visualisation -> evaluation*

In scikit-learn such sequences of modules can comfortably be encapsulated within a single [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline). As shown in the code-snippet below, a Pipeline-object can be configured as a sequence of other scikit-learn objects. The restriction is that all but the last module in a pipeline must be of **Transformer**-type. All *Transformers* have a `fit()` module for training and a `transform()`-method to transform data. The last module in the sequence is an **Estimator**-type. All *Estimators* have a `fit()`-method for training and a `predict()`-method to estimate an output for the given input data. The main benefits of the `Pipeline`-class are:

* For training the `fit()`-method must be envoked only once to fit a whole sequence of modules in the pipeline.
* After training the `predict()`-method must also be envoked only once per pipeline.
* Parameter optimisation, e.g. by Grid-Search can be performed over all parameters in the pipeline. 

Define multiple pipelines:
* All of them shall apply [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for classification. However, different pipes can be defined for different parameter settings of this classifier. In particular the parameter `class_weight` shall be varied. Describe the meaning of this parameter.
* Define pipes with and without normalization and scaling
* Define pipes with and without one-hot encoding
* Define pipes with and without feature-selection (`selectKBest`)

All of the pipes shall be trained with training data, which shall compris 70 percent of the entire data. The remaining 30% shall be applied for test. For all of the pipes
* the confusion matrix
* accuracy
* precision
* recall
* f1-measure

shall be determined. Display all results concisely. Which configuration yields the best result? 
