# Lecture Programming for Data Science, Part Python

* Author: Prof. Dr. Johannes Maucher
* Email: maucher@hdm-stuttgart.de
* Last Update: 22.06.2017

Goal of this workshop is to demonstrate how *Python* and it's main Machine Learning Framework [Scikit-Learn](http://scikit-learn.org/stable/index.html) can be applied for data mining, in particular for learning, testing and evaluating models. *Scikit-Learn* is based on [NumPy](http://www.numpy.org/), which is the fundamental package for scientific calculation in Python. Comfortable methods for data access and data analysis are provided by [Pandas](http://pandas.pydata.org/). Pandas is also based on *Numpy*. The main library for 2- and 3-dimensional data visualisation is [Matplotlib](http://matplotlib.org/). Numpy datastructures can easily be visualized with matplotlib. 

Selected concepts, classes and methods of *Scikit-Learn* are demonstrated in the main part [Data Mining with Python Modules](#data_mining) of this workshop. Basic concepts of the mentioned base- and helper-modules *Numpy, Pandas* and *Matplotlib* can be learned from the jupyter notebooks of chapter [Basic Modules](#basic_modules).

<a id='basic_modules'></a>
## Basic Modules and concepts
The main Python modules, which are usually applied in the context of scikit-learn are:

* [Basics in Numpy (.ipynb)](NP01numpyBasics.ipynb) / [[.html]](NP01numpyBasics.html)
* [Basics in Matplotlib (.ipynb)](PLT01visualization.ipynb) / [[.html]](PLT01visualization.html)
* [Basics in Pandas (.ipynb)](PD01Pandas.ipynb) / [[.html]](PD01Pandas.html)

The main concepts of the Python Machine Learning framework [Scikit-Learn](http://scikit-learn.org/stable/index.html) are:
* it is primarily built on Numpy. In particular internal and external data structures are [Numpy Arrays](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html).
* All algorithms, which somehow transform data belong to the **<span class="mark">Transformer</span>**-class, e.g. *PCA, Normalizer, StandardScaler, OneHotEncoder*, etc. These transformers are trained by applying the *<span class="mark">.fit(traindata)</span>*-method. Once they are trained, there *<span class="mark">.transform(data)</span>*-method can be called in order to transform *data*. If the data used for training the transformer shall be transformed immediately after training, the *.<span class="mark">fit_transform(data)</span>*-method can be applied.
* All Machine Learning algorithms for supervised and unsupervised learning belong to the **<span class="mark">Estimator</span>** class, e.g. *LogisticRegression, SVM, MLP, Kmeans*, etc. These estimators are trained by applying the *.fit(trainfeatures)*- or *<span class="mark">.fit(trainfeatures,trainlabels)</span>*-method. The former configuration is applied for unsupervised-, the latter for supervised learning. Once an estimator is trained, it can be applied for clustering, classification or regression by envoking the *.<span class="mark">predict(data)</span>*-method. 
* At their interfaces all **Transformers** and **Estimators** apply *Numpy Arrays*.

<a id='data_mining'></a>
## Data Mining with Python Modules

1. [Data Access, Preprocessing and Understanding (.ipynb)](01DataAccess.ipynb) / [[.html]](01DataAccess.html)
    * Pandas for Data Access and Preprocessing
    * Dealing with missing data
    * Data understanding by
        - descriptive statistics
        - visualisation
    * Transformations for ordinal and nominal data
    * Data Scaling
    * Feature Selection
        
2. [Learning and Visualization of Decision Trees(.ipynb)](02DecisionTree.ipynb) / [[.html]](02DecisionTree.html)
    * Label-Encoding and One-Hot-Encoding
    * Train and test with scikit-learn (basic approach)
    * Decision Tree based calculation of feature importance
    * Visualisation of decision tree
    
3. [Building Data Mining Processing Chains (.ipynb)](03ProcessingPipeline.ipynb) / [[.html]](03ProcessingPipeline.html)
    * Example Data: Cleveland Heart Disease Dataset
    * Preprocessing
    * Building a pipeline of modules for scaling, transformation and classification
    * Evaluation of a Classifier by accuracy, confusion matrix, precision, recall, f1-score
    * Cross-Validation
    * Determine feature importance
    * Fast and efficient model comparison
    
4. [Model Evaluation (.ipynb)](04EvaluationCurves.ipynb) / [[.html]](04EvaluationCurves.html)
    * Calculation and visualisation of Learning Curve
    * Calculation and visualisation of Validation Curve
    * Logistic Regression
    * Support Vector Machine
    * Hyperparameter tuning with
        * GridSearch
        * RandomSearch
    * Calculation and visualisation of ROC
    * Analyse and visualise influence of Regularisation on weights
    
5. [Ensemble Methods: General Concept (.ipynb)](05EnsembleMethods.ipynb)  / [[.html]](05EnsembleMethods.html)
    * Categorisation of ensemble machine learning algorithms and the main concepts

5. [Random Forest Regression (.ipynb)](05RandomForestRegression.ipynb)  / [[.html]](05RandomForestRegression.html)
    * Example Data: Predict bike rental
    * Train and evaluate Random Forest Regression model
    * Error visualisation
    * Determining feature importance
    * Hyperparameter Tuning
    * Fast and efficient model comparison
    * Comparison with Extremly Randomized Trees
    * Combined Learning and Hyperparameter-Tuning in Linear Regression modules

5. [Gradient Boosting Regression (.ipynb)](05GradientBoostingRegression.ipynb)  / [[.html]](05GradientBoostingRegression.html)
    * Example Data: Predict bike rental
    * Train and evaluate Gradient Boosting Regression model
    * Error visualisation
    * Determining feature importance
    * Hyperparameter Tuning
    * Fast and efficient model comparison
    * Comparison with Ada Boost Regression

6. [Clustering Energy Consumption (.ipynb)](06ClusteringEnergy.ipynb) / [[.html]](06ClusteringEnergy.html).
    * Boxplots
    * Enhance data with geo-information
    * Normalization
    * Clusteralgorithms
        - Hierarchical Clustering
        - Kmeans
        - DBSAN
        - Affinity Propagation
    * Visualisation of clusters
    * Dimensionality Reduction
    * Visualisation in Google Maps

## Exercise

1. [Classification and Regression on StepStone Data(.ipynb)](100Exercise.ipynb) / [[.html]](100Exercise.html)