# Practical 1: Installing and Introduction to Python's Data Mining Libraries

### In this practical
1. [Why Python?](#whypython)
2. [Installing Anaconda](#install)
3. [Process flow of predictive mining in Python](#processflow)
4. [Interactive prototyping in ipython](#ipython)
5. [Defining problem and purpose of data mining](#purpose)
---

**Written by Hendi Lie (h2.lie@qut.edu.au) and Richi Nayak (r.nayak@qut.edu.au). All rights reserved.**

This practical note introduces you to Python and its common machine learning libraries. Python is a high-level, interpreted programming language. It is used for wide range of purposes, from web servers to scientific computing. It’s syntax allows easy and quick readability and program creativity.

The practical sessions in this subject use Python for data manipulation and analytics. This is **NOT** a Python unit, therefore, it will not cover the basic syntax of Python. Fortunately, there is a lot of resources available online for
learning elementary Python. If you are new to Python, we recommend you to familiarise yourself before Practical 2.

More specifically, we will use Python 3 in this unit. All examples are written using Python 3.5.2, but any version of Python 3 above 3.4 should work just fine.

## 1. Why Python? <a name="whypython"></a>

There exist many different languages, tools and software to perform data mining tasks. Some of them are R, Weka, Orange, Knime Analytics, SAS, Oracle Data Miner, Microsoft SQLServer Data Mining and Rapid Miner. Python is arguably the fastest growing and most widely used programming language/tool alongside R and Julia. There are a number of reasons for this popularity:

### 1.1. Interpreted language
Python is designed as an interpreted language which allows users to test and prototype models easily and quickly. With iPython interpreter, results of each line of code can immediately be checked. An interactive Jupyter notebook can also be created to share insights and code through browser interface. These tutorial notes are written in Jupyter notebooks.

### 1.2. Open-source
Python is free and has no ties to any propertiary/corporate technologies, which makes Python the top choice for students, academics and startups. There are some libraries/technologies in Python that are tied to a company (e.g. Jupyter, tensorflow), but they are still available almost in all use cases for free.

### 1.3. Wide, cutting edge support for almost anything

There exist a vast range of actively updated libraries for almost every data mining task.

* **pandas** for data wrangling and preprocessing ([link](http://pandas.pydata.org/))
* **scikit-learn** for supervised and unsupervised learning ([link](http://scikit-learn.org/stable/))
* **numpy** for matrix manipulation ([link](http://www.numpy.org/))
* **seaborn** and **matplotlib** for visualisation ([link](https://seaborn.pydata.org/)) ([link2](https://matplotlib.org/))
* **ipython** for interactive prototyping ([link](https://ipython.org/))
* **jupyter notebook** for interactive, web-based prototyping ([link](http://jupyter.org/))
* **graphviz** and **pydot** for visualising sklearn model structures.

### 1.4. Production ready
Models and pipelines built with Python are very suitable to deployment in production systems. We will talk about deploying models and data pipelines in a practical later in this unit.

## 2. Installing Anaconda <a name="install"></a>

In this unit, we will use many data science libraries in Python. We recommend you to install Anaconda, a data science package for Python. It contains many essential data science libraries and is aimed to simplify the installation process. At the time of these notes, all libraries mentioned above are in Anaconda distribution except Seaborn.

### 2.1. For Windows users

For Windows users, simply download Anaconda and install it [link](https://www.anaconda.com/download/). Choose the latest Python3 version for Windows.

Once you installed it, go to **Start-Anaconda3-Anaconda Prompt**. Type `conda install seaborn` to install Seaborn.

To install visualisation libraries for decision trees (practical 2), follow these steps:
1. Download graphviz (Stable 2.38 windows install package) for windows from http://www.graphviz.org/download/ and install it.
3. Add the graphviz bin folder to the PATH system environment variable (Example: "C:\Graphviz2.38\bin"). Tutorial here: http://windowsitpro.com/systems-management/how-can-i-add-new-folder-my-system-path.
4. Make sure git is installed in your system.
5. Go to Anaconda Prompt using start menu (Make sure to right click and select "Run as Administrator". We may get permission issues if Prompt as not opened as Administrator)
6. Execute the command: `conda install graphviz`
7. Execute the command: `pip install git+https://github.com/nlhepler/pydot.git`
8. Execute the command `conda list` and make sure pydot and graphviz modules are listed.


### 2.2. For Linux Users

For Linux users, go to Anaconda website to get the installation file [link](https://www.anaconda.com/download/). Choose the latest Python3 version. It should download an `.sh` file. Once the download process is finished, open your terminal and give execution permission with command `chmod +x [path_to_the_downloaded file]`. Run it to install by `./[path_to_downloaded file]`.

Once the installation process is finished, ensure it is included in your bash path. Open your terminal and install Seaborn by typing `conda install seaborn`.

To install visualisation libraries for decision trees (practical 2), follow these steps:
1. Install graphviz for linux through `sudo apt-get install graphviz`
2. Install graphviz for Python through `conda install graphviz`
3. Install pydot using `pip install pydot`.

## 3. Process flow for predictive mining using Python<a name="processflow"></a>
![Predictive mining process flow in Python](https://s3-ap-southeast-2.amazonaws.com/dataminingtuts/process_flow_python.png  "Predictive mining process flow in Python")

The diagram above presents the steps we will take in this unit to perform predictive mining on the dataset. It is a standard data mining process that includes business understanding, data understanding, data pre-processing, data mining and post-processing. Some of these steps are done iteratively. The first and most important step is to define the underlying task and purpose of data mining. You need to ask questions such as:

* What kind of data do we have?
* Why are we performing predictive mining on this data?
* What information are we trying to predict?
* How could the stakeholders (including yourself) use the insights we gained from the data mining?


Once the business problem is defined, and the corresponding data is identified and acquired, the next step is to explore the data. In this step, we attempt to understand common patterns and distributions in the data. We should also identify data quality problems in the dataset, such as noise and missing values, to be cleaned and processed out. The data exploration and quality enhancement steps will be performed mainly using `pandas` with some help from `sklearn`'s preprocessing modules.

After the data is clean, models can be built to perform in-depth analysis. There are two broad categories of data mining: **Predictive Mining** (e.g. classification and regression with decision tree, regression and neural network) and **Descriptive Mining** (e.g. clustering and association mining). Many algorithms belonging to each of the categories are available in `sklearn`, each with its own characteristics. The upcoming practical notes will explore some of these algorithms in detail.

Data mining outcomes ate best understood when accompanied with graphs and charts of patterns and trends identified in the data. Visualisation allows us to understand the data better. In this unit, all visualisations will be done using `seaborn` and `matplotlib` with data presented by `pandas` DataFrames.

## 4. Interactive prototyping with ipython<a name="ipython"></a>

`ipython` is an interactive Python shell designed for fast prototyping. In data mining/machine learning, many analysts use ipython to quickly review the data and process they are working on.

### For you who are using Anaconda

To start the ipython console, go to Start-Anaconda3-IPython. It will start by default on your document folder. If you wish to save your projects on another directory, change the current directory using `cd "your directory path"`.

![Starting IPython from Anaconda in Windows](http://dataminingtuts.s3.amazonaws.com/anaconda_ipython.png "Starting IPython from Anaconda in Windows")

### For you who are using Linux/Unix and installed the libraries manually

We can call ipython the same way as we call the python interpreter itself:

```bash
ipython
```

```bash
# Output
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
```

All examples in this unit are shown using ipython console.

## 5. Defining problem and purpose of data mining process<a name="purpose"></a>

The following business problem will be solved via the predictive modelling/mining in the next few practical notes.

**Business Scenario:** A national veterans organisation is seeking to improve their donation solicitations by targeting the potential donors. By only focusing on the supposedly donors, less money can be spent on solicitation efforts and more money can be available for charitable endeavors. Of particular interest is the class of individuals identified as lapsing donors. They have ran a greeting card mailing campaign called **Veteran**. The organisation now seeks to classify its lapsing donors based on their responses to this campaign. With this classification, a decision can be made to either solicit or ignore a lapsing individual in the next year campaign.

Your task, as a data science professional, to use this dataset to understand the patterns of donation and identify supposedly donors to improve the solicitation effort.

The `Veteran` dataset is available in `datasets/veteran.csv` file. Before we build the predictive models, we will examine the dataset to understand its basic characteristics such as data dimensions, attributes nature, data distribution, outliers, data quality etc.

Import the dataset into the `ipython` console with `pandas`.

In [1]:
import pandas as pd
df = pd.read_csv('datasets/veteran.csv')

Once the dataset is imported, we can start by looking at the columns/variables available. We can use `.info()` function for this purpose.

> #### Essential classes and functions of Pandas

> * Function **pandas.read_csv()** reads `.csv` file and return a **DataFrame** object. There are other read functions in pandas, including `read_json()`, `read_excel()`, `read_sas()` and even `read_sql()`.

> * **pandas.DataFrame** is the primary data object in `pandas`. It is basically a two-dimensional tabular data structure that you can modify, perform arithmetic operations on and give labels to axes (rows and columns). Each column in a DataFrame is called a Series.

> * Function **pandas.DataFrame.info()** provide concise summary of a DataFrame, such as number of entries (rows), data columns and their respective data types and memory usage.

In [2]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9686 entries, 0 to 9685
Data columns (total 28 columns):
TargetB             9686 non-null int64
ID                  9686 non-null int64
TargetD             4843 non-null float64
GiftCnt36           9686 non-null int64
GiftCntAll          9686 non-null int64
GiftCntCard36       9686 non-null int64
GiftCntCardAll      9686 non-null int64
GiftAvgLast         9686 non-null float64
GiftAvg36           9686 non-null float64
GiftAvgAll          9686 non-null float64
GiftAvgCard36       7906 non-null float64
GiftTimeLast        9686 non-null int64
GiftTimeFirst       9686 non-null int64
PromCnt12           9686 non-null int64
PromCnt36           9686 non-null int64
PromCntAll          9686 non-null int64
PromCntCard12       9686 non-null int64
PromCntCard36       9686 non-null int64
PromCntCardAll      9686 non-null int64
StatusCat96NK       9686 non-null object
StatusCatStarAll    9686 non-null int64
DemCluster          9686 non-null int64
De

The dataset contains 29 variables including ID, demographics of members, donation history of members and many others. There are two possible target variables that can be used for predicting donor patterns: **TARGETB** (a binary variable stating whether a person is a lapsing donor or not); and **TARGETD** (a continuous variable that records the donation amount given in response to the mailing campaign).

Data exploration assists us in answering some of questions such as:

**What kind of data do we have?**

A total of 29 variables with various information about the donors.

**Why are we performing predictive mining on this data?**

We would like to find possible lapsing donors to improve the donation solicitation campaign of the concerned organisation.

**What information are we trying to predict?**

Whether a person is a possible lapsing donor or not, corresponding to **TARGETB**.

**How could the stakeholders use the insights gained from the data mining?**

1. Find underlying characteristics of lapsing donors, leading to better understanding of what makes donors return.

2. Improved accuracy of the solicitation campaign whereby the response rate is high and the wasted effort is less.

Looks like we have got an interesting and useful data mining project in hand. :)

## End notes
In this practical, we learned how to install Python and its libraries with Anaconda. We also learned about the typical data mining process flow in Python and explored the very basic nature of the dataset in order to build data mining models on this dataset and understand the common patterns and trends.