# 3 Classification Problem
## Example 2

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from pandas import set_option

## 3. Understand Data With Descriptive Statistics
### 3.1. Peek at Your Data
Looking at the raw data can reveal insights that you cannot get any other way. It can also plant seeds that may later grow into ideas on how to better pre-process and handle the data for machine learning tasks. 

In [None]:
df = pd.read_csv('user_visit_duration.csv')

In [None]:
df.head()

### 3.2. Dimensions of Your Data
You must have a very good handle on how much data you have, both in terms of rows and columns.
  Too many rows and algorithms may take too long to train. Too few and perhaps you do not have enough data to train the algorithms.
  Too many features and some algorithms can be distracted or suffer poor performance due to the curse of dimensionality.

In [None]:
df.shape

### 3.3. Data Type For Each Attribute
The type of each attribute is important. Strings may need to be converted to floating point values or integers to represent categorical or ordinal values. You can get an idea of the types of attributes by peeking at the raw data, as above. You can also list the data types used by the DataFrame to characterize each attribute using the dtypes property.

In [None]:
df.dtypes

### 3.4. Descriptive Statistics
Descriptive statistics can give you great insight into the properties of each attribute. Often you can create more summaries than you have time to review. The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute. 

In [None]:
df.describe()

In [None]:
#set_option('display.width', 100)
#set_option('precision', 3)
description = df.describe()
print(description)

### 3.5. Class Distribution (Classification Only)
On classification problems you need to know how balanced the class values are. Highly imbalanced problems (a lot more observations for one class than another) are common and may need special handling in the data preparation stage of your project. 

In [None]:
df.groupby('Buy').size()

### 3.6. Correlations Between Attributes
Correlation refers to the relationship between two variables and how they may or may not change together. The most common method for calculating correlation is Pearson’s Correlation Coefficient, that assumes a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all. Some machine learning algorithms like linear and logistic regression can suffer poor performance if there are highly correlated attributes in your dataset. As such, it is a good idea to review all of the pairwise correlations of the attributes in your dataset. You can use the corr() function on the Pandas DataFrame to calculate a correlation matrix.

In [None]:
#set_option('display.width', 100)
#set_option('precision', 3)
correlations = df.corr(method='pearson')
print(correlations)

### 3.7. Skew of Univariate Distributions
Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or squashed in one direction or another. Many machine learning algorithms assume a Gaussian distribution. Knowing that an attribute has a skew may allow you to perform data preparation to correct the skew and later improve the accuracy of your models. You can calculate the skew of each attribute using the skew() function on the Pandas DataFrame.

In [None]:
#The skew results show a positive (right) or negative (left) skew. Values closer to zero show less skew.
df.skew()

## 4. Understand Data With Visualization
### 4.1. Univariate Plots
In this section we will look at three techniques that you can use to understand each attribute of your dataset independently.
  Histograms.
  Density Plots.
  Box and Whisker Plots.
#### Histograms
A fast way to get an idea of the distribution of each attribute is to look at histograms. Histograms group data into bins and provide you a count of the number of observations in each bin. From the shape of the bins you can quickly get a feeling for whether an attribute is Gaussian, skewed or even has an exponential distribution. It can also help you see possible outliers.

In [None]:
df.hist()
plt.show()

We can see that perhaps the attributes age, pedi and test may have an exponential distribution. We can also see that perhaps the mass and pres and plas attributes may have a Gaussian or nearly Gaussian distribution. This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.

#### Density Plots
Density plots are another way of getting a quick idea of the distribution of each attribute. The plots look like an abstracted histogram with a smooth curve drawn through the top of each bin, much like your eye tried to do with the histograms.

In [None]:
df.plot(kind='density', subplots=True, layout=(3,3)) 
plt.show()
#We can see the distribution for each attribute is clearer than the histograms.

#### Box and Whisker Plots
Another useful way to review the distribution of each attribute is to use Box and Whisker Plots or boxplots for short. Boxplots summarize the distribution of each attribute, drawing a line for the median (middle value) and a box around the 25th and 75th percentiles (the middle 50% of the data). The whiskers give an idea of the spread of the data and dots outside of the whiskers show candidate outlier values (values that are 1.5 times greater than the size of spread of the middle 50% of the data).

In [None]:
df.plot(kind='box', subplots=True, layout=(3,3)) 
plt.show()
#We can see that the spread of attributes is quite different. 
#Some like age, test and skin appear quite skewed towards smaller values.

## 5. Prepare Your Data For Machine Learning 

Many machine learning algorithms make assumptions about your data. 
It is often a very good idea to prepare your data in such a way to best expose the structure of the problem to the machine learning algorithms that you intend to use. 
In this section you will discover how to prepare your data for machine learning in Python using scikit-learn. 
After completing this lesson you will know how to: 
Rescale data.
Standardize data. 
Normalize data. 
Binarize data. 

### 5.1. Rescale Data
When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale. Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms used in the core of machine learning algorithms like gradient descent. You can rescale your data using scikit-learn using the MinMaxScaler class

In [None]:
from sklearn.preprocessing import MinMaxScaler
array = df.values
# separate array into input and output components
X = array[:,0:1]
Y = array[:,1]

In [None]:
X

In [None]:
Y

In [None]:
scaler = MinMaxScaler()
rescaledX = scaler.fit_transform(X)
rescaledX

## 6. Evaluate the Performance of Machine Learning Algorithms with Resampling
### 6.1. Split into Train and Test Sets 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3,
random_state=seed)
model = LogisticRegression(solver='liblinear') 
#model = LinearDiscriminantAnalysis()
#model = KNeighborsClassifier()
#model = GaussianNB()
#model = DecisionTreeClassifier()
#model = SVC()
#model = RandomForestClassifier(n_estimators=100, max_features=3)
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test) 
print("Accuracy: %.3f%%" % (result*100.0))

## 7. Machine Learning Algorithm Performance Metrics


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3,
random_state=seed)
model = LogisticRegression(solver='liblinear') 
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
matrix = confusion_matrix(Y_test, predicted)
print(matrix)

### Classification Report

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3,
random_state=seed)
model = LogisticRegression(solver='liblinear') 
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
report = classification_report(Y_test, predicted)
print(report)