### Tips:

* Review the numbers. Generating the summary statistics is not enough. Take a moment
to pause, read and really think about the numbers you are seeing.

* Ask why. Review your numbers and ask a lot of questions. How and why are you seeing
specic numbers. Think about how the numbers relate to the problem domain in general
and specic entities that observations relate to.

* Write down ideas. Write down your observations and ideas.

In [None]:
# View first 20 rows
from pandas import read_csv
data = read_csv('https://oml-data.s3.amazonaws.com/kaggle-give-me-credit-train.csv', index_col=0)
peek = data.head(20)
print(peek)

In [None]:
# Dimensions of your data
shape = data.shape
print(shape)

In [None]:
# Data Types for Each Attribute
types = data.dtypes
print(types)

In [None]:
# Statistical Summary for each column: it helps to review all datas
from pandas import set_option

set_option('display.width', 100)
set_option('precision', 3)
description = data.describe()
print(description)

In [None]:
# Just for classification data
# Class Distribution: to have a quick idea of the distribution

class_counts = data.groupby('SeriousDlqin2yrs').size()
print(class_counts)

In [None]:
# Pairwise Pearson correlations: correlation between all pairs of attributes

set_option('display.width', 100)
set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)

In [None]:
# Pairwise Pearson correlations

skew = data.skew()
print(skew)

# The skew result show a positive (right) or negative (left) skew. Values closer to zero show less skew.

If one tail is longer than another, the distribution is skewed. These distributions are sometimes called asymmetric.
A left-skewed distribution has a long left tail. Left-skewed distributions are also called negatively-skewed distributions. That’s because there is a long tail in the negative direction on the number line. The mean is also to the left of the peak.
A right-skewed distribution has a long right tail. Right-skewed distributions are also called positive-skew distributions. That’s because there is a long tail in the positive direction on the number line. The mean is also to the right of the peak.

In [None]:
# Univariate Histograms

from matplotlib import pyplot


data.hist(figsize=(15, 25))
pyplot.show()

# The age is nearly a Gaussian distribution
# Nb of dependents may ahve an exponential distribution

In [None]:
# Univariate Density Plots

data.plot(kind='density', subplots=True, layout=(8,3), sharex=False, figsize=(15, 25))
pyplot.show()

In [None]:
# Box and Whisker Plots

data.plot(kind='box', subplots=True, sharex=False, layout=(8,3), sharey=False, figsize=(15, 25))
pyplot.show()

Boxplots summarize the distribution of each attribute, drawing a line for
the median (middle value) and a box around the 25th and 75th percentiles (the middle 50% of
the data). The whiskers give an idea of the spread of the data and dots outside of the whiskers
show candidate outlier values (values that are 1.5 times greater than the size of spread of the
middle 50% of the data).

Interactions between multiple variables:
    * Correlation Matrix Plot.
    * Scatter Plot Matrix.

Correlation matrix:gives an indication of how related the changes are between two variables.

In [None]:
#Correlation matrix: correlation between each pair of attributes 
import numpy
from matplotlib import pyplot

correlations = data.corr()

# plot correlation matrix
fig = pyplot.figure(figsize=(15, 15))
ax = fig.add_subplot(111) #parameters encoded as a single integer: "111" means "1x1 grid; "234" means "2x3 grid"
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,11,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(data, rotation=40)
ax.set_yticklabels(data)
pyplot.show()

In [None]:
# Correction Matrix Plot (generic)
import numpy
from matplotlib import pyplot
correlations = data.corr()

# plot correlation matrix
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
pyplot.show()

In [None]:
# Scatterplot Matrix
from matplotlib import pyplot
from pandas.plotting import scatter_matrix

scatter_matrix(data, figsize=(15, 25))
pyplot.show()

In [None]:
# Rescale data (between 0 and 1)

from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler

array = data.values

# separate array into input and output components
X = array[:,0:10]
Y = array[:,10]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

Standardization is a useful technique to transform attributes with a Gaussian distribution and
diering means and standard deviations to a standard Gaussian distribution with a mean of
0 and a standard deviation of 1.

In [None]:
# Standardize data (0 mean, 1 stdev)

from numpy import set_printoptions
from sklearn.preprocessing import StandardScaler

array = data.values

# separate array into input and output components
X = array[:,0:10]
Y = array[:,10]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

The values for each attribute now have a mean value of 0 and a standard deviation of 1.

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called
a unit norm or a vector with the length of 1 in linear algebra).

In [None]:
# required for the normalization and binarization setp
data.fillna(0,inplace=True)
print(data)

In [None]:
# Normalize data (length of 1)
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer
array = data.values

# separate array into input and output components
X = array[:,0:10]
Y = array[:,10]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(normalizedX[0:5,:])

The rows are normalized to length 1.

In [None]:
# binarization
from sklearn.preprocessing import Binarizer
array = data.values

# separate array into input and output components
X = array[:,0:10]
Y = array[:,10]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(binaryX[0:5,:])