#### Summary Statistics 


The pandas package offers several methods that assist in summarizing data. The
DataFrame method describe() gives an overview of the entire set of variables in the data.
The methods mean(), std(), min(), max(), median(), and len() are also very helpful for
learning about the characteristics of each variable. First, they give us information about
the scale and type of values that the variable takes. The min and max statistics can be
used to detect extreme values that might be errors. The mean and median give a sense of
the central values of that variable, and a large deviation between the two also indicates
skew. The standard deviation gives a sense of how dispersed the data are (relative to the
mean). Further options, such as the combination of .isnull().sum(), which gives the
number of null values, can tell us about missing values.

(1) incorporating domain knowledge to remove or combine categories, 
(2) using data summaries to detect information overlap between variables
(and remove or combine redundant variables or categories), 
(3) using data conversion techniques such as converting categorical variables into numerical variables

In [None]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
import matplotlib.pylab as plt

It is likely that subsets of variables are highly correlated with each other.
Including highly correlated variables in a classification or prediction model, or including
variables that are unrelated to the outcome of interest, can lead to overfitting, and
accuracy and reliability can suffer. A large number of variables also poses computational
problems for some supervised as well as unsupervised algorithms (aside from questions
of correlation). In model deployment, superfluous variables can increase costs due to the
collection and processing of these variables.

In [None]:
bostonHousing_df = pd.read_csv('https://raw.githubusercontent.com/reisanar/datasets/master/BostonHousing.csv')

In [None]:
Description of Variables in the Boston Housing Dataset
CRIM Crime rate
ZN Percentage of residential land zoned for lots over 25,000 ft2
INDUS Percentage of land occupied by nonretail business
CHAS Does tract bound Charles River? (= 1 if tract bounds river, = 0 otherwise)
NOX Nitric oxide concentration (parts per 10 million)
RM Average number of rooms per dwelling
AGE Percentage of owner-occupied units built prior to 1940
DIS Weighted distances to five Boston employment centers
RAD Index of accessibility to radial highways
TAX Full-value property tax rate per $10,000
PTRATIO Pupil-to-teacher ratio by town
LSTAT Percentage of lower status of the population
MEDV Median value of owner-occupied homes in $1000s
CAT.MEDV Is median value of owner-occupied homes in tract above $30,000
(CAT.MEDV = 1) or not (CAT.MEDV = 0)?

In [None]:
bostonHousing_df = bostonHousing_df.rename(columns={"CAT. MEDV": "CAT_MEDV"})
bostonHousing_df.head(5)

In [None]:
bostonHousing_df = bostonHousing_df.rename(columns={"CAT. MEDV": "CAT_MEDV"})
bostonHousing_df.head(5)

In [None]:
bostonHousing_df.describe()

In [None]:
# Compute mean, standard deviation, min, max, median, length, and missing values of
# CRIM
print('Mean : ', bostonHousing_df.CRIM.mean())
print('Std. dev : ', bostonHousing_df.CRIM.std())
print('Min : ', bostonHousing_df.CRIM.min())
print('Max : ', bostonHousing_df.CRIM.max())
print('Median : ', bostonHousing_df.CRIM.median())
print('Length : ', len(bostonHousing_df.CRIM))
print('Number of missing values : ', bostonHousing_df.CRIM.isnull().sum())

In [None]:
# Compute mean, standard dev., min, max, median, length, and missing values for all
# variables
pd.DataFrame({'mean': bostonHousing_df.mean(),
'sd': bostonHousing_df.std(),
'min': bostonHousing_df.min(),
'max': bostonHousing_df.max(),
'median': bostonHousing_df.median(),
'length': len(bostonHousing_df),
'miss.val': bostonHousing_df.isnull().sum(),
})

In [None]:
summarize relationships between two or more variables. For numerical
variables, we can compute a complete matrix of correlations between each pair of
variables, using the pandas method corr().We see that most correlations are low and that many are
negative. Recall also the visual display of a correlation matrix via a heatmap .

In [None]:
bostonHousing_df.corr().round(2)

In [None]:
Heatmaps: Visualizing Correlations and Missing Values
A heatmap is a graphical display of numerical data where color is used to denote values.
In a data mining context, heatmaps are especially useful for two purposes: for visualizing
correlation tables and for visualizing missing values in the data. In both cases, the
information is conveyed in a two-dimensional table. A correlation table for p variables has
p rows and p columns. A data table contains p columns (variables) and n rows
(observations). If the number of rows is huge, then a subset can be used. In both cases, it
is much easier and faster to scan the color-coding rather than the values. Note that
heatmaps are useful when examining a large number of values, but they are not a
replacement for more precise graphical display, such as bar charts, because color
differences cannot be perceived accurately.

In [None]:
## simple heatmap of correlations (without values)
corr = bostonHousing_df.corr()

In [None]:
import seaborn as sns

In [None]:
#sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns)

In [None]:
# Change the colormap to a divergent scale and fix the range of the colormap
#sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, vmin=-1,vmax=1, cmap = 'RdBu')
# Include information about values (example demonstrate how to control the size of  the plot
fig, ax = plt.subplots()
fig.set_size_inches(11, 7)
sns.heatmap(corr, annot=True, fmt='.1f', cmap='RdBu', center=0, ax=ax)

In [None]:
Aggregation and Pivot Tables

In [None]:
bostonHousing_df.CHAS.value_counts()

code for aggregating MEDV by CHAS and RM
Create bins of size 1 for variable using the method pd.cut. By default, the method creates a categorical variable 
The argument labels=False determines integers instead, e.g. 6.
bostonHousing_df["RM_bin"] = pd.cut(bostonHousing_df.RM, range(0, 10), labels=False)
Compute the average of MEDV by (binned) RM and CHAS. 
First group the data frame using the groupby method, 
then restrict the analysis to MEDV and determine mean for each group.
bostonHousing_df.groupby([’RM_bin’, ’CHAS’])[’MEDV’].mean()

In [None]:
 bostonHousing_df["RM_bin"]= pd.cut(bostonHousing_df.RM, range(0, 10),labels=False) # range is bin range

In [None]:
bostonHousing_df["RM_bin"]

In [None]:
bostonHousing_df.groupby(["RM_bin", "CHAS"])["MEDV"].mean()

Another useful method is pivot_table() in the pandas package, that allows the creation of pivot tables by reshaping the data by the aggregating variables of our choice. For example, code below computes the average of MEDV by CHAS and RM and presents it as a pivot table.

In [None]:
pd.pivot_table(bostonHousing_df, values="MEDV", index=["RM_bin"], columns= ["CHAS"],aggfunc=np.mean, margins=True)