# Peek at Your Data
No substitute for looking at the raw data; looking at it can reveal insights you cannot get any other way. 


In [4]:
from pandas import read_csv
filename = "pima-indians-diabetes.data.csv"
names = ["preg", "plas", "pres", "skin", "test", "mass", "pedi", "age", "class"]
data = read_csv(filename, names = names)
data.head(20)

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


# Data Dimensions
Must have a good handle on how much data you have, both in terms of rows and columns
- Too man rows and algorithms may take too long to train; too few and perhaps you do not have enough data to train algorithms
- Too many features and som algorithms can be distracted or suffer poor performance due to the curse of dimensionality

In [6]:
#Print dimensions
print(data.shape)

(768, 9)


# Data Type for Each Attribute
Understanding type of each attribute is important. Strings may need to be converted to floating point values or integers or to represent categorical or ordinal values.

In [8]:
#Print attribute data types
data.dtypes

preg       int64
plas       int64
pres       int64
skin       int64
test       int64
mass     float64
pedi     float64
age        int64
class      int64
dtype: object

# Descriptive Statistics
Descriptive statistics can give you great insight into the shape of each attribute. The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute. They are:
- Count
- Mean
- Standard Deviation
- Minimum Value
- 25th, 50th (median), 75% Percentiles
- Maximum Value

In [11]:
from pandas import set_option
set_option('display.width', 100)
set_option('precision', 3)
data.describe()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845,120.895,69.105,20.536,79.799,31.993,0.472,33.241,0.349
std,3.37,31.973,19.356,15.952,115.244,7.884,0.331,11.76,0.477
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.244,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.372,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.626,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


The calls to panda.set_option() change the precision of the numbers and preferred width of the output. This is to make it more readable.

# Class Distribution
On classification problems you need to know how balanced the class values are.  Highly imbalanced problems are common and may need special handling in the data preparation stage of your project.

In [12]:
class_counts = data.groupby('class').size()
class_counts

class
0    500
1    268
dtype: int64

# Correlations between attributes
Most common method for calculating correlation is Pearson's Correlation Coefficients that assumes normal distribution of attributes involved. Correlation of -1 or 1 shows a full negative or positivty correlation respectively. A value of 0 shows no correlation at all. 

In [14]:
#Pairwise Pearson correlations
set_option('display.width', 100)
set_option('precision', 3)
correlations = data.corr(method = 'pearson')
correlations

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
preg,1.0,0.129,0.141,-0.082,-0.074,0.018,-0.034,0.544,0.222
plas,0.129,1.0,0.153,0.057,0.331,0.221,0.137,0.264,0.467
pres,0.141,0.153,1.0,0.207,0.089,0.282,0.041,0.24,0.065
skin,-0.082,0.057,0.207,1.0,0.437,0.393,0.184,-0.114,0.075
test,-0.074,0.331,0.089,0.437,1.0,0.198,0.185,-0.042,0.131
mass,0.018,0.221,0.282,0.393,0.198,1.0,0.141,0.036,0.293
pedi,-0.034,0.137,0.041,0.184,0.185,0.141,1.0,0.034,0.174
age,0.544,0.264,0.24,-0.114,-0.042,0.036,0.034,1.0,0.238
class,0.222,0.467,0.065,0.075,0.131,0.293,0.174,0.238,1.0


# Skew of Univariate Distributions
Skew refers to a distribution that is assumed Gaussian that is shifted or squashed in one direction or another. Many machine learning algorithms assume a Gaussian distribution. Knowing that an attribute has skew may allow you to perform data preparation techniques to correct the skew and late improve tha ccuracy of your models. 

In [15]:
#Calculate skew
skew = data.skew()
skew

preg     0.902
plas     0.174
pres    -1.844
skin     0.109
test     2.272
mass    -0.429
pedi     1.920
age      1.130
class    0.635
dtype: float64