# Understand Your Data

> The pipeline to understanding your data is below. Use it as a general guide because everything when it comes to data science is at a case by case basis. Specificities can creep in and ruin your pipeline which is why I try to make this tutorial as broadly applicable as possible.

> I emphasize specific methods when possible that I think are essential to the pipeline.

1. Use **head, tail, or sample method** to take a peek at your raw data.
2. Review the dimensions of your dataset via **shape method**. 
3. See the data types and non-null value counts of each column using **info** method.
4. Summarize your data using descriptive statistics via **describe** method.
5. Summarize the distribution of instances across classes in your dataset, usually using **groupby, value_counts and size** method. 
6. Understand the relationships in your data using correlations via **corr** method. 
7. Review the skew of the distributions of each attribute via **skew** method.

In [2]:
import pandas as pd 
import numpy as np 
from pandas import Series, DataFrame

### Preview the DataFrame via Head, Tail, or Sample

We're going to continue with the Facebook dataset that we previewed earlier.

   (Moro et al., 2016) S. Moro, P. Rita and B. Vala. Predicting social media performance metrics and evaluation 
   of the impact on brand building: A data mining approach. Journal of Business Research, Elsevier, In press.

In [3]:
file_path = './data_files/dataset_Facebook.csv' # specify the file_path
df = pd.read_csv(file_path, delimiter=';') # specify the delimiter as semi-colon and read the file via read_csv method

In [4]:
df.head(3) # return first three rows of DataFrame via head method

Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,comment,like,share,Total Interactions
0,139441,Photo,2,12,4,3,0.0,2752,5091,178,109,159,3078,1640,119,4,79.0,17.0,100
1,139441,Status,2,12,3,10,0.0,10460,19057,1457,1361,1674,11710,6112,1108,5,130.0,29.0,164
2,139441,Photo,3,12,3,3,0.0,2413,4373,177,113,154,2812,1503,132,0,66.0,14.0,80


### Check Dimensionality Via Shape Method

1. Too many rows will take really long to train on some algorithms.
2. Too many features will lead to what is called the curse of dimensionality that will make our data incredibly difficult to model.

In [5]:
df.shape # the dimensions (rows x columns)

(500, 19)

In [6]:
df.columns # view our columns which are also called features

Index([u'Page total likes', u'Type', u'Category', u'Post Month',
       u'Post Weekday', u'Post Hour', u'Paid', u'Lifetime Post Total Reach',
       u'Lifetime Post Total Impressions', u'Lifetime Engaged Users',
       u'Lifetime Post Consumers', u'Lifetime Post Consumptions',
       u'Lifetime Post Impressions by people who have liked your Page',
       u'Lifetime Post reach by people who like your Page',
       u'Lifetime People who have liked your Page and engaged with your post',
       u'comment', u'like', u'share', u'Total Interactions'],
      dtype='object')

### Observe How Many Non-Null Values You Have

Null-values are placeholders to denote the absence of a value. They are "empty" so to say.

In [7]:
df.info() # tells us how much non-null values we have for each feature along with the data type

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 19 columns):
Page total likes                                                       500 non-null int64
Type                                                                   500 non-null object
Category                                                               500 non-null int64
Post Month                                                             500 non-null int64
Post Weekday                                                           500 non-null int64
Post Hour                                                              500 non-null int64
Paid                                                                   499 non-null float64
Lifetime Post Total Reach                                              500 non-null int64
Lifetime Post Total Impressions                                        500 non-null int64
Lifetime Engaged Users                                                 500 non-nul

Type is the only feature that has the data type of "object." We'll see what those are very shortly. Rule of thumb is that typically they are categorical variables.

In [8]:
df.describe() # overview of aggregate statistics 

Unnamed: 0,Page total likes,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,comment,like,share,Total Interactions
count,500.0,500.0,500.0,500.0,500.0,499.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,499.0,496.0,500.0
mean,123194.176,1.88,7.038,4.15,7.84,0.278557,13903.36,29585.95,920.344,798.772,1415.13,16766.38,6585.488,609.986,7.482,177.945892,27.266129,212.12
std,16272.813214,0.852675,3.307936,2.030701,4.368589,0.448739,22740.78789,76803.25,985.016636,882.505013,2000.594118,59791.02,7682.009405,612.725618,21.18091,323.398742,42.613292,380.233118
min,81370.0,1.0,1.0,1.0,1.0,0.0,238.0,570.0,9.0,9.0,9.0,567.0,236.0,9.0,0.0,0.0,0.0,0.0
25%,112676.0,1.0,4.0,2.0,3.0,0.0,3315.0,5694.75,393.75,332.5,509.25,3969.75,2181.5,291.0,1.0,56.5,10.0,71.0
50%,129600.0,2.0,7.0,4.0,9.0,0.0,5281.0,9051.0,625.5,551.5,851.0,6255.5,3417.0,412.0,3.0,101.0,19.0,123.5
75%,136393.0,3.0,10.0,6.0,11.0,1.0,13168.0,22085.5,1062.0,955.5,1463.0,14860.5,7989.0,656.25,7.0,187.5,32.25,228.5
max,139441.0,3.0,12.0,7.0,23.0,1.0,180480.0,1110282.0,11452.0,11328.0,19779.0,1107833.0,51456.0,4376.0,372.0,5172.0,790.0,6334.0


### Class Distribution For Categorical Variables 

In data science you need to know how balanced the class values are for categorical variables. To understand why, suppose you have a highly imbalanced dataset of assembly line information where 99% of your labels are "not defect". If you train your model on the dataset, it will be incredibly familiar with what constitutes "not defect" and have little information on what constitutes "defect". Put it another way, the class priors are biased strongly in favor of the majority class. The minority class here suffers from the problem of where there isn't enough data point to influence the classifier to correct misclassification on the minority class.

To fix the class imbalance problem you need special handling in the data preparation stage of your project which we won't get into in this module, but first you can quickly get an idea of the distribution of the class attribute in Pandas.

### Unique 

> When it comes to categorical variables, I use the unique method to view how many categories we have for that feature.

> Let's first confirm our suspicion that Type is a categorical feature via the unique method.

In [9]:
df['Type'].unique() # we see that class is a categorical variable that has four values: Photo, Status, Link and Video

array(['Photo', 'Status', 'Link', 'Video'], dtype=object)

Ok so we've confirmed that Type is categorical, but how can we specify to Pandas that this is a categorical data type, not just "object." Luckily we have the astype method that we can call.

In [10]:
df["Type"] = df["Type"].astype('category') # we can specify to the dataframe that class is a category via the astype method

Now let's check out the data type of the Type feature now via the dtype method.

In [11]:
df['Type'].dtype # bam! 

category

In [12]:
df.info() # note that the Type feature now has a data type of "category" instead of "object"

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 19 columns):
Page total likes                                                       500 non-null int64
Type                                                                   500 non-null category
Category                                                               500 non-null int64
Post Month                                                             500 non-null int64
Post Weekday                                                           500 non-null int64
Post Hour                                                              500 non-null int64
Paid                                                                   499 non-null float64
Lifetime Post Total Reach                                              500 non-null int64
Lifetime Post Total Impressions                                        500 non-null int64
Lifetime Engaged Users                                                 500 non-n

### Value Counts

Now we're curious of the distribution of Type, so we can use value_counts method.

In [13]:
df['Type'].value_counts() # clearly the majority of the posts are photos

Photo     426
Status     45
Link       22
Video       7
Name: Type, dtype: int64

### Correlations

> Machine learning algorithms like linear and logistic regression can suffer from
poor performance if there is high collinearity among features in your dataset. 

In [14]:
df.corr(method='pearson') #  return correlation matrix via the corr method, note we can specify different correlation methods but pearson is common

Unnamed: 0,Page total likes,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,comment,like,share,Total Interactions
Page total likes,1.0,-0.091142,0.941192,-0.005401,-0.143807,0.005341,-0.083245,-0.10254,-0.111922,-0.149129,-0.12824,-0.096109,-0.060516,0.033699,0.031891,0.053276,-0.004859,0.045231
Category,-0.091142,1.0,-0.12769,-0.053239,-0.107383,-0.022474,-0.142073,-0.094368,0.003392,-0.031172,-0.149443,-0.047803,-0.104456,0.021569,0.027842,0.126786,0.149211,0.127307
Post Month,0.941192,-0.12769,1.0,0.01705,-0.17639,-0.018934,-0.102506,-0.101616,-0.115898,-0.147083,-0.142829,-0.094624,-0.092012,0.010956,0.006174,0.025633,-0.021859,0.018362
Post Weekday,-0.005401,-0.053239,0.01705,1.0,0.045857,-0.001963,-0.050155,-0.033674,-0.048382,-0.029602,-0.021565,-0.046442,-0.068741,0.001144,-0.077209,-0.082322,-0.048713,-0.081049
Post Hour,-0.143807,-0.107383,-0.17639,0.045857,1.0,-0.069464,0.003338,0.012747,0.003879,0.012222,0.078759,0.038892,0.052412,0.038011,0.000922,-0.024523,-0.05868,-0.027421
Paid,0.005341,-0.022474,-0.018934,-0.001963,-0.069464,1.0,0.146631,0.062564,0.117014,0.097679,0.097462,0.003211,0.110043,0.054163,0.075761,0.110694,0.076821,0.107739
Lifetime Post Total Reach,-0.083245,-0.142073,-0.102506,-0.050155,0.003338,0.146631,1.0,0.694926,0.570629,0.477908,0.324362,0.322254,0.743053,0.400756,0.427155,0.545185,0.456312,0.538597
Lifetime Post Total Impressions,-0.10254,-0.094368,-0.101616,-0.033674,0.012747,0.062564,0.694926,1.0,0.368553,0.315201,0.226081,0.850787,0.651933,0.323843,0.316612,0.345091,0.286829,0.343358
Lifetime Engaged Users,-0.111922,0.003392,-0.115898,-0.048382,0.003879,0.117014,0.570629,0.368553,1.0,0.968213,0.67684,0.260346,0.61208,0.839279,0.505806,0.569565,0.531261,0.572159
Lifetime Post Consumers,-0.149129,-0.031172,-0.147083,-0.029602,0.012222,0.097679,0.477908,0.315201,0.968213,1.0,0.706666,0.222941,0.503847,0.81351,0.334621,0.349152,0.343048,0.354502


Note that the diagonals are 1.0 and the matrix is symmetric. 

### Skew

> There are some models that assume a nearly normal distribution. Knowing that an attribute has a skew may allow you to perform data preparation such as log transformation to correct the skew. If not normal, by visualizing the data, we can hypothesize which distribution the data came from.

In [15]:
df.skew() # values closer to zero show less skew 

Page total likes                                                       -0.982448
Category                                                                0.231967
Post Month                                                             -0.122262
Post Weekday                                                           -0.102518
Post Hour                                                               0.213850
Paid                                                                    0.990928
Lifetime Post Total Reach                                               3.679156
Lifetime Post Total Impressions                                         8.351008
Lifetime Engaged Users                                                  4.515920
Lifetime Post Consumers                                                 5.033075
Lifetime Post Consumptions                                              4.817636
Lifetime Post Impressions by people who have liked your Page           14.723360
Lifetime Post reach by peopl

It looks like Post Weekday is the feature closest to one. We'll visualize this later to confirm our belief that this feature has the smallest skew and is nearly symmetric. It might be normal or it might look uniform; we'll see.

### Final Note

- Generating the summary statistics is not enough. In the next module we're actually going to explore the data.
- Data exploration as we did is nice and everything, but we want to start critically thinking about how they will relate to our problem.
- Another question I like to keep in mind is why we might be seeing the data we are seeing. Does it confirm or rebutt some of our pre-conceived notions.
- Also, let's have fun and work hard!


# We're Done!

This tutorial closely follows my Medium blog [@dhexonian](http://medium.com/@dhexonian).

If you have any questions or requests please Tweet those to me, also [@dhexonian](https://twitter.com/dhexonian) 