# Exploratory Data Analysis - Describing & Filtering data

In the below example we'll look at how to filter and select data in pandas.

In [3]:
import pandas as pd

In [4]:
random_data = pd.read_csv("random_data.csv")

In [11]:
random_data.head()

Unnamed: 0,A,B,C,D,E
0,0.437349,0.144989,0.841138,1.257286,-1.459595
1,-0.872279,-0.756656,0.198512,-0.231623,0.006993
2,2.292596,-0.644215,-0.320572,-1.746933,0.08632
3,-0.207686,0.274467,0.330393,1.697657,-0.235343
4,-0.325484,-2.078535,0.867357,-1.149836,0.1121


In [12]:
list(random_data.columns)

['A', 'B', 'C', 'D', 'E']

# Describing Data in Pandas

Now that we have some fake data to look at - let's see how we can describe the data effectively.  Describing data is all about finding a minimal description of the dataset, that gives you a sense of it's general distribution and shape.  For this we look at things like:

* center 
    * Mean, 
    * Median, 
    * Mode, 
    * Trimean, 
    * geometric mean, 
    * harmonic mean, 
    * weighted arthemtic average, 
    * truncated mean, 
    * midrange, 
    * midhinge, 
    * trimean, 
    * winsoried mean, 
    * geometric median, 
    * quadratic mean, 
    * simplicial depth, 
    * tukey median

[Descriptions of each measure of center listed above](https://en.wikipedia.org/wiki/Central_tendency)

* spread
     * standard deviation
     * interquartile range
     * range
     * mean absolute difference
     * median absolute deviation
     * average absolute deviation
     * distance standard deviation
     * coefficient of variation
     * quartile coefficient of dispersion
     * relative mean difference
     * Entropy
     * Variance
     * Variance to mean ratio
     
[Descriptions of each measure of spread listed above](https://en.wikipedia.org/wiki/Statistical_dispersion) 
 
     
* Additional measures
     * biweight midvariance
     * absolute pairwise differences
 


In [17]:
columns = random_data.columns.tolist()
descriptions = pd.DataFrame(columns=columns)
descriptions = descriptions.append({column:random_data[column].mean() for column in columns}, ignore_index=True)
descriptions = descriptions.append({column:random_data[column].median() for column in columns}, ignore_index=True)
descriptions = descriptions.append({column:random_data[column].std() for column in columns}, ignore_index=True)
descriptions = descriptions.append({column:random_data[column].var() for column in columns}, ignore_index=True)
descriptions = descriptions.append({column:random_data[column].min() for column in columns}, ignore_index=True)
descriptions = descriptions.append({column:random_data[column].max() for column in columns}, ignore_index=True)
descriptions.index = ["mean", "median", "stdev", "variance", "min", "max"]
descriptions

Unnamed: 0,A,B,C,D,E
mean,-0.035185,0.005638,-0.018034,0.016843,0.044505
median,-0.014933,-0.005839,0.033285,0.036266,0.071855
stdev,0.978908,1.033043,1.014287,0.977759,1.01108
variance,0.958261,1.067179,1.028778,0.956012,1.022283
min,-3.774899,-3.913617,-2.721996,-3.180298,-3.156781
max,3.578237,3.138728,3.883681,3.080082,3.099781


From here we get a sense of where the data is centered as well as how far apart it's spread out.

Based on the above it appears all of the columns have a very similar distribution - they are all spread out about the same with very close centers.

# Filtering in Pandas

Filtering in pandas is fast because it uses a technique called broadcasting.  Broadcasting means that the loops happen in C rather than in Python, leading to a performance gain in speed.

Let's look at two implementations of the same thing, first the standard Python version:

In [26]:
subset = pd.DataFrame()
for index in random_data.index:
    if random_data.loc[index]["A"] < 0.1:
        subset = subset.append(random_data.loc[index])

In [27]:
fast_subset = random_data[random_data["A"] < 0.1]

In [28]:
subset.equals(fast_subset)

True