# Exploratory Data Analysis  

## What is EDA?
EDA is the unstructured process of probing the data we haven’t seen before to understand more about it with a view to thinking about how we can use the data, and to discover what it reveals as insights at first glance.  

At other times, we need to analyze some data with no particular objective in mind except to find out if it could be useful for anything at all.  
Consider a situation where your manager points you to some data and asks you to do some analysis on it.  The data could be in a Google Drive, or a Github repo, or on a thumb drive.  It may have been received from a client, a customer or a vendor.  You may have a high level pointer to what the data is, for example you may know there is order history data, or invoice data, or web log data.  The ask may not be very specific, nor the goal clarified, but we would like to check the data out to see if there is something useful we can do with it.

In other situations, we are looking for something specific, and are looking for the right data to analyze.  For example, we may be trying to to identify zip codes where to market our product.  We may be able to get data that provides us information on income, consumption, population characteristics etc that could help us with our task.  When we receive such data, we would like to find out if it is fit for purpose.  


### Inquiries to conduct
So when you get data that you do not know much about in advance, you start with exploratory data analysis, or EDA.  Possible inquiries you might like to conduct are:


 - How much data do we have - number of rows in the data?
 - How many columns, or fields do we have in the dataset?
 - Data types - which of the columns appear to be numeric, dates or strings?
 - Names of the columns, and do they tell us anything?
 - A visual review of a sample of the dataset
 - Completeness of the dataset, are missing values obvious?  Columns that are largely empty?
 - Unique values for columns that appear to be categorical, and how many observations of each category?
 - For numeric columns, the range of values (calculated from min and max values)
 - Distributions for the different columns, possibly graphed
 - Correlations between the different columns


Exploratory Data Analysis (EDA) is generally the first activity performed to get a high level understanding of new data.  It employs a variety of graphical and summarization techniques to get a ‘sense of the data’.

The purpose of Exploratory Data Analysis is to interrogate the data in an open-minded way with a view to understanding the structure of the data, uncover any prominent themes, identify important variables, detect obvious anomalies, consider missing values, review data types, obtain a visual understanding of the distribution of the data, understand correlations between variables, etc.  Not all these things can be discovered during EDA, but these are generally the things we look for when performing EDA. 

EDA is unstructured exploration, there is not a defined set of activities you must perform.  Generally, you probe the data, and depending upon what you discover, you ask more questions.

## Introduction to Arrays

Arrays, or collection of numbers, are fundamental to analytics at scale.  We will cover arrays from a NumPy lens exclusively, given how much NumPy dominates all array based manipulation.  

NumPy is the underlying library for manipulating arrays in Python.  And arrays are really important for analytics.  The reason arrays are important is because many analytical algorithms will only accept arrays as input.  Deep learning networks will exclusively accept only arrays as input, though arrays are called **_tensors_** in the deep learning world.  In addition to this practical issue, data is much easier to manipulate, transform and perform mathematical operations on if it is expressed as an array.  

NumPy underpins pandas as well as many other libraries.  So we may not be using it a great deal, but there will be situations where numpy is unavoidable.

Below is a high level overview of what arrays are, and some basic array operations.  

***
### Multi-dimensional data

Arrays have structure in the form of dimensions, and numbers sit at the intersection of these dimensions.  In a spreadsheet, you see two dimensions - one being the rows, represented as 1, 2, 3..., and the other the columns, repesented as A, B, C.  Numpy arrays can have any number of dimensions, even though dimensions beyond the third are humanly impossible to visualize.

A numpy array when printed in Python encloses data for a dimension in square brackets.  The fundamental unit of an array of any size is a single one-dimensional row where numbers are separated by commas and enclosed in a set of square brackets, for example, `[1, 2, 3, 1]`.  Several of these will then be arranged within additional nested square brackets to make up the complete array.  To understand the idea of an array, mentally visualize a 2-dimensional array similar to a spreadsheet.  Every number within the array exists at the intersection of all of its dimensions.  In Python, each position along a dimension, more commonly called an _axis_, is represented by numbers starting with the first element being 0.  These positions are called indexes.

The number of square brackets `[` gives the number of dimensions in the array.  Two are represented on screen, the rows and columns, like a 2D matrix.  But the screen is two-dimensional, and cannot display additional dimensions.  Therefore all other dimensions appear as repeats of rows and columns - look at the example next. The last two dimensions, eg here 3, 4 represent rows and columns.  The 2, the first one, means there are two sets of these rows and columns in the array!

***
### Creating arrays with Numpy  

Everything that Numpy touches ends as an array, just like everything from a pandas function is a dataframe.  Easiest way to generate a random array is `np.random.randn(2,3)` which will give an array with dimensions 2,3.  You can pick any other dimensions too.  `randn` gives random normal numbers.

In [1]:
# import some libraries

import pandas as pd
import os
import random
import numpy as np
import scipy
import math
import joblib 



In [2]:
import os
os.chdir('/home/jovyan')

In [3]:
# Create a one dimensional array

np.random.randn(4)

array([ 0.53010939, -0.26013979, -1.36789056,  0.09318644])

In [4]:
# Create a 2-dimensional array with random normal variables
# np.random.seed(123)

np.random.randn(2,3)

array([[ 1.06635634, -1.85709758, -1.06510315],
       [-0.73035291,  0.89551841,  0.13411568]])

In [5]:
# Create a 3-dimensional array with random integers

x = np.random.randint(low = 1, high = 5, size = (2,3,4))
print('Shape: ', x.shape)
x

Shape:  (2, 3, 4)


array([[[1, 3, 2, 4],
        [2, 3, 2, 4],
        [3, 1, 4, 4]],

       [[2, 1, 4, 3],
        [4, 1, 4, 1],
        [1, 4, 4, 1]]])

    
Numpy axes numbers run from left to right, starting with the index 0.  So `x.shape` gives me 2, 3, 4 which means 2 is the 0th axis, 3 rows are the 1st axis and 4 columns are the 2nd axis.  
  
The shape of the above array is (2, 3, 4)  
  
axis = 0 means : (**2**, 3, 4)  
axis = 1 means : (2, **3**, 4)  
axis = 2 means : (2, 3, **4**)  

In [6]:
# Create a 3-dimensional array

data = np.random.randn(2, 3, 4)
print('The shape of the array is:', data.shape)
data

The shape of the array is: (2, 3, 4)


array([[[-0.29751345, -0.79209536, -0.30800011, -0.11003366],
        [ 0.89254248, -2.5404233 ,  0.11387802, -0.28652507],
        [ 0.09047351,  1.89614923, -0.06056426, -0.15000823]],

       [[ 0.10524661,  1.36260319, -0.69532639, -0.23330721],
        [-1.78602706, -0.12680377, -0.81624271,  0.08893306],
        [ 0.00924079,  0.5099789 ,  0.07351083,  0.70030037]]])

> The number of `[` gives the number of dimensions in the array.  
Two are represented on screen, the rows and columns.  All others appear afterwards.
The last two dimensions, eg here 3, 4 represent rows and columns.  The 2, the first one, means there are two 
sets of these rows and columns in the array.

In [7]:
# Now let us add another dimension.  But this time random integers than random normal.
# The random integer function (randint) requires specifying low and high for the uniform distribution.

data = np.random.randint(low = 1, high = 100, size = (2,3,2,4))
data

array([[[[20, 87, 51, 48],
         [29, 76, 28, 42]],

        [[26,  6, 16, 69],
         [94, 38, 27, 60]],

        [[86, 38, 16, 85],
         [74, 88, 32, 98]]],


       [[[57, 30, 33, 71],
         [60, 16, 22,  4]],

        [[21, 24, 31, 69],
         [49, 56, 30, 55]],

        [[14, 85, 22, 77],
         [ 3, 24, 73, 37]]]])

So there will be a collection of 2 rows x 4 columns matrices, repeated 3 times, and that entire set another 2 times. <br><br>
And the 4 occurrences of `[[[[` means there are 4 dimensions to the array.

In [8]:
type(data)

numpy.ndarray

In [9]:
# Converting a list to an array

list1 = list(range(12))
list1

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

In [10]:
array1 = np.array(list1)
array1

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [11]:
# This array1 is one dimensional, let us convert to a 3x4 array.
array1.shape = (3,4)
array1

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [12]:
# Create arrays of zeros
array1 = np.zeros((2,3)) # The dimensions must be a tuple inside the brackets
array1

array([[0., 0., 0.],
       [0., 0., 0.]])

In [13]:
# Create arrays from a range
array1 = np.arange((12))
array1

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [14]:
#You can reshape the dimensions of an array
array1.reshape(3,4) 

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [15]:
array1.reshape(3,2,2)

array([[[ 0,  1],
        [ 2,  3]],

       [[ 4,  5],
        [ 6,  7]],

       [[ 8,  9],
        [10, 11]]])

In [16]:
# Create an array of 1's
array1 = np.ones((3,5))
array1

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [17]:
# Creates the identity matrix 
array1 = np.eye(4) 
array1

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [18]:
# Create an empty array - useful if you need a place to keep data that will be generated later in the code.
# It shows zeros but is actually empty

np.empty([2,3])

array([[0., 0., 0.],
       [0., 0., 0.]])

### Summarizing data along an axis  
Putting the `axis = n` argument with a summarization function (eg, sum) makes the axis _n_ disappear, having been summarized into the function's results, leaving only the rest of the dimensions.  So `np.sum(array_name, axis = n)`, similarly `mean()`, `min()`, `median()`, `std()` etc will calculate the aggregation function by collapsing all the elements of the selected axis number into one and performing that operation.  See below using the sum function.  
  

In [19]:
x = data = np.random.randint(low = 1, high = 100, size = (2,3))
x

array([[76, 34, 43],
       [42, 55, 78]])

In [20]:
# So with axis = 0, the very first dimension, ie the 2 rows, will collapse leaving an array of shape (3,)
x.sum(axis = 0)

array([118,  89, 121])

In [21]:
# So with axis = 0, the very first dimension, ie the 2 rows, will collapse leaving an array of shape (2,)
x.sum(axis = 1)

array([153, 175])

### Subsetting arrays ('slices')
Python starts numbering things starting with zero, which means the first item is the 0th item.  

The portion of the dimension you wish to select is given in the form `start:finish` where the `start` element is included, but the `finish` is excluded.  So `1:3` means include 1 and 2 but not 3.

`:` means include everything

In [22]:
array1 = np.random.randint(0, 100, (3,5))
array1

array([[99, 90, 44, 72, 55],
       [24,  5, 38, 81, 12],
       [61, 35, 25, 46, 63]])

In [23]:
array1[0:2, 0:2]

array([[99, 90],
       [24,  5]])

In [24]:
array1[:,0:2] # ':' means include everything

array([[99, 90],
       [24,  5],
       [61, 35]])

In [25]:
array1[0:2]

array([[99, 90, 44, 72, 55],
       [24,  5, 38, 81, 12]])

In [26]:
#Slices are references to the original array.  So you if you need a copy, use the below:
array1[0:2].copy()

array([[99, 90, 44, 72, 55],
       [24,  5, 38, 81, 12]])

Generally, use the above 'Long Form' way for slicing where you specify the indices for each dimension. Where everything is to be included, use `:`.  There are other short-cut methods of slicing, but can leave those as is.

Imagine an array a1 with dimensions (3, 5, 2, 4).  This means:
 - This array has 3 arrays in it that have the dimensions (5, 2, 4)
 - Each of these 3 arrays have 5 additional arrays each in them of the dimension (2,4).  (So there are 3*5=15 of these 2x4 arrays)
 - Each of these (2,4) arrays has 2 one-dimensional arrays with 4 columns.
 
If in the slice notation only a portion of what to include is specified, eg a1[0], then it means we are asking for the first one of these axes, ie the dimension parameters are specifying from the left of (3, 5, 2, 4).  It means give me the first of the 3 arrays with size (5,2,4).  

If the slice notation says a1[0,1], then it means 0th element of the first dim, and 1st element of the second dim.

Check it out using the following code:

In [27]:
a1 = np.random.randint(0, 100, (3,4,2,5))
a1

array([[[[49, 32, 85,  3, 36],
         [65, 67, 68, 82, 61]],

        [[36, 79, 74, 71, 75],
         [64, 86, 93, 51, 92]],

        [[79, 76, 78, 51, 96],
         [37, 56,  4, 46, 71]],

        [[82, 67, 28, 72, 44],
         [24, 89, 71,  2, 86]]],


       [[[21, 89, 71, 38,  9],
         [51, 32, 10, 38, 52]],

        [[38, 11, 79, 54, 79],
         [23, 24, 16, 88, 61]],

        [[ 3,  4, 28, 60, 94],
         [63, 83, 81,  1, 80]],

        [[12, 14, 40, 63, 23],
         [69, 79, 45, 90, 29]]],


       [[[61, 51, 76, 72, 79],
         [21, 40, 23, 73, 88]],

        [[13, 15, 65, 76, 79],
         [50, 24, 84,  4, 74]],

        [[11, 39, 19, 61,  0],
         [98, 87, 52, 77,  7]],

        [[65, 46, 45, 52, 76],
         [25, 17, 50, 55,  5]]]])

In [28]:
a1[0].shape

(4, 2, 5)

In [29]:
a1[0]

array([[[49, 32, 85,  3, 36],
        [65, 67, 68, 82, 61]],

       [[36, 79, 74, 71, 75],
        [64, 86, 93, 51, 92]],

       [[79, 76, 78, 51, 96],
        [37, 56,  4, 46, 71]],

       [[82, 67, 28, 72, 44],
        [24, 89, 71,  2, 86]]])

In [30]:
a1[0,1]

array([[36, 79, 74, 71, 75],
       [64, 86, 93, 51, 92]])

### More slicing: Picking selected rows or columns

In [31]:
a1 = np.random.randint(0, 100, (8,9))
a1

array([[29, 31, 36, 98, 27, 23, 91,  3, 65],
       [36,  1, 17, 51, 45, 92, 70, 55, 65],
       [85, 47, 84, 94, 30, 70, 17, 54,  8],
       [69, 65, 94, 70, 99, 69, 19, 50, 76],
       [60, 52,  8, 93, 68, 71, 72, 62, 32],
       [94, 88, 13,  0, 45, 21, 30,  1, 26],
       [83, 11, 12, 29, 41, 22, 69, 35, 22],
       [55, 66, 54, 92, 14, 51, 30, 69, 60]])

In [32]:
# Select the first row
a1[0]

array([29, 31, 36, 98, 27, 23, 91,  3, 65])

In [33]:
# Select the fourth row
a1[3]

array([69, 65, 94, 70, 99, 69, 19, 50, 76])

In [34]:
# Select the first and the fourth row together
a1[[0,3]]

array([[29, 31, 36, 98, 27, 23, 91,  3, 65],
       [69, 65, 94, 70, 99, 69, 19, 50, 76]])

In [35]:
# Select the first and the fourth column
a1[:,[0,3]]

array([[29, 98],
       [36, 51],
       [85, 94],
       [69, 70],
       [60, 93],
       [94,  0],
       [83, 29],
       [55, 92]])

In [36]:
# Select subset of named rows and columns

a1[[0, 3]][:,[0, 1]] # Named rows and columns.  

# Note that a1[[0, 3],[0, 1]] does not work as expected, it selects two points (0,0)and (3,1).  
# Really crazy but it is what it is.

array([[29, 31],
       [69, 65]])

 ### Operations on arrays
 **All math on arrays is element wise, and scalars are multiplied/added with each element.**

In [37]:
array1 + 4

array([[103,  94,  48,  76,  59],
       [ 28,   9,  42,  85,  16],
       [ 65,  39,  29,  50,  67]])

In [38]:
array1 > np.random.randint(0, 2, (3,5))

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

In [39]:
array1 + 2

array([[101,  92,  46,  74,  57],
       [ 26,   7,  40,  83,  14],
       [ 63,  37,  27,  48,  65]])

In [40]:
np.sum(array1) # adds all the elements of an array

750

In [41]:
np.sum(array1, axis = 0) # adds all elements of the array along a particular axis

array([184, 130, 107, 199, 130])

### Matrix math

Numpy has arrays as well as matrices.  Matrices are 2D, arrays can have any number of dimensions. The only real difference between a matrix (type = `numpy.matrix`) and an array (type = `numpy.ndarray`) is that all array operations are element wise, ie the special R x C matrix multiplication does not apply to arrays.  However, for an array that is 2 x 2 in shape you can use the `@` operator to do matrix math.

So that leaves matrices and arrays interchangeable in a practical sense.  Except that you can't do an inverse of an array using `.I` which you can for a matrix.

In [42]:
# Create a matrix 'm' and an array 'a' that are identical
m = np.matrix(np.random.randint(0,10,(3,3)))
a = np.array(m)

In [43]:
m

matrix([[1, 5, 0],
        [6, 3, 8],
        [7, 4, 3]])

In [44]:
a

array([[1, 5, 0],
       [6, 3, 8],
       [7, 4, 3]])

#### Transpose with a `.T`

In [45]:
m.T

matrix([[1, 6, 7],
        [5, 3, 4],
        [0, 8, 3]])

In [46]:
a.T

array([[1, 6, 7],
       [5, 3, 4],
       [0, 8, 3]])

#### Inverse with a `.I` 
**Does not work for arrays**

In [47]:
m.I

matrix([[-0.13772455, -0.08982036,  0.23952096],
        [ 0.22754491,  0.01796407, -0.04790419],
        [ 0.01796407,  0.18562874, -0.16167665]])

#### Matrix multiplication
For matrices, just a `*` suffices for matrix multiplication.  If using arrays, use `@` for matrix multiplication, which also works for matrices.  So just to be safe, just use `@`.

**Dot-product** is the same as row-by-column matrix multiplication, and is not elementwise.

In [48]:
a=np.matrix([[4, 3], [2, 1]])
b=np.mat([[1, 2], [3, 4]])

In [49]:
a

matrix([[4, 3],
        [2, 1]])

In [50]:
b

matrix([[1, 2],
        [3, 4]])

In [51]:
a*b

matrix([[13, 20],
        [ 5,  8]])

In [52]:
a@b

matrix([[13, 20],
        [ 5,  8]])

In [53]:
# Now check with arrays
a=np.array([[4, 3], [2, 1]])
b=np.array([[1, 2], [3, 4]])

In [54]:
a@b # does matrix multiplication.  

array([[13, 20],
       [ 5,  8]])

In [55]:
a

array([[4, 3],
       [2, 1]])

In [56]:
b

array([[1, 2],
       [3, 4]])

In [57]:
a*b # element-wise multiplication as a and b are arrays

array([[4, 6],
       [6, 4]])

`@` is the same as `np.dot(a, b)`, which is just a longer fully spelled out function.

In [58]:
np.dot(a,b)

array([[13, 20],
       [ 5,  8]])

#### Exponents with matrices and arrays `**`.

In [59]:
a = np.array([[4, 3], [2, 1]])
m = np.matrix(a)
m

matrix([[4, 3],
        [2, 1]])

In [60]:
a**2 # Because a is an array, this will square each element of a.

array([[16,  9],
       [ 4,  1]])

In [61]:
m**2 # Because m is a matrix, this will be read as m*m, and dot product of the matrix with itself will result.

matrix([[22, 15],
        [10,  7]])

which is same as `a@a`

In [62]:
a@a

array([[22, 15],
       [10,  7]])

#### Modulus, or size 
The modulus is just `sqrt(a^2 + b^2 + ....n^2)`, where a, b...n are elements of the vector, matrix or array.  Can be calculated using `np.linalg.norm(a)`

In [63]:
a = np.array([4,3,2,1])
np.linalg.norm(a)

5.477225575051661

In [64]:
# Same as calculating manually
(4**2 + 3**2 + 2**2 + 1**2) ** 0.5

5.477225575051661

In [65]:
b


array([[1, 2],
       [3, 4]])

In [66]:
np.linalg.norm(b)

5.477225575051661

In [67]:
m

matrix([[4, 3],
        [2, 1]])

In [68]:
np.linalg.norm(m)

5.477225575051661

In [69]:
m = np.matrix(np.random.randint(0,10,(3,3)))
m

matrix([[0, 3, 0],
        [8, 7, 5],
        [0, 0, 6]])

In [70]:
np.linalg.norm(m)

13.527749258468683

In [71]:
print(np.ravel(m))
print(type(np.ravel(m)))
print('Manual calculation for norm')
((np.ravel(m)**2).sum())**.5

[0 3 0 8 7 5 0 0 6]
<class 'numpy.ndarray'>
Manual calculation for norm


13.527749258468683

#### Determinant of a matrix `np.linalg.det(a)`
Used for calculating the inverse of a matrix, and only applies to square matrices.

In [72]:
np.linalg.det(m)

-143.9999999999999

#### Converting from matrix to array and vice-versa
`np.asmatrix` and `np.asarray` allow you to convert one to the other. Though above we have just used np.array and np.matrix without any issue.

The above references: https://stackoverflow.com/questions/4151128/what-are-the-differences-between-numpy-arrays-and-matrices-which-one-should-i-u


#### Distances and angles between vectors
**Size of a vector, angle between vectors, distance between vectors**

In [73]:
# We set up two vectors a and b

a = np.array([1,2,3]); b = np.array([5,4,3])
print('a =',a)
print('b =',b)

a = [1 2 3]
b = [5 4 3]


In [74]:
# Size of the vector, computed as the root of the squares of each of the elements
np.linalg.norm(a) 

3.7416573867739413

In [75]:
# Distance between two vectors
np.linalg.norm(a - b) 

4.47213595499958

In [76]:
# Which is the same as 
print(np.sqrt(np.dot(a, a) - 2 * np.dot(a, b) + np.dot(b, b)))
(a@a + b@b - 2*a@b)**.5

4.47213595499958


4.47213595499958

In [77]:
# Combine the two vectors
X = np.concatenate((a,b)).reshape(2,3)
X

array([[1, 2, 3],
       [5, 4, 3]])

In [78]:
# Euclidean distance is the default metric for this function
# from sklearn
from sklearn.metrics import pairwise_distances
pairwise_distances(X)

ModuleNotFoundError: No module named 'sklearn'

In [None]:
# Angle in radians between two vectors. To get the
# answer in degrees, multiply by 180/pi, or 180/math.pi (after import math).  Also there is a function in math called
# math.radians to get radians from degrees, or math.degrees(x) to convert angle x from radians to degrees.

import math
angle_in_radians = np.arccos(np.dot(a,b) / (np.linalg.norm(a) * np.linalg.norm(b))) 
angle_in_degrees = math.degrees(angle_in_radians)

print('Angle in degrees =', angle_in_degrees)
print('Angle in radians =', angle_in_radians)

In [None]:
# Same as above using math.acos instead of np.arccos

math.acos(np.dot(a,b) / (np.linalg.norm(a) * np.linalg.norm(b))) 

#### Sorting with `argsort` 
Which is the same as sort, but  shows index numbers instead of the values

In [None]:
# We set up an array

a = np.array([20,10,30,0])

In [None]:
# Sorted indices

np.argsort(a)

In [None]:
# Using the indices to get the sorted values

a[np.argsort(a)]

In [None]:
# Descending sort indices

np.argsort(a)[::-1]

In [None]:
# Descending sort values

a[np.argsort(a)[::-1]]

## Understanding DataFrames
As we discussed in the prior section, understanding and manipulating arrays of numbers is fundamental to the data science process.  This is because nearly all ML and AI algorithms insist on being provided data arrays as inputs, and the NumPy library underpins almost all of data science.

As we discussed, a NumPy array is essentially a collection of numbers.  This collection is organized along ‘dimensions’.  So NumPy objects are n-dimensional array objects, or _ndarray_, a fast and efficient container for large datasets in Python.  

But arrays have several limitations.  One huge limitation is that they are raw containers with numbers, they don't have 'headers', or labels that describe the columns, rows, or the additional dimensions.  This means we need to track separately somewhere what each of the dimensions mean.  Another limitation is that after 3 dimensions, the additional dimensions are impossible toto visualize in the human mind.  For most practical purposes, humans like to think of data in the tabular form, with just rows and columns.  If there are more dimensions, one can have multiple tables.

This is where _pandas_ steps in.  Pandas use dataframes, or a spreadsheet like construct where there are rows and columns, and these rows and columns can have names or headings.  Pandas dataframes are easily converted to NumPy arrays, and algorithms will mostly accept a dataframe as an input just as they would an array.

### Exploring Tabular Data with Pandas  

Tabular data is often the most common data type that is encountered, though ‘unstructured’ data is increasingly becoming common.  Tabular data is two dimensional data – with rows and columns.  The columns are defined and understood, and we generally understand what they contain.  

 - Data is laid out as a 2-dimensional matrix, whether in a spreadsheet, or R/Python dataframes, or in a database table.  
 - Rows generally represent individual observations, while columns are the fields/variables.  
 - Variables can be numeric, or categorical.  
 - Numerical variables can be integers, floats etc, and are continuous.  
 - Categorical variables may be cardinal (eg, species, gender), or ordinal (eg, low, medium, high), and belong to a discrete set.  
 - Categorical variables are also called factors, and levels.  
 - Algorithms often require categorical variables to be converted to numerical variables.  
 
Unstructured data includes audio, video and other kinds of data that is useful for problems of perception.  Unstructured data will almost invariably need to be converted into structured arrays with defined dimensions, but for the moment we will skip that.  
  
### Reading data with Pandas

Pandas offer several different functions for reading different types of data.
> `read_csv` : Load comma separated files  
> `read_table` : Load tab separated files  
> `read_fwf` : Read data in fixed-width column format (i.e., no delimiters)  
> `read_clipboard` Read data from the clipboard; useful for converting tables from web pages  
> `read_excel` : Read Excel files  
> `read_html` : Read all tables found in the given HTML document  
> `read_json` : Read data from a JSON (JavaScript Object Notation) file  
> `read_pickle` : Read a pickle file  
> `read_sql` : Read results of an SQL query  
> `read_sas` : Read SAS files  

## Other data types in Python

 - Lists are represented as ```[]```.  Lists are a changeable collection of elements, and the elements can be any Python data, eg strings, numbers, dictionaries, or even other lists.<br>
 - Dictionaries are enclosed in ```{}```.  These are 'key:value' pairs, where 'key' is almost like a name given to a 'value'.<br>
 - Sets are also enclosed in ```{}```, except they don't have the colons separating the key:value pairs.  These are collections of items, and they are unordered. <br>
 - Tuples are collections of variables, and enclosed in ```()```.  They are different from sets in that they are unchangeable.

In [None]:
# Example - creating a list

empty_list = []
list1 = ['a', 2,4, 'python']
list1

In [None]:
# Example - creating a dictionary

dict1 = {'first': ['John', 'Jane'], 'something_else': (1,2,3)}
dict1

In [None]:
dict1['first']

In [None]:
dict1['something_else']

In [None]:
# Checking the data type of the new variable we created

type(dict1)

In [None]:
# Checking the data type

type(list1)

In [None]:
# Set operations

set1 = {1,2,4,5} # Sets can do intersect, union and difference

In [None]:
# Tuple example
tuple1 = 1, 3, 4 # or
tuple1 = (1, 3, 4)
tuple1

## Loading built-in data sets in Python  

Before we move forward with getting into the details with EDA, we will first take a small digressive detour to talk about data sets.  

In order to experiment with EDA, we need some data.  We can bring our own data, but for exploration and experimentation, it is often easy to load up one of the many in-built datasets accessible through Python.  These datasets cover the spectrum - from really small datasets to those with many thousands of records, and include text data such as movie reviews and tweets.  

We will leverage these built in datasets for the rest of the discussion as they provide a good path to creating reproducible examples. These datasets are great for experimenting, testing, doing tutorials and exercises.

The next few headings will cover these in-built datasets.

 - The Statsmodels library provides access to several interesting inbuilt datasets in Python.  
 - The datasets available in R can also be accessed through statsmodels.  
 - The Seaborn library has several toy datasets available to explore.  
 - The Scikit Learn (sklearn) library also has in-built datasets.   
 - Scikit Learn also provides a function to generate random datasets with described characteristics (`make_blobs` function)  

In the rest of this discussion, we will use these data sets and explore the data.  

Some of these are described below, together with information on how to access and use such datasets.  

In [None]:
# Load the regular libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Loading data from Statsmodels
Statsmodels allows access to several datasets for use in examples, model testing, tutorials, testing functions etc.  These can be accessed using `sm.datasets.macrodata.load_pandas()['data']`, where `macrodata` is just one example of a dataset.  Pressing `TAB` after `sm.datasets` should bring up a pick-list of datasets to choose from.  
  
The commands `print(sm.datasets.macrodata.DESCRLONG)` and `print(sm.datasets.macrodata.NOTE)` provide additional details on the datasets.

In [None]:
# Load macro economic data from Statsmodels

import statsmodels.api as sm
df = sm.datasets.macrodata.load_pandas()['data']
df

In [None]:
# Print the description of the data

print(sm.datasets.macrodata.DESCRLONG)

In [None]:
# Print the data-dictionary for the different columns/fields in the data 

print(sm.datasets.macrodata.NOTE)

***
### Importing R datasets using Statsmodels
Datasets available in R can also be imported using the command `sm.datasets.get_rdataset('mtcars').data`, where `mtcards` can be replaced by the appropriate dataset name.

In [None]:
# Import the mtcars dataset which contains attributes for 32 models of cars

mtcars = sm.datasets.get_rdataset('mtcars').data

In [None]:
# Be sure to change directory to a writeable place before running this
# eg:
# os.chdir('/home/jovyan')
# mtcars.to_excel('mtcars.xlsx')

In [None]:
mtcars.describe()

In [None]:
# Load the famous Iris dataset 
iris = sm.datasets.get_rdataset('iris').data

In [None]:
iris


***
### Datasets in Seaborn
Several datasets are accessible through the Seaborn library

In [None]:
# Get the names of all the datasets that are available through Seaborn

import seaborn as sns
sns.get_dataset_names()

In [None]:
# Load the diamonds dataset

diamonds = sns.load_dataset('diamonds')

In [None]:
diamonds.head(20)

In [None]:
# Load the mpg dataset from Seaborn.  This is similar to the mtcars dataset,
# but has a higher count of observations.

sns.load_dataset('mpg')

In [None]:
# Look at how many cars from each country in the mpg dataset

sns.load_dataset('mpg').origin.value_counts()

In [None]:
# Build a histogram of the model year

sns.load_dataset('mpg').model_year.astype('category').hist();

In [None]:
# Create a random dataframe with random data
n = 25
df = pd.DataFrame(
    {'state': list(np.random.choice(["New York", "Florida", "California"], size=(n))), 
     'gender': list(np.random.choice(["Male", "Female"], size=(n), p=[.4, .6])),
     'education': list(np.random.choice(["High School", "Undergrad", "Grad"], size=(n))),
     'housing': list(np.random.choice(["Rent", "Own"], size=(n))),     
     'height': list(np.random.randint(140,200,n)),
     'weight': list(np.random.randint(100,150,n)),
     'income': list(np.random.randint(50,250,n)),
     'computers': list(np.random.randint(0,6,n))
    })

In [None]:
df

In [None]:
# Load the 'Old Faithful' eruption data

sns.load_dataset('geyser')

***
### Datasets in sklearn
Scikit Learn has several datasets that are built-in as well that can be used to experiment with functions and algorithms.  Some are listed below:


`load_boston(*[, return_X_y])` Load and return the boston house-prices dataset (regression).  
`load_iris(*[, return_X_y, as_frame])` Load and return the iris dataset (classification).  
`load_diabetes(*[, return_X_y, as_frame])` Load and return the diabetes dataset (regression).   
`load_digits(*[, n_class, return_X_y, as_frame])` Load and return the digits dataset (classification).  
`load_linnerud(*[, return_X_y, as_frame])` Load and return the physical excercise linnerud dataset.  
`load_wine(*[, return_X_y, as_frame])` Load and return the wine dataset (classification).  
`load_breast_cancer(*[, return_X_y, as_frame])` Load and return the breast cancer wisconsin dataset (classification).  

Let us import the _wine dataset_ next, and the _California housing datset_ after that.  

In [None]:
from sklearn import datasets

X = datasets.load_wine()['data']
y = datasets.load_wine()['target']
features = datasets.load_wine()['feature_names']
DESCR = datasets.load_wine()['DESCR']
classes = datasets.load_wine()['target_names']


wine_df = pd.DataFrame(X, columns = features)
wine_df.insert(0,'WineType', y)

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

In [None]:
df = wine_df[(wine_df['WineType'] != 2)]


In [None]:
# Let us look at the DESCR for the dataframe we just loaded

print(DESCR)

In [None]:
# California housing dataset. medv is the median value of the homes

from sklearn import datasets

X = datasets.fetch_california_housing()['data']
y = datasets.fetch_california_housing()['target']
features = datasets.fetch_california_housing()['feature_names']
DESCR = datasets.fetch_california_housing()['DESCR']

cali_df = pd.DataFrame(X, columns = features)
cali_df.insert(0,'medv', y)
cali_df

In [None]:
# Again, we can look at what the various columns mean

print(DESCR)

***
### Create Artificial Data using sklearn

In addition to the built-in datasets, it is possible to create artificial data of arbitrary size to test or explain different algorithms for solving classification (both binary and multi-class) as well as regression problems.

One example using the `make_blobs` function is provided below, but a great deal more detail is available at https://scikit-learn.org/stable/datasets/sample_generators.html#sample-generators

`make_blobs` and `make_classification` can create multiclass datasets, and `make_regression` can be used for creating datasets with specified characteristics.  Refer to the sklearn documentation link above to learn more.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs
X, y, centers = make_blobs(n_samples=1000, centers=3, n_features=2,
                      random_state=0, return_centers=True, center_box=(0,20),
                          cluster_std = 1.1)

In [None]:
df = pd.DataFrame(dict(x1=X[:,0], x2=X[:,1], label=y))
df = round(df,ndigits=2)
df

In [None]:
plt.figure(figsize=(6,6))
sns.scatterplot(data = df, x = 'x1', y = 'x2', hue = 'label', 
                alpha = .8, palette="deep",edgecolor = 'None');

***
## Exploratory Data Analysis using Python

After all of this lengthy introduction, we are finally ready to get started with actually performing some EDA.

As mentioned earlier, EDA is unstructured exploration, there is not a set of set activities you must perform.  Generally, you probe the data, and depending upon what you discover, you ask more questions.

Things we will do:

 - Look at how to read different types of data  
 - Understand how to access in-built datasets in Python  
 - Calculate summary statistics covered in the prior class (refer list to the right)  
 - Perform basic graphing using Pandas to explore the data  
 - Understand group-by and pivoting functions (the split-apply-combine process)  
 - Look at pandas-profiling, a library that can perform many data exploration tasks  


Pandas is a library we will be using often, and is something we will use to explore data and perform EDA.  We will also use NumPy and SciPy.  

In [None]:
# Load the regular libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### A note on managing working directories

A very basic problem one runs into when trying to load datafiles is the file path - and if the file is not located in the current working directory for Python.  

Generally, reading a CSV file is simple - `pd.read_csv` and pointing to the filename does the trick.  If the file is there but pandas returns an error, that could be because the file may not be located in your working directory.  In such a case, enter the complete path to the file.  

Alternatively, you can bring the file to your working directory.  To check and change your working directory, use the following code:  

In [None]:
import os

# To check current working directory:
os.getcwd()

Or, you could type `pwd` in a cell.  Be aware that pwd should be on the first line of the cell!

In [None]:
pwd

In [None]:
# To change current working directory

os.chdir('/home/jovyan')

### EDA on the diamonds dataset

#### Questions we might like answered
Below is a repeat of what was said in the introduction to this chapter, just to avoid having to go back to check what we are trying to do. When performing EDA, we want to explore data in an unstructured way, and try to get a 'feel' for the data.  The kinds of questions we may want to answer are:  

 - How much data do we have - number of rows in the data?
 - How many columns, or fields do we have in the dataset?
 - Data types - which of the columns appear to be numeric, dates or strings?
 - Names of the columns, and do they tell us anything?
 - A visual review of a sample of the dataset
 - Completeness of the dataset, are missing values obvious?  Columns that are largely empty?
 - Unique values for columns that appear to be categorical, and how many observations of each category?
 - For numeric columns, the range of values (calculated from min and max values)
 - Distributions for the different columns, possibly graphed
 - Correlations between the different columns



#### Load data 

We will start our exploration with the diamonds dataset.  

The ‘diamonds’ has 50k+ records, each representing a single diamond.  The weight and other attributes are available, and so is the price.

The dataset allows us to experiment with a variety of prediction techniques and algorithms.  Below are the columns in the dataset, and their description.

| Column | Description |
| --- | --- |
| price | price in US dollars (\\$326--\\$18,823) |
| carat | weight of the diamond (0.2--5.01) |
| cut | quality of the cut (Fair, Good, Very Good, Premium, Ideal) |
| color | diamond colour, from J (worst) to D (best) |
| clarity | a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) |
| x | length in mm (0--10.74) |
| y | width in mm (0--58.9) |
| z | depth in mm (0--31.8) |
| depth | total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79) |
| table | width of top of diamond relative to widest point (43--95) |


In [None]:
# Load data from seaborn

df = sns.load_dataset('diamonds')
df

#### Descriptive stats
Pandas `describe()` function provides a variety of summary statistics.  Review the table below.  Notice the categorical variables were ignored.  This is because descriptive stats do not make sense for categorical variables.


In [None]:
# Let us look at some descriptive statistics for the numerical variables

df.describe()

`df.info()` gives you information on the dataset

In [None]:
df.info()

  
  
Similarly, `df.shape` gives you a tuple with the counts of rows and columns.

Trivia:  
 - Note there is no `()` after `df.shape`, as it is a property. Properties are the 'attributes' of the object that can be set using methods.
 - Methods are like functions, but are inbuilt, and apply to an object.  They are part of the class definition for the object.

In [None]:
df.shape

`df.columns` gives you the names of the columns.

In [None]:
df.columns

**Exploring individual columns**  
Pandas provide a large number of functions that allow us to explore several statistics relating to individual variables.


| Measures | Function (from Pandas, unless otherwise stated) |
| --- | --- |
| **Central Tendency** |  |
| Mean | `mean()` |
| Geometric Mean | `gmean()` (from scipy.stats) |
| Median | `median()` |
| Mode | `mode()` |
| **Measures of Variability** |  |
| Range | `max()` - `min()` |
| Variance | `var()` |
| Standard Deviation | `std()` |
| Coefficient of Variation | `std()` / `mean()` |
| **Measures of Association** |  |
| Covariance | `cov()` |
| Correlation | `corr()` |
| **Analyzing Distributions** |  |
| Percentiles | `quantile()` |
| Quartiles | `quantile()` |
| Z-Scores | `zscore` (from scipy) |

We examine many of these in action below.  

#### Functions for descriptive stats

In [None]:
# Mean
df.mean(numeric_only=True)

In [None]:
# Median
df.median(numeric_only=True)

In [None]:
# Mode
df.mode()

In [None]:
# Min, also max works as well

df.min(numeric_only=True)

In [None]:
# Variance
df.var(numeric_only=True)

In [None]:
# Standard Deviation
df.std(numeric_only=True)

#### Some quick histograms  
Histograms allow us to look at the distribution of the data.  The `df.colname.hist()` function allows us to create quick histograms (or column charts in case of categorical variables).  

Visualization using Matplotlib is covered in a different chapter.  


In [None]:
# A quick histogram

df.carat.hist();

In [None]:
df.depth.hist();

In [None]:
df.cut.hist();

In [None]:
# All together
df.hist(figsize=(16,10));

#### Calculate range

In [None]:
# Let us calculate the range manually

df.depth.max() - df.depth.min()

#### Covariance and correlations

In [None]:
# Let us do the covariance matrix, which is a one-liner with pandas

df.cov(numeric_only=True)

In [None]:
# Now the correlation matrix - another one-liner

df.corr(numeric_only=True)

In [None]:
# We can also calculate the correlations individually between given variables

df[['carat', 'depth']].corr()

In [None]:
# We can create a heatmap of correlations

In [None]:
plt.figure(figsize = (8,8))
sns.heatmap(df.corr(numeric_only=True), annot=True);
plt.show()

In [None]:
# We can calculate phi-k correlations as well
import phik
X = df.phik_matrix()
X

In [None]:
sns.heatmap(X, annot=True);

**Detailed Phi-k correlation report**  
```python
from phik import report
phik.report.correlation_report(df)
```

#### Quantiles to analyze the distribution

In [None]:
# Calculating quantiles
# Here we calculate the 30th quantile


df.quantile(0.30, numeric_only=True)

In [None]:
# Calculating multiple quantiles

df.quantile([.1,.3,.5,.75], numeric_only=True)

#### Z-scores

In [None]:
# Z-scores for two of the columns (x - mean(x))/std(x)

from scipy.stats import zscore

zscores = zscore(df[['carat', 'depth']])


# Verify z-scores have mean of 0 and standard deviation of 1:
print('Z-scores: \n', zscores, '\n')

print('Mean is: ', zscores.mean(axis = 0), '\n')

print('Std Deviation is: ', zscores.std(axis = 0), '\n')


#### Dataframe information

In [None]:
# Look at some dataframe information

df.info()

#### Names of columns

In [None]:
# Column names
df.columns

#### Other useful functions  
> Sort: `df.sort_values(['price', 'table'], ascending = [False, True]).head()`  
> Unique values: `df.cut.unique()`  
> Count of unique values: `df.cut.nunique()`  
> Value Counts: `df.cut.value_counts()`  
> Take a sample from a dataframe: `diamonds.sample(4)` (or n=4)  
> Rename columns: `df.rename(columns = {'price':'dollars'}, inplace = True)`  


## _Split-Apply-Combine_ 

The phrase Split-Apply-Combine was made popular by Hadley Wickham, who is the author of the popular dplyr package in R.  His original paper on the topic can be downloaded at https://www.jstatsoft.org/article/download/v040i01/468

Conceptually, it involves:  
 - Splitting the data into sub-groups based on some filtering criteria  
 - Applying a function to each sub-group and obtaining a result  
 - Combining the results into one single dataframe.  

Split-Apply-Combine does not represent three separate steps in data analysis, but a way to think about solving problems by breaking them up into manageable pieces, operate on each piece independently, and put all the pieces back together.  

In Python, the Split-Apply-Combine operations are implemented using different functions such as pivot, pivot_table, crosstab, groupby and possibly others. 

_Ref: http://www.jstatsoft.org/v40/i01/_



### Stack
Even though `stack` and `unstack` do not pivot data, they reshape a data in a fundamental way that deserves a reference alongside the standard split-apply-combine techniques.

What _stack_ does is to completely flatten out a dataframe by bringing all columns down against the index.  The index becomes a multi-level index, and all the columns show up against every single row.  

The result is a pandas series, with as many rows as the rows times columns in the original dataset.

You can then move the index into the columns of a dataframe by doing `reset_index()`.

Let us first consider a simpler dataframe with just a few entries.  

**Example 1**  

In [None]:
df = pd.DataFrame([[9, 10], [14, 30]],
                                    index=['cat', 'dog'],
                                    columns=['weight-lbs', 'height-in'])

In [None]:
df

In [None]:
df.stack()

In [None]:
# Convert this to a dataframe
pd.DataFrame(df.stack()).reset_index().rename({'level_0': 'animal', 'level_1':'measure', 0: 'value'}, axis=1)

In [None]:
type(df.stack())

In [None]:
df.stack().index

**Example 2:**  
Now we look at a larger dataframe.

In [None]:
import statsmodels.api as sm
iris = sm.datasets.get_rdataset('iris').data

In [None]:
# Let us look at the original data before we stack it
iris

In [None]:
iris.stack()

We had 150 rows and 5 columns in our original dataset, and we would therefore expect to have 150*5 = 750 items in our stacked series.  Which we can verify.

In [None]:
iris.shape[0] * iris.shape[1]

**Example 3:**  
We stack the mtcars dataset.

In [None]:
mtcars = sm.datasets.get_rdataset('mtcars').data

In [None]:
mtcars

In [None]:
mtcars.stack()

### Unstack
Unstack is the same as the stack of the transpose of a dataframe.

So you flip the rows and columns of a database, and you then do a stack.

In [None]:
mtcars.transpose()

In [None]:
mtcars.unstack()

In [None]:
mtcars.transpose().stack()

In [None]:
# Check the row count
mtcars.stack().shape

In [None]:
# Expected row count in stack
mtcars.shape[0] * mtcars.shape[1]

### Pivot table

A powerful way the idea behind _split-apply_combine_ is implemented is through pivot tables.  Pivot tables allow reshaping the data into useful summaries.  Pivot tables are widely used by Excel users, and you will find them used in reports, presentations and analysis of all types.  Pandas offers a great deal of flexibility for creating pivot tables using the pivot_table function.  

The pivot_table function is essentially a copy of the Excel functionality.  

- `index` - On the left is the index, and you can specify multiple columns there.  Each unique value in that index column will have a separate line.  Under each of these lines, there will be a line for each value of the second column in the index, and so on. 
- `columns` - On the top are the columns, again in the order in which specified in the parameters to the function.  The first column specified is on the top, and underneath will be all unique values of that column.  This is followed by the next column in the list, and so on.  
- `values` - Inside the table itself are values derived from the columns named in the _values_ parameter.  The default for values is the mean of the value columns, but you can change it to other functions using aggfunc.  
- `aggfunc` - Next is aggfunc.  You can specify any function from any library that returns a single value.  

**CAUTION**  
It is really easy to get pivot tables wrong and get something incomprehensible.  To create a sensible pivot table, it makes sense to:
- have categorical columns in both index and columns.  If you use numerical variables in either, the length of your columns/rows will explode unless the number of unique values is limited.  
- have columns in the values parameter that lend themselves to the aggregation function specified.  So if you specify a categorical column for values, and ask pandas to show the mean, you will be setting yourself up for disappointment.  If you are using a categorical column for values, be sure to use an appropriate aggregation function eg `count`.


In [None]:
mtcars.head()

In [None]:
# Some transformations to help understand pivots better
mtcars.cyl = mtcars.cyl.replace({4: 'Four', 6: 'Six', 8: 'Eight'} )
mtcars.am = mtcars.am.replace({1: 'Automatic', 0: 'Manual'} )

In [None]:
mtcars = mtcars.head(8)
mtcars

In [None]:
mtcars.pivot_table(index = ['gear','cyl'],
                   values = ['wt'])

In [None]:
mtcars.pivot_table(index = ['am', 'gear'],
                   columns = ['cyl'],
                   values = ['wt'])

In [None]:
mtcars.pivot_table(index = ['am', 'gear'],
                  columns = ['cyl'],
                  values = ['wt'],
                  aggfunc = ['mean', 'count', 'median', 'sum'])

In [None]:
diamonds = sns.load_dataset('diamonds')

diamonds.pivot_table(index = ['clarity', 'cut'],
              columns = ['color'],
              values = ['depth', 'price', 'x'],
              aggfunc = {'depth': np.mean,
                        'price': [min, max, np.median],
                        'x': np.median}
              )

In [None]:
# Let us create a dataframe with random variables
np.random.seed(1)
n = 2500
df = pd.DataFrame(
    {'state': list(np.random.choice(["New York", "Florida", "California"], size=(n))), 
     'gender': list(np.random.choice(["Male", "Female"], size=(n), p=[.4, .6])),
     'education': list(np.random.choice(["High School", "Undergrad", "Grad"], size=(n))),
     'housing': list(np.random.choice(["Rent", "Own"], size=(n))),     
     'height': list(np.random.randint(140,200,n)),
     'weight': list(np.random.randint(100,150,n)),
     'income': list(np.random.randint(50,250,n)),
     'computers': list(np.random.randint(0,6,n))
    })

In [None]:
df

In [None]:
df.pivot_table(index = ['gender'],
               columns = ['education'],
               values = ['income'],
               aggfunc = ['mean'])

In [None]:
df.pivot_table(index = ['state'],
               columns = ['education', 'housing'],
               values = ['gender', 'computers'],
               aggfunc = {'gender': [len], 'computers': [np.median, 'mean']})

### Pivot
Pivot is a simpler version of pivot_table.  It cannot do any aggregation function, it just shows the values of the 'value' columns at the intersection of the 'index' and the 'columns'.

There are three parameters for pivot:
1. index - which columns in the dataframe should be the index.  This is optional. If not specified, it uses the index of the dataframe.  
2. columns - which dataframe columns should appear on the top as columns in the result.  For each entry in the column parameter, it will create a separate column for each unique value of that column.  So if 'carb' can be 1, 2 or 4, it will show 1, 2 and 4 on the top.
3. values - which column's values to show at the intersection of index and columns.  If there is more than one value (even if the multiple values are identical), pivot will throw an error. (for example, in mtcars_small, if yuou put cyl 4,6,8 on the left as index, and am 0,1 on the top as columns, and mpg as values, you have two cars at their intersection.)

Pivot can be better than pivot_table as it brings in the value at the intersection of index and columns as-is, which is what you need sometimes without having to add, mean, or count them.

In [None]:
mtcars = sm.datasets.get_rdataset('mtcars').data

In [None]:
mtcars.head()

In [None]:
mtcars = mtcars.reset_index().rename(columns={'rownames': 'car'})
mtcars

In [None]:
mtcars_small = mtcars.iloc[1:8, [0, 1, 2, 4, 8 , 9]]
mtcars_small

In [None]:
mtcars_small.pivot(index = 'car', columns = 'cyl', values = 'mpg')

In [None]:
mtcars_small.pivot(index = 'car', columns = 'cyl')

In [None]:
mtcars_small.pivot(index = 'car', columns = ['am'], values=['mpg', 'vs'])

**Sometimes you may wish to use the index of a dataframe directly, as opposed to moving it into its own column first.**

In [None]:
df = pd.DataFrame([[0, 1, 2], [2, 3, 5], [6,7,8],],
                                    index=['cat', 'dog', 'cow'],
                                    columns=['weight', 'height', 'age'])

In [None]:
df

In [None]:
df.pivot(index = [ 'weight'], columns = ['height'])

In [None]:
# Now also use the native index of the dataframe

df.pivot(index = [df.index, 'weight'], columns = ['height'])

Now the same thing fails if there are duplicates

In [None]:
df = pd.DataFrame([['cat', 0, 1, 2], ['dog', 2, 3, 5], ['cow', 6,7,8], ['pig', 6,7,8],],
                                    columns=['animal', 'weight', 'height', 'age'])

In [None]:
df

The below will fail as there are duplicates.
```python
df.pivot(index = [ 'weight'], columns = ['height'])
```


In [None]:
# We consider only the first 3 rows of this new dataframe.
# Look how in the values we have a categorical variable.

df.iloc[:3].pivot(index = 'weight', columns = 'height', values = 'animal')

### Crosstab

Cross computes a frequency table given an index and columns of categorical variables (as a data frame column, series, or numpy array).  However it is possible to specify an aggfunc as well, that makes it like a pivot_table.  

You can pass normalize = True, or index, or columns, and it will normalize based on totals, or by the rows or by the columns.

In [None]:
df = sns.load_dataset('diamonds')
df

In [None]:
# Basic
pd.crosstab(df.cut, df.color)

In [None]:
# With margins
pd.crosstab(df.cut, df.color, margins = True)

In [None]:
# With margins and normalized
pd.crosstab(df.cut, df.color, margins = True, normalize = True)

In [None]:
# Normalized by index.  Rows total to 1.  See how the total column 'All' has 
# disappeared from rows.  But it has remained for the columns
pd.crosstab(df.cut, df.color, margins = True, normalize = 'index')

In [None]:
# Normalized by columns
pd.crosstab(df.cut, df.color, margins = True, normalize = 'columns')

In [None]:
# You can also pass multiple series for both the index and columns
pd.crosstab([df.cut, df.color], [df.clarity])

### Melt

Melt is similar to Stack() but unlike stack it returns a dataframe, not a series with a multi-level index.  A huge advantage is that unlike stack, you can _freeze_ some of the columbns and stack the rest.  

In melt, you specify id_vars (index variables) - these are the columns that stay untouched, and then the value_vars, that get stacked.  If value_vars are not specified, all columns other than id_vars get stacked.

**Opposite of melt is pivot.  Pivot applies no aggfunc, just lists the values at the intersection of categorical vars it picks up from a melted dataset.**

In [None]:
# Let us create a dataframe with random variables
np.random.seed(1)
n = 10
df = pd.DataFrame(
    {'state': list(np.random.choice(["New York", "Florida", "California"], size=(n))), 
     'gender': list(np.random.choice(["Male", "Female"], size=(n), p=[.4, .6])),
     'education': list(np.random.choice(["High School", "Undergrad", "Grad"], size=(n))),
     'housing': list(np.random.choice(["Rent", "Own"], size=(n))),     
     'height': list(np.random.randint(140,200,n)),
     'weight': list(np.random.randint(100,150,n)),
     'income': list(np.random.randint(50,250,n)),
     'computers': list(np.random.randint(0,6,n))
    })

In [None]:
df

In [None]:
# Just to demonstrate, melt-ing the first five rows of the df
df.head().melt(id_vars = ['state', 'gender'], value_vars = ['computers', 'income'])

### Groupby  
Groupby returns a groupby object, to which other agg functions can be applied.  

- Groupby does the 'split' part in the split-apply-combine framework.

- You do the 'apply' using an aggregation function against the groupby object.  

- 'Combine' doesn't need to be done separately as it is done automatically after the aggregation function is applied.


In [None]:
# Simple example

df.groupby(['state', 'gender']).agg({"height": "mean", "weight": "sum", "housing": "count", "education": "count"})

In [None]:
# Aggregation is done only for the columns for which an aggregation function is specified

df.groupby(['state', 'gender']).agg({"height": "mean", "weight": "sum", "housing": "count"})

In [None]:
df.groupby(['state', 'gender']).head(1).agg({"height": "mean", "weight": "sum", "housing": "count", "education": "count"})

In [None]:
group = df.groupby(['state', 'gender'])

In [None]:
# How to look at groups in a groupby:
list(group)

In [None]:
list(group)[0][1]

In [None]:
type(group)

In [None]:
# Look at groups in a groupby - more elegant version:
for group_name, combined in group:
    print(group_name)
    print(combined)
    print('\n')

In [None]:
# How to look at a specific group - the group categorical values have to be entered as a tuple
group.get_group(('New York', 'Male'))

In [None]:
# get the first row of each group

group.first()

In [None]:
# Get the first record of each group.
# For this to be useful, sort the original df by the right columns before groupby.

group.head(1)

In [None]:
# Summary stats for all groups

group.describe()

In [None]:
# Or, if you prefer this

group.describe().reset_index()

In [None]:
# Get the count of rows in each group.
# You can pd.DataFrame it, and reset_index() to clean up

group.size()

In [None]:
# Getting min and max values in each group using groupby

mtcars = sm.datasets.get_rdataset('mtcars').data

In [None]:
mtcars

In [None]:
mtcars.groupby(['cyl']).agg('mean')

In [None]:
# See which rows have the min values in each column of a groupby.  The index of the row is returned
# Which in this case is happily the car name, not an integer

mtcars.groupby(['cyl']).idxmin()

In [None]:
mtcars.groupby(['cyl']).idxmax()

**`rename` columns with Groupby**

In [None]:
# We continue the above examples to rename the aggregated columns we created using groupby

diamonds.groupby('cut').agg({"price": "sum", 
                             "clarity": "count"}).rename(columns = {"price": "total_price", "clarity": "diamond_count"})

## Pandas Profiling
### Profiling our toy dataframe

In [None]:
import os
os.chdir('/home/jovyan')

In [None]:
import ydata_profiling
profile = ydata_profiling.ProfileReport(df, title = 'My EDA', minimal=True).to_file("output.html")

--  
  
Now check out output.html in your folder.  You can right click and open output.html in the browser.

### Pandas Profiling on the Diamonds Dataset

In [None]:
# Import libraries and the diamonds dataset
import pandas as pd
import numpy as np
import seaborn as sns
import os
import ydata_profiling
import phik
import matplotlib.pyplot as plt

df = sns.load_dataset('diamonds')

In [None]:
profile = ydata_profiling.ProfileReport(df, title = 'My EDA', minimal=True).to_file("output.html")

***
With this, we end our discussion on EDA.  We have seen how we can analyze data, get statistics, distributions and identify key themes.  Since this is a problem that has to be solved for every day by lots of analysts, there are many libraries devoted to EDA that automate much of the work.  We looked at one - `pandas_profiling`.  If you search, you will find several more, and may even find something that work best for your use case.

If you have been able to follow thus far, you are all set to explore any numerical data in a tabular form.   