## Pandas for Exploratory Data Analysis

WHO alcohol consumption data:  
* article: http://fivethirtyeight.com/datalab/dear-mona-followup-where-do-people-drink-the-most-beer-wine-and-spirits/    
* original data: https://github.com/fivethirtyeight/data/tree/master/alcohol-consumption  
* file: drinks.csv (with additional 'continent' column)

In [1]:
# imports

In [2]:
# where are we?


In [3]:
# what's the parent folder?


In [4]:
# what's in the data folder?


In [5]:
# read in the dataset.


# EDA alcohol consumption dataset

In [6]:
# print the head and the tail


In [7]:
# examine the default index, data types, and shape


In [8]:
# what are the columns?


In [9]:
# print the 'beer_servings' Series


In [10]:
# Remember that there are two ways to call a column title!


### Quick note about "dot notation" in `pandas`
The "dot notation", i.e. df.col2 is the attribute access that's exposed as a convenience.

A couple of caveats about attribute access:
* you cannot add a column (df.new_col = x won't work, worse: it will silently actually create a new attribute rather than a column - think monkey-patching here)
* it won't work if you have spaces in the column name or if the column name is an integer.

In [11]:
# calculate the average 'beer_servings' for the entire dataset


In [12]:
# count the number of occurrences of each 'continent' value and see if it looks correct


# EDA

In [13]:
# filter DataFrame to only include European countries


In [14]:
# filter DataFrame to only include European countries with wine_servings > 300


In [15]:
# calculate the average 'beer_servings' for all of Europe


In [16]:
# determine which 10 countries have the highest total_litres_of_pure_alcohol

### Renaming, Adding, and Removing Columns

In [17]:
# renaming one or more columns


In [18]:
# but of course you need to use 'inplace' for this to stick.


In [19]:
# what if I want to replace all column names?
# first, make a list.

In [20]:
# replace during file reading

In [21]:
# replace after file reading

In [22]:
# add a new column as a function of existing columns


### removing columns

In [23]:
# axis=0 for rows, 1 for columns

In [24]:
# drop multiple columns

In [25]:
# make it permanent


### Handling Missing Values

#### missing values are usually excluded by default

In [26]:
# excludes missing values

In [27]:
# includes missing values

In [28]:
## Hmmm. Something suspicious about those missing values.

#### find missing values in a Series

In [29]:
# True if missing, False if not missing

In [30]:
# count the missing values

In [31]:
# True if not missing, False if missing

In [32]:
# only show rows where continent is not missing

#### side note: understanding axes

In [33]:
# sums "down" the 0 axis (rows)

In [34]:
# axis=0 is the default

In [35]:
# sums "across" the 1 axis (columns)

#### find missing values in a DataFrame

In [36]:
# DataFrame of booleans

In [37]:
# count the missing values in each column

#### drop missing values

In [38]:
# How many rows and columns?

In [39]:
 # drop a row if ANY values are missing


In [40]:
# drop a row only if ALL values are missing

#### fill in missing values

In [41]:
# fill in missing values with 'NA'

In [42]:
# modifies 'drinks' in-place

In [43]:
# turn off the missing value filter
 # Now North America is back!

### Split-Apply-Combine
![Split-Apply-Combine diagram](http://i.imgur.com/yjNkiwL.png)

In [44]:
# for each continent, calculate the mean beer servings\

In [45]:
# for each continent, calculate the mean of all numeric columns

In [46]:
# for each continent, describe beer servings

In [47]:
# similar, but outputs a DataFrame and can be customized

In [48]:
# for each continent, describe all numeric columns

In [49]:
# for each continent, count the number of occurrences

### Other Commonly Used Features

In [50]:
# change the data type of a column

In [51]:
# create dummy variables for 'continent' and exclude first dummy column

In [52]:
# concatenate two DataFrames (axis=0 for rows, axis=1 for columns)

### Other Less Used Features

#### convert a range of values into descriptive groups

In [53]:
# initially set all values to 'low'
 # change 101-200 to 'med'
  # change 201-400 to 'high'

In [54]:
# display a cross-tabulation of two Series\

In [55]:
# convert 'beer_level' into the 'category' data type
# sorts by the categorical ordering (low to high)

## A few other useful `pandas` tricks

In [56]:
# write a DataFrame out to a CSV
 # index is used as first column
    # ignore index

In [57]:
# create a DataFrame from a dictionary

In [58]:
# create a DataFrame from a list of lists

In [59]:
# limit which rows are read when reading in a file
  # only read first 10 rows
  
   # skip the first two rows of data

In [60]:
# change the maximum number of rows and columns printed ('None' means unlimited)
  # default is 60 rows
   # default is 20 columns
   

In [61]:
# reset options to defaults


In [62]:
# change the options temporarily (settings are restored when you exit the 'with' block)
