## Learning objectives
* Students should be able to assess the structure and cleanliness of their dataset, including size and shape of data, number of variables of each type 
* Students should be able to describe their findings, translate results from code to text using Markdown comments in the Jupyter Notebook, and summarize their thought process in a narrative
* Students should be able to modify the raw data to prepare a clean data set -- including copying data, removing or replacing missing and incoherent data, dropping columns, removing duplicates in Pandas and Jupyter -- and explain and justify their decisions in markdown in their Jupyter notebook
* Students should be able to assess whether their data is “Tidy” and identify appropriate steps and write and  execute code to arrange it into a tidy format - including merging, reshaping, subsetting, grouping, sorting, making appropriate new columns  - and explain and justify their decisions in markdown in their Jupyter notebook
* Students should be able to identify several relevant summary measures, illustrate data using appropriate plots, and explain and justify their decisions in markdown in their Jupyter notebook
* Student should assess the summaries and plots and appraise the need for repeated or further analysis, and justify decisions in markdown


## Describe findings, translate results into Markdown text
 This is more of an overarching goal that should be woven in through the lesson

## Assess the structure and cleanliness

* Take “header” of data, # columns/rows, length, shape, size
* Cheat sheet for commands
* make sure data is in right type - categorical, continuous numerical, string, datetime
* use markdown to comment


In [1]:
import pandas as pd

In [3]:
url = "https://raw.githubusercontent.com/STAT545-UBC/STAT545-UBC.github.io/master/gapminderDataFiveYear_dirty.txt"
gapminder = pd.read_table(url, sep = "\t")
gapminder.head()

Unnamed: 0,year,pop,lifeExp,gdpPercap,region
0,1952,8425333.0,28.801,779.445314,Asia_Afghanistan
1,1957,9240934.0,30.332,820.85303,Asia_Afghanistan
2,1962,10267083.0,31.997,853.10071,Asia_Afghanistan
3,1967,11537966.0,34.02,836.197138,Asia_Afghanistan
4,1972,13079460.0,36.088,739.981106,Asia_Afghanistan


In [4]:
gapminder.shape

(1704, 5)

The describe() method will take the numeric columns and give a summary of their values. This is useful for getting a sense of the ranges of values and seeing if there are any unusual or suspicious numbers.


In [5]:
gapminder.describe()

Unnamed: 0,year,pop,lifeExp,gdpPercap
count,1704.0,1704.0,1704.0,1704.0
mean,1979.5,29601210.0,59.474439,7215.327081
std,17.26533,106157900.0,12.917107,9857.454543
min,1952.0,60011.0,23.599,241.165877
25%,1965.75,2793664.0,48.198,1202.060309
50%,1979.5,7023596.0,60.7125,3531.846989
75%,1993.25,19585220.0,70.8455,9325.462346
max,2007.0,1318683000.0,82.603,113523.1329



describe() just blindly looks at all numeric variables. But we wouldn't actually want to take the mean year. Let's pull out only the correct columns.

In [17]:
gapminder[['pop', 'lifeExp', 'gdpPercap']].describe()

Unnamed: 0,pop,lifeExp,gdpPercap
count,1704.0,1704.0,1704.0
mean,29601210.0,59.474439,7215.327081
std,106157900.0,12.917107,9857.454543
min,60011.0,23.599,241.165877
25%,2793664.0,48.198,1202.060309
50%,7023596.0,60.7125,3531.846989
75%,19585220.0,70.8455,9325.462346
max,1318683000.0,82.603,113523.1329


Why look at a value_counts table like this?
- how many values are there?
- do the data look how we would expect? We anticipate 12 measurements for each year
- Discover inconsistencies -- some are legit, like Congo changed names, some are not

In [16]:
print(len(gapminder['region'].unique())) # How many unique regions are in the data?
gapminder['region'].value_counts() # How many times does each unique region occur?

151


Asia_Oman                                  12
Asia_Cambodia                              12
Africa_Namibia                             12
Oceania_New Zealand                        12
Africa_Gabon                               12
Europe_Czech Republic                      12
Africa_Nigeria                             12
Africa_Sao Tome and Principe               12
Africa_Mali                                12
Africa_Guinea                              12
Asia_Bangladesh                            12
Africa_Chad                                12
Africa_Kenya                               12
Asia_Nepal                                 12
Africa_Zambia                              12
Europe_Sweden                              12
Asia_Iraq                                  12
Americas_Bolivia                           12
Africa_Mauritania                          12
Europe_Norway                              12
Asia_Syria                                 12
Asia_Yemen, Rep.                  

This table reveals some problems: we should have 12 counts for every country/region but some have fewer than 12. E.g. _Canada, Asia_China vs Asia_china, etc. It will require some string processing to clean up.

## Modify the dataset -- cleaning

* Make a copy of the data
in place vs. not
pandas.DataFrame.copy() - make a copy of data frame
* Assigning to new variable/df names 
* Dealing with missing data
Pros and cons of dropping NAs and inconsistent data
* Dealing with incoherent data (NA, na, N/A, n/a, ND, not done, XXXX) misspellings, etc
Regex, data transformation to address inconsistency (fill_na)
* Dropping columns
df.drop()
* Removing duplicates
df.drop_duplicates()


## Prepare the data structure -- tidy data

* Describe Tidy Data - each variable has it’s own column, each observation has its own row. 
* Reshaping (if necessary) 
renaming columns (from names that don’t make sense, names with strange characters, etc to names that make sense) 
* Merging datasets,
Pandas - merge()[left, right, inner, outer], concat()
* Subsetting data
0) refer back to the way we selected columns in .describe() above
1) Indexing - numerically(zero indexing) vs by names, boolean indices, 
2) Slicing - base python and/or pandas
3) List comprehensions
* Regular expressions (complicated! But useful and important)
* Grouping data, Indexing DataFrames
df.groupby(by="col")
* Sorting data
df.sort_values()
df.sort_index()
* Creating new variables/columns (for transformation (log, sqrt, etc)


## Summarize and plot

Summaries (but can’t *say* statistics…)
* Sort data
* Can make note about using numpy functions, dif between dataframe and array
Good Plots for the data/variable type



Plots 
* of subsets, 
* single variables
* pairs of variables
* Matplotlib syntax (w/ seaborn for defaults (prettier, package also good for more analysis later...))

Exploring is often iterative - summarize, plot, summarize, plot, etc. - sometimes it branches…


## Interpret plots