<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Pandas Data Munging Full Overview

_Authors: Joseph Nelson (DC)_

---


### Lesson Guide
- [Basics of pandas dataframes](#basics)
    - [Loading data](#loading)
    - [Basic examination of data](#examination)
    - [Selecting columns](#selecting)
    - [Describing the data](#describing)
- [Exercise 1](#exercise-1)
- [Sorting and filtering dataframes](#sorting-filtering)
    - [Filtering](#filtering)
    - [Sorting](#sorting)
- [Exercise 2](#exercise-2)
- [Renaming, adding, and removing columns](#columns)
    - [Renaming columns](#renaming-columns)
    - [Adding columns](#adding-columns)
    - [Removing columns](#removing-columns)
- [Handling missing values](#missing)
    - [Find missing values](#find-missing)
    - [Drop missing values](#drop-missing)
    - [Fill in missing values](#fill-missing)
- [Exercise 3](#exercise-3)
- [Split-apply-combine](#split-apply-combine)
    - [Groupby](#groupby)
    - [Apply and combine](#apply-combine)
- [Exercise 4](#exercise-4)
- [Indexing](#indexing)
    - [Location indexing with .loc](#loc)
    - [Position indexing with .iloc](#iloc)
- [Other frequently used features](#frequent)
    - [Use map functions with replacement dictionaries](#map-dict)
    - [Encode strings as integers with .factorize](#factorize)
    - [Determine unique values](#unique)
    - [Replace values with .replace](#replace)
    - [Series string methods with .str](#series-str)
    - [Datetime conversion and arithmetic](#datetime)
    - [Setting and resetting the index](#set-reset-index)
    - [Sort by index](#sort-by-index)
    - [Change data type of a column](#change-dtype)
    - [Create dummy-coded columns](#dummy)
    - [Concatenate dataframes](#concatenate)
    - [Detect and drop duplicate rows](#duplicate-rows)
    - [Write a dataframe to a csv](#write-csv)
    - [Pickle a dataframe](#pickle)
    - [Randomly sample a dataframe](#sample)
- [Infrequently used features](#infrequent)
    - [Creating dataframes from dictionaries and lists of lists](#toy-dataframes)
    - [Doing cross-tabulations](#crosstab)
    - [Query filtering syntax](#query)
    - [Calculate memory usage](#memory-usage)
    - [Converting column to category type](#category-type)
    - [Creating columns with the assign function](#assign)
    - [Limit number of rows to load on file read](#limit-rows-read)
    - [Manually set number of rows and columns to print](#manual-print)

<a id='basics'></a>

## Reading Files, Selecting Columns, and Summarizing

---

In [1]:
import pandas as pd

<a id='loading'></a>
### Loading data

**Q.1** You can read a file from your local computer or directly from a URL.

In [2]:
# Local:
# pd.read_table('u.user')

# Remote:
users = pd.read_table('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user')

In [3]:
users.head(2)

Unnamed: 0,user_id|age|gender|occupation|zip_code
0,1|24|M|technician|85711
1,2|53|F|other|94043


**Q.2** Use kwargs to set appropriate data-reading parameters.

In [4]:
# read 'u.user' into 'users'
users = pd.read_table('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', sep='|', index_col='user_id')
users.head(2)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043


<a id='examine'></a>
### Basic examination of dataframes

**Q.1** Print the type of `users`.

In [5]:
type(users)

pandas.core.frame.DataFrame

**Q.2** Print the first 5 rows, first 10 rows, and last 2 rows of `users`.

In [6]:
users.head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


In [7]:
users.head(10)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
6,42,M,executive,98101
7,57,M,administrator,91344
8,36,M,administrator,5201
9,29,M,student,1002
10,53,M,lawyer,90703


In [8]:
users.tail(2)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
942,48,F,librarian,78209
943,22,M,student,77841


**Q.3** Print the index and columns.

In [9]:
print users.index[0:5]
print users.columns

Int64Index([1, 2, 3, 4, 5], dtype='int64', name=u'user_id')
Index([u'age', u'gender', u'occupation', u'zip_code'], dtype='object')


**Q.4** Find the dtypes of the columns.

In [10]:
users.dtypes

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

**Q.5** Find the dimensions of the dataframe.

In [11]:
users.shape

(943, 4)

**Q.6** Extract the underlying numpy array as a new variable.

In [12]:
X = users.values
print type(X), X.shape

<type 'numpy.ndarray'> (943, 4)


<a id='selecting'></a>
### Selecting columns

**Q.1** Assign the `gender` column to a variable.

In [13]:
gender = users['gender']
gender = users.gender

**Q.2** What is the type of `gender`?

In [14]:
type(gender)

pandas.core.series.Series

**Q.3** Select `gender` and `occupation` as a new dataframe.

In [15]:
gen_occ = users[['gender','occupation']]

<a id='describing'></a>
### Describing the data

**Q.1** Calculate the descriptive statistics for the numeric columns in the dataframe.

In [16]:
users.describe()

Unnamed: 0,age
count,943.0
mean,34.051962
std,12.19274
min,7.0
25%,25.0
50%,31.0
75%,43.0
max,73.0


**Q.2** Describe the "object" (string) columns.

In [17]:
users.describe(include=['object'])

Unnamed: 0,gender,occupation,zip_code
count,943,943,943
unique,2,21,795
top,M,student,55414
freq,670,196,9


**Q.3** Describe all the columns regardless of type.

In [18]:
users.describe(include='all')

Unnamed: 0,age,gender,occupation,zip_code
count,943.0,943,943,943.0
unique,,2,21,795.0
top,,M,student,55414.0
freq,,670,196,9.0
mean,34.051962,,,
std,12.19274,,,
min,7.0,,,
25%,25.0,,,
50%,31.0,,,
75%,43.0,,,


**Q.4** Describe the `gender` Series from the `users` dataframe.

In [19]:
users.gender.describe()

count     943
unique      2
top         M
freq      670
Name: gender, dtype: object

**Q.5** Calculate the mean of the `age` column.

In [20]:
users.age.mean()

34.05196182396607

**Q.6** Calculate the counts of distinct values in the gender and age column.

In [21]:
# most useful for categorical variables.
users.gender.value_counts()

M    670
F    273
Name: gender, dtype: int64

In [23]:
# You can also use it on numeric, however:
users.age.value_counts()[0:5]

30    39
25    38
22    37
28    36
27    35
Name: age, dtype: int64

<a id='exercise=1'></a>
## Exercise 1

---

Load the `drinks.csv` data provided in the url below.

**Perform the following:**
1. Print the head and tail.
- Look at the index, columns, dtypes and shape.
- Assign the `beer_servings` column/Series to a variable.
- Calculate summary statistics for `beer_servings`.
- Calculate the median of `beer_servings`.
- Count the values of unique categories in `continent`.
- Print the dimensions of the drinks dataframe.
- Find the first 3 items of the value counts of the `occupation` column.

**BONUS:**
- Create the 'users' DataFrame from the `user_file` provided (which lacks a header row).
- Supply a header: `['user_id', 'age', 'gender', 'occupation', 'zip_code']`


In [35]:
drinks_csv = 'https://raw.githubusercontent.com/josephnelson93/GA-DSI/master/example-lessons/plotting-with-pandas/drinks.csv'
user_file = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user_original'

In [36]:
'''
EXERCISE ONE
'''

# read drinks.csv into a DataFrame called 'drinks'
drinks = pd.read_table('https://raw.githubusercontent.com/josephnelson93/GA-DSI/master/example-lessons/plotting-with-pandas/drinks.csv', sep=',')
drinks = pd.read_csv('https://raw.githubusercontent.com/josephnelson93/GA-DSI/master/example-lessons/plotting-with-pandas/drinks.csv')              # assumes separator is comma

# print the head and the tail
drinks.head()
drinks.tail()

# examine the default index, data types, and shape
drinks.index
drinks.dtypes
drinks.shape

# print the 'beer_servings' Series
drinks['beer_servings']
drinks.beer_servings

# calculate the mean 'beer_servings' for the entire dataset
drinks.describe()                   # summarize all numeric columns
drinks.beer_servings.describe()     # summarize only the 'beer_servings' Series
drinks.beer_servings.mean()         # only calculate the mean

# count the number of occurrences of each 'continent' value and see if it looks correct
drinks.continent.value_counts()

# BONUS: display only the number of rows of the 'users' DataFrame
users.shape[0]

# BONUS: display the 3 most frequent occupations in 'users'
users.occupation.value_counts().head(3)
users.occupation.value_counts()[:3]

# BONUS: create the 'users' DataFrame from the u.user_original file (which lacks a header row)
# Hint: read the pandas.read_table documentation
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user_original', sep='|', header=None, names=user_cols, index_col='user_id')


<a id='filtering-sorting'></a>

## Filtering and sorting dataframes and series

---


<a id='filtering'></a>
### Boolean filtering

**Q.1** Show users with age < 20 using a boolean mask.

In [38]:
# boolean filtering: only show users with age < 20
young_bool = users.age < 20         # create a Series of booleans...
users[young_bool]                   # ...and use that Series to filter rows

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30,7,M,student,55436
36,19,F,student,93117
52,18,F,student,55105
57,16,M,none,84010
67,17,M,student,60402
68,19,M,student,22904
101,15,M,student,05146
110,19,M,student,77840
142,13,M,other,48118
179,15,M,entertainment,20755


**Q.2** Calculate value counts of occupation for users age < 20.

In [39]:
users[users.age < 20]               # or, combine into a single step
users[users.age < 20].occupation    # select one column from the filtered results
users[users.age < 20].occupation.value_counts()     # value_counts of resulting Series

student          64
other             4
none              3
writer            2
entertainment     2
salesman          1
artist            1
Name: occupation, dtype: int64

**Q.3** Print the male users age < 20. 

In [40]:
# boolean filtering with multiple conditions
users[(users.age < 20) & (users.gender=='M')]       # ampersand for AND condition

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30,7,M,student,55436
57,16,M,none,84010
67,17,M,student,60402
68,19,M,student,22904
101,15,M,student,5146
110,19,M,student,77840
142,13,M,other,48118
179,15,M,entertainment,20755
221,19,M,student,20685
246,19,M,student,28734


**Q.4** Print the users age < 10 or age > 70.

In [45]:
users[(users.age < 10) | (users.age > 70)]          # pipe for OR condition


Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30,7,M,student,55436
481,73,M,retired,37771


<a id='sorting'></a>
### Sorting

**Q.1** Return the age column sorted (ascending order)

In [47]:
users.age.sort_values()                   # sort a column

user_id
30      7
471    10
289    11
880    13
609    13
142    13
674    13
628    13
813    14
206    14
887    14
849    15
281    15
461    15
618    15
179    15
101    15
57     16
580    16
550    16
451    16
434    16
621    17
619    17
761    17
375    17
904    17
646    17
582    17
257    17
       ..
90     60
308    60
931    60
752    60
469    60
464    60
234    60
694    60
934    61
351    61
106    61
520    62
266    62
858    63
777    63
364    63
845    64
423    64
318    65
651    65
564    65
211    66
349    68
573    68
559    69
585    69
767    70
803    70
860    70
481    73
Name: age, dtype: int64

**Q.2** Sort the users dataframe by the age column (ascending).

In [48]:
users.sort_values('age').head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30,7,M,student,55436
471,10,M,student,77459
289,11,M,none,94619
880,13,M,student,83702
609,13,F,student,55106


**Q.3** Sort the users dataframe by the age column *descending*.

In [49]:
users.sort_values('age', ascending=False).head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
481,73,M,retired,37771
803,70,M,administrator,78212
767,70,M,engineer,0
860,70,F,retired,48322
585,69,M,librarian,98501


<a id='exercise-2'></a>

## Exercise 2

---

**Using the drinks dataframe from the previous exercise:**
1. Filter drinks to include only European countries.
- Filter drinks to include only European countries with `wine_servings` > 300.
- Calculate the mean `beer_servings` for all of Europe.
- Which 10 countries have the highest `total_litres_of_pure_alcohol`?

**Using the users dataframe:**
1. Sort users by occupation and then by age in a single command.
- Filter users to only include doctors and lawyers without using a `|`

> *Hint:* look up `pandas.Series.isin`

In [50]:
'''
EXERCISE TWO
'''

# filter 'drinks' to only include European countries
drinks[drinks.continent=='EU']

# filter 'drinks' to only include European countries with wine_servings > 300
drinks[(drinks.continent=='EU') & (drinks.wine_servings > 300)]

# calculate the mean 'beer_servings' for all of Europe
drinks[drinks.continent=='EU'].beer_servings.mean()

# determine which 10 countries have the highest total_litres_of_pure_alcohol
drinks.sort_values('total_litres_of_pure_alcohol').tail(10)

# BONUS: sort 'users' by 'occupation' and then by 'age' (in a single command)
users.sort_values(['occupation', 'age'])

# BONUS: filter 'users' to only include doctors and lawyers without using a |
# Hint: read the pandas.Series.isin documentation
users[users.occupation.isin(['doctor', 'lawyer'])].head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,53,M,lawyer,90703
125,30,M,lawyer,22202
126,28,F,lawyer,20015
138,46,M,doctor,53211
161,50,M,lawyer,55104


<a id='columns'></a>

## Renaming, adding, and removing columns

---

<a id='renaming-columns'></a>
### Renaming columns

**Q.1** Rename "beer_servings" to "beer" and "wine_servings" to wine in the drinks dataframe, returning a *new* dataframe.

In [51]:
# rename one or more columns
renamed_drinks = drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'})

**Q.2** Do the same renaming for drinks, but inplace.

In [52]:
drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'}, inplace=True)


**Q.3** Replace the column names of drinks with `['country', 'beer', 'spirit', 'wine', 'liters', 'continent']`

In [53]:
# replace all column names
drink_cols = ['country', 'beer', 'spirit', 'wine', 'liters', 'continent']
drinks.columns = drink_cols

# Side note: you can replace when loading a csv or other file:
# drinks = pd.read_csv('drinks.csv', header=0, names=drink_cols)

<a id='adding-columns'></a>
### Adding columns

**Q.1** Make a "servings" column that is beer + spirit + wine

In [54]:
drinks['servings'] = drinks.beer + drinks.spirit + drinks.wine

**Q.2** Make a "mL" column that is the liters column times 1000.

In [55]:
drinks['mL'] = drinks.liters * 1000

<a id='removing-columns'></a>
### Removing columns

**Q.1** Remove the "mL" column returning a new dataframe.

In [56]:
dropped = drinks.drop('mL', axis=1) # axis=0 for rows, 1 for columns


**Q.2** Remove the "mL" and "servings" columns from drinks inplace.

In [57]:
drinks.drop(['mL', 'servings'], axis=1, inplace=True)   # drop multiple columns

<a id='missing'></a>
## Handling missing values

---

<a id='find-missing'></a>
### Finding missing values

**Q.1** Include missing values of the continent variable in the drinks dataframe when counting unique values.

In [58]:
# missing values are usually excluded by default
drinks.continent.value_counts()              # excludes missing values
drinks.continent.value_counts(dropna=False)  # includes missing values

AF     53
EU     45
AS     44
NaN    23
OC     16
SA     12
Name: continent, dtype: int64

**Q.2** Create a boolean Series representing which values are missing and not missing in continents.

In [59]:
# find missing values in a Series
is_null = drinks.continent.isnull() # True if missing
is_not_null = drinks.continent.notnull() # True if not missing

**Q.3** Subset to rows in drinks where continent is missing and where continent is not missing.

In [60]:
# use a boolean Series to filter DataFrame rows
drinks_continent_null = drinks[drinks.continent.isnull()]   # only show rows where continent is missing
drinks_continent_notnull = drinks[drinks.continent.notnull()]  # only show rows where continent is not missing

**Q.4** [Side note] Calculate the sum of drink *columns* and the sum of *rows*.

In [62]:
# side note: understanding axes
drinks.sum()            # sums "down" the 0 axis (rows)
drinks.sum(axis=0)      # equivalent (since axis=0 is the default)
drinks.sum(axis=1).head()      # sums "across" the 1 axis (columns)

0      0.0
1    279.9
2     39.7
3    707.4
4    324.9
dtype: float64

In [63]:
# side note: adding booleans
# pd.Series([True, False, True])          # create a boolean Series
# pd.Series([True, False, True]).sum()    # converts False to 0 and True to 1

**Q.5** FInd the number of missing values by column in drinks.

In [64]:
# find missing values in a DataFrame
drinks.isnull()             # DataFrame of booleans
drinks.isnull().sum()       # count the missing values in each column

country       0
beer          0
spirit        0
wine          0
liters        0
continent    23
dtype: int64

<a id='drop-missing'></a>
### Dropping missing values

**Q.1** Drop rows where *ANY* values are missing in drinks (return new dataframe)

In [65]:
print drinks.shape
d = drinks.dropna()
print d.shape

(193, 6)
(170, 6)


**Q.2** Drop rows only where *ALL* values are missing in drinks.

In [66]:
print drinks.shape
d = drinks.dropna(how='all')
print d.shape

(193, 6)
(193, 6)


<a id='fill-missing'></a>
### Fill in missing values

**Q.1** Fill in the missing values of the continent column with string "NA"

In [67]:
# fill in missing values
drinks.continent.fillna(value='NA', inplace=True)   # fill in missing values with 'NA'

**Q.2** Turn off the missing value filter when loading the drinks csv.

In [68]:
# turn off the missing value filter
drinks = pd.read_csv(drinks_csv, header=0, names=drink_cols, na_filter=False)

<a id='exercise-3'></a>
## Exercise 3

---

**Using the ufo data provided below:**
1. Read in the data.
- Check the shape and describe the columns.
- Find the four most reported colors.
- Find the most frequent city for reports in state VA.
- Find only UFO reports from Arlington, VA
- Find the number of missing values in each column.
- Show only UFO reporst where city is missing
- Count the number of rows with no null values.
- Replace column names with spaces to have underscores.
- Make a new column that is a combination of city and state.

In [70]:
ufo_csv = 'https://raw.githubusercontent.com/josephofiowa/DAT8/master/data/ufo.csv'

In [71]:
'''
EXERCISE THREE
'''

# read ufo.csv into a DataFrame called 'ufo'
ufo = pd.read_table('https://raw.githubusercontent.com/josephofiowa/DAT8/master/data/ufo.csv', sep=',')
ufo = pd.read_csv('https://raw.githubusercontent.com/josephofiowa/DAT8/master/data/ufo.csv')

# check the shape of the DataFrame
ufo.shape

# calculate the most frequent value for each of the columns (in a single command)
ufo.describe()

# what are the four most frequent colors reported?
ufo['Colors Reported'].value_counts().head(4)

# for reports in VA, what's the most frequent city?
ufo[ufo.State=='VA'].City.value_counts().head(1)

# show only the UFO reports from Arlington, VA
ufo[(ufo.City=='Arlington') & (ufo.State=='VA')]

# count the number of missing values in each column
ufo.isnull().sum()

# show only the UFO reports in which the City is missing
ufo[ufo.City.isnull()]

# how many rows remain if you drop all rows with any missing values?
ufo.dropna().shape[0]

# replace any spaces in the column names with an underscore
ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True)

# BONUS: redo the task above, writing generic code to replace spaces with underscores
# In other words, your code should not reference the specific column names
ufo.columns = [col.replace(' ', '_') for col in ufo.columns]
ufo.columns = ufo.columns.str.replace(' ', '_')

# Create a new column called 'Location' that includes both City and State
# For example, the 'Location' for the first row would be 'Ithaca, NY'
ufo['Location'] = ufo.City + ', ' + ufo.State

<a id='split-apply-combine'></a>
## Split-apply-combine

---

![](http://i.imgur.com/yjNkiwL.png)

<a id='groupby'></a>
### Grouping

**Q.1** With the drinks dataframe, calculate the mean bear servings by continent.

In [72]:
# for each continent, calculate the mean beer servings
drinks.groupby('continent').beer.mean()

continent
AF     61.471698
AS     37.045455
EU    193.777778
NA    145.434783
OC     89.687500
SA    175.083333
Name: beer, dtype: float64

**Q.2** Describe the beer column by continent.

In [73]:
# for each continent, describe beer servings
drinks.groupby('continent').beer.describe()

continent       
AF         count     53.000000
           mean      61.471698
           std       80.557816
           min        0.000000
           25%       15.000000
           50%       32.000000
           75%       76.000000
           max      376.000000
AS         count     44.000000
           mean      37.045455
           std       49.469725
           min        0.000000
           25%        4.250000
           50%       17.500000
           75%       60.500000
           max      247.000000
EU         count     45.000000
           mean     193.777778
           std       99.631569
           min        0.000000
           25%      127.000000
           50%      219.000000
           75%      270.000000
           max      361.000000
NA         count     23.000000
           mean     145.434783
           std       79.621163
           min        1.000000
           25%       80.000000
           50%      143.000000
           75%      198.000000
           max      28

<a id='apply-combine'></a>
### Apply functions to groups and combine

**Q.1** Find the count, mean, minimum, and maximum of the beer column by continent.

In [74]:
# similar, but outputs a DataFrame and can be customized
drinks.groupby('continent').beer.agg(['count', 'mean', 'min', 'max'])

Unnamed: 0_level_0,count,mean,min,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AF,53,61.471698,0,376
AS,44,37.045455,0,247
EU,45,193.777778,0,361
,23,145.434783,1,285
OC,16,89.6875,0,306
SA,12,175.083333,93,333


**Q.2** Do the same as in Q.1, but sort the output by the mean column.

In [75]:
drinks.groupby('continent').beer.agg(['count', 'mean', 'min', 'max']).sort_values('mean')


Unnamed: 0_level_0,count,mean,min,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AS,44,37.045455,0,247
AF,53,61.471698,0,376
OC,16,89.6875,0,306
,23,145.434783,1,285
SA,12,175.083333,93,333
EU,45,193.777778,0,361


**Q.3** Apply a custom function to all the columns of the drinks dataframe, grouped by continent.

In [77]:
# first value of each column by continent:
drinks.groupby('continent').apply(lambda x: x.iloc[0,:])

Unnamed: 0_level_0,country,beer,spirit,wine,liters,continent
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AF,Algeria,25,0,14,0.7,AF
AS,Afghanistan,0,0,0,0.0,AS
EU,Albania,89,132,54,4.9,EU
,Antigua & Barbuda,102,128,45,4.9,
OC,Australia,261,72,212,10.4,OC
SA,Argentina,193,25,221,8.3,SA


**Q.4** [Note] If you don't specify a column for the aggregation function, it will be applied to all numeric columns.

In [78]:
drinks.groupby('continent').mean()
drinks.groupby('continent').describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,beer,liters,spirit,wine
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AF,count,53.0,53.0,53.0,53.0
AF,mean,61.471698,3.007547,16.339623,16.264151
AF,std,80.557816,2.647557,28.102794,38.846419
AF,min,0.0,0.0,0.0,0.0
AF,25%,15.0,0.7,1.0,1.0
AF,50%,32.0,2.3,3.0,2.0
AF,75%,76.0,4.7,19.0,13.0
AF,max,376.0,9.1,152.0,233.0
AS,count,44.0,44.0,44.0,44.0
AS,mean,37.045455,2.170455,60.840909,9.068182


<a id='exercise-4'></a>

## Exercise 4

---

**Using the users dataframe**:
1. Count the number of distinct occupations in users.
- Calculate the mean age by occupation.
- Calculate the minimum and maximum age by occupation.
- Calculate the mean age by cross-sections of occupation and gender.

> *Tip: multiple columns can be passed to the groupby function for granular cross-sections.*

In [79]:
'''
EXERCISE FOUR
'''

# for each occupation in 'users', count the number of occurrences
users.occupation.value_counts()

# for each occupation, calculate the mean age
users.groupby('occupation').age.mean()

# For each occupation, calculate the minimum and maximum ages
users.groupby('occupation').age.agg(['min', 'max'])

# For each combination of occupation and gender, calculate the mean age
users.groupby(['occupation', 'gender']).age.mean()

occupation     gender
administrator  F         40.638889
               M         37.162791
artist         F         30.307692
               M         32.333333
doctor         M         43.571429
educator       F         39.115385
               M         43.101449
engineer       F         29.500000
               M         36.600000
entertainment  F         31.000000
               M         29.000000
executive      F         44.000000
               M         38.172414
healthcare     F         39.818182
               M         45.400000
homemaker      F         34.166667
               M         23.000000
lawyer         F         39.500000
               M         36.200000
librarian      F         40.000000
               M         40.000000
marketing      F         37.200000
               M         37.875000
none           F         36.500000
               M         18.600000
other          F         35.472222
               M         34.028986
programmer     F         32.16666

<a id='indexing'></a>
## Indexing

---

<a id='loc'></a>
### Label indexing with `.loc`

**Q.1** Select all rows and the "City" column from the ufo dataset with `.loc`.

In [80]:
d = ufo.loc[:, 'City'] # colon means "all rows", then select one column

**Q.2** Select all rows and columns "City" and "State"

In [81]:
d = ufo.loc[:, ['City', 'State']]   # select two columns

**Q.3** Select all rows and columns from "City" *through* "State"

In [82]:
d = ufo.loc[:, 'City':'State'] # select a range of columns
d.columns

Index([u'City', u'Colors_Reported', u'Shape_Reported', u'State'], dtype='object')

**Q.4** Select:
- all columns at row 0
- all columns at rows 0:2
- columns "City" through "State" at rows 0:2

In [83]:
# loc can also filter rows by "name" (the index)
d = ufo.loc[0, :]                   # row 0, all columns
d = ufo.loc[0:2, :]                 # rows 0/1/2, all columns
d = ufo.loc[0:2, 'City':'State']    # rows 0/1/2, range of columns

<a id='iloc'></a>
### Position indexing with `.iloc`

**Q.1** Select all rows and columns in position 0 and 3.

In [84]:
d = ufo.iloc[:, [0, 3]] # all rows, columns in position 0/3

**Q.2** Select all rows and columns 0 through 4.

In [85]:
d = ufo.iloc[:, 0:4] # all rows, columns in position 0/1/2/3

**Q.3** Select rows in position 0:3 and all columns.

In [87]:
d = ufo.iloc[0:3, :] # rows in position 0/1/2, all columns

<a id='frequent'></a>
## Frequently used features

---

<a id='map-dict'></a>
### The `.map` function with replacement dictionaries

In [88]:
# map existing values to a different set of values
users['is_male'] = users.gender.map({'F':0, 'M':1})

<a id='factorize'></a>
### Encode strings as integers with `.factorize`

In [89]:
# encode strings as integer values (automatically starts at 0)
users['occupation_num'] = users.occupation.factorize()[0]

users.head()

Unnamed: 0_level_0,age,gender,occupation,zip_code,is_male,occupation_num
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,24,M,technician,85711,1,0
2,53,F,other,94043,0,1
3,23,M,writer,32067,1,2
4,24,M,technician,43537,1,0
5,33,F,other,15213,0,1


<a id='unique'></a>
### Determine unique values

In [90]:
# determine unique values in a column
users.occupation.nunique()      # count the number of unique values

21

In [91]:
users.occupation.unique()       # return the unique values

array(['technician', 'other', 'writer', 'executive', 'administrator',
       'student', 'lawyer', 'educator', 'scientist', 'entertainment',
       'programmer', 'librarian', 'homemaker', 'artist', 'engineer',
       'marketing', 'none', 'healthcare', 'retired', 'salesman', 'doctor'], dtype=object)

<a id='replace'></a>
### Replace values with `.replace`

In [92]:
# replace all instances of a value in a column (must match entire value)
ufo.State.replace('Fl', 'FL', inplace=True)

<a id='series-str'></a>
### Series string methods with `.str`

In [94]:
# string methods are accessed via 'str'
ufo.State.str.upper()                               # converts to uppercase
ufo.Colors_Reported.str.contains('RED', na='False').head(2) # checks for a substring

0    False
1    False
Name: Colors_Reported, dtype: object

<a id='datetime'></a>
### `datetime` conversion and arithmetic

In [95]:
# convert a string to the datetime format
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo.Time.dt.hour                        # datetime format exposes convenient attributes
(ufo.Time.max() - ufo.Time.min()).days  # also allows you to do datetime "math"
ufo[ufo.Time > pd.datetime(2014, 1, 1)].head(2) # boolean filtering with datetime format

Unnamed: 0,City,Colors_Reported,Shape_Reported,State,Time,Location
75177,Clarksville,ORANGE,SPHERE,TN,2014-01-01 00:01:00,"Clarksville, TN"
75178,Henderson,,SPHERE,NV,2014-01-01 00:01:00,"Henderson, NV"


<a id='set-reset-index'></a>
### Setting and resetting the index

In [96]:
# setting and then removing an index
ufo.set_index('Time', inplace=True)
ufo.reset_index(inplace=True)

<a id='sort-by-index'></a>
### Sorting by the index

In [98]:
# sort a column by its index
ufo.State.value_counts().sort_index()[0:3]

AK    403
AL    808
AR    748
Name: State, dtype: int64

<a id='change-dtype'></a>
### Changing the data type of a column

In [101]:
# change the data type of a column
drinks['beer'] = drinks.beer.astype('float')

# change the data type of a column when reading in a file
d = pd.read_csv(drinks_csv, dtype={'beer_servings':float})

<a id='dummy'></a>
### Create dummy-coded columns from a categorical column

In [102]:
# create dummy variables for 'continent' and exclude first dummy column
continent_dummies = pd.get_dummies(drinks.continent, prefix='cont').iloc[:, 1:]
continent_dummies.head(3)

Unnamed: 0,cont_AS,cont_EU,cont_NA,cont_OC,cont_SA
0,1,0,0,0,0
1,0,1,0,0,0
2,0,0,0,0,0


<a id='concatenate'></a>
### Concatenate dataframes together

In [103]:
# concatenate two DataFrames (axis=0 for rows, axis=1 for columns)
drinks = pd.concat([drinks, continent_dummies], axis=1)

In [104]:
drinks.head(2)

Unnamed: 0,country,beer,spirit,wine,liters,continent,cont_AS,cont_EU,cont_NA,cont_OC,cont_SA
0,Afghanistan,0.0,0,0,0.0,AS,1,0,0,0,0
1,Albania,89.0,132,54,4.9,EU,0,1,0,0,0


<a id='duplicate-rows'></a>
### Detect and drop duplicate rows

In [106]:
# detecting duplicate rows
d = users.duplicated()          # True if a row is identical to a previous row
d = users.duplicated().sum()    # count of duplicates
d = users[users.duplicated()]   # only show duplicates
d = users.drop_duplicates()     # drop duplicate rows
d = users.age.duplicated()      # check a single column for duplicates
d = users.duplicated(['age', 'gender', 'zip_code']).sum()   # specify columns for finding duplicates

<a id='write-csv'></a>
### Write a dataframe to a csv

In [117]:
# write a DataFrame out to a CSV
# drinks.to_csv('drinks_updated.csv')                 # index is used as first column
# drinks.to_csv('drinks_updated.csv', index=False)    # ignore index

<a id='pickle'></a>
### Write a dataframe to a pickle object

In [118]:
# save a DataFrame to disk (aka 'pickle') and read it from disk (aka 'unpickle')
# drinks.to_pickle('drinks_pickle')
# pd.read_pickle('drinks_pickle')

<a id='sample'></a>
### Randomly sample a dataframe

In [116]:
# randomly sample a DataFrame
train = drinks.sample(frac=0.75, random_state=1)    # will contain 75% of the rows
test = drinks[~drinks.index.isin(train.index)]      # will contain the other 25%

<a id='infrequent'></a>
## Infrequently used features

---


<a id='toy-dataframes'></a>
### Create dataframes from dictionaries and lists of lists

In [105]:
# create a DataFrame from a dictionary
d = pd.DataFrame({'capital':['Montgomery', 'Juneau', 'Phoenix'], 'state':['AL', 'AK', 'AZ']})

# create a DataFrame from a list of lists
d = pd.DataFrame([['Montgomery', 'AL'], ['Juneau', 'AK'], ['Phoenix', 'AZ']], columns=['capital', 'state'])

<a id='crosstab'></a>
### Do a cross-tabulation between Series

In [107]:
# display a cross-tabulation of two Series
pd.crosstab(users.occupation, users.gender)

gender,F,M
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
administrator,36,43
artist,13,15
doctor,0,7
educator,26,69
engineer,2,65
entertainment,2,16
executive,3,29
healthcare,11,5
homemaker,6,1
lawyer,2,10


<a id='query'></a>
### Query syntax for filtering

In [108]:
# alternative syntax for boolean filtering (noted as "experimental" in the documentation)
d = users.query('age < 20')                 # users[users.age < 20]
d = users.query("age < 20 and gender=='M'") # users[(users.age < 20) & (users.gender=='M')]
d = users.query('age < 20 or age > 60')     # users[(users.age < 20) | (users.age > 60)]

<a id='memory-usage'></a>
### Calculate memory usage

In [109]:
# display the memory usage of a DataFrame
d = ufo.info()          # total usage
ufo.memory_usage()  # usage by column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80543 entries, 0 to 80542
Data columns (total 6 columns):
Time               80543 non-null datetime64[ns]
City               80496 non-null object
Colors_Reported    17034 non-null object
Shape_Reported     72141 non-null object
State              80543 non-null object
Location           80496 non-null object
dtypes: datetime64[ns](1), object(5)
memory usage: 3.7+ MB


<a id='category-type'></a>
### Convert a column to type 'category'

In [110]:
# change a Series to the 'category' data type (reduces memory usage and increases performance)
ufo['State'] = ufo.State.astype('category')

<a id='assign'></a>
### Define column with `.assign`

In [112]:
# temporarily define a new column as a function of existing columns
drinks.assign(servings = drinks.beer + drinks.spirit + drinks.wine).head(2)

Unnamed: 0,country,beer,spirit,wine,liters,continent,cont_AS,cont_EU,cont_NA,cont_OC,cont_SA,servings
0,Afghanistan,0.0,0,0,0.0,AS,1,0,0,0,0,0.0
1,Albania,89.0,132,54,4.9,EU,0,1,0,0,0,275.0


<a id='limit-rows-read'></a>
### Limit rows when reading a file

In [115]:
# limit which rows are read when reading in a file
d = pd.read_csv(drinks_csv, nrows=10)           # only read first 10 rows
d = pd.read_csv(drinks_csv, skiprows=[1, 2])    # skip the first two rows of data

<a id='manual-print'></a>
### Manually set maximum rows and columns to print

In [120]:
# change the maximum number of rows and columns printed ('None' means unlimited)
pd.set_option('max_rows', 2)     # default is 60 rows
pd.set_option('max_columns', 2)  # default is 20 columns
print drinks

         country   ...     cont_SA
0    Afghanistan   ...           0
..           ...   ...         ...
192     Zimbabwe   ...           0

[193 rows x 11 columns]


In [121]:
# reset options to defaults
pd.reset_option('max_rows')
pd.reset_option('max_columns')

In [None]:
# change the options temporarily (settings are restored when you exit the 'with' block)
with pd.option_context('max_rows', None, 'max_columns', None):
    print drinks