<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# `pandas` Data Munging Overview: Part 2


---

### Lesson Guide
- [Exercise #3](#exercise-3)
- [Split-Apply-Combine](#split-apply-combine)
    - [`.groupby()`](#groupby)
    - [Apply Functions to Groups and Combine](#apply-combine)
- [Exercise #4](#exercise-4)
- [Indexing](#indexing)
    - [Location Indexing With `.loc()`](#loc)
    - [Position Indexing With `.iloc()`](#iloc)
- [Other Frequently Used Features](#frequent)
    - [Using Map Functions With Replacement Dictionaries](#map-dict)
    - [Encoding Strings as Integers With `.factorize()`](#factorize)
    - [Determining Unique Values](#unique)
    - [Replacing Values With `.replace()`](#replace)
    - [Series String Methods With `.str`](#series-str)
    - [Datetime Conversion and Arithmetic](#datetime)
    - [Setting and Resetting the Index](#set-reset-index)
    - [Sorting by Index](#sort-by-index)
    - [Changing the Data Type of a Column](#change-dtype)
    - [Creating Dummy-Coded Columns](#dummy)
    - [Concatenating DataFrames](#concatenate)
    - [Detecting and Dropping Duplicate Rows](#duplicate-rows)
    - [Writing a DataFrame to a `.csv`](#write-csv)
    - [Pickling a DataFrame](#pickle)
    - [Randomly Sampling a DataFrame](#sample)
- [Infrequently Used Features](#infrequent)
    - [Creating DataFrames From Dictionaries and Lists of Lists](#toy-dataframes)
    - [Performing Cross-Tabulations](#crosstab)
    - [Query-Filtering Syntax](#query)
    - [Calculating Memory Usage](#memory-usage)
    - [Converting Column to Category Type](#category-type)
    - [Creating Columns With `.assign()`](#assign)
    - [Limiting the Number of Rows to Load in a File Read](#limit-rows-read)
    - [Manually Setting the Number of Rows and Columns to Print](#manual-print)

In [1]:
import pandas as pd

<a id='exercise-3'></a>
## Exercise #3

---

**Using the UFO data provided below:**
1. Read in the data.
2. Check the shape and describe the columns.
3. Find the four most frequently reported colors.
4. Find the most frequent city for reports in state `VA`.
5. Find only UFO reports from Arlington, VA.
6. Find the number of missing values in each column.
7. Show only UFO reports where `city` is missing.
8. Count the number of rows with no null values.
9. Amend column names with spaces to have underscores.
10. Make a new column that is a combination of `city` and `state`.

In [2]:
ufo_csv = '../../../../../resource-datasets/ufo_sightings/ufo.csv'

In [3]:
# Read `ufo.csv` into a DataFrame called `ufo`.
ufo = pd.read_table(ufo_csv, sep=',')
ufo = pd.read_csv(ufo_csv)

In [4]:
# Check the shape of the DataFrame.
ufo.shape

(80543, 5)

In [5]:
# Calculate the most frequent value for each of the columns in a single command.
ufo.describe()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
count,80496,17034,72141,80543,80543
unique,13504,31,27,52,68901
top,Seattle,ORANGE,LIGHT,CA,7/4/2014 22:00
freq,646,5216,16332,10743,45


In [6]:
# What are the four most frequently reported colors?
ufo['Colors Reported'].value_counts().head(4)

ORANGE    5216
RED       4809
GREEN     1897
BLUE      1855
Name: Colors Reported, dtype: int64

In [7]:
# For reports in `VA`, what's the most frequently listed city?
ufo[ufo.State=='VA'].City.value_counts().head(1)

Virginia Beach    110
Name: City, dtype: int64

In [8]:
# Show only the UFO reports from Arlington, VA.
ufo[(ufo.City=='Arlington') & (ufo.State=='VA')].head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
202,Arlington,GREEN,OVAL,VA,7/13/1952 21:00
6300,Arlington,,CHEVRON,VA,5/5/1990 21:40
10278,Arlington,,DISK,VA,5/27/1997 15:30
14527,Arlington,,OTHER,VA,9/10/1999 21:41
17984,Arlington,RED,DISK,VA,11/19/2000 22:00


In [9]:
# Count the number of missing values in each column.
ufo.isnull().sum()

City                  47
Colors Reported    63509
Shape Reported      8402
State                  0
Time                   0
dtype: int64

In [10]:
# Show only the UFO reports in which the `city` is missing.
ufo[ufo.City.isnull()].head(10)

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
21,,,,LA,8/15/1943 0:00
22,,,LIGHT,LA,8/15/1943 0:00
204,,,DISK,CA,7/15/1952 12:30
241,,BLUE,DISK,MT,7/4/1953 14:00
613,,,DISK,NV,7/1/1960 12:00
1877,,YELLOW,CIRCLE,AZ,8/15/1969 1:00
2013,,,,NH,8/1/1970 9:30
2546,,,FIREBALL,OH,10/25/1973 23:30
3123,,RED,TRIANGLE,WV,11/25/1975 23:00
4736,,,SPHERE,CA,6/23/1982 23:00


In [11]:
# How many rows remain if you drop all rows with any missing values?
ufo.dropna().shape[0]

15510

In [12]:
# Replace any spaces in the column names with underscores.
ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True)

In [13]:
# BONUS: Redo the task above, writing generic code to replace spaces with underscores.
# In other words, your code should not reference the specific column names.
ufo.columns = [col.replace(' ', '_') for col in ufo.columns]
ufo.columns = ufo.columns.str.replace(' ', '_')

In [14]:
# Create a new column called `location` that includes both `city` and `state`.
# For example, the `location` for the first row would be `Ithaca, NY`.
ufo['Location'] = ufo.City + ', ' + ufo.State

<a id='split-apply-combine'></a>
## Split-Apply-Combine

---

![](../assets/split_apply_combine.png)

<a id='groupby'></a>
### `.groupby()`

**Q.1** Using the `drinks` DataFrame, calculate the mean `beer` servings by continent.

In [15]:
drinks = pd.read_csv('../../../../../resource-datasets/alcohol_by_country/drinks.csv')

In [16]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [17]:
# Rename the columns
drinks.rename({'beer_servings':'beer','spirit_servings':'spirit',
               'wine_servings':'wine','total_litres_of_pure_alcohol':'total'},
             axis=1, inplace=True)

In [18]:
drinks.head()

Unnamed: 0,country,beer,spirit,wine,total,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [19]:
# For each continent, calculate the mean `beer` servings.
drinks.groupby('continent').beer.mean()

continent
AF     61.471698
AS     37.045455
EU    193.777778
OC     89.687500
SA    175.083333
Name: beer, dtype: float64

**Q.2** Describe the `beer` column by continent.

In [20]:
# For each continent, describe `beer` servings.
drinks.groupby('continent').beer.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AF,53.0,61.471698,80.557816,0.0,15.0,32.0,76.0,376.0
AS,44.0,37.045455,49.469725,0.0,4.25,17.5,60.5,247.0
EU,45.0,193.777778,99.631569,0.0,127.0,219.0,270.0,361.0
OC,16.0,89.6875,96.641412,0.0,21.0,52.5,125.75,306.0
SA,12.0,175.083333,65.242845,93.0,129.5,162.5,198.0,333.0


<a id='apply-combine'></a>
### Apply Functions to Groups and Combine

**Q.1** Find the `count`, `mean`, `minimum`, and `maximum `of the `beer` column by continent.

In [21]:
# Similar, this but outputs a DataFrame and can be customized.
drinks.groupby('continent').beer.agg(['count', 'mean', 'min', 'max'])

Unnamed: 0_level_0,count,mean,min,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AF,53,61.471698,0,376
AS,44,37.045455,0,247
EU,45,193.777778,0,361
OC,16,89.6875,0,306
SA,12,175.083333,93,333


**Q.2** Perform the same task as in Q.1, but now sort the output by the `mean` column.

In [22]:
drinks.groupby('continent').beer.agg(['count', 'mean', 'min', 'max']).sort_values('mean')

Unnamed: 0_level_0,count,mean,min,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AS,44,37.045455,0,247
AF,53,61.471698,0,376
OC,16,89.6875,0,306
SA,12,175.083333,93,333
EU,45,193.777778,0,361


**Q.3** Apply a custom function to all columns of the `drinks` DataFrame, grouping by continent.

In [23]:
# Find the first value of each column by continent:
drinks.groupby('continent').apply(lambda x: x.iloc[0,:])

Unnamed: 0_level_0,country,beer,spirit,wine,total,continent
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AF,Algeria,25,0,14,0.7,AF
AS,Afghanistan,0,0,0,0.0,AS
EU,Albania,89,132,54,4.9,EU
OC,Australia,261,72,212,10.4,OC
SA,Argentina,193,25,221,8.3,SA


**Q.4** **Note:** If you don't specify a column for the aggregation function, it will be applied to all numeric columns.

In [24]:
drinks.groupby('continent').mean()
drinks.groupby('continent').describe()

Unnamed: 0_level_0,beer,beer,beer,beer,beer,beer,beer,beer,spirit,spirit,...,total,total,wine,wine,wine,wine,wine,wine,wine,wine
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
AF,53.0,61.471698,80.557816,0.0,15.0,32.0,76.0,376.0,53.0,16.339623,...,4.7,9.1,53.0,16.264151,38.846419,0.0,1.0,2.0,13.0,233.0
AS,44.0,37.045455,49.469725,0.0,4.25,17.5,60.5,247.0,44.0,60.840909,...,2.425,11.5,44.0,9.068182,21.667034,0.0,0.0,1.0,8.0,123.0
EU,45.0,193.777778,99.631569,0.0,127.0,219.0,270.0,361.0,45.0,132.555556,...,10.9,14.4,45.0,142.222222,97.421738,0.0,59.0,128.0,195.0,370.0
OC,16.0,89.6875,96.641412,0.0,21.0,52.5,125.75,306.0,16.0,58.4375,...,6.15,10.4,16.0,35.625,64.55579,0.0,1.0,8.5,23.25,212.0
SA,12.0,175.083333,65.242845,93.0,129.5,162.5,198.0,333.0,12.0,114.75,...,7.375,8.3,12.0,62.416667,88.620189,1.0,3.0,12.0,98.5,221.0


<a id='exercise-4'></a>

## Exercise #4

---

**Using the `users` DataFrame**:
1. Count the number of distinct occupations in `users`.
2. Calculate the mean age by occupation.
3. Calculate the minimum and maximum age by occupation.
4. Calculate the mean age by cross-sections of `occupation` and `gender`.

> **Tip**: Multiple columns can be passed to the `.groupby()` function for more granular cross-sections.

In [25]:
users = pd.read_table('../../../../../resource-datasets/users/users.txt', sep='|')

In [26]:
# For each occupation in `users`, count the number of occurrences.
users.occupation.value_counts()

student          196
other            105
educator          95
administrator     79
engineer          67
programmer        66
librarian         51
writer            45
executive         32
scientist         31
artist            28
technician        27
marketing         26
entertainment     18
healthcare        16
retired           14
lawyer            12
salesman          12
none               9
doctor             7
homemaker          7
Name: occupation, dtype: int64

In [27]:
# For each occupation, calculate the mean age.
users.groupby('occupation').age.mean()

occupation
administrator    38.746835
artist           31.392857
doctor           43.571429
educator         42.010526
engineer         36.388060
entertainment    29.222222
executive        38.718750
healthcare       41.562500
homemaker        32.571429
lawyer           36.750000
librarian        40.000000
marketing        37.615385
none             26.555556
other            34.523810
programmer       33.121212
retired          63.071429
salesman         35.666667
scientist        35.548387
student          22.081633
technician       33.148148
writer           36.311111
Name: age, dtype: float64

In [28]:
# For each occupation, calculate the minimum and maximum ages.
users.groupby('occupation').age.agg(['min', 'max'])

Unnamed: 0_level_0,min,max
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
administrator,21,70
artist,19,48
doctor,28,64
educator,23,63
engineer,22,70
entertainment,15,50
executive,22,69
healthcare,22,62
homemaker,20,50
lawyer,21,53


In [29]:
# For each combination of `occupation` and `gender`, calculate the mean age.
users.groupby(['occupation', 'gender']).age.mean()

occupation     gender
administrator  F         40.638889
               M         37.162791
artist         F         30.307692
               M         32.333333
doctor         M         43.571429
educator       F         39.115385
               M         43.101449
engineer       F         29.500000
               M         36.600000
entertainment  F         31.000000
               M         29.000000
executive      F         44.000000
               M         38.172414
healthcare     F         39.818182
               M         45.400000
homemaker      F         34.166667
               M         23.000000
lawyer         F         39.500000
               M         36.200000
librarian      F         40.000000
               M         40.000000
marketing      F         37.200000
               M         37.875000
none           F         36.500000
               M         18.600000
other          F         35.472222
               M         34.028986
programmer     F         32.16666

<a id='indexing'></a>
## Indexing

---
<a id='loc'></a>
### Location Indexing With `.loc()`

**Q.1** Select all rows and the `city` column from the UFO data set using `.loc()`.

In [30]:
d = ufo.loc[:, 'City'] # Colon means "all rows;" then, select one column
d.head(10)

0                  Ithaca
1             Willingboro
2                 Holyoke
3                 Abilene
4    New York Worlds Fair
5             Valley City
6             Crater Lake
7                    Alma
8                 Eklutna
9                 Hubbard
Name: City, dtype: object

**Q.2** Select all rows and columns in `city` and `state`.

In [31]:
d = ufo.loc[:, ['City', 'State']]   # Select two columns
d.head(10)

Unnamed: 0,City,State
0,Ithaca,NY
1,Willingboro,NJ
2,Holyoke,CO
3,Abilene,KS
4,New York Worlds Fair,NY
5,Valley City,ND
6,Crater Lake,CA
7,Alma,MI
8,Eklutna,AK
9,Hubbard,OR


**Q.3** Select all rows and columns from `city` *through* `state`.

In [32]:
d = ufo.loc[:, 'City':'State'] # Select a range of columns.
d.columns

Index(['City', 'Colors_Reported', 'Shape_Reported', 'State'], dtype='object')

**Q.4** Select:
- All columns at row 0.
- All columns at rows 0:2.
- Columns `city` through `state` at rows 0:2.

In [33]:
# `.loc()` can also filter rows by "name" (the index).
d = ufo.loc[0, :]                   # Row 0, all columns
d = ufo.loc[0:2, :]                 # Rows 0/1/2, all columns
d = ufo.loc[0:2, 'City':'State']    # Rows 0/1/2, range of columns

<a id='iloc'></a>
### Position indexing with `.iloc`

**Q.1** Select all rows and columns in position 0 and 3.

In [34]:
d = ufo.iloc[:, [0, 3]] # All rows, columns in position 0/3
d.head(10)

Unnamed: 0,City,State
0,Ithaca,NY
1,Willingboro,NJ
2,Holyoke,CO
3,Abilene,KS
4,New York Worlds Fair,NY
5,Valley City,ND
6,Crater Lake,CA
7,Alma,MI
8,Eklutna,AK
9,Hubbard,OR


**Q.2** Select all rows and columns in positions 0 through 4.

In [35]:
d = ufo.iloc[:, 0:4] # All rows, columns in position 0/1/2/3
d.head()

Unnamed: 0,City,Colors_Reported,Shape_Reported,State
0,Ithaca,,TRIANGLE,NY
1,Willingboro,,OTHER,NJ
2,Holyoke,,OVAL,CO
3,Abilene,,DISK,KS
4,New York Worlds Fair,,LIGHT,NY


**Q.3** Select rows in positions 0:3, along with all columns.

In [36]:
d = ufo.iloc[0:3, :] # rows in position 0/1/2, all columns
d.head()

Unnamed: 0,City,Colors_Reported,Shape_Reported,State,Time,Location
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,"Ithaca, NY"
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,"Willingboro, NJ"
2,Holyoke,,OVAL,CO,2/15/1931 14:00,"Holyoke, CO"


<a id='frequent'></a>
## Frequently Used Features

---
<a id='map-dict'></a>
### Using Map Functions With Replacement Dictionaries

In [37]:
# Map existing values to a different set of values.
users['is_male'] = users.gender.map({'F':0, 'M':1})

<a id='factorize'></a>
### Encoding Strings as Integers With `.factorize()`

In [38]:
# Encode strings as integer values. (This function automatically starts at 0).
users['occupation_num'] = users.occupation.factorize()[0]

users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_male,occupation_num
0,1,24,M,technician,85711,1,0
1,2,53,F,other,94043,0,1
2,3,23,M,writer,32067,1,2
3,4,24,M,technician,43537,1,0
4,5,33,F,other,15213,0,1


<a id='unique'></a>
### Determining Unique Values

In [39]:
# Determine unique values in a column.
users.occupation.nunique()      # Count the number of unique values.

21

In [40]:
users.occupation.unique()       # Return the unique values.

array(['technician', 'other', 'writer', 'executive', 'administrator',
       'student', 'lawyer', 'educator', 'scientist', 'entertainment',
       'programmer', 'librarian', 'homemaker', 'artist', 'engineer',
       'marketing', 'none', 'healthcare', 'retired', 'salesman', 'doctor'],
      dtype=object)

<a id='replace'></a>
### Replacing Values With `.replace()`

In [41]:
# Replace all instances of a value in a column (must match the entire value).
ufo.State.replace('Fl', 'FL', inplace=True)

<a id='series-str'></a>
### Series String Methods With `.str`

In [42]:
# String methods are accessed via `.str`.
ufo.State.str.upper()                               # Converts to uppercase
ufo.Colors_Reported.str.contains('RED', na='False').head(2) # Checks for a substring

0    False
1    False
Name: Colors_Reported, dtype: object

<a id='datetime'></a>
### Datetime Conversion and Arithmetic

In [43]:
# Convert a string to the datetime format.
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo.Time.dt.hour                        # Datetime format exposes convenient attributes.
(ufo.Time.max() - ufo.Time.min()).days  # It also allows you to do datetime "math."
ufo[ufo.Time > pd.datetime(2014, 1, 1)].head(2) # Boolean filtering with the datetime format

Unnamed: 0,City,Colors_Reported,Shape_Reported,State,Time,Location
75177,Clarksville,ORANGE,SPHERE,TN,2014-01-01 00:01:00,"Clarksville, TN"
75178,Henderson,,SPHERE,NV,2014-01-01 00:01:00,"Henderson, NV"


<a id='set-reset-index'></a>
### Setting and Resetting the Index

In [44]:
# Setting and then removing an index
ufo.set_index('Time', inplace=True)
ufo.reset_index(inplace=True)

<a id='sort-by-index'></a>
### Sorting by Index

In [45]:
# Sort a column by its index.
ufo.State.value_counts().sort_index()[0:3]

AK    403
AL    808
AR    748
Name: State, dtype: int64

<a id='change-dtype'></a>
### Changing the Data Type of a Column

In [46]:
# Change the data type of a column.
drinks['beer'] = drinks.beer.astype('float')

# Change the data type of a column when reading in a file.
d = pd.read_csv('../../../../../resource-datasets/alcohol_by_country/drinks.csv', dtype={'beer_servings':float})

<a id='dummy'></a>
### Creating Dummy-Coded Columns

In [47]:
# Create dummy variables for `continent` and exclude the first dummy column.
continent_dummies = pd.get_dummies(drinks.continent, prefix='cont').iloc[:, 1:]
continent_dummies.head(3)

Unnamed: 0,cont_AS,cont_EU,cont_OC,cont_SA
0,1,0,0,0
1,0,1,0,0
2,0,0,0,0


<a id='concatenate'></a>
### Concatenating DataFrames

In [48]:
# Concatenate two DataFrames (axis=0 for rows, axis=1 for columns).
drinks = pd.concat([drinks, continent_dummies], axis=1)

In [49]:
drinks.head(2)

Unnamed: 0,country,beer,spirit,wine,total,continent,cont_AS,cont_EU,cont_OC,cont_SA
0,Afghanistan,0.0,0,0,0.0,AS,1,0,0,0
1,Albania,89.0,132,54,4.9,EU,0,1,0,0


<a id='duplicate-rows'></a>
### Detecting and Dropping Duplicate Rows

In [50]:
# Detecting duplicate rows:
d = users.duplicated()          # True if a row is identical to a previous row.
d = users.duplicated().sum()    # Count of duplicates.
d = users[users.duplicated()]   # Only shows duplicates.
d = users.drop_duplicates()     # Drops duplicate rows.
d = users.age.duplicated()      # Checks a single column for duplicates.
d = users.duplicated(['age', 'gender', 'zip_code']).sum()   # Specifies columns for finding duplicates.

<a id='write-csv'></a>
### Writing a DataFrame to a `.csv`
```python
# Write a DataFrame out to a `.csv`.
drinks.to_csv('drinks_updated.csv')  # Index is used as the first column
drinks.to_csv('drinks_updated.csv', index=False) # Ignore index
```

<a id='pickle'></a>
### Pickling a DataFrame
```python
# Save a DataFrame to disk (a.k.a., "pickle") and read it from disk (a.k.a., "unpickle").
drinks.to_pickle('drinks_pickle')
pd.read_pickle('drinks_pickle')
```

<a id='sample'></a>
### Randomly Sampling a DataFrame

In [51]:
# Randomly sample a DataFrame.
train = drinks.sample(frac=0.75, random_state=1)    # Will contain 75% of the rows
test = drinks[~drinks.index.isin(train.index)]      # Will contain the other 25%

<a id='infrequent'></a>
## Infrequently Used Features

---

<a id='toy-dataframes'></a>
### Creating DataFrames From Dictionaries and Lists of Lists

In [52]:
# Create a DataFrame from a dictionary.
d = pd.DataFrame({'capital':['Montgomery', 'Juneau', 'Phoenix'], 'state':['AL', 'AK', 'AZ']})
d.head(2)

Unnamed: 0,capital,state
0,Montgomery,AL
1,Juneau,AK


In [53]:
# Create a DataFrame from a list of lists.
d = pd.DataFrame([['Montgomery', 'AL'], ['Juneau', 'AK'], ['Phoenix', 'AZ']], columns=['capital', 'state'])
d.head(2)

Unnamed: 0,capital,state
0,Montgomery,AL
1,Juneau,AK


<a id='crosstab'></a>
### Performing Cross-Tabulations

In [54]:
# Display a cross-tabulation of two Series.
pd.crosstab(users.occupation, users.gender)

gender,F,M
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
administrator,36,43
artist,13,15
doctor,0,7
educator,26,69
engineer,2,65
entertainment,2,16
executive,3,29
healthcare,11,5
homemaker,6,1
lawyer,2,10


<a id='query'></a>
### Query-Filtering Syntax

In [55]:
# Alternative syntax for Boolean filtering (noted as "experimental" in the documentation):
d = users.query('age < 20')                 # users[users.age < 20]
d = users.query("age < 20 and gender=='M'") # users[(users.age < 20) & (users.gender=='M')]
d = users.query('age < 20 or age > 60')     # users[(users.age < 20) | (users.age > 60)]

<a id='memory-usage'></a>
### Calculating Memory Usage

In [56]:
# Display the memory usage of a DataFrame.
d = ufo.info()          # Total usage
ufo.memory_usage()  # Usage by column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80543 entries, 0 to 80542
Data columns (total 6 columns):
Time               80543 non-null datetime64[ns]
City               80496 non-null object
Colors_Reported    17034 non-null object
Shape_Reported     72141 non-null object
State              80543 non-null object
Location           80496 non-null object
dtypes: datetime64[ns](1), object(5)
memory usage: 3.7+ MB


Index                  80
Time               644344
City               644344
Colors_Reported    644344
Shape_Reported     644344
State              644344
Location           644344
dtype: int64

<a id='category-type'></a>
### Converting Column to Category Type

In [57]:
# Change a Series to the `category` data type. (This reduces memory usage and increases performance).
ufo['State'] = ufo.State.astype('category')

<a id='assign'></a>
### Creating Columns With `.assign()`

In [58]:
# Temporarily define a new column as a function of the existing columns.
drinks.assign(servings = drinks.beer + drinks.spirit + drinks.wine).head(2)

Unnamed: 0,country,beer,spirit,wine,total,continent,cont_AS,cont_EU,cont_OC,cont_SA,servings
0,Afghanistan,0.0,0,0,0.0,AS,1,0,0,0,0.0
1,Albania,89.0,132,54,4.9,EU,0,1,0,0,275.0


<a id='limit-rows-read'></a>
### Limiting the Number of Rows to Load in a File Read

In [59]:
# Limit which rows are included when reading in a file.
d = pd.read_csv('../../../../../resource-datasets/alcohol_by_country/drinks.csv', nrows=10)           # Only read the first 10 rows.
d = pd.read_csv('../../../../../resource-datasets/alcohol_by_country/drinks.csv', skiprows=[1, 2])    # Skip the first two rows of data.

<a id='manual-print'></a>
### Manually Setting the Number of Rows and Columns to Print

In [60]:
# Change the maximum number of rows and columns printed. (`None` means unlimited).
pd.set_option('max_rows', 2)     # Default is 60 rows
pd.set_option('max_columns', 2)  # Default is 20 columns
print(drinks)

         country   ...     cont_SA
0    Afghanistan   ...           0
..           ...   ...         ...
192     Zimbabwe   ...           0

[193 rows x 10 columns]


In [61]:
# Reset the options to defaults.
pd.reset_option('max_rows')
pd.reset_option('max_columns')

In [62]:
# Change the options temporarily. (Settings are restored when you exit the `with` block).
with pd.option_context('max_rows', None, 'max_columns', None):
    print(drinks[:10])

             country   beer  spirit  wine  total continent  cont_AS  cont_EU  \
0        Afghanistan    0.0       0     0    0.0        AS        1        0   
1            Albania   89.0     132    54    4.9        EU        0        1   
2            Algeria   25.0       0    14    0.7        AF        0        0   
3            Andorra  245.0     138   312   12.4        EU        0        1   
4             Angola  217.0      57    45    5.9        AF        0        0   
5  Antigua & Barbuda  102.0     128    45    4.9       NaN        0        0   
6          Argentina  193.0      25   221    8.3        SA        0        0   
7            Armenia   21.0     179    11    3.8        EU        0        1   
8          Australia  261.0      72   212   10.4        OC        0        0   
9            Austria  279.0      75   191    9.7        EU        0        1   

   cont_OC  cont_SA  
0        0        0  
1        0        0  
2        0        0  
3        0        0  
4        