#### Table 5.1 - Possible Data Inputs to a DataFrame Constructor
| Type | Notes | 
|---|--------------------------|
| 2D ndarray | A matrix of data, passing optional row and column labels |
| dict of arrays, lists, or tuples | Each sequence becomes a column in the DataFrame. All sequences must be the same length. |
| NumPy structured/record array | Treated as the “dict of arrays” case |
| dict of Series | Each value becomes a column. Indexes from each Series are unioned together to form the result’s row index if no explicit index is passed. |
| dict of dicts | Each inner dict becomes a column. Keys are unioned to form the row index as in the “dict of Series” case. |
| list of dicts or Series | Each item becomes a row in the DataFrame. Union of dict keys or Series indexes become the DataFrame’s column labels |
| List of lists or tuples | Treated as the “2D ndarray” case |
| Another DataFrame | The DataFrame’s indexes are used unless different ones are passed |
| NumPy MaskedArray | Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result |

#### Table 5-2. Main Index objects in pandas
|Class|Description|
|--|-------------------------------|
| Index | The most general Index object, representing axis labels in a NumPy array of Python objects. |
| Int64Index | Specialized Index for integer values. |
| MultiIndex | “Hierarchical” index object representing multiple levels of indexing on a single axis. Can be thought of as similar to an array of tuples. |
| DatetimeIndex | Stores nanosecond timestamps (represented using NumPy’s datetime64 dtype). |
| PeriodIndex | Specialized Index for Period data (timespans). |

#### Table 5-3. Index methods and properties
|Method|Description|
|--|-------------------------------|
| append | Concatenate with additional Index objects, producing a new Index | 
| diff | Compute set difference as an Index |
| intersection | Compute set intersection |
| union | Compute set union |
| isin | Compute boolean array indicating whether each value is contained in the passed collection |
| delete | Compute new Index with element at index i deleted |
| drop | Compute new index by deleting passed values |
| insert | Compute new Index by inserting element at index i |
| is_monotonic | Returns True if each element is greater than or equal to the previous element |
| is_unique | Returns True if the Index has no duplicate values |
| unique | Compute the array of unique values in the Index |

#### Table 5-4. reindex method (interpolation) options
|Argument |Description|
|--|-------------------------------|
| ffill or pad | Fill (or carry) values forward |
| bfill or backfill | Fill (or carry) values backward |

#### Table 5-5. reindex function arguments
|Argument |Description|
|--|-------------------------------|
| index | New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying |
| method | Interpolation (fill) method, see Table 5-4 for options. |
| fill_value | Substitute value to use when introducing missing data by reindexing |
| limit | When forward- or backfilling, maximum size gap to fill |
| level | Match simple Index on level of MultiIndex, otherwise select subset of |
| copy | If True, always copy underlying data even if new index is equivalent to old index.  Otherwise, do not copy the data when the indexes are equivalent. |

#### Table 5-6. Indexing options with DataFrame
|Argument |Description|
|------------------|----------------------------|
| obj[val] | Select single column or sequence of columns from the DataFrame. Special case con- veniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion). |
| obj.ix[val] | Selects single row or subset of rows from the DataFrame. |
| obj.ix[:, val] | Selects single column of subset of columns. |
| obj.ix[val1, val2] | Select both rows and columns. |
| reindex method | Conform one or more axes to new indexes. |
| xs method | Select single row or column as a Series by label. |
| icol, irowmethods | Select single column or row, respectively, as a Series by integer location. |
| get_value, set_value methods | Select single value by row and column label. |

#### Table 5-7. Flexible arithmetic methods
|Method |Description|
|------------------|----------------------------|
| add | Method for addition (+) |
| sub | Method for subtraction (-) div Method for division (/) |
| mul | Method for multiplication (*) |

#### Table 5-8. Tie-breaking methods with rank
|Method |Description|
|------------------|----------------------------|
| 'average' | Default: assign the average rank to each entry in the equal group. |
| 'min' | Use the minimum rank for the whole group. |
| 'max' | Use the maximum rank for the whole group. |
| 'first' | Assign ranks in the order the values appear in the data. |

#### Table 5-9. Options for reduction methods
|Method |Description|
|------------------|----------------------------|
| axis | Axis to reduce over. 0 for DataFrame’s rows and 1 for columns. |
| skipna | Exclude missing values, True by default. |
| level | Reduce grouped by level if the axis is hierarchically-indexed (MultiIndex). |

#### Table 5-10. Descriptive and summary statistics
|Method |Description|
|------------------|----------------------------|
| count | Number of non-NA values |
| describe | Compute set of summary statistics for Series or each DataFrame column |
| min, max | Compute minimum and maximum values |
| argmin, argmax | Compute index locations (integers) at which minimum or maximum value obtained, respectively  |
| idxmin, idxmax | Compute index values at which minimum or maximum value obtained, respectively |
| quantile | Compute sample quantile ranging from 0 to 1 |
| sum | Sum of values |
| mean | Mean of values |
| median | Arithmetic median (50% quantile) of values |
| mad | Mean absolute deviation from mean value |
| var | Sample variance of values |
| std | Sample standard deviation of values |
| skew | Sample skewness (3rd moment) of values |
| kurt | Sample kurtosis (4th moment) of values |
| cumsum | Cumulative sum of values |
| cummin, cummax | Cumulative minimum or maximum of values, respectively |
| cumprod | Cumulative product of values |
| diff | Compute 1st arithmetic difference (useful for time series) |
| pct_change | Compute percent changes |

#### Table 5-11. Unique, value counts, and binning methods
|Method |Description|
|------------------|----------------------------|
| isin | Compute boolean array indicating whether each Series value is contained in the passed sequence of values. |
| unique | Compute array of unique values in a Series, returned in the order observed. |
| value_counts | Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order. |

#### Table 5-12. NA handling methods
|Method |Description|
|------------------|----------------------------|
| dropna | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate. |
| fillna | Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'. | 
| isnull | Return like-type object containing boolean values indicating which values are missing / NA. | 
| notnull | Negation of isnull. |

#### Table 5-13. fillna function arguments
| Argument |Description|
|------------------|----------------------------|
| value | Scalar value or dict-like object to use to fill missing values |
| method | Interpolation, by default 'ffill' if function called with no other arguments |
| axis | Axis to fill on, default axis=0 |
| inplace | Modify the calling object without producing a copy |
| limit | For forward and backward filling, maximum number of consecutive periods to fill |

## Working with Pandas -- an in-class working session

Below are a series of questions, with the answers remaining for you to fill in by using pandas expressions that draw on the methods in Chapter 5.  You should not need to use anything more than the content of this chapter -- a subset of the methods summarized above, to do this exercise.  Hopefully you can complete it within class if you've been keeping up with the reading.

Let's start by importing pandas. Note that we import pandas as pd.  And in order to use DataFrame without pd.DataFrame to reference it.

In [4]:
import pandas as pd

OK, let's create a DataFrame from a dictionary, following the example on pg 116 of Python for Data Analysis (PDA).

In [6]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)

Explain the contents and structure of 'data'

What does 'DataFrame(data)' do? What if we did not begin that line with 'df ='?

Look at the contents of df, using just df by itself, and 'print df'. 

How can we get a quick statistical profile of all the numeric columns?

Can you get a profile of a column that is not numeric, like state? Try it.

How can we print the data types of each column?

How can we print just the column containing state names?

How can we get a list of the states in the DataFrame, without duplicates?

How can we get a count of how many rows we have in each state?

How can we compute the mean of population across all the rows?

How can we compute the maximum population across all the rows?

How can we compute the 20th percentile value of population?

How can we compute a Boolean indicating whether the state is 'Ohio'?

How can we select and print just the rows for Ohio?

How can we create a new DataFrame containing only the Ohio records?

How can we select and print just the rows in which population is more than 2?

How could we compute the mean of population that is in Ohio, averaging across years?

How can we print the DataFrame, sorted by State and within State, by Population?

How can we print the row for Ohio, 2002, selecting on its values (not on row and column indexes)?

How can we use row and column indexing to set the population of Ohio in 2002 to 3.4?

How can we use row and column indexing to append a new record for Utah, initially with no population or year?

How can we set the population to 2.5 and year to 2001 for the new record?