### 10 minutes to Pandas

In [1]:
# importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Creating a Series by passing a list of values, letting pandas create a default integer index:

In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print s

SyntaxError: Missing parentheses in call to 'print'. Did you mean print(s)? (<ipython-input-2-aa3b21c343c1>, line 2)

### Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

```mySeries = pd.Series(data, index=index)
```

Syntax for declaring a Series object

Here, data can be many different things:

- a Python dict
- an ndarray
- a scalar value (like 5)
The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is

#### From ndarray

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1]

In [None]:
s = pd.Series(np.random.randn(5), index=['a','b','c','d','e'])

print s

print s.index

print pd.Series(np.random.randn(5))

#### From dict

If data is a dict, if index is passed the values in data corresponding to the labels in the index will be pulled out. Otherwise, an index will be constructed from the sorted keys of the dict, if possible.

In [None]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}

print pd.Series(d)

print pd.Series(d, index=['b', 'c', 'd', 'a'])

** Note NaN (not a number) is the standard missing data marker used in pandas **

#### From scalar value
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

In [None]:
pd.Series(5., index=['a', 'b', 'c'])

Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However, things like slicing also slice the index.

In [None]:
print s
print s[0]
print s[:3]

print s.median()

print s > s.median()

print s[s > s.median()]

print s[[4,3,1]]


A Series is like a fixed-size dict in that you can get and set values by index label:

In [None]:
print s['a']

s['e'] = 100

print s

print 'e' in s

Using the get method, a missing label will return None or specified default:

In [None]:
print s.get('f')

# if not found return -1
print s.get('f', -1)

print s.get('f', np.nan)

### Vectorized operations and label alignment with Series

When doing data analysis, as with raw NumPy arrays looping through Series value-by-value is usually not necessary. Series can also be passed into most NumPy methods expecting an ndarray.

In [None]:
print s+s

print s*2


A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

In [None]:
print s[1:]

print s[:-1]

print s[1:] + s[:-1]

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

#### Name attr

Series can also have a name attribute:

In [None]:
s = pd.Series(np.random.randn(5), name="something")
print s

# renaming a series
s2 = s.rename("different")
print s2

### Data Frame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a **dict of Series objects**. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

1. Dict of 1D ndarrays, lists, dicts or Series
2. 2-D numpy.ndarray
3. Structured or record ndarray
4. A Series

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

#### From dict of Series or dicts

The result index will be the union of the indexes of the various Series. If there are any nested dicts, these will be first converted to Series. If no columns are passed, the columns will be the sorted list of dict keys.

In [None]:
d = {'one': pd.Series([1,2,3], index=['a', 'b', 'c']), 'two': pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

print df

#### The row and cols of a DataFrame can be accessed by index and column attr resp

In [None]:
print pd.DataFrame(d, index=['d', 'b', 'a'])

# there is no col 'three' hence returns NaNs
print pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

print df.index

print df.columns

#### From dict of ndarrays / lists

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.

In [None]:
d = {
    'one': [1, 2, 3, 4],
    'two': [4, 3, 2, 1]
}

print pd.DataFrame(d)

print pd.DataFrame(d, index=['a','b','c','d'])

#### From a list of dicts

In [None]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

print pd.DataFrame(data2)

print pd.DataFrame(data2, index=['first', 'second'])

print pd.DataFrame(data2, columns = ['a', 'b'])

#### DataFrame from items

DataFrame.from_items works analogously to the form of the dict constructor that takes a sequence of (key, value) pairs, where the keys are column (or row, in the case of orient='index') names, and the value are the column values (or row values). This can be useful for constructing a DataFrame with the columns in a particular order without having to pass an explicit list of columns:

In [None]:
print pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [5, 6, 7])], orient='columns')

# If you pass orient='index', the keys will be the row labels. But in this case you must also pass the desired column names

print pd.DataFrame.from_items([('A', [1, 5]), ('B', [2, 6]), ('C', [3, 7])], orient='index', columns=[1, 2])

#### Simple operations on DataFrames: Column selection, addition, deletion

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

In [None]:
print df

print df['one']

df['three'] = df['one'] * df['two']

print df

df['flag'] = df['one'] > 2

print df


Columns can be deleted or popped like with a dict

In [None]:
# del df['three']

df['three'] = df['one'] * df['two']
print df

# pop deletes and returns the column
three = df.pop('three')

print three

** When inserting a scalar value, it will naturally be propagated to fill the column: **

In [None]:
df['foo'] = 'bar'

print df

When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index:

In [None]:
print df['one'][:2]

df['one_truncated'] = df['one'][:2]
print df

#### Assigning New Columns in Method Chains

Inspired by dplyr’s mutate verb, DataFrame has an assign() method that allows you to easily create new columns that are potentially derived from existing columns.

In [None]:
# read in the iris dataset

iris = pd.read_csv('data/iris.csv')

print iris.head()

# mutate a new column

iris.assign(sepal_ratio = iris['sepal_width']/iris['sepal_length']).head()

Above was an example of inserting a precomputed value. We can also pass in a function of one argument to be evalutated on the DataFrame being assigned to.

In [None]:
iris.assign(sepal_ratio = lambda x: (x['sepal_width'])/x['sepal_length']).head()

**assign always returns a copy of the data, leaving the original DataFrame untouched.**

In [None]:
iris.head()

# original data unchanged

Passing a callable, as opposed to an actual value to be inserted, is useful when you don’t have a reference to the DataFrame at hand. This is common when using assign in chains of operations. For example, we can limit the DataFrame to just those observations with a Sepal Length greater than 5, calculate the ratio, and plot:

In [None]:
# . query(): like subset

%pylab inline

print iris.query('petal_length > 6')

iris.query('petal_length > 5').assign(sepal_ratio = lambda x: x.sepal_width/x.sepal_length,
                                     petal_ratio = lambda x: x.petal_width/x.petal_length).plot(kind = 'scatter', x = 'sepal_ratio', y = 'petal_ratio')


** Warning: All expressions are computed first, and then assigned. So you can’t refer to another column being assigned in the same call to assign. For example: **

```

In [74]: # Don't do this, bad reference to `C`
        df.assign(C = lambda x: x['A'] + x['B'],
                  D = lambda x: x['A'] + x['C'])
In [2]: # Instead, break it into two assigns
        (df.assign(C = lambda x: x['A'] + x['B'])
           .assign(D = lambda x: x['A'] + x['C']))
```

### Indexing/Selection

1. Selecting column: df[col]: Series
2. Select row by label: df.loc[label]: Series
3. Select row by int location: df.iloc[loc]: Series
4. Slice rows: df[5:10]: DataFrame
5. Select rows by Boolean Vector(like in R): df[bool_vector]: DataFrame

Row selection, for example, returns a Series whose index is the columns of the DataFrame:

In [None]:
print df

print df.loc['b']

print df.iloc[1]

one_row = df.loc['b']

# simple loop
for value in one_row:
    print value


### Data alignment and arithmetic

Data alignment between DataFrame objects automatically align on both the columns and the index (row labels). Again, the resulting object will have the union of the column and row labels.

In [None]:
# (10, 4) => 10 rows and 4 cols
df = pd.DataFrame(np.random.randn(10, 4), columns=['A','B','C','D'])

df1 = pd.DataFrame(np.random.randn(7, 3), columns=['A','B','C'])

print df

print df1

print df + df1

# When doing an operation between DataFrame and Series, the default behavior is to align the Series index
# on the DataFrame columns, thus broadcasting row-wise. For example:

print df.iloc[0]

print df - df.iloc[0]



In the special case of working with time series data, and the DataFrame index also contains dates, the broadcasting will be column-wise:

In [None]:
index = pd.date_range("1/1/2000", periods=8)

df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=list('ABC'))

print df

Operations with scalars are as expected

In [None]:
print "df:", df

print "df*2 + 5", df*2 + 5

print "1/df", 1/df

#### Boolean Operators

In [None]:
df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)

df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)

print df1

print df2

# print df1 & df2

# print df1 | df2

# print df1 ^ df2

print -df1

#### Transposing and Data Interportability

To transpose, access the T attribute (also the transpose function), similar to an ndarray:

Elementwise NumPy ufuncs (log, exp, sqrt, ...) and various other NumPy functions can be used with no issues on DataFrame, assuming the data within are numeric:

In [None]:
print df[:5]

print df[:5].T

print np.exp(df)

# convert DataFrame to list

print np.asarray(df)

The dot method on DataFrame implements matrix multiplication:

In [None]:
df.T.dot(df)

Similarly, the dot method on Series implements dot product:

In [None]:
print np.arange(5,10)

s1 = pd.Series(np.arange(5,10))
print s1

s1.dot(s1)

#### Console Display

Very large DataFrames will be truncated to display them in the console. You can also get a summary using info(). 

```
In [101]: baseball = pd.read_csv('data/baseball.csv')

In [102]: print(baseball)
       id     player  year  stint  ...   hbp   sh   sf  gidp
0   88641  womacto01  2006      2  ...   0.0  3.0  0.0   0.0
1   88643  schilcu01  2006      1  ...   0.0  0.0  0.0   0.0
..    ...        ...   ...    ...  ...   ...  ...  ...   ...
98  89533   aloumo01  2007      1  ...   2.0  0.0  3.0  13.0
99  89534  alomasa02  2007      1  ...   0.0  0.0  0.0   0.0

[100 rows x 23 columns]

In [103]: baseball.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 23 columns):
id        100 non-null int64
player    100 non-null object
year      100 non-null int64
stint     100 non-null int64
team      100 non-null object
lg        100 non-null object
g         100 non-null int64
```

#### Displaying DataFrames nicely

In [None]:
# Wide DataFrames will be printed across multiple rows by default:

print pd.DataFrame(np.random.randn(3, 12))

# You can change how much to print on a single row by setting the display.width option:

pd.set_option('display.width', 40) # default is 80

print pd.DataFrame(np.random.randn(3, 12))


### End of Intro to Data Structures

In [None]:
# Creating a Series

s = pd.Series([1, 3, 5, np.nan, 6, 8])

s

We now want to create a DataFrame by passing a numpy array, with datetime index and labeled columns

In [None]:
# resetting the width

pd.set_option('display.width', 80) 

dates = pd.date_range('20130101', periods=6)

df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'), index=dates)

print df

#### A DataFrame may also be created by passing in a dict of objects of various data types

In [None]:
# 1. => it should be a floating type value and not an int

df2 = pd.DataFrame({
    'A': 1.,
    'B': pd.Timestamp('20130102'),
    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
    'D': np.array([3] * 4, dtype='int32'),
    'E': pd.Categorical(["test", "train", "test", "train"]),
    'F': 'foo'
})

df2

# print df2.dtypes

#### Viewing Data

In [None]:
# print df.head()

# print df.tail(3)

# print df.index

# print df.columns

# print df.values # returns list of lists

# Describe shows a quick statistic summary of your data

df.describe()

# Transpose

# print df.T

Sorting by axis

In [None]:
print df

print df.sort_index(axis=1, ascending=False)

Sorting by values

In [None]:
df.sort_values(by='B')

#### Selection

Selecting a single column, which yields a Series, equivalent to ```df.A```


In [None]:
print df['A']

Selecting via [], which slices the rows.

In [None]:
print df[0:3]

#### Selection using Label

For getting a cross section using a label

In [None]:
print df

# print all cols of row pointed by dates[0]

print df.loc[dates[0]]

# print all rows and only columns: A and B

print df.loc[:, ['A', 'B']]

# print cols B and D pointed by the rows '20130102':'20130104'b

print df.loc['20130102':'20130104', ['B', 'D']]

For getting a scalar value


In [None]:
# returns a scalar value

print df.loc[dates[0], 'A']

# returns a DataFrame

print df.loc[dates[0], ['A']]

#### Selecting by Position

Select via the position of the passed integers

In [3]:
print df

print df.iloc[3]

# By integer slices, acting similar to numpy/python

print df.iloc[3:5, 0:2]

# By lists of integer position locations, similar to the numpy/python style
print df.iloc[3:5,0:2]

SyntaxError: Missing parentheses in call to 'print'. Did you mean print(df)? (<ipython-input-3-5324a860bf99>, line 1)