# Getting Started with pandas
## DAT540 Introduction to Data Science
## University of Stavanger
### L07
### 19/09/2018

#### Antorweep Chakravorty (antorweep.chakravorty@uis.no)

In [1]:
import numpy as np
import pandas as pd

- In order to load a text/csv file into pandas, we use the top-level method **pd.read_csv**
- There are different ways to easily read and write files using pandas, these topics would be covered in later lectures
- For now we just need *pd.read_csv*
- A pandas 2-D dataframe can be easily converted to ndarray using the ndarray method

In [3]:
df = pd.read_csv('./IRIS.csv', header=None)
print('shape:', df.shape)
df.head()

shape: (150, 5)


Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [4]:
arr = np.array(df)
print('arr.shape:', arr.shape)
arr[:5]

arr.shape: (150, 5)


array([[5.1, 3.5, 1.4, 0.2, 0. ],
       [4.9, 3. , 1.4, 0.2, 0. ],
       [4.7, 3.2, 1.3, 0.2, 0. ],
       [4.6, 3.1, 1.5, 0.2, 0. ],
       [5. , 3.6, 1.4, 0.2, 0. ]])

- **Pandas Essential Functionality**
- **Re-indexing**: data is conformed to a new index. Using **reindex** instance method the data can be filtered and rearranged into a new order
- reindex may introduce missing values if any of the index values were not already present
- reindex creates a copy of the data

In [5]:
# Let us have the 1st column with top 5 rows of the dataframe as a series. Obtains a view on df
sr1 = df[0][:5]
print('type(sr1):', type(sr1))
print('sr1.shape:', sr1.shape)
sr1

type(sr1): <class 'pandas.core.series.Series'>
sr1.shape: (5,)


0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: 0, dtype: float64

In [6]:
sr2 = sr1.reindex([4,2,'a',3,0])
# Changing index 4 of sr2. Does it get reflected back to sr1
sr2[4] = np.nan

In [6]:
sr2

4    NaN
2    4.7
a    NaN
3    4.6
0    5.1
Name: 0, dtype: float64

In [7]:
sr1

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: 0, dtype: float64

- *reindexing* can also be used to interpolate or filling missing values
- Argument *method* of reindex can be used to specify the way we want to interpolate
- *ffill* copies the previous value into the missing value
- *bfill* copies the next value into the missing value
- *nearest* copies the nearest value into the missing value
- indices must be monotonic increasing or decreasing
- index labels must be unique

In [8]:
sr1.reindex(index=range(7), method='ffill')

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
5    5.0
6    5.0
Name: 0, dtype: float64

- reindexing on DataFrames with rows and columns
- index attribute specifies the order to the rows and columns specify the order of the columns

In [19]:
df1 = df[0:5]
df1.reindex(index=range(7), columns=range(6, 0, -1))

Unnamed: 0,6,5,4,3,2,1
0,,,0.0,0.2,1.4,3.5
1,,,0.0,0.2,1.4,3.0
2,,,0.0,0.2,1.3,3.2
3,,,0.0,0.2,1.5,3.1
4,,,0.0,0.2,1.4,3.6
5,,,,,,
6,,,,,,


In [24]:
df1.reindex(index=range(7), columns=range(6, 0, -1), method='nearest')

Unnamed: 0,6,5,4,3,2,1
0,0,0,0,0.2,1.4,3.5
1,0,0,0,0.2,1.4,3.0
2,0,0,0,0.2,1.3,3.2
3,0,0,0,0.2,1.5,3.1
4,0,0,0,0.2,1.4,3.6
5,0,0,0,0.2,1.4,3.6
6,0,0,0,0.2,1.4,3.6


- reindex function arguments
<img src='reindex.png' width='500'>
- A alternate way to reindex a DataFrame is also by using the *loc* attribute

In [54]:
df1_indices = list(df1.index)
np.random.shuffle(df1_indices)
df1.index = df1_indices
df1

Unnamed: 0,0,1,2,3,4
1,5.1,3.5,1.4,0.2,0
4,4.9,3.0,1.4,0.2,0
0,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
2,5.0,3.6,1.4,0.2,0


In [58]:
df1.loc[[2,1,3], [4,3]]

Unnamed: 0,4,3
2,0,0.2
1,0,0.2
3,0,0.2


In [59]:
df1.iloc[[2,1,3], [4,3]]

Unnamed: 0,4,3
0,0,0.2
4,0,0.2
3,0,0.2


- **Dropping** entries from an Axis
- The *drop* instance method of a Series or a DataFrame can be used to drop one or more rows/columns
- It returns a new object/copy after performing the required operations
- In order to perform drop operations *in-place*, we pass a the *inplace=True* to the drop call

In [60]:
sr2.drop([2,3])

4    NaN
a    NaN
0    5.1
Name: 0, dtype: float64

In [61]:
# Drops row at index 2
df1.drop(2)

Unnamed: 0,0,1,2,3,4
1,5.1,3.5,1.4,0.2,0
4,4.9,3.0,1.4,0.2,0
0,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0


In [62]:
# Drops columns 0, 3
df1.drop([0,3], axis=1)

Unnamed: 0,1,2,4
1,3.5,1.4,0
4,3.0,1.4,0
0,3.2,1.3,0
3,3.1,1.5,0
2,3.6,1.4,0


- **.loc and .iloc**: DataFrame label-indexing on rows
- Allows selection of a subset of rows and columns from a DataFrame with NumPy like notations using either axis labels (loc) or integers (iloc)
- Slicing can be used on both indexing method
- However, *loc* works by creating a view, whereas *iloc* creates a copy


In [63]:
arr = np.random.randint(0, 20, 16).reshape(4,4)
data = pd.DataFrame(arr) # Create a DataFrame from ndarray
data.index = ['a', 'b', 'c', 'd'] # Label the rows
data.columns = ['one', 'two', 'three', 'four'] # Label the cols
data

Unnamed: 0,one,two,three,four
a,6,3,5,10
b,14,18,15,17
c,8,6,9,12
d,12,2,14,18


In [64]:
data.loc['a', ['one', 'three']]

one      6
three    5
Name: a, dtype: int64

In [65]:
data.iloc[0, [1, 3]]

two      3
four    10
Name: a, dtype: int64

In [66]:
# Using slicing and boolean filtering here. Can be used for both rows or columns
data.iloc[:, 1:3][data.three > 5]

Unnamed: 0,two,three
b,18,15
c,6,9
d,2,14


- Indexing options with DataFrame

<img src='indexing_ops.png'>

- **Arithmetic and Data Alignment**
- When adding together objects, if any index pairs are not the same, the respective index in the result will be union of the index pairs (similar to outer join)
- In case of DataFrame alignment is performed on both rows and columns
- What happens if you add two DataFrames without any rows or columns in common?

In [67]:
sr1 = pd.Series({'a': 7.3, 'c': -2.3, 'd': 3.4, 'e': 1.5})
sr2 = pd.Series({'a': -2.1, 'c': 3.6, 'e': -1.5, 'f': 4, 'g': 3.1})
sr1 + sr2

a    5.2
c    1.3
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [72]:
df1 = pd.DataFrame(np.arange(9.).reshape(3,3), columns=list('bcd'), index=['x', 'y', 'z'])
df2 = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('bde'), index=['u', 'x', 'y', 'o'])
df1 + df2

Unnamed: 0,b,c,d,e
o,,,,
u,,,,
x,3.0,,6.0,
y,9.0,,12.0,
z,,,,


- *Arithmetic methods with Fill Value*
- Operations that generates missing values due to non availability of an axis label in one or the other DataFrame can be filled with special value like 0

In [73]:
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
o,9.0,,10.0,11.0
u,0.0,,1.0,2.0
x,3.0,1.0,6.0,5.0
y,9.0,4.0,12.0,8.0
z,6.0,7.0,8.0,


- Why is [o, c], [u, c] and [z, e] still NaN

- Flexible arithmetic methods

<img src='arithmetics.png' width='450'>

- The methods starting with the prefix *r* is a counterpart, wherein, the arguments are flipped

```python
1/ df1 
# Equivalent to
df1.rdiv(1)
````

- *Operations between DataFrame and Series*
- Similar to NumPy, wherein, we can perform operations between arrays of different dimensions, we can do the same between DataFrame and Series
- The concept is similar to the idea of broadcasting with ndarrays
- By default, arithmetic between a DataFrame and Series matches the index of the Series on the DataFrame's columns, broadcasting on rows

In [74]:
df2

Unnamed: 0,b,d,e
u,0,1,2
x,3,4,5
y,6,7,8
o,9,10,11


In [75]:
sr1 = df2.iloc[0]
sr1

b    0
d    1
e    2
Name: u, dtype: int64

In [76]:
df2 - sr1

Unnamed: 0,b,d,e
u,0,0,0
x,3,3,3
y,6,6,6
o,9,9,9


- In order to broadcast over columns, matching the rows, we have to use the arithmetic methods
- the axis attribute should be set to either 0 or 'index'

In [77]:
df2

Unnamed: 0,b,d,e
u,0,1,2
x,3,4,5
y,6,7,8
o,9,10,11


In [78]:
sr1 = df2['b']
sr1

u    0
x    3
y    6
o    9
Name: b, dtype: int64

In [80]:
df2.subtract(sr1, axis='index')

Unnamed: 0,b,d,e
u,0,1,2
x,0,1,2
y,0,1,2
o,0,1,2


- **Function Application Mapping**
- NumPy *ufunc* (element-wise array methods) also work with pandas objects
- lambda functions on 1-D arrays can also be applied to each column or row using the *apply* instance method
- *axis* argument can be provided to the apply method to perform the operations over columns or rows

In [81]:
np.sum(df2, axis=1)

u     3
x    12
y    21
o    30
dtype: int64

In [82]:
data

Unnamed: 0,one,two,three,four
a,6,3,5,10
b,14,18,15,17
c,8,6,9,12
d,12,2,14,18


In [83]:
f = lambda x: x.max() - x.min()
data.apply(f, axis=0)

one       8
two      16
three    10
four      8
dtype: int64

- The function passed to *apply* need not return only a scalar value, it can also return a Series with multiple values

In [84]:
def f(x):
  return pd.Series([x.min(), x.max()], index=['min', 'max'])
data.apply(f)

Unnamed: 0,one,two,three,four
min,6,2,5,10
max,14,18,15,18


- *applymap* and *map* instance methods
- performs an element-wise computation of a given func
- applymap are specific to DataFrames and map is used for Series

In [85]:
df1 = pd.DataFrame(np.random.randn(9).reshape(3,3), index=['b', 'a', 'c'], columns=[list('xzy')])
df1

Unnamed: 0,x,z,y
b,-0.990686,0.230607,0.313111
a,0.52917,1.197852,-0.306571
c,-0.157807,-0.126531,0.672516


In [89]:
# Suppose we want to compute a formatted string from each floating point value in df1
f = lambda x:  '%.2f' % x
df1.applymap(f)

Unnamed: 0,x,z,y
b,-0.99,0.23,0.31
a,0.53,1.2,-0.31
c,-0.16,-0.13,0.67


- **Sorting and Ranking**
- *sort_index* instance method sorts pandas data object by row or column index
- It returns a new object
- We can provide the *inplace=True* argument to perform the operation on a view
- by default both row and column indices are sorted
- We can specifically define the axis that we want sorting of the indices to be performed using the *axis* argument
- The sorting is ascending by default. It can be changed to descending by defining argument 'ascending=False'

In [90]:
sr1 = pd.Series(np.random.randn(9))
sr1.map(f)

0    -0.55
1    -0.03
2     0.81
3     0.40
4     0.18
5     0.11
6    -0.63
7     3.28
8    -1.47
dtype: object

In [91]:
df1

Unnamed: 0,x,z,y
b,-0.990686,0.230607,0.313111
a,0.52917,1.197852,-0.306571
c,-0.157807,-0.126531,0.672516


In [92]:
df1.sort_index()

Unnamed: 0,x,z,y
a,0.52917,1.197852,-0.306571
b,-0.990686,0.230607,0.313111
c,-0.157807,-0.126531,0.672516


In [93]:
df1.sort_index(axis=1, ascending=False, inplace=True)
df1

Unnamed: 0,z,y,x
b,0.230607,0.313111,-0.990686
a,1.197852,-0.306571,0.52917
c,-0.126531,0.672516,-0.157807


- *sort_values* instance method can be used to sort by values
- Series can be sorted directly using sort_values method, however, for DataFrame 
  - we have to specify the *axis* argument (default axis=0)
  - we have to specify the *by* argument with one or a list column or row indices
- like sort_index, *inplace=True*, *ascending=False* argument can be used for sort_values as well

In [95]:
sr1.sort_values(inplace=True)
sr1

8   -1.468870
6   -0.626835
0   -0.551446
1   -0.032993
5    0.111995
4    0.178398
3    0.403374
2    0.805964
7    3.280559
dtype: float64

In [96]:
df1 = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
df1

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [97]:
df1.sort_values(by=0, axis=1)

Unnamed: 0,a,b
0,0,4
1,1,7
2,0,-3
3,1,2


In [98]:
df1.sort_values(by=['a', 'b'], axis=0)

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


- *Ranking* computes numerical data ranks (1 through n) along axis. 
- Equal values are assigned a rank that is the average of the ranks of those values
  - *method='first'* argument can be used to assign rank according to the order in which they are observed 
- By default ranks are from high(1) to low (N), it can be changed by passing the argument *ascending=False*
- The *rank* instance method for Series and DataFrame performs ranking
- Tie breaking methods with rank

<img src='ranking.png' width='550'>

In [101]:
sr1 = pd.Series([7, -5, 7, 4, 2, 0, 4])

In [102]:
sr1.rank(method='first', ascending=False)

0    1.0
1    7.0
2    2.0
3    3.0
4    5.0
5    6.0
6    4.0
dtype: float64

- DataFrame can compute ranks over the rows or columns

In [103]:
df1 = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], 'c': [-2, 5, 8, 2.5]})
df1

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,2.5


In [107]:
df1.rank(axis=1)

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,2.0,1.0,3.0


- *Axis Indices with Duplicate Labels*
- A Series or a DataFrame that has duplicate labels, accessing those indices retrieve values for all of them
- Caution needs to maintained to verify and address duplicate labels
  - sort_indices can be used
  - summary statistics can be perfromed on labels  

In [108]:
sr1 = pd.Series(range(5), index=list('aabbc'))
sr1

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [109]:
sr1['a']

a    0
a    1
dtype: int64

In [110]:
df2 = pd.DataFrame(np.random.randint(0, 10, 9).reshape(3,3))
df2.index = list('abb')
df2.columns = list('yyx')
df2

Unnamed: 0,y,y.1,x
a,2,3,5
b,7,8,6
b,7,8,1


In [111]:
df2['y']

Unnamed: 0,y,y.1
a,2,3
b,7,8
b,7,8


In [112]:
df2.loc['b']

Unnamed: 0,y,y.1,x
b,7,8,6
b,7,8,1


- **Summarizing and Computing Descriptive Statistics**
- set of common mathematical and statistical methods
- most fall into the category of reductions or summary statistics
- Unlike, NumPy methods, pandas stats methods have built-in capabilities to handle missing data
- Passing *axis=1 or axis='columns' sums across the columns instead*
- NA values are excluded unless the entire slice (row or column) is NA
- This can be disabled with the argument *skipna=False*
- Calling the *sum* instance method returns a Series containing the column sum

<img src='reductions.png' width='450'>

In [113]:
df1 = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=list('abcd'), columns=['one', 'two'])
df1

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [114]:
df1.sum()

one    9.25
two   -5.80
dtype: float64

In [115]:
df1.sum(axis=1,  skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

- Methods like *idxmin* and *idxmax* returns indirect statistics like the index value where the minimum or maximum values are attained
- Methods like *cumsum* performs accumulations
- There are methods that neither performs reductions or accumulations, such as describe that provides multiple summary statistics
- describe provides specific summary statics depending on the data type numeric or non-numeric
- All these methods can also be performed over an axis by providing the *axis* argument

In [116]:
df1.idxmax()

one    b
two    d
dtype: object

In [117]:
df1.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [118]:
df1.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


- Descriptive and Summary Statistics

<img src='summary.png' width='450'>

- **Correlation and Covariance**
- The *corr* instance method of a Series computes the correlation of overlapping, non-NA, aligned by index values in two Series
- Similarly, the *conv* computes their covariance
- A DataFrame's *corr* and *cov* methods, returns a full correlation or covariance matrix as a DataFrame
- *corr* and *cov* methods of DataFrame work on themself. 
- In order to compute pairwise correlation with another DataFrame or Series we use the *corrwith* instance methods
- Passing a series returns a Series with the correlation values computed for each column or row, based on argument *axis*

In [119]:
df.columns = ['slength', 'swidth', 'plength', 'pwidth', 'class']
df.head()

Unnamed: 0,slength,swidth,plength,pwidth,class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [120]:
slength = df['slength']
swidth = df['swidth']
print('types: ', type(slength), type(swidth))
print('corr:', slength.corr(swidth), 'cov:', slength.cov(swidth))

types:  <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
corr: -0.10936924995064937 cov: -0.03926845637583893


In [121]:
df.corr()

Unnamed: 0,slength,swidth,plength,pwidth,class
slength,1.0,-0.109369,0.871754,0.817954,0.782561
swidth,-0.109369,1.0,-0.420516,-0.356544,-0.419446
plength,0.871754,-0.420516,1.0,0.962757,0.949043
pwidth,0.817954,-0.356544,0.962757,1.0,0.956464
class,0.782561,-0.419446,0.949043,0.956464,1.0


- **Unique Values, Value Counts, and Membership**
- Provides a class of related methods to extract information about the values contained in a Series
- *unique()* instance method gives an array of unique values in a Series
- In order to retrieve unique sorted values we can use the sort instance method *sort()* on unique values
- *value_counts()* computes a Series containing value frequencies. 
- *isin* performs a vectorized set membership check 

In [122]:
sr1 = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
sr1.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

In [123]:
sr1.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

In [124]:
print(sr1.isin(['b', 'c']))

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool


- These functionalities can be applied to DataFrame by using its *apply* instance method to perform the operations/apply the functions over a each row or column specified by *axis* argument

In [127]:
df1 = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                    'Qu4': [2, 3, 1, 2, 3],
                    'Qu3': [1, 5, 2, 4, 4]})
df1

Unnamed: 0,Qu1,Qu4,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [128]:
df1.apply(pd.value_counts)

Unnamed: 0,Qu1,Qu4,Qu3
1,1.0,1.0,1.0
2,,2.0,1.0
3,2.0,2.0,
4,2.0,,2.0
5,,,1.0


- **Categorical Data**
- pandas has special *Categorical* type for holding data that uses integer-based categorical representation or encoding


In [129]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = pd.DataFrame({'fruit': fruits, 
                   'basket_id': np.arange(N), 
                   'count': np.random.randint(3, 5, size=N), 
                   'weight': np.random.uniform(0, 4, size=N)}, 
                   columns = ['basket_id', 'fruit', 'count', 'weight'])
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,3,2.807408
1,1,orange,4,3.341679
2,2,apple,4,1.09733
3,3,apple,3,1.974291
4,4,apple,4,2.502689
5,5,orange,3,3.950903
6,6,apple,4,3.63697
7,7,apple,4,3.749495


- Here df['fruit'] is an array of Python string objects that can be converted to categorical type using *astype* instance method

In [130]:
df['fruit'] = df['fruit'].astype('category')

- Here the Cetegorical object has *categories* and *codes* attributes

In [131]:
df['fruit'].values.categories

Index(['apple', 'orange'], dtype='object')

In [132]:
df['fruit'].values.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

- Alternatively Categorical types can also be created directly of Python sequences
- or top level *pd.Categorical.from_codes(codes, categories)* method can be used to create encoding from existing categories and codes
- In order to maintain alphanumeric ordering while creating the encodings, we can pass the argument *ordered=True*
- An unordered categorical instance can be made ordered with instance method *as_ordered*

In [133]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

- **Computation with Categoricals**

In [134]:
np.random.seed(12345)
draws = np.random.randn(1000)
draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

In [135]:
"""
Quantile-based discretization function. Discretize variable into
equal-sized buckets based on rank or based on sample quantiles. For example
1000 values for 10 quantiles would produce a Categorical object indicating
quantile membership for each data point.
"""
pd.qcut(draws, 4)

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

In [136]:
# Giving proper labels to the quantiles
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [137]:
# The labeled bins categorical doesnot contain information about bin edges, so we use groupby to extract some summary statistics
bins = pd.Series(bins, name='quartile')
results = (pd.Series(draws)
          .groupby(bins)
          .agg(['count', 'min', 'max'])
          .reset_index())
results

Unnamed: 0,quartile,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


- Better performance with categoricals
- provides substantial performance gains
- will often use less memory
- flaw, one-time conversion cost
- GroupBy operations can be significantly faster with categoricals

- **Categorical Methods**

<img src='categoricals.png'>

- *Creating dummy variables for modeling*
- While using stats of ML tools, categorical data is often transformed into *dummy variables*, also known as *one-hot* encoding
- This involves creating a DataFrame with a column for each distinct category
- These columns contain 1s for occurrence of a given category and 0 otherwise
- Top level *pd.get_dummies* performs this transformation

In [139]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [140]:
pd.get_dummies(cat_s)

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1
