# Getting Started with pandas
## DAT540 Introduction to Data Science
## University of Stavanger
### L07
#### Antorweep Chakravorty (antorweep.chakravorty@uis.no)

- **pandas** contains data structures and data manipulation tools designed to make data cleaning and analysis faster
- it is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotib
- pandas adopts significant parts of NumPy's idiomatic style of array-based computing, especially array-based functions and a preference for data processing without loops
- pandas is particularly designed for working with tabular (2-D) heterogeneous data
- In order to use pandas, we need to first import it and other required modules
- *DataFrame* and *Series* are two pandas data structures that would be used extensively

```python 
import pandas as pd
# import DataFrame and Series, when required as follows
from pandas import Series, DataFrame
```

In [1]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np

## pandas **Data Structures**

### Series
    - A one dimensional array-like object containing a sequence of values
    - It also contains an associated array of data labels called *index* 
    - By default, it automatically assigns indices from 0 to N - 1 (where N is the length of the data), when a Series is created over a list of values
    - A Series, could be created with an additional argument, specifying the list of indices for the values as well

In [3]:
# index labels not provided. Therefore, a default list would be created, starting from zero index
obj = Series([4, 'A', '6.0', -1])
# dtype is the object where we have heterogeneous data. 
# If all the data are of same time, the dtype of the series would be the same 
print(obj) 
print('\nList of indices can be retrieved using \'obj.index\'. \nIn this case, it is a object like python range:', obj.index)
print('\nList of values can be retrieved using \'obj.value\':', obj.values)


# Custom index labels provided
obj2 = Series([3, 2, -5, 0, 4], index=['a', 'b', 'c', 'd', 'e'] )
# In this case, the dtype of the series is the same as all values in it are of type int64
print("\nObject with custom indices\n")
print(obj2)

0      4
1      A
2    6.0
3     -1
dtype: object

List of indices can be retrieved using 'obj.index'. 
In this case, it is a object like python range: RangeIndex(start=0, stop=4, step=1)

List of values can be retrieved using 'obj.value': [4 'A' '6.0' -1]

Object with custom indices

a    3
b    2
c   -5
d    0
e    4
dtype: int64


- Element(s) from a Series could be retrieved using similar mechanisms to that with 1-D ndarrays in NumPy, provided the underlying dtype supports it
  - on single Index
  - Basic Indexing       
  - Fancy Indexing
  - Boolean Indexing    
- Additionally, elements can also be retrieved using the labels and combined with most of the above indexing methods

In [5]:
print('obj2:\n', obj2)
print('retrieved a single element:', obj2[2])
print('retrieved a slice:\n', obj2[2:5])
print('retrieved a slice with all values except the last:\n', obj2[:-1])

obj2:
 a    3
b    2
c   -5
d    0
e    4
dtype: int64
retrieved a single element: -5
retrieved a slice:
 c   -5
d    0
e    4
dtype: int64
retrieved a slice with all values except the last:
 a    3
b    2
c   -5
d    0
dtype: int64


In [18]:
print('obj2:\n', obj2)
print('\nUsing Fancy Indexing:\n', obj2[[2,1,2,3]])
print('\nUsing Boolean Indexing:\n', obj2[obj2 > 0])

obj2:
 a    3
b    2
c   -5
d    0
e    4
dtype: int64

Using Fancy Indexing:
 c   -5
b    2
c   -5
d    0
dtype: int64

Using Boolean Indexing:
 a    3
b    2
e    4
dtype: int64


In [6]:
print('obj2:\n', obj2)
print('retrieved a single element by label:', obj2['b'])
print('retrieved a slice:\n', obj2['b':'d'])
print('\nUsing Fancy Indexing:\n', obj2[['b', 'a', 'b']])
print('\nUsing Boolean Indexing:\n', obj2[[True,True,False,False,True]])

obj2:
 a    3
b    2
c   -5
d    0
e    4
dtype: int64
retrieved a single element by label: 2
retrieved a slice:
 b    2
c   -5
d    0
dtype: int64

Using Fancy Indexing:
 b    2
a    3
b    2
dtype: int64

Using Boolean Indexing:
 a    3
b    2
e    4
dtype: int64


- Series can be also conceptualized as a fixed-length, ordered dict
- Mapping index values to data values

```python
IN[]: 'b' in obj2 # checks on keys
OUT[]: True

IN[]: 'z' in obj2
OUT[]: False
```
- A Series can be created by directly passing a dict object to the Series method. 
- In this case, the keys of the dicts would be converted into labels for the Series
- Alternatively, the index argument could also be provided, to filter the dict elements that we want into the Series
- Specified indices that are present in the dict, are included into the Series
- Indices those are not in the dict, but specified in the Series are set to have values NA values

In [20]:
my_dict = {'a': 1, 'b': 2, 'f': 9, 'c': -1}
obj3 = Series(my_dict)
print('obj3:\n', obj3)

# It selected and replaced the the labels for the keys that it found, for others a Not a Number value was assigned
obj4 = Series(my_dict, index=['b', 'c', 'x'])
print('obj4:\n', obj4)

obj3:
 a    1
b    2
f    9
c   -1
dtype: int64
obj4:
 b    2.0
c   -1.0
x    NaN
dtype: float64


- Series additionally also have top-level and instance methods *isnull* and *isna* to detact missing data

In [21]:
obj4.isnull()

b    False
c    False
x     True
dtype: bool

- Series automatically aligns by index label for arithmetic operations
- They act similar to outer joins while performing the the operations on the values
- If the keys are not found in either of the input series, a NA value is associated to them as the arithmetic operation would not be successful for them

In [8]:
a = Series(np.random.randint(0,10, 5), index=['c', 'd', 'b', 'e', 'b'])
b = Series(np.random.randint(0,10, 4), index=['x', 'c', 'e', 'b'])
print('a:\n', a)
print('b:\n', b)
print('a+b:')
a + b

a:
 c    0
d    9
b    7
e    3
b    0
dtype: int64
b:
 x    3
c    7
e    6
b    7
dtype: int64
a+b:


b    14.0
b     7.0
c     7.0
d     NaN
e     9.0
x     NaN
dtype: float64

- The Series object itself and its index have a **name** attribute

In [23]:
obj4.name = 'cname'
obj4.index.name = 'cindex'
obj4

cindex
b    2.0
c   -1.0
x    NaN
Name: cname, dtype: float64

- The Series index and values can be altered in place

In [18]:
obj4 = Series({'a': 1, 'b': 2, 'f': 9, 'c': -1})
# Indexing on Series follows the same rules like NumPy Array when creating a view or copy
# Index based on Basic indexing does not create a copy
print('Initial values for obj4:\n', obj4)

x = obj4[1:3]
x[:] = 10
print('Slice stored in x:\n', x)

y = obj4[:'b'] 
y[:] = -10
print('Slice stored in y:\n', y)

# Fancy Indexing and Boolean indexing create a copy
z = obj4[[2,1]]
z[:] = 99
print('Slice stored in z:\n', z)

# only index f now has value 10
b = obj4[obj4.values == 10]
b[:] = 100
print('Slice stored in b:\n', b)

# We can also completely change the index lables in place
obj4.index = ['a', 'b', 'f', 'c']
print('modded obj4:\n', obj4 )

Initial values for obj4:
 a    1
b    2
f    9
c   -1
dtype: int64
Slice stored in x:
 b    10
f    10
dtype: int64
Slice stored in y:
 a   -10
b   -10
dtype: int64
Slice stored in z:
 f    99
b    99
dtype: int64
Slice stored in b:
 f    100
dtype: int64
modded obj4:
 a   -10
b   -10
f    10
c    -1
dtype: int64


### DataFrame
  - Represents a rectangular table of data representing rows and columns
  - Each column in a DataFrame could have different data types 
  - Dataframe has labels/indices for both rows and columns. Alternatively, it can be conceptualized as a dict of Series all sharing the same index
  - Typically a DataFrame can be constructed from a dict of equal length list of NumPy arrays
  - A sequence of columns can be specified to the columns attribute of a DataFrame when creating it to 
    - filter the selected columns
    - order the columns
  - A sequence of labels can be specified to the row attribute to index the rows. If the row attribute is not specified, the rows are indexed from 0 to N-1 (N being the total number of rows)
  - If a specified column isn't contained in the dict, it appears with missing values
  

In [20]:
data = {'name': ['abc', 'def', 'ghi', 'jkl', 'mno'], 'age': [23, 74, 31, 16, 34], 'height': [170.2, 164.0, 168.0, 140.0, 170.0]}
frame = pd.DataFrame(data, columns = ['age', 'name', 'xyz'], index = ['one', 'two', 'three', 'four', 'five'])
frame

Unnamed: 0,age,name,xyz
one,23,abc,
two,74,def,
three,31,ghi,
four,16,jkl,
five,34,mno,


  - DataFrames have multiple helper instance methods, that allows easy investigation of a frame
  - The *head* instance method shows the first 5 rows of a data frame. An integer value could be passed to the head method to retrieved that many rows
  - Similarly the *tail* method show the last 5 or specified number of rows of a dataframe
  - A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute

In [21]:
frame = pd.DataFrame(data, columns = ['name', 'height', 'age'], index = ['one', 'two', 'three', 'four', 'five'])
print(frame.head())
print('\n', frame.name.head(2))
print('\n', frame['name'].tail(2))

      name  height  age
one    abc   170.2   23
two    def   164.0   74
three  ghi   168.0   31
four   jkl   140.0   16
five   mno   170.0   34

 one    abc
two    def
Name: name, dtype: object

 four    jkl
five    mno
Name: name, dtype: object


  - Rows of a DataFrame can be retrieved by position or name/rowlabel with the loc attribute
  - Columns can be modified in-place by assignment
  - A new column could be created also by assignment, with a new column name as index
  - When a column is assigned while creating a new or updating, its length must match with the length of the DataFrame columns
  - If a Series is assigned to a column, it labels will be realigned exactly to the DataFrames' index, inserting missing values
  - The *del* keyword or *drop* instance method can be used to remove a columns

In [22]:
frame = pd.DataFrame(data, columns = ['name', 'height', 'age'], index = ['one', 'two', 'three', 'four', 'five'])
print('retrieved a row:\n', frame.loc['one'])

# Changing the values of an existing column
frame['height'] = np.random.randint(150, 180, 5)

# Adding a new column
frame['asd'] = np.arange(5)

# Updating a column via indexed Series
val = Series(np.random.randint(10, 20, 4), index=['one', 'three', 'five', 'seven'])
frame['nha'] = val
frame

retrieved a row:
 name        abc
height    170.2
age          23
Name: one, dtype: object


Unnamed: 0,name,height,age,asd,nha
one,abc,165,23,0,16.0
two,def,159,74,1,
three,ghi,151,31,2,12.0
four,jkl,168,16,3,
five,mno,171,34,4,17.0


In [24]:
del frame['asd'] # performs in-place
frame.drop('nha', axis=1) # returns a copy with the specified index removed. Attributed 'inplace=True' will persist the change. Else resign it to the same or different variable

Unnamed: 0,name,height,age
one,abc,165,23
two,def,159,74
three,ghi,151,31
four,jkl,168,16
five,mno,171,34


  - Nested dict can also be used to create a DataFrame
  - pandas will interpret the outer dict keys as columns and the inner keys as the row indices
  - A DataFrame can be transposed (swap rows and columns) similar to NumPy array
  - A new DataFrame can be created from an existing DF, by selecting the columns that are required to be inserted

In [27]:
pop = {'Stavanger': {2001: 2.4, 2002: 2.9},
       'Oslo': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame1 = DataFrame(pop)

In [28]:
frame1

Unnamed: 0,Stavanger,Oslo
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [29]:
# Transposing a DF
frame1.T

Unnamed: 0,2001,2002,2000
Stavanger,2.4,2.9,
Oslo,1.7,3.6,1.5


In [30]:
# Creating a new DF from an existing one
frame2 = DataFrame({'Stavanger': frame1['Stavanger'][:-1],
                   'Oslo': frame1['Oslo'][:2]})
frame2

Unnamed: 0,Stavanger,Oslo
2001,2.4,1.7
2002,2.9,3.6


- All index row names and column names can also be changed 
- Additionally, a frame index and column title can be directly set using its name attribute
- If all the rows and column names are changed, then the change of title statement should be always called after it
- When all the data contained in a DataFrame are returned, it gets returned as a 2-D array
- If columns are heterogeneous in the DataFrame, the dtype of the values array will be of type *object*

In [31]:
# Changing all column and row names
frame2.columns =  ['x', 'y']
frame2.index = [0, 1]

# Changing column and row titles / axis names
frame2.index.name = 'year'
frame2.columns.name = 'city'

frame2

city,x,y
year,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2.4,1.7
1,2.9,3.6


In [32]:
# retrieved all values/columns from the DF. A homogeneous DF
frame2.values

array([[2.4, 1.7],
       [2.9, 3.6]])

In [33]:
# retrieved all values/columns from the DF. A heterogeneous DF
obj.values

array([4, 'A', '6.0', -1], dtype=object)

  - Input types for DataFrames

<img src="./images/df_inputs.png">

  - **Index Objects**
  - pandas Index objects are responsible for holding the axis labels and other metadata
    - like axis names or names
  - Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index
  - Index objects are immutable
  - Making it safer to share Index objects among data structures
  - Index objects also behaves like a fixed-size set, meaning we can perform certain set operations
  - However, pandas index can contain duplicate labels

  - Index Objects Methods

<img src='./images/index_methods.png'>

In [174]:
obj = Series(range(5), index=['a', 'b', 'a' ,'c', 'e'])
index = obj.index

index

Index(['a', 'b', 'a', 'c', 'e'], dtype='object')

In [24]:
index[1:]

Index(['b', 'c'], dtype='object')

In [25]:
# index[1] = 'd' # TypeError

In [26]:
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [27]:
obj2 = Series([1.5, -2.5, 0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [34]:
# Performing simple checks on index and columns
frame2

city,x,y
year,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2.4,1.7
1,2.9,3.6


In [39]:
print(frame2.columns)
print('\'y\' in frame2.columns:', 'y' in frame2.columns )
print('0 in frame2.index:', 0 in frame2.index)

Index(['x', 'y'], dtype='object', name='city')
'y' in frame2.columns: True
0 in frame2.index: True


### Reading CSV files in pandas
- In order to load a text/csv file into pandas, we use the top-level method **pd.read_csv**
- There are different ways to easily read and write files using pandas, these topics would be covered in later lectures
- For now we just need *pd.read_csv*
- A pandas 2-D dataframe can be easily converted to ndarray using the ndarray method

In [36]:
df = pd.read_csv('./data/Irisv1.csv', header=None)
print('shape:', df.shape)
df.head()


shape: (150, 6)


Unnamed: 0,0,1,2,3,4,5
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [37]:
arr = np.array(df)
print('arr.shape:', arr.shape)
arr[:5]

arr.shape: (150, 6)


array([[1, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
       [2, 4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
       [3, 4.7, 3.2, 1.3, 0.2, 'Iris-setosa'],
       [4, 4.6, 3.1, 1.5, 0.2, 'Iris-setosa'],
       [5, 5.0, 3.6, 1.4, 0.2, 'Iris-setosa']], dtype=object)

### Pandas Essential Functionality
- **Re-indexing**: data is conformed to a new index. Using **reindex** instance method the data can be filtered and rearranged into a new order
- reindex may introduce missing values if any of the index values were not already present
- reindex creates a copy of the data

In [38]:
# Let us have the 1st column with top 5 rows of the dataframe as a series. Obtains a view on df
sr1 = df[1][:5]
print('type(sr1):', type(sr1))
print('sr1.shape:', sr1.shape)
sr1


type(sr1): <class 'pandas.core.series.Series'>
sr1.shape: (5,)


0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: 1, dtype: float64

In [39]:
sr2 = sr1.reindex([4,2,'a',3,0])
# Changing index 4 of sr2. Does it get reflected back to sr1
sr2[4] = np.nan

In [40]:
sr2

4    NaN
2    4.7
a    NaN
3    4.6
0    5.1
Name: 1, dtype: float64

In [75]:
sr1

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: 1, dtype: float64

- *reindexing* can also be used to interpolate or filling missing values
- Argument *method* of reindex can be used to specify the way we want to interpolate
- *ffill* copies the previous value into the missing value
- *bfill* copies the next value into the missing value
- *nearest* copies the nearest value into the missing value
- indices must be monotonic increasing or decreasing
- index labels must be unique

In [179]:
sr1.reindex(index=range(7), method='ffill')

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
5    5.0
6    5.0
Name: 1, dtype: float64

- reindexing on DataFrames with rows and columns
- index attribute specifies the order to the rows and columns

In [42]:
df1 = df[0:5]

In [43]:
df1

Unnamed: 0,0,1,2,3,4,5
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [44]:

df1.reindex(index=range(7), columns=range(6, 0, -1))

Unnamed: 0,6,5,4,3,2,1
0,,Iris-setosa,0.2,1.4,3.5,5.1
1,,Iris-setosa,0.2,1.4,3.0,4.9
2,,Iris-setosa,0.2,1.3,3.2,4.7
3,,Iris-setosa,0.2,1.5,3.1,4.6
4,,Iris-setosa,0.2,1.4,3.6,5.0
5,,,,,,
6,,,,,,


In [49]:
df1.reindex(index=range(7), columns=range(6, 0, -1), method='nearest')

Unnamed: 0,6,5,4,3,2,1
0,Iris-setosa,Iris-setosa,0.2,1.4,3.5,5.1
1,Iris-setosa,Iris-setosa,0.2,1.4,3.0,4.9
2,Iris-setosa,Iris-setosa,0.2,1.3,3.2,4.7
3,Iris-setosa,Iris-setosa,0.2,1.5,3.1,4.6
4,Iris-setosa,Iris-setosa,0.2,1.4,3.6,5.0
5,Iris-setosa,Iris-setosa,0.2,1.4,3.6,5.0
6,Iris-setosa,Iris-setosa,0.2,1.4,3.6,5.0


- reindex function arguments

<img src='./images/reindex.png' width='500'>

- A alternate way to reindex a DataFrame is also by using the *loc* attribute

In [52]:
df1_indices = list(df1.index)
np.random.shuffle(df1_indices)
df1.index = df1_indices
df1

Unnamed: 0,0,1,2,3,4,5
1,1,5.1,3.5,1.4,0.2,Iris-setosa
2,2,4.9,3.0,1.4,0.2,Iris-setosa
4,3,4.7,3.2,1.3,0.2,Iris-setosa
0,4,4.6,3.1,1.5,0.2,Iris-setosa
3,5,5.0,3.6,1.4,0.2,Iris-setosa


In [81]:
df1.loc[[2,1,3], [4,3]]

Unnamed: 0,4,3
2,0.2,1.4
1,0.2,1.3
3,0.2,1.4


In [83]:
df1.iloc[[2,1,3], [4,3]]

Unnamed: 0,4,3
1,0.2,1.3
2,0.2,1.4
4,0.2,1.5


- **Dropping** entries from an Axis
- The *drop* instance method of a Series or a DataFrame can be used to drop one or more rows/columns
- It returns a new object/copy after performing the required operations
- In order to perform drop operations *in-place*, we pass a the *inplace=True* to the drop call

In [84]:
sr2.drop([2,3])

4    NaN
a    NaN
0    5.1
Name: 1, dtype: float64

In [85]:
# Drops row at index 2
df1.drop(2)

Unnamed: 0,0,1,2,3,4,5
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,3,4.7,3.2,1.3,0.2,Iris-setosa
4,4,4.6,3.1,1.5,0.2,Iris-setosa
3,5,5.0,3.6,1.4,0.2,Iris-setosa


In [86]:
# Drops columns 0, 3
df1.drop([0,3], axis=1)

Unnamed: 0,1,2,4,5
0,5.1,3.5,0.2,Iris-setosa
2,4.9,3.0,0.2,Iris-setosa
1,4.7,3.2,0.2,Iris-setosa
4,4.6,3.1,0.2,Iris-setosa
3,5.0,3.6,0.2,Iris-setosa


- **.loc and .iloc**: DataFrame label-indexing on rows
- Allows selection of a subset of rows and columns from a DataFrame with NumPy like notations using either axis labels (loc) or integers (iloc)
- Slicing can be used on both indexing method

In [53]:
arr = np.random.randint(0, 20, 16).reshape(4,4)
data = pd.DataFrame(arr) # Create a DataFrame from ndarray
data.index = ['a', 'b', 'c', 'd'] # Label the rows
data.columns = ['one', 'two', 'three', 'four'] # Label the cols
data

Unnamed: 0,one,two,three,four
a,13,5,17,5
b,1,19,3,8
c,12,11,9,17
d,13,3,11,8


In [54]:
data.loc['a', ['one', 'three']]

one      13
three    17
Name: a, dtype: int64

In [55]:
data.iloc[0, [1, 3]]

two     5
four    5
Name: a, dtype: int64

In [56]:
# Using slicing and boolean filtering here. Can be used for both rows or columns
data.iloc[:, 1:3][data.three > 5]


Unnamed: 0,two,three
a,5,17
c,11,9
d,3,11


- Indexing options with DataFrame

<img src='./images/indexing_ops.png'>

- **Arithmetic and Data Alignment**
- When adding together objects, if any index pairs are not the same, the respective index in the result will be union of the index pairs (similar to outer join)
- In case of DataFrame alignment is performed on both rows and columns
- What happens if you add two DataFrames without any rows or columns in common?

In [58]:
sr1 = pd.Series({'a': 7.3, 'c': -2.3, 'd': 3.4, 'e': 1.5})
sr2 = pd.Series({'a': -2.1, 'c': 3.6, 'e': -1.5, 'f': 4, 'g': 3.1})
sr1 + sr2

a    5.2
c    1.3
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [59]:
df1 = pd.DataFrame(np.arange(9.).reshape(3,3), columns=list('bcd'), index=['x', 'y', 'z'])
df2 = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('bde'), index=['u', 'x', 'y', 'o'])


In [60]:
df1

Unnamed: 0,b,c,d
x,0.0,1.0,2.0
y,3.0,4.0,5.0
z,6.0,7.0,8.0


In [101]:
df2

Unnamed: 0,b,d,e
u,0,1,2
x,3,4,5
y,6,7,8
o,9,10,11


In [61]:
df1 + df2


Unnamed: 0,b,c,d,e
o,,,,
u,,,,
x,3.0,,6.0,
y,9.0,,12.0,
z,,,,


- *Arithmetic methods with Fill Value*
- Operations that generates missing values due to non availability of an axis label in either one or the other but not both, DataFrame can be filled with special value like 0

In [103]:
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
o,9.0,,10.0,11.0
u,0.0,,1.0,2.0
x,3.0,1.0,6.0,5.0
y,9.0,4.0,12.0,8.0
z,6.0,7.0,8.0,


- Why is [o, c], [u, c] and [z, e] still NaN

- Flexible arithmetic methods

<img src='./images/arithmetics.png' width='450'>

- The methods starting with the prefix *r* is a counterpart, wherein, the arguments are flipped

```python
1/ df1 
# Equivalent to
df1.rdiv(1)
````

- *Operations between DataFrame and Series*
- Similar to NumPy, wherein, we can perform operations between arrays of different dimensions, we can do the same between DataFrame and Series
- The concept is similar to the idea of broadcasting with ndarrays
- By default, arithmetic between a DataFrame and Series matches the index of the Series on the DataFrame's columns, broadcasting on rows

In [62]:
df2

Unnamed: 0,b,d,e
u,0,1,2
x,3,4,5
y,6,7,8
o,9,10,11


In [63]:
sr1 = df2.iloc[0]
sr1

b    0
d    1
e    2
Name: u, dtype: int64

In [64]:
df2 - sr1

Unnamed: 0,b,d,e
u,0,0,0
x,3,3,3
y,6,6,6
o,9,9,9


- In order to broadcast over columns, matching the rows, we have to use the arithmetic methods
- the axis attribute should be set to either 0 or 'index'

In [65]:
df2

Unnamed: 0,b,d,e
u,0,1,2
x,3,4,5
y,6,7,8
o,9,10,11


In [66]:
sr1 = df2['b']
sr1

u    0
x    3
y    6
o    9
Name: b, dtype: int64

In [67]:
df2.subtract(sr1, axis='index')

Unnamed: 0,b,d,e
u,0,1,2
x,0,1,2
y,0,1,2
o,0,1,2


- **Function Application Mapping**
- NumPy *ufunc* (element-wise array methods) also work with pandas objects
- lambda functions on 1-D arrays can also be applied to each column or row using the *apply* instance method
- *axis* argument can be provided to the apply method to perform the operations over columns or rows

In [68]:
np.sum(df2, axis=1)

u     3
x    12
y    21
o    30
dtype: int64

In [69]:
data

Unnamed: 0,one,two,three,four
a,13,5,17,5
b,1,19,3,8
c,12,11,9,17
d,13,3,11,8


In [111]:
f = lambda x: x.max() - x.min()
data.apply(f, axis=0)

one      15
two       5
three    13
four     16
dtype: int64

- The function passed to *apply* need not return only a scalar value, it can also return a Series with multiple values

In [70]:
def f(x):
  return pd.Series([x.min(), x.max()], index=['min', 'max'])
data.apply(f)

Unnamed: 0,one,two,three,four
min,1,3,3,5
max,13,19,17,17


- *applymap* and *map* instance methods
- performs an element-wise computation of a given func
- applymap are specific to DataFrames and map is used for Series

In [71]:
df1 = pd.DataFrame(np.random.randn(9).reshape(3,3), index=['b', 'a', 'c'], columns=[list('xzy')])
df1

Unnamed: 0,x,z,y
b,0.658599,0.687217,1.526099
a,2.178664,-0.347739,0.262946
c,0.207371,-0.023608,-0.691519


In [72]:
# Suppose we want to compute a formatted string from each floating point value in df1
f = lambda x:  '%.2f' % x
df1.applymap(f)

Unnamed: 0,x,z,y
b,0.66,0.69,1.53
a,2.18,-0.35,0.26
c,0.21,-0.02,-0.69


In [73]:
sr1 = pd.Series(np.random.randn(9))
sr1.map(f)

0    -1.24
1     1.13
2    -0.27
3    -0.72
4     0.73
5     0.15
6    -0.62
7     1.04
8     2.30
dtype: object

- **Sorting and Ranking**
- *sort_index* instance method sorts pandas data object by row or column index
- It returns a new object
- We can provide the *inplace=True* argument to perform the operation on a view
- We can specifically define the axis that we want sorting of the indices to be performed using the *axis* argument
- The sorting is ascending by default. It can be changed to descending by defining argument 'ascending=False'

In [74]:
df1

Unnamed: 0,x,z,y
b,0.658599,0.687217,1.526099
a,2.178664,-0.347739,0.262946
c,0.207371,-0.023608,-0.691519


In [75]:
df1.sort_index()

Unnamed: 0,x,z,y
a,2.178664,-0.347739,0.262946
b,0.658599,0.687217,1.526099
c,0.207371,-0.023608,-0.691519


In [76]:
df1.sort_index(axis=1, ascending=False, inplace=True)
df1

Unnamed: 0,z,y,x
b,0.687217,1.526099,0.658599
a,-0.347739,0.262946,2.178664
c,-0.023608,-0.691519,0.207371


- *sort_values* instance method can be used to sort by values
- Series can be sorted directly using sort_values method, however, for DataFrame 
  - we have to specify the *axis* argument (default axis=0)
  - we have to specify the *by* argument with one or a list column or row indices
- like sort_index, *inplace=True*, *ascending=False* argument can be used for sort_values as well

In [77]:
sr1.sort_values(inplace=True)
sr1

0   -1.240502
3   -0.718641
6   -0.620547
2   -0.270810
5    0.146883
4    0.731677
7    1.043120
1    1.126914
8    2.301014
dtype: float64

In [83]:
df1 = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
df1

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [86]:
df1.sort_values(by=1, axis=1)

Unnamed: 0,a,b
0,0,4
1,1,7
2,0,-3
3,1,2


In [200]:
df1.sort_values(by=['a', 'b'], axis=0)

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


- *Ranking* computes numerical data ranks (1 through n) along axis. 
- Equal values are assigned a rank that is the average of the ranks of those values
  - *method='first'* argument can be used to assign rank according to the order in which they are observed 
- By default ranks are from high(1) to low (N), it can be changed by passing the argument *ascending=False*
- The *rank* instance method for Series and DataFrame performs ranking
- Tie breaking methods with rank

<img src='images/ranking.png' width='550'>

In [120]:
sr1 = pd.Series([7, -5, 7, 4, 2, 0, 4])

In [121]:
sr1.rank(method='first', ascending=False)

0    1.0
1    7.0
2    2.0
3    3.0
4    5.0
5    6.0
6    4.0
dtype: float64

- DataFrame can compute ranks over the rows or columns

In [88]:
df1 = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], 'c': [-2, 5, 8, 2.5]})
df1

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,2.5


In [90]:
df1.rank(axis=1, ascending=False)

Unnamed: 0,b,a,c
0,1.0,2.0,3.0
1,1.0,3.0,2.0
2,3.0,2.0,1.0
3,2.0,3.0,1.0


- *Axis Indices with Duplicate Labels*
- A Series or a DataFrame that has duplicate labels, accessing those indices retrieve values for all of them
- Caution needs to maintained to verify and address duplicate labels
  - sort_indices can be used
  - summary statistics can be perfromed on labels  

In [91]:
sr1 = pd.Series(range(5), index=list('aabbc'))
sr1

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [92]:
sr1['a']

a    0
a    1
dtype: int64

In [93]:
df2 = pd.DataFrame(np.random.randint(0, 10, 9).reshape(3,3))
df2.index = list('abb')
df2.columns = list('yyx')
df2

Unnamed: 0,y,y.1,x
a,8,3,3
b,9,5,6
b,7,7,3


In [94]:
df2['y']

Unnamed: 0,y,y.1
a,8,3
b,9,5
b,7,7


In [95]:
df2.loc['b']

Unnamed: 0,y,y.1,x
b,9,5,6
b,7,7,3


- **Summarizing and Computing Descriptive Statistics**
- set of common mathematical and statistical methods
- most fall into the category of reductions or summary statistics
- Unlike, NumPy methods, pandas stats methods have built-in capabilities to handle missing data
- Passing *axis=1 or axis='columns' sums across the columns instead*
- NA values are excluded unless the entire slice (row or column) is NA
- This can be disabled with the argument *skipna=False*
- Calling the *sum* instance method returns a Series containing the column sum

<img src='images/reductions.png' width='450'>

In [96]:
df1 = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=list('abcd'), columns=['one', 'two'])
df1

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [97]:
df1.sum()

one    9.25
two   -5.80
dtype: float64

In [98]:
df1.sum(axis=1,  skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

- Methods like *idxmin* and *idxmax* returns indirect statistics like the index value where the minimum or maximum values are attained
- Methods like *cumsum* performs accumulations
- There are methods that neither performs reductions or accumulations, such as describe that provides multiple summary statistics
- describe provides specific summary depending on the data type numeric or non-numeric
- All these methods can also be performed over an axis by providing the *axis* argument

In [99]:
df1.idxmax()

one    b
two    d
dtype: object

In [100]:
df1.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [101]:
df1.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


- Descriptive and Summary Statistics

<img src='images/summary.png' width='450'>

- **Correlation and Covariance**
- The *corr* instance method of a Series computes the correlation of overlapping, non-NA, aligned by index values in two Series
- Similarly, the *conv* computes their covariance
- A DataFrame's *corr* and *cov* methods, returns a full correlation or covariance matrix as a DataFrame
- *corr* and *cov* methods of DataFrame work on themself. 
- In order to compute pairwise correlation with another DataFrame or Series we use the *corrwith* instance methods
- Passing a series returns a Series with the correlation values computed for each column or row, based on argument *axis*

In [102]:
df.columns = ['index', 'slength', 'swidth', 'plength', 'pwidth', 'class']
df.head()

Unnamed: 0,index,slength,swidth,plength,pwidth,class
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [103]:
slength = df['slength']
swidth = df['swidth']
print('types: ', type(slength), type(swidth))
print('corr:', slength.corr(swidth), 'cov:', slength.cov(swidth))

types:  <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
corr: -0.10936924995064937 cov: -0.03926845637583894


In [104]:
df.corr()

Unnamed: 0,index,slength,swidth,plength,pwidth
index,1.0,0.716676,-0.397729,0.882747,0.899759
slength,0.716676,1.0,-0.109369,0.871754,0.817954
swidth,-0.397729,-0.109369,1.0,-0.420516,-0.356544
plength,0.882747,0.871754,-0.420516,1.0,0.962757
pwidth,0.899759,0.817954,-0.356544,0.962757,1.0


- **Unique Values, Value Counts, and Membership**
- Provides a class of related methods to extract information about the values contained in a Series
- *unique()* instance method gives an array of unique values in a Series
- In order to retrieve unique sorted values we can use the sort instance method *sort()* on unique values
- *value_counts()* computes a Series containing value frequencies. 
- *isin* performs a vectorized set membership check 

In [105]:
sr1 = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
sr1.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

In [106]:
sr1.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

In [107]:
print(sr1.isin(['b', 'c']))

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool


- These functionalities can be applied to DataFrame by using its *apply* instance method to perform the operations/apply the functions over a each row or column specified by *axis* argument

In [110]:
df1 = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                    'Qu4': [2, 3, 1, 2, 3],
                    'Qu3': [1, 5, 2, 4, 4]})
df1

Unnamed: 0,Qu1,Qu4,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [111]:
df1.apply(pd.value_counts)

Unnamed: 0,Qu1,Qu4,Qu3
1,1.0,1.0,1.0
2,,2.0,1.0
3,2.0,2.0,
4,2.0,,2.0
5,,,1.0


- **Categorical Data**
- pandas has special *Categorical* type for holding data that uses integer-based categorical representation or encoding


In [112]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = pd.DataFrame({'fruit': fruits, 
                   'basket_id': np.arange(N), 
                   'count': np.random.randint(3, 5, size=N), 
                   'weight': np.random.uniform(0, 4, size=N)}, 
                   columns = ['basket_id', 'fruit', 'count', 'weight'])
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,4,2.935551
1,1,orange,4,1.720475
2,2,apple,3,1.87479
3,3,apple,4,2.752824
4,4,apple,4,0.605051
5,5,orange,3,1.471078
6,6,apple,3,3.407676
7,7,apple,4,3.951286


- Here df['fruit'] is an array of Python string objects that can be converted to categorical type using *astype* instance method

In [113]:
df['fruit'].astype('category')

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [114]:
df['fruit'] = df['fruit'].astype('category')

- Here the Cetegorical object has *categories* and *codes* attributes

In [115]:
df['fruit'].values.categories

Index(['apple', 'orange'], dtype='object')

In [116]:
df['fruit'].values.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

- Alternatively Categorical types can also be created directly of Python sequences
- or top level *pd.Categorical.from_codes(codes, categories)* method can be used to create encoding from existing categories and codes
- In order to maintain alphanumeric ordering while creating the encodings, we can pass the argument *ordered=True*
- An unordered categorical instance can be made ordered with instance method *as_ordered*

In [117]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']

- **Computation with Categoricals**

In [118]:
np.random.seed(12345)
draws = np.random.randn(1000)
draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

In [119]:
"""
Quantile-based discretization function. Discretize variable into
equal-sized buckets based on rank or based on sample quantiles. For example
1000 values for 10 quantiles would produce a Categorical object indicating
quantile membership for each data point.
"""
pd.qcut(draws, 4)

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

In [120]:
# Giving proper labels to the quantiles
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins

['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q3', 'Q2', 'Q1', 'Q3', 'Q4']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

In [121]:
# The labeled bins categorical doesnot contain information about bin edges, so we use groupby to extract some summary statistics
bins = pd.Series(bins, name='quartile')
results = (pd.Series(draws)
          .groupby(bins)
          .agg(['count', 'min', 'max'])
          .reset_index())
results

Unnamed: 0,quartile,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


- Better performance with categoricals
- provides substantial performance gains
- will often use less memory
- flaw, one-time conversion cost
- GroupBy operations can be significantly faster with categoricals

- **Categorical Methods**

<img src='images/categoricals.png'>

- *Creating dummy variables for modeling*
- While using stats of ML tools, categorical data is often transformed into *dummy variables*, also known as *one-hot* encoding
- This involves creating a DataFrame with a column for each distinct category
- These columns contain 1s for occurrence of a given category and 0 otherwise
- Top level *pd.get_dummies* performs this transformation

In [122]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [123]:
pd.get_dummies(cat_s)

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1
