<center> <font face= "Neuville" style= "font-size:150px" color= "grey"> Pandas </font> </center>

In [1]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

### → Some refresher on Series

In [5]:
S = Series(np.arange(22, 26), index= ['a', 'b', 4, 2])
S

a    22
b    23
4    24
2    25
dtype: int32

In [7]:
S[0:3]

a    22
b    23
4    24
dtype: int32

In [11]:
S[['a', 2]]

a    22
2    25
dtype: int32

In [14]:
# ↓  Tricked, this 'in' will check in KEYS, not in VALUES (like dict)
4 in S

True

# 

###  → Assigning Series vs List/Array to existing DF

In [19]:
df = DataFrame(np.random.randint(0, 100, (10, 3)), columns= ['a', 'b', 'c'])
df

Unnamed: 0,a,b,c
0,55,3,75
1,88,6,69
2,76,57,52
3,25,2,21
4,55,48,53
5,80,38,8
6,1,36,98
7,92,97,75
8,80,88,50
9,58,6,76


In [21]:
df["d"] = "NA"
df

Unnamed: 0,a,b,c,d
0,55,3,75,
1,88,6,69,
2,76,57,52,
3,25,2,21,
4,55,48,53,
5,80,38,8,
6,1,36,98,
7,92,97,75,
8,80,88,50,
9,58,6,76,


#### ↓ This will fail (because of not matching length)

In [26]:
df["d"] = np.array([1,2,3])

ValueError: Length of values (3) does not match length of index (10)

#### ↓ This won't, as it is Series! 

In [28]:
df["d"] = Series([1,2,3])
df

Unnamed: 0,a,b,c,d
0,55,3,75,1.0
1,88,6,69,2.0
2,76,57,52,3.0
3,25,2,21,
4,55,48,53,
5,80,38,8,
6,1,36,98,
7,92,97,75,
8,80,88,50,
9,58,6,76,


But the values will be applied to only `Matching` indices...

In [29]:
df["d"] = Series([1,2,3], index= [4,5,6])
df

Unnamed: 0,a,b,c,d
0,55,3,75,
1,88,6,69,
2,76,57,52,
3,25,2,21,
4,55,48,53,1.0
5,80,38,8,2.0
6,1,36,98,3.0
7,92,97,75,
8,80,88,50,
9,58,6,76,


In [30]:
df["d"] = Series([1,2,3], index= ['a', 'b', 'c'])
df

Unnamed: 0,a,b,c,d
0,55,3,75,
1,88,6,69,
2,76,57,52,
3,25,2,21,
4,55,48,53,
5,80,38,8,
6,1,36,98,
7,92,97,75,
8,80,88,50,
9,58,6,76,


What if DF had `EXPLICIT INDEX`

In [39]:
df = DataFrame(np.random.randint(0, 100, (5, 3)), columns= ['a', 'b', 'c'], index= [1,3,2,5,4])
df["d"] = "NA"
df

Unnamed: 0,a,b,c,d
1,22,65,81,
3,75,17,8,
2,59,9,15,
5,5,61,25,
4,15,52,23,


In [41]:
df["d"] = Series([1,2,3])
df

Unnamed: 0,a,b,c,d
1,22,65,81,2.0
3,75,17,8,
2,59,9,15,3.0
5,5,61,25,
4,15,52,23,


It turns out that, it matches on the EXPLICIT index or the index that appears outside.

# 

###  → `Reindex` made simple
    It is just a way to reorder the index (of rows or columns) in a way we want.
    It will create NaN if new label is introduced - else make them ordered.

In [47]:
df = DataFrame(np.random.randint(0, 10, (5, 3)), columns= list('abc'), index= list('bacde'))
df

Unnamed: 0,a,b,c
b,7,7,3
a,4,4,9
c,4,8,7
d,4,2,2
e,5,3,6


In [51]:
df.reindex(index= list('abfcde'), columns= list('bca'))

Unnamed: 0,b,c,a
a,4.0,9.0,4.0
b,7.0,3.0,7.0
f,,,
c,8.0,7.0,4.0
d,2.0,2.0,4.0
e,3.0,6.0,5.0


Many people prefer to use it in Un-official way - with using .loc[]

```python
df.loc[list('abcde'), list('bca')]
```
The same!

In [2]:
# Some trial of `reindex` with repeating indices
array = np.random.randint(0, 100, (5,2))
df = DataFrame(array, index= list('aabbc'))
df

Unnamed: 0,0,1
a,87,67
a,83,72
b,66,35
b,39,89
c,89,63


In [9]:
df.reindex(list('ababc'))

ValueError: cannot reindex from a duplicate axis

In [11]:
# See? ↑ will fail because it can't decide which to take first. 
# But ↓ will work if the same index given
df.reindex(list('aabbc'))

Unnamed: 0,0,1
a,87,67
a,83,72
b,66,35
b,39,89
c,89,63


In [13]:
# And... loc will simple print those occurance!
df.loc[list('aabbc')]

Unnamed: 0,0,1
a,87,67
a,83,72
a,87,67
a,83,72
b,66,35
b,39,89
b,66,35
b,39,89
c,89,63


# 

### → `Apply` sementics

In [79]:
df = DataFrame(np.random.randint(0, 1000, (4,5)), columns= list('ABCDE'), index= range(1, 5))
df

Unnamed: 0,A,B,C,D,E
1,947,664,241,515,855
2,574,645,935,309,367
3,193,939,104,721,661
4,708,185,460,451,55


It is ususal to use apply with **either per row or per column** by using the *axis* argument.  
__  
But it also becomes useful to understand how *apply* works if we want to RETURN more than one arguments.

In [82]:
df.apply(lambda x: Series([x.min(), x.max()], index= ["min", "max"]))

Unnamed: 0,A,B,C,D,E
min,193,185,104,309,55
max,947,939,935,721,855


In [83]:
df.apply(lambda x: Series([x.min(), x.max()], index= ["min", "max"]), axis= 1)

Unnamed: 0,min,max
1,241,947
2,309,935
3,104,939
4,55,708


> Remember, returning *`Multiple`* arguments has nothing to do with lists, tuple etc. To return multiple, you need to return as a SERIES.

In [85]:
df.apply(lambda x: [x.min(), x.max()], axis= 1)

1    [241, 947]
2    [309, 935]
3    [104, 939]
4     [55, 708]
dtype: object

**See ↑**

Now, I am wondering - what would have happened if there were multi index?

In [89]:
index = pd.MultiIndex.from_product([["Male", "Female"], ["Adult", "Non-Adult"]])
df = DataFrame(np.random.randint(0, 100, (4, 2)), columns= ["Height", "Weight"], index= index)
df

Unnamed: 0,Unnamed: 1,Height,Weight
Male,Adult,85,79
Male,Non-Adult,72,72
Female,Adult,20,95
Female,Non-Adult,96,42


In [95]:
df.apply(lambda x: Series([x.min(), x.max()], index= ["min", "max"]))

Unnamed: 0,Height,Weight
min,20,42
max,96,95


Oh my man! MultiIndex is just the same thing! It just hides the repitative indices!

# 

### → Minor difference - makes difference
`apply` vs `applymap`

In [97]:
df = DataFrame(np.random.randn(4,3))
df

Unnamed: 0,0,1,2
0,-1.275362,1.328398,-0.266117
1,0.536035,-0.568472,0.166333
2,-0.333306,-0.018705,1.715436
3,0.449415,-0.367741,-0.605842


In [102]:
# Trying `apply` to convert each element to 2 floating precision (fails)
df.apply(lambda x: '%.2f' % x)

TypeError: cannot convert the series to <class 'float'>

In [103]:
# Using `applymap` to convert each number to 2 floating precision (success)
df.applymap(lambda x: '%.2f' % x)

Unnamed: 0,0,1,2
0,-1.28,1.33,-0.27
1,0.54,-0.57,0.17
2,-0.33,-0.02,1.72
3,0.45,-0.37,-0.61


**Why?**: The reson is simple, the `apply` treats the input as SERIES by default. So any function that you take **must act well** like it can with a SERIES. 

It failed with apply because we tried to *convert* a Series to Float. Which can't be.

**So what?**: So we used `applymap`. It will take each element as an element (which most of the time you need right?) And then you can do operations.

# 

**One more example**

In [104]:
temp = Series(['Aayush', 'Shah', 'BAA'])

In [107]:
df = DataFrame({"A": temp, "B": temp})
df

Unnamed: 0,A,B
0,Aayush,Aayush
1,Shah,Shah
2,BAA,BAA


In [110]:
df.apply(lambda x: x.lower())

AttributeError: 'Series' object has no attribute 'lower'

In [109]:
df.applymap(lambda x: x.lower())

Unnamed: 0,A,B
0,aayush,aayush
1,shah,shah
2,baa,baa


COOL!

Next up, we will talk some more... Still basics okay?

# 