<a href="https://colab.research.google.com/github/Limpapat/Python101/blob/main/pandas_part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python 101 : Pandas : Data Preprocessing [Part 1]
by Limpapat Bussaban

04/09/2021

[pandas] Python Data Analysis Library https://pandas.pydata.org 

[Datacamp] Cheat sheets http://datacamp-community-prod.s3.amazonaws.com/f04456d7-8e61-482f-9cc9-da6f7f25fc9b

[pandas] Cheat sheets https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf 

In [1]:
!pip list -v | grep [Pp]andas

pandas                        1.1.5          /usr/local/lib/python3.7/dist-packages pip
pandas-datareader             0.9.0          /usr/local/lib/python3.7/dist-packages pip
pandas-gbq                    0.13.3         /usr/local/lib/python3.7/dist-packages pip
pandas-profiling              1.4.1          /usr/local/lib/python3.7/dist-packages pip
sklearn-pandas                1.8.0          /usr/local/lib/python3.7/dist-packages pip


In [2]:
import numpy as np

import pandas as pd
pd.__version__

'1.1.5'

## Why we need to prepare data?

## Series

```
pd.Series(data=None, index=None, dtype=None, name=None)
```
**Parameters** : 

            data : array-like, Iterable, dict, or scalar value
            index : array-like or Index (1d) 
            detype : str, numpy.dtype, or ExtensionDtype, optional
            name : str, optional

In [3]:
data = np.array([1,2,3,4,5])
sr = pd.Series(data, index=list('abcde'), name='integer')
sr

a    1
b    2
c    3
d    4
e    5
Name: integer, dtype: int64

### Indexing, Slicing & Filtering

```
.index
.values
```

In [4]:
print(sr.index, sr.values, sep='\n\n')

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

[1 2 3 4 5]


```
sr[index]
index : index_name or list
```

In [5]:
sr['a']

1

In [6]:
sr[['b', 'c']]

b    2
c    3
Name: integer, dtype: int64

```
sr[start:stop:step]
```

In [7]:
sr['a':'e':2]

a    1
c    3
e    5
Name: integer, dtype: int64

```
sr[condition] -> & and, | or
```

In [8]:
sr[(sr < 2) | (sr >= 4)]

a    1
d    4
e    5
Name: integer, dtype: int64

### loc & iloc

In [9]:
sr = pd.Series(data=[22, 45, 69, 999, -1, 10], index=[1, 3, 4, 5, 6, 10])
sr

1      22
3      45
4      69
5     999
6      -1
10     10
dtype: int64

```
position index = [0,   1,  2,   3,  4,  5]
label index    = [1,   3,  4,   5,  6, 10]
values         = [22, 45, 69, 999, -1, 10]
```

In [10]:
sr[1:4]

3     45
4     69
5    999
dtype: int64

```
sr.loc[label_index] # -> Access a group of rows and columns by label(s) or a boolean array.
```

In [11]:
sr.loc[1:4]

1    22
3    45
4    69
dtype: int64

```
sr.iloc[position_index] # -> Purely integer-location based indexing for selection by position.
```

In [12]:
sr.iloc[1:4] # == sr[1:4]

3     45
4     69
5    999
dtype: int64

### Useful functions

**Math operators :**
```
 + - * ** / // %
```

In [13]:
sr1 = pd.Series(data=[2, 4, 8, 6])
sr2 = pd.Series(data=[1, 3, 5, 7])
print(sr1+sr2, sr1-sr2, sr1*sr2, sr1**sr2, sr1/sr2, sr1//sr2, sr1%sr2, sep='\n\n')

0     3
1     7
2    13
3    13
dtype: int64

0    1
1    1
2    3
3   -1
dtype: int64

0     2
1    12
2    40
3    42
dtype: int64

0         2
1        64
2     32768
3    279936
dtype: int64

0    2.000000
1    1.333333
2    1.600000
3    0.857143
dtype: float64

0    2
1    1
2    1
3    0
dtype: int64

0    0
1    1
2    3
3    6
dtype: int64


In [14]:
# Tip:
s1 = pd.Series(data=[2, 4, 8, 6], index=list('abcd'))
s2 = pd.Series(data=[1, 3, 5, 7], index=['b', 'c', 'f', 'g'])
s1 + s2

a     NaN
b     5.0
c    11.0
d     NaN
f     NaN
g     NaN
dtype: float64

**Functions :**
``` 
.sum()
.count() or .size
.mean()
.min(), .max()
.describe()
.idxmin(), .idxmax()
```

In [15]:
print('sr1.sum: {}'.format(sr1.sum()))

sr1.sum: 20


In [16]:
print('sr1.count: {}'.format(sr1.count()), 'sr1.size: {}'.format(sr1.size), sep='\n\n')

sr1.count: 4

sr1.size: 4


In [17]:
print('sr1.mean: {}'.format(sr1.mean()), 
      'sr1.std: {}'.format(sr1.std()), 
      'sr1.min: {}, sr1.max: {}'.format(sr1.min(), sr1.max()), 
      sep='\n\n')

sr1.mean: 5.0

sr1.std: 2.581988897471611

sr1.min: 2, sr1.max: 8


In [18]:
sr1.describe()

count    4.000000
mean     5.000000
std      2.581989
min      2.000000
25%      3.500000
50%      5.000000
75%      6.500000
max      8.000000
dtype: float64

In [19]:
print('sr1.idxmin: {}, sr1.idxmax: {}'.format(sr1.idxmin(), sr1.idxmax()))

sr1.idxmin: 0, sr1.idxmax: 2


```
(condition).all(), (condition).any()
```

In [20]:
(sr1 > 3).all(), (sr1 > 3).any()

(False, True)

```
.append(to_append, ignore_index=False)
pd.concat(objs, axis=0, ignore_index=False)
```

In [21]:
print(sr1.append(sr2), pd.concat([sr1, sr2]), sep='\n\n')

0    2
1    4
2    8
3    6
0    1
1    3
2    5
3    7
dtype: int64

0    2
1    4
2    8
3    6
0    1
1    3
2    5
3    7
dtype: int64


```
.apply()
```

In [22]:
my_function = lambda x : x**2
sr1.apply(my_function)

0     4
1    16
2    64
3    36
dtype: int64

```
.unique()
.nunique()
.value_counts()
.is_unique()
```

In [23]:
sr3 = pd.Series([2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 9])
print('sr3.unique: {}'.format(sr3.unique()), 
      'sr3.nunique: {}'.format(sr3.nunique()), 
      'sr3.value_count: \n{}'.format(sr3.value_counts()),
      'sr3.is_unique: {}'.format(sr3.is_unique),
      'sr1.is_unique: {}'.format(sr1.is_unique),
      sep='\n\n')

sr3.unique: [2 3 4 5 9]

sr3.nunique: 5

sr3.value_count: 
4    4
3    3
5    2
9    1
2    1
dtype: int64

sr3.is_unique: False

sr1.is_unique: True



```
.head(n) 
.tail(n)
```



In [24]:
sr3.head(3)

0    2
1    3
2    3
dtype: int64

```
.drop(index)
```

In [25]:
sr4 = pd.Series(data=[9, 8, 7, 6], index=list('abcd'))
sr4.drop(['b', 'd'])

a    9
c    7
dtype: int64

```
.map(dict)
```

In [26]:
sr4.map({9:1, 8:2})

a    1.0
b    2.0
c    NaN
d    NaN
dtype: float64

```
.rename()
```

In [27]:
sr4.rename('my_name')

a    9
b    8
c    7
d    6
Name: my_name, dtype: int64

In [28]:
sr4.rename({'a':0, 'c':2}) # for all label -> sr4.index = [...]

0    9
b    8
2    7
d    6
dtype: int64

```
.sort_values()
.iteritems()   # ~ enumerate()
```

In [29]:
sr4.sort_values()

d    6
c    7
b    8
a    9
dtype: int64

In [30]:
for i, j in sr4.iteritems():
  print('Index: {}, Value: {}'.format(i, j))

Index: a, Value: 9
Index: b, Value: 8
Index: c, Value: 7
Index: d, Value: 6


### NaN

```
.dropna()
.fillna(value)
.hasnans
```

In [31]:
sr5 = sr4.map({9:1, 8:2})
print(sr4, sr5, sep='\n\n')

a    9
b    8
c    7
d    6
dtype: int64

a    1.0
b    2.0
c    NaN
d    NaN
dtype: float64


In [32]:
sr4.hasnans, sr5.hasnans

(False, True)

In [33]:
sr6 = sr5.dropna()
sr6

a    1.0
b    2.0
dtype: float64

In [34]:
sr7 = sr5.fillna(-1)
sr7

a    1.0
b    2.0
c   -1.0
d   -1.0
dtype: float64