<img src="img/01.png">

# 01. Pandas - Series

## 01.01 What is Pandas?

#### According to Pandas website: https://pandas.pydata.org

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

#### According to [REF1](../README.md) :

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a **DataFrame**. DataFrames are essentially multidimensional arrays with **attached row and column labels**, and often with **heterogeneous** types and/or missing data

## 01.02 Pandas Series - basics

A Pandas Series is a one-dimensional array of indexed data.
* Create a `Series` object

In [1]:
import numpy as np
import pandas as pd

In [2]:
series = pd.Series([103, 204, 406, 256])
series

0    103
1    204
2    406
3    256
dtype: int64

In [3]:
# from list
pd.Series([1, 4, 9, 16, 25, 36])

0     1
1     4
2     9
3    16
4    25
5    36
dtype: int64

In [4]:
# from dict
pd.Series({x: x**2 for x in range(1, 7)})

1     1
2     4
3     9
4    16
5    25
6    36
dtype: int64

In [5]:
# from list of values and list of indices (we can pass dtype as well)
pd.Series([x**2 for x in range(1,7)], index=[x for x in range(1,7)], dtype=np.float32)

1     1.0
2     4.0
3     9.0
4    16.0
5    25.0
6    36.0
dtype: float32

In [6]:
# integer is is not obligatory for index values
pd.Series({name:len(name) for name in ['John', 'Mark', 'Daisy', 'Alexis', 'J.D.', 'Constantine']})

John            4
Mark            4
Daisy           5
Alexis          6
J.D.            4
Constantine    11
dtype: int64

In [7]:
# read from csv
states_pop = pd.read_csv("../92_data/usa_states_population.csv", squeeze=True, names=['population'])
states_pop

Alabama            4833722
Alaska              735132
Arizona            6626624
Arkansas           2959373
California        38332521
Colorado           5268367
Connecticut        3596080
Delaware            925749
Florida           19552860
Georgia            9992167
Hawaii             1404054
Idaho              1612136
Illinois          12882135
Indiana            6570902
Iowa               3090416
Kansas             2893957
Kentucky           4395295
Louisiana          4625470
Maine              1328302
Maryland           5928814
Massachusetts      6692824
Michigan           9895622
Minnesota          5420380
Mississippi        2991207
Missouri           6044171
Montana            1015165
Nebraska           1868516
Nevada             2790136
New Hampshire      1323459
New Jersey         8899339
New Mexico         2085287
New York          19651127
North Carolina     9848060
North Dakota        723393
Ohio              11570808
Oklahoma           3850568
Oregon             3930065
P

* `Series` properties

In [8]:
states_pop.index

Index(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'],
      dtype='object')

In [9]:
states_pop.values

array([ 4833722,   735132,  6626624,  2959373, 38332521,  5268367,
        3596080,   925749, 19552860,  9992167,  1404054,  1612136,
       12882135,  6570902,  3090416,  2893957,  4395295,  4625470,
        1328302,  5928814,  6692824,  9895622,  5420380,  2991207,
        6044171,  1015165,  1868516,  2790136,  1323459,  8899339,
        2085287, 19651127,  9848060,   723393, 11570808,  3850568,
        3930065, 12773801,  1051511,  4774839,   844877,  6495978,
       26448193,  2900872,   626630,  8260405,  6971406,  1854304,
        5742713,   582658])

In [10]:
states_pop.dtype

dtype('int64')

In [11]:
states_pop.name

'population'

## 01.02 Series - Indexing and selection

In [12]:
# calling index directly as property - I do not recommend this way
states_pop.Alabama

4833722

In [13]:
# calling references the explicit index (just like in case of Python dictionary)
states_pop['Alabama']

4833722

In [14]:
# calling references the implicit Python-style index (just like in case of Python lists)
states_pop[0]

4833722

In [15]:
# sometimes it might be troublesome in case of series like
series_index_shift = pd.Series({x: x**2 for x in range(1, 7)})
series_index_shift

1     1
2     4
3     9
4    16
5    25
6    36
dtype: int64

In [16]:
series_index_shift[1]

1

In [17]:
series_index_shift[2]

4

In [18]:
# it is very usefull to use masks when working with series
series_index_shift_mask = series_index_shift > 10
series_index_shift_mask

1    False
2    False
3    False
4     True
5     True
6     True
dtype: bool

In [19]:
# filtering using mask
series_index_shift[series_index_shift_mask]

4    16
5    25
6    36
dtype: int64

In [20]:
# #### you can try calling this
# series_index_shift[0]

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes.  
`series.loc` - calling references the explicit index (Python dictionary)  
`series.iloc` - references the implicit Python-style index (Python list)  

**IMPORTANT**:
assuming you have index [0,1,2,3,4,5]  
`series.loc[1:3]` will return series with a length of 3  
`series.iloc[1:3]` will return series with a length of 2  

In [21]:
# call the value which corresponds to the index of value "2"
series_index_shift.loc[2]

4

In [22]:
# call value corresponding to the third element (we stat counting from 0) in our series
series_index_shift.iloc[2]

9

In [34]:
pd.Series([0,1,2,3,4,5]).loc[1:3]

1    1
2    2
3    3
dtype: int64

In [36]:
pd.Series([0,1,2,3,4,5]).iloc[1:3]

1    1
2    2
dtype: int64

All the NumPy slicing and indexing tricks works in this case as well

In [23]:
# slicing
states_pop['Alabama':'California']

Alabama        4833722
Alaska          735132
Arizona        6626624
Arkansas       2959373
California    38332521
Name: population, dtype: int64

In [24]:
# more slicing with `loc`
states_pop.loc['Alabama':'California':2]

Alabama        4833722
Arizona        6626624
California    38332521
Name: population, dtype: int64

In [25]:
# fancy indexing
states_pop[['Alabama', 'Delaware', 'Alabama', 'Nevada', 'Alabama']]

Alabama     4833722
Delaware     925749
Alabama     4833722
Nevada      2790136
Alabama     4833722
Name: population, dtype: int64

In [26]:
# lets try to shuffle index
states_pop_shuffled = states_pop.sample(frac=1.0)
states_pop_shuffled

New Jersey         8899339
Kansas             2893957
Delaware            925749
New York          19651127
Florida           19552860
Hawaii             1404054
Oklahoma           3850568
Maryland           5928814
Alaska              735132
Idaho              1612136
Rhode Island       1051511
Minnesota          5420380
Nebraska           1868516
Wyoming             582658
Vermont             626630
Wisconsin          5742713
Illinois          12882135
Connecticut        3596080
Texas             26448193
Arkansas           2959373
Montana            1015165
Oregon             3930065
Washington         6971406
Alabama            4833722
North Dakota        723393
South Carolina     4774839
South Dakota        844877
Utah               2900872
Kentucky           4395295
Georgia            9992167
Tennessee          6495978
West Virginia      1854304
Virginia           8260405
Arizona            6626624
Michigan           9895622
Louisiana          4625470
New Hampshire      1323459
N

In [27]:
states_pop_shuffled['New Hampshire':'Montana']

Series([], Name: population, dtype: int64)

**IMPORTANT NOTE:** If we multi-indexing many operations like slicing may fail when we do not use sorted index

## 01.03 Series - Operations

In [28]:
# python functions
states_pop / 1_000_000

Alabama            4.833722
Alaska             0.735132
Arizona            6.626624
Arkansas           2.959373
California        38.332521
Colorado           5.268367
Connecticut        3.596080
Delaware           0.925749
Florida           19.552860
Georgia            9.992167
Hawaii             1.404054
Idaho              1.612136
Illinois          12.882135
Indiana            6.570902
Iowa               3.090416
Kansas             2.893957
Kentucky           4.395295
Louisiana          4.625470
Maine              1.328302
Maryland           5.928814
Massachusetts      6.692824
Michigan           9.895622
Minnesota          5.420380
Mississippi        2.991207
Missouri           6.044171
Montana            1.015165
Nebraska           1.868516
Nevada             2.790136
New Hampshire      1.323459
New Jersey         8.899339
New Mexico         2.085287
New York          19.651127
North Carolina     9.848060
North Dakota       0.723393
Ohio              11.570808
Oklahoma           3

In [29]:
# numpy functions
np.log10(states_pop)

Alabama           6.684282
Alaska            5.866365
Arizona           6.821292
Arkansas          6.471200
California        7.583567
Colorado          6.721676
Connecticut       6.555829
Delaware          5.966493
Florida           7.291210
Georgia           6.999660
Hawaii            6.147384
Idaho             6.207402
Illinois          7.109988
Indiana           6.817625
Iowa              6.490017
Kansas            6.461492
Kentucky          6.642988
Louisiana         6.665156
Maine             6.123297
Maryland          6.772968
Massachusetts     6.825609
Michigan          6.995443
Minnesota         6.734030
Mississippi       6.475846
Missouri          6.781337
Montana           6.006537
Nebraska          6.271497
Nevada            6.445625
New Hampshire     6.121710
New Jersey        6.949358
New Mexico        6.319166
New York          7.293387
North Carolina    6.993351
North Dakota      5.859374
Ohio              7.063364
Oklahoma          6.585525
Oregon            6.594400
P

In [30]:
# let's try to get more intuition about population
states_pop_milions = states_pop // 1_000_000
states_pop_milions

Alabama            4
Alaska             0
Arizona            6
Arkansas           2
California        38
Colorado           5
Connecticut        3
Delaware           0
Florida           19
Georgia            9
Hawaii             1
Idaho              1
Illinois          12
Indiana            6
Iowa               3
Kansas             2
Kentucky           4
Louisiana          4
Maine              1
Maryland           5
Massachusetts      6
Michigan           9
Minnesota          5
Mississippi        2
Missouri           6
Montana            1
Nebraska           1
Nevada             2
New Hampshire      1
New Jersey         8
New Mexico         2
New York          19
North Carolina     9
North Dakota       0
Ohio              11
Oklahoma           3
Oregon             3
Pennsylvania      12
Rhode Island       1
South Carolina     4
South Dakota       0
Tennessee          6
Texas             26
Utah               2
Vermont            0
Virginia           8
Washington         6
West Virginia

In [31]:
# this `value_counts` command is EXTREMELY useful!
states_pop_milions.value_counts()

1     8
6     6
2     6
0     6
5     4
4     4
3     4
9     3
19    2
12    2
8     2
38    1
26    1
11    1
Name: population, dtype: int64

**EXCERCISE 04.01**

1. What is the mean population for all states?
2. How many states have population below mean claculated in p.1?
3. What is a standard deviation for all states/above mean subgroup/below mean subgroup (mean calculated in p.1)?

TIPS:
to get `not` in `pandas` use:
```python
mask1 = pd.Series([True, False, False])
mask2 = ~mask1 # mask2 = pd.Series([False, True, True])
```
to calculate standard deviation use:
```python
series1.std()
```

In [32]:
### YOUR CODE HERE:
pass
### END YOUR CODE

In [33]:
### TO SHOW SOLUTION USE LINE BELOW ###
# %load ../91_solutions/ex4_1.py