# Learning Pandas

## Series

- A *Series* in Pandas is like a Numpy array
- It has a strict data type
- We can set the `name` variable of a Series, this really helpful when:
    - Other people need to know details of what is stored in the Series
    - The series needs to be used as a column in a table
- All python lists are indexed using incremental integers starting from zero. This is fixed for Python lists and we can't have any other indexing scheme. However, we can create our own indexes in Pandas Series which can help us associate each value with a corresponding value (possibly a string) by setting the `index` variable of a series:
    - This is different from dictionaries in Python because Pandas lists are continguous ordered allocations while Python dictionaries are discontinguous unordered allocations.
    - This is important because having continguous memory allocations allows us to have much faster access times!
    - Also, if we had to get the population of France without indexing then we would have to remember that France was at index=1 everytime we needed its value. But with out custom indexing, we can access the value whenever we need it using the same syntax as a Python dictionary.
    - Even after custom indexing, we can get elements at integer locations (like we did earlier) by using `.iloc[<integer_location>]`. This means that `.iloc[0]` will still get us the first element and `.iloc[-1]` will get us the last element

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Declaring a Pandas Series
series = pd.Series([35, 64, 81, 61, 127, 65, 319])
series

0     35
1     64
2     81
3     61
4    127
5     65
6    319
dtype: int64

In [3]:
series.name = "Populations of G7 countries in millions"
series

0     35
1     64
2     81
3     61
4    127
5     65
6    319
Name: Populations of G7 countries in millions, dtype: int64

In [4]:
series.index = ["Canada", "France", "Germany", "Italy", "Japan", "UK", "USA"]
series

Canada      35
France      64
Germany     81
Italy       61
Japan      127
UK          65
USA        319
Name: Populations of G7 countries in millions, dtype: int64

In [5]:
series.iloc[0]

35

In [6]:
# multi-indexing still works here like it did for Numpy arrays
series.iloc[[0,1]]

Canada    35
France    64
Name: Populations of G7 countries in millions, dtype: int64

In [7]:
# Note: slicing gets us the upper bound as well!
series["France":"Italy"]

France     64
Germany    81
Italy      61
Name: Populations of G7 countries in millions, dtype: int64

### Vectorised Operations and Boolean Selection

In [8]:
# The same boolean selection techniques from Numpy array will still apply alongside the operators & | ~ (i.e. and, or, not)
series > series.mean()

Canada     False
France     False
Germany    False
Italy      False
Japan       True
UK         False
USA         True
Name: Populations of G7 countries in millions, dtype: bool

In [9]:
series[series > series.mean()]

Japan    127
USA      319
Name: Populations of G7 countries in millions, dtype: int64

In [10]:
# Vectorised operations work the same way as they did for Numpy arrays
series * 1_000_000

Canada      35000000
France      64000000
Germany     81000000
Italy       61000000
Japan      127000000
UK          65000000
USA        319000000
Name: Populations of G7 countries in millions, dtype: int64

In [11]:
# we can still use traditional Numpy functions on Pandas Series because a Series is still internally a Numpy Array
np.log(series)

Canada     3.555348
France     4.158883
Germany    4.394449
Italy      4.110874
Japan      4.844187
UK         4.174387
USA        5.765191
Name: Populations of G7 countries in millions, dtype: float64

### Modifying a Series

In [12]:
# we can modify the Series jsut as we used to do with Python dictionaries
series["Canada"] = 40
series

Canada      40
France      64
Germany     81
Italy       61
Japan      127
UK          65
USA        319
Name: Populations of G7 countries in millions, dtype: int64

In [13]:
# We can also modify using the .iloc[] function
series.iloc[0] = 36
series

Canada      36
France      64
Germany     81
Italy       61
Japan      127
UK          65
USA        319
Name: Populations of G7 countries in millions, dtype: int64

In [14]:
# Modifications can also be made with boolean selection
series[series < 70] = 100
series

Canada     100
France     100
Germany     81
Italy      100
Japan      127
UK         100
USA        319
Name: Populations of G7 countries in millions, dtype: int64

## Dataframes

- Looks a lot like an excel sheet
- It is actually very common to create Dataframes out of .csv files
- A dataframe column is basically Series
- So we can think of dataframes as set of series