# 05.01.01 - Pandas

## Purpose
This notebook goes over Pandas basics.  Mostly generating pandas objects based off existing structures and/or creating new ones.  We'll also cover updating pandas objects.

Next module we'll talk more about how to use datasets and to load it

In [1]:
# Standard naming convention
import pandas as pd

In [2]:
# Generate 4 elements, just a range/sequence
explicitDataSet = pd.Series([5, 10, 15, 20])
explicitDataSet

0     5
1    10
2    15
3    20
dtype: int64

In [3]:
# To pull a single element
explicitDataSet[2]

15

In [4]:
# To pull a range
explicitDataSet[2:]

2    15
3    20
dtype: int64

In [5]:
# We can assign to another object, just like we've seen before
explicitDataSetSubSet = explicitDataSet[2:]
explicitDataSetSubSet

2    15
3    20
dtype: int64

In [6]:
# Uncomment the below line and run this.
# explicitDataSetSubSet[0] # Note this gives an error!
explicitDataSetSubSet[3]  # only 2 rows, but index 3???

20

In [7]:
# This will list the indexes currently in the dataframe
explicitDataSetSubSet.index

RangeIndex(start=2, stop=4, step=1)

In [8]:
# Gives the length of the dataFrame
len(explicitDataSetSubSet)

2

In [9]:
# We can get the 'true' index, as if this was a standard object
# by using iloc
explicitDataSetSubSet.iloc[0] # This is what we expect

15

In [10]:
# We may want to reset the index to make things consistent
# Note, this works, but returns a new dataframe
explicitDataSetSubSet.reset_index(drop=True)

0    15
1    20
dtype: int64

In [11]:
explicitDataSetSubSet

2    15
3    20
dtype: int64

In [12]:
# We can use inplace to overwrite our current dataframe
explicitDataSetSubSet.reset_index(inplace=True, drop=True)
explicitDataSetSubSet

0    15
1    20
dtype: int64

The important part of the above activity is to realize that Pandas has a separate index, than the index we're normally aware of.  This can cause some confusion, hence the rather verbose example list.  So just remember, index in Pandas language isn't the same as index as in a list, or in numpy, etc

Indexes can actually be more than just numbers, so lets try that.

In [13]:
explicitDataSet

0     5
1    10
2    15
3    20
dtype: int64

In [14]:
explicitDataSet.index = ["Foo", "Bar", "Car", "Zar"]
explicitDataSet

Foo     5
Bar    10
Car    15
Zar    20
dtype: int64

In [15]:
# Now we can reference the rows two different ways:
print(f"explicitDataSet['Foo'] => {explicitDataSet['Foo']}")
print(f"explicitDataSet.iloc[0] => {explicitDataSet.iloc[0]}")

explicitDataSet['Foo'] => 5
explicitDataSet.iloc[0] => 5


In [16]:
# We don't have to cast, in some cases - pay attention to the below
explicitDataSet['Foo'] = "105"

In [17]:
explicitDataSet


Foo    105
Bar     10
Car     15
Zar     20
dtype: int64

In [18]:
# We're still in dynamic line, but watch the dtype in this case
# Went from dtype: int64 -> object
explicitDataSet["Foo"] = "Hello, world"
explicitDataSet

Foo    Hello, world
Bar              10
Car              15
Zar              20
dtype: object

In general, it's bad to mix data types.  Pay close attention to the below "filter".  In this case
we are searching for all values that are greater than 10.  This doesn't work because Foo => 'Hello, world'.

In [19]:
# Uncomment out and run, to see the error message
# explicitDataSet[(explicitDataSet > 10)]

In [20]:
explicitDataSet = pd.Series([5, 10, 15, 20])
explicitDataSet.index = ["Foo", "Bar", "Car", "Zar"]
explicitDataSet[(explicitDataSet > 10)]

Car    15
Zar    20
dtype: int64

Thus far, data frames don't have much use. Lets make a more complicated data source

In [21]:
import numpy.random as rnd
# Generate some fixture data
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sept", "Oct", "Nov", "Dec"]
years = [2015, 2016, 2017, 2018, 2019, 2020]

npTemps = rnd.randint(-20, 100, size=(len(months), len(years)))
npTemps

array([[ 14,  64,  21,  -5, -16,  26],
       [ 25,  57,  87, -16,  73,  -1],
       [ 49,  72,  65, -17, -17,   7],
       [ 65,  20,  91,  70,  37,  27],
       [-19,  74,  45,   2,  66,  22],
       [ 55,   0, -13,  83,  -8,  62],
       [  9,  54,  35,  49,  79,  48],
       [ 84,  51,  92,  67,  77,  63],
       [ -8,   3,  83,  -1, -17,   5],
       [ 44,  44,  19,  -3,  89, -10],
       [ -4,   4,  19,  30, -17,  80],
       [ 70,  48,   7,  62,  54,  14]])

In [22]:
wonkyWeather = pd.DataFrame(npTemps, columns=years, index=months)
wonkyWeather

Unnamed: 0,2015,2016,2017,2018,2019,2020
Jan,14,64,21,-5,-16,26
Feb,25,57,87,-16,73,-1
Mar,49,72,65,-17,-17,7
Apr,65,20,91,70,37,27
May,-19,74,45,2,66,22
Jun,55,0,-13,83,-8,62
Jul,9,54,35,49,79,48
Aug,84,51,92,67,77,63
Sept,-8,3,83,-1,-17,5
Oct,44,44,19,-3,89,-10
