This is meant as a quick introduction to Python, NumPy, Pandas, and Seaborn.

# Python

https://docs.python.org/3/tutorial/index.html

### Variables

Everything in Python is an object. You can check what methods are defined for a specific object using `dir` and can check the type with `type`.

In [None]:
a = 12
b = 12.0

print('a is', type(a))
print('b is', type(b))
print(dir(a))

f-strings are your friend. At some point you'll want to print out the values of your variables. You can do this easily with f-strings.

In [None]:
a = 12
b = 'dog'
c = 3.14

print(f'a={a}, b={b}, c={c}')

They can also have expressions

In [None]:
print(f'a + 12 = {a+12}')

Similar to C++, you can override the operators that classes use. In Python these methods are sometimes known as `dunder` methods and include double underscores.

In [None]:
class MyObj:
    def __init__(self, name):
        self.name = name
    
    def __eq__(self, other):
        print('-checking equality-')
        return self.name == other.name

    def __add__(self, other):
        print('-adding-')
        return MyObj(self.name + ' ' + other.name)
    
    def __radd__(self, other):
        print('-other obj found-')
        if isinstance(other, str):
            return MyObj(self.name + ' ' + other)
        else:
            print('uh oh')
            return None


These get invoked automatically when the operators are used.

In [None]:
m1 = MyObj('taco')
m2 = MyObj('bell')
print('eq check')
print(m1 == m2)
print('addition check')
print(m1 + m2)
print('reverse addition check')
print('liberty' + m2)
print('reverse addition with int')
print(12 + m2)

By default the string printed comes from `__str__()`. If there's nothing defined you will get the object location. You can also explicitly call these if you wanted. `__radd__` will only get called if the left operand does not support the operation. When you explicitly call these, wyou would have to take that into account. Otherwise Python will handle that for you when you use the operator.

In [None]:
m1.__add__(m2)
m1.__radd__('liberty')

### Lists

In [None]:
l = [1,2,3]
print(len(l))

Lists can easily be concatenated together

In [None]:
new_list = l + [4,5,6]
print(new_list)

Lists can be created using `range` with `list`.
Range also supported a `step` amount, e.g. how much to increment by each step.


In [None]:
print(list(range(1, 10)))
print(list(range(1, 10, 2)))

You can access portions of your list with a slice.

In [None]:
l = list(range(1,10))
print(l[0:4])  # Print elements at indices 0, 1, 2, 3
print(l[:4])   # Print elements at indices 0, 1, 2, 3
print(l[4:])   # Print elements at indices 4, 5, 6....
print(l[0::2]) # Print every other element
print(l[0:len(l):2]) # Print every other element

You can use slices to delete elements as well.

In [None]:
print(l)
del l[0::2]
print(l)

Shallow copy with slices.

In [None]:
l = list(range(1, 10))
new_l = l[:]  # Shallow copy
del new_l[0::2]
print(l)
print(new_l)

Shallow copies are still dangerous in Python.

In [None]:
l = [
    list(range(1, 10)),
    list(range(2, 8, 2))
]
print(l)
new_l = l[:]  # A shallow copy of a list of lists is just going to copy a pointer over
del new_l[0][0]
print(l) # Where'd the 1 go??
print(new_l)

You want `copy.deepcopy` instead.

In [None]:
from copy import deepcopy
l = [
    list(range(1, 10)),
    list(range(2, 8, 2))
]
print(l)
new_l = deepcopy(l) 
del new_l[0][0]
print(l) # Found the 1
print(new_l)

Strings can be converted to character arrays by calling `list` on them.

In [None]:
l = list('hello')
print(l)

List comprehensions are cool..... you comprehend? Syntax is similar to a for loop.

In [None]:
l = []
for i in range(10):
    l.append(i)

new_l = [i for i in range(10)]
print(l)
print(new_l)

Let's toss in an if statement

In [None]:
l = []
for i in range(10):
    if i % 2 == 0:
        l.append(i)

new_l = [i for i in range(10) if i % 2 == 0]
print(l)
print(new_l)

Let's make like birds and get nested.

In [None]:
l = []
for i in range(3):
    for j in range(2):
        l.append((i, j))

new_l = [(i,j) for i in range(3) for j in range(2)]
print(l)
print(new_l)

Let's iterate over the items in a list

In [None]:
l = list(range(2,10))
for item in l:
    print(item)

What if you want to want to keep track of the index of the item? Option 1.

In [None]:
for i in range(len(l)):
    print(f'idx={i}, item={l[i]}')

Option 2 - using `enumerate` to get the index and the item.

In [None]:
for i, item in enumerate(l):
    print(f'idx={i}, item={item}')

You can also use negative indices to reference the end of the list starting at -1.

In [None]:
print('last element is', l[-1])
print('last 3 elements are', l[-3:])

### Dictionaries

Dictionaries are wonderful. Key-value pairs of arbitrary objects.

In [None]:
d = {
    'a': 1,
    'b': 2,
    'c': 3
}

print(d)
print(d['a'])

Iterate over the dictionary.

In [None]:
for k,v in d.items():
    print('k', k, end = '\t')
    print('v', v)

Do you... comprehend dictionaries?

In [None]:
d = {k:v for k,v in enumerate(list('abc'), 1)}
print(d)

Python also offers `generator` objects. Generators are a lot like normal for loops except each call of `next()` will go to the next `yield` statement - remembering the current state of the iteration. This is especially handy if you don't want to store all the elements in RAM at once.

Here's a simple generator:

In [None]:
def simple_generator():
    yield 1
    yield 2
    print('hello')
    yield 3


gen = simple_generator()
print('type is', type(gen))
print(next(gen))
print(next(gen))

What do you think happens the next call to `next()`?

In [None]:
print(next(gen))

Once the generator has been exhausted, the generator will throw a `StopIteration`.

In [None]:
print(next(gen))

You can also use `map` to apply a function to an iterable sequence. For instance you can add 2 to every element like this:

In [None]:
l = list(range(5))
def add_two(val):
    return val + 2

m = map(add_two, l)
print(type(m))
for item in m:
    print(item)

You can also define a `lambda` function as the anonymous function to apply. If you want to get the `map` object back to a list you can call `list` on it without explicitly having to iterate over each item.

In [None]:
l = list(range(5))
list(map(lambda val: val + 2, l))

# Numpy

For a great quickstart, check out [numpy's official quickstart](https://numpy.org/doc/stable/user/quickstart.html), [beginners](https://numpy.org/doc/stable/user/absolute_beginners.html) and  [fundamentals](https://numpy.org/doc/stable/user/basics.html) section.

Numpy is great for working with multidimensional arrays

In [None]:
import numpy as np

a = np.array([1, 2, 3, 4])
print(type(a))
print(a.dtype)
print(a)

Specify the `dtype` at creation time or convert it later.

In [None]:
a = np.array([1, 2, 3, 4], dtype=np.int16)
print(a.dtype)
a = a.astype(np.float64)
print(a.dtype)

Perform arithmetic with [broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html). Broadcasting allows operations on different shapes. General rules (from their website):



> When operating on two arrays, NumPy compares their shapes element-wise. It >  starts with the trailing (i.e. rightmost) dimensions and works its way left. Two dimensions are compatible when
1. they are equal, or
2. one of them is 1



In [None]:
a = np.arange(5)
print(a)
print(a + 2)

You can easily reshape arrays and can even use `-1` to infer the size assuming you give the other dimensions.

In [None]:
a = np.arange(12).reshape(-1, 4)
print(a.shape)
a

In [None]:
a = np.arange(12).reshape(-1, 3, 4)
print(a.shape)
a

In [None]:
a = np.arange(12).reshape(-1, 3) + np.array([1,2,3])
a

Broadcasting will fail if the trailing dimensions don't match.

In [None]:
a = np.arange(12).reshape(-1, 3)  # (4,3)
b = np.arange(4)  # (4,)
a + b

You can sort your arrays.

In [None]:
a = np.random.randint(0, 10, (10,))
a

In [None]:
np.sort(a)

Or you can just get the indices that would be needed to sort the array.

In [None]:
np.argsort(a)

Let's access some data from a 2D array.

In [None]:
a = np.arange(12).reshape(-1, 3)
print(a)

Access the last row:

In [None]:
a[-1, :]

Access the last column

In [None]:
a[:, -1]

Access multiple indices at once:

In [None]:
rows = np.array([0, 0, 1])
cols = np.array([0, 1, 1])
a[rows, cols]

In [None]:
indices = np.array([0, 1, 4])
a.ravel()[indices]  # Can use ravel to create a 1D view of the data

You can perform logical operations to help separate data.

In [None]:
a > 5

In [None]:
a[a > 5]

If you need the indices associated with these, you can use `np.where`

In [None]:
rows, cols = np.where(a > 5)
print(rows)
print(cols)
print(a[rows, cols])

You can stack your arrays as well.

In [None]:
a = np.arange(3).reshape(-1, 1)
b = np.arange(3, 6).reshape(-1, 1)
c = np.arange(6, 9).reshape(-1, 1)

np.hstack((a, b, c))

In [None]:
np.vstack((a,b,c))

You can expand dimensions. This is useful to get shapes to match (batch size, etc).

In [None]:
a = np.arange(5)
print(a.shape)
print(a)

a = np.expand_dims(a, axis=0)
print(a.shape)
print(a)

# Pandas

Pandas is a way to store table data. For a great quickstart, check out their [official getting started](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html) tutorial. Pandas has rows and columns - very similar to excel, except pandas is actually useful.

There are a number of ways to read in data with pandas. You can create a `DataFrame` from a csv, numpy array, dictionary, etc.

In [None]:
import pandas as pd

a = np.arange(12).reshape(-1, 3)
df = pd.DataFrame(a, columns=['Col A', 'Col B', 'Col C'])
df

Each column in the data frame is a `Series`. These series can be accessed by as a string. Series also allow for a number of high level functions to help gain insight into your data.

In [None]:
col_a = df['Col A']
print(type(col_a))
col_a

In [None]:
col_a.max()  # A method applied to a series

In [None]:
df.max()  # A method applied to a DataFrame

Let's switch to a real dataset to gather some actual insight.

In [None]:
from sklearn.datasets import fetch_openml
import pandas as pd

data =  fetch_openml(data_id=40945, as_frame=True)
df = data.data
target = data.target

This is the "Titanic" dataset. Basically a dataset to predict whether someone lived or died based on a number of features. You can use `df.head()` to look at some of the data.

In [None]:
df.head()

The total number of entries will be the length of the dataframe.

In [None]:
len(df)

To gather basic statistics on a Series or a DataFrame, you can use `.describe()`.

In [None]:
df['age'].describe()

In [None]:
df['cabin'].describe()

For categorical data, you might want to know what values there are and the frequency of each. For this you can use `value_counts()`.

In [None]:
df['cabin'].value_counts()

You can select subsets of your data as well by using a list.

In [None]:
# Notice the inner list
df[['age', 'fare']] 

Similar to what we did with `numpy`, you can grab a subset of data based on logical operations. For instance, what if you only want to look at passengers that are younger than 18?

In [None]:
df[df['age'] < 18]

You can use additional operators for boolean indexing. For instance, you might want to find the entries (rows or people in this case) that were in 1st or 2nd class.

In [None]:
df[ (df['pclass'] == 1.0) | (df['pclass'] == 2.0) ]

You can also use `isin` to check the value of a Series is in a list of values. So you can instead do the following to check for 1st or 2nd class:

In [None]:
df[ df['pclass'].isin([1, 2]) ]

Maybe you're interested in data where the age is known.

In [None]:
df[ df['age'].notna() ]

Maybe you're looking for people younger than 18 but only care about their name.

In [None]:
df[df['age'] < 18][['age', 'name']]  # A weird and awkward way to do it

Can instead use `.loc[]`. 

In [None]:
df.loc[df['age'] < 18, ['age', 'name']]  # Cleaner

If needed, you can access rows or columns directly using `iloc`.

In [None]:
df.iloc[0]

In [None]:
df.iloc[:3]

In [None]:
bool_indices = [True] + [False] * (len(df) - 1)
df.iloc[bool_indices]

In general, if you want to modify a `Series`, try to avoid iterating over the rows, and instead opt for `.apply()`. For instance if you needed to change the age of everyone by +2, you could do this:

In [None]:
df['age'].apply(lambda age: age + 2)

It's often very helpful to view your data. Pandas has some build in plotting functions that make it quite easy.

In [None]:
df['age'].plot.box(figsize=(8, 8))

In [None]:
df['age'].plot.kde(figsize=(8,8))

In [None]:
df['fare'].plot.hist(figsize=(8,8))

In [None]:
df = pd.concat((df, target), axis=1)  # Add the "survived" column

Let's say we have a hypothesis that only the rich survived. We could create a machine learning model and check the accuracy, but we can also just visualize if that's the case.

In [None]:
df_fare_survived = df.loc[ df['fare'].notna(), ['fare', 'survived']]
df_fare_survived.plot.scatter('fare', 'survived', figsize=(8,8))

You can then perform some basic statistics to find out more.

In [None]:
indices_survived = df_fare_survived['survived'] == '1'
df_fare_survived.loc[indices_survived, 'fare'].describe()

In [None]:
df_fare_survived.loc[~indices_survived, 'fare'].describe()

In [None]:
df_fare_survived['fare'].describe()

In [None]:
stds = [33.295 + i*51.758 for i in range(4)]

for amount in stds:
    print(f'--- {amount} ---')
    paid_eq_or_more = df_fare_survived[df_fare_survived['fare'] >= amount]
    avg_survived = pd.to_numeric(paid_eq_or_more['survived']).mean()
    print(f'average surival: {avg_survived}')


### Tasks

Below are a set of tasks which you should attempt to help familiarize yourself with pandas.

In [None]:
df_recovery = df.copy()  # For restoring later

1. How many firt class passengers were there that were younger than 18?

In [None]:
# Task 1

In [None]:
#@title Solution to task 1

len(df[ (df['age'] < 18) & (df['pclass'] == 1) ])

2. Get some descriptive statistics for `parch` and `sibsp`

In [None]:
# Task 2

In [None]:
#@title Solution to task 2

df[['parch', 'sibsp']].describe()

3. It turns out that out dataset is taking up too much space. Convert "female" to "F" and "male" to "M" within the `sex` column.

In [None]:
# Task 3

In [None]:
#@title Solution to task 3

def convert_sex(sex):
    if sex == 'female':
        return 'F'
    elif sex == 'male':
        return 'M'
    return sex

df['sex'] = df['sex'].apply(convert_sex)

4. What is the average fare paid for people between ages  `[0, 18), [18, 25), [25,35), [35, )`?



In [None]:
# Task 4

In [None]:
# Restore
df = df_recovery

In [None]:
#@title Solution to task 4

# Create a list of "ranges" that we're interested in. I'm using 300 as an upper bound for the last one so that it includes everything else
ranges = [(0, 18), (18, 25), (25, 35), (35, 300)]

# Create a temporary list that contains each subset dataframe. Within the subset I'm renaming the column using .rename(i)
temp = [df.loc[df['age'].isin(range(r[0], r[1] + 1)), 'fare'].rename(i) for i,r in enumerate(ranges)]

# Create a concatenation of the list above. Finally rename the columns to give us something helpful to look at.
# Otherwise the columns would still be named 0, 1, 2, 3
concat = pd.concat(temp, axis=1).rename(columns={i: f'[{r[0]}, {r[1]})' for i, r in enumerate(ranges)})

# Lastly you can just call the mean() over the newly created dataframe
concat.mean()

In [None]:
import seaborn as sns

In [None]:
df_temp = df.copy()
sex2int = {'female': 0, 'male': 1}
survived2int = {'1': 1, '0': 0}

df_temp['sex'] = df_temp['sex'].apply(lambda x: sex2int[x])
df_temp['sex'] = pd.to_numeric(df_temp['sex'])
df_temp['survived'] = df_temp['survived'].apply(lambda x: survived2int[x])
df_temp['survived'] = pd.to_numeric(df_temp['survived'])
# df_temp.dtypes
sns.pairplot(df_temp, vars=['pclass', 'age', 'parch', 'sibsp', 'fare', 'sex'], hue='survived')

In [None]:
df_new = df.copy()
df_new['pclass'] = df_new['pclass'].astype('category')  # Convert to a category
sns.catplot(data=df_new, x="age", y="pclass", kind="box", height=8)

In [None]:
sns.catplot(
    data=df_new, x="age", y="pclass", hue="sex",
    kind="violin", bw=.25, cut=0, split=True, height=8
)

In [None]:
df_new['survived'] = pd.to_numeric(df_new['survived'])

sns.catplot(
    data=df_new, x="pclass", y="survived", col="sex",
    kind="bar", height=8, aspect=0.6
)

In [None]:
sns.catplot(data=df_new, x="age", y="pclass", kind="violin", color=".9", inner=None, height=8)
sns.swarmplot(data=df_new, x="age", y="pclass", size=4)

In [None]:
sns.violinplot(data=df_temp, x='pclass', y='age', hue='survived', height=8,  split=True)

### Tasks

Take a look at the different [plotting options](https://seaborn.pydata.org/api.html) under Seaborn's API. Try to plot some different features