# The Python Ecosystem

Here are some extra resources for learning Python:

**Getting Started with Python**:

* https://www.codecademy.com/learn/python
* http://docs.python-guide.org/en/latest/intro/learning/
* https://learnpythonthehardway.org/book/
* https://www.codementor.io/learn-python-online

**Learning Python in Notebooks**:

* http://mbakker7.github.io/exploratory_computing_with_python/

This is handy to always have available for reference:

**Python Reference**:

* https://docs.python.org/3/reference/


There are also many Python courses avilable via Datacamp. You can access all their courses with the invite link on our [resources page](https://www.mdst.club/resources)!

## 0. Jupyter Notebook

Welcome to Jupyter Notebook! Jupyter lets you develop documents that combine codes, visualizations and explanatory texts. 

At MDST, we use Jupyter Notebooks for: 
- data cleaning and transformation
- statistical modeling
- data visualization
- machine learning
- ...

Cells are the basic units of organization in Jupyter Notebooks. You can start editing each cell by pressing ENTER or double clicking.

All our cells so far are _Markdown_ cells, meaning they just contain text! 

What is [Markdown](https://en.wikipedia.org/wiki/Markdown), you ask. It is just a kind of text file where the information about the file's formatting is stored in the file itself. 

That means (enter edit mode to see the actual Markdown text):

- To make something bold, put two asterisks on each side, **like so**.
- To italicize something, put an asterisk on each side, *like so*.
- To cross something out, put two tildes on each side, ~~like so~~.
- To embed a link in words, put the words in square brackets and put the link immediately after that in parenthesis, [like so](https://www.yout-ube.com/watch?v=dQw4w9WgXcQ)

Most crucially, simply pressing ENTER once does NOT do anything in Markdown. You have to leave an empty line before every new paragraph.

Whereas these operations are done by clicking a button in MS Word or Google Docs, they are a part of the text in Markdown. 

Here is a Markdown [cheatsheet](https://www.markdownguide.org/cheat-sheet/). 

In [2]:
# Jupyter also has code cells for writing and running Python codes. 
# What's in this cell are not Python codes, instead they are comments. You can start comments by putting an asterisk at the beginning of lines.
# Comments are for other humans only. The computer will ignore them when executing other codes.
# Pro Tip: You can comment and uncomment many lines at once by highlighting them and pressing CTRL + / or CMD + /

You can run a cell by pressing CTRL + ENTER or CMD + ENTER. Running a cell will either render the contained Markdown to nice-looking text or execute the contained codes.

You should run every cell in this notebook.

## 1. Data Types

### 1.0 Your First Python Program

In [3]:
# Tradition demands that we do this
# Try running this cell

print("Hello World")

Hello World


The `print()` function is how you output things for people to see in Python.

In [89]:
# Notebooks will automatically print the output of the last line of each cell when they are ran

413 * 5791

2391683

### 1.1 Data Types

#### 1.1.0 Ints and Floats

Python distinguishes between integers and decimal numbers (floats).

In [15]:
type(0)

int

In [18]:
type(0.0)

float

Basic arithmetic is straight forward in Python.

In [19]:
3 + 2

5

In [20]:
1.1 - 9.0

-7.9

In [21]:
1 * 5

5

In [8]:
# When two numbers, regardless of whether they are int or float, are divided, Python returns the result as if the operation is done on a calculator
# This is known as float division
print(1/2)
print(1.5/2.4)

0.5
0.625


In [11]:
# There is also integer division that can be done between two int
# In Python, the behavior is always to round the float divison result down to the nearest integer
14 // 5

2

In [24]:
# You can also find the remainders of divisions
# Also known as taking the modulus
13 % 5

3

In [12]:
# exponent
10 ** 3

1000

ints and floats are mostly interchangeable

In [26]:
3 * 3.0

9.0

In [27]:
9.8 // 2

4.0

and can also be cast (i.e. converted) to the other type

In [28]:
float(3)

3.0

In [29]:
int(2.9)

2

#### 1.1.1 Strings

Strings are Python's internal representation of texts.

In [14]:
# They can either be surrounded by double quotes...
type("apple")

str

In [15]:
# or single quotes.
type('apple')

str

In [34]:
# You can piece two strings together (aka concatenate) using the plus sign
"Hello" + " World"

'Hello World'

Python provides many functions for manipulating strings. 

In [16]:
# Capitalize
"like so".upper()

'LIKE SO'

In [17]:
# Lowercase
"LIKE SO".lower()

'like so'

In [18]:
# Title case
"like so".title()

'Like So'

In [19]:
# Count the number of characters, including whitespace
len("like so")

7

In [20]:
# Remove spaces on either side of a string
"    like so  ".strip()

'like so'

In [21]:
# Split a string into a list of words 
"like so".split()

['like', 'so']

You can find a comprehensive list of these functions [here](https://www.w3schools.com/python/python_ref_string.asp).

#### 1.1.2 Boolean Values

There are two boolean values in Python `True` and `False`. They are case sensitive and must be typed exactly as such.

(There is also `NULL`, but don't worry about that right now).

Now time for some basic [boolean algebra](https://en.wikipedia.org/wiki/Boolean_algebra).

You can flip a boolean value to its opposite with `not`

In [40]:
print(not True)
print(not False)

False
True


In [22]:
# and, or conjunction, only evaluates to True when every boolean value involved is True
print(True and True)
print(True and False)
print(False and False)

True
False
False


In [23]:
# or, or disjunction, evaluates to True whenever at least one involved boolean value is True
print(True or True)
print(True or False)
print(False or False)

True
True
False


In [28]:
# All the non-zero numbers are treated as True
print(bool(1 and True))
print(bool(0 and True))

True
False


In [30]:
# All non-empty strings, even if the string is all whitespaces, are treated as True
print(bool(True and ""))
print(bool(True and "    "))
print(bool(True and "False"))

False
True
True


We will use boolean values much more extensively when we encounter control flow and `if` statements.

#### 1.1.3 Variables

You can store data inside named variables, and refer back to the data with its name.

Variable names cannot begin with a digit but there are not many restrictions beside that.

In [47]:
# Python automatically figures out what type your variables are
# Once the cell is ran, the variables are made available everywhere else in the notebook

x = 4
y = 5


In [33]:
# We can do arithmetic with those variables in another cell
4*x + 5*y

41

In [49]:
# There are some shorthands for updating variables. 
# Instead of x = x+2
# We can simply do:

x += 2
x

# You can do the same for -, *, and /

8

In [None]:
# In Python, snake case is the norm for multi-word variable names

michigan_data_science_club_abbreviation = "MDST"

In [35]:
# The values stored inside the variables can be overwritten later by referring back to the variable name
# Python allows changing the data type of the variable when it is overwritten

x = "like"
y = " so"

x + y

'like so'

: 

### 1.2 Containers

#### 1.2.0 List

A list is a collection of data. In Python, a list can contain different types of data.

In [7]:
# You can create (aka initialize) an empty list with the square brackets
empty_list = []

# or with the list() command
another_empty_list = list()

In [1]:
# Or you can create lists with elements already inside by listing them in the square brackets
nonempty_list = [32, 'MDST', True]

Once a list is created, you can retrieve elements inside with its index.

Python (and most other languages) use 0-indexing, meaning the first element is on index 0. 

In [2]:
# Retrieve an element by putting its index in a square bracket after the list's name
nonempty_list[1]

'MDST'

In [3]:
# This works similarly for strings
mdst = "MDST"
mdst[2]

'S'

In [4]:
# You can chain indices as well
nonempty_list[1][2]

'S'

Negative numbers index from the end. Think of it as -1 wrapping around to the last element in the list. -2 is then the second last element in the list etc.

In [5]:
nonempty_list[-2]

'MDST'

Be careful to not use an index that doesn't exist in a list. Python won't know what to do and will throw an error.

In [8]:
# Getting the first element in an empty list doesn't make sense.

print(empty_list[0])


list index out of range
list index out of range


In [None]:
# Neither does finding the fifth element in a three-element list

print(nonempty_list[4])

You can use indexing to get subarrays/substrings

syntax: [start:end:step]

The subarray will include the start index (inclusive) but not the end (exclusive)

In [9]:
sample_list = [0, 1, 2, 3 , 4, 5, 6, 7, 8, 9, 10]

In [10]:
# Getting the fourth to eighth element
# If you don't specify the step, Python assumes you want every element

sample_list[3:8]

[3, 4, 5, 6, 7]

In [11]:
# When end is not specified, Python includes everything including and after the start index
sample_list[5:]

[5, 6, 7, 8, 9, 10]

In [12]:
# Similarly, when start is not specified, Python includes everything before the end index but excludes the end index itself
sample_list[:-5]

[0, 1, 2, 3, 4, 5]

In [13]:
# When neither start nor end is specified, Python apples the step argument to the entire list
# step = 2 means to take 2 steps forward each time an element is selected. In other words, it selects every other element

sample_list[::2]

[0, 2, 4, 6, 8, 10]

In [14]:
# A neat trick for reversing a list, try to understand what it's doing
sample_list[::-1]

[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

You can add element to an existing list ...

In [15]:
# at the end ...
sample_list.append(11)

# or somewhere in the middle
# syntax: insert(index, new_value)
sample_list.insert(1, 0.5)

print(sample_list)

[0, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]


or remove an element ...

In [16]:
# remove the first instance of a given value in the list
sample_list.remove(0.5)

# or remove the element on a specified index
sample_list.pop(0)

sample_list

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

or change an element using its index ...

In [17]:
sample_list[-1] = 12
sample_list

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12]

or many other things ...

See the full range of possibility [here](https://www.w3schools.com/python/python_ref_list.asp).

If you thought typing out every number from 0 to 10 was an inefficient way of creating a list, you will be glad to learn about the `range()` function. 

Syntax: `range(start (inclusive), end (exclusive), step)`

Pro tip: if you only specify `end`, Python will give you every integer from 0 up to the one before `end`.

In [45]:
# let's recreate the list of numbers from 0 to 10 using range()
# The output of range()'s type is range, not list. We need to convert it with list()
sample_list = list(range(11))
sample_list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

##### 1.2.1 Tuple

Python tuples are list-like data structures with a few important differences.

In [19]:
# You can create them with parenthesis

empty_tuple = tuple()

sample_tuple = (1, 2, 3, 4)

print(empty_tuple, sample_tuple)

() (1, 2, 3, 4)


Indexing tuples is just like indexing lists

In [21]:
print(sample_tuple[1], sample_tuple[-3])

2 2


Crucially, tuples can NOT be modified once created. 

Tuples are *immutable*. While this property makes them less versatile than lists, it sometimes come in handy. For example, tuples can be used as keys in dictionaries (next section).

In [22]:
# try to overwrite an item in a tuple

try:
    sample_tuple[-1] = 10
except TypeError as e:
    print(e)

'tuple' object does not support item assignment


##### 1.2.2 Dictionary

Dictionary is a way to store pairs of values, known as keys and values, with some associations to each other.

In [23]:
# You can create an empty dictionary in two ways
empty_dict1 = dict()
empty_dict2 = {}

print(empty_dict1, empty_dict2)

{} {}


In [24]:
# You can also create dictionaries with key:value pairs already inside 
panda_express_pricing = {"Bowl":5.80, "Plate":6.80, "Bigger Plate":8.30}

You index a dictionary with a key and gets its associated value.

In [25]:
bowl_price = panda_express_pricing["Bowl"]
bowl_price

5.8

Be careful to not index a key that doesn't exist in the dictionary because that will cause an error.

If you are not sure whether a key is in the dictionary or not, use the [get](https://www.w3schools.com/python/ref_dictionary_get.asp) method to be safe.

In [29]:
# try to eat buffet at Panda express
buffet_price = panda_express_pricing["Buffet"]

KeyError: 'Buffet'

It follows that you can change the value associated with a key


In [26]:
# let's say Panda Express has a sale on the bowls
panda_express_pricing["Bowl"] = 5.00
bowl_price = panda_express_pricing["Bowl"]
bowl_price

5.0

There is, however, no easy way to modify the key associated with a value.

In [30]:
# You can see a list of all the keys in a dictionary
panda_express_pricing.keys()

dict_keys(['Bowl', 'Plate', 'Bigger Plate'])

In [31]:
# Or a list of all values
panda_express_pricing.values()

dict_values([5.0, 6.8, 8.3])

In [32]:
# Or a list of key value pairs, represented as tuples
panda_express_pricing.items()

dict_items([('Bowl', 5.0), ('Plate', 6.8), ('Bigger Plate', 8.3)])

See a list of everything you can do with dictionaries [here](https://www.w3schools.com/python/python_ref_dictionary.asp).

##### 1.2.3 Set

Sets store unique elements

In [34]:
# You can only create sets with set(); (), [], {} are all taken

s = set([1,2,3,1,2,3])
s

{1, 2, 3}

In [35]:
# Add new elements to a set
s.add(3)
s.add(4)
s

{1, 2, 3, 4}

In [36]:
# Remove elements in the set 
s.discard(1)
s.discard(2)

There are many set operations that can be performed between two sets. We will not go into them here. You can see a list on this [page](https://www.w3schools.com/python/python_ref_set.asp).

#### 1.2.4 Container Utilities

You can use `len()` to find the number of items in each of the above four containers.

In [37]:
l = [1,2,3]
t = (1,2,3)
d = {1:'a', 2:'b', 3:'c'}
s = set([1, 2, 3])

print(len(l), len(t), len(d), len(s))

3 3 3 3


And use the `in` keyword to check if an element is in the container or not.

(For dictionaries, you can only use this to check whether a key is in the dictionary or not)

In [38]:
print(1 in l)
print(4 in t)
print(2 in d)
print(0 in s)

True
False
True
False


## 2. Control Flow

You can use `if` statements to execute different actions in different scenarios.

Before we dive in, a quick aside on comparing numbers:
- Use `==` to check equality
- Use `!=` to check inequality
- Use `<`, `>`, `>=`, and `<=` to compare two numbers.

In [41]:
# Here is the general idea of if statements
# if (condition evalutes to true):
#   execute code here

to_print_or_not_to_print = True

if to_print_or_not_to_print:
    # Most code editors will automatically indent the lines inside an if statement for you 
    # It doesn't matter whether you use tabs or spaces to indent or how much you indent (two or four spaces are common)
    # Just be consistent! Your code will not work without consistent indentation!
    
    print("The first block of code is executed")

to_print_or_not_to_print = False

if to_print_or_not_to_print:
    print("The second block of code is executed")


The first block of code is executed


We can use more complex conditions for `if` statements.

In [42]:
if 4 < 5 and 6 >= 6 and len(list(range(3))) == 3:
    print("The first block of code is executed")

if 4 != 4 or 6 > 7 or -1 < 0:
    print("The second block of code is executed")

The first block of code is executed
The second block of code is executed


An `if ... else` scheme can handle both when the condition is true and false.

In [43]:
to_print_or_not_to_print = True

if to_print_or_not_to_print:
    # indented
    print("printing")
# unindented
else:
    # indented
    print("not printing")

to_print_or_not_to_print = False

if to_print_or_not_to_print:
    print("printing")
else:
    print("not printing")

printing
not printing


`if ... elif ... else` schemes can handle many different scenarios.

You can have `elif` without `else` but all `elif` must appear before `else`.

In [77]:
uniqname = "ENTER YOUR UNIQNAME HERE"

if len(uniqname) <= 4:
    print("Short")
elif len(uniqname) < 8:
    print("Medium")
else: 
    print("Long")

Long


## 3. Iterating

### 3.0 For Loops

Lists, tuples, sets, dictionaries, strings, and ranges are all *iterables*. That just means we can move through them in a certain order.

This property is useful for simplifying repeated actions. 

Say we have a list of numbers and we want to print each of them, doubled. 

We can use the index to access, multiply, and print each of them but that's inefficient.

For loops to the rescue.

In [60]:
nums = list(range(5))

for num in nums: 
    print(num*2)


0
2
4
6
8


In [None]:
# What is actually going on here?
#
# in nums specifies the iterable to go through, nums in this case
# num is what is called an iterator. i, j, and k are common iterator names but num makes more sense here
#
# for num in nums: 
#     indent!
#     num is set to an element in the nums list and the action is executed
#     print(num*2)
#     num is set to the next element in the nums list
#
# in this case, we iterated through the elements of the list

In [52]:
# Another common pattern is to iterate through the indices 
# let's print out the indices that has an even number on them

for i in range(len(nums)):
    # range(len(nums)) gives all the indices in the nums list
    # nums has 5 elements so range(len(nums)) looks like 0, 1, 2, 3, 4
    # you will see this all the time in for loops

    if nums[i] % 2 == 0:
        print(i)

0
2
4


One more example: 

Make a new list containing the items in nums squared


In [61]:
squared_nums = []

for num in nums:
    squared_nums.append(num ** 2)

squared_nums

[0, 1, 4, 9, 16]

Sometimes it is useful to iterate through both the element and index at the same time. 

Look into [`enumerate`](https://realpython.com/python-enumerate/).

### 3.1. List Comprehension (Optional)

Here we present a nice feature of Python that allows creating lists using a shorthand of for loops

In [57]:
# every letter in MDST
letters = [letter for letter in "MDST"]
letters

['M', 'D', 'S', 'T']

In [62]:
# Modify the iterator 
# Let's redo the squared_nums example from the previous section

squared_nums = [num**2 for num in nums]
squared_nums

[0, 1, 4, 9, 16]

In [63]:
# Modify the iterator differently based on some conditions
# Square the number if it is even, else cube it 

squares_and_cubes = [num**2 if num % 2 == 0 else num**3 for num in nums]
squares_and_cubes

[0, 1, 4, 27, 16]

In [64]:
# Filter the iterator 
# Triple the number if it is odd 

triples = [num*3 for num in nums if num % 2 == 1]
triples

[3, 9]

In [66]:
# chained comprehension 
# numbers from 1 to 20, in 3 number segments
# you really shouldn't nest more than 2 levels

segments = [[i for i in range(start, start+3)] for start in range(0, 20, 3)]
segments

[[0, 1, 2],
 [3, 4, 5],
 [6, 7, 8],
 [9, 10, 11],
 [12, 13, 14],
 [15, 16, 17],
 [18, 19, 20]]

## 4. Functions

### 4.0 Import & Libraries

Libraries (aka packages) are codes that other people have developed for you to use. Python has tons of cool and interesting libraries.

You can start using them in your notebooks with the `import` key word.

In [71]:
# There is always a relevant xkcd 

import antigravity

Most libraries are more elaborate and contain many functionalities.

In [74]:
# Once a library is imported, you can start using the functions and methods they have.
import random
random.randint(1, 10)

4

In [75]:
# If you know what function you need, you can also import it specifically. 
from random import randint

# If you do it this way, you can use randint directly instead of typing out random.randint()

randint(1, 10)

6

In [76]:
# Sometimes function or library names are very long and you might not want to type them out every time
# You can use the as key word to rename imports

from random import randrange as r 

r(1, 10)

2

### 4.1 Built-in functions

We present some more built-in functions that may be useful for completing the checkpoints.

You can find documentation for all of them [here](https://docs.python.org/3/library/functions.html).

In [68]:
# check the type of a variable
print(isinstance('a',str))

x = []
print(isinstance(x, int))

True
False


In [77]:
max([3,4,5])

5

In [78]:
min([-3,3,9])

-3

In [None]:
sum([1,3,5])

9

In [81]:
round(3.8)

4

In [82]:
round(3.3)

3

In [None]:
abs(-3)

3

In [69]:
# format allows easy modification of strings

name = "ENTER YOUR NAME HERE"

print("My name is {}".format(name))

My name is enter your name here


### 4.2 Custom functions

Functions are great ways to reduce code duplication and repetition.

Functions can be used to carry out specific actions. We will slowly build up to a function that outputs custom greeting messages.

Let's start by having the function just print "hi".

In [90]:
# The first line in a function is the function header. It starts with the def key word, followed by the function name
def greet():
    # Indent!
    print("Hi")

greet()

Hi


Not exactly a custom message. It would be nice if we can greet people by their names

In [92]:
# We can shape a function's behavior by adding arguments. These appear in the parenthesis after the function name
# Note: the name on this line names an argument to the greet function
def greet(name):
    print("Hi " + name)

# Note: the name on this line refers to the name variable, which you should've changed to your own name in the previous section
greet(name)

Hi enter your name here


Maybe you are excited to see the person, in which case some exclamation marks are in order. 

Usually, 1 is good. 

In [94]:
# You can set default values for arguments. The function will use those defaults if the argument is not provided.
# On the contrary, arguments without default values have to specified
def greet(name, num_exclamation=1):
    print("Hi " + name + '!'*num_exclamation)

greet(name)

Hi enter your name here!


In [96]:
# You can of course use different values for all your default arguments.
# Python will try to match arguments using the order listed in the header
greet(name, 3)

# or you can mix up the order by referring to the arguments by their names
greet(num_exclamation=2, name=name)

Hi enter your name here!!!
Hi enter your name here!!


Functions don't have to interface with users directly. They can also be used to perform computations and return the results.

In [97]:
def round_to_hundreds(num):
    rounded = round(num / 100) * 100

print(round_to_hundreds(168))

Weird, we expected 2 but received `None`. 

This is because we forgot to get the function to make its output available for other parts of the program to use.

In its current state, the output of the function (`rounded`) is inaccessible.

This is where `return` comes into play.

In [99]:
def round_to_hundreds(num):
    rounded = round(num / 100) * 100

    # returning is to make the output available for other codes
    return rounded

print(round_to_hundreds(168))

200


## 5. Numpy

Numpy is short for *numerical python*, a library built for optimized operations on large arrays and matrices. It is the first of the three big libraries used for data science!

In [80]:
import numpy as np

### 5.0 Arrays

Numpy arrays can be created from a Python list

In [81]:
a = [1,2,3,4,5,6]
b = np.array(a)
b

array([1, 2, 3, 4, 5, 6])

Right now, it looks an awful like a python list, but there are some key points you should know.

Numpy arrays are:
- homogeneous (all elements in an array have the same type)
- multidimensional

In [82]:
# Homogeneous: all numpy arrays have an associated data type
# numbers are usually ints or floats
b.dtype

dtype('int32')

In [83]:
# Multidimensional: numpy arrays can have arbitrarily many dimensions
# We can reshape b into a 3x2 matrix. This means 3 rows and 2 columns
# Note: this doesn't change b. That's why we assign it to a new variable: m
m = b.reshape(3, 2)
m

array([[1, 2],
       [3, 4],
       [5, 6]])

In [84]:
# Each dimension is called an axis
# The size across each axis is called the shape
# These are two very important concepts!
m.shape

(3, 2)

In [85]:
# One numpy function worth highlighting is transpose 
# Essentially, the first row becomes the first column, the second row becomes the second column etc.

m = m.transpose()
m

array([[1, 3, 5],
       [2, 4, 6]])

### 5.1 Math

Numpy gives us a lot of math functions to work with. You can find them all in the [documentation](https://numpy.org/doc/stable/reference/routines.math.html).

In [97]:
np.sum(b)

21

In [98]:
np.mean(b)

3.5

In [None]:
# for convenience, you can also call
b.mean()

You can also apply these functions by axis

In general, `axis=0` means to operate by columns and `axis=1` means to operate by rows.

In [86]:
# summing by rows 
print(np.sum(m, axis=1))

# summing by columns
print(np.sum(m, axis=0))

[ 9 12]
[ 3  7 11]


In [88]:
# Unlike a regular list, you can do arithmetic on numpy arrays directly.
# In most cases, numpy will apply the arithmetic operations to each element. 
# Sometimes it can get a bit more complicated...

print(m*3)
print(m+3)
print(np.power(m,2))

[[ 3  9 15]
 [ 6 12 18]]
[[4 6 8]
 [5 7 9]]
[[ 1  9 25]
 [ 4 16 36]]


## 6. Pandas

Pandas is another Python library which we will be using _a lot!_ It lets us look at data in tabular format and is well integrated with other libraries for plotting, machine learning, etc.

In [100]:
import pandas as pd

### 6.0 Dataframes & Series

Pandas puts data into dataframes, which are made up of series.

In [101]:
# here, we're reading in data from a 'csv', or comma-separated value, file 
df = pd.read_csv("../data/cereal.csv")
type(df)


pandas.core.frame.DataFrame

A dataframe is like a table and series are the columns:

In [102]:
df

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.00,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.50,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,Triples,G,C,110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174
73,Trix,G,C,110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.00,27.753301
74,Wheat Chex,R,C,100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445
75,Wheaties,G,C,100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.00,51.592193


We can use head(), tail(), or sample() to take a look at the data

In [103]:
# head returns the first 5 rows in the dataframe, tail returns the last 5
df.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [104]:
df.sample()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
17,Corn Pops,K,C,110,1,0,90,1.0,13.0,12,20,25,2,1.0,1.0,35.782791


Each column is a pandas Series (pd.Series)

In [105]:
df["name"]

0                     100% Bran
1             100% Natural Bran
2                      All-Bran
3     All-Bran with Extra Fiber
4                Almond Delight
                ...            
72                      Triples
73                         Trix
74                   Wheat Chex
75                     Wheaties
76          Wheaties Honey Gold
Name: name, Length: 77, dtype: object

In [106]:
type(df["name"])

pandas.core.series.Series

Series are similar to numpy arrays

In [107]:
df["carbo"].mean()

14.597402597402597

In [108]:
# we can turn pd.Series into a numpy array
df["carbo"].to_numpy()

array([ 5. ,  8. ,  7. ,  8. , 14. , 10.5, 11. , 18. , 15. , 13. , 12. ,
       17. , 13. , 13. , 12. , 22. , 21. , 13. , 12. , 10. , 21. , 21. ,
       11. , 18. , 11. , 14. , 14. , 12. , 14. , 13. , 11. , 15. , 15. ,
       17. , 13. , 12. , 11.5, 14. , 17. , 20. , 21. , 12. , 12. , 16. ,
       16. , 16. , 17. , 15. , 15. , 21. , 18. , 13.5, 11. , 20. , 13. ,
       10. , 14. , -1. , 14. , 10.5, 15. , 23. , 22. , 16. , 19. , 20. ,
        9. , 16. , 15. , 21. , 15. , 16. , 21. , 13. , 17. , 17. , 16. ])

The key difference is that Series are indexed

In [109]:
# See the 0, 1, ... 76 on the left? That is the index of each item.
# Right now they are just positions, but theoretically they can be any identifier for the row

df["carbo"].index

RangeIndex(start=0, stop=77, step=1)

## 5.2 Pandas Indexing

The index in a pandas series/dataframe can by any list of values (row number, ID, time, etc.)

In [110]:
# a range index is just a numeric index
df.index

RangeIndex(start=0, stop=77, step=1)

In [111]:
# see how the leftmost row is now replaced with the cereal names
df_ = df.set_index('name')
df_.head()

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [112]:
df_.index

Index(['100% Bran', '100% Natural Bran', 'All-Bran',
       'All-Bran with Extra Fiber', 'Almond Delight',
       'Apple Cinnamon Cheerios', 'Apple Jacks', 'Basic 4', 'Bran Chex',
       'Bran Flakes', 'Cap'n'Crunch', 'Cheerios', 'Cinnamon Toast Crunch',
       'Clusters', 'Cocoa Puffs', 'Corn Chex', 'Corn Flakes', 'Corn Pops',
       'Count Chocula', 'Cracklin' Oat Bran', 'Cream of Wheat (Quick)',
       'Crispix', 'Crispy Wheat & Raisins', 'Double Chex', 'Froot Loops',
       'Frosted Flakes', 'Frosted Mini-Wheats',
       'Fruit & Fibre Dates; Walnuts; and Oats', 'Fruitful Bran',
       'Fruity Pebbles', 'Golden Crisp', 'Golden Grahams', 'Grape Nuts Flakes',
       'Grape-Nuts', 'Great Grains Pecan', 'Honey Graham Ohs',
       'Honey Nut Cheerios', 'Honey-comb', 'Just Right Crunchy  Nuggets',
       'Just Right Fruit & Nut', 'Kix', 'Life', 'Lucky Charms', 'Maypo',
       'Muesli Raisins; Dates; & Almonds', 'Muesli Raisins; Peaches; & Pecans',
       'Mueslix Crispy Blend', 'Multi-Gr

Indexing in pandas is a bit different than in built-in Python

`iloc` is used to index by row number in a dataframe

In [113]:
# this returns the first row of the dataframe
df_.iloc[0]

mfr              N
type             C
calories        70
protein          4
fat              1
sodium         130
fiber           10
carbo            5
sugars           6
potass         280
vitamins        25
shelf            3
weight           1
cups          0.33
rating      68.403
Name: 100% Bran, dtype: object

`loc` is used to index by the series/dataframe index

In [114]:
df_.loc['All-Bran']

mfr               K
type              C
calories         70
protein           4
fat               1
sodium          260
fiber             9
carbo             7
sugars            5
potass          320
vitamins         25
shelf             3
weight            1
cups           0.33
rating      59.4255
Name: All-Bran, dtype: object

In [115]:
# multiple indices work
df.iloc[[1, 2, 3]]

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912


We can also use boolean indexing to condionally select data

In [116]:
df[[True] + [False] * 76]

# [True] + [False] * 76 gives us a list that looks like [True, False, ..., False] with 1 True and 76 Falses
# This matches the number of rows in our data (77)
# pandas returns all the rows with a corresponding True (in this case, only the first one)

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973


This is powerful because we can also make comparisons with Series and values

In [117]:
df["protein"] > 3

0      True
1     False
2      True
3      True
4     False
      ...  
72    False
73    False
74    False
75    False
76    False
Name: protein, Length: 77, dtype: bool

Combining these two things, we have a very expressive way of filtering

In [118]:
# This gives us all the rows in which the protein is greater than 3.
df[df["protein"] > 3]

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
11,Cheerios,G,C,110,6,2,290,2.0,17.0,1,105,25,1,1.0,1.25,50.764999
41,Life,Q,C,100,4,2,150,2.0,12.0,6,95,25,2,1.0,0.67,45.328074
43,Maypo,A,H,100,4,1,0,0.0,16.0,3,95,25,2,1.0,1.0,54.850917
44,Muesli Raisins; Dates; & Almonds,R,C,150,4,3,95,3.0,16.0,11,170,25,3,1.0,1.0,37.136863
45,Muesli Raisins; Peaches; & Pecans,R,C,150,4,3,150,3.0,16.0,11,170,25,3,1.0,1.0,34.139765
56,Quaker Oat Squares,Q,C,100,4,1,135,2.0,14.0,6,110,25,3,1.0,0.5,49.511874
57,Quaker Oatmeal,Q,H,100,5,2,0,2.7,-1.0,-1,110,0,1,1.0,0.67,50.828392


##  5.3 Manipulating Data

Often when we're preprocessing data, we want to make changes to a specific column. We can do this by applying functions.

In [119]:
# Suppose we want to make the cereals more appetizing.
# Let's add "Delicious " to the beginning of every name.

# The pattern is we define a function for a single entry
def make_delicious(name):
    return "Delicious " + name

# and then call apply on the series to apply the function to each element in the series
df["name"].apply(make_delicious)

0                     Delicious 100% Bran
1             Delicious 100% Natural Bran
2                      Delicious All-Bran
3     Delicious All-Bran with Extra Fiber
4                Delicious Almond Delight
                     ...                 
72                      Delicious Triples
73                         Delicious Trix
74                   Delicious Wheat Chex
75                     Delicious Wheaties
76          Delicious Wheaties Honey Gold
Name: name, Length: 77, dtype: object

In [120]:
# this returns the changes, but doesn't apply them in place.
# that means on our original dataframe, the cereals are still bland
df.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [121]:
# we can fix this by assigning the new names to the column.
df["name"] = df["name"].apply(make_delicious)
df.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,Delicious 100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,Delicious 100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,Delicious All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,Delicious All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Delicious Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


## 5.4 Groups and Aggregates

When we have lots and lots of data, it's more useful to look at aggregate statistics like the mean or median. But sometimes we lose too much detail aggregating across the whole dataset.

The solution is to aggregate across groups. For example, maybe we're less interested in the mean calorie count of all cereals and more interested in the mean for each manufacturer.

In [122]:
# First, we can see how many (and which) unique manufacturers there are
# Note: this gives us a numpy array
df["mfr"].unique()

array(['N', 'Q', 'K', 'R', 'G', 'P', 'A'], dtype=object)

In [123]:
# Now let's group by the manufacturers
# This gives us a groupby object across the dataframe
mfrs = df.groupby("mfr")
mfrs

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ffdc0069040>

In [124]:
# what happens if we try to access the calories column?
mfrs["calories"]

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7ffdc008ca60>

In [125]:
# now let's try to get the mean
mfrs["calories"].mean()

mfr
A    100.000000
G    111.363636
K    108.695652
N     86.666667
P    108.888889
Q     95.000000
R    115.000000
Name: calories, dtype: float64

In [126]:
# we can also aggregate across multiple columns, and even use different aggregations
# let's get the average calorie count but the maximum protein
mfrs[["calories", "protein"]].agg({"calories": "mean", "protein": "max"})

Unnamed: 0_level_0,calories,protein
mfr,Unnamed: 1_level_1,Unnamed: 2_level_1
A,100.0,4
G,111.363636,6
K,108.695652,6
N,86.666667,4
P,108.888889,3
Q,95.0,5
R,115.0,4
