<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:SteelBlue">Intro:</span> Python Crash Course</h1>

<br><hr id="toc">

### Table of Contents

* [Lesson 1: Jupyter Notebook Basics](#l0)
* [Lesson 2: Python Basics](#l1)
* [Lesson 3: Data Structures](#l2)
* [Lesson 4: Flow and Functions](#l3)
* [Lesson 5: Pandas](#l5)

<br><hr>

### Jupyter Notebook is an open-source web application that allows you to create and share documents that contain:
- live code 
- equations 
- visualizations 
- narrative text 

### Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

<br id="l0">

# Lesson 1: Jupyter Notebook Basics

When running **multiple lines of code** in Jupyter Notebooks, only the last result is shown.

In [51]:
3 * 5  # Not shown
16 / 4 # Shown

4.0

But you can display each line by explicitly printing them with the <code style="color:steelblue">print()</code> function.

In [52]:
print( 3 * 5 ) # Shown
print( 16 / 4 ) # Shown

15
4.0


Hey, did you see that <code style="color:dimgray; font-weight:bold">Gray</code> text in the code cell? 

That's called a **comment**.
* Comments add extra information, and they are not executed
* In Python, comments start with the pound sign (a.k.a. hashtag): 

<pre style="color:dimgray"># This is a comment</pre>

In [53]:
# print( 3 * 5 )  <-- This code does not get run

By the way, if you <code style="color:steelblue">print()</code> multiple objects, separated by commas, it will **concatenate** those objects into a single string (more on strings later).

In [54]:
# Print concatenation
print( 'Testing', 1, 2, 3 )

Testing 1 2 3


Great, now let's start our tour of the basics of the Python programming language.

By the way, you'll  practice each of these topics throughout the course, so don't worry too much about remembering every little detail on your first pass.

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

<div style="text-align:center; margin: 40px 0 40px 0;">
[**Back to Contents**](#toc)
</div>

<br id="l1">

# Lesson 2: Python Basics

A Python library is a collection of functions and methods that allows you to perform lots of actions without writing your own code.

In general, it's good practice to keep all of your library imports at the top of your notebook or program.

Lets import the math library to calculate the amount of water that can be carried in a container with a radius of 5cm and a height of 17cm.

* The formula for the volume of a cylinder is $V = \pi r^2 h$

In [55]:
import math
total_volume = math.pi*math.pow(5, 2)*17
total_volume

1335.1768777756622

Do you have enough space to accomodate 2000 cm$^3$? Print <code style="color:steelblue">True</code> if you have enough space or <code style="color:steelblue">False</code> if you do not.**
* Use the **greater-than-or-equal-to** operator.

In [56]:
# Do you have space for at least 2000 cm^3
if total_volume >= 2000:
    print(True)
else:
    print(False)

False


Repeat the calculations from <span style="color:RoyalBlue">Above</span>, but this time use variables with descriptive names.

* You have **3** empty cylinder-shaped bottles. 
* Each bottle has a height of **16** cm and a radius of **4** cm.
* Each bottle can be completely filled (ignore the thickness of the bottle).
* The formula for the volume of a cylinder is $V = \pi r^2 h$

<br>
**Start by setting variables with descriptive names for bottle dimensions, the number of bottles, and $\pi$.**

In [57]:
# Set variables
bottles = 3 
bottle_height = 16 
bottle_radius = 4 

<br>
**Next, calculate the intermediary step of a single bottle's volume.**
* Set it to a new variable.

In [58]:
# Volume of one bottle (in cm^3)
bottle_volume = math.pi*math.pow(bottle_radius, 2)*bottle_height
# Print volume of a single bottle
print(bottle_volume)

804.247719318987


<br>
**Finally, calculate total volume.**
* Set it to a new variable.

In [59]:
# Total volume
total_volume = bottle_volume * bottles
# Print total volume
print(total_volume)

2412.743157956961


<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

<div style="text-align:center; margin: 40px 0 40px 0;">
[**Back to Contents**](#toc)
</div>

<br id="l2">
# Lesson 3: Data Structures

***In the previous lesson...***

> *In the previous lesson, you learned about importing libraries, declaring variables and conditional statements.*

> *You also learned how to use libraries by making calculations.*


In this lesson we'll learn about importing files along with data structures while planning a trip to breweries in California.

Before we start, let's import lists of breweries by locations. 
* These are stored in text files that we have provided for you. 
* Python has a variety of **input/output** methods. We won't cover them here, but you can learn more about them in the [documentation](https://docs.python.org/2/tutorial/inputoutput.html).

<br>
**First, run this code.**

In [1]:
# Read lists of locations (simply run this code block)
with open('data/bay_area.txt', 'r') as f:
    bay_area = f.read().splitlines()
    
with open('data/los_angeles_area.txt', 'r') as f:
    los_angeles = f.read().splitlines()

with open('data/san_diego.txt', 'r') as f:
    san_diego = f.read().splitlines()

Note that when text files are read using the <code style="color:steelblue">splitlines()</code> function, the resulting object is a list.

So the three objects you just created from the files - <code style="color:steelblue">bay_area</code>, <code style="color:steelblue">los_angeles</code>, and<code style="color:steelblue">san_diego</code> - are all lists. 

In [2]:
print( type(san_diego) )

<class 'list'>


Let's start exploring this data.

<br>
**Print the first 5 locations in San Diego.**

In [3]:
# Print the first 5 locations in San Diego
san_diego[0:5]

['AleSmith Brewing',
 'Amplified Ale Works',
 'Ballast Point Brewing',
 'Coronado Brewing',
 'Gordon Biersch']

Next, we need to know how many breweries are in each location. 

<br>
**Print the number of breweries in each list.**
* Which city has the most locations?

In [4]:
# Print length of each list
b_length = len(bay_area)
l_length = len(los_angeles)
s_length = len(san_diego)

print('Bay area has ' + str(b_length) +  ' breweries.')
print('Los Angeles has ' + str(l_length) +  ' breweries.')
print('San Diego has ' + str(s_length) +  ' breweries.')

Bay area has 24 breweries.
Los Angeles has 23 breweries.
San Diego has 21 breweries.


Next, your friend has a couple questions...

They ask you to:
* **Print <code style="color:steelblue">True</code> if <code>'Stone Brewing'</code> is in San Diego or <code style="color:steelblue">False</code> if it's not.**
* **Print <code style="color:steelblue">True</code> if <code>'Area 51 Craft Brewing'</code> is in the Bay area or <code style="color:steelblue">False</code> if it's not.**

In [64]:
# Is 'Stone Brewing' in San Diego?
if "Stone Brewing" in san_diego: 
    print(True) 
else: 
    print(False)

# Is 'Area 51 Craft Brewing' in the Bay area?
if "Area 51 Craft Brewing" in bay_area: 
    print(True) 
else: 
    print(False) 

True
False


In [65]:
# Print minimum value in san_diego
print('Minimum - ', sorted(san_diego)[0])
# Print maximum value in san_diego
print('Maximum - ', sorted(san_diego)[-1])

Minimum -  AleSmith Brewing
Maximum -  Thorn Street Brewery


Let's continue planning locations to visit. Before we continue, we need to remove duplicates from our lists because we don't have time to visit the same location twice.

<br>
**For each of the 3 lists of locations, print <code style="color:steelblue">True</code> if it has duplicate locations and <code style="color:steelblue">False</code> if it doesn't.**
* Hint: A list with duplicates will have a greater length than a set of the same locations.

In [66]:
# Bay area has duplicates?
b_len = len(bay_area)
print (b_len > len(set(bay_area)))
    
# Los Angeles has duplicates?
l_len = len(los_angeles)
print(l_len > len(set(los_angeles)))

# San Diego has duplicates?
s_len = len(san_diego)
print(s_len > len(set(san_diego)))

True
False
True


<br>
**For the lists with duplicates, remove duplicates by converting them into sets. Then, convert them back into lists.**
* Hint: <code style="color:steelblue">set()</code> and <code style="color:steelblue">list()</code> are your friends.

In [67]:
# Convert lists to sets to remove duplicates, then convert them back to lists 

bay_area = list(set(bay_area))
san_diego = list(set(san_diego))

Great, now lets double check to make sure the duplicates were removed.

In [68]:
# Bay area has duplicates?
b_len = len(bay_area)
print (b_len > len(set(bay_area)))
    
# Los Angeles has duplicates?
l_len = len(los_angeles)
print(l_len > len(set(los_angeles)))

# San Diego has duplicates?
s_len = len(san_diego)
print(s_len > len(set(san_diego)))

False
False
False


Looks good! Now, let's look at a simple way to store the breweries in one place.

We're almost ready to visit the breweries! 

However, it's too cumbersome to lug around the 3 different lists we created.

<br>
**Create a single dictionary named <code style="color:steelblue">brewery_dict</code> for the breweries in each location.**
* Each key should be the name of the location.
* Their values should be the lists of unique locations.

In [4]:
# Create location_dict
brewery_dict = {"Bay Area": bay_area, "Los Angeles": los_angeles, "San Diego": san_diego}

Next, let's make sure the dictionary has the correct keys. 

<br>
**Run the cell below and check the output.**
* What do you think the code below is doing?
* You'll learn more about <code style="color:steelblue">for</code> loops in the next lesson.

In [70]:
# Run this cell
for brewery in ['Bay Area', 'Los Angeles', 'San Diego']:
    print( brewery in brewery_dict )

True
True
True


<br>
Did you get the expected output? If not, check the answer key before moving on.

Suddenly, your friend walks over to you and says...

> "Hmm... if you set up the dictionary correctly, you won't need the original lists anymore."

> "Please get rid of them."

<br>
**Run this next code cell to overwrite the original borough lists with <code style="color:steelblue">None</code>.**

In [71]:
# Run this cell
bay_area, los_angeles, san_diego = None, None, None

By the way, <code style="color:steelblue">None</code> is its own object type in Python.

<br>
> *<span style="color:tomato; font-weight:bold">None</span> is an object that denotes emptiness.*

<br>
For example:

In [72]:
print( type(None) )

<class 'NoneType'>


Now, we want to split our visit to California into two trips: one for Southern California and one for Northern California.

<br>
**Add two new items to your dictionary:**
1. **Key:** <code style="color:steelblue">'Southern California'</code>... **Value:** All locations in <code style="color:steelblue">'San Diego'</code> and <code style="color:steelblue">'Los Angeles'</code>.
2. **Key:** <code style="color:steelblue">'Northern California'</code>... **Value:** All locations in <code style="color:steelblue">'Bay Area'</code>.

Since you got rid of your original lists, you'll have to use the values you've already stored in your dictionary.

In [73]:
# Create a new key-value pair for 'Southern California'
brewery_dict["Southern California"] = brewery_dict["Los Angeles"] + brewery_dict["San Diego"] 
        
# Create a new key-value pair for 'Northern California'
brewery_dict["Northern California"] = brewery_dict["Bay Area"] 

## Finally, let's just check that we have the right number of locations for each trip.
* You should have 41 for Southern California
* You should have 22 for Northern California

<br>
**Run the cell below and check that you get the expected output.**

In [74]:
print( len(brewery_dict['Southern California']) )
print( len(brewery_dict['Northern California']) )

41
22


If you don't have the right number of locations, doublecheck that you removed duplicates and that you're concatenating the correct lists. You can also check the answer key for the solution.

<br>
**Once you have the right number of locations, let's save this object so we can use it in the next lesson. Run the cell below.**
* We'll use a Python built-in package called <code style="color:steelblue">pickle</code> to do so.
* Pickle saves an entire object in a file on your computer.

In [5]:
# Import pickle library
import pickle

# Save object to disk
with open('./data/brewery_dict.pkl', 'wb') as f:
    pickle.dump(brewery_dict, f)

<br>

> *Now you have a dictionary of breweries in California.*

> *In the next lesson, we'll look through the locations and pick one to start with.*

<br>

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

<div style="text-align:center; margin: 40px 0 40px 0;">
[**Back to Contents**](#toc)
</div>

<br id="l3">
# Lesson 4: Flow and Functions

***In the previous lesson...***

> *In the previous lesson, you created a brewery dictionary for the locations we're interested in visting.*

Now we're ready to pick a location to start with.

<br>
**First, let's import <code style="color:steelblue">brewery_dict</code> again using <code style="color:steelblue">pickle</code>. Run this cell.**

In [6]:
import pickle

# Read object from disk
with open('./data/brewery_dict.pkl', 'rb') as f:
    brewery_dict = pickle.load(f)

Now we have the <code style="color:steelblue">brewery_dict</code> object again, but what if we forgot which keys are in the dictionary? 

<br>
**Print the keys in <code style="color:steelblue">brewery_dict</code>.**

In [7]:
# Print the keys in brewery_dict
for keys in brewery_dict.keys():
    print(keys)

Bay Area
Los Angeles
San Diego


Ah, yes... 

Now, we need to choose between starting with <code style="color:steelblue">'Southern California'</code> or with <code style="color:steelblue">'Northern California'</code>. We should start with the list with more locations, so let's find which one it is.

<br>
**Write code, using <code style="color:steelblue">if</code> statements, that does the following:**
* **If** our Southern California list has more locations than our Northern California list, print the message:


<pre style="color:steelblue">I want to start in Southern California.</pre>


* **Else if** our Northern California list has more locations than our Southern California list, print the message:

    
<pre style="color:steelblue">I want to start in Northern California.</pre>


* **Else** (i.e. they have the same number of locations), print the message:


<pre style="color:steelblue">Either is fine. Flip a coin!</pre>



In [78]:
# Write code here
if len(brewery_dict["Southern California"]) > len(brewery_dict["Northern California"]):
    print("I want to start in Southern California.")
elif len(location_dict["Northern California"]) > len(brewery_dict["Southern California"]):
    print("I want to start in Northern California.")
else: 
    print ("Either is fine. Flip a coin!")

I want to start in Southern California.


<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

Remember we said that we wanted to start with the list with the most locations.
* We already knew that it would be either Southern California or Northern California because those lists are combinations of 2 of the others.
* However, what if we didn't know that?

<br>
**For each key in <code style="color:steelblue">brewery_dict</code>, print the number of locations in its list, like so:**

<pre style="color:steelblue">
Bay Area has 13 locations.
San Diego has ...
</pre> 

* **Tip:** You can iterate through keys and values of a dictionary at the same time using <code style="color:steelblue">.items()</code>, like so:

<pre style="color:#bbb">
for <strong style="color:steelblue">key, value</strong> in brewery_dict<strong style="color:steelblue">.items()</strong>:
    <span style="color:dimgray"># code block</span>
</pre>      
        
* **Tip:** Remember, to insert multiple dynamic values into a string, you can just add more places to <code style="color:steelblue">.format()</code>, like so:

<pre style="color:#bbb">
'<strong style="color:steelblue">{}</strong> has <strong style="color:steelblue">{}</strong> locations.'.format(<strong style="color:steelblue">first_value, second_value</strong>)
</pre>

In [79]:
# For each key in brewery_dict, print the number of breweries in its list
for key, value in brewery_dict.items(): 
    print ("{} has {} breweries".format(key, len(value)))

Bay Area has 22 breweries
Los Angeles has 23 breweries
San Diego has 18 breweries
Southern California has 41 breweries
Northern California has 22 breweries


Now, let's give each brewery in Southern California a first impression based on its name. 

<br>
**Combine <code style="color:steelblue">if</code> and <code style="color:steelblue">for</code> statements. For each brewery in Southern California...**
* **If** its name has <code style="color:steelblue">'51'</code>, <code style="color:steelblue">'Coronado'</code>, <code style="color:steelblue">'Noble'</code> in it, print:

<pre style="color:steelblue">{<strong>name</strong>} sounds good.</pre>

* **Else If** its name has <code style="color:steelblue">'Stone'</code>, <code style="color:steelblue">'Ballast'</code> in it, print:

<pre style="color:steelblue">{<strong>name</strong>} sounds awesome.</pre>
    
* If its name doesn't sound pleasant or grand, just ignore it.
* **Tip:** If you want to check if any word from a list is found in a string, you can use <code style="color:steelblue">any()</code>, like so:

<pre style="color:steelblue">
any( word in name for word in <strong>list_of_words</strong> )
</pre>

In [80]:
sounds_good = ['51', 'Coronado', 'Noble']
sounds_awesome = ['Stone', 'Ballast']

# Print first impression of each location in Southern California based on its names
for breweries in brewery_dict["Southern California"]:
    if any(word in breweries for word in sounds_good):
        print(breweries +  " sounds good.")
    elif any(word in breweries for word in sounds_awesome):
        print (breweries + " sounds awesome.")

Noble Ale Works sounds good.
Area 51 Craft Brewing sounds good.
Stone Brewing sounds awesome.
Ballast Point Brewing sounds awesome.
Coronado Brewing sounds good.


<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

**Using a list comprehension, create a new list called <code style="color:steelblue">good_locations</code>.**
* It should contain locations in Southern California that <code style="color:steelblue">sound_good</code>.
* Then print the list.
* **Tip:** To check if any word from a list is found in a string, you can use <code style="color:steelblue">any()</code>. 

In [81]:
# Create good_locations list using a list comprehension
good_breweries = [breweries for breweries in brewery_dict["Southern California"] if any(word in breweries for word in sounds_good)]
# Print the good-sounding locations
print(good_breweries) 

['Noble Ale Works', 'Area 51 Craft Brewing', 'Coronado Brewing']


**Print the number pleasant-sounding locations we have.**

In [82]:
# Print number of good-sounding locations
print(str(len(good_breweries)))

3


<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

**Write a function called <code style="color:steelblue">filter_locations</code> that takes two arguments:**
1. <code style="color:steelblue">location_list</code>
2. <code style="color:steelblue">words_list</code>

The function should return the list of names in <code style="color:steelblue">location_list</code> that have any word in <code style="color:steelblue">words_list</code>.


In [83]:
# Code here
def filter_breweries(location_list, word_list):
    names = [breweries for breweries in location_list if any(word in breweries for word in word_list)]
    return names

Next, let's test that function. 

<br>
**Create a new <code style="color:steelblue">good_breweries</code> list using the function you just wrote.**
* Pass in the list of Southern California breweries and the list of good-sounding words.
* You should get the same breweries that you got just above.
* Print the new list.

In [84]:
# Create good_breweries using filter_breweries()
good_breweries = filter_breweries(brewery_dict["Southern California"], sounds_good)

# Print list of good-sounding breweries
print(good_breweries)


['Noble Ale Works', 'Area 51 Craft Brewing', 'Coronado Brewing']


**Next, let's use this handy function to create a <code style="color:steelblue">awesome_breweries</code> list for breweries that sound awesome.**
* Pass in the list of Southern California breweries and the list of awesome-sounding words.
* Print the new list and confirm the expected output

In [85]:
# Create awesome_locations using filter_locations()
awesome_locations = filter_breweries(brewery_dict["Southern California"], sounds_awesome)

# Print list of awesome-sounding breweries
print(awesome_locations)

['Stone Brewing', 'Ballast Point Brewing']


Great, we'll start with these for our visit.

<br>

> *In this lesson, we filtered our lists of breweries.*

<br>
<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

<div style="text-align:center; margin: 40px 0 40px 0;">
[**Back to Contents**](#toc)
</div>

<br id="l5">
# Lesson 5: Pandas

Then, let's import Pandas.

In [1]:
import pandas as pd

Read Iris dataset:

In [3]:
# Read the iris dataset from a CSV file
df = pd.read_csv('./data/iris.csv')

<br>
**First, create a new DataFrame called <code style="color:steelblue">toy_df</code>. It should contain the first 5 rows plus the last 5 rows from our original Iris dataset.**
* **Tip:** You already have a <code style="color:steelblue">.head()</code>, but what about a <code style="color:steelblue">.tail()</code>?
* **Tip:** <code style="color:steelblue">pd.concat()</code> is your [friend](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html).

In [4]:
# Create toy_df
df_h = df.head()
df_t = df.tail()
toy_df = pd.concat([df_h, df_t])

**Next, display <code style="color:steelblue">toy_df</code>.** 
* After all, it will only be 10 rows.
* In <code style="color:steelblue">toy_df</code>, you should have data from 2 different species of flower. Which are they?

In [5]:
# Display toy_df
toy_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


You should have 'setosa' and 'virginica' flowers.

**Next, display a summary table for <code style="color:steelblue">toy_df</code>.**
* It should have the mean, standard deviation, and quartiles for each of the columns

In [90]:
# Describe toy_df
toy_df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,10.0,10.0,10.0,10.0
mean,5.59,3.13,3.29,1.13
std,0.807534,0.316403,1.995244,0.992248
min,4.6,2.5,1.3,0.2
25%,4.925,3.0,1.4,0.2
50%,5.5,3.05,3.25,1.0
75%,6.275,3.35,5.175,1.975
max,6.7,3.6,5.4,2.3


Since <code style="color:steelblue">toy_df</code> is only 10 rows, you can manually check the **mins** and **maxes**. Are they correct?

Elementwise operations are very useful in machine learning, especially for feature engineering.

<br>

> *<span style="color:tomato; font-weight:bold">Feature engineering</span> is the process of creating new features (model input variables) from existing ones.*

<br>

We'll cover this topic in much more detail later, but let's first use our <code style="color:steelblue">toy_df</code> to illustrate the concept.

In the Iris dataset, we have petal width and length, but what if we wanted to know petal area? Well, we can create a new <code style="color:steelblue">petal_area</code> feature (yes, the petals are not perfect rectangles, but that's fine).

<br>
**First, display the two columns of <code style="color:steelblue">petal_width</code> and <code style="color:steelblue">petal_length</code> in <code style="color:steelblue">toy_df</code>.**
* **Tip:** You can index a DataFrame using a list of column names too, like so:


<pre style="color:steelblue">df[['column_1', 'column_2']]</pre>

In [91]:
# Display petal_width and petal_length
petal_width = toy_df["petal_width"]
petal_length = toy_df["petal_length"]
toy_df[["petal_width", "petal_length"]]

Unnamed: 0,petal_width,petal_length
0,0.2,1.4
1,0.2,1.4
2,0.2,1.3
3,0.2,1.5
4,0.2,1.4
145,2.3,5.2
146,1.9,5.0
147,2.0,5.2
148,2.3,5.4
149,1.8,5.1


**Next, create a new <code style="color:steelblue">petal_area</code> feature in <code style="color:steelblue">toy_df</code>.**
* Multiply the <code style="color:steelblue">petal_width</code> column by the <code style="color:steelblue">petal_length</code> column.
* Display <code style="color:steelblue">toy_df</code> after creating the new feature.
* Are the values for <code style="color:steelblue">petal_area</code> correct? Manually spot check a few of them just to make sure.

In [92]:
# Create a new petal_area column
toy_df["petal_area"] = toy_df["petal_width"] * toy_df["petal_length"]

# Display toy_df
toy_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,petal_area
0,5.1,3.5,1.4,0.2,setosa,0.28
1,4.9,3.0,1.4,0.2,setosa,0.28
2,4.7,3.2,1.3,0.2,setosa,0.26
3,4.6,3.1,1.5,0.2,setosa,0.3
4,5.0,3.6,1.4,0.2,setosa,0.28
145,6.7,3.0,5.2,2.3,virginica,11.96
146,6.3,2.5,5.0,1.9,virginica,9.5
147,6.5,3.0,5.2,2.0,virginica,10.4
148,6.2,3.4,5.4,2.3,virginica,12.42
149,5.9,3.0,5.1,1.8,virginica,9.18


**Finally, what do we now know about Iris flowers?**

By creating a <code style="color:steelblue">petal_area</code> feature, it's now much easier to see that virginica flowers have significantly larger petals than setosa flowers do!

Often, by creating new features, you can learn more about the data (and improve your machine learning models).

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">


Let's say we wanted to display observations where <code style="color:steelblue">petal_area > 10</code> and <code style="color:steelblue">sepal_width > 3</code>. How could we do so?

<br>
**First, display <code style="color:steelblue">toy_df</code> again just to have it in front of you.**

In [93]:
# Display toy_df
toy_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,petal_area
0,5.1,3.5,1.4,0.2,setosa,0.28
1,4.9,3.0,1.4,0.2,setosa,0.28
2,4.7,3.2,1.3,0.2,setosa,0.26
3,4.6,3.1,1.5,0.2,setosa,0.3
4,5.0,3.6,1.4,0.2,setosa,0.28
145,6.7,3.0,5.2,2.3,virginica,11.96
146,6.3,2.5,5.0,1.9,virginica,9.5
147,6.5,3.0,5.2,2.0,virginica,10.4
148,6.2,3.4,5.4,2.3,virginica,12.42
149,5.9,3.0,5.1,1.8,virginica,9.18


**Take a look at the DataFrame and manually count the number that satisfy our conditions.**
* How many observations have <code style="color:steelblue">petal_area > 10</code>?
* How many observations have <code style="color:steelblue">sepal_width > 3</code>?
* How many satisfy both conditions?

Great. Now we'll see what's going on under the hood when we use our boolean masks.

<br>
**Create a boolean mask for <code style="color:steelblue">petal_area > 10</code>.**
* Name it <code style="color:steelblue">petal_area_mask</code>.
* Display the mask after you create it.
* Does the result make sense?

In [94]:
# Mask for petal_area > 10
petal_area_mask = toy_df["petal_area"] > 10 

# Display petal_area_mask
petal_area_mask

0      False
1      False
2      False
3      False
4      False
145     True
146    False
147     True
148     True
149    False
Name: petal_area, dtype: bool

**Next, create a boolean mask for <code style="color:steelblue">sepal_width > 3</code>.**
* Name it <code style="color:steelblue">sepal_width_mask</code>.
* Display the mask after you create it.
* Does the result make sense?

In [45]:
# Mask for sepal_width > 3
sepal_width_mask = toy_df["sepal_width"] > 3

# Display sepal_width_mask
sepal_width_mask

0       True
1      False
2       True
3       True
4       True
145    False
146    False
147    False
148     True
149    False
Name: sepal_width, dtype: bool

**Next, display the two masks combined using the <code style="color:steelblue">&</code> operator.**
* Note how their combination results in another boolean mask!

In [46]:
# Display both masks, combined
sepal_width_mask & petal_area_mask

0      False
1      False
2      False
3      False
4      False
145    False
146    False
147    False
148     True
149    False
dtype: bool

**Finally, select the observations from <code style="color:steelblue">toy_df</code> where both conditions are met.**

In [47]:
# Index with both masks
toy_df[sepal_width_mask & petal_area_mask]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,petal_area
148,6.2,3.4,5.4,2.3,virginica,12.42


<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

Now, armed with the power of **groupby**, let's just bring back our <code style="color:steelblue">toy_df</code> for one last hoorah, just to make sure we know what's going on under the hood.

<br>
Let's calculate the median <code style="color:steelblue">petal_area</code> for each species. 
* Since <code style="color:steelblue">toy_df</code> is small, we can do this manually as well and check to make sure the values are correct.

<br>
**First, let's manually calculate the median <code style="color:steelblue">petal_area</code> for the virginica flowers in our <code style="steelblue">toy_df</code>.**
* Display all observations of the virginica species.
* Sort them by <code style="color:steelblue">petal_area</code> in ascending order.
* **Tip:** Check out the <code style="color:steelblue">.sort_values()</code> function ([documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html)).

In [48]:
# Display all 'virginica' species, sorted by petal_area
toy_df[toy_df["species"] == "virginica"].sort_values("petal_area", axis=0, ascending="True")

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,petal_area
149,5.9,3.0,5.1,1.8,virginica,9.18
146,6.3,2.5,5.0,1.9,virginica,9.5
147,6.5,3.0,5.2,2.0,virginica,10.4
145,6.7,3.0,5.2,2.3,virginica,11.96
148,6.2,3.4,5.4,2.3,virginica,12.42


Based on the output above, what's median <code style="color:steelblue">petal_area</code> for the virginica species?

<br>
**Next, let's manually calculate the median <code style="color:steelblue">petal_area</code> for the setosa flowers in our <code style="steelblue">toy_df</code>.**
* Display all observations of the setosa species.
* Sort them by <code style="color:steelblue">petal_area</code> in ascending order.

In [49]:
# Display all 'setosa' species
toy_df[toy_df["species"] == "setosa"].sort_values("petal_area", axis=0, ascending="True")

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,petal_area
2,4.7,3.2,1.3,0.2,setosa,0.26
0,5.1,3.5,1.4,0.2,setosa,0.28
1,4.9,3.0,1.4,0.2,setosa,0.28
4,5.0,3.6,1.4,0.2,setosa,0.28
3,4.6,3.1,1.5,0.2,setosa,0.3


Based on the output above, what's median <code style="color:steelblue">petal_area</code> for the setosa species?

<br>
**Finally let's calculate the median values using a <code style="color:steelblue">.groupby()</code>.**
* Do you get the same result?

In [50]:
# Median petal_area in toy_df
toy_df.groupby("species")["petal_area"].median()

species
setosa        0.28
virginica    10.40
Name: petal_area, dtype: float64

** *Congratulations... You've completed the Python Crash Course!* **

> *In this lesson, you explored the Iris dataset using Pandas.*

<br>
<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

<div style="text-align:center; margin: 40px 0 40px 0;">
[**Back to Contents**](#toc)
</div>