# Course Notes
Use this workspace to take notes, store code snippets, or build your own interactive cheatsheet! For courses that use data, the datasets will be available in the `datasets` folder.


In [None]:
# Import any packages you want to use here


## Take Notes

Add notes here about the concepts you've learned and code cells with code you want to keep.

_Add your notes here_

In [1]:
# Add your code snippets here


# Hardcoding a highlight

In [None]:
houston_pollution = pollution[pollution.city  ==  'Houston']

# Make array orangred for day 330 of year 2014, otherwise lightgray
houston_colors = ['orangered' if (day  ==  330) & (year  ==  2014) else 'lightgray' 
                  for day,year in zip(houston_pollution.day, houston_pollution.year)]

sns.regplot(x = 'NO2',
            y = 'SO2',
            data = houston_pollution,
            fit_reg = False, 
            # Send scatterplot argument to color points 
            scatter_kws = {'facecolors': houston_colors, 'alpha': 0.7})
plt.show()

# Programmatically creating a highlight

Find the value corresponding to the highest observed O3 value in the houston_pollution DataFrame. Make sure to type the letter O and not the number zero!
Append the column 'point_type' to the houston_pollution DataFrame to mark if the row contains the highest observed O3.
Pass this newly created column to the hue argument of sns.scatterplot() to color the points.

In [None]:
houston_pollution = pollution[pollution.city  ==  'Houston'].copy()

# Find the highest observed O3 value
max_O3 = houston_pollution.O3.max()

# Make a column that denotes which day had highest O3
houston_pollution['point_type'] = ['Highest O3 Day' if O3  ==  max_O3 else 'Others' for O3 in houston_pollution.O3]

# Encode the hue of the points with the O3 generated column
sns.scatterplot(x = 'NO2',
                y = 'SO2',
                hue = 'point_type',
                data = houston_pollution)
plt.show()

![image](image.png)


# Comparing with two KDEs

- Filter the data in the first sns.kdeplot() call to include only the year 2012.
- Shade under the first KDE with the shade argument.
- Add the label '2012' for the plot legend.
- Repeat the first three steps for second sns.kdeplot() call, but filter the data to not include 2012. Use the label 'other years'.

In [None]:
# Filter dataset to the year 2012
sns.kdeplot(pollution[pollution.year== 2012 ].O3, 
            # Shade under kde and add a helpful label
shade = True ,
label = '2012')

# Filter dataset to everything except the year 2012
sns.kdeplot(pollution[pollution.year != 2012 ].O3, 
            # Again, shade under kde and add a helpful label
shade = True,
label = 'other years')
plt.show()

![image-2](image-2.png)


# Improving your KDEs

- Turn off the histogram overlay for the first plot.
- Make the Vandenberg plot 'steelblue'.
- Turn on rug plot functionality in the Vandenberg plot.
- Remove histogram from the non-Vandenberg plot and set its color to 'gray'.

In [None]:
sns.distplot(pollution[pollution.city == 'Vandenberg Air Force Base'].O3, 
             label = 'Vandenberg', 
             # Turn off the histogram and color blue to stand out
             hist = False,
             color = 'steelblue', 
             # Turn on rugplot
             rug = True)

sns.distplot(pollution[pollution.city != 'Vandenberg Air Force Base'].O3, 
             label = 'Other cities',
             # Turn off histogram and color gray
             hist = False,  
             color = 'gray')
plt.show()

![image-3](image-3.png)


# Beeswarms
Build a beeswarm plot using sns.swarmplot() that looks at the Ozone levels for all the cities in the pollution data for the month of March. To make the beeswarm a bit more legible, decrease the point size to avoid the overcrowding caused by the many points drawn on the screen. Last, since you've done some manipulation of the data to make this plot, provide a title to help the reader orient with what they are viewing.

**Instructions**

- Subset the pollution data to include just the observations in March.
- Plot the O3 levels as the continuous value in the swarmplot().
- Decrease the point size to 3 to avoid crowding of the points.
- Title the plot 'March Ozone levels by city'.

In [None]:
# Filter data to just March
pollution_mar = pollution[pollution.month == 3]

# Plot beeswarm with x as O3
sns.swarmplot(y = "city",
              x = 'O3', 
              data = pollution_mar, 
              # Decrease the size of the points to avoid crowding 
              size = 3)

# Give a descriptive title
plt.title('March Ozone levels by city')
plt.show()

![image-4](image-4.png)


# A basic text annotation
On the current scatter plot, you can see a particularly prominent point that contains the largest SO2 value observed for August. This point is Cincinnati on August 11th, 2013; however, you would not be able to learn this information from the plot in its current form. Basic text annotations are great for pointing out interesting outliers and giving a bit more information. Draw the readers attention to this Cincinnati value by adding a basic text annotation that gives a bit of the background about this outlier.

**Instructions**

1. Filter the data plotted in scatter plot to just August.
2. Draw text annotation at x = 0.57 and y = 41 to call out the highest SO2 value.
3. Label annotation with 'Cincinnati had highest observed\nSO2 value on Aug 11, 2013' (note the line break).
4. Change the font-size to 'large' for the annotation.

In [None]:
# Draw basic scatter plot of pollution data for August
sns.scatterplot(x = 'CO', y = 'SO2', data = pollution[pollution.month  ==  8 ])

# Label highest SO2 value with text annotation
plt.text( 0.57 , 41 ,
         'Cincinnati had highest observed\nSO2 value on Aug 11, 2013', 
         # Set the font to large
         fontdict = {'ha': 'left', 'size': 'large'})
plt.show()

![image-5](image-5.png)


# Arrow annotations

Imagine you are a city planner for Long Beach, California. Long Beach is located on the Pacific Ocean and has a large firework show every New Year's Eve. You want to look into whether this show negatively impacts the air quality of the city. To do this, you will look at CO and NO2 levels on New Year's Day. However, it turns out that New Year's Day is not one of the outliers in the plot on the right, it's located in one of the more crowded areas.

To help guide the reader to this point, you'll use an annotation along with an arrow that points to the New Year's Day value. This will provide a nice annotation that explains what the viewer is looking while printing the text in a less crowded region of the plot.

**Instructions**
Grab the row from jan_pollution that corresponds to New Year's Day 2012 in the city of Long Beach using the pandas' .query() method.
Set the endpoint of the arrow (xy) by using the CO and NO2 column values from the lb_newyears DataFrame.
Use the argument xytext to place the annotation arrow's text in the bottom left corner of the display at x = 2, y = 15.
'shrink' the arrow to 0.03, so it doesn't occlude the point of interest.




In [None]:
# Query and filter to New Years in Long Beach
jan_pollution = pollution.query("(month  ==  1) & (year  ==  2012)")
lb_newyears = jan_pollution.query("(day ==  1 ) & (city  == 'Long Beach')")

sns.scatterplot(x = 'CO', y = 'NO2',
                data = jan_pollution)

# Point arrow to lb_newyears & place text in lower left 
plt.annotate('Long Beach New Years',
             xy = ( lb_newyears.CO , lb_newyears.NO2),
             xytext = ( 2, 15), 
             # Shrink the arrow to avoid occlusion
             arrowprops = {'facecolor':'gray', 'width': 3, 'shrink': 0.03},
             backgroundcolor = 'white')
plt.show()

![image-6](image-6.png)


# Combining annotations and color

 1 Using a list comprehension, make a vector of colors for each point with'orangered' if the point belongs to Long Beach, and 'lightgray' if it doesn't.

In [None]:
# Make a vector where Long Beach is orangered; else lightgray
is_lb = ['orangered' if city  ==  'Long Beach' else 'lightgray' for city in pollution['city']]

2 Use the is_lb vector to provide custom colors for each point using the additional keyword argument facecolors in the scatter_kws argument.
In the same scatter_kws dictionary, set the opacity to 0.3.

In [None]:
# Make a vector where Long Beach is orangered; else lightgray
is_lb = ['orangered' if city  ==  'Long Beach' else 'lightgray' for city in pollution['city']]

# Map facecolors to the list is_lb and set alpha to 0.3
sns.regplot(x = 'CO',
            y = 'O3',
            data = pollution,
            fit_reg = False, 
            scatter_kws = {'facecolors': is_lb, 'alpha': 0.3})
plt.show() 

3 Add an annotation at x = 1.6 and y = 0.072 using the text 'April 30th, Bad Day' to draw attention to a specific point in the data.

In [None]:
# Make a vector where Long Beach is orangered; else lightgray
is_lb = ['orangered' if city  ==  'Long Beach' else 'lightgray' for city in pollution['city']]

# Map facecolors to the list is_lb and set alpha to 0.3
sns.regplot(x = 'CO',
            y = 'O3',
            data = pollution,
            fit_reg = False,
            scatter_kws = {'facecolors':is_lb, 'alpha': 0.3})

# Add annotation to plot
plt.text( 1.6 ,0.072, 'April 30th, Bad Day')
plt.show() 