# Coding Outreach Group Summer Workshop
# Jupyter Notebook
06/03/2021

__**Content creator:**__ Kim Nguyen

__**Content reviewers:**__ Haroon Popal



## Description
This workshop will be an introduction to [Jupyter](https://jupyter.org/) Notebooks, an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative texts.

## Prerequisites
1. Basic knowledge of python
2. A computer with Jupyter installed
  1. You can install Jupyter by [itself](https://jupyter.org/install), or you can install [Anaconda](https://www.anaconda.com/distribution/) (recommended), which includes Jupyter, Python (with the spyder IDE), and Rstudio

## To-do before the workshop
1. Download the data.csv and the jupyter_workshop2021.ipynb files in this folder and save these files in one location
2. If you have Anaconda installed, you should have the required Python packages for this workshop. If you are not using Jupyter through Anaconda, install these packages: pandas, seaborn, matplotlib, numpy, and os.
  1. You can do this in the Terminal with `conda install [package name]`

## Workshop objectives:
1. Familiarize you to Jupyter Notebook
2. Demonstrate how Jupyter Notebook can be used for data management and data visualization

# Intro

In [None]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/d5RUUOYMYEg;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

# I. Markdowns

Markdown cells convert your text into HTML 
You can treat *text* in markdowns like you would if you were writing **HTML code**.

Notebook markdowns are great for adding in text that would be useful for understanding what your code is doing or any text that is necessary for the organization of your Notebook.

[Markdown cheatsheet](https://www.ibm.com/docs/en/watson-studio-local/1.2.3?topic=notebooks-markdown-jupyter-cheatsheet)

## Exercise 1: Create markdown cells
#### Instructions: 
1. Add a markdown cell below with answers to the trivia questions 

### Trivia Question 1! 
What three programming languages do Jupyter Notebooks support? Hint: ”Jupyter” is a mashup of all three!

### Trivia Question 2!
What social media platform allowed you to edit HTML code to customize your page?

# II. Code Cells

In [None]:
# This is a cell
# You will put your code in cells (Python or R)
# To include in-cell text that is not code, include a # at the start of the line
print ("hi!")

## Exercise 2: Printing answers
#### Instructions:
1. Add a code cell below that will print this trivia answer

### Trivia Question 3!
What is the largest planet in our solar system?

# III. Restarting the Kernel
Hm, I think I want to start my code over, let's restart and clear the kernel
Go to **Kernel** > **Restart & Clear Output**

Notice now that the ln [ ] is empty and your code cells no longer have output. Restarting the kernel will restart your Notebook session. 

You may exit your Notebook without restarting & clearing ouput and come back to the Notebook as it was- ouput and all.

Now add a code cell below that will print this trivia answer

### Trivia question 4!
What two animals are on the Pennsylvania flag?

See how the line number for the cell you just created is now 1? (ln [1])
When you restart the kernel and then execute code, the Notebook's line count starts over.

# IV. Data Management

## Setting your working directory
This isn't necessary today if this Notebook is in the same folder/directory as the data.csv. But if you need to work with data from multiple directories and don't want a separate Notebook for each directory, this is useful. 

In [None]:
# Mac users can use command line within the notebooks, just make sure the command is in its own cell
# Otherwise, the notebook will treat the cell as a Python code 
pwd

In [None]:
pwd

In [None]:
ls

## Exercise 3: Change directories
#### Instructions:
1. Change directories to where the data.csv file for this workshop is located

In [None]:
# Windows users can import os to change the working directory
import os
os.chdir(r"C:\Users\youruser\folder\folder\etc")
# Mac users can also use os.chdir in this format: os.chdir('/Users/KimNguyen/Desktop/Jupyter_Workshop')

# Get current working directory
os.getcwd()

In [None]:
# List everything in current directory
os.listdir()

### Trivia Question 5!
What does "pwd" stand for?

# Introduction to Pandas
Pandas is a Python data organization and analysis software.

It uses a DataFrame object for data manipulation with integrated indexing.

Note that Python is 0-indexed, meaning the first element is 0 and the second is 1. 

In [None]:
# Import the pandas and numpy modules
import pandas as pd
import numpy as np


## Importing and viewing your data

In [None]:
# Reading in your data file
data = "data.csv"

# Convert the data file from a csv to a pandas dataframe
DF = pd.read_csv(data, header = 0, index_col = 0)


In [None]:
# Some ways you can view your data to check that everything was imported and converted correctly

# Prints the first 5 rows of the dataframe, you can also insert a specific number of rows in the parentheses DF.head(2)
DF.head()


In [None]:
# How will the dataframe change if we ran this line instead: DF = pd.read_csv(mapreading, header = 1, index_col = 1)?
# Test it out here!
DF = pd.read_csv(data, header = 1, index_col = 1)
DF.head()


In [None]:
# But remember to reset to the correct dataframe format
DF = pd.read_csv(data, header = 0, index_col = 0)
DF.head()

In [None]:
# Gives the dimensions of your dataframe (#rows, #columns)
DF.shape


In [None]:
# Gives the types of data of the columns
DF.dtypes


In [None]:
# If you put all three previous lines together, you have to use print() for each, otherwise, the notebook will only print the last line.
print(DF.head())
print(DF.shape)
print(DF.dtypes)


## Selecting specific data

In [None]:
# We can cut out a single column like this
# What's really returned here is a pandas series
DF[['Age']]

# Or multiple columns by inserting a list of columns ['Age', 'Sex']
# What's returned in this case is a DataFrame
#DF[['Age', 'Sex']]

# If you want to save any selection as it's own list or dataframe 
#DF2 = DF[['Age', 'Sex']]


In [None]:
# We can also select specific rows using .loc. 
# For this dataframe, this will give us specific participant numbered rows
DF.loc[1]


In [None]:
# We can specify row position and column position by passing in two arguments
#DF.loc[1, 'Age']

# Or multiple row and columns!
DF.loc[[1,2], ['Age', 'Sex']]


In [None]:
# Getting stats of specific columns
DF['Age'].describe()

# Getting stats of all columns, this will help later on when you graph the data
DF.describe()


In [None]:
# If you make changes to your dataframe or create a new dataframe, you can save it as a csv/excel/text file
DF.to_csv('new.csv', index=False)
# DF.to_excel('new.xls', index=False)
# DF.to_csv('new.txt', index=False)

More on Pandas: https://pandas.pydata.org/pandas-docs/version/0.15/tutorials.html

### Trivia Question 6!
What US city got the first panda in their zoo in 1936?

# VI. Data Visualization

## matplotlib
Plotting library for Python and its extension NumPy

Very similar to the MATLAB interface, but free

More on matplotlib: https://matplotlib.org/gallery/index.html

## seaborn
Visualization library built on matplotlib

Closely integrated with pandas data structures

More on seaborn: https://seaborn.pydata.org/

More on seaborn [color palettes](https://seaborn.pydata.org/tutorial/color_palettes.html) and a [comparison of palettes](https://gist.github.com/mwaskom/b35f6ebc2d4b340b4f64a4e28e778486) to see which are colorblind-friendly.

### Trivia Question 7!
What is different about the Pantone 2021 color of the year?

Bonus! What is the 2021 color of the year (or general color family)?

In [None]:
# Import seaborn and matplotlib modules for data visualization 
# We can combine these two
import seaborn as sns
import matplotlib.pyplot as plt 


Colormap values: Accent, Accent_r, Blues, Blues_r, BrBG, BrBG_r, BuGn, BuGn_r, BuPu, BuPu_r, CMRmap, CMRmap_r, Dark2, Dark2_r, GnBu, GnBu_r, Greens, Greens_r, Greys, Greys_r, OrRd, OrRd_r, Oranges, Oranges_r, PRGn, PRGn_r, Paired, Paired_r, Pastel1, Pastel1_r, Pastel2, Pastel2_r, PiYG, PiYG_r, PuBu, PuBuGn, PuBuGn_r, PuBu_r, PuOr, PuOr_r, PuRd, PuRd_r, Purples, Purples_r, RdBu, RdBu_r, RdGy, RdGy_r, RdPu, RdPu_r, RdYlBu, RdYlBu_r, RdYlGn, RdYlGn_r, Reds, Reds_r, Set1, Set1_r, Set2, Set2_r, Set3, Set3_r, Spectral, Spectral_r, Wistia, Wistia_r, YlGn, YlGnBu, YlGnBu_r, YlGn_r, YlOrBr, YlOrBr_r, YlOrRd, YlOrRd_r, afmhot, afmhot_r, autumn, autumn_r, binary, binary_r, bone, bone_r, brg, brg_r, bwr, bwr_r, cividis, cividis_r, cool, cool_r, coolwarm, coolwarm_r, copper, copper_r, cubehelix, cubehelix_r, flag, flag_r, gist_earth, gist_earth_r, gist_gray, gist_gray_r, gist_heat, gist_heat_r, gist_ncar, gist_ncar_r, gist_rainbow, gist_rainbow_r, gist_stern, gist_stern_r, gist_yarg, gist_yarg_r, gnuplot, gnuplot2, gnuplot2_r, gnuplot_r, gray, gray_r, hot, hot_r, hsv, hsv_r, icefire, icefire_r, inferno, inferno_r, jet, jet_r, magma, magma_r, mako, mako_r, nipy_spectral, nipy_spectral_r, ocean, ocean_r, pink, pink_r, plasma, plasma_r, prism, prism_r, rainbow, rainbow_r, rocket, rocket_r, seismic, seismic_r, spring, spring_r, summer, summer_r, tab10, tab10_r, tab20, tab20_r, tab20b, tab20b_r, tab20c, tab20c_r, terrain, terrain_r, viridis, viridis_r, vlag, vlag_r, winter, winter_r

In [None]:
# Set seaborn context. 
# This affects things like the size of the labels, lines, and other elements of the plot, but not the overall style. 
# The base context is “notebook”, and the other contexts are “paper”, “talk”, and “poster”
sns.set_context("paper", font_scale = 1.5)


More about set context: https://seaborn.pydata.org/generated/seaborn.set_context.html

## Scatterplots
Visualizing two continuous variables and adding in grouping dimensions

In [None]:
# A good ole vanilla scatterplot
plt.gcf().subplots_adjust(bottom=0.15) #adds room to the x-axis label to not cut off the text
plot1 = sns.scatterplot(x="Age",y="Composite (z)", palette = 'Set2', data=DF)
# Saving the graph to your current directory, the higher the dpi value, the longer it'll take for the cell to run 
plt.savefig("plot1.png",dpi=100, bbox_inches="tight")


In [None]:
# You can also plot data by groups using hue = ""
plot2 = sns.scatterplot(x="Preposition",y="Composite (z)", palette = 'Set2', data=DF, hue = "Age Bins")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)  # moves the legend outside the plot
# The saved png will cutoff anything outside the plot (legend), so it might be better to take a screenshot. I've yet to learn how to deal with this.
plt.savefig("plot2.png",dpi=100,bbox_inches="tight")


In [None]:
# Scatterplots by two grouping dimensions (Age Bins and Sex)
plot3 = sns.FacetGrid(DF, col="Age Bins", hue="Sex", palette = "Set2")
plot3.map(plt.scatter, "Preposition", "Composite (z)", alpha=1)
plot3.add_legend();
plot3.savefig("plot3.png",dpi=50,bbox_inches="tight")


In [None]:
# Scatterplots with bivariate distributions
plot4 = sns.jointplot(x="Preposition", y="Composite (z)", data=DF, kind="reg")
plt.savefig("plot4.png", dpi=100,bbox_inches="tight")

# Hexbin plot: A really cool plot that also includes distributions, but it works best with large datasets
with sns.axes_style("white"):
    plot5= sns.jointplot(x= "Preposition", y= "Composite (z)", kind="hex", color="Pink", data=DF)
plt.savefig("plot5.png", dpi=100,bbox_inches="tight")


More about scatterplot: https://seaborn.pydata.org/generated/seaborn.scatterplot.html

## Regression Plots
Plotting data with regression model fit

In [None]:
# Bivariate regression plot
# x and y are data variables, palette is your color scheme, data is your dataset (usually a dataframe), and aspect is the multiplier for the x-axis length
plot6 = sns.lmplot(x="Age", y="Composite (z)", palette = 'Set2',data=DF, aspect=1.5, hue="Sex") 

# Optional: setting specific axes limits. 
# You can check your .describe() output from earlier to see the min and max of your variables
axes = plot6.axes 
axes[0,0].set_ylim(-3,3)  # set graph y-axis limits
axes[0,0].set_xlim(5,25)  # set graph x-axis limits
plt.savefig("plot6.png",dpi=100,bbox_inches="tight") 


In [None]:
# Regression plot by two groups (Age Bins and Sex)
plot7 = sns.lmplot(x="Preposition", y="Composite (z)", palette = 'Set2', data=DF, aspect=1.5, hue = "Age Bins", col = "Sex") 
axes = plot7.axes 
axes[0,0].set_ylim(-2,2)
axes[0,0].set_xlim(0,50)
plt.savefig("plot7.png",dpi=100,bbox_inches="tight")


More about lm plot: https://seaborn.pydata.org/generated/seaborn.lmplot.html

## Bar Plots

In [None]:
# Regular bar plot with capped 95% confidence interval bars
plot8 = sns.barplot(x="Age Bins", y="Composite (z)", data=DF, palette= "Set2",ci=95, capsize= .15)
plt.savefig("plot8.png",dpi=100,bbox_inches="tight")


In [None]:
# Barplot with individual datapoints
# order = [] takes in a list of the specified categorical order
plot9 = sns.catplot(x="Age Bins", y="Composite (z)", data=DF, palette="Pastel2")
plot9.map(sns.barplot,x="Age Bins", y="Composite (z)", data=DF, palette= "Set2",ci=95, capsize=.15, order = ["6-7 years","8-9 years","10-12 years","Adult"])
plot9.set_axis_labels("Age Bins", "Composite (z)")
plt.savefig("plot9.png",dpi=100,bbox_inches="tight")


In [None]:
# Bar plots across two grouping dimensions
with sns.color_palette("Set2"):
    plot10 = sns.FacetGrid(DF, col="Sex", height=10, aspect=1.5)
    plot10.map(sns.barplot, "Age Bins", "Composite (z)", ci=95);
plt.savefig("plot10.png",dpi=50,bbox_inches="tight")


More on bar plots: https://seaborn.pydata.org/generated/seaborn.barplot.html

More on multi-plot grids: https://seaborn.pydata.org/tutorial/axis_grids.html

## Other Cool Plots

In [None]:
# Violin plots
plot11 = sns.violinplot(x="Age Bins", y="Composite (z)", data=DF, palette="Set2", hue="Sex")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.savefig("plot11.png", dpi=100,bbox_inches="tight")  # again, this will cutoff the legend that's outside the plot


More about violin plots: https://seaborn.pydata.org/generated/seaborn.violinplot.html

In [None]:
# Heatmaps, very useful for RSA or correlation comparison visualization
fake_data = np.random.rand(51, 51)
plot12 = sns.heatmap(fake_data, cmap="Blues")
plt.savefig("plot12.png", dpi=100,bbox_inches="tight")


More on heatmaps: https://seaborn.pydata.org/generated/seaborn.heatmap.html

More on clustermaps: https://seaborn.pydata.org/generated/seaborn.clustermap.html#seaborn.clustermap

# Outro

In [None]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/6QhH7cFcyoo;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')