# Graduate Society of Black Engineers and Scientists
### March 20, 2019
### Programming Workshop by Girls Who Code at UM DCMB
#### Adapted from Software Carpentry Curriculum http://swcarpentry.github.io/python-novice-gapminder/

### Intro to Python and Jupyter Notebook
Python is  an object oriented programming language which just means we care a lot about the objects that we want to manipulate. There are several data types that an object can have. The basic data types  are integer, float, string, boolean. We assign values to the variable names in a left to right fashion. Jupyter notebook is letting us run and edit Python code. Learn more about Jupyter Notebook [here](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Notebook%20Basics.html)

In [0]:
#Data Types
w = 1 #integer
x = 1.0 #float
y = "hello world" #string
z = True #boolean

In [0]:
#The print function will print its argument to the screen
#Python has lots of built in functions like print
#You can also define your own functions when you learn more
#Think of functions like f(x) in math class
print(w+z)

In [0]:
#Running the notebook cell gives the evaluation of the last command
w+z

In [0]:
#If we define a variable as the last command, we do not see the output. 
#But the variable test now has a value which we can access later.
test = w+z

In [0]:
print(test)

In [0]:
#What happens if we add x and y?
#How can we determine the data type of z?

Data can be stored in containers. Common containers in  Python  are lists, numpy arrays, pandas data frames, dictionaries, and tuples.

In [0]:
#Each container has specific manipulations  we can perform
my_list = [1,2,3]
#We can use the list method of append to add a value to the end
my_list.append(4)
print(my_list)
#We can slice a list to pull a value out
my_list[0]
#Python is 0-based so the 1st position is accessed  with 1

In [0]:
#Dictionaries help us keep track of key and value pairs
my_dict = {"Brooke":"Bioinformatics", "Sierra":"MIP", "Allie":"Pharmacology"}
#We can pull out a value if we know the key
print(my_dict["Brooke"])
#We can also look up all the keys with a method
my_dict.keys()

In [0]:
#To use numpy arrays and pandas data frames we need to import packages
#Packages are tidy ways of organizing special functionalities
import numpy as  np
my_arr = np.array([1,2,3])
print(my_arr)

### Mount Google Drive
If you're using this .ipynb in a Google Drive with .csv files needed for the rest of the exercise, we need to mount the drive and set the file path.

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

In [0]:
# set the file path 
# after "My Drive", change the path for the path to this folder in your own Google Drive
# i.e. 'codeDemos-master' is specific to my GDrive -- yours may be different
gdrive_path = '/content/gdrive/My Drive/codeDemos-master/' # don't leave out the trailing slash /

### Manipulating data frames with pandas
Pandas lets us easily use dataframes which may come from comma seperated value (.csv) files. Documentation can be found [here](http://pandas.pydata.org/pandas-docs/stable/). We will be using a .csv file from [datahub.io](https://datahub.io/collections/climate-change) with the monthly global mean temperature in Celsius from 1880 to 2016.

In [0]:
import pandas as pd
#Let's read in a comma separated value (.csv) file
#The filename as a string is passed as an argument to the function read_csv which is from the package pandas
filename = f"{gdrive_path}monthly_global_temps.csv"
print(filename)
df = pd.read_csv(f"{gdrive_path}monthly_global_temps.csv")
#Let's look at the first 10 lines
#I'm passing the data frame and the argument 10 to the pandas.DataFrame.head function
pd.DataFrame.head(df,10)
#How do you think we can see the last 5 rows of the data frame?

In [0]:
#I can also call head as a method and specify the number of lines shown
#Default is 5 lines
df.head(3)

In [0]:
#Let's get some information about the data frame. How many rows are there total?
pd.DataFrame.info(df)

In [0]:
#Pandas has a special data type for Dates (datetime64)
#It will help us later if we represent the Date column in this format, so convert it using the functon to_datetime
df['Date']=pd.to_datetime(df['Date'])
pd.DataFrame.info(df)

In [0]:
#We can look at a specific column
#Note the method is loc, we are using [] instead of () 
#And we always have to think about [row,column]
df.loc[:,'Mean']

In [0]:
#We can pull a specific row or series of rows using the function loc
df.loc[0:3]

In [0]:
#And we can do this for a specific column. What data type is the Mean column?
print(df.loc[0:3,"Mean"])

In [0]:
#Experiment with how df.iloc works. What is different?

In [0]:
#We can also perform conditional subsetting of a data frame 
#Let's subset based on source 
#Data from GISS Surface Temperature (GISTEMP) analysis & the global component of Climate at a Glance (GCAG)
df.loc[:,'Source']=="GCAG"
#This give us a pandas series of booleans

In [0]:
#We can achieve the same thing without loc 
df['Source']=="GCAG"

In [0]:
#We can use the booleans to subset the dataframe to only the lines that are True
subset=df[df['Source']=="GCAG"]
#Let's look at the first 5 rows of the new data frame variable called subset
subset.head()

In [0]:
#Let's look at the dimensions
print("The total dataset has shape:",df.shape)
print("The conditionally subsetted dataset has shape:",subset.shape)

### Plotting with matplotlib
Matplotlib is a 2D plotting library which allows you to visualize your data in many customizable ways! We are going to focus on one function of the library, pyplot which has documentation [here](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html?highlight=pyplot#module-matplotlib.pyplot). A really awesome tutorial on plotting is available through [Software Carpentry](https://swcarpentry.github.io/python-novice-gapminder/09-plotting/).

In [0]:
#To make plots we need to use the matplotlib package and we are going to learn the sub-library pyplot
import  matplotlib.pyplot as  plt

#Let's visualize the mean temperature over time
#plt.plot takes 3 arguments, series for x axis, series for y axis, point spefification
#We are also saving this to a file called test.png
plt.figure(figsize=(6,4))
plt.plot(subset.loc[:,'Date'],subset.loc[:,'Mean'],'b')
plt.xlabel('Date')
plt.ylabel('Mean Temperature (Celsius)')
plt.savefig("test.png")
plt.show()

In [0]:
#Try to change the x-axis labels to 45 degree angles using plt.xticks()
#Customize the plot any other way you would like

In [0]:
#Alternatively we can use the plot method on Series and DataFrame which is just a simple wrapper around plt.plot
subset.plot(x='Date',y='Mean')

In [0]:
#The data has temperature measures from 2 sources
#GISTEMP Global Land-Ocean Temperature Index from NASA
#Global component of Climate at a Glance (GCAG) from NOAA
#Let's look at the temperature distribution 

#Boxplots can be created for every column in the dataframe by df.boxplot() or indicating the columns to be used
df.boxplot(column=['Mean'],by='Source')

#We see something similar with we calculate quantiles for numeric values in the data frame using the quantile function
df.quantile([0.25,0.5,0.7])


### Extension
Read in a .csv file from [datahub.io](https://datahub.io/collections/climate-change) with the monthly global Carbon Dioxide measurements over time and perform some plotting!

In [0]:
#Read in and play with a second data frame
df2 = pd.read_csv(f"{gdrive_path}co2-mm-mlo.csv")
pd.DataFrame.head(df2)