# Are any of you sleeping?

Following up on yesterday's lesson, we're going to collect and analyze ourselves. 

If you haven't yet, please fill out this form: https://forms.gle/QDLZ2yhzQ94kTcZZ9

## How do we store information that we might want to change? Variables!

Here, we need to store the location (the url) of our data. We use a variable, which-just like in math-has a value assigned to it. 

In [1]:
# https://kanoki.org/2018/12/25/read-google-spreadsheet-data-into-pandas-dataframe/
# fakedata = "https://raw.githubusercontent.com/story645/EAS213/master/bootcamp/fake_survey.csv"
url = "https://docs.google.com/spreadsheets/d/1liJc0zDH5y5Aa02TnmfCvIG08pEQlv7rbi9aSswEuP0/export?format=csv&gid=577241993"

In [2]:
# lets print the value stored in our variable (click on it!)
url

'https://docs.google.com/spreadsheets/d/1liJc0zDH5y5Aa02TnmfCvIG08pEQlv7rbi9aSswEuP0/export?format=csv&gid=577241993'

# How do we use other peoples code to get stuff done? Functions!

We want to open our spreadsheet in Python, but that's a lot of code and pretty hard code at that. What we can do instead is use code somebody else wrote that already knows how to transform our url into a spreadsheet. 

This is analogous to math, where the function $f(x) = x^2$
* function name: $f$
* function arguments (input): $x$
* function return values (output): $x^2$

What are some other examples of functions?

Here, we are going to use the pandas library (https://pandas.pydata.org/) a library is basically a collection of functions other folk wrote) to work with our spreadsheet. Can you identify the following:

* function name:
* function argument(s):
* function return value(s): 


In [None]:
import pandas as pd
df = pd.read_csv(url)

# How do we store more complicated data (like spreadsheets and not urls)? Objects!

df is a special type of variable called an object. Not only does it store our spreadsheet, is also comes with it's own functions called methods. They work the same as regular functions, except they're used (called) in the format `object_name.function_name(input)`. 

Here we use the head method with input = 3 to see 3 rows in our spreadsheet:

In [None]:
df.head(3)

### Practice
What's the data for row 5?


## How do find out a little bit more info about our spreadsheet? `.info` & `.describe`

### `.info` 
`.info` tells us the datatype (is it a number? a word?) of the values in each column. This is really important when trying to figure out what calculations we can make. 

Here we use the `.info` method to look at our variable types

In [None]:
df.info()

### `.describe`
Just like in R, we have a `.describe` method that prints out our descriptive statistics:

In [None]:
df.describe()

`.describe` defaults to computing the quantative statistics. We can include categorical data by using the `include='all'` keyword argument

In [None]:
df.describe(include='all')

# How are y'all distributed?

The way we select a column in python is `dataframe_name['column_name']`, whereas yesterday in R it was `class(dataframe_name$column_name)`

Here, we're gonna compute & then plot how many of you are from which borough. We use `%matplotlib inline` to embed the image in our environment

In [None]:
# compute distribution by borough
df['Borough'].value_counts()

In [None]:
%matplotlib inline
df['Borough'].value_counts().plot.bar()

### Lets look at multiple categories at once - groupby

In pandas we can say, hey, we want to know information about two or more groups at once, and we use the groupby to do it. For example, here lets seperate our data by borough:

In [None]:
df.groupby('Borough')

That looks like soup, so lets convert it into a more readable form by turning it into a list of the form [(Borough name, all rows (people) in that category)]

In [None]:
list(df.groupby('Borough'))

In [None]:
# we can compute something like # of people in both categories using groupby
df.groupby(['Borough', 'Class']).count()

#### `.unstack`
this is a drop complicated but we want to have boroughs as rows and classes as columns so we use `.unstack` to reshape our table by moving one level of the heirarchy of rows up to a level of columns (which is why it's called  unstack)

In [None]:
dfbc = df.groupby(['Borough', 'Class']).count().unstack()
dfbc

In [None]:
# lets look at class by borough
dfbc['Wake'].plot.bar(stacked='True')

The column we choose to plot doesn't really matter - verify it yourself by trying a different a different column!

In [None]:
# Sometimes it's easier to see a pattern if we flip the bar sections with the x-axis. 
# We do this using `.T`
dfbc['Wake'].T

The default Pandas visualization is a wrapper on matplotlib (https://matplotlib.org/) so you can tweak the figures by assigning the Axes object returned by the `.plot` function to a variable and then using the methods on that variable. 

In [None]:
ax = dfbc['Wake'].T.plot.bar(stacked='True')
ax.legend(ncol=3, loc='upper left')

### Scavenger Hunt:
1. What class are most of you in?
2. What time does almost nobody go to bed?
3. Do seniors wake up later then Sophmores?

## Is there a correlation between wake and bed?

pandas has a correlation function, `.corr` but remember our sleep times are currently strings

We can select multiple columns by passing in a list of column names.

In [None]:
df[['Wake', 'Bed']].corr()

### Lets write a function to convert our "time" to a number:
Facts about our number:
* hours and minutes only, no seconds
* it's in AM and PM form

### How can we do different things if it's am or pm? Selection/Contol Flow!

We can use `if` statements to decide to do different things depending on if it's AM or PM. We can also use elseif if we want another option and else if we want a default. 

We can use string methods to find out about strings. https://docs.python.org/3/library/stdtypes.html#string-methods

In [None]:
# lets write an is am check:
test_time = "11:00:00AM"
#let's use the `.endswith` method
test_time.endswith('AM')

In [None]:
if test_time.endswith('AM'):
    print("AM")
elif test_time.endswith('PM'):
    print("PM")
else:
    print('😕')

In [None]:
# let's wrap that in a function:
def whatis(time):
    if time.endswith('AM'):
        print("AM")
    elif time.endswith('PM'):
        print("PM")
    else:
        print('😕')

In [None]:
whatis(test_time)

In [None]:
whatis("hannah")

### How do we split our time into hours/minutes/etc? `.split`

In [None]:
test_time.split(":")

Python allows whats called iterative assignment

In [None]:
hour, second, local = test_time.split(":")
print(hour, type(hour))

### Converting values is called casting, and we can cast to a number by using a casting function

In [None]:
h = int(hour)
print(h, type(h))

###  Lets put it all together to write a `to24` function: 
* input: a time such as 11:00:00PM
* output: a 4 digit number representing the 24hr representation: 2300

In [None]:
def to24(time):
    
    return time

In [None]:
# lets test our function
df['Wake24'] = df['Wake'].apply(to24)
df['Bed24'] = df['Bed'].apply(to24)
df[['Wake24', 'Bed24']].head()

### Compute the correlation again
Scroll up and find the code for computing the correlation, then please report out the correlation between bed time and waking time. 


### How do I get data for just one borough? Selection/Indexing

in Pandas, we can do what's called boolean indexing to get all rows that meet a condition:

![](../2016/figs/masking.png)

In [None]:
# our condition is everywhere `borough` is queens:
df['Borough'].str.match('Queens')

In [None]:
# And we put that condition inside the dataframe to get all rows that meet that condition
dsubset = df[df['Borough'].str.match('Staten Island')]
dsubset.head()

In [None]:
# We can combine conditions
dsubsubset = df[df['Borough'].str.match('Staten Island') & df['Class'].str.match('freshman')]
dsubsubset.head()

# Break Out: Tell me about a Borough
Note: this might change day of if the data isn't there for this/ how many of y'all there are.

1. Each group is assigned to a borough: Bronx, Queens, Brooklyn, Manhatten, Staten Island/Other 
1. Develop a function to compute hours of sleep
2. Compute hours of sleep and store it in a new columns called `sleep`
2. Create a mini report on these variables for your borough:
    * charts, correlations, tell me a story about sleep and grades in your Borough
    * is anybody in your group in your subset? how can you tell?
3. Create a list of other information you'd like to collect
4. Skip ahead and make a map to compare information
6. EC: Post it to instagram & tag @matplotlib 😉

# Let's map it

We have geo data, so we can make maps! We'll be using https://geopandas.org because it works almost exactly like Pandas (which is what we were using for spreadsheets), with added support for geographic data. It also works inside ArcGIS and QGIS if you want to automate generating visualizations.

In [None]:
# let's load our sample dataset
import geopandas as gpd
gdf = gpd.read_file(gpd.datasets.get_path('nybb'))

In [None]:
gdf.head()

In [None]:
# we can directly plot the map and identify the boro & make the map big
gdf.plot('BoroName', legend=True, figsize=(8,8))

## How do we combine our data with maps? Merge/Join

This gets wildely out of scope of this workshop, you can read more at https://geopandas.org/mergingdata.html and https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

In [None]:
# lets look at our data
gdf.head()

In [None]:
df.head()

Let's aggregate our data such that we have one value to plot per geographic unit & we'll convert into a table (`.unstack`) and make sure it's headers are lined up `reset_index`

In [None]:
pop = df.groupby(['Borough']).count()['Wake'].to_frame().reset_index()
pop

We have boro in common so we can join on that, we use the `.join` from the geopandas dataframe so that what we get back will still be a geopandas dataframe

In [None]:
geodf = gdf.merge(pop, right_on='Borough', left_on='BoroName')
geodf.head()

In [None]:
# our data is categorical, so let's plot by # of folks in each boro
geodf.plot('Wake', legend=True)

# Try out one of the other variables 