![DSL_logo](dsl_logo.png)


# Python 2.0!

Welcome to the Digital Scholarship Lab Level 2 Python workshop. Before proceeding please make sure you've completed [part 1](https://github.com/BrockDSL/Intro_to_Python_Workshop) 

Which covers that:
- variables
- math
- conditional
- loops
- functions


What we'll learn today is:
- importing libraries
- analyzing data with pandas
- plotting data with matplot lib

Join the [Etherpad](http://139.57.126.30:32780/p/Python2)






Before we get going the next cell should look totally familar to you

In [None]:
scores = [3,5,6,2,1,6]

def find_mean(scores):
    
    sum = 0
    for s in scores:
        sum = sum + s
        
    return sum/len(scores)


find_mean(scores)


## Importing Libraries

- Our end goal is to re-use as much code as possible
- To do this we load in different Libraries using the `import` command
- For this example we want to load in the [statistics](https://docs.python.org/3/library/statistics.html) library


In [None]:
import statistics

print(statistics.mean(scores))
print(statistics.median(scores))
print(statistics.mode(scores))



How would we use the [math](https://docs.python.org/3/library/math.html) library to find the square root of the variable called _number_ following? 

In [None]:
import 

number = 81

We'll be focusing on data analysis for the rest of this workshop so let's import some libraries: [pandas](https://pandas.pydata.org/), [numpy](https://numpy.org) & [matplotlib](https://matplotlib.org)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# EXERCISE: Analyzing Data

We'll be exploring how to do analysis with a riff of a data set taken from [Kaggle](https://www.kaggle.com/carlolepelaars/toy-dataset/). It has been localized with Canadian Cities and shortened a tad. Let's view the [file](canadian_toy_dataset.csv)



## Brainstorming: what questions we'd like to ask


Let's take a look at the file and figure out what types of questions we can ask from it?
What would we like to graph out?


Add you thoughts to the [etherpad](http://139.57.126.30:32780/p/Python2)

## Loading the data

We'll load the data into a pandas `dataframe`. A dataframe has a lot of properties we can look into

In [None]:
data = pd.read_csv("canadian_toy_dataset.csv")
data.columns = ["city","gender","age","income","ill"]
data.head(5)

## Grouping and  Counting

When we want to find out how many of something is in a dataframe we use the `.count()` we also need to gather the entries we need by grouping them together with `.groupby()`. We can chain these things together to make longer more complicated enquires of the data.


How many people are `ill`?

In [None]:
data.groupby("ill").count()

How many people are `Male` in this dataset?

## Grouping and Averaging

If we want to do some math on the data we need to cluster it together a bit. We use `.groupby()` and `.mean()` to accomplish this.


What is the average income of people in `Waterloo`?

In [None]:
data.groupby("city")["income"].mean()

What is the average age of people in each `city`?

Other useful functions to apply to dataframes:

- `.max()`
- `.min()`

What is the minimum and maximum age seen in the data

In [144]:
print(data["age"].max())
print(data["age"].min())

65
25


What is the maximum and minimum age seen in the data set?

# Sorting

We can apply sorting to our dataframe actions by adding `.sort_value()` to the end of our count and average statements and telling it what `column` to sort by with the added statement `by = "column"`.

What city has the most `ill` people?

In [None]:
data.groupby("city").count().sort_values(by = "ill",ascending = False)

What city has the highest average income?

In [None]:
data.groupby("city").mean().sort_values(by = "income", ascending = False)

## Unique entries

Here we use `.unique()` to only give the first instances of the item. Results are returned as a list, which is useful for us later

In [None]:
data["city"].unique()

What are unique values for the `Age` field?

## Values Counts

This will give you the values associated with the unique values that are seen in a column

In [None]:
data["city"].value_counts()

## Making selections into List

We add `.tolist()` to end of selections to get lists of the results. Useful for making graphs, as we'll see later.

In [None]:
data["city"].value_counts().tolist()

## Selecting subsets of data

To make life easier we can create dataframes that just have the values we are interested in

Say we want to make a dataframe of only those that are `ill` we'd do the following:

In [None]:
ill_people = data[data['ill'] == "Yes"]
ill_people

Can you make a new dataframe that just has people from `Waterloo` in it. Display the first 5 entries.

waterloo_people = 


With our `ill_people` dataframe how can we find out how many people are `ill` in each city. 

*Hint:* we can use `.groupby()` and `.count()` to do this

Can we sort our previous results? We can use the `.sort()` to accomplish this

## Putting more pieces together

We can put all of these calculations together to ask more complex questions of the data.



How many people are `ill` and above the average age of all people in the data set?

In [None]:
average_age = 45

above_average_ill = ill_people[ill_people['age'] > average_age]['ill'].count()

print(above_average_ill)

What percentage of the above average age people are `ill`?


total_ill =

(above_average_ill / total_ill ) * 100


## Graphing Results

We can use the `matplotlib` library to generate some graphs of our results

## Pie Graphs
Let's draw a pie graph of the number of people that are `ill` as a proportion of everyone

In [None]:

total_ill_people = ill_people.count()['ill']
total_people = data.count()['ill']


# Matplot lib always wants data in a list, so we'll make one
pie_data = [total_ill_people,total_people]


plt.pie(pie_data)
plt.show()

Say we want to add some more details to our pie graph, like `labels` and a `title`

In [None]:
pie_data = [total_ill_people,total_people]
lables = ["Total Ill People","Total People"]

plt.pie(pie_data,labels=lables)
plt.title("Percentage of people who are ill")
plt.show()

Can you create a pie graph that shows the gender balance of who is `ill`?

In [None]:
# Hint: you'll need to select and count

females_ill = 
males_ill = 

pie_data = [females_ill,males_ill]
lables = ["Females","Males"]

plt.pie(pie_data,labels=lables)
plt.title("Gender distibution of those ill")
plt.show()

## Automatic Histograms


Say we wanted to plot out the income distribution of our data set

In [None]:
plt.hist(data["income"])
plt.show()

Let's add some labels to our axes

In [None]:
plt.hist(data["income"])

plt.title("Income distribution")
plt.xlabel("Income")
plt.show()

Can you draw a histogram of the age distribution? Make sure to give it a `title` and a good `xlabel`

## Bar Graphs

Let's try to draw how many people are `ill` in each city

In [None]:

ill_by_city = ill_people['city'].value_counts().tolist()
cities = ill_people.city.unique()
   
plt.bar(cities, ill_by_city)
plt.show()


# Bringing it all together

Can you make a dataframe of people who are not `ill` and graph them based on what city they are in. Some of the work is written out for you


In [None]:

healthy_people = data

healthy_people_by_city = 
cities = health_people.city.unique()

plt.bar()
plt.title()
plt.show()