![DSL_logo](dsl_logo.png)


# Python 2.0!

Welcome to the Digital Scholarship Lab Level 2 Python workshop. Before proceeding please make sure you've completed [part 1](https://brockdsl.github.io/Intro_to_Python_Workshop/) 

Which covers:
- variables
- math
- conditional
- loops
- functions


What we'll learn today is:
- importing libraries
- analyzing data with pandas
- plotting data with matplot lib


We'll be using Python as a Data Analysis tool
This is how the [Kaggle](https://kaggle.com) website works



Before we get going the next cell should look totally familar to you

In [None]:
scores = [3,5,6,2,1,6]

def find_mean(scores):
    
    sum = 0
    for s in scores:
        sum = sum + s
        
    return sum/len(scores)


print(find_mean(scores))

----

## Importing Libraries

- Our end goal is to re-use as much code as possible
- To do this we load in different Libraries using the `import` command
- For this example we want to load in the [statistics](https://docs.python.org/3/library/statistics.html) library


In [None]:
import statistics

print(statistics.mean(scores))
print(statistics.median(scores))
print(statistics.mode(scores))


- Try Q1 - Q2 below and type "Got it" in the chat when you are done.

- **Q1** How would we use the [math](https://docs.python.org/3/library/math.html) library to find the square root of the variable called _number_ following? 

In [None]:
import 

number = 81

print(number)

- **Q2** The `str` library is so important that it is included all the time Python runs. Try to Print to the screen the all of the lower case letters [str](https://docs.python.org/3/library/string.html) knows about. 

In [None]:
print()


# EXERCISE: Analyzing Data

![sick](https://upload.wikimedia.org/wikipedia/commons/9/97/Caladrius2.jpg)

We'll be focusing on data analysis for the rest of this workshop so let's import some libraries: [pandas](https://pandas.pydata.org/), [numpy](https://numpy.org) & [matplotlib](https://matplotlib.org)

We'll be exploring how to do analysis with a riff of a data set taken from [Kaggle](https://www.kaggle.com/carlolepelaars/toy-dataset/). It has been localized with Canadian Cities and shortened a tad. Let's view the [file](canadian_toy_dataset.csv)

The data has 5 columns
- _City_ is a Canadian Cities
- _Gender_ is the self reported gender of the person
- _Age_ is a integer that represents how old the person in in the record is
- _Income_ the annual salary of the person as an integer
- _ill_ a 'Yes' or 'No' to indicate if the person is suffering from our mystery illness

We want to explore the data to determine if we can find out if we can predict if a person is sick based on different factors.


To get everything ready we need to load the following cell

In [9]:
#This line is for Jupyter's benefit
%matplotlib inline

#Load the Library Pandas, that works with data
import pandas as pd

#Load the Library Numpy, that works with numerical calculations
import numpy as np

#Import MayPlotLib to graph some results
import matplotlib.pyplot as plt

- **Q3** What types of questions might we want to ask from the data? Provides some ideas in the Zoom chat box


## Loading the data

We'll load the data into a pandas `dataframe`. A dataframe has a lot of properties we can look into

In [13]:
#Load the file
data = pd.read_csv("https://brockdsl.github.io/Python_2.0_Workshop/canadian_toy_dataset.csv")

#Tell it what our columns are
data.columns = ["city","gender","age","income","ill"]

#Show the first 5 lines
data.head(10)

Unnamed: 0,city,gender,age,income,ill
0,Montreal,Male,41,40367,No
1,Montreal,Male,54,45084,No
2,Montreal,Male,42,52483,No
3,Montreal,Male,40,40941,No
4,Montreal,Male,46,50289,No
5,Montreal,Female,36,50786,No
6,Montreal,Female,32,33155,No
7,Montreal,Male,39,30914,No
8,Montreal,Male,51,68667,No
9,Montreal,Female,30,50082,No


In [14]:
#A quantitative summary of the dataframe

data.describe()

Unnamed: 0,age,income
count,150000.0,150000.0
mean,44.9502,91252.798273
std,11.572486,24989.500948
min,25.0,-654.0
25%,35.0,80867.75
50%,45.0,93655.0
75%,55.0,104519.0
max,65.0,177157.0


# Asking question from the data

Dataframes are great because we can ask for more complicated data and analysis and get they'll do the hard work for us

## Counting

When we want to find out how many of something is in a dataframe we use the `.count()` function.

How many records are in the dataset?

In [None]:
data.count()

## Grouping and  Counting

We also need to gather the entries we need by grouping them together with the `.groupby()` function. We can chain these things together to make longer more complicated enquires of the data.

*NB:* Grouping a dataframe will cause a new index to be applied


How many people are `ill`?

In [None]:
data.groupby("ill").count()

How many people are `Male` in this dataset?

## Grouping and Averaging

If we want to do some math on the data we need to cluster it together a bit. We use `.groupby()` and `.mean()` to accomplish this.


What is the average income of people in `Waterloo`?

In [None]:
data.groupby("city")["income"].mean()

What is the average age of people in each `city`?

Other useful functions to apply to dataframes:

- `.max()`
- `.min()`

What is the minimum and maximum age seen in the data

In [None]:
print(data["age"].max())
print(data["age"].min())

What is the maximum and minimum income seen in the data set?

# Sorting

We can apply sorting to our dataframe actions by adding `.sort_values()` to the end of our count and average statements and telling it what `column` to sort by with the added statement `by = "column"`.

What city has the most `ill` people?

In [None]:
data.groupby("city").count().sort_values(by = "ill",ascending = False)

Answer:

What city has the highest average income?

In [None]:
data.groupby("").mean().sort_values(by = "", ascending = False) #FIX

Answer: 

What city has the oldest people?

Answer:

## Unique entries

Here we use `.unique()` to only give the first instances of the item. Results are returned as a list, which is useful for us later

In [None]:
data["city"].unique()

What are unique values for the `age` field?

What are unique values are in the `income` field?

That's a lot of text on the screen... How can we find the number?

## Values Counts

This will give you the values associated with the unique values that are seen in a column

In [None]:
data["city"].value_counts()

## Selecting subsets of data

To make life easier we can create dataframes that just have the values we are interested in
This is a bit more complicated but follows this type of pattern:

`dataframe[dataframe[search criteria]`

We are basically getting subselecting the dataframe with `search criteria`,
that search criteria can be any that is a conditional


EG. People with an income over $100000

In [None]:
data[data["income"] > 100000]

If we want we can put this selection in a new dataframe.

Eg. We make a dataframe of only those that are `ill` we'd do the following:

In [None]:
ill_people = data[data['ill'] == "Yes"]
ill_people

Can you make a new dataframe that just has people from `Waterloo` in it. Display the first 5 entries.

In [None]:
waterloo_people = 


# Some questions now

With our `ill_people` dataframe how can we find out how many people are `ill` in each city. 

*Hint:* we can use `.groupby()` and `.count()` to do this

Can we sort our previous results? We can use the `.sort_values()` to accomplish this

What percentage of people in the dataset have a salary over $100000

What is the average age of people that are ill?

What is the average salary of those that are ill?

What is the average salary of those that are not ill?

# Graphing Results

We can use the `matplotlib` library to generate some graphs of our results. We always gives lists as parameters for the graphs

## Pie Graphs
Let's draw a pie graph of the number of people that are `ill` as a proportion of everyone

In [None]:

total_ill_people = ill_people.count()['ill']
total_people = data.count()['ill']


# Matplot lib always wants data in a list, so we'll make one
pie_data = [total_ill_people,total_people]


plt.pie(pie_data)
plt.show()

Say we want to add some more details to our pie graph, like `labels` and a `title`

In [None]:
pie_data = [total_ill_people,total_people]
lables = ["Total Ill People","Total People"]

plt.pie(pie_data,labels=lables)
plt.title("Percentage of people who are ill")
plt.show()

Can you create a pie graph that shows the gender balance of who is `ill`?

In [None]:
# Hint: you'll need to select and count

females_ill = #FILL
males_ill = #FILL

pie_data = [females_ill,males_ill]
lables = ["Females","Males"]

plt.pie(pie_data,labels=lables)
plt.title("Gender distibution of those ill")
plt.show()

## Automatic Histograms


Say we wanted to plot out the income distribution of our data set as a [historgram](https://en.wikipedia.org/wiki/Histogram) 

In [None]:
# this represents how many pieces we want to chop our series into, more bins has a higher resolution
bins = 250

plt.hist(data["income"],bins)
plt.show()

Let's add some labels to our axes

In [None]:
# this represents how many pieces we want to chop our series into, more bins has a higher resolution
bins = 250

plt.hist(data["income"],bins)

plt.title("Income distribution")
plt.xlabel("Income")

plt.show()

Can you draw a histogram of the `age` distribution? Make sure to give it a `title` and a good `xlabel` and use an appropriate number of bins

In [None]:
bins = #FILL

plt.hist() #FILL
plt.title() #FILL
plt.xlabel() #FILL

plt.show()

# Congrats!

You now know a bit about Python Libraries and using advance features of the Language
