![DSL_logo](dsl_logo.png)


# Python 2.0!

Welcome to the Digital Scholarship Lab Level 2 Python workshop. Before proceeding please make sure you've completed [part 1](https://brockdsl.github.io/Intro_to_Python_Workshop/) 

Which covers:
- variables
- math
- conditional
- loops
- functions


What we'll learn today is:
- importing libraries
- analyzing data with pandas
- plotting data with matplotlib


We'll be using Python as a Data Analysis tool
This is how the [Kaggle](https://kaggle.com) website works



Before we get going the next cell should look totally familar to you

In [None]:
scores = [3,5,6,2,1,6]

def find_mean(scores):
    
    sum = 0
    for s in scores:
        sum = sum + s
        
    return sum/len(scores)


print(find_mean(scores))

----

## Importing Libraries

- Our end goal is to re-use as much code as possible
- To do this we load in different Libraries using the `import` command
- For this example we want to load in the [statistics](https://docs.python.org/3/library/statistics.html) library


In [None]:
import statistics

print(statistics.mean(scores))
print(statistics.median(scores))
print(statistics.mode(scores))


- Try Q1 - Q2 below and type "Got it" in the chat when you are done.

- **Q1** How would we use the [math](https://docs.python.org/3/library/math.html) library to find the square root of the variable called _number_ following? 

In [None]:
import 

number = 81

print(number)

- **Q2** The `str` library is so important that it is included all the time Python runs. Try to print the contents of the variable `all_caps` to the screen in lower case letters.

Details on the [str](https://docs.python.org/3/library/string.html) library.

In [None]:
all_caps = "HELLO"
print()


# EXERCISE: Analyzing Data

![sick](https://upload.wikimedia.org/wikipedia/commons/9/97/Caladrius2.jpg)

We'll be focusing on data analysis for the rest of this workshop so let's import some libraries: [pandas](https://pandas.pydata.org/), [numpy](https://numpy.org) & [matplotlib](https://matplotlib.org)

We'll be exploring how to do analysis with a riff of a data set taken from [Kaggle](https://www.kaggle.com/carlolepelaars/toy-dataset/). It has been localized with Canadian Cities and shortened a tad. Let's view the [file](canadian_toy_dataset.csv)

---

![excel preview](https://raw.githubusercontent.com/BrockDSL/Python_2.0_Workshop/master/data_in_excel.png)

You could use Excel to do some of this analysis true, but if you have a large dataset, using Excel is going to be difficult to work with.


The data has 5 columns
- _City_ is a Canadian Cities
- _Gender_ is the self reported gender of the person
- _Age_ is a integer that represents how old the person in in the record is
- _Income_ the annual salary of the person as an integer
- _ill_ a 'Yes' or 'No' to indicate if the person is suffering from our mystery illness



We want to explore the data to see if we can pick up any insights about who is ill and who is not ill. In part 3 of this workshop we will use machine learning to see if we can make predictions with the data.


- **Q3** What types of questions might we want to ask from the data? Provides some ideas in the Zoom chat box

## Loading the Libraries

To get everything ready we need to load the following cell

In [None]:
#Load the Library Pandas, that works with data
import pandas as pd

#Load the Library Numpy, that works with numerical calculations
import numpy as np

#These two libraries are often used together!

## Loading the data

- We'll load the data into a pandas `dataframe`. ([More Info](https://realpython.com/pandas-dataframe/)) A dataframe has a lot of properties we can use.
- It is a very complex type of variable, think of the `dictionary` we looked at last time.
- This data is complete, so we don't need to worry about incomplete rows in our observations
- We'll take a look at the first 10 lines of the dataset

In [None]:
#Load the file into a dataframe using the pandas read_csv function
data = pd.read_csv("https://brockdsl.github.io/Python_2.0_Workshop/canadian_toy_dataset.csv")

#Tell it what our columns are by passing along a list of that information
data.columns = ["city","gender","age","income","ill"]

#Show the first 10 lines
data.head(10)

Pandas can provide us some nice quantitative details about our data by calling the `describe()` function

In [None]:
data.describe()

## Grouping and  Counting

- We also need to gather the entries we need by grouping them together with the `.groupby()` function. We can chain these things together to ask very specific questions of the data.
- We pass what column we'd like to group the data by
- We add `.count()` if we are just interested int the counts and not the dataframe


How many people are `ill`?

In [None]:
data.groupby("ill")

In [None]:
data.groupby("ill").count()

Try questions Q4 & Q5 below and type "Finished!" in the chat box when you are done

- **Q4** How many people are `Male` in this dataset?

- **Q5** How many different cities are in the dataset?

## Grouping and applying functions

- If we want to do some math on the data we need to cluster it together a bit. We use `.groupby()` and then apply our mathematical functions to the result
- Here we'll use the following 3 functions:
 - `mean()` finds the arithmetic mean of the data
 - `max()` finds the largest occurence of data in that column
 - `min()` finds the smallest occurennce of data in that column


What is the average income of people in `Waterloo`?

In [None]:
data.groupby("city")["income"].mean()

Try questions Q6-Q8 and type "All done" into the chat when you are finished"

- **Q6** What is the average age of people in each `city`?

- **Q7** What is the minimum and maximum age seen in the data

- **Q8** What is the maximum and minimum income seen in the data set?

# Sorting

- We can apply sorting to our dataframe actions by using the funciton `.sort_values()`
- We need to give what column we'd like to sort it with `by =`
- We also need to tell it to display it in an increase way `ascending = False`

What city has the most `ill` people? Here we do it in two steps

In [None]:
by_city = data.groupby("city").count()

sorted_city = by_city.sort_values(by = "ill",ascending = False)

sorted_city

We could also do it in one step:

In [None]:
data.groupby("city").count().sort_values(by = "ill",ascending = False)

Try questions Q9 - Q10 and type "Finished" in the chat when you are done

- **Q9** What city has the highest average income?

In [None]:
data.groupby("").mean().sort_values(by = "", ascending = False)

Answer: 

- **Q10** What city has the oldest people?

Answer:

## Unique entries & values counts

- Here we use `.unique()` to only give the first instances of the item. Results are returned as a list, which is useful for us later
- This is useful for seeing how many values are in a categorical column

In [None]:
data["city"].unique()

What are unique values for the `age` field?

In [None]:
data["age"].unique()

- To get total number of unique values and frequency in the data we use `value_counts() 

In [None]:
data["city"].value_counts()

## Selecting subsets of data

- To make life easier we can create dataframes that just have the values we are interested in
- This is a bit more complicated but follows this type of pattern:

```
dataframe[dataframe[search criteria]]
```

- We are basically creating a subset of the dataframe by matching all entries that match `search criteria`
- That search criteria can be anything that is a conditional
- Doing this gives you a new dataframe

EG. A new dataframe of people with an income over $100000

In [None]:
over_100k = data[data["income"] > 100000]
print(over_100k)

EG. If we want the count of people over 100k, we apply the `.count()` function to what we selected

In [None]:
over_100k.count()

This can be done in 1 line as well

In [None]:
data[data["income"] > 100000].count()

Try Q11 below and type "I got it" into the chat when you are done

- **Q11** Can you make a new dataframe that just has people from `Waterloo` in it. Display the first 5 entries.

In [None]:
waterloo_people = 

# Some questions now

Let's first make a dataframe of all of the ill people

In [None]:
ill_people = data[data["ill"] == "Yes"]
ill_people

Try answering Q12 - Q15, type "Finished" into the chat when you are done

- **Q12** How can we sort our `ill_people` dataframe?

- **Q13** What percentage of people in the `ill_people` dataset have a salary over $100000

- **Q14** What is the average age of people in the `ill_people` dataset?

- **Q15** What is the average salary of those in the `ill_people` dataset ?

# Another Library, MatplotLib

If we have time, let's take a look at graphing our results

We can use the `matplotlib` library to generate some graphs of our results. We always gives lists as parameters for the graphs


In [None]:
#This line is for Jupyter's benefit
%matplotlib inline
#Import MayPlotLib to graph some results
import matplotlib.pyplot as plt

Let's reload our data into a new dataframe

In [None]:
#Load the file
graph_data = pd.read_csv("https://brockdsl.github.io/Python_2.0_Workshop/canadian_toy_dataset.csv")

#Tell it what our columns are
graph_data.columns = ["city","gender","age","income","ill"]

## Pie Graphs
Let's draw a pie graph of the number of people that are `ill` as a proportion of everyone

In [None]:
#All of the ill people
total_ill = graph_data[graph_data["ill"] == "Yes"]["ill"].count()
#print(total_ill)

#All the people in the graph
total_people = graph_data.count()['ill']
#print(total_people)


# Matplot lib always wants data in a list, so we'll make one
pie_data = [total_ill,total_people]
pie_labels = ["Ill", "No Ill"]
plt.pie(pie_data,labels=pie_labels)

plt.show()

Try questions Q16  - Q17 and type "Completed" in the chat when you're done.

- **Q16** Can you create a pie graph that shows the gender distribution in the data?

In [None]:
#Fill in the following
females_ill = graph_data[graph_data[""] ==""]["ill"].count()

#Fill in the following
males_ill = graph_data[graph_data[""] ==""]["ill"].count()

pie_data = [females_ill,males_ill]
pie_labels = ["Females","Males"]
plt.pie(pie_data,labels=pie_labels)

plt.show()

- **Q17** Can you create a pie graph that shows the how many people in the dataset make over 100000 annual income

In [None]:
#Fill in the following
over_100k = graph_data[graph_data[""]  100000]["income"].count()

#Fill in the following
under_100k = graph_data[graph_data[""]  100000]["income"].count()

pie_data = [over_100k, under_100k]
pie_labels = ["Over 100k","Under 100k"]
plt.pie(pie_data,labels=pie_labels)

plt.show()

## Automatic Histograms


Say we wanted to plot out the income distribution of our data set as a [historgram](https://en.wikipedia.org/wiki/Histogram) 

In [None]:
# bins is the number of containers we'll split our x-axis values into
bins = 250

plt.hist(graph_data["income"],bins)

plt.title("Income distribution")
plt.xlabel("Income")
plt.ylabel("Occurrences")

plt.show()

Try Q18 below and type "All done!" in the chat when you're done

**Q18** Can you draw a histogram of the `age` distribution? Make sure to give it a `title` and other descriptive text and use an appropriate number of bins. (The example above should help you)

In [None]:
bins = #FILL

plt.hist() #FILL
plt.title() #FILL
plt.xlabel() #FILL
plt.ylabel() #FILL

plt.show()

# Congrats!

You now know a bit about Python Libraries and using advance features of the Language. Try adding new cells to this page and asking yourself more questions


## Further Reading

- Now that we've handled the basics here are some interesting next steps you can persue.

[Kaggle](https://www.kaggle.com/) - An online portal that teaches data science using Notebooks, also has contests for cash prizes

[Python the Hard Way](https://learntocodetogether.com/learn-python-the-hard-way-free-ebook-download/) - Don't let the name fool you, this book is a great introduction to Python and programming more generally

[Data Analysis with Python and Sci Hub](https://brockdsl.github.io/SciHub_Workshop/) - A tutorial on using Python to analyze Sci-Hub data. Similar to what we saw today, but with real data.

[Thinking in Pandas](https://www.apress.com/gp/book/9781484258385) - A short book that looks at how to use Pandas for analysis.