# Introduction to Statistics Part II


Now that we have learned how to use the mean and median, we'll talk about some more advanced statistics.

In [None]:
# import pandas and numpy



## Count Statistics

*Count variables* are variables which represent the number of events that occur of a specific category. This can be anything, like the number of dogs in a park or how many people went to a concert. For both of these examples, each of the counts must be *whole numbers*, i.e. `int` data type. 

Run the cell below to load a listing of the weather in Detroit for every day since 1950:

In [None]:
data_table = pd.read_csv('../SampleData/detroit_weather.csv') # Data from Mathematica WeatherData, 2019

Take a look at the contents of `data_table`:

In [None]:
# Run head on data_table to look at its contents



In [None]:
# Run tail on data_table to look at its contents



This table contains if it was snowing and if it was raining for each day in Detroit since 1950. We will use this as an example dataset.

In [None]:
# Lookup the weather for May 1, 2019:


In [None]:
# another way to do the same thing is chain together multiple calls to query
data_table.query('YEAR == 2019').query('MONTH == 5').query('DAY == 1')

As we can see, it was raining, but not snowing that day!

Now, let's create some count statistics! To do this, we will use the `Counter` **module** from the `collections` **package**. Let's import it! 

In [None]:
# Import the Counter class from collections to help us do the counting

from collections import Counter

`Counter` summarizes any `list` with the counts of all its unique variables:

In [None]:
# Create a list and count it using Counter

Counter([1,1,1,1,1,2,2,2,2,2])

Now, let's count the weather data!

In [None]:
# Count how many days it has snowed in Detroit since 1950:



It looks like it has snowed 4,235 days in that time period, that is a lot!

This `Counter` variable functions a lot like a dictionary object - we haven't talked about this data type in this course, but essentially its a way of mapping **keys** to **values**. We can access the **values** associated with each **key** in a similar way that we index lists. For example, if we wanted to get the total number of snow days in our data set:

In [None]:
# snow days where value == True



Let's break this down a little more granuarly. How are these 4235 total snow days distributed across our 12 months?

In [None]:
# Count how many days *per month* it has snowed since 1950 and print



What about the days that is *has not* snowed per month?

In [None]:
# How many days *per month* has it NOT snowed since 1950 and print?



## Percentages

A *percentage* is a number between 0 and 1 which represents the fraction of a given variable that meets a given condition. i.e. if there are 28 dogs and 45 cats at the humane society, the percentage of adoptable animals that are dogs is:

In [None]:
28/(28 + 45)

First, let's calculate the percentage of all days since 1950 that have been snow days using the variable `snow_days` from above.

In [None]:
# percentage of all days since 1950 that have been snow days


Now, let's calculate the percent of January days that have had snow since 1950. To do this, we first need the total number of January days since 1950.

In [None]:
# How many days TOTAL have there been in each month since 1950?



Now let's use the `snow_days_by_month` and `days_by_month` variables to isolate the **values** associated with the **key** for January to calculate the percentage.

In [None]:
# Find the percentage of days in January where it snowed:



A percentage of 51% means that half the January days since 1950 have seen snowfall. 

Now let's do the same for June.

In [None]:
# Now do the same for June:




It shouldn't come as much suprise that it doesn't snow much in summer!

In this lesson you learned how to:

* Calculate count statistics using data from `pandas`
* Calculate percentages from count statistics
     
Now, lets continue to practice with your partner!