<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Intro to Python: Files, CSV Library, numpy, and distributions
Week 2 | Day 1

---

### LEARNING OBJECTIVES

- Understand the measures of Central Tendency (mean, median, and mode)
- Understand how mean, median and mode are affected by skewness in data
- Understand measures of variability (variance and standard deviation)
- Intro to files & csv library
- Read a csv file using pandas
- Viewing data: head, columns, values, describe
- Selection: a single column, slicing by row, by position
- Perform boolean indexing on dataframes



# Introduction: Stats review (5 mins)
There are two main fields of statistics: **descriptive** and **inferential**.

Right now, we're going to focus on descriptive statistics: describing, summarizing, and understanding data.

Our focus today is on the Measures of **Central Tendency Measures** of Central Tendency provide descriptive information about the single numerical value that is considered to be the most typical of the values of a quantitative variable.

That may sound complicated, but you're probably already familiar with some measures of central tendency: the **mean**, **median**, and **mode**.

We'll also discuss **skewness**, which is the lack of symmetry in a distribution data that affects the mean, median, and mode.

Lastly we'll take a look at measures of variability, namely the **range**, **variance**, and **standard deviation**.

NumPy has functions to calculate all of these, but before we let NumPy do the work, it's important to understand the fundamental concepts.



# Guided Practice: Mean, median, and mode (20 mins)
The mean is the sum of the numbers divided by the length of the list.

**Check**: Find the **mean** of this list using python:



In [2]:
n = [1,2,3,4,5]



---
#### Median
- For odd-length lists: the median is the middle number of the ordered list.
- For even-length lists: the median is the average of the two middle numbers of the ordered list.

**Check**: Find the **median** of each list using python:


In [3]:
n_odd = [1,5,9,2,8,3,10,15,7]
n_even = [8,2,3,1,0,-1,-5,20]


---
#### Mode
The mode is the most frequently occurring number.

Finding the **mode** is not as trivial as the mean or median, so here it is calculated using `scipy.stats.mode()`.

**Note**: doing this without `scipy.stats.mode()` is a challenge problem in the independent practice section.



In [1]:

n = [0,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5]


In [37]:
# calculate mean and median
A = [10,10,50,90,90]
B = [40,40,50,60,60]



# Guided Practice: Skewness (20 mins)
**Skewness** is lack of symmetry in a distribution of data.

![Skewness](images/skewness.png)

A **positive-skewed** distribution means the right side tail of the distribution is longer or fatter than the left.

Likewise a **negative-skewed** distribution means the left side tail is longer or fatter than the right.

Symmetric distributions have no skewness!

---
**Skewness and measures of central tendency**

The mean, median, and mode are affected by skewness.

When a distribution is symmetrical, the mean, median, and mode are the same number.

When a distribution is negatively skewed, the mean is less than the median, which is less than the mode.

**Negative skew: mean < median < mode**

When a distribution is positively skewed, the mean is greater than the median, which is greater than the mode!

**Positive skew: mode < median < mean**

This way of thinking can help you, especially if you can't see a line graph of the data. All you need are the mean and the median. Nice!

1. If the mean < median, the data are skewed left.
2. If the mean > median, the data are skewed right.

**Check**: Using this information, does the list of numbers form a symmetric distribution? Is it skewed left of right?

In [4]:
n = [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 18, 19, 20, 21, 22, 23, 24,
     22, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 18, 19, 20, 21, 22, 23,
     24, 20, 21, 22, 23, 24, 22, 16, 17, 18, 19, 20, 21, 22, 23, 24, 0, 1, 2, 3, 4, 5, 6, 7,
     8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 0, 1, 2,
     3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]




# Guided Practice: Range, Variance and Standard Deviation (20 mins)

Measures of variability like the **range**, **variance**, and **standard deviation** tell you about the spread of your data.

These measurements give complementary (and no less important!) information to the measures of central tendency (mean, median, mode).



---
#### Range

The **range** is the difference between the lowest and highest values of a distribution.

Calculate the range of the list of numbers below.

In [5]:
n = [3, 75, 98, 2, 10, 3, 14, 99, 44, 25, 31, 100, 356, 4, 23, 55, 327, 64, 6, 20]


In [6]:
# Range code here


---
#### Variance

The **variance** is a numeric value used to describe how widely the numbers distribution vary.

![Variance](images/variance.png)

Calculate the variance of the list as well.

In [7]:
marks = [13, 17, 19, 20, 23, 24]


In [9]:
# Variance code here



Which is **the average of the sum of the squared distances of each number from the mean of the numbers.**

![Distribution with Variance](images/dist_with_variance.png)

**Check**: What could a distribution with a large variance look like? A small?

**Check**: What does a variance of 0 mean?




#### Standard deviation
The **standard deviation** is the square root of the variance.

Because the variance is the average of the distances from the mean squared, the standard deviation tells us approximately, on average, the distance of numbers in a distribution from the mean.

The standard deviation can be calculated with:

In [8]:
# Standard deviation code here


![Dist, variance and Standard Deviation](images/dist_with_var_std.png)

<a name="Series and DataFrame data types"></a>
## Introduction: Series and DataFrame data types (10 mins)

- Series is a one-dimensional labeled array capable of holding any data type (integers, strings,
floating point numbers, Python objects, etc.). The axis labels are collectively referred to as
the index. The basic method to create a Series is to call:

```Python
s = pd.Series(data, index=index)
```

- Here, data can be many different things:
    - a Python dict
    - an ndarray
    - a scalar value (like 5)

- The passed index is a list of axis labels.



- DataFrame is a 2-dimensional labeled data structure with columns of potentially
different types. You can think of it like a spreadsheet or SQL table, or a dict
of Series objects. It is generally the most commonly used pandas object.

- Like Series, DataFrame accepts many different kinds of input:
    - Dict of 1D ndarrays, lists, dicts, or Series
    - 2-D numpy.ndarray
    - Structured or record ndarray
    - A Series
    - Another DataFrame

- Along with the data, you can optionally pass index (row labels) and columns
(column labels) arguments. If you pass an index and / or columns, you are
guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict
of Series plus a specific index will discard all data not matching up to the
passed index.

- If axis labels are not passed, they will be constructed from the input data based on common sense rules.

Here is more information on [series and dataframes](http://pandas.pydata.org/pandas-docs/stable/dsintro.html).

**Check:** What are some differences between Series and DataFrame?



<a name="pd.Series"></a>
## Demo / Guided Practice: pd.Series (25 mins)

Let's create a series and see what `pandas.Series` can do.


In [2]:
# create a series using a numpy random number generator

s = pd.Series(np.random.randint(5, 25, 7), index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])  
s

a    17
b     5
c    13
d     9
e    20
f    14
g    16
dtype: int32


Now we have a series of 7 random numbers. Let's try out the same things we did with
a data frame back in W2 L1.1. First, let's look at the series head.

In [9]:
# head of series


<details>
    <summary>Solution</summary>
    <code>s.head()</code>
</details>

In [10]:
# tail of series


In [11]:
# summary stats


In [12]:
# select by location c to g


In [13]:
# select just b


In [14]:
# slice for rows 1-3


**Check:** How would you select just 'd'?


<a name="Boolean indexing"></a>
## Demo / Guided Practice: Boolean indexing (25 mins)

Another common operation is the use of boolean vectors to filter the data. The operators
are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

Let's create another series and use pandas to do some Boolean indexing.

In [12]:
# create another series ranging from -3 to 3

s = pd.Series(range(-3, 4))
s

0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64

In [15]:
# find the values that are > 0. 


In [16]:
# find the values that are < -1 or > 0.5


In [17]:
# find the values that are not < 0.


Here is some further information on [boolean indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#slicing-ranges).

**Check:** How would you find all the numbers that are < 2?


In [32]:
# find the values that are < 2



# Demo: csv module (10 mins)
Let's take a look at the Python csv module. The csv module’s reader and writer objects read and write sequences. We'll be using a small Sales data set to practice. Let's read a csv file first

The output will be the contents of sales.csv file. <br>
Now, let's write to a csv file.

In [24]:
data = ['123456', 'cosmos', 'neil', 'lucy', 'universe', '1', '1,000,000', 'presented']


Now, let's see the file again, with the data you just added:

In [25]:
# read in csv file and create a pandas dataframe


**Check:** This looks familiar...didn't we already learn how to read in csv files?
Yes, but that was using Python without any libraries or packages. It took 5 lines of
Python, but using Pandas it only takes one line. Nice!

<a name="Viewing data: head/tail, describe"></a>
## Demo / Guided Practice: Viewing data: head/tail, describe (25 mins)


In [26]:
# head of dataset


In [27]:
# tail of dataset


**Check:** What can looking at the head and tail of a dataset tell us?


In [28]:
# summary stats


This gives us: count, mean, std, min, 25%, 50%, 75%, and max. Awesome!

**Check:** What was the cautionary tale about relying too heavily on summary stats again?


<a name="Selection: a single column, slicing by row, by position"></a>
## Demo / Guided Practice: Selection: a single column, slicing by row, by position (25 mins)


In [29]:
# select a single column


**Check:** How would you select the 'Quantity' and 'Price' columns separately?


In [30]:
# slice certain rows 


**Check:** How would you slice for rows 9 to 14?


In [31]:
# slice for rows 9-14


##### Now, let's try selecting by position.

In [12]:
# First, let's slice some rows.



**Check:** How would you slice for rows 9 to 14?


In [32]:
# slice some columns


**Check:** How would you slice for the 'Manager' and 'Product' columns?


In [15]:
# slice for the 'manager' column



In [33]:
# select for an explicit value only

