### Python Resources
The purpose of this document is to direct you to resources that you may find useful if you decide to do a deeper dive into Python. This course is not meant to be an introduction to programming, nor an introduction to Python, but if you find yourself interested in exploring Python further, or feel as if this is a useful skill, this document aims to direct you to resources that you may find useful. If you have a background in Python or programming, a style guides are included below to show how Python may differ from other programming languages or give you a launching point for diving deeper into more advanced packages. This course does not endorse the use or non-use of any particular resource, but the author has found these resources useful in their exploration of programming and Python in particular

### The Python Documentation
Any reference that does not begin with the Python documentation would not be complete. The authors of the language, as well as the community that supports it, have developed a great set of tutorials, documentation, and references around Python. When in doubt, this is often the first place that you should look if you run into a scary error or would like to learn more about a specific function. The documentation can be found here: [Python Documentation](https://docs.python.org/3/)




### Python Programming Introductions

Below are resources to help you along your way in learning Python. While it is great to consume material, in programming there is no substitute for actually writing code. For every hour that you spend learning, you should spend about twice that amount of time writing code for cool problems or working out examples. Coding is best learned through actually coding!

* [Coursera](https://www.coursera.org/courses?query=python) has several offerings for Python that you can take in addition to this course. These courses will go into depth into Python programming and how to use it in an applied setting 
* [Code Academy](https://www.codecademy.com/learn/learn-python) is another resources that is great for learning Python (and other programming languages). While not as focused as Cousera, this is a quick way to get up-and-running with Python
* YouTube is another great resource for online learning and there are several "courses" for learning Python. We recommend trying several sets of videos to see which you like best and using multiple video series to learn since each will present the material in a slightly different way
* There are tens of books on programming in Python that are great if you prefer to read. More so than the other resources, be sure to code what you learn. It is easy to read about coding, but you really learn to code by coding!
* If you have a background in coding, the authors have found the tutorial at [Tutorials Point](https://www.tutorialspoint.com/python/index.htm) to be useful in getting started with Python. This tutorial assumes that you have some background in coding in another language

### Python Style Guides

As you learn to code, you will find that you will begin to develop your own style. Sometimes this is good. Most times, this can be detrimental to your code readability and, worse, can hinder you from finding bugs in your own code in extreme cases. 

It is best to learn good coding habits from the beginning and the [Google Style Guide](https://github.com/google/styleguide/blob/gh-pages/pyguide.md) is a great place to start. We will mention some of these best practices here.

# Python Libraries

Python, like other programming languages, has an abundance of additional modules or libraries that augument the base framework and functionality of the language.

Think of a library as a collection of functions that can be accessed to complete certain programming tasks without having to write your own algorithm.

For this course, we will focus primarily on the following libraries:

* **Numpy** is a library for working with arrays of data.

* **Pandas** provides high-performance, easy-to-use data structures and data analysis tools.

* **Scipy** is a library of techniques for numerical and scientific computing.

* **Matplotlib** is a library for making graphs.

* **Seaborn** is a higher-level interface to Matplotlib that can be used to simplify many graphing tasks.

* **Statsmodels** is a library that implements many statistical techniques.

# Documentation

Reliable and accesible documentation is an absolute necessity when it comes to knowledge transfer of programming languages.  Luckily, python provides a significant amount of detailed documentation that explains the ins and outs of the language syntax, libraries, and more.  

Understanding how to read documentation is crucial for any programmer as it will serve as a fantastic resource when learning the intricacies of python.

Here is the link to the documentation of the python standard library: [Python Standard Library](https://docs.python.org/3/library/index.html#library-index)

### Cheatsheets and References

There are a variety of one-pagers and cheat-sheets available for Python that summarize the language in a few simple pages. These resources tend to be more aimed at someone who knows the language, or has experience in the language, but would like a refresher course in how the language works. 

* [Cheatsheet for Numpy](https://www.datacamp.com/community/blog/python-numpy-cheat-sheet#gs.AK5ZBgE)
* [Cheatsheet for Datawrangling](https://www.datacamp.com/community/blog/pandas-cheat-sheet-python#gs.HPFoRIc)
* [Cheatsheet for Pandas](https://www.datacamp.com/community/blog/python-pandas-cheat-sheet#gs.oundfxM)
* [Cheatsheet for SciPy](https://www.datacamp.com/community/blog/python-scipy-cheat-sheet#gs.JDSg3OI)
* [Cheatsheet for Matplotlib](https://www.datacamp.com/community/blog/python-matplotlib-cheat-sheet#gs.uEKySpY)




### Importing Libraries

When using Python, you must always begin your scripts by importing the libraries that you will be using. 

The following statement imports the numpy and pandas library, and gives them abbreviated names:

In [1]:
import numpy as np
import pandas as pd

### Utilizing Library Functions

After importing a library, its functions can then be called from your code by prepending the library name to the function name.  For example, to use the '`dot`' function from the '`numpy`' library, you would enter '`numpy.dot`'.  To avoid repeatedly having to type the libary name in your scripts, it is conventional to define a two or three letter abbreviation for each library, e.g. '`numpy`' is usually abbreviated as '`np`'.  This allows us to use '`np.dot`' instead of '`numpy.dot`'.  Similarly, the Pandas library is typically abbreviated as '`pd`'.

The next cell shows how to call functions within an imported library:

In [2]:
a = np.array([0,1,2,3,4,5,6,7,8,9,10]) 
np.mean(a)

5.0

As you can see, we used the mean() function within the numpy library to calculate the mean of the numpy 1-dimensional array.

# Data Management

Data management is a crucial component to statistical analysis and data science work.  The following code will show how to import data via the pandas library, view your data, and transform your data.

The main data structure that Pandas works with is called a **Data Frame**.  This is a two-dimensional table of data in which the rows typically represent cases (e.g. Cartwheel Contest Participants), and the columns represent variables.  Pandas also has a one-dimensional data structure called a **Series** that we will encounter when accesing a single column of a Data Frame.

Pandas has a variety of functions named '`read_xxx`' for reading data in different formats.  Right now we will focus on reading '`csv`' files, which stands for comma-separated values. However the other file formats include excel, json, and sql just to name a few.

This is a link to the .csv that we will be exploring in this tutorial: [Cartwheel Data](https://www.coursera.org/learn/understanding-visualization-data/resources/0rVxx) (Link goes to the dataset section of the Resources for this course)

There are many other options to '`read_csv`' that are very useful.  For example, you would use the option `sep='\t'` instead of the default `sep=','` if the fields of your data file are delimited by tabs instead of commas.  See [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for the full documentation for '`read_csv`'.

### Importing Data

In [4]:
# Store the url string that hosts our .csv file (note that this is a different url than in the video)
url = "Cartwheeldata.csv"

# Read the .csv file and store it as a pandas Data Frame
df = pd.read_csv(url)

# Output object type
type(df)

pandas.core.frame.DataFrame

### Viewing Data

In [6]:
# We can view our Data Frame by calling the head() function
df.head()

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


The head() function simply shows the first 5 rows of our Data Frame.  If we wanted to show the entire Data Frame we would simply write the following:

In [8]:
# Output entire Data Frame
df

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4
5,6,24,M,2,N,0,75.0,71.0,81,N,0,3
6,7,28,M,2,N,0,75.0,76.0,107,Y,1,10
7,8,22,F,1,N,0,65.0,62.0,98,Y,1,9
8,9,29,M,2,Y,1,74.0,73.0,106,N,0,5
9,10,33,F,1,Y,1,63.0,60.0,65,Y,1,8


As you can see, we have a 2-Dimensional object where each row is an independent observation of our cartwheel data.

To gather more information regarding the data, we can view the column names and data types of each column with the following functions:

In [9]:
df.columns

Index(['ID', 'Age', 'Gender', 'GenderGroup', 'Glasses', 'GlassesGroup',
       'Height', 'Wingspan', 'CWDistance', 'Complete', 'CompleteGroup',
       'Score'],
      dtype='object')

Lets say we would like to splice our data frame and select only specific portions of our data.  There are three different ways of doing so.

1. .loc()
2. .iloc()
3. .ix()

We will cover the .loc() and .iloc() splicing functions.

### .loc()
.loc() takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.

In [10]:
# Return all observations of CWDistance
df.loc[:,"CWDistance"]

0      79
1      70
2      85
3      87
4      72
5      81
6     107
7      98
8     106
9      65
10     96
11     79
12     92
13     66
14     72
15    115
16     90
17     74
18     64
19     85
20     66
21    101
22     82
23     63
24     67
Name: CWDistance, dtype: int64

In [11]:
# Select all rows for multiple columns, ["CWDistance", "Height", "Wingspan"]
df.loc[:,["CWDistance", "Height", "Wingspan"]]

Unnamed: 0,CWDistance,Height,Wingspan
0,79,62.0,61.0
1,70,62.0,60.0
2,85,66.0,64.0
3,87,64.0,63.0
4,72,73.0,75.0
5,81,75.0,71.0
6,107,75.0,76.0
7,98,65.0,62.0
8,106,74.0,73.0
9,65,63.0,60.0


In [12]:
# Select few rows for multiple columns, ["CWDistance", "Height", "Wingspan"]
df.loc[:9, ["CWDistance", "Height", "Wingspan"]]

Unnamed: 0,CWDistance,Height,Wingspan
0,79,62.0,61.0
1,70,62.0,60.0
2,85,66.0,64.0
3,87,64.0,63.0
4,72,73.0,75.0
5,81,75.0,71.0
6,107,75.0,76.0
7,98,65.0,62.0
8,106,74.0,73.0
9,65,63.0,60.0


In [13]:
# Select range of rows for all columns
df.loc[10:15]

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
10,11,30,M,2,Y,1,69.5,66.0,96,Y,1,6
11,12,28,F,1,Y,1,62.75,58.0,79,Y,1,10
12,13,25,F,1,Y,1,65.0,64.5,92,Y,1,6
13,14,23,F,1,N,0,61.5,57.5,66,Y,1,4
14,15,31,M,2,Y,1,73.0,74.0,72,Y,1,9
15,16,26,M,2,Y,1,71.0,72.0,115,Y,1,6


The .loc() function requires to arguments, the indices of the rows and the column names you wish to observe.

In the above case **:** specifies all rows, and our column is **CWDistance**. df.loc[**:**,**"CWDistance"**]

Now, let's say we only want to return the first 10 observations:

In [15]:
df.loc[:9, "CWDistance"]

0     79
1     70
2     85
3     87
4     72
5     81
6    107
7     98
8    106
9     65
Name: CWDistance, dtype: int64

### .iloc()
.iloc() is integer based slicing, whereas .loc() used labels/column names. Here are some examples:

In [16]:
df.iloc[:4]

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10


In [17]:
df.iloc[1:5, 2:4]

Unnamed: 0,Gender,GenderGroup
1,F,1
2,F,1
3,F,1
4,M,2


In [20]:
#df.iloc[1:5, ["Gender", "GenderGroup"]]

We can view the data types of our data frame columns with by calling .dtypes on our data frame:

In [21]:
df.dtypes

ID                 int64
Age                int64
Gender            object
GenderGroup        int64
Glasses           object
GlassesGroup       int64
Height           float64
Wingspan         float64
CWDistance         int64
Complete          object
CompleteGroup      int64
Score              int64
dtype: object

The output indicates we have integers, floats, and objects with our Data Frame.

We may also want to observe the different unique values within a specific column, lets do this for Gender:

In [22]:
# List unique values in the df['Gender'] column
df.Gender.unique()

array(['F', 'M'], dtype=object)

In [23]:
# Lets explore df["GenderGroup] as well
df.GenderGroup.unique()

array([1, 2], dtype=int64)

It seems that these fields may serve the same purpose, which is to specify male vs. female. Lets check this quickly by observing only these two columns:

In [24]:
# Use .loc() to specify a list of mulitple column names
df.loc[:,["Gender", "GenderGroup"]]

Unnamed: 0,Gender,GenderGroup
0,F,1
1,F,1
2,F,1
3,F,1
4,M,2
5,M,2
6,M,2
7,F,1
8,M,2
9,F,1


From eyeballing the output, it seems to check out.  We can streamline this by utilizing the groupby() and size() functions.

In [25]:
df.groupby(['Gender','GenderGroup']).size()

Gender  GenderGroup
F       1              12
M       2              13
dtype: int64

This output indicates that we have two types of combinations. 

* Case 1: Gender = F & Gender Group = 1 
* Case 2: Gender = M & GenderGroup = 2.  

This validates our initial assumption that these two fields essentially portray the same information.