# 01 - Python and Pandas Introduction
<p class="lead">
Michelle Brown Notes v 1.5<br />

# Outline

<!-- MarkdownTOC autolink=true autoanchor=true bracket=round -->

- [Why Python?](#why)
- [Introduction to DataFrames](#df)
- [Quickly viewing the data](#view)
- [Filtering the dataframe](#filter)
- [Renaming, adding, and dropping columns](#rename)
- [Pivot Tables](#pivot)
<!-- /MarkdownTOC -->

<a name="why"></a>
# Python and Pandas

<b>What is Jupyter Notebook?</b>
Jupyter Notebook, previously called "iPython Notebook" prior to version 4.0, is a way of interacting with Python code using a web browser. It is a very useful instructional tool that we will be using to introduce you to analysis with Python. Notebooks have the file extensions "ipynb" which are abbreviations of "iPython Notebook". Some websites such as nbviewer.ipython.org or www.github.com can view .ipynb files directly as rendered HTML. However, these are not interactive versions of the notebook, meaning, they are not running the python kernel which evaluates/interacts with the code. So the notebook is just a static version of the code contained inside of it.

In order to interact with notebook and start coding, you will need to launch Terminal (for Mac and Linux users). For Windows users, you will need to launch the Command Prompt Tool using the `cmd.exe` program.<br>

<b>Why Python?</b><br>
Python is an easy-to-get-started language which handles data wrangling tasks in a simple and straightforward way. Python code is often said to be almost like pseudocode, since it allows you to express very powerful ideas in few lines of code while being very readable. It does not cost anything to install and use. Python also has a very open community that is quite supportive of new users. It also has many helpful packages that you can import and use that are well developed and documented. The Python language has the advantage of these Jupyter notebooks where you can easily use to add in comments and instructions. And it is easy to share the notebooks. The Python language is also very versatile and can be used to build applications. Python is a high-level, <a href="/files/images/matrix.gif">dynamically typed multiparadigm</a> programming language. 

<b>Pandas Data Analysis Library</b><br>
Python and <a href="http://pandas.pydata.org/">Pandas</a> can easily read in and handle very large datasets (like voters lists with millions of records) which are not possible in excel. The pandas package for python was built to use data structures that are tabular (which is a commonn structure for election datasets). Here is a <a href="https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf"> Pandas Cheat Sheet </a> for common tasks. 

A Note on the 2.7 version of Python:<br>
There are two major Python versions (or interpreters). Python 2.X and Python 3.X. The most recent stable version of Python 2.X is 2.7, which is the Python version used in this notebook. The newest recent version of Python 3.X is Python 3.4, which is also the newest Python version available but is not considered stable (as of Jan 2017) and not all of the packages will work in 3.4. For now, assume code you write for 2.7 will not work in 3.4. (The term used to describe this is to say that 3.4 breaks <i>backwards compatibility</i>.)


<a name="df"></a>
# Dataframes

First we are going to import the analysis module called Pandas as a variable called 'pd' so we can use it's associated methods. 

In [None]:
import pandas as pd
import numpy as np

In [None]:
# As an aside if you want to see what version of pandas you have, uncomment (remove the #) in the last line below. 
# I had version 0.18
# pd.__version__

Let's start with a small and simple dataset, the Polling Station List 2008 file. We'll use pandas to read the file as a dataframe and we'll call the dataframe ps2008:

In [None]:
ps2008 = pd.read_csv('data/Polling_Station_List_2008_1.csv')

We now have a <b>DataFrame</b>. A dataframe is one of the basic Pandas data structures.
A dataframe is similar to an Excel workbook – you have column names referring to columns and you have rows, which can be accessed with use of row numbers. Each row is an observation or measurment and each column is a (vector of) variable.
<br>Those who are familiar with R know the data frame as a way to store data in rectangular (or 'tabular') grids that can easily be overviewed. Each row of these grids corresponds to measurements or values of an instance, while each column is (a vector containing) data for a specific variable.
The essential difference in a dataframe is that the column names and row numbers are known as column _index_ and row _index_.
<br> And a tricky thing to keep in mind is that indices start their numbering with 0 not 1. 


<b> Previewing the Data</b><br>The shape method gives you the dimensions of your DataFrame. So in this case it will tell us how many rows (20,928) and how many columns (6) are in our dataframe: 

In [None]:
ps2008.shape

Let's use the .head method to look at the first few rows of all the columns: 

In [None]:
ps2008.head()

Note that pandas automatically created a row index on the lefthand side (and again note that it started at 0):

Remember if you want to learn more about the methods available for your dataframe, then type the name followed by the "." and hit tab and it will show you a list like below: <br> 
<img src='images/tabhelp_screenshot.png' width=300 style="float: left;")>

If you want to learn more about a specific method, select it, type a "?" after it, hit return and the help should show up. Try it by running the below code:

In [None]:
ps2008.astype?

<a name="view"></a>
# Viewing the dataframe 

Let's see a list of all the column names:

In [None]:
ps2008.columns

View a specific column of the dataframe:

In [None]:
ps2008['PS_Name']

Let's look at the first 10 rows of all the columns:'

In [None]:
ps2008[:20]

Let's view  2 specific columns: The PS_Name and Registered_Voters 

In [None]:
ps2008[['PS_Name', 'Registered_Voters']]

We can also just get the first 10 rows of a specific column. Also note that it also tells us the type of data for that variable. In this case, we see that the data type is an integer (i.e., int64):

In [None]:
ps2008['Registered_Voters'][:20]

<a name="filter"></a>
# Filtering the dataset 

Now let's view only those rows where the number of registered voters is greater than 750:

In [None]:
ps2008[ps2008['Registered_Voters'] > 750 ]

<b>Viewing based on a row's specific value in one variable:</b>
Now let's look in the column labeled PS_Code column and find the location of the value of W00770 and then show the it's values for the other columns: 

In [None]:
ps2008.loc[ps2008['PS_Code'] == 'W00770'] 

Filter for those rows where the Region_Name is western or eastern

In [None]:
ps2008[ps2008['Region_Name'].isin(['WESTERN', 'EASTERN'])]

<a name="rename"></a>
# Renaming, adding, and dropping columns


<b>Rename a column</b><br>Let's say we want to take the column currently named 'Registered_Voters' and rename it to RV'

In [None]:
ps2008.rename(columns={'Registered_Voters': 'rv'}, inplace=True)

In [None]:
ps2008.columns

In [None]:
# Define the new names of several columns at once 
newcols = {
    'Region_Name': 'region', 
    'District_Name': 'district', 
}

ps2008.rename(columns=newcols, inplace=True)

In [None]:
ps2008.columns

Let's make add a new column where all values are 1:

In [None]:
ps2008['new_column'] = 1

In [None]:
ps2008['new_column']

Make a new dataframe called "smaller_ps2008" but without a specific column: 

In [None]:
smaller_ps2008 = ps2008.drop('new_column', axis = 1)
smaller_ps2008.head()

<a name="pivot"></a>
# Pivot Tables

Total (sum) number of registered voters for each Constituency

In [None]:
mean_const = ps2008.pivot_table('rv', columns='Constituency_Name', aggfunc='sum')

In [None]:
mean_const.head()

Average number of registered voters for each region

In [None]:
mean_rvregion = ps2008.pivot_table('rv', columns='region', aggfunc='mean')
print mean_rvregion

Total number of registered voters for each region:

In [None]:
sum_rvregion = ps2008.pivot_table('rv', columns='region', aggfunc='sum')
print sum_rvregion

Next let's do the same calculations as above but set up the table so it looks nicer with some labels: 

In [None]:
tablesumrv_region = pd.pivot_table(ps2008, index=["region"],
               values=["rv"],
               aggfunc=[np.sum],fill_value=0)

In [None]:
tablesumrv_region

That is the end of this introductory notebook. Next let's start summarizing a bigger dataset in the next notebook. 