# Introduction to Pandas

Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

# Pandas 

Pandas is by far the most commonly used primary Python tool for data analysis. It offers many tools for working with all kinds of data, and many packages are even built around the pandas tools. If you have worked with R before in another class, you are probably familiar with **data frames**. Pandas offers a DataFrame class of objects that has very similar properties to R data frames and will be the main method of storing **tabular data** (data with rows and columns). 

If you'd like to read more about the libraries that we're loading here, see the following link for [Pandas](http://pandas-docs.github.io/pandas-docs-travis/). 

In [None]:
import numpy as np 
import pandas as pd 

## Reading in the Data Set
Before we can start analyzing data, we first need to actually bring in the data into Python so that we can work with it. We'll start by reading in a data set from a csv, or comma-separated value, file. 

We use the `read_csv` function from pandas to read in the csv file.

In [None]:
data_file = '201807-CAH_PulseOfTheNation_Raw.csv'
df = pd.read_csv(data_file)

You can use the `.head` method for DataFrames to see the top few lines of the dataset. The rows represent individual observations, while the columns represent the variables. Note that there are row names as well as column names. The column names are the variable names, while the row names are by default just incrementing up from 0. 

In [None]:
df.head()

<font color ='red'>**Question 1: Bring in the Codebook file as a DataFrame object called `codebook`. This should have two columns: the name of the variable in the dataset as well as the description of that variable. Look at the first few lines.**</font>

## Data from the Internet

We can also bring in data from a URL if the data is made available in this way. The Pulse of the Nation data was available on their [website](https://thepulseofthenation.com/#bottom). We can copy the link from the dataset we want (this one is from the July 2018 survey) and use that to bring in the data directly. 

In [None]:
data_url = 'https://thepulseofthenation.com/downloads/201807-CAH_PulseOfTheNation_Raw.csv'
df_from_url = pd.read_csv(data_url)

In [None]:
df_from_url.head()

Note that the data should look a little bit different because the CSV version was cleaned a bit before being made available. The main difference is in the column names. The file that Cards Against Humanity provides contains the full question, while the CSV version has the names separated out to make it easier to read.

## Exploring the Data Frame

There's a lot that you can do with Data Frames. The documentation for Data Frames (https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) describes all of the methods and instance variables associated with it. We'll go over a few here that are helpful for exploring and looking at the data. 

First, let's check the number of rows and columns. We can do this by looking at the `shape` attribute of a data frame. 

In [None]:
df.shape

Let's also find out the names of all the variables in this data set. 

In [None]:
df.columns

The `codebook` DataFrame has information about what question each of these variables correspond to.

To get more information about the contents of the Data Frame, we can also use the `.info()` method. This will give us the number of non-null values and the type of data (these have all been read in as integers) for each column.

In [None]:
df.info()

Notice that the count of non-null objects is 800 for most, but much smaller for others. This indicates that there are a few rows with missing values. We'll discuss how to deal with missing values later, but this is a good way to identify which variables have missing values and how many there are. 

We have already used the `head` method, but we can also use the `tail` methods in order to look at the last few rows of the data frame. 

In [None]:
df.head() # Default is to show first 5 rows.

In [None]:
df.head(10) # We can specify how many rows we want to see.

In [None]:
df.tail(10) # Same as head, except the last 10 instead of first 10

## Accessing the Data Frame
What if we want to only look at certain cells, or certain columns? We can use a variety of commands to do just that.

### Accessing Columns

To access individual columns, we can use square brackets or we can simply use dot notation.

In [None]:
df["gender"] 

In [None]:
df.gender 

### Accessing Rows 

What if we want to get certain rows? We can use bracket notation to get the rows we want.

In [None]:
df[10:15] 

We can also use `loc` with square brackets. We use a colon to indicate that we want a series of indices with a start and end. We can also leave one side of the colon empty to indicate that we want the rest of the values on that end.

In [None]:
df.loc[10:20] 

In [None]:
df.loc[:10]

In [None]:
df.loc[:] # This gives all rows

In addition, we can use `loc` to access certain columns as well as certain indices in the Data Frame.

In [None]:
df.loc[10:20,"race"] 

To get a range of columns, we can use the same colon notation. 

In [None]:
df.loc[10:20,'gender':'race']

An alternative to `loc` is `iloc`. This takes rows from specific positions in the Data Frame rather than the row labels. Most of the time, row labels are going to be numbered sequentially, so `loc` and `iloc` should act similarly. However, sometimes, especially when creating subsets of the data, you might end up with row labels that aren't ordered sequentially and go up by one. In those cases, `iloc` might be more useful.

Another use case for `iloc` is in using negative numbers.

In [None]:
df.iloc[-5:,1:5]

In this case, we were able to use the "`-5:`" to indicate that we want the last 5 rows of the data frame. Note that we can't do the same with `.loc`. This is because `.loc` retrieves the rows from a particular *label* in the Data Frame, while `.iloc` retrieves them from particular *positions*. 

## Summaries of Data



When we look individual columns, that itself is a **Series** object rather than a DataFrame object. 

In [None]:
type(df.gender)

This can be useful for working with individual columns, because we can then use Series methods or use arithmetic.

In [None]:
df.age.max()

In [None]:
df.age.median()

<font color = 'red'>**Question 2: What is the mean age of people in this dataset? What is the standard deviation of age?**</font>

You can also use the `describe` method with DataFrames to calculate a lot of the summary statistics for numerical variables. In this dataset, we only have one, so we will only have one column.

In [None]:
df.describe()

We can get summaries of numerical variables using methods like `min`, `median`, and `max`, but what about categorical variables? We want to get counts instead. To do this, we can use the `value_counts` method.

In [None]:
df.race.value_counts()

To get the counts as proportions instead, use the `normalize=True` argument. 

In [None]:
df.race.value_counts(normalize=True)

<font color = 'red'>**Question 3: What is the distribution of political party like? Which political party was the most common answer? What proportion of people said they didn't know or refused to answer?**</font>

## Subsets of Data

You can create subsets of the data by using conditional statements within the square brackets. This can help you look at summaries for certain subsets.

In [None]:
older_than_40 = df[df.age > 40]
older_than_40.head()

In [None]:
older_than_40['political_party'].value_counts(normalize = True)

In [None]:
df[df.age <= 40]['political_party'].value_counts(normalize = True)

<font color = 'red'>**Question 4: What is the average age of people who have a college degree or a graduate degree? What is the distribution of political party like for those people?**</font>

## Comparison between Variables

To do comparison between two categorical variables, we can use the `crosstab` function from `pandas`. This allows us to look at the breakdown of the number of people within the combinations of categories of multiple variables.

In [None]:
pd.crosstab(df.race, df.gender)

To find the row proportion, we can use `normalize = 'index'`. Using `normalize='columns'` gives you the column proportions.

In [None]:
pd.crosstab(df.race, df.gender, normalize = 'index')

In [None]:
pd.crosstab(df.race, df.gender, normalize = 'columns')

<font color = 'red'>**Question 5: Find the crosstab of race and political party. Does there seem to be an association between race and political party?**</font>

If we want to look at the association between a numerical variable and a categorical variable, we can use the `pivot_table` method to calcualate the value of some statistic of the numerical variable for each category of the categorical variable. By default, the mean is calculated.

In [None]:
df.pivot_table(values = 'age',index ='political_party')

We can provide other functions as well. For example, we can find the standard deviation using the `np.std` function.

In [None]:
df.pivot_table(values = 'age',index ='political_party', aggfunc = np.std)

<font color = 'red'>**Question 6: What is the median age of respondents by their response to whether they supported Trump or not? Does there seem to be an association?**</font>

### Simple Visualizations

We can use the `plot` methods for DataFrames with these summaries to create simple visualizations. We'll discuss how to make nicer, more customizable graphs later, but for now, these should be very helpful in seeing relationships and exploring the data.

For a bar chart, you can use `plot.bar`.

In [None]:
race_by_gender = pd.crosstab(df.race, df.gender, normalize = 'index')
race_by_gender.plot.bar()

<font color = 'red'>**Question 7: Create a bar graph of education and political party. Does it look like there is a relationship based on the graph?**</font>