In [None]:
import pandas as pd
import numpy as np

## Introduction to Pandas 🐼

What is pandas? Pandas is a Python library that is very useful for manipulating data. It is very commonly used in the Python community for any data analysis, including cleaning data, making quick visualizations, and preparing data for advanced analysis or machine learning algorithms. If you're interested in data science, you'll likely end up using Pandas quite a lot.

Fortunately, pandas is well integrated with other important data libraries like numpy (for vector and matrix operations), matplotlib (plotting and visualizations), and sckit-learn (machine learning). Pandas is incredibly well documented (http://pandas.pydata.org/pandas-docs/stable/)

### Data Structures: Series and DataFrames

The main data structures to be aware of in pandas are series and dataframes. Series are 1-dimensional arrays of data. Series can hold any type of data, but all elements in a specific series must be the same type. So you can have a series made up of integers, or a series made up of strings but you can't have a series of floats and strings-- you can create it, but the data types will be converted.

In [None]:
some_numbers = pd.Series([1, 2, 3])
some_strings = pd.Series(["cat", "dog", "mouse"])
mixed = pd.Series([1, "cat", 3.0])

In [None]:
print some_numbers
print some_strings
print mixed

In contrast, dataframes are 2-dimensional arrays of data where the columns can be different data types. A dataframe might look familiar, as its similar to how you might arrange data in a spreadsheet. 

In [None]:
df = pd.DataFrame({"animal":some_strings, "score":some_numbers*3})

In [None]:
df

There are a lot of different ways to construct a dataframe (from a Series, a dictionary, a numpy array, etc.) but I find most often you are probably reading data in from a csv or database. Fortunately, pandas has a lot of different methods of reading in data. For the rest of this tutorial, let's use some data available at [Kaggle](https://www.kaggle.com/c/titanic/data).

In [None]:
titanic = pd.read_csv("train.csv")

In [None]:
titanic.info()

In [None]:
titanic.head()

Let's say I just want to view 1 column of the dataframe. Notice anything about this output?

In [None]:
titanic["Name"]

It's a series! The columns and the rows in DataFrames are both Series. This means anything you can do to a Series, you can do to a row or column in a dataframe and turns out to be pretty handy.

In [None]:
# Useful for numeric data
titanic.describe()

In [None]:
# Useful for categorical data
titanic["Sex"].value_counts()

What if I need to see only a specific slice of rows and columns? The most common way to achieve this is the .loc method:

In [None]:
# View only Name and survived for rows 10-20
titanic.loc[10:20,["Name", "Survived"]]

In [None]:
# Also lets us easily filter; say I want to see the names of passengers under age 18
titanic.loc[titanic["Age"] < 18, ["Name", "Age"]]

In [None]:
# Same as above but only where Survived = 1:
titanic.loc[(titanic["Age"] < 18) & (titanic["Survived"] == 1), :]

### Creating new columns
You'll often want to create new variables or features based on the data you have. In this dataset, passengers under 1 year have an age listed as a percent of a year. Let's create a new variable that is age in months.

In [None]:
titanic["age_mos"] = titanic["Age"]*12
# to round to whole months:
#titanic["age_mos"] = titanic["Age"].apply(lambda x: round(x*12))
titanic.loc[titanic["Age"]< 1, :]

Maybe I want to create a categorical variable for whether a passenger is an adult (over age 18). I can do this really simply with a boolean:

In [None]:
titanic["adult"] = titanic["Age"] >= 18

In [None]:
titanic.head()
# if need to convert to int:
# titanic["adult"] = titanic["adult"]*1

One quick aside: lots of pandas methods transform the data and return it to you without altering the actual dataframe you're working on. This is mostly good and it makes it harder for you to accidentally mess up your data. It does make it easy for you to **think** you changed some data but really you didn't alter it in place. The solution is to either set your dataframe or dataframe column equal to whatever transformation you applied (see titanic["adult"] = titanic["adult"]*1 above) or sometimes the method you're using has an "inplace" parameter that you can set to true.

In [None]:
## Another example: recode male/female to numbers
titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 1
titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 0

In [None]:
titanic.head()

### Plotting

In [None]:
import matplotlib.pyplot as plt
# if using a notebook:
%matplotlib inline 

Pandas is well integrated with matplotlib, the most commonly used Python plotting library. It's really handy when you want to visualize your data with a quick histogram or make a nice looking bar chart. Dataframes and series even have a built in histogram method!

In [None]:
titanic["Age"].hist() 

In [None]:
# let's compare age distribution among surviving passengers vs. those who didn't
titanic[titanic['Survived'] == 0]['Age'].hist(label="Non-survivors")
titanic[titanic['Survived'] == 1]['Age'].hist(label="Survivors")
plt.legend()
plt.show()

In [None]:
titanic.head()

In [None]:
# Let's make a nice bar chart of sex breakdown of passengers in each class
pd.crosstab(titanic["Sex"], titanic["Pclass"])

In [None]:
women, men = pd.crosstab(titanic["Sex"], titanic["Pclass"]).values
classes = np.arange(3)
width = 0.5
plt.bar(classes, women, width, color="DarkSeaGreen", label="Women")
plt.bar(classes, men, width, bottom=women, color="SlateBlue", label="Men")
plt.xticks(classes + width/2., ("First", "Second", "Third"))
plt.legend(loc="upper left")
plt.xlabel("Passenger Class")
plt.ylabel("Count")
plt.title("Titanic Passenger Count by Class and Sex")
plt.show()