# <font face="times"><font size="6pt"><p style = 'text-align: center;'> The City University of New York, Queens College

<font face="times"><font size="6pt"><p style = 'text-align: center;'><b>Introduction to Computational Social Science</b><br/><br/>

<p style = 'text-align: center;'><font face="times"><b>Lesson 4 | Coding with Python III </b><br/><br/>

<p style = 'text-align: center;'><font face="times"><b>11 Checkpoints</b><br/><br/>


Source: https://www.coursera.org/learn/python-programming-introduction

***
***


# Begin Lesson 04

## Installing Modules in Python

Before we do anything more advanced, we need to learn how install `modules` in Python. For many, this is a bit challenging, as it requires knowledge of Linux for it work. Whereas with R, using `install.packages()` is seemless, unfortunately, `Python` has a less user-friendly integrated environment. 

To rememdy this, we'll focus on `pip`. What is `pip`? `pip` is the standard package manager for `Python`. It allows you to install and manage additional packages that are not part of the Python standard library. For R users, it's similar to `install.packages()`, but requires a bit more knowledge of Linux for it work. 

`Pip3` is a version of the pip installer for `Python`, which can download and configure new python modules with a simple one line command. Since we're using version `Python 3.6`, we'll use `pip3.6`: 

                                        pip3.6 install --user pyserial  

**Note:** Modules are dependent on the version of `Python` you're using. So, if you installed using `pip3.6` and wanted to import a module using `Python 3.5`, you wouldn't be able to find the module, because it's directory is specific to `Python 3.6`. 

`Pip3` relies on PyPI (the Python Package Index) which is a software repository where versions of community-managed modules are maintained.

When you issue a command like `pip3 install --user pyserial` it checks an online repository and downloads the `pyserial` package (which may contain multiple modules). `Pip3` then puts the modules in a designated place in your computer, and updates a local index to record where they can be found. Then you will be able to use them from python3.

You'll need the flag `--user` to indicate that this module is just for you, otherwise you'll likely get an error saying that you don't have permission. (It's sort of like ordering delivery just to your apartment, rather than the whole building.) 

From Jupyter Notebook, you can use an `!` to access the linux command line to run `pip3`. This is similar to the magic command character `%`. So, from Jupyter Notebook, we'd run the above command as: 

                                        !pip3.6 install --user pyserial  


The `Python` installer installs `pip`, so it should be ready for you to use, unless you installed an old version of `Python`. You can verify that pip is available by running the following command in your console:

In [None]:
!pip3.6 --version

`PyPI` hosts a very popular library to perform HTTP requests called requests. You can learn all about it in its official documentation site.

The first step is to install the requests package into your environment. You can learn about `pip` supported commands by running it with help:

In [None]:
!pip3.6 help

In this lecture, we'll install our first module using `pip3`. 

***
***

## Importing Data with `pandas`

Okay, now that we've covered the basics of manipulating `DataFrames` and `Series`, let's turn to one of the most important features of both `pandas` and `Python`: importing in data. 

Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a `DataFrame` object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.

First, let's see what your working directory is right now. 

In [None]:
%pwd

However, we're looking for the data folder---where our data file is saved. So, we'll need to navigate there to import the file.

We can see which files and folders are in our directory with the `ls` function.

In [None]:
%ls

We can inspect the top few rows of this CSV fie using the `head` command line function. 

A CSV file is a commo-separated value file, where the values in each row are seperated into columns using a delimiter, or what seperates values into columns. 

In [None]:
!head Data/microbiome.csv

****
****

# Checkpoint 1 of 11

## Now you try!

### Use tail to see what's at the bottom of this CSV. 
### List out all of the other CSV files in the Data directory. 
### Pick one and explore it using head and tail. 

****
****

This is just standard command line far so far. Remember, outside of this class you can run all your code in terminal or command line if you prefer that to iPython. 

Now, let's import `pandas`. And remember, we'll import it as "pd", since that will save us keystrokes every time we want to use a tool from `pandas`

In [None]:
import pandas as pd

Now, let's read the csv with the `.read_csv` command. With this, our table can be read straight into a DataFrame:

In [None]:
mb = pd.read_csv("Data/microbiome.csv")
mb

This is really long, but we can look at the top few rows with the `.head()` function. 

In [None]:
mb.head() #First 5 rows by default.

By default, this shows the first five rows of the `DataFrame`. If you want to see more rows, you can include a number with the `.head()` function, indicating how many rows to show. For instance:

In [None]:
mb.head(15) #First 15 rows.

We can also use the `.tail()` function in just the same way as the `.head()` function, but show the last few rows of the `DataFrame`.

In [None]:
mb.tail(10) #Last 10 rows

Notice that `read_csv` automatically considered the first row in the file to be a header row.

We can override default behavior by customizing some the arguments, like `header`, `names` or `index_col`.

In [None]:
pd.read_csv("Data/microbiome.csv", header=None).head()

The `read_csv` is just a convenience function for `read_table`, since csv is such a common format. The `read_table` function works in the same way, but the delimiter--or what seperates the values per row--does not default to a comma. Because of this `read_table` tends to be slightly slower than a specialized function like `read_csv`, though you will only notice with large files.

If you specify a separator, that difference disappears. So we'll go ahead and tell `python` that our file is comma-separated using the `sep` argument.

In [None]:
mb = pd.read_table("Data/microbiome.csv", sep=',')

The `sep` argument can be customized as needed to accomodate arbitrary separators.
    
    sep='\t' # If your file is tab seperated
    sep='\s+' # If your file is seperated by "whitespace" in between values. 

***
For a more useful index, we can specify the first two columns, which together provide a unique index to the data.

This is called a *hierarchical* index, but don't worry about that yet. We will revisit this later, just know that it's one of the unique quirks associated to `pandas`.

In [None]:
mb = pd.read_csv("Data/microbiome.csv", index_col=['Taxon','Patient'])
mb.head()

***

Often, our data will be corrupted or bad. If we have parts of the data that we know are corrupt or bad, we can populate the `skiprows` argument. Like you might guess, this just tells `python` to skip those rows when it imports your data. For example. 

In [None]:
pd.read_csv("Data/microbiome.csv", skiprows=[3,4,6]).head()

Conversely, if we only want to import a small number of rows from, say, a very large data file we can use `nrows`:

In [None]:
pd.read_csv("Data/microbiome.csv", nrows=4)

***
***

# Checkpoint 2 of 11

## Now you try!

### Re-read in the CSV `microbiome.csv` just as you did before, but try the following configurations. 
1. Skip the first, eight, and tenth rows and only display the first ten rows. 
2. Display the last six rows. 
3. Read in the file with a 'tab' delimiter and display the first five rows. Why does this look weird?

***
***

***
Alternately, our data might be really big. So big, that reading it in all at once may be too much for our machine. 

To that end, we can process our data in reasonable chunks. The `chunksize` argument will return an iterable object that can be employed in a data processing loop. 


For example, our microbiome data are small, but can be read in in chunksize, nonetheless. 


In [None]:
data_chunks = pd.read_csv("Data/microbiome.csv", chunksize=15)

Most real-world data is incomplete, with values missing due to incomplete observation, data entry or transcription error, or other reasons. Pandas will automatically recognize and parse common missing data indicators, including `NA` and `NULL`.

In [None]:
!cat Data/microbiome_missing.csv

In [None]:
pd.read_csv("Data/microbiome_missing.csv").head(20)

Above, Pandas recognized `NA` and an empty field as missing data.

In [None]:
pd.isnull(pd.read_csv("Data/microbiome_missing.csv")).head(20)

Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark "?" and a large negative number where there should have been a positive integer. We can specify additional symbols with the `na_values` argument:
   

In [None]:
pd.read_csv("Data/microbiome_missing.csv", na_values=['?', -99999]).head(20)

These can be specified on a column-wise basis using an appropriate dict as the argument for `na_values`.

****
****

# Checkpoint 3 of 11

## Now you try!

### Read in the CSV file with the missing data, just as you did before. 
### Here, replace all values that have either a "?", -99999, or 0 with NA and display the last four rows. 

***
***

***
***

***
***
## Working with Microsoft Excel

Since so much financial and scientific data ends up in Excel spreadsheets (regrettably), `pandas`' ability to directly import Excel spreadsheets is valuable. This support is contingent on having one or two dependencies installed (depending on what version of Excel file is being imported): `xlrd` and `openpyxl` (these may be installed with either `pip3`).

First, let's install the package `xlrd` and import it in. If you've already installed it previously, ignore the next cell. 

In [None]:
!pip3.6 install --user xlrd

In [None]:
import xlrd

Importing Excel data to Pandas is a two-step process. First, we create an `ExcelFile` object using the path of the file:                                             

In [None]:
mb_file = pd.ExcelFile('Data/Excel_MID/MID1.xls')
mb_file

Then, since modern spreadsheets consist of one or more "sheets", we parse the sheet with the data of interest:

In [None]:
mb1 = mb_file.parse("Sheet 1", header=None)
mb1.columns = ["Taxon", "Count"]
mb1.head()

There is now a `read_excel` convenience function in Pandas that combines these steps into a single call:

In [None]:
mb2 = pd.read_excel('Data/Excel_MID/MID2.xls', sheetname='Sheet 1', header=None)
mb2.head()

There are several other data formats that can be imported into Python and converted into DataFrames, with the help of built-in or third-party libraries. These include JSON, XML, HDF5, relational and non-relational databases, and various web APIs. These are beyond the scope of this tutorial, but are covered in [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do), any number of online tutorials, and of course, the online documentation around `python` and `pandas`.

***
***

# Checkpoint 4 of 11

## Now you try!

### Repeat what you just did, but for the files `MID1` and `MID2`, read in the second sheet and display the first three rows. What do you see?

***
***

***
***

## Writing Data to Files

As well as being able to read several data input formats, Pandas can also export data to a variety of storage formats. We will bring your attention to just a couple of these.

In [None]:
mb1.to_csv("Data/mb_test.csv")

The `to_csv` method writes a `DataFrame` to a comma-separated values (csv) file. You can specify custom delimiters (via `sep` argument), how missing values are written (via `na_rep` argument), whether the index is writen (via `index` argument), whether the header is included (via `header` argument), among other options.

An efficient way of storing data to disk is in binary format. Pandas supports this using Python’s built-in pickle serialization.

In [None]:
mb1.to_pickle("Data/mb_test_pickle")

The complement to `to_pickle` is the `read_pickle` function, which restores the pickle to a `DataFrame` or `Series`:

In [None]:
pd.read_pickle("Data/mb_test_pickle")

As Wes McKinney warns in his book, it is recommended that binary storage of data via pickle only be used as a temporary storage format, in situations where speed is relevant. This is because there is no guarantee that the pickle format will not change with future versions of Python.

***
***

# Checkpoint 5 of 11

## Now you try!

### Recall the `DataFrame` you made called `data_chunks`. Let's pickle the `DataFrame` for future use. That way we don't need to covert it to a CSV. 

1. Create a `pickle` file called `data_chunks_pickle` and write it to the Datafolder. 
2. Now, read that file back in. Does it match the original `DataFrame`?

***
***

***
***

## Bring data into `Python` with `pandas` from the Internet.
Let's go ahead and load some data.
There are a series of `read_*` methods (e.g., `read_excel`) in `pandas`. Usually, you can start reading data by giving the `read_*` method an appropriate file path. For example, if there is a `csv` file named `data.csv` in the same directory as your notebook, you can read it like
```Python
pd.read_csv('./data.csv')
```

For now, we'll play with the [wine quality dataset][wine_quality] from the UCI machine learning database. You can either download it to your machine from [here][wine_quality], or use the URL (http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv) directly in the `read_csv()` method to have `pandas` load it from the source.

[wine_quality]: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

In [None]:
data_src = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
wine_data = pd.read_csv(data_src)
wine_data.head()

The first line above saves the address to the data in a variable we name `data_src`. We can also write the link directly in the `read_csv` call, i.e.,
```python
pandas.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv')
```
but saving the address in a separate string is generally good practice (and will make your life easier if you decide to change that address some day). The second line loads the data into a variable we named `wine_data`.  The third line tells `pandas` to print the first couple lines (`head`) of `wine_data`.

Notice from the output above that the data seems to be chunked into a single column. This is probably because we used `read_csv`, which expects the data to be separated by commas(,) - hence the name **c**omma **s**eparated **v**alues- but the data (if you look carefully) is actually separated by semi-colons(;).
Instead of bloating the library with a `read_semi_colon_separated_value` method, the `read_csv` method in `pandas` let's you specify what the separating charater is, if not the expected comma. The syntax for that looks like
```python
pandas.read_csv(path_to_data, sep=';')
```
(`read_csv` actually has a whole lot more options. Read the docs with <kbd>Shift</kbd>+<kbd>Tab</kbd> - a nice feature of Jupyter)

In [None]:
wine_data = pd.read_csv(data_src, sep=';')
wine_data.head()

That looks much better. Now that we've loaded a proper `pandas` [`DataFrame`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame) we can use methods provided with `pandas` to take a closer look at things. We've already seen `head()`, but as we mentioned above, sometimes it's useful to see the last bit of your data too. 

In [None]:
wine_data.tail()

***
***

## Exploring Your Data in Pandas (Continued)

Let's do some descriptive statistics. We will read a dataset based on Trilling (2013). It contains some sociodemographic variables as well as the number of days the respondent uses a specific medium to access information about news and current affairs.

In [None]:
df = pd.read_csv('Data/mediause.csv')

In [None]:
df.keys()

In [None]:
df

Get a quick summary with the method `describe()`, and while we're at it, let's get some some basic distribution information for some of the variables using the `value_counts` method.

In [None]:
df.describe()

In [None]:
df['gender'].value_counts()

In [None]:
df['education'].value_counts(sort=False)

***
***

# Checkpoint 6 of 11

## Now you try!

### Pick two other variables (i.e., columns) and explore them using the method `.value_counts`.

Of course, you can integrate all this in normal Python structures like for-loops. That is, we can just loop through each variable in the dataset and ask for the counts:

***
***

In [None]:
for medium in ['radio','newspaper','tv','internet']:
    print(medium.upper()) # The iterator, in upper-case letters... in case this still seems a bit unfamiliar
    print(df[medium].value_counts(sort=True, normalize=True))
    print('-------------------------------------------\n')

Also, you can 'chain' different methods together. This is really helpful when you want to do something a little more involved, like comparing the characteristics of your data according to the gender of respondents.

In [None]:
df.groupby('gender').describe()

You can see already that some additional variation is being captured with this move. But we can also do the same kind of chaining to create a series of plots. We'll come back to that in a bit, since we haven't important `matplotlib` to do the heavy lifting yet.

***
***

## Statistical Tests and Subsetting Your Data

Let's stop with the descriptive stuff and do some statistical tests. We'll run a t-test, since you are probably familiar with these already. The results return the test statistic, p-value, and the degrees of freedom. 

A note on syntax: notice that we filter the dataframe using the arcane-looking `df[df['gender']==1]['internet']` command. Recall that `df[conditions]` is how we subset data in `pandas`. So here, we are finding the subset of df where the gender column equals "1". The `['internet']` at the end just tells `python` to pull just the data from the internet variable. 

In [None]:
from scipy.stats import ttest_ind 

In [None]:
males_internet = df[df['gender']==1]['internet']
females_internet = df[df['gender']==0]['internet']

In [None]:
males_internet.describe()

In [None]:
females_internet.describe()

In [None]:
ttest_ind(males_internet,females_internet)

We see that males use the internet significantly more often than females. Let's extract the t-statistics and the p-value from this test. Each part can be pulled out by indexing.

In [None]:
ttest_ind(males_internet,females_internet)[0]

In [None]:
ttest_ind(males_internet,females_internet)[1]

We won't belabor the point here, but if you need to run a statistical test, `pandas` is likely able to do it. For all but the most cutting-edge (as in, someone invented it in the past 1-2 years) techniques, the function is probably built-in to `pandas` or one of its accompanying statistical packages. The syntax may change from test to test, though, so read the documentation to be sure you are getting out what you expect.

***
***

## Merging Multiple DataFrames Together

At this point you're able to import your data, explore it a little bit, and even run some basic statistical tests to see if some of the differences you see in the data are significant.

This lesson has all been about trying to show you how to do things that you're going to encounter on a regular basis. There are still a couple of major tasks left in front of us now. The one we're going to deal with first is how to merge datasets.

Say we want to know the average usage figures in our dataset, but across different types of devices. The data we have---across a few different `.csv` files---give us that information. We need to get the user's device code from one dataset (user_device.csv) and add it as a column to another dataset (user_usage.csv). Even then, we might be interested in the manufacturer of all these different phones---maybe it isn't the model that affects usage, but the brand. So we also need to add the device's manufacturer (from android_devices.csv) as a column on the result.

In [None]:
user_usage = pd.read_csv("Data/user_usage.csv")
user_device = pd.read_csv("Data/user_device.csv")
devices = pd.read_csv("Data/android_devices.csv")

***

##### Rename Columns
Rename columns using the `.rename()` method and using a `Dictionary`. Also, you can use the `inplace` parameter so you don't need to reassign it. 

In [None]:
devices.rename(columns={"Retail Branding": "manufacturer"}, inplace=True)

We can check for ourselves what's happening. 

In [None]:
user_usage.head()

In [None]:
user_device.head()

In [None]:
devices.head(10)

Now, let's merge the two `DataFrames` together. We do this using the `merge` method in `pandas`. All you need to do is tell `python` which two `DataFrames` you want to merge, and what variable to use to link to two sets of data together (using the "on" argument):

In [None]:
result = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id')

In [None]:
result.head()

While on first glance, this may seem perfect, all is not as it seems. First, let's check the dimensions of `user_usage` with `.shape` method. 

In [None]:
user_usage.shape

In [None]:
user_device[['use_id', 'platform', 'device']].shape

In [None]:
user_usage['use_id'].isin(user_device['use_id']).value_counts()

In [None]:
result.shape

Notice that we have a whole bunch of use_id's that appear in user_usage.csv but not in user_device.csv. As a result these have been excluded entirely from our result dataset. We have to do something about this if we want to keep our missing data (which is usually a smart thing to do).

***
***

# Checkpoint 7 of 11

## Now you try!

### Try your own merge using these data, but on a different column than `use_id` to merge them on. 

### Repeat the same steps. 
1. Check the first few rows. 
2. Check the shape. 
3. Check the value counts on the merging column. 

### What do you notice? How does it differ?

***
***

***

##### Left Merge

A left merge, or left join, between two dataframes keeps all of the rows and values from the left dataframe, in this case "user_usage". Rows from the right dataframe will be kept in the result only where there is a match in the merge variable in the left dataframe, and NaN values will be in the result where not. We do this using the "how" argument in our `merge` method.


In [None]:
result = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id', how='left')

Now let's check the dimensions again.

In [None]:
user_usage.shape

In [None]:
result.shape

That looks a lot better! We've added two columns, and kept all our rows. That means there is a lot of missing data, though. So let's see how much:

In [None]:
result['device'].isnull().sum()

In [None]:
result.head()

In [None]:
result.tail()

***

##### Right Merge

Another way to do this would have been to use what we call a "right merge". A right merge, or right join, between two dataframes keeps all of the rows and values from the right dataframe, in this case "user_device". Rows from the left dataframe will be kept where there is a match in the merge variable, and NaN values will be in the result where not. Let's give it a try.

In [None]:
result = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id', how='right')

In [None]:
user_device.shape

In [None]:
result.shape

In [None]:
result['monthly_mb'].isnull().sum()

In [None]:
result['platform'].isnull().sum()

***

##### One more alternative: Outer Merge

A full outer join, or outer merge, keeps all rows from the left and right dataframe in the result. Rows will be aligned where there is shared join values between the left and right, and rows with NaN values, in either the left-originating or right-originating columns will be, will be left in the result where there is no shared join value.

In the final result, a subset of rows should have no missing values. These rows are the rows where there was a match between the merge column in the left and right dataframes. These rows are the same values as found by our inner merge result before.


In [None]:
pd.concat([user_usage['use_id'], user_device['use_id']]).unique().shape[0]

In [None]:
result = pd.merge(user_usage,
                user_device[['use_id', 'platform', 'device']],
                on='use_id', 
                how='outer', 
                indicator=True)

In [None]:
result.shape

In [None]:
(result.apply(lambda x: x.isnull().sum(), axis=1) == 0).sum()

This time around we've kept more rows than in the other two merges. Those additional rows have missing data, which makes it less useful. But especially in large datasets, we can get around this problem using some kind of imputation (filling in missing values using smart guesses). So there are often good reasons for keeping as much data as you can. 

***
***

# Checkpoint 8 of 11

## Now you try!

### Repeat the same merge you did for the previous checkpoint, but now do:

1. left merge
2. right merge
3. outer merge

### Use the `.shape()` method to see the diffence in sizes of the resulting `DataFrames`. 

***
***

***
##### One Final Merge - Adding Device Manufacturer

Remember, we wanted to have the manufacturer in our dataset, on the chance that it's the brand and not the model that matters for usage. So, first, let's add the platform and device to the user usage.

In [None]:
result = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id',
                 how='left')

Now, based on the "device" column in result, match the "Model" column in devices.

In [None]:
devices.rename(columns={"Retail Branding": "manufacturer"}, inplace=True)

In [None]:
result = pd.merge(result, 
                  devices[['manufacturer', 'Model']],
                  left_on='device',
                  right_on='Model',
                  how='left')

In [None]:
result.head()

Good! We should be able to do some interesting things with this. We can, of course, search through this data in various ways, too:

In [None]:
devices[devices.Model == 'SM-G930F']

In [None]:
devices[devices.Device.str.startswith('GT')]

***
***

## Reshaping DataFrames from Wide to Long

In the context of a single DataFrame, we are often interested in re-arranging the layout of our data. 

In [None]:
mb = pd.read_csv("Data/microbiome.csv")
mb.head()

This dataset includes repeated measurements of the same individuals (longitudinal data). Its possible to present such information in (at least) two ways: showing each repeated measurement in their own row, or in multiple columns representing mutliple measurements.


The `stack` method rotates the data frame so that columns are represented in rows:

In [None]:
stacked = mb.stack()
stacked.head()

To complement this, `unstack` pivots from rows back to columns.

In [None]:
stacked.unstack().head()

To convert our "wide" format back to long, we can use the `melt` function. It contains two important parameters: 
`idvars` and `value_vars`. 

`idvars` represents the row values, while `value_vars` now become the column values. For example:

In [None]:
long_format = pd.melt(mb, 
        id_vars=['Patient','Taxon'], #Row Values
        value_vars=['Tissue','Stool']) #Column Values

Notice that the word `Tissue` is repeated for each row, where the `value` column was the `Tissue` column before the `melt()` function. 

In [None]:
long_format.head()

Also notice that the same happens for second `value_vars` 'Stool'

In [None]:
long_format.tail()

In short, we took two columns (`Stool` and `Tissue`) and "flipped" them such that they are now rows. We took two columns in a `DataFrame` that were once in a **wide** format and made them **long**. 

This illustrates the two formats for longitudinal data: **long** and **wide** formats. The preferable format for analysis depends entirely on what is planned for the data, so it is imporant to be able to move easily between them.

***
***

# Checkpoint 9 of 11

## Now you try!

### Reshaping data from wide to long format (and vice-versa) is a critical skill to master. 

### Take the merged `DataFrame` called `result` from the previous example on merges and change it from "wide" to "long" format. 

### Hint: You can also set the categorical variables as the `DataFrame's` index and then play around with the `.stack()` method, like so:

`result.set_index(['platform','device']).stack()`


****
****

***
***

## Pivoting

If you are used to working in Excel or something similar, you might be wondering why we didn't just use a pivot table to get the job done. Well, we can do that too! The `pivot` method allows a DataFrame to be transformed easily between long and wide formats in the same way as a pivot table is created in a spreadsheet. It takes three arguments: `index`, `columns` and `values`, corresponding to the DataFrame index (the row headers), columns and cell values, respectively.

In [None]:
mb.pivot(index='Patient', columns='Taxon', values='Stool').head()

If we omit the `values` argument, we get a `DataFrame` with hierarchical columns, just as when we applied `unstack` to the hierarchically-indexed table:

In [None]:
mb.pivot('Patient', 'Taxon')

A related method, `pivot_table`, creates a spreadsheet-like table with a hierarchical index, and allows the values of the table to be populated using an arbitrary aggregation function.

For a simple cross-tabulation of group frequencies, the `crosstab` function (not a method) aggregates counts of data according to factors in rows and columns. The factors may be hierarchical if desired.

***
***

## Dealing with Duplicates

We can easily identify and remove duplicate values from `DataFrame` objects. For example, say we want to remove people from our dataset that have the same stool characteristics (maybe we suspect them of being clerical errors). We can identify those rows with the `duplicated` method, and can delete those duplicates from the dataset with the `drop_duplicates` method.

In [None]:
mb.duplicated(subset = 'Stool')

In [None]:
mb.drop_duplicates(['Stool'])

Notice that the number of rows in our dataset decreased by a dozen. The duplicates have been discarded.

***
***

## Plotting in Pandas

You might also be wondering how to go about making informative plots to go along with them (or with any other type of analysis you want to use). Pandas includes methods for DataFrame and Series objects that are relatively high-level, and that make reasonable assumptions about how the plot should look.

In [None]:
# This way we can see the output in this notebook
%matplotlib inline

import matplotlib # More on this in a bit
import numpy as np # Just so we can create some random data
normals = pd.Series(np.random.normal(size=10))
normals.plot()

Notice that by default a line plot is drawn, and a light grid is included. All of this can be changed, however:

In [None]:
normals.cumsum().plot(grid=False)

Similarly, for a DataFrame:

In [None]:
variables = pd.DataFrame({'normal': np.random.normal(size=100), 
                       'gamma': np.random.gamma(1, size=100), 
                       'poisson': np.random.poisson(size=100)})
variables.cumsum(0).plot()

As an illustration of the high-level nature of Pandas plots, we can split multiple series into subplots with a single argument for `plot`:

In [None]:
variables.cumsum(0).plot(subplots=True)

Or, we may want to have some series displayed on the secondary y-axis, which can allow for greater detail and less empty space (as a general rule, though, try not to use more than one labeling system on a single axis---it can be misleading):

In [None]:
variables.cumsum(0).plot(secondary_y='normal')

Let's talk more about the different kinds of plots you can create, and what you might use them for. There are countless books out there on visualization that can give you more detail. Tufte's book "The Visual Display of Quantitative Information" is a good (and classic) place to start, though.

***
***

## Bar Plots

Bar plots are useful for displaying and comparing measurable quantities, such as counts or volumes. In Pandas, we just use the `plot` method with a `kind='bar'` argument.

For this series of examples, let's load up the Titanic dataset:

In [None]:
titanic = pd.read_excel("Data/titanic.xls", "titanic")
titanic.head()

In [None]:
titanic.groupby(['sex','pclass']).survived.sum().plot(kind='barh')

You can also "stack" the bars, in case that helps convey additional information in a given case:

In [None]:
death_counts = pd.crosstab([titanic.pclass, titanic.sex], titanic.survived.astype(bool))
death_counts.plot(kind='bar', stacked=True, color=['black','gold'], grid=False)

Another way of comparing the groups is to look at the survival *rate*, by adjusting for the number of people in each group.

In [None]:
death_counts.div(death_counts.sum(1).astype(float), axis=0).plot(kind='barh', stacked=True, color=['black','gold'])

***
***

## Histograms

Frequently it is useful to look at the *distribution* of data before you analyze it. Histograms are a sort of bar graph that displays relative frequencies of data values; hence, the y-axis is always some measure of frequency. This can either be raw counts of values or scaled proportions.

For example, we might want to see how the fares were distributed aboard the titanic:

In [None]:
titanic.fare.hist(grid=False)

The `hist` method puts the continuous fare values into **bins**, trying to make a sensible décision about how many bins to use (or equivalently, how wide the bins are). We can override the default value (10):

In [None]:
titanic.fare.hist(bins=30)

There are algorithms for determining an "optimal" number of bins, each of which varies somehow with the number of observations in the data series.

In [None]:
sturges = lambda n: int(np.log2(n) + 1)
square_root = lambda n: int(np.sqrt(n))
from scipy.stats import kurtosis
doanes = lambda data: int(1 + np.log(len(data)) + np.log(1 + kurtosis(data) * (len(data) / 6.) ** 0.5))

n = len(titanic)
sturges(n), square_root(n), doanes(titanic.fare.dropna())

In [None]:
titanic.fare.hist(bins=doanes(titanic.fare.dropna()))

A **density plot** is similar to a histogram in that it describes the distribution of the underlying data, but rather than being a pure empirical representation, it is an *estimate* of the underlying "true" distribution. As a result, it is smoothed into a continuous line plot. We create them in Pandas using the `plot` method with `kind='kde'`, where `kde` stands for **kernel density estimate**.

In [None]:
titanic.fare.dropna().plot(kind='kde', xlim=(0,600))

Often, histograms and density plots are shown together:

In [None]:
titanic.fare.hist(bins=doanes(titanic.fare.dropna()), normed=True, color='lightseagreen')
titanic.fare.dropna().plot(kind='kde', xlim=(0,600), style='r--')

Here, we had to normalize the histogram (`normed=True`), since the kernel density is normalized by definition (it is a probability distribution).

We will explore kernel density estimates more in the next section.

***
***

## Boxplots

A different way of visualizing the distribution of data is the boxplot, which is a display of common quantiles; these are typically the quartiles and the lower and upper 5 percent values.

You can think of the box plot as viewing the distribution from above. The blue crosses are "outlier" points that occur outside the extreme quantiles.

In [None]:
titanic.boxplot(column='fare', by='pclass', grid=False)

***
***

# Checkpoint 10 of 11

## Now you try!

### Use any of the previous `DataFrames` and select one or two variables to focus on. 

1. Create a simple boxplot using one categorical variable and one contiuous variable. 
2. Create a simple histogram of one continous variable. 

***
***

***
***

## Plotting and Visualization

There are a handful of third-party Python packages that are suitable for creating scientific plots and visualizations. These include packages like:

* matplotlib
* Seaborn
* Chaco
* PyX
* Bokeh

Here, we'll briefly explore `Matplotlib` and `Seaborn`

***
***

## Matplotlib

The easiest way to interact with matplotlib is via `pylab` in iPython. By starting iPython (or iPython notebook) in "pylab mode", both matplotlib and numpy are pre-loaded into the iPython session:

    ipython notebook --pylab
    
You can specify a custom graphical backend (*e.g.* qt, gtk, osx), but iPython generally does a good job of auto-selecting. Now matplotlib is ready to go, and you can access the matplotlib API via `plt`. If you do not start iPython in pylab mode, you can do this manually with the following convention:

    import matplotlib.pyplot as plt

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

In [None]:
help(plt.hist)

Or you can acess the command line with `%` like you see below: 

In [None]:
%matplotlib inline

In [None]:
plt.plot(np.random.normal(size=100), np.random.normal(size=100), 'ro')

The above plot simply shows two sets of random numbers taken from a normal distribution plotted against one another. The `'ro'` argument is a shorthand argument telling matplotlib that I wanted the points represented as red circles.

This plot was expedient. We can exercise a little more control by breaking the plotting into a workflow:

In [None]:
with mpl.rc_context(rc={'font.family': 'serif', 'font.weight': 'bold', 'font.size': 8}):
    fig = plt.figure(figsize=(6,3))
    ax1 = fig.add_subplot(121)
    ax1.set_xlabel('some random numbers')
    ax1.set_ylabel('more random numbers')
    ax1.set_title("Random scatterplot")
    plt.plot(np.random.normal(size=100), np.random.normal(size=100), 'r.')
    ax2 = fig.add_subplot(122)
    plt.hist(np.random.normal(size=100), bins=15)
    ax2.set_xlabel('sample')
    ax2.set_ylabel('cumulative sum')
    ax2.set_title("Normal distrubution")
    plt.tight_layout()
    plt.savefig("My_Saved_File.png", dpi=150)

matplotlib is a relatively low-level plotting package, relative to others. It makes very few assumptions about what constitutes good layout (by design), but has a lot of flexiblility to allow the user to completely customize the look of the output.

If you want to make your plots look pretty like mine, steal the *matplotlibrc* file from [Huy Nguyen](http://www.huyng.com/posts/sane-color-scheme-for-matplotlib/).

***
***

## Boxplots with `Matplotlib`

One way to add additional information to a boxplot is to overlay the actual data; this is generally most suitable with small- or moderate-sized data series.

In [None]:
bp = titanic.boxplot(column='age', by='pclass', grid=False)
for i in [1,2,3]:
    y = titanic.age[titanic.pclass==i].dropna()
    # Add some random "jitter" to the x-axis
    x = np.random.normal(i, 0.04, size=len(y))
    plt.plot(x, y, 'r.', alpha=0.2)

When data are dense, a couple of tricks used above help the visualization:

1. reducing the alpha level to make the points partially transparent
2. adding random "jitter" along the x-axis to avoid overstriking

***
***

## Scatterplots with `Matplotlib`

To look at how Pandas does scatterplots, let's reload the baseball sample dataset.

In [None]:
baseball = pd.read_csv("Data/baseball.csv")
baseball.head()

Scatterplots are useful for data exploration, where we seek to uncover relationships among variables. There are no scatterplot methods for Series or DataFrame objects; we must instead use the matplotlib function `scatter`.

In [None]:
plt.scatter(baseball.ab, baseball.h)
plt.xlim(0, 700)
plt.ylim(0, 200)

We can add additional information to scatterplots by assigning variables to either the size of the symbols or their colors.

In [None]:
plt.scatter(baseball.ab, baseball.h, s=baseball.hr*10, alpha=0.5)
plt.xlim(0, 700)
plt.ylim(0, 200)

In [None]:
plt.scatter(baseball.ab, baseball.h, c=baseball.hr, s=40, cmap='hot')
plt.xlim(0, 700)
plt.ylim(0, 200)

To view scatterplots of a large numbers of variables simultaneously, we can use the `scatter_matrix` function that was recently added to Pandas. It generates a matrix of pair-wise scatterplots, optiorally with histograms or kernel density estimates on the diagonal.

In [None]:
_ = pd.scatter_matrix(baseball.loc[:,'r':'sb'], figsize=(12,8), diagonal='kde')

***
***

## Heatmaps in `Seaborn`

The `seaborn` library offers a lot of cool stuff, but first we'll have to install it. We'll be making heatmaps. 

First, install `seaborn`. 

In [None]:
!pip install --user seaborn

In [None]:
from scipy.stats import pearsonr, mannwhitneyu
import seaborn as sns
import statsmodels as sm
from scipy.stats import kendalltau
import numpy as np

In [None]:
corrmatrix = df[['internet','tv','radio','newspaper']].corr()

In [None]:
corrmatrix

In [None]:
sns.heatmap(corrmatrix)

In [None]:
sns.set(style="white")
mask = np.zeros_like(corrmatrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.light_palette("red",as_cmap=True)
sns.heatmap(corrmatrix,mask=mask,cmap=cmap,vmin=0,vmax=.2)

In [None]:
sns.jointplot(df['age'], df['meanmedia'] , 
              kind="hex", stat_func=kendalltau, color="#4CB391")

In [None]:
sns.jointplot(df['age'], df['meanmedia'] , 
              kind="hex", stat_func=pearsonr, color="#4CB391")

***
***

# Checkpoint 11 of 11

## Now you try! 

### Use a previous `DataFrame` and replicate the steps above to create your own heatmap. (You should include continuous variables inorder for the correlations to work.)

***
***

# Installing `ggplot` for Python

Making plots is a very repetetive: draw this line, add these colored points, then add these, etc. Instead of re-using the same code over and over, `ggplot`–a very famous R package–implements them using a high-level but very expressive API. 

The result is less time spent creating your charts, and more time interpreting what they mean.
`ggplot` in `Python` is not a good fit for people trying to make highly customized data visualizations. While you can make some very intricate, great looking plots, ggplot sacrafices highly customization in favor of generall doing "what you’d expect."

First, install `ggplot`.
**Note:** This will take a few minutes. . . (be patient!)

In [None]:
!pip3.6 install --user ggplot

## Data
`ggplot` has a symbiotic relationship with `pandas`.

If you’re planning on using `ggplot`, it’s best to keep your data in `DataFrames`. Think of a
`DataFrame` in this context as a tabular data object. 

For example, let’s look at the `meat` dataset which ships with `ggplot`.

In [None]:
from ggplot import *

In [None]:
meat.head()

## Aesthetics

Aesthetics describe how your data will relate to your plots. Some common aesthetics are: `x`, `y`, and `color`. 

Aesthetics are specific to the type of plot (or layer) you’re adding to your visual. 

For example, a scatterplot (geom_point) and a line (geom_line) will share x and y, but only a line chart has a linetype aesthetic.

For more information about which geoms have which aesthetics, see the here http://ggplot.yhathq.com/docs/index.html.

***

## Layers
`ggplot` lets you combine or add different types of visualization components (or layers) together. 

I think this is easiest to understand with an example.

Start with a blank canvas.

In [None]:
p = ggplot(aes(x='date', y='beef'), data=meat)

Let's plot it! 

In [None]:
p

Add some points.

In [None]:
p + geom_point()

Add a line.

In [None]:
p + geom_point() + geom_line()

Add a trendline.


In [None]:
p + geom_point() + geom_line() + stat_smooth(color='blue')

As you can see, you can quite literally add components of your visualization together. For more info on available components, see http://ggplot.yhathq.com/docs/index.html.
Finally, let’s make the aesthetics look a bit nicer.


In [None]:
p + geom_point() + geom_line() + stat_smooth(color='blue') + theme_bw()

# Checkpoint 12 of 12 
## Now you try!

### Take any DataFrame we’ve used previously and create a plot using ggplot. What you plot doesn’t actually matter. Have fun with it! 

***
***