# Python for the Office
_No downtime - just results_


Here is a glimpse at the content within _Python for the Office_! This preview allows you to directly contribute to the development of the tools featured inside both the ebook and web resources. This oppurtunity to provide feedback will help shape the production product. We hope to share this adventure with you! <a href="mailto:office@pythonfortheoffice.com?Subject=Python%20Feedback">Email us</a> at office@pythonfortheoffice.com!


Here is what we are going to tackle:

- Part 1: **Python Installation**. Here we'll look at how installation can be completed in a few minutes - not an entire day. We will quickly walk you through running Python within a Jupyter Notebook.


- Part 2: **Beginner's Guide to Plotting**. Instantly start creating plots from an excel file! This tutorial will immediately teach you how to load a simple dataset and then use the powerful `pandas` module to quickly create plots. This section is an example of how the material in the full ebook will be presented.

>In Part 2 we assume that you have no prior knowledge of Python and no experience programming. If you've already been plotting with `pandas` then feel free to skim this section and move on to the advanced plotting tutorial.

- Part 3: **Make Awesome Plots!** This section will dive into more complex datasets and show how to produce great looking visualizations. We'll look into variations in financial data and end it with an interactive plot.

>In Part 3, descriptions will be shorter than in Part 2. This part is meant to provide you with some examples of what you will eventually be able to accomplish. The full ebook will expand on the topics presented in this section and a whole lot more!

## Part 1: Installing Python

We are going to be installing Python by using the Anaconda download. Anaconda is a package of Python frameworks and working environment in a single, simple download. You won't be typing things into a command prompt. 

1. Visit [Anaconda's Download Page](https://www.anaconda.com/distribution/#windows), pick your operating system, and download the latest Python 3 version.

2. Read each step of the installation process, but the recommended settings should work best! You should only need to click "**Next**".

3. Once finished, search for *Anaconda* and find a new application named **Anaconda Navigator**.

4. Launching the *Jupyter Notebook* will allow you to create a new folder and then a new Python3 notebook.

If you are having any trouble, then have a look at our detailed, step-by-step guide which includes images for each step. [Beyond Python: How To Install Python Notebooks](https://www.beyondpython.com/article/how+to+install+python+notebooks)

### Follow along in the Jupyter Notebook
You might find it easier to follow along reading this guide within a Jupyter Notebook. This allows you to run each cell already typed out for you. If you prefer to type through the code yourself then you always have the PDF version. [Interactive Jupyter Notebook](https://www.bit.ly/OfficeNotebook)


## Part 2: Beginner's Guide to Plotting

Coding is supposed to be hard, right? Maybe you've tried taking an online course or read another book, but just never got anywhere with it. While some of those other resources are great, they aren't practical for the office professional that doesn't have time to dive into the nuiances of programming. Most resources for beginners start with topics like "what are strings", or even worse "what are tuples". This guide will get into those topics when you need them. Starting there just isn't practical.

Let's start with something that can improve your workflow today. This is how you plot a chart:

In [None]:
%matplotlib inline
import pandas as pd

datasheet = pd.read_excel('http://bit.ly/PyRandom-xls')
datasheet.plot()

You've just created your first plot in __4 lines of code!__ We'll break the code down so you'll understand how it works, but hopefully this will help you see our approach to teaching you python. Start working with it now, figure out the details as you go.

Remember the steps in Excel: Open the data, select some columns, insert a chart, find where that chart went, never sort your data, and then fight when Excel has set the axes wrong. Let's breakdown a better way.

### Your First Chart

So, it's kinda a small plot, and missing a title and axis labels. We'll fix that shortly, but for now hopefully you are amazed at how easy this was. Before we make this plot beautiful, a bit of explanation is required.

```python
%matplotlib inline
```

This first statement is a bit complicated to get into. That "`%`" sign at that start of the line makes this a magic command. Magic commands help configure your Jupyter Notebook. No worries about all the details, this line above simply tells the notebook to display the plot inline, below the box of code. Without this line the plot may show up in another window (or not at all, depends on your browser configuration). A good rule of thumb is that if you are plotting, then just add this line as the first line of code in your notebook. We'll dive into more magic commands in the book, they are magic because they tend to make your work so much easier!

```python
import pandas as pd
```

Here we add the pandas module for use within our notebook. Pandas is a popular framework for doing almost any kind of work with data. Think of pandas as your replacement for excel. The `as pd` part of the line simply adds an alias. Now instead of using the module's fullname, we can use a shortened `pd`. Coders use alias' to simplify their code. Many of the common modules used in Python will have well known alias'. To follow the convention used by other coders, we recommend that you always import pandas as pd. Just be aware that these two code snippets will function the same:

```python
import pandas
datasheet = pandas.read_excel('http://bit.ly/PyRandom-xls')
```

```python
import pandas as pd
datasheet = pd.read_excel('http://bit.ly/PyRandom-xls')
```

```python
datasheet.plot()
```

This last line is where the interesting stuff happens. We've defined the data in the excel spreadsheet as "`datasheet`" in the code. By reading the data using the pandas module (`pd.read_excel()`), it was automatically stored as a pandas dataframe. The easiest way to picture this is to think of the pandas dataframe as Python's version of excel. The dataframe is an extremely powerful data structure that simplifies working with almost any type of data. One of the main benefits of using the dataframe is that it has a lot of built in functions, such as plotting. Our full ebook will be diving into other functions of the dataframe and other plotting modules, but in a majority of cases the built-in plotting of the dataframe is good enough for initial analysis.

So what actually happened when that line of code ran? The variable `datasheet` is a dataframe. `plot()` is a function built into the pandas dataframe. By typing `datasheet.plot()` we've said to run the `plot()` function on the `datasheet` dataframe. In this case you may have noticed that a legend was automatically added to the plot. This is because the pandas dataframe was built to simplify your work. Headers were found in the file and pandas automaticlally populates the legend with those headers.

### Labels, Titles, and Formatting 

Making the plot above is relatively easy, but it certainly isn't the prettiest plot. Adding some formatting is critical for a good visualization. Luckily, this is straigtforward in Python. As you practice plotting, you'll tend to have many of the basic formatting commands memorized. As your own projects mature, you might even want to build your own plotting module to automate a report.

Below we'll focus on just quickly producing an acceptable plot. The full ebook will cover an example that will walk you through building your own plotting module.

Here's some updates to those 4 lines of code for that first plot:

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

datasheet = pd.read_excel('http://bit.ly/PyRandom-xls')
datasheet.plot(figsize=(14, 10), fontsize=14, grid=True)
plt.title('Random Data for Plotting Example', fontsize=18)
plt.ylabel('Units (arb.)', fontsize=14)
plt.xlabel('Index (arb.)', fontsize=14)
plt.legend(fontsize=14);

Now that's a lot better! The chart is larger, axes are labled, and the text is readable.

First change to note is the additional import statement:

```python
import matplotlib.pyplot as plt
```

This time we need to import a plotting module. Note that the first example also used `matplotlib.pyplot`, but since pandas handled all the formatting, there was no need to import the `pyplot` module directly. We'll come back to this below.

Next is the call to the `plot()` function. There are some additional instructions that weren't there before.

```python
datasheet.plot(figsize=(12, 8), fontsize=14, grid=True)
```
Those added variables inside of the parantheses are called arguments. `figsize`, `fontsize`, and `grid` are arguments that we've decided to manually configure this time. In the first example we just let pandas use the default values. Most arguments are relatively self explanatory. Here we've set the figure size (note that it is `(width,height)` in units of inches), the font size of the x and y tick labels, and enable the grid lines. Functions such as `plot()` have many arguments that can be set to improve your plots. How do you know which arguments are available? An easy way is to just check the reference documentation. Type `pandas plot` into google and the first result will be the documentation for the pandas plot function.

Don't worry about all the options for now. The important thing to know is that functions tend to have arguments and being able to look them up is an important skill for working with various modules.

```python
plt.title('Random Data for Plotting Example', fontsize=18)
plt.ylabel('Units (arb.)', fontsize=14)
plt.xlabel('Index (arb.)', fontsize=14)
plt.legend(fontsize=14);
```

These four lines are the other formatting options that were added to the plot for readability. This time we are using the `plt` alias to run functions from the `matplotlib.pyplot` module. These functions are self-explanatory. We're adding a title, a x axis label, and a y axis label. As mentioned with the `plot()` function above, there are also a couple of arguments for each of these. The arguments are the text that should be displayed and the fontsize to use. Lastly, the fontsize was also updated for the legend to improve readability. Note that `plt.legend()` could be used to create the legend or change the legend labels, but the dataframe `plot()` function took care of that for us so we only used the fontsize argument here.

There might be something bothering you about these last four lines, if so then great! You're paying attention. When we called the `plot()` function, it was used on a variable that we created. `datasheet.plot()` works, but `pd.plot()` would not have worked. So why does `plt.xlabel()` work? A variable was never defined for the `plt` module. This works because the creators of matplotlib.pyplot were trying to mimic the command style behavior that people are familar with in MATLAB. In the background, the matplotlib.pyplot keeps track of the current figure. After calling `datasheet.plot()`, a reference to that figure was created and any call using `plt` will operate on the current figure. Defining the current figure explicitly is possible, but isn't necessary in this simple example. We'll take a look at a case where it is necessary in the next example.

Last note is that `;` at the end of the last line. The semicolon here isn't necessary, it just cleans up the output. Feel free to try running this code yourself and delete the semicolon. You'll see an output similar to this: `<matplotlib.legend.Legend at 0x7fcdf36feb38>`. The code at the end will be different for you. This is the return of the `plt.legend()` function. In the Jupyter Notebook, the last function will print to the output if left unassigned. The semicolon is just an easy way to supress this output.

### Subplots

So far we haven't really done anything that isn't relatively easy to do in excel. Hopefully you have found plotting with Python to be easier then you imagined and may even see this as being easier then working with excel (we at Python for the Office certainly do!).

Let's look at something that would be much more challenging to quickly do in excel, yet Python can pull off with ease. Sometimes plotting all of your datasets on one plot can be messy and creating seperate plots is the best option. Python allows you to quickly try differing plotting styles with minimum effort. Below we'll take a look at creating subplots.

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

datasheet = pd.read_excel('http://bit.ly/PyRandom-xls')
axes = datasheet.plot(subplots=True, figsize=(12, 8), fontsize=14, grid=True)
plt.suptitle('Random Data for Plotting Example', fontsize=18)
plt.xlabel('Index (arb.)', fontsize=14)

for ax in axes:
    ax.set_ylabel('Units (arb.)', fontsize=14)
    ax.set_ylim(bottom=0, top=2)
    ax.legend(fontsize=14)

Isn't that awesome! If you don't think that is awesome then please go and try that in excel and rethink your reaction. Of course, this is a really simple random data set, but this should give you a glimpse into how powerful plotting can be with minimal effort in Python.

There are a few new lines of code here that should be explained.

```python
axes = datasheet.plot(subplots=True, figsize=(12, 8), fontsize=14, grid=True)
```
We did something different with the plot function this time. First, we adding an argument for `subplots`. That is all that was needed to create the three individual plots above. The rest of this example is just formatting. Second, we assigned the output of the `datasheet.plot()` function to a variable called `axes`. This is because we are now working with multiple plots. Above we mentioned that matplotlib.pyplot keeps track of the current figure in the background. In this case it has managed to keep track of the current figure, but that figure now has three plots. If we are trying to format the plots, we need to start keeping track of them ourself.

Let's take a look at what the `axes` variable is by printing it to an output:

In [None]:
axes

This is our first look at an array. `axes` is simply a 3 element array where each element is a reference to the individual plots in our figure.

```python
plt.suptitle('Random Data for Plotting Example', fontsize=18)
```

The title command had a slight modification. The function is now called `suptitle` instead of just `title`. This is because we are now working with subplots. `title` can still be set, but that will set the individual title of each subplot. Here we simply want to label the figure with a title which can be done with `suptitle` (this is short for super title).

The last new addition to our example is the `for` loop. Loops are a fundamental element of all programming languanges and is truly one of the most useful controls in programming. 

```python
for ax in axes:
    do something
```

`for` loops in Python try to mimic language. In this example, `for ax in axes` simply states that for each element in axes we want to do something. Each iteration of the loop will access the next item in `axes` and assign it to the variable `ax`. Recall that `axes` is simply a list of references to each plot in the figure. So this for loop will be used to do something to each plot.

```python
ax.set_ylabel('Units (arb.)', fontsize=14)
```
The first action is to add the y axis labels. Note that we were able to use the same call for the x axis labels as we used in the last example because the x axis in this example is shared between all the plots. That means we can use the reference to `plt` to set the x axis label. For each iteration of the loop, the y axis label will be set. Also note that the function is no longer just `ylabel()`, but has changed to `set_ylabel()`. This is because of the `ax` object is different from the `plt` object and has different functions. We won't dive into this here, but just be aware that when working with axis the functions are different from the figure functions. This topic can be confusing for beginners. While this is out of the scope of this guide, this is a topic that will be covered in the full ebook.

```python
ax.set_ylim(bottom=0, top=2)
```

Lastly, there is a function call added to change the y axis limits. The default plotting style is to have the y limits automatically set by the data range, but having different scales for each dataset will make visual comparisons challenging. Therefore, by setting each plot to have the same y axis limits, we have made it much easier to see the differences in the datasets. 

## Part 3: Make Awesome Plots!

So far we've shown you that Python is probably easier then you imagined. Learning Python should no longer seem like a time consuming challenge that provides a benefit sometime in the future. The short segments of code above can be used to help you today (or at least the next time you need to make a plot)! 

These simple plots aren't really the focus of Python for the Office. We want to teach you how to automate your workflow and generate visualizations that will better serve you, your team, or your clients. Part 3 of this introductory material will look at some of the advanced plots that are easy to create and understand using Python.

In this section the explanations will not be as detailed. While Part 2 was an example of the material and style of the full ebook, Part 3 is meant to present a few of the more advanced plots that you'll be able to create with a bit more learning. We still try to make the code below understandable and build off of the material presented in Part 2. While these plots certainly look more complicated then the ones above, they really aren't that much harder to create!

Part 3 will give you an idea of what is possible. The full ebook will teach you how to make these plot and many more!

###  Distributions and Correlations

The first set of plots that we'll take a look at are distribution plots. A common analysis for a univariate dataset is to plot a histogram or density plot. A histogram cuts the data into multiple bins and displays a count of the number of values in each bin. Density plots are similar to histograms, but instead of bins they show a continuous distribution of the data.

While this is a common plotting techniuqe, trying this is excel can be troublesome. In excel you'll likely need the data analysis add-in and then you'll need to define your own bins. Python has modules, such as `seaborn`, that provide an incredibly easy way to generate a histogram with a density estimate in one plot. 

The dataset being used for this topic is an engineering dataset and represents a set of experiments run to help control a process. This guide is meant to present the simplicity of plotting in Python, therefore there is no need to worry too much about what the dataset is. Just be aware that making plots with other datasets is just as easy! The full ebook will cover multiple types of data (financial, economic indicators, engineering measurements, and even some map data) and we will provide plenty of pointers on how to wrangle these other data sources. Most of the time the flow is similar to what is presented in this guide. Figure out how to get the data into a pandas dataframe and then use modules like seaborn to generate visualizations. Simple as that!

For the dataset used below, there are two independent variables:
1. Substrate
2. Ag Thickness (nm)

Every other column is a type of measurement (aka dependent variables). A common analysis in engineering (and basically every other field) is trying to identify the relationship between the independent variables and dependent variables.

First, lets import a few modules that we want to use and load the data:

In [None]:
%matplotlib inline

# matplotlib is the main plotting library and
#   is needed to help with formatting the figures
import matplotlib.pyplot as plt

# Pandas is the dataframe we use to store and work with the data
import pandas as pd

# Seaborn is a powerful plotting library built on matplotlib.pyplot
import seaborn as sns

# Note: Warnings is being imported to help clean up the output
#      This module doesn't generally need to be imported
import warnings; warnings.simplefilter('ignore')


nanodata = pd.read_excel('http://bit.ly/PyNanoParticle-xls')

There are two new topics that should be discussed before moving on. There is a new module being used called `seaborn` and there are these weird lines starting with `#`.

`seaborn` is a plotting library built on top of `matplotlib.pyplot` and you should have noticed that we are using an alias. `sns` is the common alias used for the `seaborn` module. This module tends to make advanced plotting a piece of cake. This module also works very well with the pandas dataframe, as will be presented below.

So how about those hashtags? Those represent a one line comment. Any text on a line of code that comes after `#` is a comment and will not be run as code. Comments are simply to help others understand your code and for you to remember what you did when you come back to a piece of code after 3 months (trust me, that happens a lot and you'll hate yourself for not using more comments). Always comment your code! The comments in this section are a bit of overkill. Once you learn more about Python, there isn't a need to have a comment for every import or every function call. Comments tend to be most important for explaning the purpose of a section of code or to mention why code was written in a certain way. This section, and the code in our ebook, will be heavily commented to help with the learning process.

There are multiple types of comments that can be used in Python. The first is the hashtag used above:
```python
# This is a comment
```

Note that the hastag does not have to be at the start of the line. You can place the comment at the end of a line of code.

```python
datasheet.plot() # This is a comment
```

The `datashet.plot()` is from our first example in Part 2, but this time we added a comment on the end. Even though there is a comment on the line, the plot will still be created. Note that the rule is that everything after the `#` is a comment, so if we had the line shown below, the plot would not work.

```python
# This is a comment datasheet.plot()
```
Lastly, the `#` is used for single line comments. As was done in import code above, each line of a comment started with a `#`. A multiline comment can also be created by using the three single quotes closed by another three single quotes:

```python
'''
This is a multiline comment. More lines 
of text can be added until there are 
another 3 single quotes.
'''
```

#### Violin Plot

We've started with a few import statements to prepare for reading and plotting the data. So far not much has happened except for that extremely simple call to load the data with `pd.read_excel()`.

The first plot that we'll create is a violin plot. As mentioned above, there are multiple independent variables. When using a violin plot, we can easily group the data into unique sets and then view the relationship between them.

You may have used boxplots before to show summary statistics (mean and quartile ranges), a violin plot shows the probability estimate of a numeric dataset. A basic boxplot is just as easy to make, but since the violin plot looks more interesting that's what we've put in the guide. 

One of the columns in the data is _Feret Diameter (nm)_. This is one of the dependent variables of interest in the dataset. We could have choosen another column to plot, but this one was already known to show a pattern. The ebook will give you a better idea of how to approach new datasets and find variables of interest. In this case, we are focused on creating the plots.

In [None]:
# Create a new figure and set the figure size
plt.figure(figsize=(10, 8))

# Create the violin plot using seaborn (sns)
sns.violinplot(x="Ag Thickness (nm)", y="Feret Diameter (nm)", 
               hue="Substrate", data=nanodata, split=True)

plt.title('Nanoparticle Diameter Distributions', fontsize=20);

Just as in Part 2 of this guide, the plotting was done is a single command! There are a couple of formatting instructions, but the majority of the work was done by using `seaborn` (aliased as `sns`). One of the handy features of seaborn is it's ability to automatically add labels. In this case the x and y axes are labeled appropriately because the data columns have descriptive headers. Even better is that a legend for the two different substrates being analyzed was automatically added.

Let's breakdown this command a bit:
```python
sns.violinplot(x="Ag Thickness (nm)", y="Feret Diameter (nm)", 
               hue="Substrate", data=nanodata, split=True)
```
`data=nanodata` is the main focus here. We've said that the data that we want to use for the plot is in nanodata. A plot would be created even if you only had this argument in the `sns.violinplot()`. In that case, `seaborn` does its best to create a plot. However, if you have too many columns like in the `nanodata`, then the plot won't be all that useful. So we give seaborn a bit more information to select the data that we want to see, by passing these arguments:
```python
x="Ag Thickness (nm)", 
y="Feret Diameter (nm)", 
hue="Substrate"
```
Seaborn will automatically select those columns from the dataframe for the correct axes. `x` and `y` are pretty self explanatory, but the `hue` parameter is a bit more special. This is one of those features that really makes seaborn stand out as a powerful plotting module. By setting `hue="Substrate"` we've told seaborn to group the data by the substrate column. That means our `nanodata` dataframe will be grouped by the unique values in the Substrate column, which are _ITO_ and _a-Si_. 

The last argument is `split=True` and is just there to improve readability. Go ahead, delete that argument and see what happens. You'll simply get 6 separate violin plots instead of the side-by-side configuration above.

What did we learn from the violin plot? Two things:
1. As the Ag Film Thickness increases, the nanoparticle diameter also increases.
2. The ITO substrate has a wider (i.e. less controllable) distbution then the a-Si substrate.

With just a few lines of code it was easy to quickly learn something about the data (and have a nice graphic for the boss!).

#### Correlation Matrix

The next plot is an amazing tool for exploratory data analysis. When starting with a new dataset you may need to quickly identify relationships between variables. A correlation matrix (aka correlogram) allows for the visualization of the interaction between pairs of variables. 

Seaborn provides a function to quickly produce a correlation matrix called `pairplot()`. This function will create a visualization to show pairwise relationships between variables.

In [None]:
# Create the pair plot using seaborn (sns)
sns.pairplot(data=nanodata[nanodata['Substrate']=='a-Si'], 
             vars=['Feret Diameter (nm)', 'Area', 'Circ.', 'Perim.'], 
             hue='Ag Thickness (nm)');

With one function we've created a matrix of plots! There's even some pretty amazing features going on here. The data was first simplified by just selecting the _a-Si_ substrate data.  

```python
nanodata[nanodata['Substrate']=='a-Si']
```

This is an indexing (or slicing) operation. The data has too many variables to plot with a single `pairplot`, so we've reduced the dimension. Remember that the two main experimental factors are _Ag Thickness (nm)_ and _Substrate_. By selecting only _Substrate=a-Si_, the number of factors has been reduced to only _Ag Thickness (nm)_.

Indexing is an important part of dealing with data and this is a basic example of how to select a subset of the data. Indexing, grouping, and pivoting are critical tools when working with data and those topics will be an important part of the full ebook.

`vars` is a new argument and is simply stating which variables we want to use for the pairwise comparisons. This argument can be left blank and seaborn will add plots for all columns in the dataframe, but those tend to be hard to read because of how much data there is. So by defining a list of columns for `vars`, we've simply reduced the number of plots to create. 

Lastly, the `hue` argument serves the same purpose as in the vilon plot above. By defining `hue='Ag Thickness (nm)'`, each film thickness displays as a different color. Not only can we see the pairwise relationship between variables, but we can also see that same relationship across the main groups in the data!

_Note:
This time there is no call to create and set the size of the figure like there was with the violin plot. This is simply because the pair plot is generated with a larger figure size and manually setting it is unnecessary in this case. The command wasn't needed with the violin plot either, but due to the figure being small, it was desireable to increase the size before plotting._

#### Distribution Plot with Style

The last plot we'll look at with this dataset is a distribution plot with some extra formatting. The critical topic with this data is that we need to understand the distribution of the nanoparticle diameter under the various conditions. So lets create a clean looking plot that clearly shows diameter versus the experimental groups.

In [None]:
# Create the figure and axis handles
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# To simplify the code, grab subsets of the data for each axis
# This is the same indexing used in the pairplot example
aSi = nanodata[nanodata['Substrate'] == 'a-Si']
ito = nanodata[nanodata['Substrate'] == 'ITO']

# Set the title for the first axis
ax1.set_title('Ag Directly on a-Si', fontsize=14)
# Loop over a parameter to create individual distributions
for name, group in aSi.groupby('Ag Thickness (nm)'):
    sns.distplot(group['Feret Diameter (nm)'], 
                 label=(str(name) + 'nm Ag'), ax=ax1)

# Set the title for the second axis
ax2.set_title('10nm ITO Between Ag and a-Si', fontsize=14)
# Loop over a parameter to create individual distributions
for name, group in ito.groupby('Ag Thickness (nm)'):
    sns.distplot(group['Feret Diameter (nm)'], 
                 label=(str(name) + 'nm Ag'), ax=ax2)


# Plot Formatting

# Add a main title for the figure
fig.suptitle('Nanoparticle Size Distributions', fontsize=20)

# For both axes, add a legend and change the limits to improve viewing
for ax in [ax1, ax2]:
    ax.legend(frameon=True, fontsize=14)
    ax.axis([0, 300, 0, 0.03])
    ax.grid(True)

# Make the text larger to improve readability
plt.rcParams.update({'font.size': 14})

Now we're ready for a presentation! This is a clean looking plot with a lot of data. Be sure to read the comments in the code. This time we won't walk through every command, but the comments should provide useful tips.

```python
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
```
This command should look familar. It is similar to the plt.figure() command. This time we used `subplots` to create 2 plots in the same figure. This argument says that we want `1` row and `2` columns, then the last argument is our standard `figsize`. Unlike in previous examples, we assigned the output of this command to a few variables. The earlier examples simply called `plt.figure()` and the output was ignored. Remember how in Part 2 (_1.2.2 Labels, Titles, and Formatting_) we mentioned that matplotlib.pyplot will keep track of the current figure for you? Well this time we are doing something a bit more complex and we need to keep track of the figure and axes ourselves. Understanding how to create, store, and manage your axes is a topic that will be covered in much more detail in the ebook. For now just know that the line of code above allows use to access the two plots that are created in the figure.

```python
for name, group in aSi.groupby('Ag Thickness (nm)'):
    do something
```

Here we've encounted another `for` loop. This `for` loop is much more exciting then the first one because we are using one of the most useful features of `pandas` and that is the `groupby()` function. Similar to how `hue` worked in the earlier seaborn examples, the `groupby()` command will separate the dataframe by the unique values in the entered column. So by using `groupby('Ag Thickness (nm)')`, we are creating subsets of the data for each value in _Ag Thickness (nm)_. The ebook will contain examples of how these dataframes are segmented to help provide a visual understanding of how function like `groupby()` work.

The `groupby()` function can then be used with a `for` loop to do an operation on each group of data. When using `groupby()` in this manner, it will return a `name` and the `group`. The name is simply the value of the specified column for the group and the `group` is the subset of data.

So if `6`, `8`, and `10` are the unique values in the _Ag Thickness (nm)_ column, then here is an example of the values for one iteration of the `for` loop where the group is for _Ag Thickness (nm)_=6:

```python
name=6
group=aSi[aSi['Ag Thickness (nm)'] == 6]
```
Notice that the group is simply a subset of the dataframe and is equivalent to using the indexing command presented earlier. As mentioned earlier, grouping, indexing, pivoting, and other data manipulation operations are an important part of analyzing data. These topics tend to be confusing for beginners. This guide isn't meant to provide a full description of these operations, but this is a topic that will be fully covered in the ebook.

```python
sns.distplot(group['Feret Diameter (nm)'], 
             label=(str(name) + 'nm Ag'), ax=ax1)
```
Once again we've used seaborn to make plotting easy. Remember that the `group` variable is a subset of the `aSi` dataframe and by using `group['Feret Diameter (nm)]` we are selecting a single column for plotting. The `distplot()` function doesn't need the `data=` that we used earlier because there is only one variable being plotted. We've also added the `ax=ax1` to tell Python which plot to put the data in. Notice that the next `for` loop uses `ax=ax2`. Lastly, this time we had to modify the label for readability.

```python
label=(str(name) + 'nm Ag')
```
The `label` is what will show up in the legend. Here we've built our own label from the `name` variable created by the loop. The `str()` command is a built-in Python function that will convert a number into a string and then we can simply add 2 strings together. String operations are another important set of operations that will be covered in the full ebook.

Finally we have a new function:
```python
plt.rcParams.update({'font.size': 14})
```

We won't discuss how this function works here, but what you should understand is that this will simply modify the fontsize on the plots built with `matplotlib.pyplot`. Remember that the `seaborn` and `pandas` plotting functions are built on that library, so their fontsize will also change. Keep this command handy when you need to improve the readibility of all your plot labels and titles.

All the plots above would generally take a lot of coding to make from scratch, or even to build from `matplotlib.pyplot`. The `seaborn` module takes a great deal of the work out of creating complex plots. Labels were automatically added, legends created, and colors set. All of that was done behind the scenes. Of course we added a few minor formatting changes ourselves, but the heavy lifting was done by `seaborn`. That's how it should be. Creating a plot for your data should be easy, then you'll have all the time you need to beautify it for your manager. 

Check out the [seaborn gallery](https://seaborn.pydata.org/examples/index.html) for more ideas.

### Working with a Time Series: Financial Data

Before finishing the guide, let's use your newfound knowledge to look at another dataset, this time a financial dataset. As we mentioned, the full ebook will be convering a variety of datasets in different formats. Here we just want to prove that Python is also incredibly useful when working with timeseries data, such as the 2018 Dow Jones Industrial Average historical data.

In [None]:
%matplotlib inline

# Import pandas and seaborn for data handling and plotting
import pandas as pd
import seaborn as sns


# Read the 2018 dow jones data set into a dataframe
dowdata = pd.read_csv('http://bit.ly/DowJones2018')

# Change the Date column into time data
dowdata['Date'] = pd.to_datetime(dowdata['Date'])

# Set the index to the date to make the dataset easier to work with
dowdata = dowdata.set_index('Date')


# Make the plot text larger to improve readability
plt.rcParams.update({'font.size': 14})

# Create the figure and define size
plt.figure(figsize=(12, 8))

# Create the violin plot using seaborn (sns)
sns.violinplot(x=dowdata.index.month, y=dowdata['Open'])


# Correct a label and add a title
plt.ylabel('Open Price (USD)')
plt.title('Monthly Dow Jones Industrial Average Open Price Variation');

We've just taken a full year of data for the Dow Jones Industrial Average opening price and plotted the variation by month! What's even more impresive is that the bulk of the work was done in these 4 lines of code:

```python
dowdata = pd.read_csv('http://bit.ly/DowJones2018')
dowdata['Date'] = pd.to_datetime(dowdata['Date'])
dowdata = dowdata.set_index('Date')
sns.violinplot(x=dowdata.index.month, y=dowdata.Open)
```

Everything else is comments or formatting. Also, this code should look very similar to what you've already encountered. There are a couple new functions which are associated with working on time-based data. 

```python
dowdata['Date'] = pd.to_datetime(dowdata['Date'])
```

The first new function is `pd.to_datetime()`. This function is pretty much self-explanatory, we are converting something to a `datetime` using the `pandas` module. When the data was read in with `pd.read_csv()`, there was no way for pandas to know that it was a special type of data referred to as `datetime`. So after we read the data, we change the _Date_ column to the `datetime` format using the line of code shown above. This line of code will make working with the _Date_ column a lot easier. The importance of this will be shown in the plotting function.

```python
dowdata = dowdata.set_index('Date')
```

Similar to the `datetime` conversion, this `set_index()` function is simply being used to make handling the code easier. When we read the date, `pandas` created an arbitrary index for us. That arbitrary index isn't all that useful, but luckily we can set our own index. An index is used to better organize the data, and in this case we want to use the `Date` as the index column. Note that the ebook will be going into more detail on columns and indexes since they are critical for data handling.

```python
sns.violinplot(x=dowdata.index.month, y=dowdata['Open'])
```

This command should look familar, but we did change the arguments a bit. This time we didn't pass a `data` agument and simply passed only the x and y data. `y` is the _Open_ column of the data set, and should look the same as indexing operations that we've done earlier. What we set as `x` is something new and is where our work with the _Date_ column really pays off. `dowdata.index` is a way to select the index column of the dataset. Remember how we used the `set_index()` command earlier? So now by using `dowdata.index`, we are selecting the _Date_. Also, since we did the work to define the _Date_ as a `datetime`, we are abe to do even more. By setting `x=dowdata.index.month`, the data is automattically grouped by the month and seaborn creates a violin plot for each month in the dataset. 

## Bonus Section: Interacting with the Data

We want to leave you with something that is a bit more advanced. Working in Jupyter Notebooks allows you to quickly create a wide variety of incredibly useful plots. There are features beyond just the static plotting examples above. Let's conclude by adding something dynamic.

Sorry! No detailed walkthrough on this example. There are some new modules being used and we created a function with the `def` statement. The main function here is that last call to `interact()`. The first argument `update` links the interact session to the `update` function that we defined. The second argument says to run `update` whenever the value of `year_widget` changes and send the new value to the `update` function.

Hopefully the rest of this code is starting to look familar and hopefully the walkthroughs have made the code understandable. If you benefited from this guide then the ebook will have you building interactive plots like the one below and so much more!

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact
import ipywidgets as widgets

coffeedata = pd.read_csv('http://bit.ly/icoCoffee')

# Format the YEAR to be just the start of the production year
coffeedata['YEAR'] = [int(x.split('/')[0]) for x in coffeedata['YEAR']]

# Rename the columns that we know we are plotting for formatting
coffeedata = coffeedata.rename(columns={
    'DOMESTIC_CONSUMPTION':'Domestic Consumption',
    'EXPORTABLE_PRODUCTION':'Exportable Production'})


# Create a widget to allow for year selection
year_widget = widgets.Dropdown(options=coffeedata['YEAR'].unique(), 
                               value=coffeedata['YEAR'].min(),
                               description='Year')

# Define a function to call when the year widget is changed
def update(val):
    # Select the data based on the year
    tmp = coffeedata[coffeedata.YEAR == val]
    # Filter out Brazil for viewing 
    # Brazil is produces a lot more then other countries
    tmp = tmp[tmp.COUNTRY != 'Brazil']
    # Plot the selected data with a stacked bar plot
    tmp.plot(x='COUNTRY', y=['Exportable Production', 'Domestic Consumption'], 
             kind='bar', stacked=True, figsize=(14, 6))

    # Format the plot to make it more readable
    ax = plt.gca()
    ax.set_xlabel('')
    ax.set_ylabel('Total Production')
    ax.set_ylim(bottom=0, top=18000)
    ax.legend(loc='upper right')
    plt.title('Coffee Exports and Consumption by Country/nTotal')

# Start the interactive plotting by using 'interact'
# Interact will call the 'update' function whenever the
#    value of the widget is changed. The value of the widget
#    will be sent to 'update' as 'val'
interact(update, val=year_widget);

There it is! An interactive plot! We've added a simple piece of functionality that makes the data much easier to explore. We can now navigate through the years and see how coffee production has changed in each contry. We've even made this a stacked bar plot so we can quickly visualize which countries are exporting most of their coffee versus those that are consuming most of their coffee. 

# Thanks and Please Provide Feedback!

Thanks for reading our short introduction to plotting with Python! If you've made it this far then you are probably interested in the material that we will be producing. Here at _Python for the Office_ we have a broad set of skills and can cover basically anything. We have an idea of what we believe will be most valuable to our readers, but hearing from you directly would be even better. If you have a topic that you are stuggling with, a file that you can't seem to work with, or even a dataset that just seems impossible to wrangle, then please let us know. We want to provide you with useful and practical information so you can start using Python as soon as possible.

office@pythonfortheoffice.com