# Data Visualization in pandas

Today, we will continue our coverage of data visualization in Python, focusing on the functionlity within pandas.

Friendly Reminders:

* Homework #5 is due tonight by 11:59 p.m.
* DataCamp Modules for matplotlib and customizing visualizations are due Thursday by 11:59 p.m. (last ones!)

In [1]:
%pylab inline
import pandas as pd

Populating the interactive namespace from numpy and matplotlib


As we have seen, matplotlib provides the low-level functionality for data visualization in Python. The advantage of matplotlib is that it offers full flexibility for developing visualizations ranging from very simple to very complex. The disadvantage is that we still need to write several-to-many lines of code to create a visualization.

pandas offers a simpler interface for creating visualizations, and in particular makes it easy to translate Series and DataFrame objects into visual representations very easily via the **.plot** method. The .plot method can be used to create our fundamental set of visualizations, as well as several other types (not listed):

* Line chart (line, default)
* Scatter plots (scatter)
* Histograms (hist) & boxplots (box)
* Column (bar) and bar (barh) charts

Similar to the previous class, let's create some synthetic data to explore the basic functionality:

In [None]:
n = 100
x = np.random.randn(n)
y = np.random.randn(n)

In [None]:
# Assemble data into DataFrame
df = pd.DataFrame({'x': x, 'y': y})
df.head()

There are two methods for creating visualizations using the .plot method:

In [None]:
# kind argument
df.plot(kind='line')

In [None]:
# plot function
df.plot.line()

Note that the name of the plot passed to the **kind** argument and the name of the function match in all cases.

The .plot method includes arguments for many of the same formatting and labeling tasks that we performed in matplotlib. Either approach can be applied to format and label your visualization.

* Figure formatting and layout: figsize, subplots/layout, sharex/y, title, legend
* Axes formatting: x/yticks, log scaling (logx/y, loglog), x/ylim, grid, rot
* Plot styling: matplotlib arguments such as color, marker, linestyle, linewidth (if applicable)

In [None]:
df.plot(kind='scatter', x='x', y='y', color='k', marker='.', grid=True);

In [None]:
df.plot(kind='scatter', x='x', y='y', color='k', marker='.');
plt.grid(b=True)

Let's visualize our full random walk:

In [None]:
df.cumsum().plot(x='x', y='y', marker='.', color='k', title='Random Walk Analysis', legend=False, figsize=(8,8))
plt.axhline(0, color='0.5', alpha=0.5)
plt.axvline(0, color='0.5', alpha=0.5)
plt.ylabel('y');

Let's reload our tips data to continue exploring data visualization in pandas:

In [None]:
path = '/Users/seanbarnes/Dropbox/Teaching/Courses/BUDT758X/data/'
tips = pd.read_csv(path + 'tips.csv')
tips.head()

Similar to before, let's calculate the tip_pct to facilitate our analysis:

In [None]:
tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill']
tips['tip_pct'].describe()

Let's visualize the distribution of tip_pct:

In [None]:
# Create figure with suplots
fig, ax = plt.subplots(2, 1, sharex=True, figsize=(8,6))

# Histogram
tips['tip_pct'].plot.hist(bins=np.arange(tips['tip_pct'].max()), ax=ax[0])

# Boxplot
tips['tip_pct'].plot.box(vert=False, ax=ax[1])

# Adjust figure labels
ax[0].set_ylabel('')
ax[1].set_yticklabels('')
ax[1].set_xlabel('Tip Percentage', fontweight='bold', fontsize=12);

# Adjust subplot
plt.tight_layout()

In addition to the standard boxplot functionality in pandas, there is an additional boxplot method of Data Frames, which facilitates more flexible data visualization using boxplots.

In [None]:
# GroupBy + Boxplots
tips.boxplot(column='total_bill', by='size', flierprops={'marker': '.'}, vert=False, grid=False)
plt.xlabel('Total Bill')
plt.ylabel('Size')
plt.title('');

Bar/column charts are one of the most natural visualizations to apply to analysis involving DataFrames, as we often perform descriptive analysis involving categorical variables.

In [None]:
# Who's paying?
tips['sex'].value_counts(normalize=True).plot(kind='bar')
plt.ylim([0,1]);

In [None]:
# Who's smoking? - Stacked
pd.crosstab(index=tips.sex, columns=tips.smoker).loc[['Male','Female']].plot(kind='bar')
plt.xlabel('');

In [None]:
# Who's smoking? - Stacked
pd.crosstab(index=tips.sex, columns=tips.smoker).loc[['Male','Female']].plot(kind='bar', stacked=True)
plt.xlabel('');

In [None]:
# Summarize tip_pct by day of the week
order = [('Thur','Lunch'),('Thur','Dinner'),('Fri','Lunch'),('Fri','Dinner'),('Sat','Dinner'),('Sun','Dinner')]
tips.groupby(by=['day','time'])['tip_pct'].mean().loc[order].plot(kind='bar', figsize=(12,8), table=True)
plt.xticks([])
plt.xlabel('');

In [9]:
80/144

0.5555555555555556

In [15]:
48/102.6

0.46783625730994155

In [13]:
56/72

0.7777777777777778

In [3]:
12*60/7

102.85714285714286

In [14]:
40/36

1.1111111111111112

In [12]:
20/36

0.5555555555555556

In [11]:
24/102.86

0.233326852031888

## Next Time: Data Visualization in Seaborn