# Matplotlib: Histograms and Bar Charts

### Histograms

Histograms are a very useful tool for exploring your data, particularly for understanding the distribution of specific variables. To construct a histogram, the first step is to "bin" the range of value, or divide the entire range of values into a series of interval and then count how many values fall into each interval. 

Let's see how it all works with an example using the `plt.hist()` function. 

In [87]:
# importing our packages
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# create a random set of data
mu = 100 # mean of our data
sigma = 15 # standard deviation of our data
data = mu + sigma*np.random.randn(10000) # get a random distribution of numbers in our range

# print data
print(data)

# plot histogram and add labels/title


[106.95903004 120.7111259  104.90378099 ...  89.70318816  90.75283697
 105.28193768]


Here we see a roughly bell-shaped disitribution, which we refer to in statistics as a normal distribution. We see the peak of the distribution is at the value 100, which corresponds to our mean parameter `mu`, and most (~70%) of the data lies within +/- 15 of the mean, corresponding to our standard deviation parameter `sigma`. 

In this plot we can see that our data is split into 10 bins spanning about 15 values per bin, as evidenced by the 10 bars. 10 bins is the default value. To get a better sense of the distribution, we can use a larger number of more fine-grained bins.  

In [86]:
# plot histogram and add labels/title with 50 bins


Using 50 bins instead of 10 gives a clearer bell-shaped normal distribution.

In [88]:
# plot histogram and add labels/title with 100 bins


However, using bins that are too fine-grained might take away from the overall shape of the distribution. 

In [89]:
# plot histogram and add labels/title with 150 bins


You'll notice that the y-axis on all of our histogram plots shows the count of the values of `data` that falls into each bin. As we increase the number of bins and decrease the bin size, the range of the y-axis also decreases - smaller bins means fewer values fall in each bin. This can make comparing between histograms of different variables difficult. 

One solution to this issue is to convert the y-axis to show the *percentage* of values that fall in each bin rather than raw counts. This is called a **density plot**. 

In [90]:
# plot density and add labels/title with 50 bins


### Bar Charts

Bar charts are useful for comparing a variable across several groups. In a bar chart, the x-axis is usually a *discrete* variable, such as different groups in a study or specific categories of things, and the y-axis is usually a continuous variable, such as height or age. 

Let's practice with a really simple example of the `plt.bar()` function.

In [91]:
# create data vector and labels vector
means = (45, 67, 64, 55, 52)
groups = ('Group1', 'Group2', 'Group3', 'Group4', 'Group5')

# get indexes for xticks

# plot bar graph and add labels/title


Sometimes we want to compare multiple variables - such as ages of men and women across our 5 study groups. One way we can do this is using the `subplots()` function, which enables us to put two plots side by side. 

In [92]:
# create vectors with age means for men and women across groups
men_means = (20, 35, 30, 35, 27)
women_means = (25, 32, 34, 20, 25)

# subplots, 1 row, 2 columns, figsize = (12,6)

# fig: full plot object
# ax: array containing each subplot
# we identify each individual plot by its index (i.e. ax[0] for the first)


# add the y-label plot

# add xticks and labels to the first plot


# add xticks and labels to the first plot


# add legend to both plots


# add suptitle to fig



However, this layout makes it difficult to directly compare the means of men and women in each group. A better way to visualize this data would be by plotting the bars side by side on a single graph. 

In [93]:
# defining the width of our bars. We'll see why this is important in a second.
width = 0.35

# plot mens bars on left side of the x-tick

# plot womens bars on the right side of the x-tick

# plot bar chart and add labels/title/xticks/legend



As we've seen in the histogram exercise, the mean is not the only defining feature of a disitribution - we often also look at the standard deviation to know how spread the disitribution is about the mean. This can also help us statisically determine if two groups are significantly different - if the mean +/- the standard deviation of two groups overlap they are less likely to be significantly different. 

We can add error bars to the above graph to help make this visual comparison easier. 

In [94]:
# create vectors for std ages for men/women across groups
men_std = (2, 3, 4, 1, 2)
women_std = (3, 5, 2, 3, 3)

# plot mens bars on left side of the x-tick


# plot womens bars on the right side of the x-tick

# plot bar chart and add labels/title/xticks/legend

