## Histogram, Boxplots and Pyramid

Social Media Dataset is used here. The subjects in the dataset are the students aged between 7 and 25 including both females and males. Data includes the amount of time spent on the each social media i.e., Facebook, Instagram and Twitter. 

The dataset contains fair number of different users who spend like 0 minutes on all social media and also users who spend 200 minutes all three social accounts.  

import the required packages.

### explore the dataset

In [1]:
import pygal
import pandas as pd
from datetime import datetime 
import numpy as np

In [2]:
social_media = pd.read_csv('../datasets/social-media_dataset.csv')

In [3]:
social_media.head()

Unnamed: 0,student_id,age,gender,facebook_time,instagram_time,twitter_time
0,s_id0,8,female,13,7,44
1,s_id1,15,male,37,55,55
2,s_id2,18,female,23,31,32
3,s_id3,25,male,17,20,52
4,s_id4,12,male,10,40,37


In [4]:
social_media.shape

(200, 6)

### Histogram 

Create Age groups using the datasets. 


The histogram plot based on the age groups in the dataset. Histogram is plotted by grouping students into groups 

- 0 to 13
- 13 to 18
- 18 to 25

In [5]:
social_media['age_group'] = pd.cut(x = social_media['age'], 
                                   bins = [0, 13, 18, 25])

social_media.head()

Unnamed: 0,student_id,age,gender,facebook_time,instagram_time,twitter_time,age_group
0,s_id0,8,female,13,7,44,"(0, 13]"
1,s_id1,15,male,37,55,55,"(13, 18]"
2,s_id2,18,female,23,31,32,"(13, 18]"
3,s_id3,25,male,17,20,52,"(18, 25]"
4,s_id4,12,male,10,40,37,"(0, 13]"


In [6]:
age_group_counts = social_media.groupby(['age_group']).size()

age_group_counts

age_group
(0, 13]     88
(13, 18]    79
(18, 25]    33
dtype: int64

In [7]:
age_group_counts.index[0].left

0

In [8]:
age_group_counts.index[0].right

13

In [9]:
from IPython.display import display, HTML

html_skeleton = """
<!DOCTYPE html>
<html>
  <head>
  <script type="text/javascript" 
          src="http://kozea.github.com/pygal.js/javascripts/svg.jquery.js">
  </script>
  <script type="text/javascript" 
          src="https://kozea.github.io/pygal.js/2.0.x/pygal-tooltips.min.js"">
  </script>
  </head>
  <body>
    <figure>
      {rendered_chart}
    </figure>
  </body>
</html>
"""

def display_chart(chart):
    rendered_chart = chart.render(is_unicode=True)
    plot_html = html_skeleton.format(rendered_chart=rendered_chart)
    display(HTML(plot_html))

In [10]:
histogram_chart = pygal.Histogram(width = 640,
                                  height = 360,
                                  explicit_size = True)

histogram_chart.title = "No. of Students by Age Group"
histogram_chart.x_title = "Age"
histogram_chart.y_title = "Number of Students"

In [11]:
bars = []

for index, value in age_group_counts.iteritems():
    bars.append((value, index.left, index.right))
    
bars

[(88, 0, 13), (79, 13, 18), (33, 18, 25)]

In [12]:
histogram_chart.add('Students', 
                    bars)

display_chart(histogram_chart)

## Pyramid

#### Plot a pyramid to show the gender distribution according to the age in the dataset.

In [13]:
social_media.describe()

Unnamed: 0,age,facebook_time,instagram_time,twitter_time
count,200.0,200.0,200.0,200.0
mean,14.475,27.925,31.22,30.81
std,4.396135,19.199544,26.06624,19.748771
min,7.0,0.0,0.0,0.0
25%,11.0,11.0,14.0,13.75
50%,14.0,26.0,28.5,31.5
75%,18.0,44.25,45.0,47.0
max,25.0,73.0,240.0,120.0


All the age groups in the dataset. 

In [14]:
social_media['age'].unique()

array([ 8, 15, 18, 25, 12, 16,  9, 19, 13,  7, 17, 14, 11, 10, 20, 24, 22,
       21, 23])

In [15]:
age_gender_counts = social_media.groupby(['gender', 'age']).size()

age_gender_counts

gender  age
female  7       6
        8       7
        9       6
        10      3
        11      6
        12      9
        13      7
        14      8
        15      5
        16      6
        17     10
        18     11
        19     10
        20      1
        21      2
        22      4
        23      2
        24      1
        25      2
male    7       3
        8       5
        9       5
        10      3
        11      7
        12     10
        13     11
        14     10
        15      8
        16      6
        17      8
        18      7
        19      3
        20      1
        22      2
        24      3
        25      2
dtype: int64

In [16]:
ages = age_gender_counts.index.get_level_values('age').unique()
ages

Int64Index([7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
            24, 25],
           dtype='int64', name='age')

In [17]:
range(ages.min(), ages.max())

range(7, 25)

In [18]:
for gender in age_gender_counts.index.get_level_values('gender').unique():
    for age in range(ages.min(), ages.max() + 1):
        try:
            age_gender_counts.loc[gender, age]
        except:
            age_gender_counts.loc[gender, age] = 0

In [19]:
age_gender_counts = age_gender_counts.sort_index()

age_gender_counts

gender  age
female  7       6
        8       7
        9       6
        10      3
        11      6
        12      9
        13      7
        14      8
        15      5
        16      6
        17     10
        18     11
        19     10
        20      1
        21      2
        22      4
        23      2
        24      1
        25      2
male    7       3
        8       5
        9       5
        10      3
        11      7
        12     10
        13     11
        14     10
        15      8
        16      6
        17      8
        18      7
        19      3
        20      1
        21      0
        22      2
        23      0
        24      3
        25      2
dtype: int64

In [20]:
pyramid_chart = pygal.Pyramid(width = 640,
                              height = 360,
                              explicit_size = True,
                              legend_at_bottom=True,
                              )

pyramid_chart.title = 'Age Distribution by gender'

In [21]:
pyramid_chart.x_labels =  range(ages.min(), ages.max() + 1)

In [22]:
pyramid_chart.add('Female',
                  age_gender_counts['female'])
pyramid_chart.add('Male',
                  age_gender_counts['male'])

display_chart(pyramid_chart)

Chart is small open in browser.

In [23]:
pyramid_chart.render_in_browser(explicit_size = False)

file:///var/folders/k1/cs73nb7s0157jb0g25yz3y6h0000gn/T/tmpihvbzlg2.html


### Boxplot 

In [24]:
social_media.head()

Unnamed: 0,student_id,age,gender,facebook_time,instagram_time,twitter_time,age_group
0,s_id0,8,female,13,7,44,"(0, 13]"
1,s_id1,15,male,37,55,55,"(13, 18]"
2,s_id2,18,female,23,31,32,"(13, 18]"
3,s_id3,25,male,17,20,52,"(18, 25]"
4,s_id4,12,male,10,40,37,"(0, 13]"


Here, the average time of each social media is taken and represented in the form of box plots. 

**A box and whisker plot gives you a quick overview.** It has five number summary i.e,  **minimum, first quartile, median, third quartile, and maximum.**


Get the individual social media time.

In [25]:
social_media.describe()

Unnamed: 0,age,facebook_time,instagram_time,twitter_time
count,200.0,200.0,200.0,200.0
mean,14.475,27.925,31.22,30.81
std,4.396135,19.199544,26.06624,19.748771
min,7.0,0.0,0.0,0.0
25%,11.0,11.0,14.0,13.75
50%,14.0,26.0,28.5,31.5
75%,18.0,44.25,45.0,47.0
max,25.0,73.0,240.0,120.0


Initialize the pygal object and give it a title. 

In [26]:
box_whisker_plot = pygal.Box(width = 640,
                              height = 360,
                              explicit_size = True)

box_whisker_plot.title = 'Social Media time'

In [27]:
box_whisker_plot.add('Facebook', 
                     social_media['facebook_time'])

display_chart(box_whisker_plot)

The whisker's contain the extreme ends of an attribute like the minimum and maximum. In the middle, you have the median line. 

In [28]:
box_whisker_plot.add('Instagram', social_media['instagram_time'])
box_whisker_plot.add('Twitter', social_media['twitter_time'])

display_chart(box_whisker_plot)

**Values at 25%, 50% and 75% can be compared by looking dataset** description like lower 25% of Facebook is 11 and bottom-line of the box and also look at the minimum value of the Facebook time, it is 0. 

In [29]:
social_media.describe()

Unnamed: 0,age,facebook_time,instagram_time,twitter_time
count,200.0,200.0,200.0,200.0
mean,14.475,27.925,31.22,30.81
std,4.396135,19.199544,26.06624,19.748771
min,7.0,0.0,0.0,0.0
25%,11.0,11.0,14.0,13.75
50%,14.0,26.0,28.5,31.5
75%,18.0,44.25,45.0,47.0
max,25.0,73.0,240.0,120.0


Now that we have boxplots - mode can be specified like what the whiskers will represent. 

Here, whiskers will the **first quartile minus 1.5 times the interquartile range** and **the third quartile plus 1.5 times the interquartile range.**

In [40]:
box_plot = pygal.Box(box_mode="1.5IQR",
                     width = 640,
                     height = 360,
                     explicit_size = True)

box_plot.title = 'Social Media time'

In [41]:
box_plot.add('Facebook', social_media['facebook_time'])
box_plot.add('Instagram', social_media['instagram_time'])
box_plot.add('Twitter', social_media['twitter_time'])

display_chart(box_plot)

Whiskers will the first quartile minus 1.5 times the interquartile range and the third quartile plus 1.5 times the interquartile range.

**Top whisker** (third quartile) - it is  third quartile + 1.5 * IQR**

**Facebook time** i.e 44 + 1.5(44 - 11) = 93.5

**Instagram time**  45 + 1.5(45  - 14 ) = 91.5

**Twitter time** = 47 + 1.5(47 - 13) = 98


Bottom whisker (First Quartile)

**Facebook time** i.e 11 - 1.5(44 - 11) = -38.5

**Instagram time**   14 - 1.5(45  - 14 ) = -32.5

**Twitter time** = 13 - 1.5(47 - 13)- = - 38

Note: You can't see the negative side in the pygal in the "IQR" mode

In [32]:
social_media.describe()

Unnamed: 0,age,facebook_time,instagram_time,twitter_time
count,200.0,200.0,200.0,200.0
mean,14.475,27.925,31.22,30.81
std,4.396135,19.199544,26.06624,19.748771
min,7.0,0.0,0.0,0.0
25%,11.0,11.0,14.0,13.75
50%,14.0,26.0,28.5,31.5
75%,18.0,44.25,45.0,47.0
max,25.0,73.0,240.0,120.0


### Tukey Boxplot

**"Tukey"** boxplot, it is the **default boxplot in most the visualization frameworks**. 

"Tukey" boxplot finds the outliers. 

It has thumb rule, **anything above the *third quartile + 1.5 IQR* for the upper boxplot is an outlier** and **anything below the *first quartile - 1.5 IQR* is an outlier**. 


Below graph has the upper whisker at the maximm value under the ***third quartile + 1.5 IQR***.

The lower whisker is at the minimum value under the ***first quartile - 1.5IQR***. So, all the values are zeros because **time cannot be negative in this case.** 



In [42]:
box_plot = pygal.Box(box_mode="tukey",
                     width = 640,
                     height = 360,
                     explicit_size = True)

box_plot.title = 'Social Media time'

In [43]:
box_plot.add('Facebook', social_media['facebook_time'])
box_plot.add('Instagram', social_media['instagram_time'])
box_plot.add('Twitter', social_media['twitter_time'])

display_chart(box_plot)

The whiskers are the **lowest datum whithin the 1.5 IQR of the lower quartile** and **the highest datum still within 1.5 IQR of the upper quartile**. The outliers are shown too.

In [44]:
social_media.describe()

Unnamed: 0,age,facebook_time,instagram_time,twitter_time
count,200.0,200.0,200.0,200.0
mean,14.475,27.925,31.22,30.81
std,4.396135,19.199544,26.06624,19.748771
min,7.0,0.0,0.0,0.0
25%,11.0,11.0,14.0,13.75
50%,14.0,26.0,28.5,31.5
75%,18.0,44.25,45.0,47.0
max,25.0,73.0,240.0,120.0


### Boxplot mode "Stdev" - Sample Standard Deviation

In [45]:
box_plot_std = pygal.Box(box_mode="stdev",
                         width = 640,
                         height = 360,
                         explicit_size = True)

box_plot_std.title = 'Social Media time'

In [46]:
box_plot_std.add('Facebook', social_media['facebook_time'])
box_plot_std.add('Instagram', social_media['instagram_time'])
box_plot_std.add('Twitter', social_media['twitter_time'])

display_chart(box_plot_std)

The boxplots with standard deviation are plotted above. 

The Q1, Q2 and Q3 are same as the usual box plot i.e. Q2 is the media, Q1 is the 25% of the data and Q3 is 75% of the data.  

**Note: standard deviation is the sample deviation.**

### The Whiskers are the interesting point here. 

**Upper Whisker**  = Highest point that's within the "**Q2 + Sample Standard Deviation**"

**Lower Whisker**  = Lowest data point that's greater than "**Q2 -Sample Standard Deviation**"
 
 for example, Facebook time 
 **Sample Standard Deviation = 19.19** 
 
 
 **Lower Whisker** = 7
 
 **Q1**** = 11
 
 **Q2** = 26
 
 **Q3** = 44.5
 
 **Upper Whisker** = 45
 
**26 + 19.19 = 45.19** which close the value 45 in the dataset
 
 
 
 ### Boxplot with Population standard Deviation -pstdev 
 

In [47]:
box_plot_psd = pygal.Box(box_mode="pstdev",
                         width = 640,
                         height = 360,
                         explicit_size = True)

box_plot_psd.title = 'Social Media time'

In [48]:
box_plot_psd.add('Facebook', social_media['facebook_time'])
box_plot_psd.add('Instagram', social_media['instagram_time'])
box_plot_psd.add('Twitter', social_media['twitter_time'])

display_chart(box_plot_psd)

The Plot is the same as above but the only difference is using "**Population Standard Deviation**"

**Upper Whisker**  = Highest point that's within the "**Q2 + Population Standard Deviation**"

**Lower Whisker**  = Lowest data point that's greater than "**Q2 - Population Standard Deviation**"