## Module 3 Practise

### This is a Python Notebook

Following notebook will show basic plots using **ggplot**, **matplotlib**, and **seaborn** packages. 

The data used in the notebook is taken from [here (external link)](https://www.causeweb.org/cause/research/literature/sexual-activity-and-lifespan-male-fruitflies-dataset-gets-attention). 


## About the data set
From the readme file, a cost of increased reproduction in terms of reduced longevity has been shown for female fruitflies, but not for males.
The flies used were an outbred stock.
Sexual activity was manipulated by supplying individual males with one or eight receptive virgin females per day.  
The longevity of these males was compared with that of two control types.
The first control consisted of two sets of individual males kept with one or eight newly inseminated females.
Newly inseminated females will not usually remate for at least two days, and thus served as a control for any effect of competition with the male for food or space.

The second control was a set of individual males kept with no females. 
There were 25 males in each of the five groups, which were treated identically in number of anaesthetizations (using CO2) and provision of fresh food medium.

The dataset has 125 observations and 5 variables

In [None]:
import pandas as pd

fruitfly_data = pd.read_csv('/dsa/data/all_datasets/fruitfly/fruitfly.txt',sep=" ",\
                            names=["ID","Partners","Type","Longevity","Thorax","Sleep"])
fruitfly_data.head(5)

In [None]:
# Check the descriptive statistics for the dataset
fruitfly_data.describe()

In [None]:
# Save the distribution of variable Partners in a variable called "counts"
counts = fruitfly_data.Partners.value_counts()
counts

In [None]:
# Create a dataframe "No_of_Partners" to make a bar graph for Partners variable
# No_of_Partners has two columns, 'Partners' to save no of partners and 'count' to store respective frequency count.
No_of_Partners = {'Partners' : [8, 1, 0], 'count' : counts}
No_of_Partners=pd.DataFrame(No_of_Partners)
No_of_Partners

---

## Grammar of Graphics


In [None]:
from ggplot import *

# Warnings expected

In [None]:
# Create a bar chart for Partners variable using ggplot package.
ggplot(aes(x="Partners", weight="count"), No_of_Partners) + geom_bar()

In [None]:
type(fruitfly_data.Partners)

Convert the type of Partners from integer to Object so that it can be used to color the data points. 

In [None]:
fruitfly_data['Partners']=fruitfly_data['Partners'].astype(object)
type(fruitfly_data.Partners)

In [None]:
from ggplot import *

# scatter plot of Longevity and Thorax with levels of Partners variable used for indicating different data points.
p = ggplot(fruitfly_data, aes(x='Longevity', y='Thorax',colour="Partners"))
p + geom_point(aes(size=10)) + facet_grid("Partners", "Type")

## Matplot Lib

Matplot Lib is the default plotting library of python.
The code below uses matplotlib package to draw a histogram. 
As you know from statistics, this is a type of bar-chart,where the bars are frequencies.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Histogram for Longevity variable with 20 bins
plt.hist(fruitfly_data.Longevity, bins=20)
plt.xlabel('Longevity')
plt.ylabel('count')
plt.title('Histogram of Longevity')

## Seaborn

##### From the documentation

Seaborn is a library for making attractive and informative statistical graphics in Python. 
It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.

Some of the features that seaborn offers are:
  * Several built-in themes that improve on the default matplotlib aesthetics
  * Tools for choosing color palettes to make beautiful plots that reveal patterns in your data
  * Functions for visualizing univariate and bivariate distributions or for comparing them between subsets of data
  * Tools that fit and visualize linear regression models for different kinds of independent and dependent variables
  * Functions that visualize matrices of data and use clustering algorithms to discover structure in those matrices
  * A function to plot statistical timeseries data with flexible estimation and representation of uncertainty around the estimate
  * High-level abstractions for structuring grids of plots that let you easily build complex visualizations

...

Seaborn should be thought of as a complement to matplotlib, not a replacement for it. 
When using seaborn, it is likely that you will often invoke matplotlib functions directly to draw simpler plots already available through the pyplot namespace. 
Further, while the seaborn functions aim to make plots that are reasonably “production ready” (including extracting semantic information from Pandas objects to add informative labels), full customization of the figures will require a sophisticated understanding of matplotlib objects.

https://seaborn.pydata.org/introduction.html

  * [Local Mirror](https://indigo.sgn.missouri.edu/static/mirror_sites/seaborn.pydata.org/introduction.html)


### Faceted Exploration

Faceted exploration will _automagically_ lay out mutliple plots allowing the comparison of difference combinations of factors on the data set.

In [None]:
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt

g = sns.FacetGrid(fruitfly_data, row="Partners", col="Type")
g.map(plt.hist, "Longevity")

The facets work with other plots types as well.

In [None]:

g = sns.FacetGrid(fruitfly_data, row="Partners", col="Type")
g.map(plt.scatter, "Longevity", "Sleep")

## Bar charts

Bar charts uses horizontal or vertical bars to show comparisons among categories. 

One axis of the chart shows the categories being compared, and the other axis represents a quantitative value. 


In [None]:
# Using the No_of_Partners datafrom compute above
sns.barplot(x="Partners", y="count",data=No_of_Partners)

## Box and Whisker Plots

Box plot is a convenient way of graphically depicting common descriptive statistics of data, divided into categories. Box plots may also have lines extending vertically from the boxes (**whiskers**) indicating variability outside the upper and lower quartiles, hence the terms box and whisker plot.

![Boxplot_vs_PDF](../images/Boxplot_vs_PDF.png)
###### from: By Jhguch at en.wikipedia, CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curid=14524285


In [None]:
# Draw a nested boxplot to show 
sns.boxplot(x="Type", y="Longevity", data=fruitfly_data, palette="PRGn")

In [None]:
# Draw a nested boxplot to show 
sns.boxplot(x="Partners", y="Longevity",hue="Type",  data=fruitfly_data, palette="PRGn")

**Notice**, buy adding the hue parameter as a new visual variable, the plot groups the defined variable as subsets of partners.

Below, we will turn the box and whiskers sideways.

In [None]:
# Draw a nested boxplot to show 
sns.boxplot(x="Longevity",y="Partners",hue="Type", orient="h",  data=fruitfly_data, palette="PRGn")