# Distributions / Data Visualization
- Definitions
    - Sample / Population
    - Statistics / Parameters 
        - `.describe()` and other aggregation functions
- `import plotly.express as px`
    - VS matplotlib(pandas), seaborn, ggplot (R but Python), shiny, bokeh, [The Pudding](https://pudding.cool) (`D3.js`), https://informationisbeautiful.net ("Data Journalism")
    - `px.histogram` (`nbins`) 
        - skew, modality, heavy tailed, outliers... 
        - Relates ***statistics*** $\bar x$ and ***sample variance*** $s^2$ (***sample standard deviation*** $s$)
            - Corresponding ***parameters*** $\mu$ and $\sigma^2$ (and $\sigma$)
            - Parameters / Populations `from scipy import stats`
    - Kernel Density Estimation: `import plotly.figure_factory as ff`
    - `px.box` and `pd.melt` 
        - How boxplots define ***outliers***
        - `pd.melt` different from `ff.create_distplot`


    

In [11]:
import pandas as pd
ab = pd.read_csv("../../amazonbooks.csv", encoding="ISO-8859-1")#.dropna()
print(ab.shape)
ab.isnull().sum()

(325, 13)


Title            0
Author           1
List Price       1
Amazon Price     0
Hard_or_Paper    0
NumPages         2
Publisher        1
Pub year         1
ISBN-10          0
Height           4
Width            5
Thick            1
Weight_oz        9
dtype: int64

In [13]:
ab_noNaN = ab.drop(['Weight_oz','Width','Height'], axis=1).dropna()
ab_noNaN['Pub year'] = ab_noNaN['Pub year'].astype(int)
ab_noNaN['NumPages'] = ab_noNaN['NumPages'].astype(int)
ab_noNaN['Hard_or_Paper'] = ab_noNaN['Hard_or_Paper'].astype("category")
print(ab_noNaN.shape)
ab_noNaN.dtypes

(319, 10)


Title              object
Author             object
List Price        float64
Amazon Price      float64
Hard_or_Paper    category
NumPages            int64
Publisher          object
Pub year            int64
ISBN-10            object
Thick             float64
dtype: object

In [5]:
ab_noNaN
ab_noNaN.Thick

0      0.8
1      0.7
2      0.3
3      1.6
4      1.4
      ... 
320    1.1
321    0.7
322    0.7
323    0.9
324    1.0
Name: Thick, Length: 319, dtype: float64

In [15]:
import plotly.express as px
px.histogram(ab_noNaN, x='Thick', nbins=4)

In [16]:
px.histogram(ab_noNaN, x='Thick')

In [8]:
px.histogram(ab_noNaN, x='Pub year')

In [14]:
px.histogram(ab_noNaN, x='Hard_or_Paper')

In [17]:
px.histogram(ab_noNaN, x='Thick')

- Sample: of thinkness in 319 amazon.com book listings 2011
    - Statistic: a function of data characterizing the data
        - sample mean 

            $\bar x = \frac{1}{n} \sum_{i=1}^n x_i \longrightarrow \text{ estimates population mean parameter } \mu$  
            
        - sample variance  
        
           $s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-\bar x)^2 \longrightarrow \text{ estimates population variance parameter } \sigma^2$
        - sample standard deviation 
        
           $s = \sqrt{s^2}$
      
      
      
**Statistics estimate paramters**      
      
- Population: ALL amazon.com book listings 2011
    - Parameter: the statistic if it was calculated on ALL the population
    
  

In [19]:
ab_noNaN['Pub year'].max()

2011

In [21]:
ab_noNaN.Thick.mean(), ab_noNaN.Thick.std(ddof=1) # ab_noNaN.Thick.var()



(0.9034482758620691, 0.365261201002392)

In [23]:
from scipy import stats
mu,sd = 1,.35

my_normal_theoretical_population = stats.norm(loc=mu, scale=sd)

In [24]:
my_theoretical_sample = my_normal_theoretical_population.rvs(size = 319)

In [25]:
df = pd.DataFrame({'my_theoretical_sample': my_theoretical_sample}) 
px.histogram(df, x = "my_theoretical_sample")

In [30]:
import numpy as np
support = np.linspace(-1,3,100)
df = pd.DataFrame({'x':support, 
                   'density':my_normal_theoretical_population.pdf(support)})
px.line(df, x='x', y='density')

In [33]:
import plotly.figure_factory as ff

hist_data = [my_theoretical_sample, ab_noNaN.Thick]

group_labels = ['My theoretical sample', 
                'My actually observed sample']

#(assuming population is N(mu=1, sigma=0.35)
fig = ff.create_distplot(hist_data, group_labels, bin_size=.2)
fig.show()

In [34]:
ab_noNaN['my_theoretical_sample'] = my_theoretical_sample
ab_noNaN

Unnamed: 0,Title,Author,List Price,Amazon Price,Hard_or_Paper,NumPages,Publisher,Pub year,ISBN-10,Thick,my_theoretical_sample
0,"1,001 Facts that Will Scare the S#*t Out of Yo...",Cary McNeal,12.95,5.18,P,304,Adams Media,2010,1605506249,0.8,1.106623
1,21: Bringing Down the House - Movie Tie-In: Th...,Ben Mezrich,15.00,10.20,P,273,Free Press,2008,1416564195,0.7,0.972245
2,100 Best-Loved Poems (Dover Thrift Editions),Smith,1.50,1.50,P,96,Dover Publications,1995,486285537,0.3,0.977264
3,1421: The Year China Discovered America,Gavin Menzies,15.99,10.87,P,672,Harper Perennial,2008,61564893,1.6,1.471118
4,1493: Uncovering the New World Columbus Created,Charles C. Mann,30.50,16.77,P,720,Knopf,2011,307265722,1.4,1.483449
...,...,...,...,...,...,...,...,...,...,...,...
320,Where the Sidewalk Ends,Shel Silverstein,18.99,12.24,H,192,HarperCollins,2004,60572345,1.1,0.437138
321,White Privilege,Paula S. Rothenberg,27.55,27.55,P,160,Worth Publishers,2011,1429233443,0.7,0.772833
322,Why I wore lipstick,Geralyn Lucas,12.95,5.18,P,224,St Martin's Griffin,2005,031233446X,0.7,1.030440
323,"Worlds Together, Worlds Apart: A History of th...",Robert Tignor,97.50,97.50,P,480,W. W. Norton & Company,2010,393934942,0.9,1.084754


In [36]:
#px.boxplot(ab_noNaN, x='Thick')
px.box(ab_noNaN, x='Thick')

- 50th percentile (median)
- 25th / 75th (quartile 1 / quartile 3) percentiles
- whiskers = $\approx$1.5*IQR (inter-quartile range)
- bonus nice definition for outliers, too

In [37]:
px.box(ab_noNaN, x='my_theoretical_sample')

In [41]:
# wide to tall/narrow/long? format

samples_actual_and_theoretical = pd.melt(ab_noNaN, value_vars=['Thick','my_theoretical_sample'])
px.box(samples_actual_and_theoretical, x='value',
       y='variable')

In [None]:
px.box(ab_noNaN, x='my_theoretical_sample'
               , x='Thick'