# Python Companion to *Visualizing Data by William S. Cleveland* with plotnine and pandas
## Chapter 2 - Univariate Data - Section 2.6
#### Dataset: Fusion Times for Stereograms

#### Contents

+ Setup, Data Preparation
+ Introduction
+ 2.6 Log Transformations

---

# Setup, Data Preparation

### Imports

In [1]:
# Setup function to print major versions - similar to movingpandas
!which python

/home/david/mambaforge/envs/cleveland/bin/python


In [2]:
import os
import math
from pathlib import Path

import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import plotnine
from plotnine import *
# import matplotlib.pyplot as plt
import rpy2
from scipy.stats import norm, probplot

### Defaults

In [3]:
# plotnine.themes.theme_set(theme_bw())
plotnine.options.current_theme=theme_bw()
plotnine.options.figure_size=(4, 4)
pd.set_option('display.max_columns', 100)

### Data Directory

In [4]:
%load_ext dotenv
%dotenv
PROJECT_DIR=Path(os.environ.get('PROJECT_DIR'))
DATA_DIR = PROJECT_DIR / 'book' / 'data'
DATA_DIR_STR = str(DATA_DIR)
DATA_DIR

PosixPath('/media/david/T7/code/cleveland-visualizing-data/book/data')

## Get the Data from the R `lattice` package

In [5]:
%load_ext rpy2.ipython

by .GlobalEnv when processing object ‘.pbd_env’


In [6]:
%%R -i DATA_DIR_STR
library(ggcleveland)
data <- fusion
filepath <- paste0(DATA_DIR_STR, "/" , "fusion.csv")
write.csv(fusion, filepath, row.names=FALSE)

## Load the Data and setup datatypes

In [7]:
df_orig = pd.read_csv(DATA_DIR / "fusion.csv")
# df_orig = df_orig.rename(columns={"voice.part": "voice_part"})
# pitch_order = ['Bass 2', 'Bass 1', 'Tenor 2', 'Tenor 1', 'Alto 2', 'Alto 1', 'Soprano 2', 'Soprano 1']
# cat_type=CategoricalDtype(categories=pitch_order, ordered=True)
# df_orig['voice_part'] = df_orig['voice_part'].astype(cat_type)
df_orig.head()

Unnamed: 0,time,nv.vv
0,47.20001,NV
1,21.99998,NV
2,20.39999,NV
3,19.70001,NV
4,17.4,NV


## Profile the Data

In [8]:
df_orig.isna().sum()

time     0
nv.vv    0
dtype: int64

In [9]:
profile = df_orig.groupby(by='nv.vv').describe()
profile

Unnamed: 0_level_0,time,time,time,time,time,time,time,time
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
nv.vv,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
NV,43.0,8.560465,8.085412,1.7,3.1,6.9,10.0,47.20001
VV,35.0,5.551429,4.801739,1.0,2.15,3.6,6.85,19.70001


In [10]:
df_orig.head()

Unnamed: 0,time,nv.vv
0,47.20001,NV
1,21.99998,NV
2,20.39999,NV
3,19.70001,NV
4,17.4,NV


---

# Introduction
About the data:
+ [fusion: Fusion times for random dot sterograms in ggcleveland](https://cran.r-project.org/web/packages/ggcleveland/ggcleveland.pdf)
+ [Stereogram: DASL - The Data and Story Library](https://dasl.datadescription.com/datafiles/?_sf_s=Stereogram&_sfm_cases=4+59943)
+ Experiment to determine the effect of prior knowledge of an object's form on fusion time.

## Analysis Question
Is there a difference between the two groups?

---

# Section 2.6 Log Transformations

## 2.6 - Fig 2.19 Quantile Plots