# Chapter overview

Structure:

* Samples versus populations
* Comments on software
* R Basics
  * Entering data
  * Arithmetic operations
  * Storage types and modes
  * Identifying and analyzing special cases
* R packages
* Access to data used in this book
* Accessing more detailed answers to exercises
* Exercises

Comments:

* Statistical methods and techniques impacts variety of fields
* Description and summary of events
* Example: Rats placed in ozone - Weight change?
  * What averages would we have for larger sample sizes?

In this book we will:

* Obtain conceptual foundation for understanding when commonly used techniques perform well, and when they don't
* Does not dig deep into mathematical underpinnings
* Focuses on concepts and principles

# Samples versus populations

How well does *sample* represent *population*?

* *Population mean* - Average of all rats.
  * Problem is that it is hard to measure all
* How well does sample extrapolate to full data

Example: Small set of depressed individuals. Does their average generalize?

Example: Norman Conquest, year 1100. Coins produced per day. We want to generalize to all coins.

## Three fundamental components of statistics

* **Design** - Procedure for planning experiments so that data yield valid and objective conclusions. **Well-chosen experimental design *maximize* information obtained for given amount of experimental effort.**
* **Description** - Numerical and graphical methods for summarizing data.
* **Inference** - Making predictions or generalizations about population based on sample observations.

Health example: Factors affecting health.

* Fat amount higher in North America than rural China.
* Death rate much higher in North America.
* Descriptive study - But not able to infer.
* Would also need to consider pertubing factors (weight, age..)

Inferential methods later on are designed to determine how well goal of seeing degree of inference works.

# Comments on Software

* R - Best and most up to date. It has it all.
* SAS - Good package providing power and flexibility. Not always cutting edge.
* Minitab - Fairly simple to use, and reasonably flexible. Standard methods available (<1960). Can require special code.
* SPSS - Lack of flexibility. Modern methods hard to apply.
* EXCEL - Easy to use. Limited. Not adequately maintained.

# R Basics

## Entering data

In [32]:
blob = 5
blob

In [33]:
blob = c(2,4,6,8,12)
blob

In [34]:
length(blob)

**Scan** command used to read simple  string of values.

In [35]:
# ice = scan(file.choose())  # Nice thing! Not working on AWS though.

In [36]:
blob

We can remove objects.

In [37]:
rm(blob)
# blob

Reading tables (now we are talking!)

In [38]:
quake <- read.table("quake.dat", skip=1, sep=" ")
head(quake)

V1,V2,V3
7.8,360,130
7.7,400,110
7.5,M,27
M,70,24
7.0,40,7
6.9,50,15


Get first column using name or index.

In [39]:
quake$V1

In [40]:
quake[,1]

Get first row

In [41]:
quake[1,]

V1,V2,V3
7.8,360,130


In [42]:
quake_annot <- read.table("quake.dat", sep=" ", header=TRUE)
head(quake_annot)

magnitude,length,duration
7.8,360,130
7.7,400,110
7.5,M,27
M,70,24
7.0,40,7
6.9,50,15


Print labels, for both rows and columns

In [43]:
labels(quake_annot)
str(quake_annot)

'data.frame':	16 obs. of  3 variables:
 $ magnitude: Factor w/ 13 levels "5.8","5.9","6.1",..: 12 11 10 13 9 8 7 7 6 5 ...
 $ length   : Factor w/ 15 levels "14","15","16",..: 8 10 15 14 9 12 3 1 5 6 ...
 $ duration : int  130 110 27 24 7 15 8 7 15 6 ...


In [46]:
quake_na <- read.table("quake.dat", na.strings="M", header=TRUE)
head(quake_na)

magnitude,length,duration
7.8,360.0,130
7.7,400.0,110
7.5,,27
,70.0,24
7.0,40.0,7
6.9,50.0,15


## Arithmetics

In [47]:
1 + 5^2

In [51]:
range(c(1, 5, 3, 7, 2))

## Storage types and modes

Dataframe is not vector. Column neither. But a row is.

In [60]:
is.vector(quake_na)
is.vector(quake_na[1,])
is.vector(quake_na[,1])

In [61]:
m <- cbind(c(1,2,3), c(4,5,6))
m

0,1
1,4
2,5
3,6


In [62]:
apply(m, 2, mean)

In [63]:
apply(m, 1, mean)

In [65]:
matrix(seq(1,6), ncol=2)

0,1
1,4
2,5
3,6


Here, we force matrix to populate row-wise.

In [66]:
matrix(seq(1,6), ncol=2, byrow=TRUE)

0,1
1,2
3,4
5,6


## TO BE CONTINUED, PAGE 15...

# R Packages

# Access to data used in this book

# Exercises