# Week 1: Introduction to Statistics and R

**PLS 120 - Applied Statistics in Agriculture**

This lab will cover the basics of R, including loading data, creating vectors, data frames and tables, and making simple plots. Please follow along with the code chunks provided.

## Data Types in R

In R, data can be stored in different types, depending on the kind of information it represents. The basic data types in R are:

1. **Integer**: Whole numbers without decimals. Such as 1, 355, etc. Numbers like 2.5, 34.7 are NOT integers.
2. **Numeric**: Numbers (both integers and decimals).
3. **Character**: Text or string data.
4. **Logical**: TRUE or FALSE values.
5. **Factor**: Categorical data with levels.

We can find out the data type of a variable by using `class()` function. The output of this function is one word, specifying the data type or class.

**Expected output**: You'll see data type names like "numeric", "integer", "character", "logical"

In [None]:
# Example of an integer variable
# Notice how R reads this variable as numeric
count <- 10
class(count)

In [None]:
# To define 'count' as integer, you can add 'L' after the value of the variable.
count <- 10L
class(count)

In [None]:
# Example of a numeric variable
x <- 3.14
class(x)  # Returns the type of the object

In [None]:
# If we want to give our x a value with decimal as an integer, we'll get an error.
x <- 3.14L
class(x)  # Returns the type of the object

In [None]:
# Example of a character variable
name <- "Statistics"
class(name)

In [None]:
# Example of a logical variable
is_student <- TRUE
class(is_student)

## Vectors

Vectors are a series of numbers in one dimension. It means that they only have one row, but they can have different columns. Let's define a vector (a series of numbers) and assign the numbers to a variable called 'vector':

**Mathematical notation**: A vector can be written as **v** = (v₁, v₂, v₃, ..., vₙ)

**Expected output**: You'll see sequences of numbers displayed in brackets like [1] 0 5 6 3 6 9 3

In [None]:
vector_1 <- c(0, 5, 6, 3, 6, 9, 3) # here we make a vector containing a random series of numbers
vector_1

In [None]:
# make another vector with seven different numbers
example_1 <- c()
example_1

In [None]:
# we can use seq() function to create a vector of sequential numbers with our specified range and increments. 
# The order of arguments in this function should be seq(min, max, increment). 
# If we want to create a vector of numbers from 4 to 6 with 0.2 increment, we write:
vector_2 <- seq(4, 6, 0.2)
vector_2

In [None]:
# using ? before every function will open 'Help' tab to explain it
?seq()

In [None]:
example_2 <- seq(0, 10, 0.15)
example_2

In [None]:
# now that we have two vectors, we can combine them to make a single data frame using the rbind() function
df <- rbind(vector_1, example_1)
df

## Data Frames

We often work with data frames, which are in two dimensions. It means that they have multiple rows and multiple columns.

For working with data frames, we will first need some data to work with. The R language includes several data sets in the language itself to practice this. For the labs, we will most often be using the "Iris" data set, which has data describing different species of flowers. First, we will start by loading in the data into our environment, then using the `str()` function to look at the data structure and describe it. It also tells you whether its a data frame, table or matrix, as well as the dimensions.

**Expected output**: You'll see a summary showing data types, dimensions, and first few values of each column

In [None]:
flower <- iris # here we assign the iris data in R to an object called flower

# notice the top right corner in 'environment' tab. if you can see the object up there, the data is loaded successfully
str(iris) # now use structure to look at the data

In [None]:
# data can also be loaded using a comma separated value (csv) file
# Note: We'll skip the LA_Crime example since we don't have that file
# Instead, let's look at our sample crop data
crop_data <- read.csv("sample_crop_data.csv")
str(crop_data)

The `str()` function tells you about data types:
- **num** are numeric values
- **chr** are character values
- **Factor** are design factors in the experiment
- **int** are whole integer values

## R as a Calculator

However, R doesn't always need data to perform its functions. It also works just like a calculator.

**Mathematical operations**: +, -, ×, ÷

**Expected output**: You'll see the numerical results of each calculation

In [None]:
# addition
3 + 4

In [None]:
# subtraction
5 - 2

In [None]:
# multiplication
3 * 6

In [None]:
# division
8 / 2

In [None]:
# what is 22 times 56 plus 8 minus 200?
22 * 56 + 8 - 200

## Data Visualization

Whenever looking at data, a good first step is to take a big picture approach, by looking at the distribution of the data points, and through simple visualization. It's often easier to look at a picture than it is to look at a bunch of numbers.

We will start with a frequency table. This is a table that shows the distribution of data given two values. This can also be described as looking at the "counts" of the data point. Using the iris data, we will make a frequency table of the sepal length.

**Expected output**: You'll see tables showing counts and distributions

In [None]:
# We've already installed tidyverse for you in this Binder environment
library(tidyverse) # add in a package to make the code work

In [None]:
# How many samples (rows) belongs to each species in Species column (specified with $)
table(iris$Species)

In [None]:
# When you pass two variables to table(), it will count the combinations of values between the two variables. 
# For example, if we want to see how many times each species has a specific Sepal Length, we write:
table(iris$Species, iris$Sepal.Length)
# Note that this might be impractical if Sepal.Length has many unique values, so you might want to group them into categories by using cut() which converts continuous data into categorical data based on specified breaks.

In [None]:
# Here we designed a frequency_table as a 'table', then we pick a factor to divide the table down, we specify how it cuts, we will divide the table by a sequence.

frequency_table <- table(iris$Species, cut(iris$Sepal.Length, seq(4, 6, 0.2)))
frequency_table # run this line to print the table

In [None]:
frequency_table <- table(iris$Species, cut(iris$Sepal.Length, seq(10, 12, 0.2)))
frequency_table

In [None]:
# how would you remake this table for a different data column in the iris data?
# Try it yourself!

## Histograms

This table is useful, but it can be difficult to interpret. However, you can make a simple graph that visualizes this information. A histogram shows the range of data, location with the highest concentration of measurements, shape of distribution (symmetric or skewed). We will make a histogram, which will show this distribution of the counts. Histograms are also useful because they only require a single function to make.

**Mathematical concept**: A histogram displays the frequency distribution of a dataset

**Expected output**: You'll see bar charts showing the distribution of data values

In [None]:
# the function hist() will make a histogram using some vector of data. Here, we use the Sepal Length column in the iris data using $
hist(iris$Sepal.Length)

In [None]:
# now that we have a histogram, we can also adjust how many "bins" are made, or how the counts are distributed.
hist(iris$Sepal.Length, breaks = 15)
# Here, setting breaks = 15 divides the range of Sepal Length into 15 equal-width bins, allowing you to control how fine or coarse the bins are.

In [None]:
# now we have increased the number of bins. What happens when we decrease the number of bins?
hist(iris$Sepal.Length, breaks = 2)