# Class Notebook 1

This notebook will feature an introduction to the statistical programming language R and some of its basic features. The general topic here is **exploratory data analysis**.

## Reading in a Dataframe from URL

Before R can begin its work for us, we need to load some data. Quite often, data sets are stored in Excel spreadsheets in CSV format. Here is the command to load a CSV file. Note that we can simply call the URL where the file is stored using the versatile command:

<center><b>read.csv</b></center>

The **personality** data were collected from 2010 to 2012. The respondents were UNG Dahlonega students in Elementary Statistics courses. The data set was developed pursuant to an undergraduate research project about sense of humor. Let's load the data into the variable "pers".

In [None]:
pers <- read.csv("https://faculty.ung.edu/rsinn/personality.csv")

To view the data, we can type the variable name "pers". Yet, the whole data frame will be displayed which, for large data sets, may be undesirable. Instead, we can use the

<center><b>head</b></center>

command which allows us to print only the top few rows of the data frame. Try this:

In [None]:
head(pers, 5)

# Exploring the Data Set

Each column is a variable in our data set, and each row is an individual. To explore the data set and find out what items are in it, we can use

<center><b>colnames</b></center>
<center><b>nrow</b></center>
<center><b>ncol</b></center>

Let's take a look under the hood of the personality data frame.

In [None]:
nrow(pers)

In [None]:
ncol(pers)

We find that we have data from 129 students on 36 different variables. To see the names of the variables (e.g. column titles), run the code below.

In [None]:
colnames(pers)

## Accessing a Specific Column in a Data Frame

To work with individual variables, we use the dollar symbol (\$) after the data frame title. In the example below, we are accessing a variable **Caff** where students reported the number of 8 oz. servings of caffeine they had consumed in the previous 24 hours.

In [None]:
Caffeine <- pers$Caff

Let's find the average and standard deviation of this data.

In [None]:
mean(Caffeine)

In [None]:
sd(Caffeine)

### The cat Function

The **cat** function combines two function into one:
- concatenation 
- print

The **cat** function and turns out to be quite useful. While the print function can be unwieldy at times, the **cat** function tends to work very well.

Concatenation means to "join together." Thus, the **cat** function allows us to join together strings of text and even variable outputs to display results from R nicely in our notebook. Here's an example where we store the mean and standard deviation in the variables **m** and **s** respectively.

Note that the command **summary** gathers the values of the **5-Number Summary**.

In [None]:
m <- mean(Caffeine)

In [None]:
s <- sd(Caffeine)

In [None]:
cat ("For this data, we have: \nmean =", m, "\nstandard deviation =", s,"\n\nThe five number summary is shown below.")
summary(Caffeine)

## Data Visulations

With the advent of data science, data visulizations have grown quite sophisticated. We will address the basic visulations here, the traditional graphics that have been used in statistics for decades:

1. Histograms
2. Box Plots
3. Scatter Plots

### Histograms

In [None]:
We can use the variable we created (e.g. Caffeine), or we can access the data for our histogram directly as is shown below:

In [None]:
hist(pers$Caff)

The function **hist** has several options the most-often used of which is **breaks**. Turns out, stats software is pretty poor at creating histograms. Often, the results are readable. However, sometimes more or fewer intervals (called "bins") along the $x$-axis improve the readability of the histogram in a stunning way.

In [None]:
hist(pers$Caff, breaks = 5)

Notice how having fewer breaks or bins enlarges the intervals along the $x$-axis so we have fewer bars in our histogram. The process of finding the correct number of breaks is more art than science and can be frustrating. However, let's leave that for later. Right now, the standard histograms will work well for us.

### Box Plots

The **5-Number Summary** displays the following:
- min
- Q1 or 25th percentile
- median
- Q3 or 75th percentile
- max

If you consider these values for a moment, you'll realize that they divide the data into quartiles. While the *mean* and *standard deviation* are the most often used descriptive statsitics, the 5-Number Summary turns out to be especially useful when the data are skewed or have outliers.

In [None]:
boxplot(pers$Caff)

The standard box plot also checks for outliers which are designated as small circles above or below the box plot. We can also plot the **Sleep** variable which asked for the total number of sleep hours in the past 48 hours including naps. The responses were divided by $2$ to indicate an average amount of sleep per day.

In [None]:
boxplot(pers$Sleep)

Please note that there are **FIVE** horizontal lines in the box plot, one each for the values of the 5-Number Summary. We need a few options here. The following shows how to add titles and how to change the colors of the box and of the lines. Also, we can flip the box plot to horizontal instead of vertical positioning.

In [None]:
boxplot(pers$Sleep,
main = "Hours of Sleep per Day for College Students",
xlab = "Sleep Hours",
ylab = "Counts",
col = "red",
border = "blue",
horizontal = TRUE
)

The titles are useful and work similarly for other plots we create.

### Scatter Plots

Are two numeric variables associated in some way? How would we be able to determine this? The answer is the scatter plot which plots a cloud of points. If there is a pattern to the data, we will be able to see it. The example below shows the basic format on the top line with titles added below that.

In [None]:
plot(x= pers$Sleep, y = pers$Caff, 
     main = "Sleep vs. Caffeine Consumption",
     xlab = "Sleep Hours", 
     ylab = "Servings of Caffeine")

We can add a trendline to our scatter plot as, quite often, the cloud of points has no descernable pattern. We are creating a linear model using the **lm** function, and plotting that superimposed on the scatter plot below it.

In [None]:
plot(x= pers$Sleep, y = pers$Caff, 
     main = "Sleep vs. Caffeine Consumption",
     xlab = "Sleep Hours", 
     ylab = "Servings of Caffeine")
model <- lm(Caff ~ Sleep, data = pers)
abline(model, col = "red") 

We see that there is a slight negative association between sleep hours and caffeine intake. As sleep hours increase, the amount of caffeine consumed decreases. The following rules apply to interprestting the trendline (e.g. line of best fit):

1. If the trendline is horizontal, there is no relationship at all between the variables.
2. If the trendline has positive slope, we have a positive association between the varirables.
3. If the trendline has negative slope, we have a negative association between the varirables.

If the trendline has a slope that is significantly different than zero, we have evidence to search for a correlation between the two variables.

## Time to Explore

You need to be able to implement the functions shown above quickly and accurately for this course. 

### Single Variable Visulations

Try using the following variables for single-variable visualizations and for descriptive statstics:
- Narcissism (Narc)
- GPA (GPA)
- Thrill-Seeking (Thrill)
- Extroversion (Extro)
- Toxic Relationship Beliefs (TxRel)
- Optimism (Opt)
- Perfectionism (Perf)
- Coping Humor (CHS)
- OCD Indicators (OCD)

### Two Variable Visulations

Try using the following variables to detect potential correlation.
- Perf vs. Opt
- Thrill vs. Extro
- CHS vs. Opt
- Extro vs. CHS
- Any other pair of variables you think may be correlated

### Start Working Below