# Visualization, Data Wrangling and Modeling

## Visualization

## Front Matter

## Example 1 

Avoid displaying full datasets in your Markdown document because it is
difficult to read.


In [None]:
# iris a a built in dataset containing info on a sample of iris flowers

Instead, use a function (such as head, glimpse (from dplyr), str, etc.)
that only shows a preview/summary of the dataset.

## Example 2 - Common Visualizations

The following illustrates some common types of plots and the
corresponding code. The examples require the `dplyr` and `ggplot2`
libraries, both of which are loaded as part of the `tidyverse`. The
examples also `penguins` data set from the `palmerpenguins` library.

In [None]:
#Bar Plot

#NOTE: Inside aes(), no y axis is specified. This is because the y axis is not a variable contained in the dataset. In a bar plot, the y axis is the count which is determined by the geom_bar procedure

In [None]:
#Histogram

#NOTE: Inside aes(), no y axis is specified. This is because the y axis is not a variable contained in the dataset. In a histogram, the y axis is the count which is determined by the geom_histogram procedure

In [None]:
#Density Plot

In [None]:
#Scatterplot

In [None]:
#Side by side boxplots

## Example 3 - Loading and previewing a dataset from a package

## Example 4 - Layers of ggplot demonstration

## Example 5 - Using aesthetics

## Example 6 - Faceting using two categorical variables

## Example 7 - Faceting using one categorical variable

## Example 8 - Boxplots

### Data Manipulation

In [None]:
#Remove all objects from Environment
remove(list = ls()) #CAUTION: This deletes everything in the environment

## Example 1 - Supporting data visualization with numerical summaries

#### Example 1 - Incorrect code

#### Example 1 - Corrected code

## Example 2 - Alternative method for finding the summary statistics

In [None]:
#Alternative Method

#Create dataset for each species and calculate values; repeat for each species

## Example 5 - dplyr and ggplot practice in R

#### Example 5 Part a

In [None]:
#Calculate medical charges by region for smokers over 40

In [None]:
#Create data frames required for the plot

In [None]:
#Create visualization

#### Example 5 Part b

In [None]:
#Creating Obesity Classification variable

Some observation:

-   Smokers appear to have higher charges than non-smokers

-   For non-smokers, there does not appear to be much of a difference in
    the charges for those who are obese or not

-   For smokers, being obese is associated with higher charges. Note,
    this is not suggesting that being obese CAUSES higher charges

## Example 6 - A note about masking

#### Example 6c

Add `Hmisc` library to front matter. Then run the code provided below.

## Another Example

In [None]:
rm(list = ls()) #CAUTION: Removes all objects in your environment
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## Example 2 Part c - Creating Superhero and Publisher Datasets

In [None]:
#Create superheroes table (requires tibble package)
superheroes <- tibble::tribble(
  ~name, ~alignment, ~gender, ~publisher,
  "Magneto", "bad", "male", "Marvel",
  "Storm", "good", "female", "Marvel",
  "Mystique", "bad", "female", "Marvel",
  "Batman", "good", "male", "DC",
  "Joker", "bad", "male", "DC",
  "Catwoman", "bad", "female", "DC",
  "Hellboy", "good", "male", "Dark Horse Comics"
)

#Create publishers table
publishers <- tibble::tribble(
  ~publisher, ~yr_founded,
  "DC", 1934,
  "Marvel", 1939,
  "Image", 1992
)

## Example 2 Part c - Performing join from 2b

## Example 2d - 1st join

## Example 2d - second join

## Example 2d - third join

## Example 3a

## Example 3b

## Example 3c

### Regression

In [None]:
#CAUTION: This removes all objects from Environment
rm(list = ls())

#Add libraries as needed
library(tidyverse)

#Read in Dataset
Vehicles <- read.csv("./data/L04_Vehicles_m.csv", header=FALSE)

## Example 5b - Rename Variables and remove observations with missing values

## Example 5c - Which variable is most likely to explain the other?

I would expect miles to influence price. We will call miles the
explanatory variable (x, input, feature, etc.). The price is the
response variable (y, output, taget, response)

## Example 5d - Scatterplot Showing Price as a Function of Miles

In [None]:
#Find correlation between miles and price

Observations: - If there was no linear association, I would expect the
line to be horizontal - in this case, the weak negative linear
association. As miles increases, I would expect price to decrease. The
relationship looks weak (the correlation is -0.08)

NOTE: The correlation is a number between -1 and +1. The closer the
absolute value of the correlation is to 1, the stronger the
relationship. When the correlation is close to 0, there is no linear
relationship.

## Example 5e - What other variables could be used to help explain the variation in price?

-   year car was built
-   past accidents/history
-   brand of car
-   number of owners
-   model of car
-   type of vehicle (car vs truck, 2 dr vs 4 dr, 4wd vs not)
-   gas vs electric
-   condition of car
-   features of the car (ac, sunroof, etc.)

## Example 5f - Scatterplot Showing Price as a Function of Miles with Model

Some observations:

-   Within each model, there is a negative linear association. As miles
    increases, price decreases.
-   The rate of change in price with respect to miles (i.e., the slopes)
    may be different for the different models. It looks like Camry’s and
    Volt’s have similar slopes, but the slope for Tacoma’s may not be as
    steep.
-   For a given number of miles, Tacoma’s are more expensive than
    Camry’s and Camry’s are more expensive than Volt’s.

## 5g - Create new dataset for Camry’s only

## 5h - Build Simple Linear Regresssion Model

The estimated regression equation is given by:

$$\hat{y}_{i} = a + bx_{i}$$