From 174a6c91b9b1fabe287efb1a5c8bf9b078a9a55a Mon Sep 17 00:00:00 2001 From: Peter-Shewmaker <43147190+Peter-Shewmaker@users.noreply.github.com> Date: Tue, 9 Oct 2018 12:26:20 -0400 Subject: [PATCH 01/18] Create README.md Added new folder "Week 05" and added a readme file to it. --- Week 05/README.md | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 Week 05/README.md diff --git a/Week 05/README.md b/Week 05/README.md new file mode 100644 index 0000000..2cfd975 --- /dev/null +++ b/Week 05/README.md @@ -0,0 +1,26 @@ +# Functions Pre-Class Work + +Please complete the following work. + + +### Objectives + +1. Gain further practice on loops. +2. Gain further practice on functions. +3. Apply functions and loops to actual data. + + + +## Flipped Material + +- Sign into [Datacamp](https://www.datacamp.com/) +- Complete [Chapters 4: Functions](https://campus.datacamp.com/courses/1118/) from PHP 1560/2560 Statistical Programing with R. +- Complete [Writing Functions in R Course](https://www.datacamp.com/courses/writing-functions-in-r) +- Complete these steps for the assignment: + - Create a new folder in your pre-class Repository called: `Week 05` + - Copy the `readme.md` and `pre-class-05.Rmd` document into this. + - Follow the instructions and commit often. + + + +Then proceed to the Pre Week 05 RMarkdown file, complete this and commit your work often to begin to learn how to make small changes and committing. From 0a9396fd889eb3d22149eeee84eb6007328dc6e1 Mon Sep 17 00:00:00 2001 From: Peter-Shewmaker <43147190+Peter-Shewmaker@users.noreply.github.com> Date: Tue, 9 Oct 2018 12:27:20 -0400 Subject: [PATCH 02/18] Create pre-class-05.Rmd Added the pre-class-05.Rmd file. --- Week 05/pre-class-05.Rmd | 109 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 109 insertions(+) create mode 100644 Week 05/pre-class-05.Rmd diff --git a/Week 05/pre-class-05.Rmd b/Week 05/pre-class-05.Rmd new file mode 100644 index 0000000..21ea6bb --- /dev/null +++ b/Week 05/pre-class-05.Rmd @@ -0,0 +1,109 @@ +# pre-class + + +Make sure you commit this often with meaningful messages. + + + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + + + + + + + + +Standardizing a variable means subtracting the mean, and then dividing by the standard deviation. Let’s use a loop to standardize the numeric columns in the [Western Collaborative Group Study](https://clinicaltrials.gov/ct2/show/NCT00005174). This study began in 1960 with 3154 men ages 39-59, who were employed in one of 11 California based companies. They were followed until 1969 during this time, 257 of these men developed coronary heart disease (CHD). You can read this data in with the code below. You can access this dataset with the following code: + +```{R} + +suppressMessages(library(foreign)) +wcgs <- read.dta("https://drive.google.com/uc?export=download&id=0B8CsRLdwqzbzYWxfN3ExQllBQkU") +``` + +The data has the following variables: + + + +WCGS has the following variables: + +----------------------------------------------------------- +Name Description +------- ------------------------------------------- +id Subject identification number + +age Age in years + +height Height in inches + +weight Weight in lbs. + +sbp Systolic blood pressure in mm + +dbp Diastolic blood pressure in mm Hg + +chol Fasting serum cholesterol in mm + +behpat Behavior + + 1 A1 + + 2 A2 + + 3 B3 + + 4 B4 + +ncigs Cigarettes per day + +dibpat Behavior + +1 type A + +2 type B + +chd69 Coronary heart disease + +1 Yes + +0 no + +typechd Type of CHD + +1 myocardial infarction or death + +2 silent myocardial infarction + +3 angina perctoris + +time169 Time of CHD event or end of follow-up + +arcus Arcus senilis + +0 absent + +1 present + +bmi Body Mass Index +----------------------------------------------------------- + + + + +### Question 1: Standardize Function + +A. Create a function called standardize.me() that takes a numeric vector as an argument, and returns the standardized version of the vector. +B. Assign all the numeric columns of the original WCGS dataset to a new dataset called WCGS.new. +C. Using a loop and your new function, standardize all the variables WCGS.new dataset. +D. What should the mean and standard deviation of all your new standardized variables be? Test your prediction by running a loop + + + + +### Question 2: Looping to Calculate + +A. Using a loop, calculate the mean weight of the subjects separated by the type of CHD they have. +B. Now do the same thing, but now don’t use a loop From 3470cafd99fc7805051fef3b00e55dc3b7b6a36d Mon Sep 17 00:00:00 2001 From: Peter-Shewmaker <43147190+Peter-Shewmaker@users.noreply.github.com> Date: Tue, 9 Oct 2018 12:32:29 -0400 Subject: [PATCH 03/18] Delete pre-class-05.Rmd --- Week 05/pre-class-05.Rmd | 109 --------------------------------------- 1 file changed, 109 deletions(-) delete mode 100644 Week 05/pre-class-05.Rmd diff --git a/Week 05/pre-class-05.Rmd b/Week 05/pre-class-05.Rmd deleted file mode 100644 index 21ea6bb..0000000 --- a/Week 05/pre-class-05.Rmd +++ /dev/null @@ -1,109 +0,0 @@ -# pre-class - - -Make sure you commit this often with meaningful messages. - - - -```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE) -``` - - - - - - - - -Standardizing a variable means subtracting the mean, and then dividing by the standard deviation. Let’s use a loop to standardize the numeric columns in the [Western Collaborative Group Study](https://clinicaltrials.gov/ct2/show/NCT00005174). This study began in 1960 with 3154 men ages 39-59, who were employed in one of 11 California based companies. They were followed until 1969 during this time, 257 of these men developed coronary heart disease (CHD). You can read this data in with the code below. You can access this dataset with the following code: - -```{R} - -suppressMessages(library(foreign)) -wcgs <- read.dta("https://drive.google.com/uc?export=download&id=0B8CsRLdwqzbzYWxfN3ExQllBQkU") -``` - -The data has the following variables: - - - -WCGS has the following variables: - ------------------------------------------------------------ -Name Description -------- ------------------------------------------- -id Subject identification number - -age Age in years - -height Height in inches - -weight Weight in lbs. - -sbp Systolic blood pressure in mm - -dbp Diastolic blood pressure in mm Hg - -chol Fasting serum cholesterol in mm - -behpat Behavior - - 1 A1 - - 2 A2 - - 3 B3 - - 4 B4 - -ncigs Cigarettes per day - -dibpat Behavior - -1 type A - -2 type B - -chd69 Coronary heart disease - -1 Yes - -0 no - -typechd Type of CHD - -1 myocardial infarction or death - -2 silent myocardial infarction - -3 angina perctoris - -time169 Time of CHD event or end of follow-up - -arcus Arcus senilis - -0 absent - -1 present - -bmi Body Mass Index ------------------------------------------------------------ - - - - -### Question 1: Standardize Function - -A. Create a function called standardize.me() that takes a numeric vector as an argument, and returns the standardized version of the vector. -B. Assign all the numeric columns of the original WCGS dataset to a new dataset called WCGS.new. -C. Using a loop and your new function, standardize all the variables WCGS.new dataset. -D. What should the mean and standard deviation of all your new standardized variables be? Test your prediction by running a loop - - - - -### Question 2: Looping to Calculate - -A. Using a loop, calculate the mean weight of the subjects separated by the type of CHD they have. -B. Now do the same thing, but now don’t use a loop From 30d937f3e406f735db6e06aee15a6ade62c0f9bd Mon Sep 17 00:00:00 2001 From: Peter-Shewmaker <43147190+Peter-Shewmaker@users.noreply.github.com> Date: Tue, 9 Oct 2018 12:32:37 -0400 Subject: [PATCH 04/18] Delete README.md --- Week 05/README.md | 26 -------------------------- 1 file changed, 26 deletions(-) delete mode 100644 Week 05/README.md diff --git a/Week 05/README.md b/Week 05/README.md deleted file mode 100644 index 2cfd975..0000000 --- a/Week 05/README.md +++ /dev/null @@ -1,26 +0,0 @@ -# Functions Pre-Class Work - -Please complete the following work. - - -### Objectives - -1. Gain further practice on loops. -2. Gain further practice on functions. -3. Apply functions and loops to actual data. - - - -## Flipped Material - -- Sign into [Datacamp](https://www.datacamp.com/) -- Complete [Chapters 4: Functions](https://campus.datacamp.com/courses/1118/) from PHP 1560/2560 Statistical Programing with R. -- Complete [Writing Functions in R Course](https://www.datacamp.com/courses/writing-functions-in-r) -- Complete these steps for the assignment: - - Create a new folder in your pre-class Repository called: `Week 05` - - Copy the `readme.md` and `pre-class-05.Rmd` document into this. - - Follow the instructions and commit often. - - - -Then proceed to the Pre Week 05 RMarkdown file, complete this and commit your work often to begin to learn how to make small changes and committing. From 71f37870434c4e3e9c3b800abba9a963abbadc74 Mon Sep 17 00:00:00 2001 From: Peter Shewmaker Date: Tue, 9 Oct 2018 13:07:03 -0400 Subject: [PATCH 05/18] Added solution to Question 1 --- Week 05/pre-class-05.Rmd | 33 +++++++++++++++++++++++++++++++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/Week 05/pre-class-05.Rmd b/Week 05/pre-class-05.Rmd index 0f4015a..131f5c2 100644 --- a/Week 05/pre-class-05.Rmd +++ b/Week 05/pre-class-05.Rmd @@ -95,11 +95,40 @@ bmi Body Mass Index ### Question 1: Standardize Function -A. Create a function called standardize.me() that takes a numeric vector as an argument, and returns the standardized version of the vector. +A. Create a function called standardize.me() that takes a numeric vector as an argument, and returns the standardized version of the vector. +```{r} +#The standardize.me function takes each element of a numeric vector x, subtracts the mean of x from it and then divides by the standard deviation of x. This creates a standardized version of the vector x. +standardize.me <- function(x){ + standardized <- (x - mean(x))/sd(x) + standardized +} +``` B. Assign all the numeric columns of the original WCGS dataset to a new dataset called WCGS.new. +```{r} +#Using the dplyr function "select_if", the columns of the original dataset that are numeric are assigned to a new dataset WCGS.new. +library(dplyr) +WCGS.new <- select_if(wcgs, is.numeric) +``` C. Using a loop and your new function, standardize all the variables WCGS.new dataset. +```{r} +#This loop standardizes each column of the new dataset containing only numeric columns. +for(i in seq_along(WCGS.new)){ + WCGS.new[,i] <- standardize.me(WCGS.new[,i]) +} +``` D. What should the mean and standard deviation of all your new standardized variables be? Test your prediction by running a loop - +```{r} +#After being standardized, the mean should be 0 and the standard deviation should be 1 for each of the standardized variables. Let's check this: +means <- numeric(0) +sds <- numeric(0) +for(i in seq_along(WCGS.new)){ + means[i] <- mean(WCGS.new[,i]) + sds[i] <- sd(WCGS.new[,i]) +} +means +sds +#In fact, excluding the NA values, the mean is extremely close to 0 for each of the standardized variables, and the standard deviations are all exactly 1. +``` From 3b158c1d782a65274c4b5a111530bf21ba138381 Mon Sep 17 00:00:00 2001 From: Peter Shewmaker Date: Tue, 9 Oct 2018 13:30:30 -0400 Subject: [PATCH 06/18] Added solution to Question 2. --- Week 05/pre-class-05.Rmd | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/Week 05/pre-class-05.Rmd b/Week 05/pre-class-05.Rmd index 131f5c2..483d292 100644 --- a/Week 05/pre-class-05.Rmd +++ b/Week 05/pre-class-05.Rmd @@ -108,6 +108,7 @@ B. Assign all the numeric columns of the original WCGS dataset to a new dataset #Using the dplyr function "select_if", the columns of the original dataset that are numeric are assigned to a new dataset WCGS.new. library(dplyr) WCGS.new <- select_if(wcgs, is.numeric) + ``` C. Using a loop and your new function, standardize all the variables WCGS.new dataset. ```{r} @@ -115,6 +116,7 @@ C. Using a loop and your new function, standardize all the variables WCGS.new da for(i in seq_along(WCGS.new)){ WCGS.new[,i] <- standardize.me(WCGS.new[,i]) } +#Notice that since the "chol" column contains an NA value, the whole column becomes NA when the standardize function is applied to it, since the mean is not calculated with NA values. The function could be fixed to deal with the cases of NA value - since this is not explicitly asked for, I'll leave it out. ``` D. What should the mean and standard deviation of all your new standardized variables be? Test your prediction by running a loop ```{r} @@ -135,5 +137,41 @@ sds ### Question 2: Looping to Calculate A. Using a loop, calculate the mean weight of the subjects separated by the type of CHD they have. +```{r} +#First we initiate vectors that will contain the number of people with each type of CHD, and then the sum of their weights. Then these two vectors can be divided to produce the mean weight of the subjects by the type of CHD they have. +num_type <- numeric(4) +names(num_type) <- c("no CHD", "MI or SD", "silent MI", "angina") +sum_weight <- numeric(4) +names(sum_weight) <- c("no CHD", "MI or SD", "silent MI", "angina") +#This for loop looks at each row, determines the type of CHD the patient has, and adds 1 to the number of people with that type of CHD, then adds the weight of the patient to the sum of the weights of the patients with that type of CHD. +for(i in 1:nrow(wcgs)){ + if(wcgs$typchd69[i] == "no CHD"){ + num_type[1] <- num_type[1] + 1 + sum_weight[1] <- sum_weight[1] + wcgs$weight[i] + } + else if(wcgs$typchd69[i] == "MI or SD"){ + num_type[2] <- num_type[2] + 1 + sum_weight[2] <- sum_weight[2] + wcgs$weight[i] + } + else if(wcgs$typchd69[i] == "silent MI"){ + num_type[3] <- num_type[3] + 1 + sum_weight[3] <- sum_weight[3] + wcgs$weight[i] + } + else if(wcgs$typchd69[i] == "angina"){ + num_type[4] <- num_type[4] + 1 + sum_weight[4] <- sum_weight[4] + wcgs$weight[i] + } +} +#Once the loop is finished, the sum of the weights is divided by the number of people with each type of CHD, giving the mean weight of the subjects separated by the type of CHD they have. +mean_by_type <- sum_weight/num_type +names(mean_by_type) <- c("no CHD", "MI or SD", "silent MI", "angina") +mean_by_type +``` B. Now do the same thing, but now don’t use a loop +```{r} +#Now, using dplyr functions and piping, we group by the type of CHD and summarise by the mean weight. +wcgs %>% + group_by(typchd69) %>% + summarise(mean_weight = mean(weight)) +``` From cd8a787b10fa642f1f3776a09f256e00b8f07546 Mon Sep 17 00:00:00 2001 From: Peter-Shewmaker <43147190+Peter-Shewmaker@users.noreply.github.com> Date: Mon, 22 Oct 2018 12:29:05 -1000 Subject: [PATCH 07/18] Create README.md New folder and Readme. --- Week 06/README.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 Week 06/README.md diff --git a/Week 06/README.md b/Week 06/README.md new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/Week 06/README.md @@ -0,0 +1 @@ + From 0b94c5b9699d8395587216d3116fc6432c8f80bf Mon Sep 17 00:00:00 2001 From: Peter-Shewmaker <43147190+Peter-Shewmaker@users.noreply.github.com> Date: Mon, 22 Oct 2018 12:29:37 -1000 Subject: [PATCH 08/18] Update README.md --- Week 06/README.md | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/Week 06/README.md b/Week 06/README.md index 8b13789..2f8715c 100644 --- a/Week 06/README.md +++ b/Week 06/README.md @@ -1 +1,27 @@ +# Functions Pre-Class Work + +Please complete the following work. + + +### Objectives + +1.Gain further practice on functions + + + + +## Required Reading + + +You Should read all of these + +- [R For Data Science: Functions Chapter](http://r4ds.had.co.nz/functions.html) +- [Functional Programming - Advanced R](http://adv-r.had.co.nz/Functional-programming.html) +- [Functionals](http://adv-r.had.co.nz/Functionals.html) +- [Function Operators](http://adv-r.had.co.nz/Function-operators.html) + + + + +Then proceed to the Pre Week 06 RMarkdown file, complete this and commit your work often to begin to learn how to make small changes and committing. From ef219fbe93b661222159c1a6db8cdbacd7d9a169 Mon Sep 17 00:00:00 2001 From: Peter-Shewmaker <43147190+Peter-Shewmaker@users.noreply.github.com> Date: Mon, 22 Oct 2018 12:30:41 -1000 Subject: [PATCH 09/18] Create pre-class-06.Rmd Added pre-class-06 solutions. --- Week 06/pre-class-06.Rmd | 70 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 Week 06/pre-class-06.Rmd diff --git a/Week 06/pre-class-06.Rmd b/Week 06/pre-class-06.Rmd new file mode 100644 index 0000000..72c0c84 --- /dev/null +++ b/Week 06/pre-class-06.Rmd @@ -0,0 +1,70 @@ +# pre-class + + +Make sure you commit this often with meaningful messages. + + + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + + +1. Read the source code for each of the following three functions, puzzle out what they do, and then brainstorm better names. + +``` +f1 <- function(string, prefix) { + substr(string, 1, nchar(prefix)) == prefix +} +``` +*This function takes the inputs "string" and "prefix", and checks if "prefix" is a prefix of "string", e.g. "ab" is a prefix of "abc", "e" is not a prefix of "abc". A better name for the function might be "check_prefix()"* +``` +f2 <- function(x) { + if (length(x) <= 1) return(NULL) + x[-length(x)] +} +``` +*This function takes a vector, checks if it has a length longer than 1 (if it does not, it returns NULL). If it does have length longer than 1, it returns the vector with the last element in the vector removed. A good name might be "remove_last()".* + +``` +f3 <- function(x, y) { + rep(y, length.out = length(x)) +} +``` +*This function takes two input vectors x and y, it then repeats the vector y, it then repeats the vector y up to the length of x, e.g. if y = 1:5 and x = 1:20, then the function returns 1:5 repeated 4 times, if y = 1:7 and x = 1:20, then the function returns 1:7 repeated twice, then 1:6, since the length of x is 20. A better name might be "repeat_y_with_length_x()".* + +2. Compare and contrast rnorm() and MASS::mvrnorm(). How could you make them more consistent? + +*Firstly, the MASS::mvrnorm() function produces samples from a multivariate normal distribution, whereas the rnorm() function only simulates from the normal distribution. If the MASS::mvrnorm() function is given single values rather than a vector of values, it should produce similar results to the rnorm() function.* + +*The function rnorm() has no default value for n, whereas mvrnorm() has default value n = 1. On the other hand, rnorm() has default "mean = 0", whereas mvrnorm() has no default value for the mean "mu", similarly rnorm has default "sd = 1", whereas mvrnorm() has no default for the standard deviation "Sigma". In other words, rnorm() defaults to the standard normal distribution, whereas mvrnorm() needs to have the mean and standard deviation specified. Finally, mvrnorm() returns a matrix whereas rnorm() returns a vector of values.* + +*It makes sense to keep it such that mvrnorm() returns a matrix so it can also give information for the case when it is giving samples from a multivariate distribution. The best things to do to make these two functions more consistent by giving both functions the same argument names and default values, e.g. n = 1, mean = 0, sd = 1.* + +3. Use `lapply()` and an anonymous function to find the coefficient of variation (the standard deviation divided by the mean) for all columns in the mtcars dataset. +```{r} +#First load the mtcars data set. +mtcars <- mtcars +#Using lapply with the anonymous function that gives the standard deviation divided by the mean gives the desired value for each column, since data frames are treated as a collection of lists corresponding to each column, so the function is applied to each column of mtcars. +lapply(mtcars, function(x) sd(x)/mean(x)) +``` +4. Use vapply() to: + a. Compute the standard deviation of every column in a numeric data frame. +```{r} +#This function uses vapply to calculate the standard deviation of each column, with FUN.VALUE set to numeric to ensure that the function returns a numeric vector. +column_sd <- function(df){ + vapply(df, sd, FUN.VALUE = numeric(1)) +} +``` + b. Compute the standard deviation of every numeric column in a mixed data frame. (Hint: you’ll need to use vapply() twice.) +```{r} +#This function works similarly to the one in part a, but first checks the columns to find which ones are numeric, then only returns the standard deviation of the columns which are numeric. +num_column_sd <- function(df){ + num_columns <- numeric(0) + for(i in seq_along(df)){ + if(is.numeric(df[,i])) + num_columns <- c(num_columns, i) + } + vapply(df[,num_columns], sd, numeric(1)) +} +``` From a1416e61d4d78908507e81400db32df6462509d7 Mon Sep 17 00:00:00 2001 From: Peter-Shewmaker <43147190+Peter-Shewmaker@users.noreply.github.com> Date: Mon, 22 Oct 2018 12:36:29 -1000 Subject: [PATCH 10/18] Create README.md Added readme file to new folder Week 07 --- Week 07/README.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 Week 07/README.md diff --git a/Week 07/README.md b/Week 07/README.md new file mode 100644 index 0000000..d0ded87 --- /dev/null +++ b/Week 07/README.md @@ -0,0 +1,23 @@ +# Simulating Gamblers Ruin + +## Description of Problem + +[Wikipedia]() describe the Gambler's Ruin as follows: + +The term gambler's ruin is a statistical concept expressed in a variety of forms: + +- The original meaning is that a persistent gambler who raises his bet to a fixed fraction of bankroll when he wins, but does not reduce it when he loses, will eventually and inevitably go broke, even if he has a positive expected value on each bet. +- Another common meaning is that a persistent gambler with finite wealth, playing a fair game (that is, each bet has expected value zero to both sides) will eventually and inevitably go broke against an opponent with infinite wealth. Such a situation can be modeled by a random walk on the real number line. In that context it is provable that the agent will return to his point of origin or go broke and is ruined an infinite number of times if the random walk continues forever. +- The result above is a corollary of a general theorem by Christiaan Huygens which is also known as gambler's ruin. That theorem shows how to compute the probability of each player winning a series of bets that continues until one's entire initial stake is lost, given the initial stakes of the two players and the constant probability of winning. This is the oldest mathematical idea that goes by the name gambler's ruin, but not the first idea to which the name was applied. +- The most common use of the term today is that a gambler playing a negative expected value game will eventually go broke, regardless of betting system. This is another corollary to Huygens' result. +- The concept may be stated as an ironic paradox: Persistently taking beneficial chances is never beneficial at the end. This paradoxical form of gambler's ruin should not be confused with the gambler's fallacy, a different concept. + +The concept has specific relevance for gamblers; however it also leads to mathematical theorems with wide application and many related results in probability and statistics. Huygens' result in particular led to important advances in the mathematical theory of probability. + + + + +## This project + + +Many times an advantage of computer science is that even without knowing how to accomplish a math problem we can actually simulate the result to as much precision as we would like. We will work through simulating the answer to this problem in this project. From 96815397ea1b2d95fe4f0c4802abb3d532f9bffd Mon Sep 17 00:00:00 2001 From: Peter-Shewmaker <43147190+Peter-Shewmaker@users.noreply.github.com> Date: Mon, 22 Oct 2018 12:37:39 -1000 Subject: [PATCH 11/18] Create pre-class-07.Rmd Added pre-class-07.Rmd file --- Week 07/pre-class-07.Rmd | 44 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) create mode 100644 Week 07/pre-class-07.Rmd diff --git a/Week 07/pre-class-07.Rmd b/Week 07/pre-class-07.Rmd new file mode 100644 index 0000000..fa4997d --- /dev/null +++ b/Week 07/pre-class-07.Rmd @@ -0,0 +1,44 @@ +--- +title: "Simulations Pre-Class Project" +date: "Due March 13, 2017 at 5:00pm" +output: + html_document + + +--- + + +```{r,setup, echo=FALSE, cache=TRUE} +## numbers >= 10^5 will be denoted in scientific notation, +## and rounded to 2 digits +options(scipen = 3, digits = 3) +``` + + + + +#Project Goals: + + +With this project we will simulate a famoues probability problem. This will not require knowledge of probability or statistics but only the logic to follow the steps in order to simulate this problem. This is one way to solve problems by using the computer. + + 1. **Gambler's Ruin**: Suppose you have a bankroll of $1000 and make bets of $100 on a fair game. By simulating the outcome directly for at most 5000 iterations of the game (or hands), estimate: + a. the probability that you have "busted" (lost all your money) by the time you have placed your one hundredth bet. + b. the probability that you have busted by the time you have placed your five hundredth bet by simulating the outcome directly. + c. the mean time you go bust, given that you go bust within the first 5000 hands. + d. the mean and variance of your bankroll after 100 hands (including busts). + e. the mean and variance of your bankroll after 500 hands (including busts). + +Note: you *must* stop playing if your player has gone bust. How will you handle this in the `for` loop? + +2. **Markov Chains**. Suppose you have a game where the probability of winning on your first hand is 48%; each time you win, that probability goes up by one percentage point for the next game (to a maximum of 100%, where it must stay), and each time you lose, it goes back down to 48%. Assume you cannot go bust and that the size of your wager is a constant $100. + a. Is this a fair game? Simulate one hundred thousand sequential hands to determine the size of your return. Then repeat this simulation 99 more times to get a range of values to calculate the expectation. + b. Repeat this process but change the starting probability to a new value within 2% either way. Get the expected return after 100 repetitions. Keep exploring until you have a return value that is as fair as you can make it. Can you do this automatically? + c. Repeat again, keeping the initial probability at 48%, but this time change the probability increment to a value different from 1%. Get the expected return after 100 repetitions. Keep changing this value until you have a return value that is as fair as you can make it. From 267c8f42e44f0e5075ee8b643747f9ae64bd06fe Mon Sep 17 00:00:00 2001 From: Peter-Shewmaker <43147190+Peter-Shewmaker@users.noreply.github.com> Date: Sun, 4 Nov 2018 15:20:14 -0500 Subject: [PATCH 12/18] Create README.md Created README file for Week 09 --- Week 09/README.md | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) create mode 100644 Week 09/README.md diff --git a/Week 09/README.md b/Week 09/README.md new file mode 100644 index 0000000..a4a5111 --- /dev/null +++ b/Week 09/README.md @@ -0,0 +1,27 @@ +# This Project + +## Flipped Material + + +## Flipped Material + +- Sign into [Datacamp](https://www.datacamp.com/) +- Complete [Working with Web Data in R](https://campus.datacamp.com/courses/working-with-web-data-in-r/downloading-files-and-using-api-clients?ex=1) +- Complete [Webscraping in R from PHP 2560](https://campus.datacamp.com/courses/php-15602560-statistical-programming-in-r). + + +## Exercises + +1. Read the HTML content of the following URL with a variable called webpage: https://money.cnn.com/data/us_markets/ At this point, it will also be useful to open this web page in your browser. +2. Get the session details (status, type, size) of the above mentioned URL. +3. Extract all of the sector names from the “Stock Sectors” table (bottom left of the web page.) +4. Extract all of the “3 Month % Change” values from the “Stock Sectors” table. +5. Extract the table “What’s Moving” (top middle of the web page) into a data-frame. +6. Re-construct all of the links from the first column of the “What’s Moving” table. +Hint: the base URL is “https://money.cnn.com” +7. Extract the titles under the “Latest News” section (bottom middle of the web page.) +8. To understand the structure of the data in a web page, it is often useful to know what the underlying attributes are of the text you see. +Extract the attributes (and their values) of the HTML element that holds the timestamp underneath the “What’s Moving” table. +9. Extract the values of the blue percentage-bars from the “Trending Tickers” table (bottom right of the web page.) +Hint: in this case, the values are stored under the “class” attribute. +10. Get the links of all of the “svg” images on the web page. From 036e83328e0e13ed2645e5ac02449fd6db527f21 Mon Sep 17 00:00:00 2001 From: Peter-Shewmaker <43147190+Peter-Shewmaker@users.noreply.github.com> Date: Sun, 4 Nov 2018 15:21:32 -0500 Subject: [PATCH 13/18] Create pre-class-09.Rmd Created pre-class-09.Rmd file --- Week 09/pre-class-09.Rmd | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) create mode 100644 Week 09/pre-class-09.Rmd diff --git a/Week 09/pre-class-09.Rmd b/Week 09/pre-class-09.Rmd new file mode 100644 index 0000000..fe424d2 --- /dev/null +++ b/Week 09/pre-class-09.Rmd @@ -0,0 +1,35 @@ +--- +title: "Basic Webscraping" +--- + + +```{r,setup, echo=FALSE, cache=TRUE} +## numbers >= 10^5 will be denoted in scientific notation, +## and rounded to 2 digits +options(scipen = 3, digits = 3) +``` + + + +## Exercises + +1. Read the HTML content of the following URL with a variable called webpage: https://money.cnn.com/data/us_markets/ At this point, it will also be useful to open this web page in your browser. +2. Get the session details (status, type, size) of the above mentioned URL. +3. Extract all of the sector names from the “Stock Sectors” table (bottom left of the web page.) +4. Extract all of the “3 Month % Change” values from the “Stock Sectors” table. +5. Extract the table “What’s Moving” (top middle of the web page) into a data-frame. +6. Re-construct all of the links from the first column of the “What’s Moving” table. +Hint: the base URL is “https://money.cnn.com” +7. Extract the titles under the “Latest News” section (bottom middle of the web page.) +8. To understand the structure of the data in a web page, it is often useful to know what the underlying attributes are of the text you see. +Extract the attributes (and their values) of the HTML element that holds the timestamp underneath the “What’s Moving” table. +9. Extract the values of the blue percentage-bars from the “Trending Tickers” table (bottom right of the web page.) +Hint: in this case, the values are stored under the “class” attribute. +10. Get the links of all of the “svg” images on the web page. From ef5e703d95dc406d72d974da94b611cb79d24b4e Mon Sep 17 00:00:00 2001 From: Peter Shewmaker Date: Sun, 4 Nov 2018 18:27:09 -0500 Subject: [PATCH 14/18] Solutions to 1-6 --- Week 09/pre-class-09.Rmd | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/Week 09/pre-class-09.Rmd b/Week 09/pre-class-09.Rmd index fe424d2..74023d3 100644 --- a/Week 09/pre-class-09.Rmd +++ b/Week 09/pre-class-09.Rmd @@ -21,12 +21,49 @@ options(scipen = 3, digits = 3) ## Exercises 1. Read the HTML content of the following URL with a variable called webpage: https://money.cnn.com/data/us_markets/ At this point, it will also be useful to open this web page in your browser. +```{r} +#After installing the 'rvest' package, we can use the read_html function on the the given url to save the content of the URL. + +library(rvest) +url <- "https://money.cnn.com/data/us_markets/" +webpage <- read_html(url) +``` 2. Get the session details (status, type, size) of the above mentioned URL. +```{r} +#Running the html_session function on the given URL will give the session details. +html_session(url) +``` 3. Extract all of the sector names from the “Stock Sectors” table (bottom left of the web page.) +```{r} +#Running html_nodes on the webpage and specifying table and then running html_table on the nodes gives data frames containing the data on the webpage. +tables <- html_table(html_nodes(webpage, "table")) +#Three tables are then saved from the page. The second table is the "Stock Sectors" table, so we can then select that one. +Stock_Sectors <- tables[[2]] +#The first column of the Stock_Sectors table contains the names. +Stock_Sectors[,1] +``` 4. Extract all of the “3 Month % Change” values from the “Stock Sectors” table. +```{r} +#The second column of the Stock_Sectors table contains the "3 Month % Change" values. +Stock_Sectors[,2] +``` 5. Extract the table “What’s Moving” (top middle of the web page) into a data-frame. +```{r} +#We saved the "What's Moving" table in the "table" list of data frames above. It is the first table in the list. +Whats_Moving <- tables[[1]] +Whats_Moving +``` 6. Re-construct all of the links from the first column of the “What’s Moving” table. Hint: the base URL is “https://money.cnn.com” +```{r} +#The URLs are in the form "https://money.cnn.com/quote/quote.html?symb=" followed by the 4 letter stock identifying abbreviation. These names are the first 4 letters in the first column of the Whats_Moving table. The urls are collected by pasting together the base URL with the 4 letter symbol by using substr to extract the abbreviations. +urls <- c() +names <- Whats_Moving[,1] +for(i in seq_along(names)){ + urls <- c(urls, paste(paste("https://money.cnn.com/quote/quote.html?symb=", substr(names[i], 1, 4), sep = ""))) +} +urls +``` 7. Extract the titles under the “Latest News” section (bottom middle of the web page.) 8. To understand the structure of the data in a web page, it is often useful to know what the underlying attributes are of the text you see. Extract the attributes (and their values) of the HTML element that holds the timestamp underneath the “What’s Moving” table. From ce96b99003d8bd718287c55e777ea18fd4b07dd9 Mon Sep 17 00:00:00 2001 From: Peter Shewmaker Date: Sun, 4 Nov 2018 19:33:07 -0500 Subject: [PATCH 15/18] Completed --- Week 09/pre-class-09.Rmd | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/Week 09/pre-class-09.Rmd b/Week 09/pre-class-09.Rmd index 74023d3..e941834 100644 --- a/Week 09/pre-class-09.Rmd +++ b/Week 09/pre-class-09.Rmd @@ -56,15 +56,15 @@ Whats_Moving 6. Re-construct all of the links from the first column of the “What’s Moving” table. Hint: the base URL is “https://money.cnn.com” ```{r} -#The URLs are in the form "https://money.cnn.com/quote/quote.html?symb=" followed by the 4 letter stock identifying abbreviation. These names are the first 4 letters in the first column of the Whats_Moving table. The urls are collected by pasting together the base URL with the 4 letter symbol by using substr to extract the abbreviations. -urls <- c() -names <- Whats_Moving[,1] -for(i in seq_along(names)){ - urls <- c(urls, paste(paste("https://money.cnn.com/quote/quote.html?symb=", substr(names[i], 1, 4), sep = ""))) -} -urls +#I used the selectorgadget to pick out the appropriate CSS selector - in this case it is "tr .wsod_symbol" - and the "href" attribute combined with the base URL recreates the links. +url_suffixes <- html_attr(html_nodes(webpage, "tr .wsod_symbol"), "href") +paste("https://money.cnn.com", url_suffixes, sep = "") ``` 7. Extract the titles under the “Latest News” section (bottom middle of the web page.) +```{r} +#This bit of code extracts the nodes from the "Latest News" section, and the html_text function extracts the titles. +html_text(html_nodes(webpage, "#section_latestnews a")) +``` 8. To understand the structure of the data in a web page, it is often useful to know what the underlying attributes are of the text you see. Extract the attributes (and their values) of the HTML element that holds the timestamp underneath the “What’s Moving” table. 9. Extract the values of the blue percentage-bars from the “Trending Tickers” table (bottom right of the web page.) From 44ec71c54af566c3ab84e2b41e3a2f18868bae8f Mon Sep 17 00:00:00 2001 From: Peter-Shewmaker Date: Tue, 6 Nov 2018 10:04:05 -1000 Subject: [PATCH 16/18] Completed 1 - 9 --- Week 09/pre-class-09.Rmd | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/Week 09/pre-class-09.Rmd b/Week 09/pre-class-09.Rmd index e941834..fd09657 100644 --- a/Week 09/pre-class-09.Rmd +++ b/Week 09/pre-class-09.Rmd @@ -57,16 +57,30 @@ Whats_Moving Hint: the base URL is “https://money.cnn.com” ```{r} #I used the selectorgadget to pick out the appropriate CSS selector - in this case it is "tr .wsod_symbol" - and the "href" attribute combined with the base URL recreates the links. -url_suffixes <- html_attr(html_nodes(webpage, "tr .wsod_symbol"), "href") +url_suffixes <- html_attr(html_nodes(webpage, cs = "tr .wsod_symbol"), "href") paste("https://money.cnn.com", url_suffixes, sep = "") ``` 7. Extract the titles under the “Latest News” section (bottom middle of the web page.) ```{r} #This bit of code extracts the nodes from the "Latest News" section, and the html_text function extracts the titles. -html_text(html_nodes(webpage, "#section_latestnews a")) +html_text(html_nodes(webpage, css = ".HeadlineList a")) ``` 8. To understand the structure of the data in a web page, it is often useful to know what the underlying attributes are of the text you see. Extract the attributes (and their values) of the HTML element that holds the timestamp underneath the “What’s Moving” table. +```{r} +#The html_attrs function gives the attributes of the HTML element, which is selected using the SelectorGadget app to find the appropriate css selector. +html_attrs(html_node(webpage, css = ".wsod_disclaimer span")) + +``` 9. Extract the values of the blue percentage-bars from the “Trending Tickers” table (bottom right of the web page.) Hint: in this case, the values are stored under the “class” attribute. +```{r} +#This time, we want the class attribute of the values that are given after selecting for ".bars" (found using SelectorGadget). This returns a vector of strings with values "bars pctX" where X is the percentage value of the bar. Then the "bars pct" is removed to give just a vector of the values. +values <- html_attr(html_nodes(webpage, ".bars"), "class") +as.numeric(gsub("bars pct", "", values)) +``` 10. Get the links of all of the “svg” images on the web page. + +```{r} +html_nodes(webpage, "svg") +``` \ No newline at end of file From 3eb072a8ebde6a193d307e90ec176c9f3b53de3d Mon Sep 17 00:00:00 2001 From: Peter-Shewmaker Date: Tue, 6 Nov 2018 10:04:52 -1000 Subject: [PATCH 17/18] Fixed typo. --- Week 09/pre-class-09.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Week 09/pre-class-09.Rmd b/Week 09/pre-class-09.Rmd index fd09657..69dbe74 100644 --- a/Week 09/pre-class-09.Rmd +++ b/Week 09/pre-class-09.Rmd @@ -57,7 +57,7 @@ Whats_Moving Hint: the base URL is “https://money.cnn.com” ```{r} #I used the selectorgadget to pick out the appropriate CSS selector - in this case it is "tr .wsod_symbol" - and the "href" attribute combined with the base URL recreates the links. -url_suffixes <- html_attr(html_nodes(webpage, cs = "tr .wsod_symbol"), "href") +url_suffixes <- html_attr(html_nodes(webpage, css = "tr .wsod_symbol"), "href") paste("https://money.cnn.com", url_suffixes, sep = "") ``` 7. Extract the titles under the “Latest News” section (bottom middle of the web page.) From 09d8d7620a0fc99086c38df6b2c1e8f180867e54 Mon Sep 17 00:00:00 2001 From: Peter-Shewmaker Date: Wed, 7 Nov 2018 04:39:10 -1000 Subject: [PATCH 18/18] Finished assignment --- Week 09/pre-class-09.Rmd | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/Week 09/pre-class-09.Rmd b/Week 09/pre-class-09.Rmd index 69dbe74..5383cea 100644 --- a/Week 09/pre-class-09.Rmd +++ b/Week 09/pre-class-09.Rmd @@ -82,5 +82,8 @@ as.numeric(gsub("bars pct", "", values)) 10. Get the links of all of the “svg” images on the web page. ```{r} -html_nodes(webpage, "svg") +#We use html_nodes to select the images, the html_attr function collects the url extensions of the images. Then the images that have a .svg file type are selected, and their urls are collected. +images <- html_attr(html_nodes(webpage, "img"), "src") +svg_images <- images[grep("svg", images)] +paste("https://money.cnn.com", svg_images, sep = "") ``` \ No newline at end of file