### Practical 5

You are studying a fish phenotypic trait, "T," which you hypothesize is dominant over the alternative phenotype "t." In classical Mendelian genetics, the offspring of two heterozygous parents (Tt) should exhibit the dominant and recessive traits in a 3:1 ratio (three individuals with the dominant phenotype for every one individual with the recessive phenotype).

In a tank containing only heterozygous parents (Tt), you inspect 350 juveniles and observe that 254 display the dominant trait (T) and 96 display the recessive trait (t). You aim to use simulation to test whether there's a statistically significant difference between the observed numbers of dominant and recessive traits (254:96) and what you would expect if the trait T is truly dominant in a 3:1 ratio (approximately 263 dominant: 87 recessive, given the sample size of 350).

In other words, imagine a scenario where you have a large number of jars. Each jar contains an immense quantity of marbles that have an exact 3:1 ratio of black (representing the dominant trait) to white (indicative of the recessive trait) marbles. From each jar, you randomly select a sample of 350 marbles. Under the most typical circumstances, given the 3:1 ratio, you would expect to retrieve approximately 263 black and 87 white marbles from each jar.

What you want to do here is to assess the probability of encountering a deviation from this anticipated outcome — specifically, how plausible it is to draw a sample comprising 254 black and 96 white marbles as was the case in your fish tank? How plausible it is to draw a distribution that diverges more substantially from the expected ratio, such as 200 black and 150 white marbles, from a jar. This evaluation helps determine whether the observed variations are within the realm of normal statistical fluctuations or if they signify an unusual event that defies the established 3:1 genetic dominance principle.

Recall that the steps to carry out this analysis are as follows:

1. Compute a test statistic to describe the observed difference between the expected and observed values.
   Hint: this was covred in the `pdf`
2. Quantify what is considered a normal sampling variation. In other words, use simulation to determine occurrnces resulting from normal statistical fluctuations. This involves simulating many instances of drawing 350 marbles from jars with a 3:1 ratio and seeing, using the test statistic above, the values that expects due to the randomness inherent to sampling alone.
3. Compute an empirical p-value and explain your findings.

Note that the approach described above is similar to the methodology discussed during our class exercise. However, unlike the procedure we followed in class, where we employed permutations as part of simulating a t-test-like process, this example doesn't necessitate permutations.

In [6]:
# Expect 3:1 ratio, 263 juveniles with the dominant trait and 87 with the recessive trait 
dominant_expected <- 263
recessive_expected <- 87

# Observed is 254 juveniles with dominant trait and 96 with the recessive trait 
dominant_observed <- 254
recessive_observed <- 96

In [2]:
dominant_observed - dominant_expected 

In [8]:
recessive_observed - recessive_expected

In [13]:
dominant_test_statistic <- (dominant_observed - dominant_expected)^2 / dominant_expected 
dominant_test_statistic

In [14]:
recessive_test_statistic <- (recessive_observed - recessive_expected)^2 / recessive_expected 
recessive_test_statistic

In [59]:
# Create a for loop to create simulations of occurrences from 350 marbles 
total_marbles <- 350

# Generated 1000 draws from 350 marbles, using the probability of the dominant and recessive test statistic 
# We want the count type printed for dominant and recessive traits 
for (i in 1:1000) {
  sampled_marbles <- sample(c(1,2), size = total_marbles, replace = TRUE, 
                            prob = c(dominant_test_statistic, recessive_test_statistic))
  

  count_type_dominant <- sum(sampled_marbles == 2)
    
    count_type_recessive <- sum(sampled_marbles == 1)
    
  

  print(count_type_dominant)
    print(count_type_recessive)
}




[1] 272
[1] 78
[1] 281
[1] 69
[1] 246
[1] 104
[1] 262
[1] 88
[1] 258
[1] 92
[1] 246
[1] 104
[1] 269
[1] 81
[1] 256
[1] 94
[1] 267
[1] 83
[1] 255
[1] 95
[1] 260
[1] 90
[1] 250
[1] 100
[1] 259
[1] 91
[1] 268
[1] 82
[1] 276
[1] 74
[1] 265
[1] 85
[1] 251
[1] 99
[1] 264
[1] 86
[1] 263
[1] 87
[1] 270
[1] 80
[1] 251
[1] 99
[1] 266
[1] 84
[1] 254
[1] 96
[1] 277
[1] 73
[1] 276
[1] 74
[1] 259
[1] 91
[1] 277
[1] 73
[1] 255
[1] 95
[1] 265
[1] 85
[1] 253
[1] 97
[1] 266
[1] 84
[1] 277
[1] 73
[1] 268
[1] 82
[1] 273
[1] 77
[1] 264
[1] 86
[1] 258
[1] 92
[1] 270
[1] 80
[1] 265
[1] 85
[1] 259
[1] 91
[1] 275
[1] 75
[1] 259
[1] 91
[1] 262
[1] 88
[1] 261
[1] 89
[1] 266
[1] 84
[1] 271
[1] 79
[1] 257
[1] 93
[1] 272
[1] 78
[1] 256
[1] 94
[1] 255
[1] 95
[1] 259
[1] 91
[1] 262
[1] 88
[1] 265
[1] 85
[1] 261
[1] 89
[1] 245
[1] 105
[1] 262
[1] 88
[1] 267
[1] 83
[1] 258
[1] 92
[1] 277
[1] 73
[1] 264
[1] 86
[1] 267
[1] 83
[1] 273
[1] 77
[1] 257
[1] 93
[1] 261
[1] 89
[1] 272
[1] 78
[1] 270
[1] 80
[1] 271
[1] 79
[1] 26

In [22]:
t_test_result <- t.test(sampled_marbles)
t_test_result


	One Sample t-test

data:  sampled_marbles
t = 74.893, df = 349, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 1.699869 1.791559
sample estimates:
mean of x 
 1.745714 


The t test result shows a very small p-value, indicating the observed results from the 3:1 ratio is significant. 

These values demonstrate there is a difference between the observed and expected drawings of the marbles. 

In [61]:
# Another way to demonstrate the simulated versus observed draws 
simulated_draws <- rbinom(1000, 350, 3/4)
observed_draws <- rbinom(1000, 350, 254/350)

In [64]:
# We come up with the same p value as the sampled_marbles (above). 
t.test(simulated_draws, observed_draws)


	Welch Two Sample t-test

data:  simulated_draws and observed_draws
t = 23.969, df = 1996.5, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 7.906438 9.315562
sample estimates:
mean of x mean of y 
  262.525   253.914 
