In [1]:
# Add libraries
library(tidyverse)
library(repr)
library(digest)
library(gridExtra)
library(dplyr)
library(broom)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.2      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Attaching package: ‘gridExtra’


The following object is masked from ‘package:dplyr’:

    combine




In [2]:
# Read data from .csv file
survey <- read.csv("https://raw.githubusercontent.com/Herman-Liao/stat-201-group-project/main/survey%20lung%20cancer.csv")
head(survey)

Unnamed: 0_level_0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC.DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL.CONSUMING,COUGHING,SHORTNESS.OF.BREATH,SWALLOWING.DIFFICULTY,CHEST.PAIN,LUNG_CANCER
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>
1,M,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
2,M,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
3,F,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
4,M,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
5,F,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO
6,F,75,1,2,1,1,2,2,2,2,1,2,2,1,1,YES


In [3]:
# Change most of the variables to boolean variables instead of integer or string variables
survey_2 <- survey %>%
    summarize(gender = GENDER,
              age = AGE,
              smoking = SMOKING - 1 == 1,
              yellow_fingers = YELLOW_FINGERS - 1 == 1,
              anxiety = ANXIETY - 1 == 1,
              chronic_disease = CHRONIC.DISEASE - 1 == 1,
              fatigue = FATIGUE - 1 == 1,
              allergy = ALLERGY - 1 == 1,
              wheezing = WHEEZING - 1 == 1,
              alcohol_consuming = ALCOHOL.CONSUMING - 1 == 1,
              coughing = COUGHING - 1 == 1,
              shortness_of_breath = SHORTNESS.OF.BREATH - 1 == 1,
              swallowing_difficulty = SWALLOWING.DIFFICULTY - 1 == 1,
              chest_pain = CHEST.PAIN - 1 == 1,
              lung_cancer = LUNG_CANCER == "YES")

head(survey_2)

Unnamed: 0_level_0,gender,age,smoking,yellow_fingers,anxiety,chronic_disease,fatigue,allergy,wheezing,alcohol_consuming,coughing,shortness_of_breath,swallowing_difficulty,chest_pain,lung_cancer
Unnamed: 0_level_1,<chr>,<int>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>
1,M,69,False,True,True,False,True,False,True,True,True,True,True,True,True
2,M,74,True,False,False,True,True,True,False,False,False,True,True,True,True
3,F,59,False,False,False,False,True,False,True,False,True,True,False,True,False
4,M,63,True,True,True,False,False,False,False,True,False,False,True,True,False
5,F,63,False,True,False,False,False,False,True,False,True,True,False,False,False
6,F,75,False,True,False,True,True,True,True,False,True,True,False,False,True


In [4]:
# Clean and wrangle data; we are only interested in people who have lung cancer and whether or not they smoked and/or consumed alcohol
# We mutate the data this way to properly separate all combinations of smoking and drinking
# We want the difference in proportions for only smoking minus only drinking
survey_clean_wrangled <- survey_2 %>%
    filter(lung_cancer == TRUE) %>%
    select(gender, smoking, alcohol_consuming) %>%
    mutate(only_smoking = smoking & !alcohol_consuming,
           only_drinking = !smoking & alcohol_consuming) %>%
    select(-alcohol_consuming, -smoking) %>%
    filter(!(only_smoking == FALSE & only_drinking == FALSE))

# To convert only_smoking and only_drinking to one variable, make one character variable that says "only smoking" when only_smoking is true, and "only drinking" otherwise
# This works because we only use observations where each patient either only drinks or only smokes, and not neither nor both
survey_clean_wrangled <- survey_clean_wrangled %>%
    mutate(only_smoke_only_drink = ifelse(only_smoking, "only smoke", "only drink")) %>%
    select(gender, only_smoke_only_drink)

head(survey_clean_wrangled)

Unnamed: 0_level_0,gender,only_smoke_only_drink
Unnamed: 0_level_1,<chr>,<chr>
1,M,only drink
2,M,only smoke
3,F,only smoke
4,F,only smoke
5,M,only drink
6,F,only smoke


In [5]:
# Separate the genders; the processes for the female data will be displayed above the processes for the male data
survey_summary <- survey_clean_wrangled %>%
    group_by(gender) %>%
    summarize(num_only_drinking = sum(only_smoke_only_drink == "only drink"),
              num_only_smoking = sum(only_smoke_only_drink == "only smoke"),
              prop_only_smoking = mean(only_smoke_only_drink == "only smoke"))

survey_summary

gender,num_only_drinking,num_only_smoking,prop_only_smoking
<chr>,<int>,<int>,<dbl>
F,27,50,0.6493506
M,48,15,0.2380952


As you can see here, `survey_clean_wrangled` passes the tests to use the central limit theorem (CLT). The first condition is that the samples are drawn in an independent fashion. It is safe to assume that the samples that were selected were not dependent on previous selections. The second condition is that the sample is no larger than 10% of the population; this is obviously not possible. The final condition is that the sample is not too small. There are at least 10 observations in only smoking and at least 10 observations in only drinking, and the sample size is above 30.

This is a one-sample z-test. To test $H_{0}$, we use the following equations:

\begin{align*}
    \mu &= \hat{p} - p_0 \\
    \sigma^{2} &= \sqrt{\frac{p_{0}(1 - p_{0})}{n}} \\
    Z &= \frac{\mu}{\sigma^{2}} = \frac{\hat{p} - p_0}{\sqrt{\frac{p_{0}(1 - p_{0})}{n}}}
\end{align*}

In [6]:
# Set type I error to 5%, can be changed
alpha = 0.05

In [7]:
# Calculate confidence interval of the proportion of lung cancer patients who only smoke
# Calculate the z-score using worksheet 8 section 3.2, then calculate the p-value and check if p-value < alpha
survey_calcs <- survey_summary %>%
    mutate(standard_error = sqrt(0.5 * (1 - 0.5) / num_only_smoking),
           lower_ci = prop_only_smoking - qnorm(alpha / 2, lower.tail = FALSE) * standard_error,
           upper_ci = prop_only_smoking + qnorm(alpha / 2, lower.tail = FALSE) * standard_error,
           z_score = (prop_only_smoking - 0.5) / standard_error,
           p_value = pnorm(z_score, lower.tail = FALSE),
           reject_null = p_value < alpha)

survey_calcs

gender,num_only_drinking,num_only_smoking,prop_only_smoking,standard_error,lower_ci,upper_ci,z_score,p_value,reject_null
<chr>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>
F,27,50,0.6493506,0.07071068,0.51076027,0.787941,2.112137,0.01733734,True
M,48,15,0.2380952,0.12909944,-0.01493502,0.4911255,-2.028706,0.97875586,False
