## Setup

### import packages

In [1]:
library(tidyr)
library(gridExtra)
library(dplyr)
library(datasets)
library(ggplot2)
library(Ecdat)
library(car)
library(multcomp)
library(gmodels)



"package 'tidyr' was built under R version 3.6.3"
"package 'gridExtra' was built under R version 3.6.3"
"package 'dplyr' was built under R version 3.6.3"

Attaching package: 'dplyr'


The following object is masked from 'package:gridExtra':

    combine


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


"package 'Ecdat' was built under R version 3.6.3"
Loading required package: Ecfun


Attaching package: 'Ecfun'


The following object is masked from 'package:base':

    sign



Attaching package: 'Ecdat'


The following object is masked from 'package:datasets':

    Orange


Loading required package: carData


Attaching package: 'carData'


The following object is masked from 'package:Ecdat':

    Mroz



Attaching package: 'car'


The following object is masked from 'package:dplyr':

    recode


Loading required package: mvtnorm

"package 'mvtnorm' was built under

### import datasets

In [2]:

WinningNumbers <- read.csv("Lottery_Mega_Millions_Winning_Numbers__Beginning_2002_Wrangled.csv")
Wins <- read.csv("AllWinners.csv")
View(Wins)

X,Draw.Date,Amount,cash.prize,Location,State,Gender,Win,Jackpot
0,2002-05-17,2.33e+07,,"Chatham, Ill.",IL,Mixed,Y,Y
1,2002-05-24,2.33e+07,,"Chicago, Ill.",IL,F,Y,Y
2,2002-07-16,2.33e+07,,"Cliffside Park, N.J.",NJ,M,Y,Y
3,2002-08-27,2.33e+07,,"New York City, N.Y.",NY,Mixed,Y,Y
4,2002-09-06,2.33e+07,,"Kentwood, Mich.",MI,M,Y,Y
5,2002-09-27,2.33e+07,,"Mount Prospect, Ill.",IL,Mixed,Y,Y
6,2002-11-08,2.33e+07,,"Hoquiam, Wash.",WA,Mixed,Y,Y
7,2002-11-19,2.33e+07,,"New York City, N.Y.",NY,Mixed,Y,Y
8,2002-12-24,2.33e+07,,Unclaimed in N.Y.,NY,Unk,Y,Y
9,2003-02-11,2.33e+07,,"Brooklyn, N.Y.",NY,M,Y,Y


In [3]:
JackpotByGender <- aggregate(JackpotAmount~Gender, Wins, mean)
JackpotByGender

ERROR: Error in eval(predvars, data, env): object 'JackpotAmount' not found


In [None]:
Jackpots<- filter (Wins, Jackpot == "Y")


In [None]:
NonJackpots<- filter (Wins, Jackpot == "N")

## Analysis

## 1. In the Mega Millions, what are the optimal numbers to select in order to achieve a return on investment (ROI)?

## 2. Does gender influence Prize Amount? 

### Comparing all Gender Categories

In [None]:
Wins$Gender <- as.factor(Wins$Gender)

In [None]:
Wins$Jackpot <- as.factor(Wins$Jackpot)

#### Testing Assumptions



##### 1. Normality


In [None]:
plotNormalHistogram(Wins$Amount)
#  positive skew


In [None]:
Wins$AmountSQRT <- sqrt(Wins$Amount)


In [None]:
plotNormalHistogram(Wins$AmountSQRT)
#  positive skew


In [None]:
Wins$AmountLOG <-log(Wins$Amount)


In [None]:
plotNormalHistogram(Wins$AmountLOG)
#better but negatively kurtotic. let's use Tukey's Ladder of Power Transformation 


In [None]:
Wins$AmountTUK <- transformTukey(Wins$Amount, plotit=FALSE)
plotNormalHistogram(Wins$AmountTUK)
#plot is the same as the log transformation. will use AmountLOG

##### 2. Homogeneity of Variance


In [None]:
leveneTest(Wins$AmountLOG~Gender, data=Wins)

Results were  significant, so the assumption is not met! use the Anova() to correct for violation

##### 3. Homogeneity of Regression Slopes

In [None]:
Homogeneity_RegrSlp = lm(Wins$AmountLOG~Jackpot, data=Wins)
anova(Homogeneity_RegrSlp)

Unfortunately, since the p value is significant, our data does not meet the assumption of homogeneity of regression slopes. That means that whether someone won a jackpot or not actually does have an impact on the size of their prize, and that you should NOT use Jackpot as a covariate, but rather include it as a second independent variable in the model

##### 4. Sample size


this assumption is met - need 20 per IV or CV and we have 2, so need at least 40 and there are 497 cases



#### Running the Analysis


In [None]:
ANCOVA = lm(AmountLOG~Jackpot + Gender*Jackpot, data=Wins)
Anova(ANCOVA,Type="I", white.adjust=TRUE)

Gender does appear to influence the size of the prize. However, controlling for whether or not it as a jackpot, that influence disappears 

### Comparing Men and Women only

In [None]:
MenVWomen<- Wins %>% filter(Gender %in% c("M", "F"))
View(MenVWomen)

In [None]:
MenVWomen$Gender <- as.factor(MenVWomen$Gender)

In [None]:
MenVWomen$Jackpot <- as.factor(MenVWomen$Jackpot)

#### Testing Assumptions



##### 1. Normality


In [None]:
plotNormalHistogram(MenVWomen$Amount)
#  positive skew

In [None]:
MenVWomen$AmountSQRT <- sqrt(MenVWomen$Amount)

In [None]:
plotNormalHistogram(MenVWomen$AmountSQRT)
#  positive skew

In [None]:
MenVWomen$AmountLOG <-log(MenVWomen$Amount)

In [None]:
plotNormalHistogram(MenVWomen$AmountLOG)
#better but negatively kurtotic. let's use Tukey's Ladder of Power Transformation

In [None]:
MenVWomen$AmountTUK <- transformTukey(MenVWomen$Amount, plotit=FALSE)

In [None]:
plotNormalHistogram(MenVWomen$AmountTUK)
#plot is the same as the log transformation. will use AmountLOG

##### 2. Homogeneity of Variance


In [None]:
leveneTest(MenVWomen$AmountLOG~Gender, data=MenVWomen)

Results were  significant, so the assumption is not met! use the Anova() function  to correct for violation

##### 4. Sample size


this assumption is met -need 20 per IV or CV and we have 2, so need at least 40 and there are 320 cases!



#### Running the Analysis


In [None]:
ANCOVA = lm(AmountLOG~Jackpot + Gender*Jackpot, data=MenVWomen)
Anova(ANCOVA,Type="I", white.adjust=TRUE)

When comparing Men to Women, Gender does not influence the size the prize even controlling for whether or not it was a jackpot.