# ANOVA Test

## Problem 1 - Use 5% as a significance level
In the last decade, stockbrokers have drastically changed the way they do business. Internet trading has become quite common and online trades can cost as little as $7. It is now easier and cheaper to invest in the stock market than ever before. What are the effects of these changes? To help answer this question, a financial analyst randomly sampled 366 American households and asked each to report the age of the head of the household and the proportion of their financial assets that are invested in the stock market. The age categories are:

Young (under 35)
Early middle age (35 to 40)
Late middle age (50 to 65)
Senior (over 65)

The analyst was particularly interested in determining whether the ownership of stocks varied by age. Do these data allow the analyst to determine that there are differences in stock ownership between the four age groups? Check the required conditions.

In [41]:
PROC IMPORT DATAFILE='Total Assets Invested Stacked.xlsx'
	DBMS=XLSX
	OUT=WORK.IMPORT;
	GETNAMES=YES;
RUN;

data stacked;
	set work.import;
run;

proc print data=import (obs=5) noobs; run;

TotalAssets,AgeGroup
24.8,Young
35.5,Young
68.7,Young
42.2,Young
49.5,Young


### Analysis

In this analysis, we will look into the Total Assets Invested dataset that contains 366 records of both the age of the household head and the proportion of their financial assets that are invested in the stock market. age was distributed into 4 main categories:

Young (under 35)
Early middle age (35 to 40)
Late middle age (50 to 65)
Senior (over 65)

This Analysis aims to determine whether stock ownership varied by age and to answer that question the following hypothesis was tested at a 95% significance level.

H0: μ Young = μ Early Middle Age = μ Late Middle Age = μ Senior

Ha: Not all means are equal

First, the normal probability plot showed that the data was distributed fairly in a straight line which meant it is normally distributed with a mean of 50.2, a Standard deviation of 21.3, and a variance of 453.7

In [42]:
proc univariate
	data=stacked;
	ppplot TotalAssets;
run;


Moments,Moments.1,Moments.2,Moments.3
N,366.0,Sum Weights,366.0
Mean,50.1800273,Sum Observations,18365.89
Std Deviation,21.3009965,Variance,453.732453
Skewness,-0.6627725,Kurtosis,0.35760488
Uncorrected SS,1087213.21,Corrected SS,165612.345
Coeff Variation,42.4491529,Std Error Mean,1.11342092

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,50.18003,Std Deviation,21.301
Median,52.04,Variance,453.73245
Mode,0.0,Range,99.97
,,Interquartile Range,26.12

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,45.06834,Pr > |t|,<.0001
Sign,M,170.5,Pr >= |M|,<.0001
Signed Rank,S,29155.5,Pr >= |S|,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,99.97
99%,91.57
95%,80.73
90%,74.01
75% Q3,65.39
50% Median,52.04
25% Q1,39.27
10%,20.62
5%,0.0
1%,0.0

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
0,326,91.19,145
0,316,91.57,94
0,313,92.37,290
0,293,94.87,117
0,236,99.97,339


Second, an ANOVA test was conducted on the dataset and the following tables were generated.

In [44]:
ods graphics off;
proc glm
	data=stacked;
	class AgeGroup;
	model TotalAssets = AgeGroup;
	means AgeGroup/ tukey;
	lsmeans AgeGroup/ adjust=tukey;
run;

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
AgeGroup,4,Early_Middle_Age Late_Middle_Age Senior Young

0,1
Number of Observations Read,366
Number of Observations Used,366

Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,3,3741.3636,1247.1212,2.79,0.0405
Error,362,161870.9817,447.1574,,
Corrected Total,365,165612.3453,,,

R-Square,Coeff Var,Root MSE,TotalAssets Mean
0.022591,42.14046,21.1461,50.18003

Source,DF,Type I SS,Mean Square,F Value,Pr > F
AgeGroup,3,3741.36361,1247.121203,2.79,0.0405

Source,DF,Type III SS,Mean Square,F Value,Pr > F
AgeGroup,3,3741.36361,1247.121203,2.79,0.0405

0,1
Alpha,0.05
Error Degrees of Freedom,362.0
Error Mean Square,447.1574
Critical Value of Studentized Range,3.65009

Comparisons significant at the 0.05 level are indicated by ***.,Comparisons significant at the 0.05 level are indicated by ***.,Comparisons significant at the 0.05 level are indicated by ***.,Comparisons significant at the 0.05 level are indicated by ***.,Comparisons significant at the 0.05 level are indicated by ***.
AgeGroup Comparison,Difference Between Means,Simultaneous 95% Confidence Limits,Simultaneous 95% Confidence Limits.1,Unnamed: 4_level_1
Early_Middle_Age - Senior,0.634,-7.974,9.242,
Early_Middle_Age - Late_Middle_Age,1.333,-6.067,8.734,
Early_Middle_Age - Young,8.074,0.445,15.703,***
Senior - Early_Middle_Age,-0.634,-9.242,7.974,
Senior - Late_Middle_Age,0.699,-8.433,9.831,
Senior - Young,7.44,-1.878,16.757,
Late_Middle_Age - Early_Middle_Age,-1.333,-8.734,6.067,
Late_Middle_Age - Senior,-0.699,-9.831,8.433,
Late_Middle_Age - Young,6.741,-1.475,14.956,
Young - Early_Middle_Age,-8.074,-15.703,-0.445,***

AgeGroup,TotalAssets LSMEAN,LSMEAN Number
Early_Middle_Age,52.4724427,1
Late_Middle_Age,51.1390323,2
Senior,51.8381034,3
Young,44.3983333,4

Least Squares Means for effect AgeGroup Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: TotalAssets,Least Squares Means for effect AgeGroup Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: TotalAssets,Least Squares Means for effect AgeGroup Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: TotalAssets,Least Squares Means for effect AgeGroup Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: TotalAssets,Least Squares Means for effect AgeGroup Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: TotalAssets
i/j,1,2,3,4
1,,0.9666,0.9976,0.0333
2,0.9666,,0.9973,0.1494
3,0.9976,0.9973,,0.1681
4,0.0333,0.1494,0.1681,


## Problem 2

One measure of the health of a national economy is how quickly it creates jobs. One aspect of
this issue is the number of jobs individuals hold. As part of a study on job tenure, a survey was
conducted wherein Americans aged between 17 and 45 were asked how many jobs they have
held in their lifetimes. Also recorded were gender and educational attainment. The categories
are:

Less than high school (E1)
High school (E2)
Some college/university but not degree (E3)
At least one university (E4)

a. Test to determine whether there is an interaction between gender and education in holding
jobs.

b. Test to determine whether there are differences in holding jobs between men and women.

c. Test to determine whether there are differences in holding jobs between the educational
levels.

In [45]:
PROC IMPORT DATAFILE='Lifetime_of_Jobs_by_Educational_Level_stacked.xlsx'
	DBMS=XLSX
	OUT=WORK.IMPORT_1;
	GETNAMES=YES;
RUN;


data stacked_1;
	set work.import_1;
run;

proc print data=stacked_1 (obs=5) noobs; run;

Gender,Education,JobLifetime
Male,E1,10
Male,E1,9
Male,E1,12
Male,E1,16
Male,E1,14


## Problem 2

In this report, we will be analyzing the Lifetime of Jobs by Educational level dataset. It contains 80 records of the Gender, Educational level, and the number of jobs held in the participant's lifetime, the report aims to answer three questions.
Is there an interaction between gender and education in holding jobs?

H0: An interaction is absent

Ha: An interaction is present

Are there any differences in holding jobs between men and women?

H0: μ Men = μ Women

Ha: μ Men ≠ μ Women

Are there any differences in holding jobs between the educational levels?

H0: μ E1 = μ E2 = μ E3 = μ E4

Ha: Not all means are equal

First, we used the P-P plot to determine if the data is normally distributed and the plot showed that the data followed the theoretical normal distribution straight line fairly which means we can assume a normal distribution.


In [47]:
proc univariate
	data=stacked_1;
	ppplot JobLifetime;
run;

Moments,Moments.1,Moments.2,Moments.3
N,80.0,Sum Weights,80.0
Mean,10.425,Sum Observations,834.0
Std Deviation,3.33669662,Variance,11.1335443
Skewness,-0.2249118,Kurtosis,-0.6938761
Uncorrected SS,9574.0,Corrected SS,879.55
Coeff Variation,32.0066822,Std Error Mean,0.37305402

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,10.425,Std Deviation,3.3367
Median,11.0,Variance,11.13354
Mode,11.0,Range,14.0
,,Interquartile Range,5.0

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,27.94501,Pr > |t|,<.0001
Sign,M,40.0,Pr >= |M|,<.0001
Signed Rank,S,1620.0,Pr >= |S|,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,17
99%,17
95%,15
90%,15
75% Q3,13
50% Median,11
25% Q1,8
10%,6
5%,5
1%,3

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
3,73,15,67
3,64,15,78
4,79,16,4
5,68,16,16
5,61,17,6


The means of both genders are close to each other as well as the standard deviation.

For the Education level on the other hand it is clear that there are some differences between the means and at least E3 has a higher standard deviation than the rest of the education levels.

To further analyze these findings we need to conduct an ANOVA test, and the test generated the following results, the model had 7 degrees of freedom and the Error had 72 degrees of freedom.

For the first question, the P value is 0.8915 which is higher than α (0.05) therefore we fail to reject the null hypothesis.

In the second question, the P value is 0.2944 which is also higher than α (0.05) therefore we fail to reject the null hypothesis.

For both previous variables, there is no statistical difference between the two genders or if there is statistical evidence of an interaction between the variables.

As for the last question, the P value is 0.0060 which is lower than α (0.05) therefore we reject the null hypothesis and decide that there is a statistical difference between education levels in terms of jobs held in a lifetime.


In [48]:
proc anova
	data=stacked_1;
	class Gender Education;
	model JobLifetime = Gender Education Gender*Education;
	means Gender Education;
run;

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
Gender,2,Female Male
Education,4,E1 E2 E3 E4

0,1
Number of Observations Read,80
Number of Observations Used,80

Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,7,153.35,21.9071429,2.17,0.0467
Error,72,726.2,10.0861111,,
Corrected Total,79,879.55,,,

R-Square,Coeff Var,Root MSE,JobLifetime Mean
0.174351,30.46392,3.175864,10.425

Source,DF,Anova SS,Mean Square,F Value,Pr > F
Gender,1,11.25,11.25,1.12,0.2944
Education,3,135.85,45.2833333,4.49,0.006
Gender*Education,3,6.25,2.0833333,0.21,0.8915

Level of Gender,N,JobLifetime,JobLifetime
Level of Gender,N,Mean,Std Dev
Female,40,10.05,3.57304725
Male,40,10.8,3.08179102

Level of Education,N,JobLifetime,JobLifetime
Level of Education,N,Mean,Std Dev
E1,20,12.05,2.85574214
E2,20,11.1,2.95403382
E3,20,10.0,3.69921756
E4,20,8.55,2.92853475



After rejecting the null hypothesis, a deeper analysis of Education levels was neededd to determine which levels had unequal means to the others, thus we used the Tukey method.

E1 and E4 were grouped in different groups while E2 and E3 once were grouped with E1 in group A and E4 in group B. This means that E1 and E4 have different means. And that difference is shown in the Tukey test where the P value was 0.0038 which is less than α (0.05) and we reject the null hypothesis

H0: μ E1 = μ E4

Ha: μ E1 ≠ μ E4

This means that the mean of jobs held in a lifetime in Education level 4 is different from the mean of other education levels.

In [49]:
ods graphics off;
proc glm
	data=stacked_1;
	class Education;
	model JobLifetime = Education;
	means Education/ tukey;
	lsmeans Education/ adjust= tukey;
run;

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
Education,4,E1 E2 E3 E4

0,1
Number of Observations Read,80
Number of Observations Used,80

Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,3,135.85,45.2833333,4.63,0.005
Error,76,743.7,9.7855263,,
Corrected Total,79,879.55,,,

R-Square,Coeff Var,Root MSE,JobLifetime Mean
0.154454,30.00655,3.128183,10.425

Source,DF,Type I SS,Mean Square,F Value,Pr > F
Education,3,135.85,45.2833333,4.63,0.005

Source,DF,Type III SS,Mean Square,F Value,Pr > F
Education,3,135.85,45.2833333,4.63,0.005

0,1
Alpha,0.05
Error Degrees of Freedom,76.0
Error Mean Square,9.785526
Critical Value of Studentized Range,3.71485
Minimum Significant Difference,2.5985

Means with the same letter are not significantly different.,Means with the same letter are not significantly different.,Means with the same letter are not significantly different.,Means with the same letter are not significantly different.,Means with the same letter are not significantly different.
Tukey Grouping,Tukey Grouping.1,Mean,N,Education
,A,12.05,20.0,E1
,A,,,
B,A,11.1,20.0,E2
B,A,,,
B,A,10.0,20.0,E3
B,,,,
B,,8.55,20.0,E4

Education,JobLifetime LSMEAN,LSMEAN Number
E1,12.05,1
E2,11.1,2
E3,10.0,3
E4,8.55,4

Least Squares Means for effect Education Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: JobLifetime,Least Squares Means for effect Education Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: JobLifetime,Least Squares Means for effect Education Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: JobLifetime,Least Squares Means for effect Education Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: JobLifetime,Least Squares Means for effect Education Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: JobLifetime
i/j,1,2,3,4
1,,0.7722,0.1715,0.0038
2,0.7722,,0.6833,0.0564
3,0.1715,0.6833,,0.463
4,0.0038,0.0564,0.463,
