# Summary / Essentials

## Lesson 1


A **single-sample t-test** is meant to examine whether **a particular value is different than the population mean**.   
*Example:* A dataset witha continuous variable is given and a single value to test against the entire dataset for significant difference.

An **independent t test** is used when you have one independent variable that is categorical and a grouping variable, and one dependent continuous variable. Use an **independent *t*-test** when you want to determine whether the **means of two different, unrelated groups are the same or different**.   
*Example:* Variable Car Class with compact and large cars. Question: Do compact and large cars consume the same amount of gaz? Solution: Car Class is one variable with elements of both groups (compact and large) --> first extract these two groups in separate datsets along with the gaz-consumption information. Then run the t-test of these two datasets on the gaz consumption field.

**Dependent t-tests** are used when your **samples are related in some way**, but you still want to see if the **means change**. *Example:* The same variables in different years or locations to compare against each other.

An **independent Chi-Square** is used when you want to **determine whether two categorical variables influence each other**. Works with a *contingency table*. *Example:* in one dataset cars do tyre size and vehicle size influence each other.

## Lesson 2

Transform non-normal distributed data in normal distributed data.


<table style="width:100%">
  <tr>
    <th><img src="../Media/skewedness.png" width="300"></th>
    <th><img src="../Media/skew3.png" width="300"></th>
  </tr>
</table>

<img src="../Media/DataTransformationTypes.png" width="500">

Which method to apply depending on how the data is distributed (cf. graph):
<img src="../Media/DataTransformationTypesExamples.png" width="500">


## Lesson 3
### One Proportion Testing
One proportion testing is used when you want to see whether the proportion of two things are similar.

Example: An Easter basket full of candy, which is filled with both jellybeans and chocolate eggs. A random sample of 43 pieces of candy are taken from the Easter basket. There are 15 jellybeans and 28 chocolate eggs in the sample. You can use the function prop.test() to determine the probability that there are the same number of jellybeans as there are chocolates, meaning the proportion of jelly beans and chocolate eggs is equal to 0.5.

### Two Proportion Testing
You will use a two proportion z test when you want to compare the proportions of two different categories to the whole.

Example: As an example, you will go back to your Easter basket full of candy, which is filled with both jelly beans and chocolate eggs. Each are available in several colors, and With a two proportion test, you can determine whether the proportions of those candies to the whole differ, as well as whether the proportion of the pink candies differ.
There are 15 jelly beans and 28 chocolate eggs. Of the jelly beans, 7 are pink. Of the chocolate eggs, 12 are pink.

```{r}
prop.test(x = c(7, 12), n = c(15, 28),
          alternative = "two.sided")
```


### Chi Square
There are two assumptions associated with independent Chi-Squares. The first is that you need to have independent data. This is just a theoretical requirement - each person or object must be able to fit in only one cell. The second is that your expected frequencies (i.e. element count) must be greater than 5 for each cell. You will be able to check this second assumption by running your Chi-Square and asking for expected frequencies.

```{r}
CrossTable(SW_survey_renamed$Age, SW_survey_renamed$RankI, fisher=TRUE, chisq = TRUE, expected = TRUE, sresid=TRUE, format="SPSS")
```


### McNemar Chi-Squares
The McNemar Chi-Square is used when you are trying to look at something over time, and have only two timepoints; maybe a pre and a post. The timepoints are your independent variable. You are also limited to two levels of your dependent variable. You can think of a McNemar Chi-Square like a dependent t-test for categorical data.

Example: is coffee sold more at the beginning or end of the moth (both 0/1 levels!)

Summary
Although continuous data is usually easier to work with and you can extract more data from it, there will be times when you come up against a large pile of categorical data (especially if your company collected it themselves!) and you will need some advanced categorical analysis tools in your arsenal! 

You will use **one proportion testing** when you are comparing the rate of one item to a gold standard rate. 

**Two proportion testing** will be used to compare rates between items to a gold standard rate. 

**Goodness of fit Chi-Squares** are used to test whether your sample data could feasibly come from the population as a whole, and you can use a **McNemar Chi-Square** to look at anything that is repeatedly measured that has two categorical variables.



## Lesson 4
### ANOVA

Steps:
*  DataWrangling - bringing data into shape: extracting (e.g. 3 types of Apps out of 15 into separate Dataframe) or making DV numeric
*  Test Assumptions
    *  Normality of DV
        *  Histograms
    *  Homogeneity of Variance of DV:
        *  Bartlett's test is for when your data is normally distributed
        *  Fligner's test is for when your data is non-parametric
        *  Correcting for Violations of Homogeneity of Variance: There are two ways that you can correct for a violation of homogeneity of variance. The first is the BoxCox transformation of your data, and the second is running a slightly different type of ANOVA, one that was created specifically to handle this violation. That test is called the Welch One-Way Test, and you'll learn about this ANOVA option.
    *  Sample size: ANOVA requires a sample size of at least 20 per independent variable.  
    *  Independence: There is no statistical test for the assumption of independence.  
       
If in Python:
If Homegeneity of Variance of DV is not met (p<0.05) then do another analysis:
*   stats.f_oneway()
*   Post Hoc Test

Example: 

```{r}
stats.f_oneway(YTC2['VideoViews5RT'][YTC2['Grade']==0],
                   YTC2['VideoViews5RT'][YTC2['Grade']==1],
               YTC2['VideoViews5RT'][YTC2['Grade']==2],
               YTC2['VideoViews5RT'][YTC2['Grade']==3])
```

and 

```{r}
postHoc = MultiComparison(YTC2['VideoViews5RT'], YTC2['Grade'])
postHocResults = postHoc.tukeyhsd()
print(postHocResults)
```

## Lesson 5 
### Repeated ANOVA
Repeated Measures ANOVAs, also known as within subjects ANOVAs, are when you are measuring the same person or thing repeatedly over time. 

This is an example for R
```{r}
library("rcompanion")
library("fastR2")
library("car")

# read in data
honey <- read.csv("Library/Mobile Documents/com~apple~CloudDocs/Privat/Weiterbildung/Bethel-DataScience/Module6_Intermediate_Statistics/DS105-Intermediate-Statistics/Data/honey.csv")
View(honey)
str(honey)

#Task:
#Determine whether honey production totalprod has changed over the years (year) using a repeated measures ANOVA
summary(honey)
#checking for NULL values
sapply(honey, function(x) sum(is.na(x)))

#DataWrangling for year:
honey$year <- as.character(honey$year)
honey$year <- as.factor(honey$year)

#Normality
plotNormalHistogram(honey$totalprod)
#fairly positively skewed --> sqrt
honey$totalprodSQRT <- sqrt(honey$totalprod)
plotNormalHistogram(honey$totalprodSQRT)
#not so nice yet - try log

honey$totalprodLog <- log(honey$totalprod)
plotNormalHistogram(honey$totalprodLog)
#way better!!

#Homogeneity of Variance
leveneTest(totalprodLog ~ year, data=honey)
# Levene's Test for Homogeneity of Variance (center = median)
#        Df F value Pr(>F)
# group   4  0.0318  0.998
#       196
#--> no violation of homogeneity of variance!! p=0.998
#
# Sample Size
# A repeated measures ANOVA requires a sample size of at least 20 per independent variable.
# We have that, so this assumption has been met.

#ANOVA Analysis - globally, not considering a grouping by state:
RManova1 <- aov(totalprodLog~year, honey)
summary(RManova1)
#Result:
#               Df Sum Sq Mean Sq F value Pr(>F)
# year          4    0.4   0.112   0.058  0.994
# Residuals   196  378.0   1.929
# There seem to be no significance.
# Ergo: Globally in the U.S. honey production has not changed over the years

#taking the state as Error into account:
RManova2 <- aov(totalprodLog~year+Error(state), honey)
summary(RManova2)

#Result:
# Error: state
# Df Sum Sq Mean Sq F value Pr(>F)
# year       1    2.6   2.571   0.272  0.605
# Residuals 39  368.6   9.452
#
# Error: Within
#              Df  Sum Sq Mean Sq F value  Pr(>F)
# year        4    0.677 0.16913   4.007 0.00402 **
#   Residuals 156  6.584 0.04221
# ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# This seems significant at 0.01 with 0.00402!
# Ergo: Locally, State-wise, in the U.S. honey production has changed over the years.
#
#Post Hocs
pairwise.t.test(honey$totalprodLog, honey$state, p.adjust="none")
#This try did not give me any results... probably too few datasets

```

## Lesson 6 
### Mixed Measures ANOVA

A mixed measure ANOVA includes both a *within* and a *between subject* variable. Follow a very similar process to do a mixed measures ANOVA as a repeated measures or within subjects ANOVA, but an additional factor is added, a second IV.   
A sample question:   
```text
Did those who ate breakfast in the morning improve their resting metabolic rate from baseline to follow up compared to those who skipped breakfast?  
```

IV:   
I     Eating Habit, Categorical, 2 Levels (ate breakfast; no breakfast); ('Treatment.Group')   
(II)  Time; Categorical; 2 Levels (Baseline; Followup); ('contrasts')    
DV:   resting metabolic rate; ('repdat')

**Testing Assumptions** 

The assumptions for a mixed measure ANOVA are the same as the ones for a repeated measures ANOVA.  The only thing that differs is the sample size, because now there are two IVs. 

**Sample Size**

Must be per IV minimum 20 --> at least 2*20.

**Analysis**

```text
RManova1 <- aov(repdat~(Treatment.Group*contrasts)+Error(Participant.Code/(contrasts)), breakfast6)
summary(RManova1)
```


## Lesson 7 
### ANCOVA

ANCOVA stands for analysis of covariance. It is an analysis in the family of ANOVAs, and the big difference is that "C." An ANCOVA takes into account, or adjusts for, yet another factor in the model, aptly named a **covariate**. Put another way, an ANCOVA controls for the changes that might come up naturally. For instance, men are slightly better than women at spatial and analytic reasoning on average. If you collected information from everyone about their gender, you could then use it to control for the natural differences between men and women, using it as a covariate.

Covariates are typically continuous, but it is possible to use categorical covariates if you dummy code them.

Example question:

```text
Controlling for students' research participation in undergrad, does the rating of the students' undergraduate university impact their chance of admittance into graduate school? 
```

IV: University Rating; categroical, 5 Levels   
DV: Chance of admittance; continuous   
CV: Student's Research Participation; categorical, 2 Levels

***Ensure the IV and CV are a Factors***

### Testing Assumptions
#### homogeneity of regression slopes
The assumptions for an ANCOVA are similar to those for your basic ANOVAs. However, one assumption is added - the assumption of homogeneity of regression slopes, which tests for whether the predictor variable (DV) and the covariate are independent of each other. 

#### Normality
Check for Normality of DV

```{r}
plotNormalHistogram(graduate_admissions$Chance.of.Admit)
```

#### Homogeneity of Variance

```{r}
leveneTest(Chance.of.AdmitSQ~University.Rating, data=graduate_admissions)
```

#### Homogeneity of Regression Slopes

In order to test for homogeneity of regression slopes, run a one-way ANOVA, with covariate as the IV and the DV to use for ANCOVA. If the *F* test is non-significant, then you are good to go!

Here is the basic ANOVA information: 

```{r}
Homogeneity_RegrSlp = lm(Chance.of.AdmitSQ~Research, data=graduate_admissions)
anova(Homogeneity_RegrSlp)
```

If *p* value is significant, data does not meet the assumption of homogeneity of regression slopes.  That means that CV has an influence on DV and thus should not be used as CV but as another IV in the model.

#### Sample Size

The last assumption for ANCOVAs is sample size. There has to be at least 20 cases for every IV or CV.


#### Violation of Homogeneity of Variance   
If you failed the assumption of homogeneity of variance, there is a quick and easy additional line you can add to the base model. Instead of running the anova() function, you can instead run the **Anova()** function. It is a function that corrects for a violation of homogeneity of variance.

Here is what using the big A Anova() function looks like:
```{r}
Anova(ANCOVA, Type="I", white.adjust=TRUE)*
```

#### Post Hocs


```{r}
postHocs <- glht(ANCOVA,linfct=mcp(University.Rating = "Tukey"))
summary(postHocs)
```

For an ANCOVA, you will still run a post hoc with the Tukey's correction, but you will need to do so using functions from the ```multcomp``` package instead because you now need to handle the covariate and interaction effects.  You will do this using the ```glht()``` function: 

```{r}
postHocs <- glht(ANCOVA,linfct=mcp(University.Rating = "Tukey"))
summary(postHocs)
```

The independent variable will go in the second second of parentheses before the equals sign.  ```linfct=mcp``` is standard code that you will use routinely.

#### Determine Means and Draw Conclusions 

Because a covariate is included in the model, it is important to look at **adjusted** means, rather than the raw means.  The means are adjusted by controlling for the covariate.   

```{r}
adjMeans <- effect("University.Rating", ANCOVA)
adjMeans
```


## Lesson 8 
### MANOVA

#### Assumptions for MANOVA:
*  Sample size
*  Multivariate Normality
*  Homogeneity of Variance
*  Absence of Multicollinearity size
*  Independence

##### Sample Size

There must be more cases than dependent variables in every cell.  In addition, there must be at least 20 cases per independent variable, as per ANOVAs.

---

##### Multivariate Normality

The dependent variables need to be normally distributed when they are lumped all together in one uber-variable that will be used for the MANOVA.

---

##### Homogeneity of Variance 

Like ANOVAs, the variables being used must have relatively equal variance. 

---

##### Absence of Multicollinearity

*Multicollinearity* is when there is a significant relationship between the dependent variables in the model.  It is to be avoided, since having a lot of overlap between DVs can again increase the chances of Type I error; finding a significant relationship between the IV and the DV when one really isn't there. Testing for multicollinearity just requires a correlation matrix, although there are specific statistics designed to test for it as well.

---

##### Independence

The assumption of independence is the same for ANOVAs as it is for MANOVAs. In a nutshell, the different levels of the independent variable should NOT be related to each other! This isn't something typically tested for, but rather assessed by using the user's noggin as they think about the data they are about to analyze.  If there's a chance that a participant or an object will fit into more than one level of the IV, than chances are there is a violation of the assumption of independence and a MANOVA should not be run!

---

Example Question:   

```text
Does the country the project originated in influence the number of backers and the amount of money pledged?
```

#### Data Wrangling

Although no data wrangling is actually required for the MANOVA itself, some wrangling is required to test for assumptions. In order to test for multivariate normality, you will need to create a dataset containing **only the two dependent variables** that is in a **matrix** format, and ensure that they are numeric. Unfortunately, the *test for normality can only handle 5,000 records*, so you will also need to limit your data to 5,000 rows as well.

---

##### Ensure Variables are Numeric

And then check the structure of the data to see what format your dependent variables are in.

```{r}
str(kickstarter$pledged)
str(kickstarter$backers)
```

#### Subsetting 

Next, keep only the two dependent variabes, e.g. ```pledged``` and ```backers```.

```{r}
keeps <- c("pledged", "backers")
kickstarter1 <- kickstarter[keeps]
```

Then limit the number of rows: 

```{r}
kickstarter2 <- kickstarter1[1:5000,]
```

---

#### Format as a Matrix

Lastly, format the data as a matrix: 

```{r}
kickstarter3 <- as.matrix(kickstarter2)
```

### Test Assumptions
#### Multivariate Normality

To test for multivariate normality, use the wrangled dataset , ```kickstarter3```, in the Wilks-Shapiro test. Use the function ```mshapiro.test()``` pulled from the ```mvnormtest``` library: 

```{r}
mshapiro.test(t(kickstarter3))
```

#### Homogeneity of Variance

Use Levene's Test from the ```car``` library to test for homogeneity of variance on **both** of your dependent variables: 

```{r}
leveneTest(pledged ~ country, data=kickstarter)
leveneTest(backers ~ country, data=kickstarter)
```

#### Absence of Multicollinearity

Typically, multicollinearity can be assessed simply by running correlations of the dependent variables with each other. A general rule of thumb is that anything above approximately .7 for correlation (i.e. a strong correlation) indicates the presence of multicollinearity.  Check out the correlation between ```pledged``` and ```backers``` with a simple ```cor.test()``` function: 

```{r}
cor.test(kickstarter$pledged, kickstarter$backers, method="pearson", use="complete.obs")
```
