In [49]:
"""
Name: Lauren Nguyen
Course: CPSC 222
Assignment: Data Assignment 6
Date: 11/17/22
Description: Program displays different hypothesis testing exercises
             and checking with SciPy
"""
import pandas as pd
import numpy as np
from scipy import stats
import json


### *Independent hypothesis testing equations:*
$\overline{X}$ = $\frac{\sum X}{n}$  
n = number of elements in a set  
s = $\sqrt{\frac{\sum {(x - \overline{x})}^2}{n-1}}$  
Degrees of Freedom = $(n_1 + n_2) - 2$


### *Dependant hypothesis testing equations:*
n = number of elements in a set  
Degrees of Freedom = $n-1$  
$\overline{d}$ **(Mean Difference)** = $\frac{\sum d}{n}$  
$\mu_d$ = the hypothesized mean difference, equals 0  
$S_{\overline{d}}$ = $\frac{S_d}{\sqrt{n}}$

## **One tailed two sample independent hypothesis t-test**
The hypothesis we are testing: is the mean age of women who had a stroke greater than the mean age of men who had a stroke? 

The data were are analyzing comes from a csv file containing information on these patients. One of the attributes in this csv file is called RIC, which is what we are pulling from. More specifically the stroke patients which we are splitting up to men and women then analyzing!

**Setting up our 5 Step process**

1. Identifying the null($H_{0}$) and alternative hypothesis($H_{1}$)
    * $H_{0}$: $\mu_f \le \mu_m$
    * $H_{1}$: $\mu_f > \mu_m$
1. Select Level of significance
    * Our level of significance will be 0.01
1. Select the appropiate test statistic
    * t-computed = $\frac{\overline{X_{1}} - \overline{X_{2}}}{\sqrt{{S_{p}}^2(\frac{1}{n_1} + \frac{1}{n_2})}}$
    * ${S_{p}}^2$ = $\frac{(n_1 - 1){S_1}^2 + (n_2 - 1){S_2}^2}{n_1 + n_2 - 2}$
1. Formulate the decision rule
    * degrees of freedom = 1167
    * t-critical = 2.326
    * If t-computed is > 2.326 then reject $H_{0}$
    * if t-computed is < 2.326 then accept $H_{0}$
1. Make a decision
    * t-computed: 2.90
    * Since t-computed is > 2.326, reject $H_{0}$

**End Result:** At a 0.01 level of significance, it seems the average age of male stroke patients is greater than the average age of female stroke patients

In [50]:
Xbar_f = 72.705882
Xbar_m = 70.331126
s_f = 14.566613
s_m = 13.396726
n_f = 562
n_m = 607

t_computed, pval = stats.ttest_ind_from_stats(Xbar_f, s_f, n_f, Xbar_m, s_m, n_m, equal_var=True)
pval /= 2
print(t_computed, pval)


2.9035946868134674 0.0018792474513853405


## **Two tailed two sample independent hypothesis test**
Hypothesis: determine if there is a difference in the number of days 222 students are active on Ed (e.g. "Days Active") compared to 322 students

The data we are analyzing for this test comes from two JSON files containing information from CPSC222 students and CPSC322 students. It details attributes including "Views", "Questions", "Days Active" on the ED puzzle platform. For this hypothesis we are specifically focusing on the "Days Active" and comparing the two to see if they are equal.

**Setting up our 5 Step process**

1. Identifying the null($H_{0}$) and alternative hypothesis($H_{1}$)
    * $H_{0}$: $\mu_{222}$ = $\mu_{322}$
    * $H_{1}$: $\mu_{222}$ $\neq$ $\mu_{322}$
1. Select Level of significance
    * Our level of significance will be 0.01
1. Select the appropiate test statistic
    * t-computed = $\frac{\overline{X_{1}} - \overline{X_{2}}}{\sqrt{{S_{p}}^2(\frac{1}{n_1} + \frac{1}{n_2})}}$
    * ${S_{p}}^2$ = $\frac{(n_1 - 1){S_1}^2 + (n_2 - 1){S_2}^2}{n_1 + n_2 - 2}$
1. Formulate the decision rule
    * degrees of freedom = 92 
    * t-critical = 2.368
    * If -2.368 < t-computed < 2.368 then accept $H_{0}$
    * If t-computed is < -2.368 or t-computed is > 2.368, reject $H_{0}$
1. Make a decision
    * t-computed: -5.48
    * Since t-computed is not between -2.3698 and 2.368, reject $H_{0}$

**End Result:** At the 0.01 significance level, we can conclude that there is a difference in the days active in CPSC222 and the days active in CPSC322.

In [51]:
# opening 222 data
infile = open("C:/Users/dzuy/Desktop/CPSC222/DAs/da_6/ed_222.json", "r")
cpsc_222 = json.load(infile)
for days_active_222 in cpsc_222:
    days_active_222 = cpsc_222['Days Active']
# opening 322 data
infile = open("C:/Users/dzuy/Desktop/CPSC222/DAs/da_6/ed_322.json", "r")
cpsc_322 = json.load(infile)
for days_active_322 in cpsc_322:
    days_active_322 = cpsc_322['Days Active']

# finding df
n1 = len(days_active_222)
n2 = len(days_active_322)
df = (n1+n2)-2
print(df)

# turning dictionarys into series
ser_222 = pd.Series(days_active_222)
ser_322 = pd.Series(days_active_322)

t_computed, pval = stats.ttest_ind(ser_222, ser_322)
print(t_computed, pval)


92
-5.487771363199516 3.5684494006989487e-07


## **One tailed two sample independent hypothesis test** 
Hypothesis: Is the mean duration for students who took the quiz remotely greater than the mean duration for students who took the quiz in the classroom

This data comes from two class sections: CPSC222 and CPSC315. The csv file contains information from both on how long it took them to complete a quiz. Also tracked was if the student completed it in perosn or online. 0 correlating to online and 1 relating to in person.

**Setting up our 5 Step process**

1. Identifying the null($H_{0}$) and alternative hypothesis($H_{1}$)
    * $H_{0}$: $\mu_0$ $\le$ $\mu_1$
    * $H_{1}$: $\mu_0 > \mu_1$
1. Select Level of significance
    * Our level of significance will be 0.005
1. Select the appropiate test statistic
    * t-computed = $\frac{\overline{X_{1}} - \overline{X_{2}}}{\sqrt{{S_{p}}^2(\frac{1}{n_1} + \frac{1}{n_2})}}$
    * ${S_{p}}^2$ = $\frac{(n_1 - 1){S_1}^2 + (n_2 - 1){S_2}^2}{n_1 + n_2 - 2}$
1. Formulate the decision rule
    * degrees of freedom = 92
    * t-critical = 2.660
    * If t-computed is > 2.660 then reject $H_{0}$
    * If t-computed is < 2.660 then accept $H_{0}$
1. Make a decision
    * t-computed: 2.13
    * Since t-computed > 2.660, we reject $H_{0}$

**End Result:** At a significance level of 0.005, we can conclude that students online took longer on the quiz compared to those who took it in person.

In [52]:
iq1_df = pd.read_csv("IQ1_quiz_durations.csv", index_col=0)

online_df = iq1_df[iq1_df.index == 0]
inperson_df = iq1_df[iq1_df.index == 1]

ser_online = pd.Series(online_df.loc[:,"Hours Start to Finish"])
ser_inperson = pd.Series(inperson_df.loc[:,"Hours Start to Finish"])

n1 = len(online_df)
n2 = len(inperson_df)
df = (n1+n2)-2
print(df)

t_computed, pval = stats.ttest_ind(ser_online, ser_inperson)
pval = pval/2
print(t_computed, pval)

92
4.12720417112991 4.029306042293943e-05


## **One tailed two sample dependant hypothesis test**
Hypothesis: is the mean circuit duration for subjects at trial B less than it was at trial A (meaning, did the subjects perform the circuit faster after one week of physical therapy)?

This data comes from a csv file containing durations of 27 subjects doing a circuit. These subjects performed these circuits twice with one week seperating the two, trial A and trial B. In between these trials the subjects recieved physical therapy in the week between to improve their ability to complete the circuit.

**Setting up our 5 Step process**

1. Identifying the null($H_{0}$) and alternative hypothesis($H_{1}$)
    * $H_{0}$: $\mu_B \ge \mu_A$
    * $H_{1}$: $\mu_B < \mu_A$
1. Select Level of significance
    * Our level of significance will be 0.01
1. Select the appropiate test statistic
    * t-computed = $\frac{\overline{d} - \mu_d}{S_{\overline{d}}}$
1. Formulate the decision rule
    * degrees of freedom = 26
    * t-critical = 2.479
    * If t-computed is > 2.479 then reject $H_{0}$
    * If t-computes is < 2.479 then accept $H_{0}$
1. Make a decision
    * t-computed: -3.34
    * Since t-computed is not greater than 2.479, we accept $H_{0}$

**End Result:** At a significance level of 0.01, we can conclude that the mean duration of trial B improved from the mean duration of trial A.

In [53]:
circuit_trials_df = pd.read_csv("circuit_trials.csv", index_col=0)
a_trial = circuit_trials_df.loc[circuit_trials_df["Trial ID"]=="A"]
b_trial = circuit_trials_df.loc[circuit_trials_df["Trial ID"]=="B"]
a_duration = pd.Series(a_trial.loc[:,"Duration"])
b_duration = pd.Series(b_trial.loc[:,"Duration"])

# Calculating df
n = len(b_duration)
df = n-1
print(df)

t_computed, pval = stats.ttest_rel(b_duration, a_duration)
pval /= 2
print(t_computed, pval)


26
-3.336688368513952 0.0012809826011843611


## **One Tailed Two Sample Independent Hypothesis Test**
Hypothesis: Is the returning visitor mean greater than the new vistor mean in the Gonzaga daily website visitor dataset?

This dataset comes from a GU website that tracks the daily number of new or returning visitors to the Gonzaga University website. It ranges from 2018-2022. 

**Setting up our 5 Step process**

1. Identifying the null($H_{0}$) and alternative hypothesis($H_{1}$)
    * $H_{0}$: $\mu_R \le \mu_N$
    * $H_{1}$: $\mu_R > \mu_N$
1. Select Level of significance
    * Our level of significance will be 0.05
1. Select the appropiate test statistic
    * t-computed = $\frac{\overline{X_{1}} - \overline{X_{2}}}{\sqrt{{S_{p}}^2(\frac{1}{n_1} + \frac{1}{n_2})}}$
    * ${S_{p}}^2$ = $\frac{(n_1 - 1){S_1}^2 + (n_2 - 1){S_2}^2}{n_1 + n_2 - 2}$
1. Formulate the decision rule
    * degrees of freedom = 2920
    * t-critical = 2.326
    * If t-computed is > 2.326 then reject $H_{0}$
    * If t-computer is < 2.326 accept $H_{0}$
1. Make a decision
    * t-computed: -0.17
    * Since t-computed is less than 2.326, accept $H_{0}$

**End Result:** At a significance level of 0.05, we can conclude that the mean number of returning visitors is less than the mean number of new visitors to the Gonzaga University website.

In [54]:
df = pd.read_csv("GU_website_daily_visitors_2018-2022.csv")

return_df = df.loc[:,"Returning Visitor"]
new_df = df.loc[:,"New Visitor"]

return_ser = pd.Series(return_df)
new_ser = pd.Series(new_df)


n1 = len(return_df)
n2 = len(new_df)
df = (n1 + n2) -2
print(df)

t_computed, pval = stats.ttest_ind(return_ser, new_ser)
pval /= 2
print(t_computed, pval)


2920
-0.17068777111586536 0.4322405523174508
