In [3]:
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as stats
import math

## Note
I had a 102 degree fever while doing this

## Using the inpatient rehabilitation dataset from DA5, is the mean age of women who had a stroke greater than the mean age of men who had a stroke? Use a level of significance of 0.01.

What tricked me for this question was that because the datasets are not the same size, I needed to use the unequal variance t-test

This is a one tailed t-test

H0 is that the mean age of women who had a stroke is not greater than the mean age of men who had a stroke.

H1 is that the mean age of women who had a stroke is greater than the mean age of men who had a stroke.

LOS is 0.01.

DF = 1167

When we look this value up on a t-table, we get around 2.320

if t-computed >= 2.320, we do not reject H0  
else we reject H0


In [4]:
patient_data_stats = pd.Series(dtype=float)

patient_data_cleaned = pd.read_csv("patient_data_cleaned.csv", index_col=0)

# Copied from DA5
# Get the men_stroke df
new_df = pd.DataFrame(dtype=float)
men_stroke = pd.DataFrame(dtype=float)
entry = patient_data_cleaned.loc[patient_data_cleaned["Gender"] == "M"]
new_df = pd.concat([new_df,entry])
entry2 = new_df.loc[new_df["RIC"] == "Stroke"]
men_stroke = pd.concat([men_stroke,entry2])

# Get the women_stroke df
new_df = pd.DataFrame(dtype=float)
women_stroke = pd.DataFrame(dtype=float)
entry = patient_data_cleaned.loc[patient_data_cleaned["Gender"] == "F"]
new_df = pd.concat([new_df,entry])
entry2 = new_df.loc[new_df["RIC"] == "Stroke"]
women_stroke = pd.concat([women_stroke,entry2])


t,p = (stats.ttest_ind(women_stroke["Age"],men_stroke["Age"],equal_var=False))
print(t/2) # divide by 2 to get the one tailed result

1.5068840848862006


Since t-computed is less than 2.320, we do not reject the null hypothesis.

This means that on average, with a level of significance of 0.01, there is no difference between the mean age of men or women who get strokes.

## Using this Piazza dataset, determine if there is a difference in the amount of posts 222 students made compare to 315 students. Use a level of significance of 0.01.
Download the piazza_222_users.json and piazza_315_users.json files from the DAs repo on Github: https://github.com/GonzagaCPSC222/DAs/blob/master/files. These JSON files contain the student activity from my CPSC 222 and CPSC 315 classes in Fall 2020. Note that I have removed identifying information and shuffled the order of the JSON array. The attributes include "days" (number of days online), "posts", "asks", "answers", and "views."

This is an independant t-test with two datasets. Since we do not care if the difference is positive or negative, this is also a 2 tailed test.

LOS = 0.01

df = sum(students in 222) + sum(students in 315) - 2
== 89

using a t-chart, we find that t-critical is somewhere around 2.646

$t=\frac{\overline{X_1} - \overline{X_2}}{\sqrt{s_p^2(\frac{1}{n_1}+\frac{1}{n_2})}}$

In [5]:
piazza_222_users = pd.Series(dtype=float)
piazza_315_users = pd.Series(dtype=float)

piazza_222_users = pd.read_json("piazza_222_users.json")
piazza_315_users = pd.read_json("piazza_315_users.json")

num_222_users = piazza_222_users.count()[0]
num_315_users = piazza_315_users.count()[0]

df = num_222_users + num_315_users - 2

print(df)
num_posts_222 = piazza_222_users["posts"].sum()
num_posts_315 = piazza_315_users["posts"].sum()

pooled_var = ((num_222_users - 1)*(piazza_222_users["posts"].std()**2) + (num_315_users -1 )*(piazza_315_users["posts"].std()**2))/(df)

t = (piazza_222_users["posts"].mean() - piazza_315_users["posts"].mean())/(math.sqrt((1/num_222_users) + (1/num_315_users)))

print(t)

print("The amount of posts in 222 is",num_posts_222)
print("The amount of posts in 315 is",num_posts_315) # wow already there looks like there is alot more 



89
-5.43360973700861
The amount of posts in 222 is 17
The amount of posts in 315 is 116


Since T-computed is >= 2.646 or <= -2.646, we reject the null hypothesis, meaning that on average, CPSC315 students post more on piazza than 222 students.

## Using this IQ1 dataset, is the mean duration for students who took the quiz remotely greater than the mean duration for students who took the quiz in the classroom? Use a level of significance of 0.005.

Download the IQ1_quiz_durations.csv file from the DAs repo on Github: https://github.com/GonzagaCPSC222/DAs/blob/master/files. This CSV file contains all of the IQ1 durations from my CPSC 222 and CPSC 315 classes in Fall 2020. Note that I have removed identifying information and shuffled the order of the values. Each IQ1 duration (expressed as a fraction of an hour) is coupled with whether or not the student was in-person when they took the quiz or not (0 means they took it remotely, 1 means they took it in-person in the classroom).

Null hypothesis: There is no difference between the duration of the students who took the quiz remotely and the duration for the students who took it in the classroom.

Alt hypothesis: The students who took the quiz remotley had a higher mean duration than the students who took the quiz in the classroom.

LOS = 0.005

df = 92 (len(in_person + len(online) -2))

this means that t-critical is around 2.636

reject null-hypothesis if |t-computed| > 2.636

$t=\frac{\overline{X_1} - \overline{X_2}}{\sqrt{s_p^2(\frac{1}{n_1}+\frac{1}{n_2})}}$

In [6]:
IQ1_quiz_durations = pd.Series(dtype=float)

IQ1_quiz_durations = pd.read_csv("IQ1_quiz_durations.csv")

in_person = IQ1_quiz_durations.loc[IQ1_quiz_durations["In-person"] == 1]

online = IQ1_quiz_durations.loc[IQ1_quiz_durations["In-person"] == 0] 

print("df =",len(in_person) + len(online) - 2)

pooled_var = ((len(online) - 1)*(online["Hours Start to Finish"].std()**2) + (len(in_person) -1 )*(in_person["Hours Start to Finish"].std()**2))/(df)

t = (online["Hours Start to Finish"].mean() - in_person["Hours Start to Finish"].mean())/(math.sqrt((1/len(online)) + (1/len(in_person))))

print(t)

df = 92
0.13837344172216143


Since t-computed is <= 2.636, we do not reject the null hypothesis. This means that on average there was no significant difference between the mean time to complete IQ1 for online students vs in person students.

# Note (Please read)!
**I currently have a 102 fever so for the next 2 questions I am actually going to not go into that much detail, if any. I know it's not doing the assignment but you once said that school is a game where you chose your battles. Im choosing to get a worse grade on this (hopefully while still proving my compitance with jupyter notebook and python with the amount of detail I was using before), in order to work on the project check in/other assignments for other classes**

## Using this circuit dataset, is the mean circuit duration for subjects at trial B less than it was at trial A (meaning, did the subjects perform the circuit faster after one week of physical therapy)? Use a level of significance of 0.01.

Download the circuit_trials.csv file from the DAs repo on Github: https://github.com/GonzagaCPSC222/DAs/blob/master/files. This CSV file contains circuit durations (in seconds) for 27 subjects. A circuit consisted of performing several tasks like standing up from a chair, walking, and loading into a vehicle. Each subject completed the circuit at two different points in time, one week apart, producing two trials, A and B. During the week between trials, subjects received therapy services to improve their ability to perform the circuit.

LOS = 0.01

df = 26

t-crit =  2.479 (one tailed)

In [7]:
circuit_trials = pd.Series(dtype=float)

circuit_trials = pd.read_csv("circuit_trials.csv", index_col=0)

circuit_trials_a = circuit_trials.loc[circuit_trials["Trial ID"] == "A"] 

circuit_trials_b = circuit_trials.loc[circuit_trials["Trial ID"] == "B"] 


# thought this was cool too
# circuit_trials_dif = pd.Series(dtype=float)

# circuit_trials_dif = circuit_trials_a["Duration"] - circuit_trials_b["Duration"]

# print(circuit_trials_dif)

t,p = stats.ttest_rel(circuit_trials_a["Duration"],circuit_trials_b["Duration"])

print(t)


3.336688368513952


Since t-computed is greater than t-critical, that means that one average, the subjects performed faster after one week of physical therapy.

## Download the GU_website_daily_vistors_2018-2021.csv from the DAs repo on Github: https://github.com/GonzagaCPSC222/DAs/blob/master/files. This file contains daily number of new or returning users to the GU website (thank you to Lyle in GU IT for sharing this with us!!). Using this dataset, what interesting statistical inferences and conclusions do you find? Write up your approach and findings using data storytelling (e.g. narrative before and after code cells describing your experiment design for reproducibility, data visualization(s), write-up of key findings, etc.).

My prediction is that the better we preformed in march madness, the more visitors the Gonzaga website had.

In [8]:
# GU_website_daily_visitors = pd.Series(dtype=float)
GU_website_daily_visitors = pd.read_csv("GU_website_daily_vistors_2018-2021.csv")

max_day = (GU_website_daily_visitors.iloc[GU_website_daily_visitors["New Visitor"].idxmax()])
print("The day that had the most new visitors was",max_day["Date"])

# Commented out because i already did it once, and it takes a while to export each time
# plt.bar(GU_website_daily_visitors["Date"],GU_website_daily_visitors["New Visitor"])
# plt.bar(GU_website_daily_visitors["Date"],GU_website_daily_visitors["Returning Visitor"],bottom=GU_website_daily_visitors["New Visitor"])
# plt.savefig('Comparison.png',dpi=1000)

The day that had the most new visitors was 2021-04-05


From the results above, we know that the max date of new visitors is 04/05/2021, which happens to be the the day of the baylor x gonzaga mens basketball game (the final game in the march madness tournament)

With this knowledge I can confirm that my prediction is correct and that the most traffic Gonzagas website gets is during march madness season. (Specifically the better we do, the more traffic)