In [1]:
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import env

  import pandas.util.testing as tm


**Comparison tests compare the mean (or median) of a quantitative variable across different groups (often in a single categorical variable)**

1 group to the population
- 1-Sample T-Test: scipy.stats.ttest_1samp
    
1 group to itself at different points in time
- Paired T-Test: scipy.stats.ttest_rel
- Wilcoxon signed rank test: scipy.stats.wilcoxon

2 groups 
- 2-Sample or Independent T-Test: scipy.stats.ttest_ind
- Mann-Whitney U Test: scipy.stats.mannwhitneyu

3 or more groups to each other 
- One-Way ANOVA: scipy.stats.f_oneway
- Kruskal Wallis H Test: scipy.stats.kruskal

Again, the "groups" often exist in a categorical variable, such as "churn", "job title", "survived", "has_pool"


**Relationship tests test the relationship or dependency between two variables**

Are two quantitative variables linearly correlated? 

A correlation test tests a linear relationship between two quantitative variables
- Pearson's Correlation Test
- Spearman's Correlation test


A chi-squared test is used to test the relationship between 2 categorical variables. It is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in two categorical variables. 
- Chi-Squared Test for Independence


scipy.stats documentation contains really good information about these stistical tests, so don't forget to use it as a resource. 

her there are any statistically significant differences between the means of three or more independent (unrelated) groups. 

The Kruskal-Wallis H-test tests the null hypothesis that the population median of all of the groups are equal. It is a non-parametric version of ANOVA. The test works on 2 or more independent samples, which may have different sizes. Note that rejecting the null hypothesis does not indicate which of the groups differs. Post hoc comparisons between groups are required to determine which groups are different.


In [2]:
def get_connection(db, user=env.user, host=env.host, password=env.password):
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

Is there a significant difference between title and salary? 

In [3]:
query = "select s.salary, t.from_date, datediff(now(),t.from_date)/365 AS tenure_years, t.title \
            from salaries s \
            join titles t using (emp_no) \
            where s.to_date = '9999-01-01' and t.to_date = '9999-01-01';"

salaries = pd.read_sql(query, get_connection('employees'))
salaries.title.value_counts()

Senior Engineer       85939
Senior Staff          82024
Engineer              30983
Staff                 25526
Technique Leader      12055
Assistant Engineer     3588
Manager                   9
Name: title, dtype: int64

Because there are more than 2 titles to compare, we can do an anova test. 

In [None]:
df = pd.read_csv("bmi_activity.csv")

In [None]:
df.head

In [None]:
# sns.scatterplot(x='steps', y='bmi', data=df, c='city', cmap='Greens')
sa = df[df.city == 'san antonio']
aus = df[df.city == 'austin']
plt.scatter(sa.bmi, sa.steps, c='lightgreen')
plt.scatter(aus.bmi, aus.steps, c='darkgreen')
plt.show()

# To Explore

1. Is there a significant difference in the number of steps taken by austin residents vs. san antonio residents? Is there anything else you want to conclude from the dataset?  
2. Does a negative correlation exist between the number of steps and BMI for SA Residents? Is there anything else you want to conclude from the dataset? 
3. Does a positive correlation exist between the number of steps and BMI for Austin Residents? Is there anything else you want to conclude from the dataset? 
4. What do you conclude from the dataset? 


## Is there a significant difference in the number of steps taken by austin residents vs. san antonio residents? Is there anything else you want to conclude from the dataset?**. 

In [None]:
sns.boxplot(x='city', y='steps', data=df)

Clearly a difference. But what is our sample size? 
And what are the descriptive statistics? The mean, median, and standard deviation especially. 

In [None]:
df.groupby('city')['steps'].describe().T

41 and 63 subjects...I could run a test to be sure with that sample size what our confidence is in that. 
I have 2 groups and 1 continuous variable. This means I can compare means or medians to see if there is a significant difference. Should I use an independent t-test or a Mann-Whitney U (aka Mann Whitney Wilcoxon Test or Wilcoxon Rank Sum test)? Depends on the assumptions, primarily the distribution. 

In [None]:
plt.hist(sa.steps, color='lightgreen')
plt.hist(aus.steps, color='darkgreen')
plt.show()

These are definitely NOT normally distributed! So we should do a non-parametric test, like the Mann-Whitney U test.

In [None]:
stats.mannwhitneyu(sa.steps, aus.steps)

**While we know the ttest isn't appropriate here due to violation of assumptions,** what happens when we use it? 

In [None]:
stats.ttest_ind(sa.steps, aus.steps, equal_var=False)

## Does a negative correlation exist between the number of steps and BMI for SA Residents? Is there anything else you want to conclude from the dataset? 

In [None]:
# sns.scatterplot(x='steps', y='bmi', data=df, c='city', cmap='Greens')
sa = df[df.city == 'san antonio']
plt.scatter(sa.bmi, sa.steps, c='lightgreen')
plt.show()