### Mann-Whitney's test
Let's check how to use an another statistical test. We have here a dataset with the real estate prices in California.   
Let's use Mann-Whitney's test to test is the probaility that a random real estate object from SACRAMENTO is more expensive than a random real estate object from ELK GROVE with the probability = 0.5.

#### Load the dataset with prices. Check it out:
 - measure the sample size for SACRAMENTO and ELK GROVE
 - measure the mean and median of prices in these towns
 - plot histograms of prices in SACRAMENTO and ELK GROVE separately

In [None]:
# load dataset and check out columns
import pandas as pd

prices_df = pd.read_csv('data/Sacramentorealestatetransactions.csv')
prices_df.head()

https://towardsdatascience.com/a-growth-marketer-guide-to-designing-a-b-tests-using-python-5c0729d8eacc

In [None]:
from statsmodels.stats.power import TTestIndPower 
# parameters for power analysis (change as needed)
effect = 0.05
alpha = 0.05
power = 0.8
# perform power analysis #
# change to TTestPower() in case of a paired sample t-test
analysis = TTestIndPower()  
result = analysis.solve_power(effect, power=power, nobs1=None, 
                              ratio=1.0, alpha=alpha)
print('Sample Size: %.2f' % result)

In [None]:
# measure the sample size for SACRAMENTO and ELK GROVE

In [None]:
# measure the mean and median of prices in these towns
prices_df[prices_df['city']=='SACRAMENTO']['price'].mean()

In [None]:
prices_df[prices_df['city']=='ELK GROVE']['price'].mean()

In [None]:
# plot histograms of prices in SACRAMENTO and ELK GROVE separately
prices_df[prices_df['city']=='SACRAMENTO'][['price']].plot.hist()

In [None]:
prices_df[prices_df['city']=='ELK GROVE'][['price']].plot.hist()

### Apply Mann-Whitney test to prices in SACRAMENTO and ELK GROVE. What does result mean? Do we reject Null hypothesis or not?

In [None]:
sacramento = prices_df[prices_df['city']=='SACRAMENTO'][['price']]
elk_grove = prices_df[prices_df['city']=='ELK GROVE'][['price']]

from scipy.stats import mannwhitneyu

In [None]:
stat, p = mannwhitneyu(sacramento, elk_grove)
print('Statistics=%.3f, p=%.3f' % (stat, p))

In [None]:
p

In [None]:
# interpret
alpha = 0.05
if p > alpha:
    print('Same distribution accept H0 (fail to reject H0)')
else:
    print('Different distribution (reject H0)')

In [None]:
# Now do the same but for SACRAMENTO and RIO LINDA

In [None]:
sacramento = prices_df[prices_df['city']=='SACRAMENTO'][['price']]
riolinda = prices_df[prices_df['city']=='RIO LINDA'][['price']]

stat, p = mannwhitneyu(sacramento, riolinda)
print('Statistics=%.3f, p=%.3f' % (stat, p))

# interpret
alpha = 0.05
if p > alpha:
    print('Same distribution accept H0 (fail to reject H0)')
else:
    print('Different distribution (reject H0)')

### Learn more: https://machinelearningmastery.com/nonparametric-statistical-significance-tests-in-python/

### Welch’s t-test is a nonparametric univariate test that tests for a significant difference between the mean of two unrelated groups. It is an alternative to the independent t-test when there is a violation in the assumption of equality of variances.

The hypothesis being tested is:

- Null hypothesis (H0): u1 = u2, which translates to the mean of sample 1 is equal to the mean of sample 2
- Alternative hypothesis (HA): u1 ≠ u2, which translates to the mean of sample 1 is not equal to the mean of sample 2

If the p-value is less than what is tested at, most commonly 0.05, one can reject the null hypothesis. 

More: https://pythonfordatascienceorg.wordpress.com/welch-t-test-python-pandas/

Welch’s t-test Assumptions

Like every test, this inferential statistic test has assumptions. The assumptions that the data must meet in order for the test results to be valid are:

- The independent variable (IV) is categorical with at least two levels (groups)
- The dependent variable (DV) is continuous which is measured on an interval or ratio scale
- The distribution of the two groups should follow the normal distribution

If any of these assumptions are violated then another test should be used.

In [None]:
riolinda = prices_df[prices_df['city']=='RIO LINDA']['price']
sacramento = prices_df[prices_df['city']=='SACRAMENTO']['price']

In [None]:
from scipy import stats

In [None]:
stats.shapiro(riolinda)

In [None]:
stats.shapiro(sacramento)

In [None]:
stats.ttest_ind(riolinda, sacramento, equal_var = False)

The p-value is significant, therefore one can reject the null hypothesis in support of the alternative. 

Another piece of information you will need to report is the degrees of freedom (DoF). However, there is not a built-in method for this currently. Below are 2 functions that will give you what you need. The first, only calculates the DoF as a two tail test and returns it. The second, conducts the Welch’s test, calculates the DoF as a two tail test, and returns all the needed information.

In [None]:
def welch_dof(x,y):
    dof = (x.var()/x.size + y.var()/y.size)**2 / ((x.var()/x.size)**2 / \
                                                  (x.size-1) + (y.var()/y.size)**2 / (y.size-1))
    print(f"Welch-Satterthwaite Degrees of Freedom= {dof:.4f}")

In [None]:
welch_dof(riolinda, sacramento)

In [None]:
print(prices_df[prices_df['city']=='RIO LINDA']['price'].mean())
print(prices_df[prices_df['city']=='RIO LINDA']['price'].std())

In [None]:
print(prices_df[prices_df['city']=='SACRAMENTO']['price'].std())
print(prices_df[prices_df['city']=='SACRAMENTO']['price'].mean())

In [None]:
def welch_ttest(x, y): 
    ## Welch-Satterthwaite Degrees of Freedom ##
    dof = (x.var()/x.size + y.var()/y.size)**2 / ((x.var()/x.size)**2 / \
                                                  (x.size-1) + (y.var()/y.size)**2 / (y.size-1))
   
    t, p = stats.ttest_ind(x, y, equal_var = False)
    
    print("\n",
          f"Welch's t-test= {t:.4f}", "\n",
          f"p-value = {p:.4f}", "\n",
          f"Welch-Satterthwaite Degrees of Freedom= {dof:.4f}")

welch_ttest(riolinda, sacramento)

Welch’s t-test Interpretation

The current study aimed to test if there was a significant difference in the price between the RIO LINDA and LINCOLN. RIO LINDA has higher price avg (M= 172,727, SD= 76,711) compared to SACRAMENTO (M= 100,870, SD= 197,735 units). The difference in price between the two areas are **not** significantly different (Welch's t(-1.1464)=13.2593, **p(0.2719)> 0.0001**).

- p is not p<0.0001


- Null hypothesis (H0): u1 = u2, which translates to the mean of sample 1 is equal to the mean of sample 2
- Alternative hypothesis (HA): u1 ≠ u2, which translates to the mean of sample 1 is not equal to the mean of sample 2

--> Hence we reject H0
