# Lab 3
---------------------------------------------------------------------
Author: Kevin Paganini    
Assignment: Lab 3     
Description: In this lab we review statistical tests. We look at chi-squared, pearson correlation and Kruskal-Walis tests and apply them to the sacramento real estate data set.    
Date: 1/3/2023   

In [119]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import seaborn as sns
import sys





pd.set_option('display.max_rows', 250)
sacramento_real_estate_path = os.path.join('Data', 'csd.csv')


## Part 1: Review of statistical tests

A. People who play video games have similar gpa's to people who don't play video games.    

B. A two sample t test can be used to determine whether two means are significantly different. Here we are trying to compare two different means to see if they are truly different. The two sample t-test is appropriate.   

C. Null Hypothesis: There is no significant difference between the means of the gamer population vs. non-gamer population.    
Alternative Hypothesis: There is a significant difference between the means of the two populations.     

D. Use the ttest_ind_from_stats function in Scipy to perform a t-test on your data above 
and report the p-value.  Interpret your p-value using a significance threshold (alpha) of 0.01.  
Are you able to reject the null hypothesis?  Are the differences in GPAs of the two groups statistically significant?


In [120]:
from scipy.stats import ttest_ind_from_stats

ttest_ind_from_stats(mean1=3.4, std1=1.2, nobs1=68, mean2=3.3, std2=1.1, nobs2=32)



Ttest_indResult(statistic=0.39893881176878243, pvalue=0.6908062583072547)

Since the p-value is ~0.69, (> 0.01), one cannot conclude that the two means are significantly different. 

E. The statistical test confirms my hypothesis. People who play video games and people who do not have similar gpa's.

## Part 2: Exploring additional Statistical Tests

Statistical tests: Pearson's correlation, Kruskal-Wallis test, Chi-squared goodness of fit     

### Pearson's correlation: 

a.  List  the  two  types  of  variables  for  which  the  test  is  appropriate. Indicate  any assumptions that you would need to be aware of. 

- Both variables should be continuous. They can be interval or ratio scale variables. Their should be a clear notion of what is considered larger or smaller. A good example of this would be weight in kg and height in m. This test is not appropriate if one or both of the varaiables are categorical. 
 
b. Write down the general forms of the null and alternative hypotheses (one sentence per 
hypothesis).    

- The general form of the null hypothesis is that there is no relationship between the two variables, i.e. H0: r = 0, where r is the pearson coefficient. The alternative hypothesis can take on a couple forms. One could be that it is simply not 0, i.e. H1: r != 0. The direction of the inequality could be specified like so: r > 0 or r <0.

 
c.  In  your  own  words,  write  what  it  would  mean  if  the  test  did  and  did  not  indicate 
statistical significance.

- If the test is close to 0, (-0.05 - 0.05), i.e. statistically insignificant, the two variables are not related. If the pearson correlation is close to 1, if one variable moves in a positive direction, the other variable also moves in a positive direction, i.e. there is a positive linear relationship. If the pearsion coefficient is close to -1, there is negative linear relationship.

### Kruskal-Wallis Test

a.  List  the  two  types  of  variables  for  which  the  test  is  appropriate. Indicate  any assumptions that you would need to be aware of. 

- The Kruskal-Wallis test can be used with ordinal or continuous values. The test cannot be used on categorical values like gender, because those cannot be ranked.
 
b. Write down the general forms of the null and alternative hypotheses (one sentence per 
hypothesis).

- The Kruskal-Wallis test null hypothesis is: H0: R1 = R2 = R3 ... = Rk. Or in other words, the average rank of the different groups is statistically insignificant. The alternative hypothesis is H1: R1 != R2 != R3 ... != Rk, or the difference in average rank is statistically significant. 
 
c.  In  your  own  words,  write  what  it  would  mean  if  the  test  did  and  did  not  indicate 
statistical significance.

- If the kruskal-wallis test is statistically significant, the null hyopthesis can be rejected. This means that the average rank is significantly different. Conversely, if the test is insignificant the null hyopthesis cannot be rejected and there is no significant difference between the mean of the different groups.

### Chi-squared goodness of fit

a.  List  the  two  types  of  variables  for  which  the  test  is  appropriate. Indicate  any 
assumptions that you would need to be aware of. 

- To do the chi squared goodness of fit test, one must have at least one categorical variable. One can then determine whether the observed frequency of the sample is statistically different from the expected freuqency. 
 
b. Write down the general forms of the null and alternative hypotheses (one sentence per 
hypothesis). 

- null Hyopthesis: H0: The categorical variable does not significantly differ from the expected frequency.
- Alternative hypothesis: H1: The categorical variable differs from the expected frequency.
 
c.  In  your  own  words,  write  what  it  would  mean  if  the  test  did  and  did  not  indicate 
statistical significance.

- If the test comes back as statistically significant, then the categories differ from the expected values. 
- If the test comes back as not statistically significant, then the categories do not differ from expected values. 

## Part 3: Regression on Price

a. Load the CSV file of the cleaned data set you created in Lab 1

In [121]:
sac_df = pd.read_csv(sacramento_real_estate_path)
sac_df.drop(['Unnamed: 0'], inplace=True, axis=1)
print(sac_df.columns)
sac_df.head()


Index(['street', 'city', 'zip', 'state', 'beds', 'baths', 'sq__ft', 'type',
       'sale_date', 'price', 'latitude', 'longitude', 'empty_lot',
       'street_type'],
      dtype='object')


Unnamed: 0,street,city,zip,state,beds,baths,sq__ft,type,sale_date,price,latitude,longitude,empty_lot,street_type
0,3526 HIGH ST,SACRAMENTO,95838,CA,2,1,836,Residential,Wed May 21 00:00:00 EDT 2008,59222,38.631913,-121.434879,False,ST
1,51 OMAHA CT,SACRAMENTO,95823,CA,3,1,1167,Residential,Wed May 21 00:00:00 EDT 2008,68212,38.478902,-121.431028,False,CT
2,2796 BRANCH ST,SACRAMENTO,95815,CA,2,1,796,Residential,Wed May 21 00:00:00 EDT 2008,68880,38.618305,-121.443839,False,ST
3,2805 JANETTE WAY,SACRAMENTO,95815,CA,2,1,852,Residential,Wed May 21 00:00:00 EDT 2008,69307,38.616835,-121.439146,False,WAY
4,6001 MCMAHON DR,SACRAMENTO,95824,CA,2,1,797,Residential,Wed May 21 00:00:00 EDT 2008,81900,38.51947,-121.435768,False,DR


b. For each continuous variable, use the scipy.stats.linregress() to fit a simple (one 
variable) linear regression model, estimate the Pearson's correlation coefficient , and 
the statistical significance (p-value) of the correlation against the price of the property. 
 
slope, intercept, r, p, stderr = stats.linregress(df["price"], df["latitude"]) 
 
In a table, indicate the variable name, p-value, and whether there is a statistically 
significant relationship between that variable and price at a threshold of alpha = 0.01. 

Continuous variables: sq__ft, price, latitude, longitude, beds, baths   


In [122]:
from scipy.stats import linregress

slope, intercept, r, p, stderr = linregress(sac_df["price"], sac_df["sq__ft"]) 

print(f'Slope: {slope}')
print(f'Intercept: {intercept}')
print(f'R: {r}')
print(f'p: {p}')
print(f'stderr: {stderr}')


Slope: 0.0020615672685718
Intercept: 833.6344944683731
R: 0.3347796424579801
p: 3.386589667562419e-27
stderr: 0.0001851698514061476


| Column name | p-value | significant |
|-------------|---------|-------------|
| longitude   | 0       | False       |
| latitude    | 0.21    | True        |
| beds        | 0       | False       |
| baths       | 0       | False       |
| sq__ft      | 0       | False       |

c.  We can test for association between categorical and continuous variables using a 
Kruskal-Wallis test using the Scipy kruskal() function.  In this example, we want to know 
if the distribution of prices for condos is different from the distribution for other 
property types: 
 
i. Use Pandas masks to select the prices for each type of property 
 
samples_by_group = []    
for value in set(df["type"]):    
    mask = df["type"] == value    
    samples_by_group.append(df["price"][mask])    
 
ii. Perform Kruskal-Wallis test: 
 
stat, p = stats.kruskal(*samples_by_group) 
 
In a table, indicate the variable name, p-value, and whether there is a statistically 
significant relationship between that variable and price at a threshold of alpha = 0.01. 

In [123]:
sac_df.columns

Index(['street', 'city', 'zip', 'state', 'beds', 'baths', 'sq__ft', 'type',
       'sale_date', 'price', 'latitude', 'longitude', 'empty_lot',
       'street_type'],
      dtype='object')

In [124]:
from scipy.stats import kruskal

cat_vars = ['city', 'zip', 'type', 'empty_lot', 'street_type', 'beds', 'baths']

for col in cat_vars:


    samples_by_group = [] 
    for value in set(sac_df[col]): 
        mask = sac_df[col] == value 
        
        samples_by_group.append(sac_df["price"][mask]) 
        
    stat, p = kruskal(*samples_by_group, nan_policy='omit')

    print(f'{col} p: {p}')
    




city p: 3.713804860597625e-49
zip p: 2.792740902787263e-65
type p: 3.207382704900952e-07
empty_lot p: 0.04271004986704472
street_type p: nan
beds p: 7.750684335831685e-38
baths p: 9.614316922056859e-51


| column name | p-value  | significant |
|-------------|----------|-------------|
| city        | 3.17e-49 | False       |
| zip         | 2.79e-65 | False       |
| type        | 3.2e-7   | False       |
| empty_lot   | 0.04     | True        |
| street_type | 0        | False       |
| beds        | 7.7e-38  | False       |
| baths       | 9.6e-51  | False       |


The results are not in line with what was expected. I thought that there would be a higher p value for longitude, sq__ft, beds and baths, but they all received low p values. THe only ones with high p values was empty_lot and latitude, which I find peculiar.   

## Part 4: Classification on Property type

a. Run Kruskal-Wallis test for each continuous variable versus the property type. In a 
table, indicate the variable name, p-value, and whether there is a statistically significant 
relationship between that variable and property type at a threshold of alpha = 0.01

In [125]:

cont_var = ['beds', 'baths', 'sq__ft', 'price', 'latitude', 'longitude']

for x in cont_var:
    samples_by_group = []
    for y in set(sac_df['type']):
        mask = sac_df['type'] == y
        samples_by_group.append(sac_df[x][mask]) 
    stat, p = kruskal(*samples_by_group, nan_policy='omit')
        
    


    print(f'{x} p: {p}')

beds p: 9.744917163987025e-20
baths p: 7.395202985621193e-10
sq__ft p: 1.6939194179245937e-12
price p: 3.207382704900952e-07
latitude p: 0.30624396471011495
longitude p: 0.8019445584702664


| column name | p-value | significant |
|-------------|---------|-------------|
| beds        | 0       | False       |
| baths       | 0       | False       |
| sq__ft      | 0       | False       |
| price       | 0       | False       |
| latitude    | 0.3     | True        |
| longitude   | 0.8     | True        |

In [126]:
sac_df.columns

Index(['street', 'city', 'zip', 'state', 'beds', 'baths', 'sq__ft', 'type',
       'sale_date', 'price', 'latitude', 'longitude', 'empty_lot',
       'street_type'],
      dtype='object')

b. Run a Chi2 test of independence between each categorical variable versus the property type. In a table, indicate the variable name,  p-value, and whether there is a statistically 
significant relationship between that variable and property type at a threshold of alpha = 0.01

In [127]:
from scipy.stats import chi2_contingency

for val in cat_vars:
    comb_counts = pd.crosstab(sac_df[val], sac_df['type'])
    chi2, p, _, _ = chi2_contingency(comb_counts)
    
    print(f'{val} chi2: {chi2}')
    print(f'{val} p: {p}')
    

city chi2: 52.99924802253293
city p: 0.9690599778204446
zip chi2: 203.22283050088117
zip p: 0.00010690898007077516
type chi2: 1968.0
type p: 0.0
empty_lot chi2: 3.6406022453620115
empty_lot p: 0.16197696865046218
street_type chi2: 125.091834430683
street_type p: 3.2273633291568606e-17
beds chi2: 356.4586026562881
beds p: 1.816876113747052e-67
baths chi2: 225.83626587573647
baths p: 6.406801519550199e-43


| column name | p-value | significant |
|-------------|---------|-------------|
| baths       | 6.4e-43 | False       |
| city        | 0.96    | True        |
| zip         | 0.0001  | False       |
| type        | NA      | NA          |
| empty_lot   | 0.16    | True        |
| street_type | 3.2e-17 | False       |
| beds        | 1.8e-67 | False       |

c. How do the results of the statistical tests compare with your analysis of the visualizations of variable pairs from Lab 2.     

City had a high p value in the statistical test. In the visualizations it also looked like it would have high predictability. Empty_lot had a high p value. however in the visualizations it looked like it was lower. Latitude and longitude did not look very predictive in the visualizations, but here they have a higher p value. It makes sense that longitude has a higher p vaue than latitude, visually it seems the same way.   