- NAMUGENYI LISA LUSINGA
- S24B38/010
- B30300

# ANOVA test / F-test
- This is used when finding out whether the mean of 3 or more groups are the same or significantly different.

- null Hypothesis: the means of all the groups are the same
- alternative hypothesis: One or more groups has a different mean from the rest.

- if the p value is less than 0.05, we reject the null hypothesis, meaning atleast one group has a different mean from the others

In [33]:
# importing the necessary libraries
import pandas as pd
import numpy as np
import scipy.stats as stats
#loading the dataset
diamonds=pd.read_excel("diamonds_new.xlsx")
diamonds.head()

Unnamed: 0,price,carat,cut,color,clarity,depth,table,x,y,z
0,326,0.23,Ideal,E,SI2,61.5,55.0,3.95,3.98,2.43
1,326,0.21,Premium,E,SI1,59.8,61.0,3.89,3.84,2.31
2,327,0.23,Good,E,VS1,56.9,65.0,4.05,4.07,2.31
3,334,0.29,Premium,I,VS2,62.4,58.0,4.2,4.23,2.63
4,335,0.31,Good,J,SI2,63.3,58.0,4.34,4.35,2.75


In [76]:
#Removing the outliers in the data, since the ANOVA test is parametric and is affected by outliers
cont_data=diamonds.select_dtypes(exclude='object')
def remove_outliers(column):
    for column in column:
        lower_quantile = cont_data[column].quantile(.25)
        upper_quantile = cont_data[column].quantile(.75)

        IQR = upper_quantile-lower_quantile

        upper_boundary = upper_quantile + 1.5 *IQR
        lower_boundary = lower_quantile - 1.5 *IQR

        cont_data[column] = np.where(cont_data[column]>upper_boundary,upper_boundary,cont_data[column])
        cont_data[column] = np.where(cont_data[column]<lower_boundary,lower_boundary,cont_data[column])

remove_outliers(cont_data)

In [78]:
# ANOVA test for the prices across different cut categories
cut_groups = [diamonds[diamonds["cut"] == cut]["price"] for cut in diamonds["cut"].unique()]
f_stat, p_value = stats.f_oneway(*cut_groups)

print(f"ANOVA Test for price across cut categories:")
print(f"F-statistic: {f_stat}, \nP-value: {p_value}")
if p_value<0.05:
    print("We reject the null hypothesis.\nTherefore, the average price across different cut categories is not the same.")
else:
    print("We accept the null hypothesis")


ANOVA Test for price across cut categories:
F-statistic: 175.7010806869664, 
P-value: 8.23247409367994e-150
We reject the null hypothesis.
Therefore, the average price across different cut categories is not the same.


# CHI-SQUARE TEST
- This is a statistical test for categorical data. It is used to determine whether the categorical variables are independent or related.

- null hypothesis: the two variables are not related
- alternative hypothesis: the two variables are related i.e there is a significant relationship between them.

- if p value is less than 0.05 we accept the null hypothesis.

In [39]:
# checking for the categorical data
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53920 entries, 0 to 53919
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   price    53920 non-null  int64  
 1   carat    53920 non-null  float64
 2   cut      53920 non-null  object 
 3   color    53920 non-null  object 
 4   clarity  53920 non-null  object 
 5   depth    53920 non-null  float64
 6   table    53920 non-null  float64
 7   x        53920 non-null  float64
 8   y        53920 non-null  float64
 9   z        53920 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


In [71]:
crosstab = pd.crosstab(diamonds["cut"], diamonds["color"])
chi2_stat, p_value, dof, expected = stats.chi2_contingency(crosstab)
print("Chi-Square Test")
print(f"Chi2-statistic: {chi2_stat}, \nP-value: {p_value}\n")
if p_value<0.05:
    print("We accept the null hypothesis.\nThe cut and color of diamonds are significantly related.")
else:
    print("We reject the null hypothesis.\nThe cut and color of diamonds are not related.")


Chi-Square Test
Chi2-statistic: 310.47285053969705, 
P-value: 1.2976089313827201e-51

We accept the null hypothesis.
The cut and color of diamonds are significantly related.


# MANN-WHITNEY U TEST
- This is a non-parametric test that is used to compare differences between two independent groups when the dependent variable is not normally distributed. 

- null hypothesis: The two groups have the same distribution.
- alternative hypothesis: The two groups have different distributions

- if the p value is less than 0.05, the two groups differ significantly.

***Comparing price difference bewteen two specific color categories('E' and 'J')***

In [47]:
diamonds['color'].unique()

array(['E', 'I', 'J', 'H', 'D', 'F', 'G'], dtype=object)

In [51]:
# Select two clarity categories
group1 = diamonds[diamonds["color"] == "E"]["price"]
group2 = diamonds[diamonds["color"] == "J"]["price"]

# Perform Mann-Whitney U test
u_stat, p_value = stats.mannwhitneyu(group1, group2, alternative="two-sided")

print(f"Mann-Whitney U Test for 'price' between E and J colors:")
print(f"U-statistic: {u_stat}, P-value: {p_value}")
if p_value<0.05:
    print("We reject the null hypothesis.\nColors E and J have different distribution.")
else:
    print("We accept the null hypothesis.\nColors E and J have the same distribution.")


Mann-Whitney U Test for 'price' between E and J colors:
U-statistic: 9219699.5, P-value: 2.880591005168507e-156
We reject the null hypothesis.
Colors E and J have different distribution.


# SHAPIRO-WILK'S TEST
- This is a statistical test that is used to check if data is normally distributed.

- null hypothesis: the data follows a normal distribution.
- alternative hypothesis: the data does not have a normal distribution.

- If p value is less than 0.05, the data is not normally distributed.

In [52]:
# checking whether depth follows a normal distribution
shapiro_stat, p_value = stats.shapiro(diamonds["depth"])
print("Shapiro-Wilk Test")
print(f"Shapiro-Wilk statistic: {shapiro_stat}, P-value: {p_value}\n")
if p_value<0.05:
    print("We reject the null hypothesis.\nThe data in depth is not normally distributed.")
else:
    print("We accept the null hypothesis.\nThe data in depth is normally distributed.")


Shapiro-Wilk Test
Shapiro-Wilk statistic: 0.9533876822535035, P-value: 3.450662103332805e-80

We reject the null hypothesis.
The data in depth is not normally distributed.


  res = hypotest_fun_out(*samples, **kwds)


# WILCOXON SIGNED-RANK TEST
- This is a non-parametric test used to compare two related data samples.

- null hypothesis: the distributions of the two paired samples are the same.
- alternative hypothesis: the distributions are different.

- if p value is less than 0.05, we conclude that the groups follow different distributions.

In [57]:
# we first check to ensure that there is no missing data.
diamonds.isnull().sum()

price      0
carat      0
cut        0
color      0
clarity    0
depth      0
table      0
x          0
y          0
z          0
dtype: int64

In [59]:
w_stat, p_value = stats.wilcoxon(diamonds["depth"], diamonds["table"])
print("Wilcoxon Signed-Rank Test:")
print(f"Wilcoxon statistic: {w_stat}, P-value: {p_value}\n")
if p_value<0.05:
    print("We reject the null hypothesis.\nThere is a significant difference between depth and table")
else:
    print("We accept the null hypothesis.\nThere is no significant difference between depth and table")


Wilcoxon Signed-Rank Test:
Wilcoxon statistic: 48520868.5, P-value: 0.0

We reject the null hypothesis.
There is a significant difference between depth and table


# KRUSKAL-WALLIS H TEST
- This is a non-parametric alternative to ANOVA that is used when the data is not normally distributed.

- null hyypothesis: The distributions of all the groups are the same.
- alternative hypothesis: Atleast one group has a different ditribution from the rest.

- if p value is less than 0.05, then one of the groups has a different 

In [61]:
h_stat, p_value = stats.kruskal(*cut_groups)
print("Kruskal-Wallis Test:")
print(f"H-statistic: {h_stat}, P-value: {p_value}\n")
if p_value<0.05:
    print("We reject the null hypothesis.\nThere is atleast one cut group that has a different distribution from the rest.")
else:
    print("We accept the null hypothesis.\nAll the cut groups have the same distribution")

Kruskal-Wallis Test:
H-statistic: 978.8119645529175, P-value: 1.393921273660654e-210

We reject the null hypothesis.
There is atleast one cut group that has a different distribution from the rest.


# KOLMOGOROV-SMIRNOV TEST
- This is a test that is used to compare two distributions and compare if they come from the same underlying population.

- null hypothesis: The two distibutions are the same.
- alternative hypothesis: The two distributions are different.

- if p value is less than 0.05, the two variables have significantly different distribution. 

In [62]:
cancer=pd.read_csv("Cancer_data.csv")
cancer

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,


In [69]:
ks_stat, p_value = stats.ks_2samp(cancer["radius_mean"], cancer["texture_mean"])
print("Kolmogorov-Smirnov Test:")
print(f"KS-statistic: {ks_stat}, P-value: {p_value}\n")
if p_value<0.05:
    print("We rejct the null hypothesis.\nThe radius mean and texture mean have significantly different distributions.")
else:
    print("We accept the null hypothesis.\nThe radius mean and texture mean have the same distributions.")

Kolmogorov-Smirnov Test:
KS-statistic: 0.5430579964850615, P-value: 2.5903523506209854e-77

We rejct the null hypothesis.
The radius mean and texture mean have significantly different distributions.
