In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

In [2]:
df= pd.read_excel('cleandata2911.xlsx', index_col= 0)

I want to choose a dataset from the filtered one ( z score etc. ) and then do these correlation thingies for the numeric data. 

In [7]:
df['avgscore_binned'] = pd.cut(df['avgscore'], bins=10)

In [8]:
from scipy.stats import chi2_contingency

In [9]:
# Create a contingency table
contingency_table = pd.crosstab(df['ASBH02A'], df['avgscore_binned'])

In [10]:
contingency_table.head()

avgscore_binned,"(83.869, 151.789]","(151.789, 219.036]","(219.036, 286.283]","(286.283, 353.53]","(353.53, 420.777]","(420.777, 488.024]","(488.024, 555.271]","(555.271, 622.517]","(622.517, 689.764]","(689.764, 757.011]"
ASBH02A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
No,3,55,182,358,540,718,629,331,70,4
Yes,135,855,2163,3395,5255,7834,9656,7313,1783,86


In [11]:
def cramers_v(confusion_matrix):
    chi2 = chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    r, k = confusion_matrix.shape
    return np.sqrt(chi2 / (n * (min(r, k) - 1)))

# Create a confusion matrix
confusion_matrix = pd.crosstab(df['ASBH02A'], df['avgscore_binned'])

# Calculate Cramér's V
cramers_v_value = cramers_v(confusion_matrix)
print(f"Cramér's V: {cramers_v_value}")

Cramér's V: 0.07709123028408166



Cramér's V is a statistical measure used to assess the strength of association between two nominal (categorical) variables. It is derived from the Chi-square statistic and ranges from 0 to 1, where:

0 indicates no association between the variables.
1 indicates a perfect association between the variables.
The value you provided, 0.077, is relatively close to 0, indicating a very weak association between the two variables.

Here's a general interpretation of Cramér's V:

0 to 0.1: Little or no association\
0.1 to 0.3: Weak association\
0.3 to 0.5: Moderate association\
Above 0.5: Strong association\

Since your Cramér's V is around 0.077, it suggests that the relationship between the two categorical variables in your analysis is negligible or very weak.

In [12]:
# Perform Chi-Square Test of Independence
chi2, p, dof, expected = chi2_contingency(contingency_table)

In [13]:
print(f"Chi-Square Test Statistic: {chi2}")
print(f"P-value: {p}")

Chi-Square Test Statistic: 245.83458534739597
P-value: 7.553360938216542e-48


1. Chi-Square Test Statistic: 245.83458534739597
The Chi-Square test statistic quantifies how much the observed frequencies in your data deviate from the expected frequencies under the assumption that the two categorical variables are independent.
A higher Chi-Square value generally indicates a greater difference between the observed and expected frequencies, suggesting a stronger association between the variables.
In your case, the Chi-Square statistic is quite large, indicating a noticeable deviation from the expected distribution if the variables were independent.
2. P-value: 7.553360938216542e-48
The P-value represents the probability of observing a Chi-Square statistic as extreme as, or more extreme than, the one calculated, under the null hypothesis (which states that the two variables are independent).
A very small P-value (like the one here, which is effectively 0.000...0000755) indicates that the observed association is highly unlikely to have occurred by chance.
Common thresholds for significance are 0.05, 0.01, and 0.001. Your P-value is far smaller than any of these, meaning you can reject the null hypothesis with extremely high confidence, suggesting a statistically significant association between the two variables.
Summary
Chi-Square Test Statistic (245.83): Indicates a substantial deviation from what would be expected under independence.
P-value (~0): Shows that this deviation is extremely unlikely to be due to random chance, indicating a statistically significant association.
However, the earlier Cramér's V value of 0.077 suggests that while the association is statistically significant, it is not strong. This can happen when the sample size is large, leading to statistically significant results even for weak associations.

In [14]:
dfzscore = pd.read_excel('cleandatazscore2911.xlsx', index_col= 0)
dfiqr = pd.read_excel('cleandataiqr2911.xlsx', index_col= 0)

In [15]:
def perform_chi_square_test(df, score_column, category_column, bins=10):
    """
    Perform a Chi-Square Test of Independence on a given DataFrame.

    Parameters:
        df (pd.DataFrame): The input DataFrame.
        score_column (str): The column name containing the numeric scores to be binned.
        category_column (str): The column name of the categorical variable.
        bins (int): The number of bins for discretizing the score column. Default is 10.

    Returns:
        tuple: Chi-Square test statistic, p-value, degrees of freedom, expected frequencies.
    """
    # Bin the score column
    df['binned_scores'] = pd.cut(df[score_column], bins=bins)

    # Create a contingency table
    contingency_table = pd.crosstab(df[category_column], df['binned_scores'])

    # Perform Chi-Square Test of Independence
    chi2, p, dof, expected = chi2_contingency(contingency_table)

    # Print results
    print(f"Chi-Square Test Statistic: {chi2}")
    print(f"P-value: {p}")
    print(f"Degrees of Freedom: {dof}")

    return chi2, p, dof, expected

In [16]:
perform_chi_square_test(dfzscore, 'avgscore', 'ASBH02A', bins=10)

Chi-Square Test Statistic: 3.9381881428980847
P-value: 0.9154334828579949
Degrees of Freedom: 9


(3.9381881428980847,
 0.9154334828579949,
 9,
 array([[  1.05510535,   1.65802269,   1.65802269,   2.61264182,
           3.36628849,   2.86385737,   3.26580227,   4.62236629,
           4.11993517,   5.77795786],
        [ 19.94489465,  31.34197731,  31.34197731,  49.38735818,
          63.63371151,  54.13614263,  61.73419773,  87.37763371,
          77.88006483, 109.22204214]]))

In [17]:
perform_chi_square_test(dfiqr, 'avgscore', 'ASBH02A', bins=10)

Chi-Square Test Statistic: 248.9372235851522
P-value: 1.6723183136152255e-48
Degrees of Freedom: 9


(248.9372235851522,
 1.6723183136152255e-48,
 9,
 array([[6.13825745e+01, 1.42875650e+02, 2.16590796e+02, 3.21067302e+02,
         4.70809952e+02, 6.04506244e+02, 6.23565674e+02, 3.64932018e+02,
         7.39253608e+01, 4.34442879e+00],
        [8.14617426e+02, 1.89612435e+03, 2.87440920e+03, 4.26093270e+03,
         6.24819005e+03, 8.02249376e+03, 8.27543433e+03, 4.84306798e+03,
         9.81074639e+02, 5.76555712e+01]]))

Unfiltered Dataset
Chi-Square Test Statistic: 245.83 (very large value)
P-value: 
7.55
×
1
0
−
48
7.55×10 
−48
  (extremely small, essentially 0)
Interpretation: The null hypothesis is rejected with very high confidence. There is a significant association or effect detected in the unfiltered data.
Z-Score Filtered Dataset
Chi-Square Test Statistic: 3.94 (very small value)
P-value: 0.915 (very high, close to 1)
Interpretation: The null hypothesis cannot be rejected. The z-score filtering appears to have removed most of the association or effect, likely by excluding outliers or influential data points.
IQR Filtered Dataset
Chi-Square Test Statistic: 248.94 (very large value, even larger than unfiltered)
P-value: 
1.67
×
1
0
−
48
1.67×10 
−48
  (extremely small, essentially 0)
Interpretation: Similar to the unfiltered data, the null hypothesis is rejected with very high confidence. This suggests that the IQR filtering preserved or even amplified the detected effect.
Comparison of Results
Effect of Filtering:

Z-score filtering greatly reduced the test statistic and increased the p-value, possibly due to the removal of extreme values, which may have carried most of the signal.
IQR filtering retained or enhanced the strength of the signal, leading to similar or even stronger rejection of the null hypothesis compared to the unfiltered dataset.
Significance:

The unfiltered and IQR-filtered datasets indicate a strong and statistically significant result.
The z-score filtered dataset suggests no significant result, likely due to the removal of key influential points.
Interpretation of Filtering Methods:

Z-score Filtering: This method likely excludes extreme outliers based on standard deviations, which can minimize the effect of influential data points but may also discard important variations.
IQR Filtering: By focusing on the central 50% of the data, this method avoids extreme outliers while retaining much of the inherent variation, preserving the signal.
Conclusion
The unfiltered and IQR-filtered datasets suggest strong evidence against the null hypothesis.
Z-score filtering may overly simplify the data by removing too much information, leading to an insignificant result. It’s important to consider the nature of the data and the appropriateness of the filtering technique for the given analysis.





You said:
So do you think I should continue my analysis with the IQR-filtered dataset? Is it normal to get such different results between the z-score filtered df and IQR-filtered?
ChatGPT said:
ChatGPT
Yes, continuing your analysis with the IQR-filtered dataset is a reasonable choice, particularly if your goal is to balance removing extreme outliers without losing critical information about the dataset's central tendencies. Here’s why:

1. The Case for the IQR-Filtered Dataset
Preservation of Signal: The IQR filtering retains much of the variation in the data while removing extreme outliers (points outside the interquartile range). The fact that the Chi-square test statistic is nearly identical to the unfiltered dataset suggests that the key relationships in the data remain intact.
Outlier Management: IQR filtering removes extreme points that could skew results without discarding moderate variations that might still carry useful information.
2. Is it Normal to Get Different Results Between Z-Score and IQR Filtering?
Yes, it’s quite common to see different results between these two methods because they handle outliers differently:

Z-Score Filtering:

Focuses on removing data points that deviate significantly (e.g., beyond 3 standard deviations) from the mean.
May remove both large and small values indiscriminately, especially in datasets with non-normal distributions or heavy tails.
Can lead to the loss of critical variations or signals if your dataset inherently has high variability or non-Gaussian characteristics.
IQR Filtering:

Targets the tails of the distribution, but only those points beyond the 1.5x IQR from the first or third quartile.
More robust for non-Gaussian distributions or datasets with skewed distributions because it doesn't depend on the mean and standard deviation.
The stark difference in p-values (0.915 for z-score filtering versus 
1.67
×
1
0
−
48
1.67×10 
−48
  for IQR filtering) likely reflects that z-score filtering removed influential data points that were driving the significant association in your analysis. If these removed points are not true "noise" but meaningful data, the z-score method might underrepresent the true relationships in your data.

3. Considerations
Before proceeding exclusively with the IQR-filtered dataset:

Understand Your Data:

Examine the distributions of the variables and determine if they are skewed, have heavy tails, or contain clusters of outliers. This will guide you in confirming whether z-score filtering was overly aggressive.
Validate Robustness:

Cross-check key findings across both filtered datasets to ensure that your conclusions aren’t overly dependent on the choice of filtering. The fact that the unfiltered and IQR-filtered datasets yield similar results is reassuring.
Interpret Outliers:

Investigate the nature of the outliers. If outliers represent valid observations (e.g., rare but meaningful phenomena), they should not be filtered out too aggressively.
4. Recommendation
Based on your results and standard practices:

Use the IQR-filtered dataset for your main analysis, as it preserves the data's key relationships and balances outlier removal.
Optionally, provide a sensitivity analysis with the z-score filtered dataset in your final report. This will demonstrate the robustness of your findings under different preprocessing methods.
In summary, the differences between filtering methods highlight the importance of choosing the one most aligned with the characteristics of your data. The IQR-filtered dataset appears to strike the right balance in your case!