# Inferential Analysis (Chi-Squared)

A little tutorial for applying chi-squared test of independence on hypotheses for finding possibile association between variables. We will use fake survey-data (generated below) and the **`chi2_contingency`** function from the **`scipy`** [(source)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) package for our chi-squared calculations.


## Hypotheses (from fake data) - on students' favoring professor Faaris:
> Note: "Selected variables" are the survey-questions, or columns (as you'll find later) picked for answering the respective hypothesis.

* H1: **Student-Professor connection is diverse between students of different backgrounds for professor Faaris**  
<u>Selected variables</u>: "Prof Relations" with "Age" and "Gender".
<br>
* H2: **Student relations with prof. Faaris depends on the course subject taken with him.**  
<u>Selected variables</u>: "Prof Relations" - "Prof Subjects"
<br>

* H3: **Student's delivery-preference of Faaris's Classes (online/in-person) is dependent on the number of classes taken with him.**  
<u>Selected variables</u>: "Prof Class Pref" - "Total Classes Taken"

---
## SETUP
The packages and functions below help in **calculating**, **displaying** and **saving** the tabular results from our chisquare calculations:

In [1]:
# ====================================================
# ===== Display tables side-by-side (Jupyter) =====

from IPython.display import HTML

def sbs(dfs):      
    return HTML('<table><tr style="background-color:white;">'+
                    ''.join(['<td>'+table._repr_html_()+'</td>' for table in dfs])+
                                                                    '</tr></table>')


# ===================================================
# ============= For Chi-square findings =============

from scipy.stats import chi2_contingency
import pandas as pd

def chisq(df, var1, var2,
          show_details=False,
          both_a=False,
          show_tables=False,
          save_tables=False,
          name=''):
    
    # H0: There is NO association between var1 & var2.
    # HA: Yes Association (Dependant)
    
    
    # X^2 Contingency table from our dataframe
    contingency_table = pd.crosstab(df[var1], df[var2])

    # Perform chi-squared test
    chi2, p_value, dof, expected = chi2_contingency(contingency_table)


    
    # -------------Format our tables (for display / image-saving)
    if show_tables:
        
        # Expected Frequencies Table
        expected_df = pd.DataFrame(expected)
        expected_df.columns = contingency_table.columns
        expected_df.index = contingency_table.index
        
        
        # Results Table (chi^2, p-value, degrees-of-freedom)
        chi2_df = pd.DataFrame([['chi^2',chi2],
                                ['p-value',p_value],
                                ['degrees of freedom',dof]],
                              columns=['',' '])
        chi2_df.set_index('', inplace=True)
        
         
        # Combine Contingency + Expected_freq. Tables 
        separator = pd.DataFrame(columns=['||'])
        
        contingency_expected_comined_df = pd.concat([contingency_table,
                                                     separator,
                                                     expected_df], axis=1)
        
        contingency_expected_comined_df['||'].fillna('||', inplace=True)
        
        
        
        # ----------- Side-by-side display of our: -----------
        display(sbs([
                    contingency_expected_comined_df, # combined-table
                    chi2_df                          # results-table
                    ])
               ) 
    
        
        # ------- Save-option for the tables to Excel --------
        
        if save_tables:
            with pd.ExcelWriter(f'{name}.xlsx', engine='xlsxwriter') as writer:

                # Write the Combined-table to an Excel file
                contingency_expected_comined_df.to_excel(writer,
                                                         sheet_name='chisq_tables',
                                                         index=True)

                # Results-table in the same Excel file (different sheet)
                chi2_df.to_excel(writer, sheet_name='chisq_values', index=True)
    
    
    # -------------- Conclusion (by P-value) ---------------
    
    reject = "<= a, NOT Independant (Association)"
    fail_to_reject = "> a, Independent (No Association)"
    
    a1 = reject if p_value <= 0.01 else fail_to_reject
    a5 = reject if p_value <= 0.05 else fail_to_reject
 
    # -------------- Print Chi-square findings & Conclusion
    if show_details:

        print('-'*50+f'\nTest of Independance: [{var1}] & [{var2}]\n')
        # Print results
        print(f'Chi-squared statistic: {chi2}\nDegrees of freedom: {dof}')
        print(f'Expected frequencies:\n{expected}\nP-value: {p_value}\n')
        print(f'For a = 0.01 (1%):\t P-value {a1}!\nFor a = 0.05 (5%):\t P-value {a5}!')
        print('-'*50)
    
    # ------------------ Return p-value (float) and Conclusion (str)
    return [p_value, a5]




# ===================================================
# ============== Table-results + with Style =========

import dataframe_image as dfi # For saving tables as images

# Applying colored cells (Green/Red) in column (p-values)
def color_index(val):
    
    if pd.isna(val):
        return 'color: white; background-color: white'
    
    return 'background-color: #a1e8a0' if val <= 0.05 else 'background-color: #e8a0a0'



# Automate the Chi-squared tests per hypothesis
# (via variables), with options to export tables
# as excel/images
def chisq_hypotheses(hyp_dict, df,
                     show_tables=False,
                     save_tables=False,
                     save_imgs=False):
    
    variable_indices = df.attrs['variable_indices']

    
    for h_index, hvars in enumerate(hyp_dict.values(),1):
        print("-"*50)
        
        variable_1, variable_2 = hvars

        pvals = []
        comments = []

        v1_name = variable_indices[variable_1]

        # Testing between more than two variables (for the same Hypothesis);
        # if variable-2 is a list of more, against variable-1:
        if isinstance(variable_2, list):

            for i, var2_i in enumerate(variable_2,1):

                v2_name = variable_indices[var2_i]
                file_name = f'Hypothesis {h_index}: ({v1_name} and {v2_name})'
                
                print(file_name)
                
                # Test per iteration
                chisq_result = chisq(df, v1_name, v2_name,
                                     show_details=False,
                                     show_tables=show_tables,
                                     save_tables=save_tables,
                                     name=file_name
                                    )

                # Collect each test's p-value
                pvals.append([v2_name, chisq_result[0]])


                # Collect each test's comment (Conclusion)
                comment = (f'There IS a statistically significant'+
                           f'association/dependence between [{v1_name}] and [{v2_name}]')

                if chisq_result[0] > 0.05:
                    comment = comment.replace('a statistically', 'NO').replace('/dependence','')
                comments.append(comment)


        # Test between only two variables
        else:
            v2_name = variable_indices[variable_2]
            file_name = f'Hypothesis {h_index}: ({v1_name} and {v2_name})'
            
            print(file_name)
            
            chisq_result = chisq(df, v1_name, v2_name,
                                 show_details=False,
                                 show_tables=show_tables,
                                 save_tables=save_tables,
                                 name=file_name)

            # Collect each test's p-value
            pvals.append([v2_name, chisq_result[0]])


            # Collect each test's comment (Conclusion)
            comment = (f'There IS a statistically significant'+
                       f'association/dependence between [{v1_name}] and [{v2_name}]')

            if chisq_result[0] > 0.05:
                comment = comment.replace('a statistically', 'NO').replace('/dependence','')
            comments.append(comment)
            
            
            
        # Get a coloured P-values table
        # (Green: Association between variables, Red: Not)
        chisq_df = pd.DataFrame(pvals, columns=['','P-value (X^2)'])
        chisq_df.set_index('', inplace=True)
        chisq_df.rename_axis(f'Association with "{v1_name}"', inplace=True)

            
        styled_chisq = chisq_df.style.applymap(color_index).set_properties(**{'border': '1px solid black'})

        display(styled_chisq)
        print('\nChi-squared Interpretation:\n* '+'\n* '.join(comments))
        
        
        # For saving the coloured p-values table as an image
        if save_imgs == True:
            dfi.export(styled_chisq, f'H{h_index}_CHISQ_p-values.png')
        
        print('\n\n')
        


---
## Generate (*fake*) Survey Data

In [2]:
from random import choice

# Number of Students (Respondents)
n_students = 200

# List of Survey Questions
questions = ['1. What Age-group are you in?',
             '2. What is your Gender?',
             '3. For how many years have you studied in Fake School?',
             '4. Rate your overall relationship with your professor:',
             '5. How many classes have you taken with your professor?',
             '6. Which of your classes did your professor conduct?',
             '7. Do you prefer In-Person or/and Remote classes with your professor?',
            ]

# Survey Response-Choices
choices = dict(
                age = ['Below 17', '17-21', '22-28', '29-35', '36-49', '50+'],
                gender = ['Male', 'Female', 'Prefer not to say'],
                study_duration = ['0-1 year', '1-2 years', '3-4 years', '4-5 years', '5+ years'],
                prof_relations = ["Very Good", "Good", "Ok", "Not Good", "Quite Bad",
                                  "N/A - Not Applicable"],
                total_classes_taken = ['1-3 classes', '4-6 classes', '7-9 classes', '10+ classes'],
                prof_subjects = ['1. Python Programming', '2. Calculus', '3. Physics',
                                 '1. & 2.', '1. & 3.', '2. & 3.', 'All three'],
                delivery_pref = ['In-Person', 'Remote', 'Both', 'None']
            )


# Random response/choice made per question (in a dictionary)
survey_data_dict = {q : [choice(a) for _ in range(n_students)]
                        for q,a in zip(questions, choices.values())}


# Save the data to an Excel file (from Pandas)
pd.DataFrame(survey_data_dict).to_excel('fake_survey_data.xlsx',
                                                   index=False)

---
## Data Load & Processing
Loading our (previously generated) **excel data**, then **selecting survey questions as variables** for each respective hypothesis, with **shortened-titles** per question.

In [3]:
print_comments = ['\nPreview of our loaded raw survey-data (first 5 rows):',
                 '\nPreview of our processed survey-data (first 5 rows):']


# --------- Data Load --------

df = pd.read_excel('fake_survey_data.xlsx', sheet_name=0)


# Preview data:
print(f'{"-"*len(print_comments[0])}{print_comments[0]}')
display(df.head())




# ------- Select Questions, then Shorten them to titles

selected_variables = [1,2,4,5,6,7] # Questions 1,2,4 ... 7.

question_titles = {q : short.replace('_',' ').title()
                       for q,short in dict(zip(df.columns,
                                               choices.keys() # Using a previous dictionary
                                              )).items()
                  }

# Keep selected questions in Pandas (by question's number/index)
df = df[[q for q in df.columns if int(q.split('.')[0]) in selected_variables]]

# Apply new title-names per question-column
df.rename(columns = question_titles,
          inplace = True)


#Preview data:
print(f'\n{"-"*len(print_comments[-1])}{print_comments[-1]}')
display(df.head())

------------------------------------------------------
Preview of our loaded raw survey-data (first 5 rows):


Unnamed: 0,1. What Age-group are you in?,2. What is your Gender?,3. For how many years have you studied in Fake School?,4. Rate your overall relationship with your professor:,5. How many classes have you taken with your professor?,6. Which of your classes did your professor conduct?,7. Do you prefer In-Person or/and Remote classes with your professor?
0,29-35,Female,3-4 years,Good,1-3 classes,All three,Both
1,36-49,Female,1-2 years,Not Good,4-6 classes,2. Calculus,In-Person
2,Below 17,Male,4-5 years,Ok,7-9 classes,All three,Remote
3,36-49,Prefer not to say,5+ years,Ok,4-6 classes,2. & 3.,In-Person
4,36-49,Male,5+ years,Not Good,7-9 classes,All three,In-Person



-----------------------------------------------------
Preview of our processed survey-data (first 5 rows):


Unnamed: 0,Age,Gender,Prof Relations,Total Classes Taken,Prof Subjects,Delivery Pref
0,29-35,Female,Good,1-3 classes,All three,Both
1,36-49,Female,Not Good,4-6 classes,2. Calculus,In-Person
2,Below 17,Male,Ok,7-9 classes,All three,Remote
3,36-49,Prefer not to say,Ok,4-6 classes,2. & 3.,In-Person
4,36-49,Male,Not Good,7-9 classes,All three,In-Person


In [4]:
# Keeping a reference of Variable-Indices (Question numbers)
# within our processed data:
df.attrs['variable_indices'] = dict(zip(selected_variables,
                                        df.columns))

print('Question-numbers for our selected variables:')
display(df.attrs['variable_indices'])




hypotheses = {'H1':[4, [1,2]], # "Prof Relations" - "Age" and "Gender".
              'H2':[4, 6], # "Prof Relations" - "Prof Subjects"
              'H3':[7, 5], # "Delivery Pref" - "Total Classes Taken"
             }




chisq_hypotheses(hyp_dict=hypotheses,
                 df=df,
                 show_tables=True,
                 save_tables=True,
                 save_imgs=True)

Question-numbers for our selected variables:


{1: 'Age',
 2: 'Gender',
 4: 'Prof Relations',
 5: 'Total Classes Taken',
 6: 'Prof Subjects',
 7: 'Delivery Pref'}

--------------------------------------------------
Hypothesis 1: (Prof Relations and Age)


Hypothesis 1: (Prof Relations and Gender)


Unnamed: 0_level_0,P-value (X^2)
"Association with ""Prof Relations""",Unnamed: 1_level_1
Age,0.247727
Gender,0.765123



Chi-squared Interpretation:
* There IS NO significantassociation between [Prof Relations] and [Age]
* There IS NO significantassociation between [Prof Relations] and [Gender]



--------------------------------------------------
Hypothesis 2: (Prof Relations and Prof Subjects)


Unnamed: 0_level_0,P-value (X^2)
"Association with ""Prof Relations""",Unnamed: 1_level_1
Prof Subjects,0.009805



Chi-squared Interpretation:
* There IS a statistically significantassociation/dependence between [Prof Relations] and [Prof Subjects]



--------------------------------------------------
Hypothesis 3: (Delivery Pref and Total Classes Taken)


Unnamed: 0_level_0,P-value (X^2)
"Association with ""Delivery Pref""",Unnamed: 1_level_1
Total Classes Taken,0.520861



Chi-squared Interpretation:
* There IS NO significantassociation between [Delivery Pref] and [Total Classes Taken]





## Check your saved results (excel tables / images)!