### Univariate Analysis [Refer this Blog](https://www.jmp.com/en_in/statistics-knowledge-portal/chi-square-test/chi-square-test-of-independence.html#:~:text=What%20is%20the%20Chi%2Dsquare,to%20be%20related%20or%20not.)

**Univariate analysis** deals with a single variable and assesses characteristics such as the mean or proportion.

#### 1. **One-Sample t-Test**
- **Assumed Hypothesis:**
  - **Null Hypothesis (H0):** The sample mean is equal to the known population mean (μ0). 
  - **Alternative Hypothesis (Ha):** The sample mean is different from the known population mean (μ0).
- **Type of Data Needed:** Continuous data.
- **Example:** Testing if the average height of a sample of people is different from the national average height.

#### 2. **One-Sample z-Test**
- **Assumed Hypothesis:**
  - **Null Hypothesis (H0):** The sample mean is equal to the population mean (μ0).
  - **Alternative Hypothesis (Ha):** The sample mean is different from the population mean (μ0).
- **Type of Data Needed:** Continuous data; large sample size; known population standard deviation.
- **Example:** Testing if the average score of a sample differs from a benchmark score when sample size is large.

#### 3. **Chi-Square Goodness-of-Fit Test**
- **Assumed Hypothesis:**
  - **Null Hypothesis (H0):** The observed frequencies fit the expected frequencies.
  - **Alternative Hypothesis (Ha):** The observed frequencies do not fit the expected frequencies.
- **Type of Data Needed:** Categorical data.
- **Example:** Testing if a die is fair by comparing the observed frequencies of each face to the expected frequencies.

#### 4. **Wilcoxon Signed-Rank Test**
- **Assumed Hypothesis:**
  - **Null Hypothesis (H0):** The median of the population is equal to a specified value.
  - **Alternative Hypothesis (Ha):** The median of the population is different from the specified value.
- **Type of Data Needed:** Continuous or ordinal data; paired or related samples.
- **Example:** Assessing if the median of test scores differs from a hypothesized value.

#### 5. **Kolmogorov-Smirnov Test**
- **Assumed Hypothesis:**
  - **Null Hypothesis (H0):** The sample follows a specific distribution (e.g., normal distribution).
  - **Alternative Hypothesis (Ha):** The sample does not follow the specified distribution.
- **Type of Data Needed:** Continuous data.
- **Example:** Testing if a sample of test scores follows a normal distribution.

### Bivariate Analysis

**Bivariate analysis** examines the relationship between two variables.

#### 1. **Independent Samples t-Test**
- **Assumed Hypothesis:**
  - **Null Hypothesis (H0):** The means of two independent groups are equal (µ1 = µ2).
  - **Alternative Hypothesis (Ha):** The means of the two groups are different (µ1 ≠ µ2).
- **Type of Data Needed:** Continuous data; two independent groups.
- **Example:** Comparing test scores of students from two different teaching methods.

#### 2. **Paired Samples t-Test**
- **Assumed Hypothesis:**
  - **Null Hypothesis (H0):** The mean difference between paired observations is zero.
  - **Alternative Hypothesis (Ha):** The mean difference between paired observations is not zero.
- **Type of Data Needed:** Continuous data; two related groups (e.g., pre-test and post-test).
- **Example:** Comparing test scores of students before and after an intervention.

#### 3. **Chi-Square Test of Independence**
- **Assumed Hypothesis:**
  - **Null Hypothesis (H0):** There is no association between two categorical variables.
  - **Alternative Hypothesis (Ha):** There is an association between the two categorical variables.
- **Type of Data Needed:** Categorical data; two variables.
- **Example:** Testing if there is a relationship between gender and preference for a type of product.

#### 4. **ANOVA (Analysis of Variance)**
- **Assumed Hypothesis:**
  - **Null Hypothesis (H0):** The means of three or more groups are equal.
  - **Alternative Hypothesis (Ha):** At least one group mean is different.
- **Type of Data Needed:** Continuous data; three or more independent groups.
- **Example:** Comparing average test scores among students from different teaching methods.

#### 5. **Mann-Whitney U Test**
- **Assumed Hypothesis:**
  - **Null Hypothesis (H0):** The distributions of two independent groups are equal.
  - **Alternative Hypothesis (Ha):** The distributions of the two groups are different.
- **Type of Data Needed:** Ordinal data or non-normally distributed continuous data; two independent groups.
- **Example:** Comparing median salaries between two different job categories.

#### 6. **Kruskal-Wallis Test**
- **Assumed Hypothesis:**
  - **Null Hypothesis (H0):** The distributions of three or more independent groups are equal.
  - **Alternative Hypothesis (Ha):** At least one group distribution is different.
- **Type of Data Needed:** Ordinal data or non-normally distributed continuous data; three or more independent groups.
- **Example:** Comparing median ratings of three different products.

#### 7. **Spearman's Rank Correlation**
- **Assumed Hypothesis:**
  - **Null Hypothesis (H0):** There is no monotonic relationship between the two ranked variables.
  - **Alternative Hypothesis (Ha):** There is a monotonic relationship between the two ranked variables.
- **Type of Data Needed:** Ordinal data or continuous data that does not meet parametric assumptions.
- **Example:** Examining the relationship between students’ rankings in two different exams.

#### 8. **Pearson's Correlation Coefficient**
- **Assumed Hypothesis:**
  - **Null Hypothesis (H0):** There is no linear relationship between the two continuous variables.
  - **Alternative Hypothesis (Ha):** There is a linear relationship between the two continuous variables.
- **Type of Data Needed:** Continuous data; two variables.
- **Example:** Assessing the relationship between hours studied and exam scores.



### Hypothesis Test Table with Data Types and Test Characteristics

| **Test Type / Data Type** | **Continuous Data (Ratio)** | **Continuous Data (Interval)** | **Categorical Data (Nominal)** | **Categorical Data (Ordinal)** | **Categorical Data (Binary)** |
|---------------------------|-----------------------------|-------------------------------|-------------------------------|-------------------------------|-------------------------------|
| **Continuous Data (Ratio)** | - **Parametric** <br> One-Sample t-Test <br> One-Sample z-Test <br> Paired Samples t-Test <br> ANOVA <br> Pearson's Correlation Coefficient <br> <br> - **Non-Parametric** <br> Mann-Whitney U Test <br> Kruskal-Wallis Test | - **Parametric** <br> One-Sample t-Test <br> One-Sample z-Test <br> Paired Samples t-Test <br> ANOVA <br> Pearson's Correlation Coefficient <br> <br> - **Non-Parametric** <br> Mann-Whitney U Test <br> Kruskal-Wallis Test | - **Non-Parametric** <br> Chi-Square Goodness-of-Fit Test <br> Chi-Square Test of Independence | - **Non-Parametric** <br> Spearman's Rank Correlation <br> Mann-Whitney U Test <br> Kruskal-Wallis Test | - **Non-Parametric** <br> Chi-Square Test of Independence <br> Mann-Whitney U Test |
| **Continuous Data (Interval)** | - **Parametric** <br> One-Sample t-Test <br> One-Sample z-Test <br> Paired Samples t-Test <br> ANOVA <br> Pearson's Correlation Coefficient <br> <br> - **Non-Parametric** <br> Mann-Whitney U Test <br> Kruskal-Wallis Test | - **Parametric** <br> One-Sample t-Test <br> One-Sample z-Test <br> Paired Samples t-Test <br> ANOVA <br> Pearson's Correlation Coefficient <br> <br> - **Non-Parametric** <br> Mann-Whitney U Test <br> Kruskal-Wallis Test | - **Non-Parametric** <br> Chi-Square Goodness-of-Fit Test <br> Chi-Square Test of Independence | - **Non-Parametric** <br> Spearman's Rank Correlation <br> Mann-Whitney U Test <br> Kruskal-Wallis Test | - **Non-Parametric** <br> Chi-Square Test of Independence <br> Mann-Whitney U Test |
| **Categorical Data (Nominal)** | - **Non-Parametric** <br> Chi-Square Goodness-of-Fit Test <br> Chi-Square Test of Independence | - **Non-Parametric** <br> Chi-Square Goodness-of-Fit Test <br> Chi-Square Test of Independence | - **Non-Parametric** <br> Chi-Square Goodness-of-Fit Test <br> Chi-Square Test of Independence | - **Non-Parametric** <br> Chi-Square Test of Independence | - **Non-Parametric** <br> Chi-Square Test of Independence |
| **Categorical Data (Ordinal)** | - **Non-Parametric** <br> Wilcoxon Signed-Rank Test <br> Spearman's Rank Correlation <br> Mann-Whitney U Test <br> Kruskal-Wallis Test | - **Non-Parametric** <br> Wilcoxon Signed-Rank Test <br> Spearman's Rank Correlation <br> Mann-Whitney U Test <br> Kruskal-Wallis Test | - **Non-Parametric** <br> Chi-Square Test of Independence | - **Non-Parametric** <br> Spearman's Rank Correlation <br> Mann-Whitney U Test <br> Kruskal-Wallis Test | - **Non-Parametric** <br> Chi-Square Test of Independence <br> Mann-Whitney U Test |
| **Categorical Data (Binary)** | - **Non-Parametric** <br> Chi-Square Test of Independence <br> Mann-Whitney U Test | - **Non-Parametric** <br> Chi-Square Test of Independence <br> Mann-Whitney U Test | - **Non-Parametric** <br> Chi-Square Test of Independence | - **Non-Parametric** <br> Chi-Square Test of Independence <br> Mann-Whitney U Test | - **Non-Parametric** <br> Chi-Square Test of Independence <br> Mann-Whitney U Test |

### Explanations:

1. **Continuous Data (Ratio & Interval)**
   - **Parametric Tests:** Assume normal distribution and are used when the data meets parametric assumptions.
   - **Non-Parametric Tests:** Used when data does not meet parametric assumptions, such as non-normal distributions or ordinal data.

2. **Categorical Data (Nominal)**
   - **Non-Parametric Tests:** Used to analyze frequencies and relationships between categories without assuming any specific distribution.

3. **Categorical Data (Ordinal)**
   - **Non-Parametric Tests:** Used to analyze ordered categories and rank-based relationships without assuming specific distribution properties.

4. **Categorical Data (Binary)**
   - **Non-Parametric Tests:** Focus on proportions and relationships between two categories, typically not requiring distributional assumptions.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

In [2]:
df = pd.read_csv("dataset/test.csv")
df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [3]:
df.shape

(418, 11)

In [4]:
stats.mode(df["Fare"]).mode

7.75

### Univariate Analysis

In [8]:
class univariate_analysis:
    """ 
    Univariate analysis deals with a single variable and assesses characteristics such as the mean or proportion.
    """

    def __init__(self, df: pd.DataFrame):
        self.df = df 
        self.p_value = None 
        self.null_hypo = None 
        self.alter_hypo = None 
        self.hypothized_mean = None 
        self.hypothized_median = None 
        self.hypothized_std = None
        self.sample_mean = None 
        self.sample_median = None
        self.sample_std = None

    def measure_of_stats(self, variable: str):
        """
        Objective: calculates measure of central tendency

        Parameters:

        variable (str): column for calculation

        Returns:

        dict(): mean, median, mode
        """
        mean = self.df[variable].mean()
        median = self.df[variable].median()
        mode = stats.mode(self.df[variable]).mode

        return {"mean": mean, "median": median, "mode": mode}

    def normality_check(self, variable: str):
        """ 
        Objective: Checks wheather data is normally distributed or not.

        Paremeters:

        variable (str): column for test

        Returns:

        dict(): 
            h_0: null hypothesis
            h_1: alternate hypothesis
            p_val: p values
            cc: conclusion on test
        """
        stat, p_value = stats.shapiro(self.df[variable])
        h_0 = "H0: data normally distributed"
        h_1 = "H1: data not normally distributed"
        result = ""
        if p_value < 0.05:
            result = "Reject null hypothesis"
        else:
            result = "Fail to reject null hypothesis"

        return {"h0":h_0, "h1":h_1,"p_val":p_value, "cc":result}

    def one_sample_t_test(self, variable: str, hypothesized_mean: float = None):
        """
        Objective: To assess whether the mean of your sample differs from a specific known value or hypothesized mean.
        Sample Size < 30

        Parameters:

        variable (str): column for test
        
        hypothesized_mean (float): default is none, wil consider mean of sample
        """
        self.hypothesized_mean = hypothesized_mean

        # Perform the One-Sample t-Test
        statistic, p_value = stats.ttest_1samp(self.df[variable].values, self.hypothesized_mean)
        h_0 = "H0: sample mean == population mean (hypothized mean)"
        h_1 = "H1: sample mean != population mean (hypothized mean)"
        result = ""
        if p_value < 0.05:
            result = "Reject null hypothesis"
        else:
            result = "Fail to reject null hypothesis"

        return {"h0":h_0, "h1":h_1,"p_val":p_value, "cc":result}
    

    def one_sample_z_test(self, variable: str, hypothesized_mean: float, hypothesized_std: float):
        """
        Objective: To assess whether the mean of your sample differs from a specific known value or hypothesized mean.
        Sample size > 30

        Parameters:

        variable (str): column for test
        
        hypothesized_mean (float): default is none, wil consider mean of sample
        """
        # population parameter
        self.hypothized_mean = hypothesized_mean
        self.hypothized_std = hypothesized_std

        # sample parameter
        self.sample_mean = self.df[variable].mean()
        self.sample_std = self.df[variable].std()
        sample_size = len(self.df[variable])

        # Perform the One-Sample z-Test
        z_score = (self.sample_mean - self.population_mean) / (self.population_std / np.sqrt(sample_size))
        p_value = stats.norm.sf(np.abs(z_score)) * 2  # Two-tailed p-value

        h_0 = "H0: sample mean == population mean (hypothized mean)"
        h_1 = "H1: sample mean != population mean (hypothized mean)"
        result = ""
        if p_value < 0.05:
            result = "Reject null hypothesis"
        else:
            result = "Fail to reject null hypothesis"

        return {"h0":h_0, "h1":h_1,"p_val":p_value, "cc":result}

    def chi_square_goodness_of_fit_test(self, variable: str):
        """
        we compare the observed frequencies of categories within that variable 
        to the expected frequencies under a specified distribution or hypothesis
        """
        h_0="H0: observed == expected"
        h_1="H1: observed != expected"
        observed = np.array(self.df[variable].value_counts())
        total_observed = np.sum(observed)
        expected = np.array([total_observed/self.df[variable].nunique()] * self.df[variable].nunique())
        # Perform chi-square goodness-of-fit test
        chi2_stat, p_value = stats.chisquare(f_obs=observed, f_exp=expected)
        result = ""
        if np.round(p_value, 2) < 0.05:
            result = "Reject null hypothesis"
        else:
            result = "Fail to reject null hypothesis"
        return {"h0":h_0, "h1":h_1,"p_val":p_value, "cc":result}
    
    def wilcoxon_signed_rank_test(self, variable: str, ):
        """
        Objective: To assess whether the mean of your sample differs from a specific known value or hypothesized mean.
        Sample Size < 30

        Parameters:

        variable (str): column for test
        
        hypothesized_mean (float): default is none, wil consider mean of sample
        """
        self.hypothesized_mean = hypothesized_mean

        # Perform the wilcoxon_signed_rank_test
        statistic, p_value = stats.wilcoxon(self.df[variable].values, self.hypothesized_mean)
        h_0 = "H0: sample mean == population mean (hypothized mean)"
        h_1 = "H1: sample mean != population mean (hypothized mean)"
        result = ""
        if p_value < 0.05:
            result = "Reject null hypothesis"
        else:
            result = "Fail to reject null hypothesis"

        return {"h0":h_0, "h1":h_1,"p_val":p_value, "cc":result}

    def kolmogorov_smirnov_test(self, variable: str):
        pass 

    def plot_univariate_numerical(self, variable: str, perform_test: str, max_height: float):
        """ 
        visualization of qq plot and kde plot of distribution
        """
        plot_title = ""
        if perform_test == "normality":
            plot_title = self.normality_check()
        elif perform_test == "one_sample_t_test":
            plot_title = self.one_sample_t_test()
        elif perform_test == "one_sample_z_test":
            plot_title = self.one_sample_z_test()
        elif perform_test == "chi_square":
            plot_title = self.chi_square_goodness_of_fit_test()
        elif perform_test == "wilcoxon_signed_rank_test":
            plot_title = self.wilcoxon_signed_rank_test()
        elif perform_test == "kolmogorov_smirnov_test":
            plot_title = self.kolmogorov_smirnov_test()
        else:
            print("No such test exists.")

    def plot_univariate_categorical(self, variable: str, perform_test: str):
        """ 
        visualize countplot: (observed vs expected) and pie chart of distribution
        """
        pass

In [6]:
obj = univariate_analysis(df)

### Bivariate Analysis

In [10]:
class bivariate_analysis:
    """ 
    Bivariate analysis** examines the relationship between two variables.
    """
    def __init__(self, df: pd.DataFrame, target: str):
        self.df = df
        self.target = target

    def independent_sample_t_test(self, variable: str):
        pass

    def chi_square_of_independence(self, variable: str):
        pass

    def anova_test(self, variable: str):
        pass

    def mann_whitney_u_test(self, variable: str):
        pass

    def kruskal_wallis_test(self, variable: str):
        pass

    def spearman_rank_correlation(self, variable: str):
        pass

    def pearson_correlation_coefficient(self, variable: str):
        pass