## Independent Samples T-Test vs Spearman's Rank Correlation Coefficient

Understanding when to use an independent samples t-test versus Spearman's rank correlation coefficient involves recognizing the objectives of your analysis and the nature of your data.

### Independent Samples T-Test

To perform independent samples t-test to compare mean of two independent groups.

**from scipy.stats import ttest_ind**

**When to Use:**

- **Objective:** You use an independent samples t-test when your goal is to compare the means of two independent groups to see if there is a significant difference between them. `One example, comparing the average test scores of students from two different schools. Second example, wines from France have higher average ratings than wines from Italy. We're comparing the means of two independent groups (French wines vs. Italian wines).`

- **Data Requirements:**
  - You have two groups that are independent of each other, meaning the participants or entities in one group have no relation to those in the other group.
  - The dependent variable (the one you're comparing across groups) should be continuous (e.g., height, weight, test scores).
  - Ideally, the data should be normally distributed within each group, though the t-test is robust to deviations from normality with large sample sizes.
  - Equal or similar variances between the two groups are assumed, although there are variations of the t-test (like Welch's t-test) that can be used when variances are unequal.

### Spearman's Rank Correlation Coefficient

To perform Spearman's rank correlation coefficient. It is used when you want to measure the strength and direction of association between two ranked variables. 

**from scipy.stats import spearmanr**

**When to Use:**

- **Objective:** Spearman's rank correlation coefficient is used when you want to measure the strength and direction of association between two ranked variables. It's particularly useful when your `data are ordinal or when the relationship between variables is not linear.`

- **Data Requirements:**
  - Both variables should be at least ordinal scaled, meaning they can be ranked. It can also be used with continuous data that do not meet the assumptions of Pearson's correlation (e.g., non-normally distributed data or presence of outliers).
  - The relationship between variables is monotonic, meaning that as one variable increases, the other variable tends to increase or decrease consistently, but not necessarily at a constant rate.
  - It is not required for the data to follow a normal distribution, making Spearman's correlation a non-parametric measure.

**Key Differences:**

- **Purpose:** The t-test is for comparing means across two independent groups to see if the groups are significantly different, while Spearman's correlation is for assessing how well the relationship between two variables can be described using a monotonic function.

- **Data Structure:** The t-test deals with one independent categorical variable (with two levels/groups) and one dependent continuous variable. Spearman's correlation requires two variables that can be at least ranked (ordinal, interval, or ratio).

- **Assumptions:** The t-test assumes normally distributed data (with large samples, this assumption can be relaxed) and independent samples, while Spearman's correlation does not assume normality and is suitable for non-linear relationships.

Choosing between these two statistical methods depends on your research question and the nature of your data. If you're comparing group means, the t-test is appropriate. If you're investigating the relationship between two variables, especially if they're not linearly related or are ordinal, then Spearman's rank correlation is the better choice.

## (Pearson Correlation Coefficient + Linear Regression analysis) vs Spearman's Rank Correlation Coefficient

Choosing between Pearson's correlation coefficient and Spearman's rank correlation coefficient depends on the characteristics of your data and the nature of the relationship you're investigating. Here are the key considerations for each:

### Pearson Correlation Coefficient

Calculate the Pearson correlation coefficient to find linear relationship.

**from scipy.stats import pearsonr**

**When to Use:**

- **Objective:** Pearson's correlation coefficient (r) is used to measure the strength and direction of the linear relationship between two continuous variables.

- **Data Requirements:**
  - Both variables should be continuous and ideally normally distributed, though Pearson's correlation can tolerate deviations from normality, especially with large sample sizes.
  - The relationship between the variables should be linear, meaning that changes in one variable are associated with proportional changes in the other variable.
  - The data should not have significant outliers, as Pearson's correlation is sensitive to outliers which can disproportionately influence the correlation coefficient.

**Ideal Conditions:**
- Use Pearson when you know or have reason to believe that the relationship between your variables is linear and your data meet the assumptions of normality and homoscedasticity (equal variances across the range of values).

### Linear Regression analysis
This analysis is good to do after Pearson Correlation Coefficient analysis to show if one continuous variable has more influence of the other continuous variable or vice versa.¶

plot: **sns.lmplot(x="Points", y="Price", data=reviews_hybrid);**
**import statsmodels.api as sm**

### Spearman's Rank Correlation Coefficient

To perform Spearman's rank correlation coefficient. It is used when you want to measure the strength and direction of association between two ranked variables. 

**from scipy.stats import spearmanr**

**When to Use:**

- **Objective:** Spearman's rank correlation coefficient is used when you want to assess the monotonic relationship between two variables. A monotonic relationship is one that does either consistently increase or consistently decrease but not necessarily at a constant rate.

- **Data Requirements:**
  - The variables can be ordinal, interval, or ratio scales. Spearman's correlation is particularly useful for ordinal data where ranking is possible but precise differences between ranks are not meaningful.
  - The relationship between the variables does not need to be linear, as long as it is monotonic.
  - Spearman's correlation is less sensitive to outliers than Pearson's correlation because it uses rank order rather than actual values.

**Ideal Conditions:**
- Use Spearman when your data are ordinal, not normally distributed, or when the relationship between variables is monotonic but not linear. Spearman is also a good choice when your data contain outliers or when the scale of measurement is not consistent across the range of data.

### Key Differences and Considerations

- **Linearity vs. Monotonicity:** Pearson's correlation requires a linear relationship, while Spearman's correlation only requires a monotonic relationship.
- **Data Scale and Distribution:** Pearson's correlation is best for continuous data that are normally distributed. Spearman's correlation is more flexible and can be used with ordinal data or continuous data that do not meet the assumptions required for Pearson's correlation.
- **Sensitivity to Outliers:** Pearson's correlation can be significantly affected by outliers, whereas Spearman's correlation, which relies on ranks rather than actual values, is more robust to outliers.
- **Interpretation:** Both coefficients provide values between -1 and 1, where values closer to -1 or 1 indicate a strong negative or positive relationship, respectively, and values near 0 indicate a weak or no linear/monotonic relationship. However, the nature of the relationship (linear vs. monotonic) they measure is different.

In summary, choose Pearson's correlation when your data are continuous, normally distributed, and have a linear relationship without significant outliers. Opt for Spearman's correlation when dealing with ordinal data, non-linear but monotonic relationships, non-normally distributed data, or when your data contain outliers.

## ANOVA test and Tukey's HSD

### ANOVA test

Example when I used it: 
- Does Price Categories have significant effect on Price?
- Does Country has significal effect on Price?

**import scipy.stats as stats.f_oneway**

Converting Price_Category to a numerical scale and then calculating the Pearson correlation with Price could provide some insights, but it might not be the most accurate or meaningful way to understand the relationship. This is because Price_Category is inherently categorical, and its numerical encoding might not reflect a linear relationship with Price. A more appropriate method would be to use ANOVA to compare means across categories or regression analysis with dummy variables for each category to assess their impact on price.

ANOVA test tells if the Categorical variable influences the continuous variable. However, ANOVA doesn't tell you which specific categories differ from each other, only that at least one category's mean price is different.To determine that, you would need to conduct post-hoc tests, such as Tukey's HSD, to compare all pairs of groups.

### Tukey's HSD

Example: To perform Tukey's HSD test, which compares all pairs of groups to determine which Price Categories significantly influence Price.

**from statsmodels.stats.multicomp import pairwise_tukeyhsd**