# Inferential Statistics in Python
# Continued

## In inferential statistics we seek following
1. Connection
2. Correlation
3. Causation
4. Prediction 

# Step 4 - Data Type

Knowing the type of data is crucial for accurate statistical analysis.
In Python, you can use the built-in function `type()` to determine the data type of a variable.

Here's an example:

```python
# Example 1: Integer
num = 42
print(type(num))  # Output: <class 'int'>

# Example 2: Float
pi = 3.14159
print(type(pi))  # Output: <class 'float'>

# Example 3: String
name = "John Doe"
print(type(name))  # Output: <class 'str'>

# Example 4: List
fruits = ["apple", "banana", "cherry"]
print(type(fruits))  # Output: <class 'list'>

# Example 5: Tuple
coordinates = (10, 20)
print(type(coordinates))  # Output: <class 'tuple'>

# Example 6: Dictionary
person = {"name": "Alice", "age": 30}
print(type(person))  # Output: <class 'dict'>

# Example 7: Set
numbers = {1, 2, 3, 4, 5}
print(type(numbers))  # Output: <class 'set'>

# Example 8: Boolean
is_active = True
print(type(is_active))  # Output: <class 'bool'>

# Step 5- Choosing the right statistical test

There are three main families to choose a statical test from:

1. Chi-sq test
2. T-test/Anova
3. Correlation/Regression

### 1. Chi-sq test

The chi-sq test is used to determine if there is a significant difference between the observed and expected frequencies. It is used only when we want to compare two categorical variables. There are two types of chi-sq tests:
1. Chi-sq test of independence
2. Chi-sq test of homogeneity

Chi-sq test of independence is used to determine if there is a significant association between two categorical variables.\
Chi-sq test of homogeneity is used to determine if there is a significant difference in the distribution of a categorical variable across different groups.\

 The code to use the chi-squared statistic can be calculated using the `scipy.stats.chi2_contingency` function in Python.\
 In numpy the code to calculate the chi-squared statistic can be calculated using the `np.corrcoef` function in numpy.

### 2. T-test

The t-test is used to determine if there is a significant difference between the means of two groups. This is used when we want to compare categorical and continuous variables. It includes both one-sample and two-sample t-tests.
   1. One sample t-test is used to determine if there is a significant difference between the means of two groups.
   2. Two sample t-test is used to determine if there is a significant difference between the means of a single sample and a known population mean.Two-sample t-test is used to determine if there is a significant difference between the means of two independent groups.
      1. Paired t-test is used to determine if there is a significant difference between the means of the means of two related groups.
      2. Independent t-test is used to determine if there is a significant difference between the means of two unrelated groups.

Difference between student t-test and Welch's t-test is that the Welch's t-test assumes that the variances of the two groups are equal, while the t-test does not assume this. Welch's t-test is more robust to violations of the equal variance assumption, making it a preferred choice when the assumption is not met. What it means by variance of two groups is that variance is the measure of how much the data points in a group differ from the mean of that group and is calculated as the average of the squared differences from the mean.



### 3. ANOVA

The ANOVA (Analysis of Variance) is used to compare the means of three or more groups. This is used when we want to compare continuous variables across different categories. There are two types of ANOVA tests: one-way ANOVA and two-way ANOVA. There is also repeated measures ANOVA and mixed ANOVA.
   1. One-way ANOVA is used to determine if there is a significant difference between the means of three or more groups.
   2. Two-way ANOVA is used to determine if there are any significant interactions between two independent categorical variables on a continuous dependent variable.
   3. Repeated measures ANOVA is used when the same subjects are measured multiple times under different conditions. 
   4. Mixed ANOVA is used when there are both within-subjects and between-subjects factors, allowing for the analysis of data that involves multiple groups and repeated measures.

The code to run anova in Python can be done using the `scipy.stats.f_oneway` function. In numpy the code to run anova can be done using the `np.corrcoef` function.
Post-hoc tests are often conducted after ANOVA to determine which specific groups are different from each other.
The code to use post-hoc tests can be performed using the `statsmodels` library in Python, specifically with the `pairwise_tukeyhsd` function for multiple comparisons.
There are several post-hoc tests available in the `statsmodels` library, including Tukey's HSD, Bonferroni, and Scheffe's tests.

### 4. Correlation
This test is used to assess the relationship between two or more variables and to predict outcomes based on this relationship. This is used when we want to compare continuous variables.

### Regression
This is a statistical method used to model the relationship between a dependent variable and one or more independent variables, helping to understand how the latter influences the former. This is used when we want to predict outcomes based on this relationship.



# A simple guide how to select and apply test accordingly to the data.

Normality---> Parametric Tests---->t-test, ANOVA, etc.\
Non- Normality---> Non-Parametric Tests----> Mann-Whitney U test, Kruskal-Wallis test, etc.