<a href="https://colab.research.google.com/github/SyedDanishKhurram/Data-science-Artificial-Intelligence-notes/blob/main/AI_week_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Week-8**
# Class 15-16: Introduction to Statistics
# Statistics ++
**Details:**
* Basic Statistics - Plan to have a session by an external resource
* ANOVA Test
* Chi-Square - Plan to have a session by an external resource

# Introduction to statistics in python

**Statistics in Python** involves the use of Python programming language along with its various libraries and packages to perform data analysis, statistical calculations, hypothesis testing, and data visualization. Python's extensive ecosystem of libraries makes it a popular choice for data scientists and analysts to work with statistical concepts and conduct statistical analyses effectively. Here's how Python and its libraries come together for statistics:

**Python Language:** Python provides a versatile and easy-to-learn programming language that serves as the foundation for conducting statistical operations and data manipulation.

**Libraries and Packages:** Python's strength in statistics lies in its rich collection of libraries and packages, including but not limited to:

**NumPy:** For numerical operations and working with arrays and matrices.

**pandas:** For data manipulation and analysis using DataFrame and Series structures.

**SciPy:** For advanced scientific and statistical computations.

**statsmodels:** For statistical modeling and hypothesis testing.

**matplotlib and Seaborn:** For data visualization and plotting.

**scikit-learn:** For machine learning and predictive modeling, which often involves statistical concepts.
Community and Resources: Python's large and active community provides extensive documentation, tutorials, and resources related to statistics. This makes it accessible for both beginners and experienced data analysts.

Flexibility: Python allows you to integrate statistical analysis seamlessly with other data science tasks, such as data preprocessing, machine learning, and model deployment.

By leveraging Python and its libraries, data professionals can conduct a wide range of statistical tasks, from basic descriptive statistics to advanced hypothesis testing, regression analysis, and more. Python's versatility and the availability of libraries make it a valuable tool for those working with data and statistics in various domains, including research, business, healthcare, and social sciences.






# ANOVA

**Groups:** ANOVA compares the means of multiple groups. These groups are often referred to as "treatments" or "factors." For example, you might compare the performance of three different teaching methods on student test scores, where each teaching method represents a group.

**Null Hypothesis (H0):** The null hypothesis in ANOVA states that there are no significant differences among the group means. In other words, all group means are equal.

**Alternative Hypothesis (Ha):** The alternative hypothesis contradicts the null hypothesis and suggests that there are significant differences among at least one pair of group means.

**Variation:** ANOVA decomposes the total variation in the data into two components: variation between groups (explained variation) and variation within groups (unexplained or residual variation).

T**ypes of ANOVA:**

There are different types of ANOVA, including:

**One-Way ANOVA**: Used when you have one independent variable with more than two groups. It assesses whether there are any statistically significant differences among the group means.

**Two-Way ANOVA:** Used when you have two independent variables and want to determine how they interact and affect the dependent variable. It assesses main effects and interaction effects.

**Repeated Measures ANOVA**: Used when you have repeated measurements on the same subjects or entities (e.g., longitudinal data) and want to assess changes over time or under different conditions.

Steps for Performing ANOVA:

**Formulate Hypotheses**: Define your null and alternative hypotheses based on your research question.

**Collect Data:** Gather data from your groups or treatments.

**Data Preparation:** Ensure your data meet the assumptions of ANOVA, such as normality and homogeneity of variances. Transformations may be needed.

**Perform ANOVA:** Use statistical software (e.g., Python with libraries like SciPy or R) to perform the ANOVA test. The software will calculate the F-statistic and p-value.

**Interpret Results:** Evaluate the p-value. If p < alpha (usually 0.05), you reject the null hypothesis, indicating that there are significant differences among the groups. If p ≥ alpha, you fail to reject the null hypothesis.

Post-Hoc Tests: If ANOVA indicates significant differences, post-hoc tests (e.g., Tukey's HSD, Bonferroni) can be conducted to identify which specific group pairs differ significantly.

This code is for one way


In [None]:
import scipy.stats as stats

# Sample data for three groups (replace this with your own data)
group1 = [22, 24, 28, 30, 32]
group2 = [18, 20, 24, 26, 28]
group3 = [15, 16, 18, 20, 22]

# Perform one-way ANOVA
statistic, p_value = stats.f_oneway(group1, group2, group3)

# Display the results
print("ANOVA F-statistic:", statistic)
print("p-value:", p_value)

# Determine if the p-value is statistically significant (common significance level is 0.05)
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis; there are significant differences among group means.")
else:
    print("Fail to reject the null hypothesis; there are no significant differences among group means.")


ANOVA F-statistic: 7.15962441314554
p-value: 0.00898351994543857
Reject the null hypothesis; there are significant differences among group means.


this code is for two ways


In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data (replace with your own data)
data = {
    'A': ['A1', 'A2', 'A1', 'A2', 'A1', 'A2'],
    'B': ['B1', 'B1', 'B2', 'B2', 'B1', 'B1'],
    'Value': [23, 30, 32, 27, 18, 25],
}

# Create a DataFrame
import pandas as pd
df = pd.DataFrame(data)

# Fit a two-way ANOVA model
formula = 'Value ~ C(A) + C(B) + C(A):C(B)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Display the ANOVA results
print(anova_table)


              sum_sq   df         F    PR(>F)
C(A)       13.500000  1.0  1.080000  0.407843
C(B)       40.333333  1.0  3.226667  0.214286
C(A):C(B)  48.000000  1.0  3.840000  0.189115
Residual   25.000000  2.0       NaN       NaN


# Chi-Square

Chi-Square (2) analysis is a statistical technique for determining whether two categorical variables in a contingency table are related or independent. It is frequently used to establish whether there is a meaningful connection between the two variables. Using tools like scipy.stats and pandas, you can do Chi-Square analysis in Python. A step-by-step tutorial for performing Chi-Square analysis in Python can be found here:


In [None]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency


A contingency table, commonly referred to as a cross-tabulation or crosstab, must be made in order to summarize the relationship between the two categorical variables you wish to study. Assuming you have information from a poll asking participants about their favorite color and beverage,

In [None]:
data = {
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue', 'Green'],
    'Beverage': ['Tea', 'Coffee', 'Tea', 'Coffee', 'Tea', 'Coffee']
}

df = pd.DataFrame(data)
contingency_table = pd.crosstab(df['Color'], df['Beverage'])
print(contingency_table)


Beverage  Coffee  Tea
Color                
Blue           1    1
Green          1    1
Red            1    1


Use the chi2_contingency function from scipy.stats to perform the Chi-Square test:

In [None]:
chi2, p, _, _ = chi2_contingency(contingency_table)
print(f"Chi-Square Statistic: {chi2}")
print(f"P-value: {p}")


Chi-Square Statistic: 0.0
P-value: 1.0


If the p-value is less than a significance level (e.g., 0.05), you can reject the null hypothesis, indicating that there is a significant association between the two categorical variables.

If the p-value is greater than the significance level, you fail to reject the null hypothesis, suggesting that there is no significant association between the variables.

In the example above, if the p-value is less than 0.05, you would conclude that there is a significant relationship between color preference and beverage choice.

Chi-Square analysis is useful for exploring relationships between categorical variables, such as gender and product preference, survey responses, and more. It helps you determine if there is a statistically significant association between these variables.