# Statistical Analysis in Python

This notebook covers essential concepts and techniques for statistical analysis using Python. You'll learn how to describe, explore, and infer from data using libraries like NumPy, pandas, SciPy, and statsmodels.

## Topics Covered:
1. Introduction to Statistical Analysis
2. Descriptive Statistics
3. Probability Distributions
4. Inferential Statistics (Hypothesis Testing)
5. Correlation and Covariance
6. Regression Analysis
7. ANOVA and Group Comparisons
8. Real-Life Use Cases and Best Practices
9. Effect Size and Confidence Intervals
10. Non-Parametric Tests
11. Categorical Data Analysis (Chi-Square Test)
12. Practical Exercises

## 1. Introduction to Statistical Analysis

Statistical analysis helps us summarize, interpret, and draw conclusions from data. It is foundational for data science, business analytics, scientific research, and more.

**Real-life use case:** A healthcare analyst uses statistical analysis to determine if a new drug is more effective than the current standard treatment.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Descriptive Statistics

Descriptive statistics summarize and describe the main features of a dataset. Common measures include mean, median, mode, variance, and standard deviation.

**Real-life use case:** A school administrator summarizes students' test scores to understand overall performance.

In [None]:
# Create a sample dataset
data = np.random.normal(loc=70, scale=10, size=100)
df = pd.DataFrame({'score': data})

# Calculate descriptive statistics
mean = df['score'].mean()
median = df['score'].median()
mode = df['score'].mode()[0]
std = df['score'].std()
var = df['score'].var()

print(f'Mean: {mean:.2f}, Median: {median:.2f}, Mode: {mode:.2f}, Std: {std:.2f}, Var: {var:.2f}')

# Visualize the distribution
plt.figure(figsize=(8, 4))
sns.histplot(df['score'], kde=True)
plt.title('Distribution of Scores')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.show()

## 3. Probability Distributions

Probability distributions describe how values are distributed. Common distributions include normal, binomial, and Poisson.

**Real-life use case:** A manufacturer models the probability of defects in a batch using the binomial distribution.

In [None]:
# Normal distribution example
x = np.linspace(-4, 4, 100)
y = stats.norm.pdf(x, loc=0, scale=1)
plt.plot(x, y, label='Normal Distribution')
plt.title('Normal Distribution (mean=0, std=1)')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.show()

# Binomial distribution example
n, p = 10, 0.3
binom_rv = stats.binom(n, p)
x = np.arange(0, n+1)
plt.bar(x, binom_rv.pmf(x))
plt.title('Binomial Distribution (n=10, p=0.3)')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.show()

## 4. Inferential Statistics (Hypothesis Testing)

Inferential statistics allow us to make conclusions about a population based on a sample. Hypothesis testing is used to determine if observed differences are statistically significant.

**Real-life use case:** A marketing team tests if a new ad campaign increases sales compared to the previous campaign.

In [None]:
# Example: One-sample t-test
sample = np.random.normal(loc=72, scale=10, size=30)
t_stat, p_value = stats.ttest_1samp(sample, popmean=70)
print(f'T-statistic: {t_stat:.2f}, p-value: {p_value:.4f}')
if p_value < 0.05:
    print('Result: Statistically significant difference from mean 70')
else:
    print('Result: No significant difference from mean 70')

## 5. Correlation and Covariance

Correlation measures the strength and direction of a linear relationship between two variables. Covariance measures how two variables change together.

**Real-life use case:** An economist studies the correlation between education level and income.

In [None]:
# Create two related variables
np.random.seed(0)
x = np.random.normal(50, 10, 100)
y = 2 * x + np.random.normal(0, 10, 100)

df_corr = pd.DataFrame({'x': x, 'y': y})
corr = df_corr.corr().iloc[0, 1]
cov = df_corr.cov().iloc[0, 1]
print(f'Correlation: {corr:.2f}, Covariance: {cov:.2f}')

plt.figure(figsize=(6, 4))
sns.scatterplot(x='x', y='y', data=df_corr)
plt.title('Scatter Plot of x and y')
plt.show()

## 6. Regression Analysis

Regression analysis models the relationship between a dependent variable and one or more independent variables. Linear regression is the most common type.

**Real-life use case:** A real estate agent predicts house prices based on features like size, location, and number of bedrooms.

In [None]:
# Simple linear regression with statsmodels
X = sm.add_constant(df_corr['x'])
model = sm.OLS(df_corr['y'], X).fit()
print(model.summary())

# Plot regression line
plt.figure(figsize=(6, 4))
sns.scatterplot(x='x', y='y', data=df_corr, label='Data')
plt.plot(df_corr['x'], model.predict(X), color='red', label='Regression Line')
plt.title('Linear Regression Fit')
plt.legend()
plt.show()

## 7. ANOVA and Group Comparisons

ANOVA (Analysis of Variance) tests whether there are significant differences between the means of three or more groups.

**Real-life use case:** An HR analyst compares average salaries across different departments.

In [None]:
# Create sample data for three groups
group1 = np.random.normal(70, 5, 30)
group2 = np.random.normal(75, 5, 30)
group3 = np.random.normal(80, 5, 30)

f_stat, p_val = stats.f_oneway(group1, group2, group3)
print(f'ANOVA F-statistic: {f_stat:.2f}, p-value: {p_val:.4f}')
if p_val < 0.05:
    print('Result: At least one group mean is significantly different')
else:
    print('Result: No significant difference between group means')

## 8. Real-Life Use Cases and Best Practices

- **Healthcare:** Clinical trials use hypothesis testing to validate new treatments.
- **Business:** A/B testing for website optimization uses t-tests and ANOVA.
- **Finance:** Portfolio managers use correlation and regression to manage risk.
- **Social Science:** Surveys use descriptive and inferential statistics to draw conclusions about populations.

### Best Practices:
- Always visualize your data before and after analysis.
- Check assumptions (normality, independence, etc.) before applying statistical tests.
- Report effect sizes and confidence intervals, not just p-values.
- Use appropriate statistical methods for your data type and research question.

## 9. Effect Size and Confidence Intervals

Effect size quantifies the magnitude of a difference, while confidence intervals provide a range of plausible values for a parameter.

**Real-life use case:** Reporting the effect of a new teaching method on student scores, not just whether the difference is significant.

In [None]:
# Calculate Cohen's d for effect size
group_a = np.random.normal(70, 10, 50)
group_b = np.random.normal(75, 10, 50)
mean_diff = group_a.mean() - group_b.mean()
pooled_std = np.sqrt(((group_a.std() ** 2) + (group_b.std() ** 2)) / 2)
cohens_d = mean_diff / pooled_std
print(f"Cohen's d: {cohens_d:.2f}")

# Confidence interval for the mean
import scipy.stats as stats
conf_int = stats.t.interval(0.95, len(group_a)-1, loc=group_a.mean(), scale=stats.sem(group_a))
print(f'95% Confidence interval for group_a mean: {conf_int}')

## 10. Non-Parametric Tests

Non-parametric tests are used when data doesn't meet the assumptions of parametric tests (e.g., normality).

**Real-life use case:** Comparing customer satisfaction ratings (ordinal data) between two stores.

In [None]:
# Mann-Whitney U test (non-parametric alternative to t-test)
ratings_store1 = np.random.randint(1, 6, 30)
ratings_store2 = np.random.randint(2, 6, 30)
u_stat, p_val = stats.mannwhitneyu(ratings_store1, ratings_store2)
print(f'Mann-Whitney U statistic: {u_stat}, p-value: {p_val:.4f}')

## 11. Categorical Data Analysis (Chi-Square Test)

Chi-square tests are used to examine relationships between categorical variables.

**Real-life use case:** Testing if product preference is independent of gender.

In [None]:
# Create a contingency table
obs = np.array([[30, 10], [20, 40]])  # e.g., [Male, Female] x [Product A, Product B]
chi2, p, dof, expected = stats.chi2_contingency(obs)
print(f'Chi-square statistic: {chi2:.2f}, p-value: {p:.4f}')
print('Expected frequencies:', expected)

## 12. Practical Exercises

Try these exercises to reinforce your understanding:

1. Calculate and interpret the effect size for two groups of your own data.
2. Use a non-parametric test to compare two samples that are not normally distributed.
3. Analyze a real or simulated contingency table using the chi-square test.
4. Visualize the confidence interval for a sample mean using matplotlib.
5. Find a real dataset and perform a full statistical analysis: descriptive stats, hypothesis test, effect size, and visualization.