# Advanced Statistics Notes

In this notebook, we cover the following topics:
1. Covariance  
2. Pearson Correlation Coefficient  
3. QQ Plot  
4. Confidence Interval  
5. Hypothesis Testing  
6. Chi-square Test and ANOVA Test

Each section includes an explanation (with formulas) and step-by-step Python examples.

## 1. Covariance

**Definition:**  
Covariance measures how two random variables vary together.

**Population Covariance:**  
<p style="margin-left:20px;">
Cov(X, Y) = E[ (X − E(X)) (Y − E(Y)) ]
</p>

**Sample Covariance:**  
<p style="margin-left:20px;">
s<sub>XY</sub> = (1/(n − 1)) &#931;<sub>i=1</sub><sup>n</sup> (x<sub>i</sub> − <span style="text-decoration: overline;">x</span>) (y<sub>i</sub> − <span style="text-decoration: overline;">y</span>)
</p>

Below is Python code that computes the sample covariance.
```python
import numpy as np

# Define sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])

# Calculate sample means
x_mean = np.mean(x)
y_mean = np.mean(y)

# Compute sample covariance manually
n = len(x)
cov_xy = np.sum((x - x_mean) * (y - y_mean)) / (n - 1)
print("Sample Covariance (manual):", cov_xy)

# Alternatively, use numpy's built-in function:
cov_matrix = np.cov(x, y)  # returns a 2x2 covariance matrix
print("Covariance Matrix (numpy.cov):\n", cov_matrix)



## 2. Pearson Correlation Coefficient

**Definition:**  
The Pearson correlation coefficient standardizes the covariance by the product of the standard deviations.

**Population Formula:**  
<p style="margin-left:20px;">
ρ = Cov(X, Y) / (σ<sub>X</sub> × σ<sub>Y</sub>)
</p>

**Sample Formula:**  
<p style="margin-left:20px;">
r = (&#931;<sub>i=1</sub><sup>n</sup> (x<sub>i</sub> − <span style="text-decoration: overline;">x</span>) (y<sub>i</sub> − <span style="text-decoration: overline;">y</span>)) / ( sqrt(&#931;<sub>i=1</sub><sup>n</sup> (x<sub>i</sub> − <span style="text-decoration: overline;">x</span>)<sup>2</sup>) × sqrt(&#931;<sub>i=1</sub><sup>n</sup> (y<sub>i</sub> − <span style="text-decoration: overline;">y</span>)<sup>2</sup>) )
</p>

*Here, <span style="text-decoration: overline;">x</span> and <span style="text-decoration: overline;">y</span> denote the sample means of X and Y respectively.*

Below is Python code to compute Pearson’s r.
```python
# Calculate Pearson correlation using numpy
r_matrix = np.corrcoef(x, y)
r = r_matrix[0, 1]
print("Pearson Correlation (numpy):", r)

# Alternatively, use scipy.stats:
from scipy import stats

r_scipy, p_value = stats.pearsonr(x, y)
print("Pearson Correlation (scipy):", r_scipy)
print("p-value:", p_value)


## 3. QQ Plot (Quantile-Quantile Plot)

**Definition:**  
A QQ plot compares the quantiles of sample data with those of a theoretical distribution (often the normal distribution).

**Construction Steps:**
<ol style="margin-left:20px;">
  <li>Order the sample data and compute the empirical quantiles, Q<sub>sample</sub>(p).</li>
  <li>Compute the corresponding theoretical quantiles, Q<sub>theoretical</sub>(p).</li>
  <li>Plot the points: Q<sub>sample</sub>(p) versus Q<sub>theoretical</sub>(p). A straight-line pattern indicates that the data follow the theoretical distribution.</li>
</ol>

Below is Python code to generate a QQ plot.
```python
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Generate a QQ plot for the data in x
sm.qqplot(x, line='s')
plt.title("QQ Plot for sample data (x)")
plt.show()

# For additional demonstration, plot a QQ plot for normally distributed data
normal_data = np.random.normal(loc=0, scale=1, size=100)
sm.qqplot(normal_data, line='45')
plt.title("QQ Plot for Normally Distributed Data")
plt.show()


## 4. Confidence Interval for the Mean

**Definition:**  
A confidence interval (CI) provides a range in which the true population parameter is expected to lie with a specified level of confidence.

**For a Population Mean:**  
<p style="margin-left:20px;">
CI = <span style="text-decoration: overline;">x</span> ± t<sub>&alpha;/2, n−1</sub> × ( s / sqrt(n) )
</p>

Where:
<ul style="margin-left:40px;">
  <li><span style="text-decoration: overline;">x</span> = sample mean</li>
  <li>s = sample standard deviation</li>
  <li>n = sample size</li>
  <li>t<sub>&alpha;/2, n−1</sub> = critical value from the t-distribution</li>
</ul>

Below is Python code that computes a 95% confidence interval for the mean.
```python
from scipy.stats import t

# Use the same sample data 'x'
n = len(x)
x_mean = np.mean(x)
s = np.std(x, ddof=1)  # sample standard deviation

# Set confidence level to 95%
alpha = 0.05
df = n - 1
t_crit = t.ppf(1 - alpha/2, df)

# Calculate margin of error
margin_error = t_crit * s / np.sqrt(n)
ci_lower = x_mean - margin_error
ci_upper = x_mean + margin_error

print("95% Confidence Interval for the mean of x: [{:.3f}, {:.3f}]".format(ci_lower, ci_upper))


## 5. Hypothesis Testing

**General Steps:**
<ol style="margin-left:20px;">
  <li><strong>State the Hypotheses:</strong>
    <ul style="margin-left:20px;">
      <li>H<sub>0</sub>: No effect (e.g., ρ = 0)</li>
      <li>H<sub>1</sub>: There is an effect or difference</li>
    </ul>
  </li>
  <li><strong>Choose a Significance Level:</strong> Typically, &alpha; = 0.05.</li>
  <li><strong>Compute the Test Statistic:</strong>  
    For example, for a population mean:
    <p style="margin-left:20px;">
    t = (<span style="text-decoration: overline;">x</span> − μ<sub>0</sub>) / ( s / sqrt(n) )
    </p>
  </li>
  <li><strong>Determine the p-value:</strong> The probability of observing such a test statistic under H<sub>0</sub>.</li>
  <li><strong>Make a Decision:</strong> Reject H<sub>0</sub> if p &lt; &alpha;.</li>
</ol>

Below is an example using a one-sample t-test.
```python
# One-sample t-test: testing if the mean of x is equal to a specified value, say, 3.
mu_0 = 3
t_stat, p_val = stats.ttest_1samp(x, mu_0)
print("t-statistic:", t_stat)
print("p-value:", p_val)

if p_val < 0.05:
    print("Reject H0: The sample mean is significantly different from", mu_0)
else:
    print("Fail to reject H0: No significant difference from", mu_0)


### 6. Chi-square Test (Goodness-of-fit)

**Definition:**  
The chi-square test compares observed frequencies with expected frequencies under a given hypothesis.

**Formula (Goodness-of-fit):**  
<p style="margin-left:20px;">
&chi;<sup>2</sup> = &Sigma;<sub>i=1</sub><sup>k</sup> ((O<sub>i</sub> − E<sub>i</sub>)<sup>2</sup> / E<sub>i</sub>)
</p>

Where:
<ul style="margin-left:40px;">
  <li>O<sub>i</sub> = observed frequency for category i</li>
  <li>E<sub>i</sub> = expected frequency for category i</li>
  <li>k = number of categories</li>
</ul>

*Note: It is advisable that each expected frequency is at least 5 for a reliable test.*

Below is an example using Python.
```python
from scipy.stats import chisquare, f_oneway

# Chi-square goodness-of-fit example:
# Suppose we roll a die many times. For a fair die, expected count for each face is equal.
observed = np.array([8, 9, 10, 8, 9, 6])
expected = np.full(6, np.sum(observed)/6)  # equal expected count for each category
chi2_stat, chi2_p = chisquare(observed, expected)
print("Chi-square statistic:", chi2_stat)
print("Chi-square p-value:", chi2_p)
```
## 7. ANOVA Test

**Definition:**  
ANOVA is used to determine if there are any statistically significant differences between the means of three or more independent groups.

**F-Statistic:**  
<p style="margin-left:20px;">
F = MS<sub>between</sub> / MS<sub>within</sub>
</p>

Where:
<ul style="margin-left:40px;">
  <li>MS<sub>between</sub> = (Sum of Squares Between) / (number of groups − 1)</li>
  <li>MS<sub>within</sub> = (Sum of Squares Within) / (n − number of groups)</li>
</ul>

A significant F (p &lt; &alpha;) suggests that at least one group mean is different.

Below is Python code  ANOVA.
```python
# ANOVA example:
# Create sample data for three groups with different means.
group1 = np.random.normal(10, 2, size=30)
group2 = np.random.normal(12, 2, size=30)
group3 = np.random.normal(11, 2, size=30)

f_stat, anova_p = f_oneway(group1, group2, group3)
print("ANOVA F-statistic:", f_stat)
print("ANOVA p-value:", anova_p)
```
---------------------------------------------------------------------------------------------------------------------------------
**Note:**  
All formulas are written using HTML tags (like <code>&lt;sub&gt;</code> and <code>&lt;sup&gt;</code>) along with inline styling to achieve overlining (for sample means) and proper spacing. This version should render well in your Jupyter Notebook markdown cell. Feel free to adjust the inline CSS (such as the margin values) to suit your display preferences.