# DSC 215: Probability and Statistics for Data Science

## Module 1 Summary: Introduction to Data

## 1. Introduction to Statistics

### What is Statistics?
- **Definition**: A discipline that focuses on the collection, analysis, and interpretation of data
- **Applications**: Used in various fields including:
  - "Hard" Sciences (physics, chemistry, biology)
  - Social Sciences (politics, economics, education)
  - Medicine
  - Business and industry

### Types of Data
- **Numerical Data**: Quantitative measurements
  - **Discrete**: Countable values (e.g., number of students in a class)
  - **Continuous**: Measurements on a continuous scale (e.g., height, weight)
- **Categorical Data**: Qualitative classifications
  - **Nominal**: Categories with no natural ordering (e.g., dog breeds, political party)
  - **Ordinal**: Categories with a natural ordering (e.g., education level, satisfaction ratings)

## 2. Data Organization

### Data Matrix Structure
- **Observations/Cases**: Individual entities being measured (rows in a data matrix)
- **Variables**: Characteristics being measured for each observation (columns)

**Example**: Dog Breed Study
```
| Dog ID | Breed     | Weight (kg) | Height (cm) | Fur Length (cm) |
|--------|-----------|-------------|-------------|-----------------|
| 1      | Husky     | 25          | 58          | 5               |
| 2      | Bulldog   | 22          | 40          | 2               |
| 3      | Chihuahua | 2.5         | 20          | 3               |
```

In this example:
- Each row represents one dog (an observation/case)
- The variables are Breed (categorical), Weight (numerical-continuous), Height (numerical-continuous), and Fur Length (numerical-continuous)
- There are 3 categorical levels for the Breed variable: Husky, Bulldog, and Chihuahua

## 3. Relationships Between Variables

### Types of Associations
- **Positive Association**: Variables increase or decrease together
  - Example: House size and price tend to increase together
- **Negative Association**: One variable increases as the other decreases
  - Example: Homeownership rate decreases as percentage of multi-unit structures increases
- **No Association**: No discernible pattern between variables

### Variable Roles
- **Explanatory Variable**: Variable that might affect or explain changes in another variable
  - Also called independent variable or predictor
- **Response Variable**: Variable that may be affected by the explanatory variable
  - Also called dependent variable or outcome

**Important Note**: Association does not imply causation

### Visualizing Relationships
- **Scatterplots**: Used to visualize relationships between two numerical variables
  - Each point represents a single observation
  - Pattern of points indicates type and strength of relationship

**Example**: Homeownership and Multi-Unit Structures
```
The scatterplot shows a negative association between homeownership rate and 
percentage of multi-unit structures in counties. As the percentage of multi-unit 
structures increases, the homeownership rate tends to decrease.
```

## 4. Data Collection Methods

### Observational Studies
- **Definition**: Data collected without interference in how variables arise
- **Characteristics**:
  - Good for identifying natural associations
  - Cannot establish causation
  - Subject to confounding variables
- **Applications**: Surveys, existing records, cohort studies

### Experiments
- **Definition**: Designed investigations where researchers control variables
- **Characteristics**:
  - Can establish causal connections
  - Involves treatment and control groups
  - Uses randomization to control for confounding variables
- **Key Components**:
  - **Treatment Group**: Receives the intervention being studied
  - **Control Group**: Provides a baseline for comparison

## 5. Sampling Principles

### Population vs. Sample
- **Population**: The entire set of cases about which we want to draw conclusions
- **Sample**: A subset of the population from which we collect data
- **Sampling Frame**: List of cases from which the sample is drawn

### Sampling Methods
- **Simple Random Sampling**: Each case has an equal probability of selection
  - Formula for probability of selection: $P(\text{selection}) = \frac{n}{N}$
  - Where n = sample size, N = population size
- **Stratified Sampling**: Population divided into groups, then random sampling within groups
- **Cluster Sampling**: Population divided into clusters, then entire clusters selected

### Sampling Bias
- **Non-response Bias**: When certain types of subjects are less likely to respond
  - Example: Only 30% of people respond to a survey, potentially skewing results
- **Convenience Sampling**: Using easily accessible subjects (often leads to bias)
  - Example: Surveying only people walking in a particular neighborhood
- **Voluntary Response Bias**: When sample consists of self-selected volunteers
  - Example: Online product reviews typically come from very satisfied or very dissatisfied customers

## 6. Experimental Design

### Key Principles
- **Control**: Managing differences between treatment and control groups
- **Randomization**: Random assignment to account for uncontrollable variables
- **Replication**: Using more cases for better estimation
- **Blocking**: Subdividing based on variables that may affect response

### Reducing Bias in Human Experiments
- **Blind Studies**: Participants unaware of their treatment status
- **Double-Blind Studies**: Both participants and researchers unaware of treatment status
- **Placebos**: Fake treatments given to control groups
  - Helps account for the placebo effect (improvement due to expectation)

### Example of Experimental Design
```
Study: Effect of a sleeping pill on people with trouble sleeping

Design elements:
- 80 participants with trouble sleeping
- Blocking variable: Age (40 people >50 years old, 40 people <50 years old)
- Random assignment within blocks to treatment or control
- Treatment: One sleeping pill per week
- Control: Placebo pill
- Blinding: Participants don't know which pill they received (single-blind)
- Duration: 10 weeks
- Response variable: Quality of sleep
```

## 7. Drawing Valid Conclusions

### Correlation vs. Causation
- Association between variables does not imply causation
- Causation can only be established through well-designed experiments
- Observational studies can suggest but not prove causal relationships

### Generalizability
- Results from a sample can only be generalized to the population it represents
- Random sampling improves generalizability
- External validity refers to how well results apply to other situations

### Example: Valid and Invalid Conclusions
```
Study: Survey of 50 students in a statistics class about voluntary work participation

Valid conclusion: "X% of students in this specific statistics class participate in 
voluntary work."

Invalid conclusion: "Studying statistics causes students to participate in voluntary 
work." (Correlation doesn't imply causation)

Invalid conclusion: "X% of all university students participate in voluntary work." 
(Cannot generalize beyond the specific class)
```

## 8. Key Formulas and Concepts

### Simple Random Sampling
- Each case has equal probability of selection: $P(\text{selection}) = \frac{n}{N}$
- Where n = sample size, N = population size

### Randomization in Experiments
- Random assignment helps ensure treatment and control groups are comparable
- Helps control for confounding variables
- Can be done using random number generators, coin flips, etc.

## 9. Common Misconceptions

1. **Correlation implies causation**: Just because two variables are associated doesn't mean one causes the other.

2. **Larger samples are always better**: While larger samples generally provide more precision, a large biased sample is worse than a small unbiased sample.

3. **Statistical significance equals practical importance**: A statistically significant result may not be practically meaningful.

4. **Anecdotal evidence is reliable**: Individual stories or experiences are not statistically valid evidence.

5. **All studies are equally valid**: The design and methodology of a study greatly affect the validity of its conclusions.


## Module 2 Summary: Summarizing Data

## 1. Visualizing Numerical Data

### Scatterplots
- **Purpose**: Visualize relationships between two numerical variables
- **Features**:
  - Each point represents a single case with coordinates (x, y)
  - Help identify associations (positive, negative, or none)
  - Reveal whether relationships are simple (linear) or complex (non-linear)
- **Example**: A scatterplot of total income versus loan amount shows that borrowers with higher incomes tend to take larger loans, though the relationship isn't perfectly linear.

### Histograms
- **Purpose**: Visualize the distribution of a single numerical variable
- **Features**:
  - Data values are grouped into bins (intervals)
  - Height of bars represents frequency or density
  - Higher bars indicate where data are more common
  - Provide a view of data density across the range of values
- **Example**: A histogram of interest rates for loans might show that most loans have rates between 5-10%, with fewer loans having rates above 15%.

## 2. Shape of Distributions

### Skewness
- **Right-skewed (Positively Skewed)**:
  - Data trail off to the right with a longer right tail
  - Mean is typically greater than median
  - Example: Income distributions, home prices
  
- **Left-skewed (Negatively Skewed)**:
  - Data trail off to the left with a longer left tail
  - Mean is typically less than median
  - Example: Age-at-death distributions, exam scores with ceiling effects
  
- **Symmetric**:
  - Data show roughly equal trailing off in both directions
  - Mean and median are approximately equal
  - Example: Height distributions in adult populations

### Modality
- **Unimodal**: One prominent peak (most common)
- **Bimodal**: Two prominent peaks (suggests two subpopulations)
- **Multimodal**: More than two prominent peaks
- **Uniform**: No peaks, approximately equal frequency across all values

## 3. Measures of Center

### Mean (Average)
- **Definition**: Sum of all values divided by number of observations
- **Formula**: 
  $$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + \ldots + x_n}{n}$$
- **Properties**:
  - Useful for comparisons between groups
  - Affected by all values in the dataset
  - Not resistant to outliers (non-robust)
  - Appropriate for symmetric distributions
- **Example**: For interest rates of 10.9%, 9.92%, ..., 6.08% across 50 loans, the mean is 11.57%.

### Median
- **Definition**: Middle value when data are ordered
- **Calculation**:
  - If n is odd: middle value
  - If n is even: average of two middle values
- **Properties**:
  - Robust statistic (resistant to outliers)
  - Better measure of center for skewed distributions
  - Not affected by extreme values
- **Example**: For the ordered data {1, 5, 6, 7, 10}, the median is 6. For {1, 5, 6, 7}, the median is (5+6)/2 = 5.5.

### Mode
- **Definition**: Value with a prominent peak in the distribution
- **Properties**:
  - Can have multiple modes
  - Useful for categorical data
  - Less commonly used for numerical data
- **Example**: In the dataset {2, 3, 3, 4, 5, 5, 5, 6, 7}, the mode is 5.

## 4. Measures of Spread

### Variance
- **Definition**: Average squared deviation from the mean
- **Formula**: 
  $$s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2$$
- **Properties**:
  - Measures how much data deviates from the mean
  - Units are squared (making interpretation difficult)
  - Not resistant to outliers
- **Example**: For the data {2, 4, 6, 8, 10}, the mean is 6, and the variance is:
  $$s^2 = \frac{(2-6)^2 + (4-6)^2 + (6-6)^2 + (8-6)^2 + (10-6)^2}{5-1} = \frac{16 + 4 + 0 + 4 + 16}{4} = \frac{40}{4} = 10$$

### Standard Deviation
- **Definition**: Square root of variance
- **Formula**: 
  $$s = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}$$
- **Properties**:
  - Same units as original data
  - Useful for considering how far data are distributed from the mean
  - Approximately 68% of data falls within 1 standard deviation of mean in normal distributions
  - Not resistant to outliers
- **Example**: For the data {2, 4, 6, 8, 10}, the standard deviation is $\sqrt{10} = 3.16$.

### Range
- **Definition**: Difference between maximum and minimum values
- **Formula**: Range = max(x) - min(x)
- **Properties**:
  - Simple to calculate
  - Highly sensitive to outliers
  - Provides limited information about distribution
- **Example**: For the data {2, 4, 6, 8, 10}, the range is 10 - 2 = 8.

### Quartiles and IQR
- **First quartile (Q₁)**: 25% of data falls below this value
- **Second quartile (Q₂)**: Median (50% of data falls below)
- **Third quartile (Q₃)**: 75% of data falls below this value
- **Interquartile Range (IQR)**: Q₃ - Q₁ (range of middle 50% of data)
- **Properties**:
  - Robust statistics (resistant to outliers)
  - Useful for describing skewed distributions
  - Used to identify potential outliers
- **Example**: For the data {2, 4, 6, 8, 10, 12, 14}, Q₁ = 4, Q₂ = 8, Q₃ = 12, and IQR = 12 - 4 = 8.

## 5. Box Plots and Outliers

### Box Plots
- **Components**:
  - Box: Represents IQR (middle 50% of data)
  - Line inside box: Median
  - Whiskers: Extend to smallest/largest data points within 1.5×IQR of Q₁/Q₃
  - Individual points: Potential outliers beyond whiskers
- **Uses**:
  - Comparing distributions across groups
  - Identifying skewness and outliers
  - Summarizing key statistics visually

### Outliers
- **Definition**: Observations that appear extreme relative to the rest of the data
- **Common identification method**: Values beyond Q₁-1.5×IQR or Q₃+1.5×IQR
- **Importance**:
  - May indicate data collection or recording errors
  - Could represent important rare events
  - Can significantly affect non-robust statistics
  - Should be investigated, not automatically removed

## 6. Summarizing Categorical Data

### Tables for Categorical Data
- **Frequency Tables**: Count occurrences of each category
- **Relative Frequency Tables**: Show proportion or percentage in each category
- **Contingency Tables**: Summarize data for two categorical variables
  - Each cell represents the number of times a particular combination occurred
  - Can be modified to show proportions (row, column, or overall)

### Example of a Contingency Table
```
                 homeownership
                 rent  mortgage  own   Total
app_type individual 3496   3839    1170  8505
         joint      362    950     183   1495
         Total     3858   4789    1353  10000
```

### Row and Column Proportions
- **Row proportions**: Each count divided by its row total
  - Example: 3496/8505 = 0.411 (41.1% of individual applicants rent)
- **Column proportions**: Each count divided by its column total
  - Example: 3496/3858 = 0.906 (90.6% of renters applied as individuals)

### Visualizing Categorical Data
- **Bar Plots**:
  - Display counts or proportions for categories
  - Bars should be separated (unlike histograms)
  - Height represents frequency or proportion
- **Variations**:
  - Stacked bar plots: Show composition within categories
  - Side-by-side bar plots: Compare groups directly
  - Standardized stacked bar plots: Show proportions within each category

## 7. Comparing Distributions

### Comparing Numerical Data Across Groups
- **Side-by-side Box Plots**:
  - Traditional tool for comparing distributions across groups
  - Allow comparison of center, spread, and outliers
- **Hollow Histograms**:
  - Outlines of histograms for each group on the same plot
  - Useful for comparing shapes of distributions

### Comparing Categorical Data Across Groups
- **Side-by-side Bar Plots**: Compare frequencies across groups
- **Stacked Bar Plots**: Compare composition within categories
- **Mosaic Plots**: Area represents frequency in contingency tables

## 8. Statistical Transformations

### Linear Transformations
- **Adding a constant (x + c)**:
  - Changes center but not spread
  - Mean increases by the constant: $\bar{x}_{new} = \bar{x} + c$
  - Median increases by the constant: $\text{median}_{new} = \text{median} + c$
  - Range and standard deviation remain unchanged
  - Example: Converting Celsius to Fahrenheit (F = C + 32)

- **Multiplying by a constant (c × x)**:
  - Changes both center and spread
  - Mean is multiplied by the constant: $\bar{x}_{new} = c \times \bar{x}$
  - Median is multiplied by the constant: $\text{median}_{new} = c \times \text{median}$
  - Range and standard deviation are multiplied by |c|
  - Example: Converting inches to centimeters (cm = 2.54 × inches)

### Example of Transformations
For the data {10, 20, 30, 40, 50}:
- Original: mean = 30, median = 30, range = 40, standard deviation ≈ 15.81
- After adding 5: {15, 25, 35, 45, 55}
  - New mean = 35, new median = 35, range = 40, standard deviation ≈ 15.81
- After multiplying by 2: {20, 40, 60, 80, 100}
  - New mean = 60, new median = 60, range = 80, standard deviation ≈ 31.62

## 9. Practical Examples

### Example 1: Analyzing Loan Interest Rates
A dataset contains interest rates for 50 loans with the following statistics:
- Mean: 11.57%
- Median: 9.93%
- Standard Deviation: 5.05%
- Range: 21.9% (from 5% to 26.9%)
- Q₁: 7.5%
- Q₃: 15%
- IQR: 7.5%

The histogram shows a right-skewed distribution, indicating that most loans have rates under 15%, with a few loans having rates above 20%. Since the distribution is skewed, the median (9.93%) is a better measure of central tendency than the mean (11.57%).

### Example 2: Comparing Student Quiz Scores
A class of students took a quiz, and the 5-number summaries are given for 18 freshmen and 15 sophomores:

```
Summary   Min  Q1   Median  Q3   Max
Freshmen   3   4.5   6.5    8.5  9.5
Sophomores 4   6     7.5    9    10
```

Observations:
- Sophomores have the highest score (10 > 9.5)
- Freshmen have a greater range (6.5 > 6)
- Freshmen have a greater IQR (4 > 3)
- If the mean of freshmen scores is 6.5 and sophomores is 7, the overall mean is:
  $$\text{Overall Mean} = \frac{6.5 \times 18 + 7 \times 15}{33} = \frac{117 + 105}{33} = \frac{222}{33} = 6.73$$

### Example 3: Effect of Salary Changes on Statistics
If an employee with the lowest salary in a company of three employees becomes part-time and has a salary reduction:

- **Effect on measures of center**:
  - Mean will decrease
  - Median will not change (since the lowest value remains the lowest)
  
- **Effect on measures of spread**:
  - Range will increase
  - Standard deviation will increase (the data points become more spread out from the mean)
  - IQR may not change (depends on whether the lowest salary is below Q₁)

## 10. Key Takeaways

1. Different statistics are appropriate for different types of data and distributions:
   - For skewed distributions, median is often more representative than mean
   - For symmetric distributions, mean and median are similar

2. Robust statistics (median, IQR) are less affected by outliers than non-robust statistics (mean, standard deviation)

3. Visualizations help identify patterns and outliers in data:
   - Histograms show the shape of distributions
   - Box plots summarize key statistics and identify outliers
   - Scatterplots show relationships between variables

4. Categorical data requires different analysis approaches than numerical data:
   - Contingency tables and bar plots for categorical data
   - Proportions often more informative than raw counts

5. When comparing groups, consider both measures of center and spread:
   - Side-by-side box plots or hollow histograms for numerical comparisons
   - Stacked or side-by-side bar plots for categorical comparisons

6. Transformations affect statistics in predictable ways:
   - Adding a constant shifts the center but doesn't change the spread
   - Multiplying by a constant changes both center and spread


## Module 3 Summary: Introduction to Probability

## 1. Foundations of Probability

### Why Study Probability?
- Probability is the foundation upon which statistics is built
- Essential for machine learning, artificial intelligence, game theory, and information theory
- Provides a formal framework for understanding uncertainty and randomness
- Enables deeper understanding of statistical tools and techniques

### Sample Space
- **Definition**: The set of all possible outcomes of an experiment, denoted by Ω
- **Examples**:
  - Rolling a die: Ω = {1, 2, 3, 4, 5, 6}
  - Flipping a coin: Ω = {Heads, Tails}
  - Tossing a coin three times: Ω = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}

### Event Space
- **Definition**: A set whose elements are themselves sets (subsets of Ω)
- Every event A in the event space ℱ is a subset of Ω (A ⊂ Ω)
- The event space must satisfy certain properties:
  - The empty set ∅ must be in ℱ
  - If A is in ℱ, then its complement A<sup>c</sup> must also be in ℱ
  - If A₁, A₂, ... are all in ℱ, then their union must also be in ℱ

### Probability Measure
- **Definition**: A function ℙ: ℱ → [0,1] that assigns probabilities to events
- Must satisfy the axioms of probability:
  - ℙ(A) ≥ 0 for all A ∈ ℱ (non-negativity)
  - ℙ(Ω) = 1 (normalization)
  - If A₁, A₂, ... are disjoint events (A<sub>i</sub> ∩ A<sub>j</sub> = ∅ for i ≠ j), then ℙ(A₁ ∪ A₂ ∪ ...) = Σᵢ ℙ(A<sub>i</sub>)

### Properties of Probability Measures
- If A ⊂ B, then ℙ(A) ≤ ℙ(B)
- ℙ(A ∩ B) ≤ min(ℙ(A), ℙ(B))
- ℙ(A ∪ B) ≤ ℙ(A) + ℙ(B) (union bound)
- ℙ(A<sup>c</sup>) = 1 - ℙ(A)

### Example: Die Rolling
For a fair six-sided die:
- Sample space: Ω = {1, 2, 3, 4, 5, 6}
- Event space: ℱ = P(Ω) (the power set of Ω)
- Probability measure: If a set A ∈ ℱ has i elements, then ℙ(A) = i/6
- Example: ℙ({1, 4, 6}) = 3/6 = 0.5

## 2. Conditional Probability and Independence

### Conditional Probability
- **Definition**: The probability of event A occurring given that event B has occurred
- Formula: 
  $$\mathbb{P}(A|B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}$$
  where ℙ(B) ≠ 0

### Independence
- **Definition**: Two events A and B are independent if and only if:
  $$\mathbb{P}(A \cap B) = \mathbb{P}(A) \times \mathbb{P}(B)$$
- Equivalently, A and B are independent if:
  $$\mathbb{P}(A|B) = \mathbb{P}(A)$$

### Example: Left-Handed People
If 9% of people are left-handed and 2 people are selected at random from a large population:
- Probability both are left-handed (assuming independence):
  $$\mathbb{P}(\text{both left-handed}) = \mathbb{P}(\text{first left-handed}) \times \mathbb{P}(\text{second left-handed}) = 0.09^2 = 0.0081$$

### Example: Smallpox in Boston, 1721
A dataset of 6,224 individuals exposed to smallpox in Boston:

| | Inoculated | Not Inoculated | Total |
|---|---|---|---|
| Lived | 238 | 5,136 | 5,374 |
| Died | 6 | 844 | 850 |
| Total | 244 | 5,980 | 6,224 |

- Probability a non-inoculated person died from smallpox:
  $$\mathbb{P}(\text{died}|\text{not inoculated}) = \frac{\mathbb{P}(\text{died} \cap \text{not inoculated})}{\mathbb{P}(\text{not inoculated})} = \frac{0.1356}{0.9608} \approx 0.1411$$

- Probability an inoculated person died from smallpox:
  $$\mathbb{P}(\text{died}|\text{inoculated}) = \frac{\mathbb{P}(\text{died} \cap \text{inoculated})}{\mathbb{P}(\text{inoculated})} = \frac{0.0010}{0.0392} \approx 0.0246$$

### Bayes' Theorem
- **Formula**:
  $$\mathbb{P}(A|B) = \frac{\mathbb{P}(B|A)\mathbb{P}(A)}{\mathbb{P}(B)}$$

- When ℙ(B) is not directly accessible:
  $$\mathbb{P}(B) = \sum_i \mathbb{P}(B|A_i)\mathbb{P}(A_i)$$
  where A<sub>i</sub> ∩ A<sub>j</sub> = ∅ for i ≠ j and ∪<sub>i</sub> A<sub>i</sub> = Ω

### Example: Bayes' Theorem with Smallpox Data
Using the smallpox data to find the probability that a person who died was not inoculated:
$$\mathbb{P}(\text{not inoculated}|\text{died}) = \frac{\mathbb{P}(\text{died}|\text{not inoculated})\mathbb{P}(\text{not inoculated})}{\mathbb{P}(\text{died})} = \frac{0.1411 \times 0.9608}{0.1366} \approx 0.993$$

## 3. Random Variables

### Definition and Types
- **Random Variable**: A function from the sample space Ω to another set (typically ℝ or ℕ)
- **Discrete Random Variable**: Takes values from a countable set
- **Continuous Random Variable**: Takes values from an uncountable set (typically an interval of real numbers)

### Example: Coin Tossing
For a sequence of 3 coin tosses:
- Sample space: Ω = {(H,H,H), (H,H,T), (H,T,H), (H,T,T), (T,H,H), (T,H,T), (T,T,H), (T,T,T)}
- Random variable X = number of heads in the sequence
- For ω = (H,H,T), X(ω) = 2

## 4. Probability Distributions for Discrete Random Variables

### Probability Mass Function (PMF)
- **Definition**: A function p<sub>X</sub>: ℝ → [0,1] that gives the probability that a discrete random variable equals a specific value
- **Formula**: p<sub>X</sub>(a) = ℙ(X = a)
- **Properties**:
  - p<sub>X</sub>(x) ≥ 0 for all x
  - Σ<sub>x</sub> p<sub>X</sub>(x) = 1

### Example: Fair Coin Toss
For a random variable X associated with a fair coin toss (X = 1 for heads, X = 0 for tails):
$$p_X(x) = \begin{cases}
1/2 & \text{if } x = 0 \\
1/2 & \text{if } x = 1 \\
0 & \text{otherwise}
\end{cases}$$

## 5. Probability Distributions for Continuous Random Variables

### Cumulative Distribution Function (CDF)
- **Definition**: A function F<sub>X</sub>: ℝ → [0,1] that gives the probability that a random variable is less than or equal to a specific value
- **Formula**: F<sub>X</sub>(x) = ℙ(X ≤ x)
- **Properties**:
  - 0 ≤ F<sub>X</sub>(x) ≤ 1
  - lim<sub>x→-∞</sub> F<sub>X</sub>(x) = 0
  - lim<sub>x→+∞</sub> F<sub>X</sub>(x) = 1
  - x ≤ y ⟹ F<sub>X</sub>(x) ≤ F<sub>X</sub>(y) (non-decreasing)
  - ℙ(a ≤ X ≤ b) = F<sub>X</sub>(b) - F<sub>X</sub>(a)

### Probability Density Function (PDF)
- **Definition**: For a continuous random variable with a differentiable CDF, the PDF is the derivative of the CDF
- **Formula**: f<sub>X</sub>(x) = d/dx F<sub>X</sub>(x)
- **Properties**:
  - f<sub>X</sub>(x) ≥ 0
  - ∫<sub>-∞</sub><sup>∞</sup> f<sub>X</sub>(x)dx = 1
  - For a set A, ∫<sub>A</sub> f<sub>X</sub>(x)dx = ℙ(X ∈ A)
  - ℙ(a ≤ X ≤ b) = ∫<sub>a</sub><sup>b</sup> f<sub>X</sub>(x)dx
  - F<sub>X</sub>(x) = ∫<sub>-∞</sub><sup>x</sup> f<sub>X</sub>(z)dz
  - Important: In general, f<sub>X</sub>(x) ≠ ℙ(X = x)

### Example: Uniform Random Variable
For a uniform random variable on [0,1]:
$$f_X(x) = \begin{cases}
1 & \text{if } x \in [0,1] \\
0 & \text{otherwise}
\end{cases}$$

The CDF is:
$$F_X(x) = \begin{cases}
0 & \text{if } x < 0 \\
x & \text{if } 0 \leq x \leq 1 \\
1 & \text{if } x > 1
\end{cases}$$

## 6. Expectation, Variance, and Standard Deviation

### Expectation (Expected Value)
- **For Discrete Random Variables**:
  $$\mathbb{E}(X) = \sum_{x \in S} x \cdot p_X(x)$$

- **For Continuous Random Variables**:
  $$\mathbb{E}(X) = \int_{-\infty}^{\infty} x \cdot f_X(x) dx$$

- **For Functions of Random Variables**:
  - Discrete: 𝔼(g(X)) = Σ<sub>x∈S</sub> g(x) · p<sub>X</sub>(x)
  - Continuous: 𝔼(g(X)) = ∫<sub>-∞</sub><sup>∞</sup> g(x) · f<sub>X</sub>(x) dx

### Example: Expected Value of a Die Roll
For a fair six-sided die:
$$\mathbb{E}(X) = \sum_{i=1}^{6} i \times \frac{1}{6} = \frac{1}{6} + \frac{2}{6} + \frac{3}{6} + \frac{4}{6} + \frac{5}{6} + \frac{6}{6} = \frac{21}{6} = 3.5$$

### Example: Expected Value of a Uniform Random Variable
For a uniform random variable on [0,1]:
$$\mathbb{E}(X) = \int_{-\infty}^{\infty} x \cdot f_X(x) dx = \int_{0}^{1} x \cdot 1 \, dx = \left[ \frac{x^2}{2} \right]_{0}^{1} = \frac{1}{2}$$

### Linearity of Expectation
- If X and Y are random variables, and a and b are constants:
  $$\mathbb{E}(a \cdot g(X) + b \cdot h(Y)) = a \cdot \mathbb{E}(g(X)) + b \cdot \mathbb{E}(h(Y))$$

### Example: Linearity of Expectation
If the expected sales price of an apple is $1 and an orange is $2, then the expected sales price of 2 apples and 3 oranges is:
$$\mathbb{E}(2 \cdot \text{apple price} + 3 \cdot \text{orange price}) = 2 \cdot \mathbb{E}(\text{apple price}) + 3 \cdot \mathbb{E}(\text{orange price}) = 2 \times \$1 + 3 \times \$2 = \$8$$

### Variance and Standard Deviation
- **Variance**: Measures the spread or dispersion of a random variable around its mean
  $$\text{Var}(X) = \mathbb{E}((X - \mu)^2) = \mathbb{E}(X^2) - \mu^2$$
  where μ = 𝔼(X)

- **Standard Deviation**: Square root of the variance
  $$\sigma = \sqrt{\text{Var}(X)}$$

### Example: Variance of a Die Roll
For a fair six-sided die with μ = 3.5:
$$\text{Var}(X) = \mathbb{E}((X - \mu)^2) = \sum_{i=1}^{6} (i - 3.5)^2 \times \frac{1}{6} = \frac{105}{36} \approx 2.92$$
$$\sigma = \sqrt{\frac{105}{36}} \approx 1.71$$

### Example: Variance of a Uniform Random Variable
For a uniform random variable on [0,1] with μ = 1/2:
$$\text{Var}(X) = \mathbb{E}((X - \mu)^2) = \int_{0}^{1} (x - \frac{1}{2})^2 dx = \frac{1}{12}$$
$$\sigma = \sqrt{\frac{1}{12}} \approx 0.289$$

### Properties of Variance
- For independent random variables X and Y, and constants a and b:
  $$\text{Var}(a \cdot g(X) + b \cdot h(Y)) = a^2 \cdot \text{Var}(g(X)) + b^2 \cdot \text{Var}(h(Y))$$

## 7. Practical Examples

### Example 1: Card Drawing
For a standard deck of 52 cards:
- The sample space for drawing two cards and recording their sum (Ace = 1, Jack/Queen/King = 10) is:
  {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}
- The sample space for the number of spades when drawing five cards is:
  {0, 1, 2, 3, 4, 5}

### Example 2: Die Rolling and Event Independence
For a fair 6-sided die with events:
- A: The number rolled is odd = {1, 3, 5}
- B: The number rolled is greater than or equal to 4 = {4, 5, 6}
- C: The number rolled doesn't start with the letters "f" or "t" = {1, 6}

To determine if A and C are independent:
- ℙ(A) = 3/6 = 1/2
- ℙ(C) = 2/6 = 1/3
- ℙ(A ∩ C) = |{1}|/6 = 1/6
- Since ℙ(A) × ℙ(C) = (1/2) × (1/3) = 1/6 = ℙ(A ∩ C), events A and C are independent

### Example 3: Blood Type Probabilities
Given that 44% of Americans have type O blood, 42% have type A, 10% have type B, and the rest have type AB:
- Probability of not having type A blood: 1 - 0.42 = 0.58
- Probability of having type A or AB blood: 0.42 + (1 - 0.44 - 0.42 - 0.10) = 0.42 + 0.04 = 0.46

### Example 4: Betting Game Expected Value
In a betting game where you roll a die and:
- Win $200 if you get a 5 on the first roll
- Win $100 if you don't get a 5 on the first roll but get a 5 on the second roll
- Win $0 otherwise

The expected winnings are:
- ℙ(X = 200) = 1/6
- ℙ(X = 100) = (5/6) × (1/6) = 5/36
- ℙ(X = 0) = 1 - 1/6 - 5/36 = 25/36
- 𝔼(X) = 200 × (1/6) + 100 × (5/36) + 0 × (25/36) = 33.33 + 13.89 = $47.22

## 8. Key Takeaways

1. Probability provides a formal framework for quantifying uncertainty and randomness
2. The three components of a probability space are:
   - Sample space (Ω): All possible outcomes
   - Event space (ℱ): Collection of subsets of Ω
   - Probability measure (ℙ): Function assigning probabilities to events

3. Conditional probability allows us to update probabilities based on new information
4. Independence of events means the occurrence of one event doesn't affect the probability of another
5. Random variables map outcomes to numbers, allowing mathematical analysis
6. Probability distributions describe the likelihood of different values of a random variable:
   - PMF for discrete random variables
   - PDF and CDF for continuous random variables

7. Expected value represents the long-run average of a random variable
8. Variance and standard deviation measure the spread or dispersion around the expected value
9. Linearity of expectation and properties of variance simplify calculations for combinations of random variables

## Module 4 Summary: Distributions of Random Variables

## 1. Introduction to Probability Distributions

### Common Distributions
- **Discrete Random Variables**:
  - Bernoulli distribution
  - Binomial distribution
  - Geometric distribution
  - Negative binomial distribution
  - Poisson distribution

- **Continuous Random Variables**:
  - Normal distribution
  - Chi-squared distribution
  - t-distribution
  - F-distribution
  - Logistic distribution

## 2. Bernoulli Distribution

### Definition
- Models processes with only two outcomes: "success" (1) and "failure" (0)
- A random variable X with a Bernoulli distribution takes:
  - Value 1 with probability p
  - Value 0 with probability 1-p

### Probability Mass Function (PMF)
$$p_X(x) = \begin{cases}
p & \text{if } x = 1 \\
1-p & \text{if } x = 0 \\
0 & \text{otherwise}
\end{cases}$$

### Expected Value and Variance
- Expected value: $\mu = \mathbb{E}(X) = p$
- Variance: $\sigma^2 = p(1-p)$

### Applications
- Coin flips
- Voting preference in a two-party system
- Whether a product is defective

## 3. Binomial Distribution

### Definition
- Describes the probability of having exactly k successes in n independent Bernoulli trials
- Each trial has the same probability p of success
- Notation: $X \sim B(n,p)$ means X follows a binomial distribution with parameters n and p

### Probability Mass Function (PMF)
$$\mathbb{P}(X = k) = \binom{n}{k}p^k(1-p)^{n-k} \text{ for } k = 0,1,...,n$$

Where $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ is the binomial coefficient.

### Expected Value and Variance
- Expected value: $\mu = np$
- Variance: $\sigma^2 = np(1-p)$

### Example: Peanut Allergies
If the probability of a child having a peanut allergy is 2%, and a classroom has 30 children:
- Probability that none of them has a peanut allergy:
  $\mathbb{P}(X = 0) = \binom{30}{0}0.02^0 \times 0.98^{30} \approx 0.5455$
- Probability that 3 of them have a peanut allergy:
  $\mathbb{P}(X = 3) = \binom{30}{3}0.02^3 \times 0.98^{27} \approx 0.0188$

### Example: International Students
If 23.2% of UCSD students are international students, and there are 50 students on a dorm floor:
- Expected number of international students: $E[X] = np = 50 \times 0.232 = 11.6$
- Standard deviation: $SD(X) = \sqrt{np(1-p)} = \sqrt{50 \times 0.232 \times 0.768} \approx 2.99$

## 4. Normal Distribution

### Definition
- Symmetric, unimodal, bell-shaped curve
- Parametrized by mean μ and standard deviation σ
- Notation: $X \sim \mathcal{N}(\mu, \sigma)$

### Probability Density Function (PDF)
$$f_X(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$$

### Standard Normal Distribution
- When $\mu = 0$ and $\sigma = 1$, we have the standard normal distribution: $\mathcal{N}(0,1)$
- Often denoted as Z

### Z-scores
- The Z-score of an observation x is the number of standard deviations it falls above or below the mean:
  $$Z = \frac{x - \mu}{\sigma}$$
- Z-scores allow us to standardize data for comparison
- Z-scores follow the standard normal distribution

### Finding Tail Areas
- Methods to calculate probabilities:
  - Statistical software (R, Python, MATLAB)
  - Probability tables
  - Graphing calculators

### Example: SAT Scores
If SAT scores follow $\mathcal{N}(1100, 200)$ and Ann scores 1300:
- Z-score: $Z = \frac{1300-1100}{200} = 1$
- Probability of scoring below 1300: $\mathbb{P}(X \leq 1300) \approx 0.8413$

### Example: Test Scores
For a test with scores normally distributed with mean 100 and standard deviation 15:
- The interquartile range (IQR) can be calculated using Z-scores:
  - Q1 corresponds to Z = -0.6745
  - Q3 corresponds to Z = 0.6745
  - Q1 = 100 + (-0.6745 × 15) = 89.88
  - Q3 = 100 + (0.6745 × 15) = 110.12
  - IQR = Q3 - Q1 = 20.24

## 5. Approximating Binomial with Normal Distribution

### Conditions for Approximation
- The binomial distribution $B(n,p)$ is approximately normal when:
  - $np \geq 10$
  - $n(1-p) \geq 10$

### Parameters for Approximation
- Use $\mathcal{N}(\mu, \sigma)$ where:
  - $\mu = np$
  - $\sigma = \sqrt{np(1-p)}$

### Example: Defective Light Bulbs
If 2% of light bulbs are defective, what is the probability of getting 30 or fewer defective bulbs in a batch of 1000?
- Check conditions: $np = 1000 \times 0.02 = 20 \geq 10$ and $n(1-p) = 980 \geq 10$
- Use normal approximation with $\mu = 20$ and $\sigma = \sqrt{1000 \times 0.02 \times 0.98} \approx 4.43$
- Calculate Z-score: $Z = \frac{30-20}{4.43} \approx 2.26$
- Probability: $\mathbb{P}(X \leq 30) = \mathbb{P}(Z \leq 2.26) \approx 0.988$

## 6. Theoretical Foundations

### Law of Large Numbers (LLN)
- If $X_1, X_2, ...$ is a sequence of independent and identically distributed random variables with expected value $\mu$, then the sample average $\bar{X}_n = \frac{X_1 + ... + X_n}{n}$ satisfies:
  $$\mathbb{P}(\lim_{n\to\infty} \bar{X}_n = \mu) = 1$$
- Intuitively: As we increase the number of trials, the sample mean converges to the expected value

### Central Limit Theorem (CLT)
- If $X_1, X_2, ...$ is a sequence of independent and identically distributed random variables with expected value $\mu$ and variance $\sigma^2 < \infty$, then:
  $$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$$
- Intuitively: The distribution of the sample mean approaches a normal distribution as sample size increases
- This is why the normal distribution is so prevalent in statistics

## 7. Other Important Distributions

### Chi-Squared Distribution
- If $Z_1, Z_2, ..., Z_k$ are independent standard normal random variables, then $Q = \sum_{i=1}^{k} Z_i^2$ follows a chi-squared distribution with k degrees of freedom
- Notation: $Q \sim \chi^2_k$
- Used in goodness-of-fit tests and confidence intervals for variance
- PDF becomes less skewed as degrees of freedom increase

### t-Distribution
- Used when the population variance is unknown and estimated from the data
- If the sample variance $s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})^2$ is used instead of $\sigma^2$, then $\frac{\bar{X} - \mu}{s/\sqrt{n}}$ follows a t-distribution with n-1 degrees of freedom
- Has heavier tails than the normal distribution
- Approaches the standard normal distribution as degrees of freedom increase

### F-Distribution
- Ratio of two chi-squared distributions, each divided by their degrees of freedom:
  $$F = \frac{U/d_1}{V/d_2}$$
- Where $U \sim \chi^2_{d_1}$ and $V \sim \chi^2_{d_2}$
- Used in ANOVA and testing equality of variances
- Parametrized by two degrees of freedom parameters: $d_1$ and $d_2$

### Example: F-Distribution
If $W = \frac{\chi^2_3/3}{\chi^2_2/2}$, then W follows an F-distribution with degrees of freedom 3 and 2, denoted as $F_{3,2}$.

## 8. Practical Applications in Statistical Inference

### Hypothesis Testing
- Test statistics often follow specific distributions under the null hypothesis
- The distribution of the test statistic is used to calculate p-values
- Example: For testing if a coin is fair after 100 flips with 60 heads:
  - Test statistic: $Z = \frac{60-50}{5} = 2$
  - Under the null hypothesis (p = 0.5), Z follows approximately $\mathcal{N}(0,1)$
  - P-value: $\mathbb{P}(Z \geq 2) \approx 0.0228$

### Confidence Intervals
- Different distributions are used depending on what parameter is being estimated
- Normal distribution: Used when population variance is known
- t-distribution: Used when population variance is unknown
- Chi-squared distribution: Used for variance estimation

## 9. Key Properties of Random Variables

### Discrete vs. Continuous Random Variables
- **Discrete Random Variables**:
  - Take values from a countable set
  - Have a probability mass function (PMF)
  - Example: Number of points scored in a basketball game
  - For discrete RVs, $\mathbb{P}(X = x)$ can be positive

- **Continuous Random Variables**:
  - Take values from an uncountable set (typically an interval)
  - Have a probability density function (PDF)
  - Example: Distance walked in a day
  - For continuous RVs, $\mathbb{P}(X = x) = 0$ for any specific value x

### Properties of Symmetric Distributions
- If Z has a symmetric distribution around 0 and $\mathbb{P}(Z < a) = 0.25$, then:
  - $\mathbb{P}(Z > -a) = 1 - \mathbb{P}(Z \leq -a) = 1 - \mathbb{P}(Z < -a) = 1 - 0.25 = 0.75$
  - This is because for symmetric distributions, $\mathbb{P}(Z < -a) = \mathbb{P}(Z > a)$

## 10. Key Takeaways

1. Different distributions model different types of random phenomena
2. The normal distribution is central to statistical inference due to the Central Limit Theorem
3. Z-scores standardize observations for comparison across different distributions
4. The binomial distribution can be approximated by the normal distribution under certain conditions
5. The t-distribution, chi-squared distribution, and F-distribution are crucial for statistical inference when population parameters are unknown
6. Understanding the properties of these distributions is essential for hypothesis testing and confidence interval construction

# DSC 215: Probability and Statistics for Data Science
## Midterm Exam Analysis and Solution Guide

## Part A: Exam Summary and Solution Guide

### Overview
This midterm exam covers material from Modules 1-4, testing students' understanding of:
- Research design and variable identification
- Descriptive statistics and data distribution
- Probability and random variables
- Normal distribution and standardization

The exam consists of 5 questions with multiple parts, totaling approximately 50 points.

### Question 1: Research Design Analysis

**Topic: Research Design and Variables (Module 1)**

This question presents a study on children's honesty and cheating behavior, asking students to:
- Identify the main research question
- Identify the subjects and sample size
- Identify variables and their types

#### Solution Guide:

**Part (a): Identify the main research question**
- The main research question is: "Does explicitly telling children not to cheat affect their likelihood to cheat?"
- This question identifies the explanatory variable (explicit instruction vs. no instruction) and the response variable (cheating behavior).

**Part (b): Identify subjects and sample size**
- Subjects: Children between the ages of 5 and 15 (or 10 and 15 in version 2)
- Sample size: 160 children (or 100 children in version 2)

**Part (c): Identify variables and their types**
Four variables were recorded:
1. Age (numerical, continuous)
2. Sex (categorical)
3. Whether they were an only child or not (categorical)
4. Whether they cheated or not (categorical)

**Key Concepts Applied:**
- Distinguishing between numerical and categorical variables
- Identifying explanatory and response variables
- Understanding research design elements

### Question 2: Descriptive Statistics and Distribution Shape

**Topic: Summarizing Data (Module 2)**

This question provides summary statistics for exam grades and asks students to:
- Determine if the distribution is left-skewed, right-skewed, or symmetric
- Identify if there are any outliers

#### Solution Guide:

**Part (a): Determine the shape of the distribution**

For Version 1:
- Mean = 78, Median = 76
- Since Mean > Median, this suggests a right-skewed distribution
- However, Q1 (60.25) is farther from the median than Q3 (86.5), which suggests left skew
- The distribution is slightly left-skewed

For Version 2:
- Mean = 80, Median = 84
- Since Mean < Median, this suggests a left-skewed distribution
- However, Q3 (95) is farther from the median than Q1 (76), which suggests right skew
- The distribution is slightly right-skewed

**Part (b): Identify outliers**

For Version 1:
- Calculate the IQR = Q3 - Q1 = 86.5 - 60.25 = 26.25
- Lower fence = Q1 - 1.5 × IQR = 60.25 - 1.5 × 26.25 = 20.875
- Upper fence = Q3 + 1.5 × IQR = 86.5 + 1.5 × 26.25 = 125.875
- Min = 30, Max = 95
- Since Min > Lower fence and Max < Upper fence, there are no outliers

For Version 2:
- Calculate the IQR = Q3 - Q1 = 95 - 76 = 19
- Lower fence = Q1 - 1.5 × IQR = 76 - 1.5 × 19 = 47.5
- Upper fence = Q3 + 1.5 × IQR = 95 + 1.5 × 19 = 123.5
- Min = 45, Max = 99
- Since Min < Lower fence, there is at least one outlier (on the lower end)

**Key Concepts Applied:**
- Relationship between mean, median, and skewness
- Interquartile range (IQR) calculation
- Outlier identification using the 1.5 × IQR rule

### Question 3: Expected Value and Variance of Random Variables

**Topic: Random Variables (Module 3)**

This question tests understanding of expected value and variance properties for independent random variables, asking students to find the mean and standard deviation of linear combinations.

#### Solution Guide:

**Properties Used:**
For independent random variables X and Y:
- E[aX + bY + c] = aE[X] + bE[Y] + c
- Var(aX + bY + c) = a²Var(X) + b²Var(Y)
- SD(aX + bY + c) = √Var(aX + bY + c)

**Version 1 Solutions:**

Given:
- E[X] = 10, SD(X) = 2, E[Y] = 20, SD(Y) = 3

Part (a): Find mean and SD of X + 3Y
- E[X + 3Y] = E[X] + 3E[Y] = 10 + 3(20) = 10 + 60 = 70
- Var(X + 3Y) = Var(X) + 9Var(Y) = 2² + 9(3²) = 4 + 9(9) = 4 + 81 = 85
- SD(X + 3Y) = √85 ≈ 9.22

Part (b): Find mean and SD of 2X - Y - 6
- E[2X - Y - 6] = 2E[X] - E[Y] - 6 = 2(10) - 20 - 6 = 20 - 20 - 6 = -6
- Var(2X - Y - 6) = 4Var(X) + Var(Y) = 4(4) + 9 = 16 + 9 = 25
- SD(2X - Y - 6) = √25 = 5

**Version 2 Solutions:**

Given:
- E[X] = 40, Var(X) = 16, E[Y] = 20, Var(Y) = 9

Part (a): Find mean and SD of 4X - 40
- E[4X - 40] = 4E[X] - 40 = 4(40) - 40 = 160 - 40 = 120
- Var(4X - 40) = 16Var(X) = 16(16) = 256
- SD(4X - 40) = √256 = 16

Part (b): Find mean and SD of X - Y
- E[X - Y] = E[X] - E[Y] = 40 - 20 = 20
- Var(X - Y) = Var(X) + Var(Y) = 16 + 9 = 25
- SD(X - Y) = √25 = 5

**Key Concepts Applied:**
- Properties of expected value for linear combinations
- Properties of variance for independent random variables
- Relationship between variance and standard deviation

### Question 4: Normal Distribution Probabilities

**Topic: Normal Distribution (Module 4)**

This question tests understanding of the normal distribution and standardization, asking students to find probabilities for normally distributed test scores.

#### Solution Guide:

**Key Formula:**
For a normal random variable X with mean μ and standard deviation σ:
- Z = (X - μ)/σ follows the standard normal distribution
- Use Z-scores to find probabilities using standard normal tables or calculators

**Version 1 Solutions:**

Given:
- Test scores are normally distributed with mean μ = 100 and standard deviation σ = 15

Part (a): Find P(X > 90)
- Z = (90 - 100)/15 = -0.67
- P(X > 90) = P(Z > -0.67) = 1 - P(Z < -0.67) = 1 - 0.251 = 0.749

Part (b): Find P(112 < X < 132)
- Z₁ = (112 - 100)/15 = 0.8
- Z₂ = (132 - 100)/15 = 2.13
- P(112 < X < 132) = P(0.8 < Z < 2.13) = P(Z < 2.13) - P(Z < 0.8) = 0.983 - 0.788 = 0.195 ≈ 0.19

**Version 2 Solutions:**

Given:
- Test scores are normally distributed with mean μ = 1500 and standard deviation σ = 300

Part (a): Find P(X < 1600)
- Z = (1600 - 1500)/300 = 0.33
- P(X < 1600) = P(Z < 0.33) = 0.629

Part (b): Find P(1200 < X < 1700)
- Z₁ = (1200 - 1500)/300 = -1
- Z₂ = (1700 - 1500)/300 = 0.67
- P(1200 < X < 1700) = P(-1 < Z < 0.67) = P(Z < 0.67) - P(Z < -1) = 0.749 - 0.159 = 0.59

**Key Concepts Applied:**
- Standardizing normal random variables
- Using Z-scores to find probabilities
- Finding probabilities for intervals

### Question 5: Probability and Independence

**Topic: Probability Concepts (Module 3)**

This question tests understanding of probability, independence, and conditional probability in the context of a classroom scenario.

#### Solution Guide:

Given:
- 20 students total
- 10 students have brown eyes (event E)
- 8 students are left-handed (event F)
- 3 students have brown eyes and are left-handed (event E ∩ F)

**Part (a): Determine if events E and F are independent**

Two events are independent if P(E ∩ F) = P(E) × P(F)

Calculate:
- P(E) = 10/20 = 0.5
- P(F) = 8/20 = 0.4
- P(E ∩ F) = 3/20 = 0.15
- P(E) × P(F) = 0.5 × 0.4 = 0.2

Since P(E ∩ F) = 0.15 ≠ 0.2 = P(E) × P(F), the events E and F are not independent.

**Part (b): Find the probability that a left-handed student does not have brown eyes**

This is asking for P(E^c | F), where E^c is the complement of E (not having brown eyes).

Method 1:
- P(E^c | F) = P(E^c ∩ F)/P(F)
- Number of left-handed students without brown eyes = 8 - 3 = 5
- P(E^c ∩ F) = 5/20 = 0.25
- P(E^c | F) = 0.25/0.4 = 5/8 = 0.625

Method 2:
- P(E^c | F) = 1 - P(E | F)
- P(E | F) = P(E ∩ F)/P(F) = 3/8 = 0.375
- P(E^c | F) = 1 - 0.375 = 0.625

The probability that a left-handed student does not have brown eyes is 5/8 or 0.625.

**Key Concepts Applied:**
- Definition of independence
- Conditional probability
- Complement of events

## Part B: Trend Analysis

### Relationship Between Exam Questions and Module Content

#### Module 1 Coverage
**Question 1** directly tests concepts from Module 1:
- Research design and methodology
- Identifying variables and their types (categorical vs. numerical)
- Understanding subjects and sampling

This question aligns with Module 1's focus on the foundations of statistics, data collection methods, and variable classification. The question tests students' ability to analyze a research study and identify its key components.

#### Module 2 Coverage
**Question 2** tests concepts from Module 2:
- Descriptive statistics (mean, median, quartiles)
- Distribution shape (skewness)
- Outlier identification using IQR

This question directly applies the concepts of summarizing numerical data, understanding distribution shapes, and identifying outliers using the 1.5 × IQR rule, which are core topics in Module 2.

#### Module 3 Coverage
**Questions 3 and 5** test concepts from Module 3:
- Expected value and variance of random variables (Question 3)
- Properties of linear combinations of random variables (Question 3)
- Probability concepts, independence, and conditional probability (Question 5)

These questions assess students' understanding of probability theory, random variables, and their properties, which are the main focus of Module 3.

#### Module 4 Coverage
**Question 4** tests concepts from Module 4:
- Normal distribution
- Standardization (Z-scores)
- Finding probabilities using the standard normal distribution

This question directly applies the concepts of the normal distribution and using Z-scores to find probabilities, which are central topics in Module 4.

### Trends and Patterns in Exam Emphasis

1. **Equal Coverage Across Modules**
   - The exam provides balanced coverage of all four modules, with approximately one question per module (with Module 3 having slightly more emphasis with two questions).

2. **Focus on Computational Skills**
   - Most questions require calculations and application of formulas rather than theoretical explanations.
   - Students need to demonstrate their ability to apply statistical concepts to solve problems.

3. **Real-World Applications**
   - Questions are framed in real-world contexts (research studies, test scores, classroom scenarios).
   - This emphasizes the practical application of statistical concepts.

4. **Multiple Versions of Questions**
   - The exam includes multiple versions of the same question with different numerical values.
   - This suggests an emphasis on understanding the underlying concepts rather than memorizing specific solutions.

5. **Progressive Difficulty**
   - The exam progresses from more straightforward conceptual questions (Question 1) to more complex computational problems (Questions 3-5).
   - This allows students to demonstrate both basic understanding and advanced application.

6. **Integration of Concepts**
   - Some questions require integrating concepts from multiple modules.
   - For example, Question 5 combines probability concepts with conditional probability.

7. **Emphasis on Core Statistical Skills**
   - The exam focuses on fundamental statistical skills that form the foundation for more advanced topics:
     - Understanding research design
     - Summarizing and interpreting data
     - Working with probability and random variables
     - Using the normal distribution

### Conclusion

The midterm exam provides a comprehensive assessment of students' understanding of the first four modules of the course. It emphasizes computational skills, application of statistical concepts to real-world scenarios, and integration of concepts across modules. The balanced coverage ensures that students have a solid foundation in the fundamental principles of probability and statistics before moving on to more advanced topics in later modules.

## Module 5 Summary: Introduction to Statistical Inference

## 1. Introduction to Statistical Inference

### What is Statistical Inference?
- **Definition**: The process by which we estimate parameters of interest from data and quantify the uncertainty in our estimates
- **Key Components**:
  - Point estimates: Single values that estimate population parameters
  - Confidence intervals: Ranges of plausible values for population parameters
  - Hypothesis tests: Methods to evaluate claims about the population

### Population Parameters vs. Sample Statistics
- **Population Parameter**: A numerical characteristic of the entire population (usually unknown)
  - Example: The true proportion of all voters who support a candidate (p)
- **Sample Statistic**: A numerical characteristic calculated from a sample (known)
  - Example: The proportion of sampled voters who support a candidate (p̂)

### Sources of Error in Estimation
- **Sampling Error**: Variability that occurs due to random sampling
  - Related to sample size (n)
  - Unavoidable but can be quantified
- **Bias**: Systematic error that causes estimates to consistently deviate from the true parameter
  - Examples: Non-response bias, selection bias, poorly worded questions
  - Can be minimized through proper study design

## 2. Sampling Distribution of a Proportion

### Sampling Distribution
- **Definition**: The distribution of a sample statistic over all possible samples of the same size from the same population
- **Properties of the Sampling Distribution of p̂**:
  - Center: The mean of p̂ is p (the population parameter)
  - Spread: The standard error of p̂ is $\sqrt{\frac{p(1-p)}{n}}$
  - Shape: Approximately normal for large enough samples

### Standard Error
- **Definition**: The standard deviation of the sampling distribution
- **Formula for Proportion**: 
  $$SE_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}$$
- **Interpretation**: Measures the typical deviation between the sample proportion and the population proportion
- **Example**: For a sample of 1000 people with p = 0.75:
  $$SE_{\hat{p}} = \sqrt{\frac{0.75 \times 0.25}{1000}} \approx 0.0137$$

### Central Limit Theorem (CLT) for Proportions
- **Statement**: For large enough samples, the sampling distribution of p̂ is approximately normal with:
  - Mean: μ = p
  - Standard error: $SE_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}$

- **Success/Failure Condition**: For the CLT to apply, we need:
  - np ≥ 10 (at least 10 successes)
  - n(1-p) ≥ 10 (at least 10 failures)

### Using the CLT in Practice
- Since p is unknown in real applications, we use p̂ to:
  1. Check the success/failure condition: np̂ ≥ 10 and n(1-p̂) ≥ 10
  2. Estimate the standard error: $SE_{\hat{p}} \approx \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$

### Example: Applying the CLT
If 761 out of 1000 randomly sampled people support candidate A:
- Sample proportion: p̂ = 761/1000 = 0.761
- Estimated standard error: $SE_{\hat{p}} \approx \sqrt{\frac{0.761 \times 0.239}{1000}} \approx 0.0135$
- Success/failure check: 
  - np̂ = 1000 × 0.761 = 761 ≥ 10 ✓
  - n(1-p̂) = 1000 × 0.239 = 239 ≥ 10 ✓

## 3. Confidence Intervals for a Proportion

### Definition and Interpretation
- **Confidence Interval**: A range of plausible values for the population parameter
- **General Form**: Point estimate ± Margin of error
- **Interpretation**: If we repeatedly collect samples and construct confidence intervals using the same method, approximately (confidence level)% of these intervals would contain the true population parameter

### Formula for Confidence Interval
- **Formula**: 
  $$\hat{p} \pm z^* \times SE_{\hat{p}}$$
  where z* is the critical value corresponding to the desired confidence level

- **Margin of Error**: 
  $$MOE = z^* \times SE_{\hat{p}}$$

### Common Critical Values (z*)
- 90% confidence: z* = 1.64
- 95% confidence: z* = 1.96
- 99% confidence: z* = 2.58

### Example: Constructing a 95% Confidence Interval
For a sample where 761 out of 1000 people support candidate A:
- p̂ = 0.761
- SE = 0.0135
- 95% confidence interval:
  $$0.761 \pm 1.96 \times 0.0135 = 0.761 \pm 0.0265 = (0.7345, 0.7875)$$

- Interpretation: We are 95% confident that the true proportion of people who support candidate A is between 73.45% and 78.75%.

### Effect of Confidence Level on Interval Width
- Higher confidence level → Wider interval
- Lower confidence level → Narrower interval
- Trade-off between precision (narrow interval) and confidence (high percentage)

## 4. Sample Size Determination

### Determining Sample Size for a Desired Margin of Error
- **Formula**: 
  $$n = \frac{z^{*2} \times p(1-p)}{MOE^2}$$

- Since p is unknown before sampling, we can:
  1. Use a previous estimate of p
  2. Use p = 0.5 (which maximizes p(1-p) and gives the largest, most conservative sample size)

### Example: Sample Size Calculation
To achieve a margin of error of 5% with 95% confidence:
- z* = 1.96
- Using p = 0.5 (conservative approach):
  $$n = \frac{1.96^2 \times 0.5 \times 0.5}{0.05^2} = \frac{0.9604}{0.0025} = 384.16$$
- We would need at least 385 participants.

## 5. Common Misconceptions about Confidence Intervals

### What a Confidence Interval IS:
- A range of plausible values for the population parameter
- A procedure that, when repeated, captures the true parameter at the specified rate
- A measure of the reliability of our estimation method

### What a Confidence Interval IS NOT:
- A probability statement about the parameter (once calculated, the interval either contains the parameter or it doesn't)
- A statement that X% of the population falls within the interval
- A guarantee that the true parameter is in any specific interval

## 6. Practical Examples

### Example 1: Teenage Phone Usage
A random sample of 1000 teenagers were interviewed about their average daily phone use. About 60% said they spent around 5-7 hours on their phones per day.

- Sample proportion: p̂ = 0.6
- Standard error: $SE_{\hat{p}} = \sqrt{\frac{0.6 \times 0.4}{1000}} = 0.015$
- 95% confidence interval: 
  $$0.6 \pm 1.96 \times 0.015 = 0.6 \pm 0.029 = (0.571, 0.629)$$

- Interpretation: We are 95% confident that the true proportion of teenagers who spend 5-7 hours on their phones daily is between 57.1% and 62.9%.

### Example 2: Roller Coaster Survey
According to a survey, around 38% of American teenagers are scared to ride roller coasters. This survey was conducted based on a random sample of 800 teenagers.

- Checking CLT conditions:
  - np̂ = 800 × 0.38 = 304 ≥ 10 ✓
  - n(1-p̂) = 800 × 0.62 = 496 ≥ 10 ✓
- Standard error: $SE_{\hat{p}} = \sqrt{\frac{0.38 \times 0.62}{800}} \approx 0.017$
- 95% confidence interval:
  $$0.38 \pm 1.96 \times 0.017 = 0.38 \pm 0.033 = (0.347, 0.413)$$

### Example 3: E-book Preferences
A research firm wants to find the proportion of adults who prefer traditional books to E-books. They took a random sample of 200 adults and found that 65% like E-books better.

- Standard error: $SE_{\hat{p}} = \sqrt{\frac{0.65 \times 0.35}{200}} \approx 0.034$
- For an 80% confidence interval, z* = 1.28
- 80% confidence interval:
  $$0.65 \pm 1.28 \times 0.034 = 0.65 \pm 0.044 = (0.606, 0.694)$$

- Interpretation: We are 80% confident that the true proportion of adults who prefer E-books is between 60.6% and 69.4%.

### Example 4: Sample Size for Delivery Company
A delivery company wants to estimate the proportion of on-time deliveries with a margin of error of 3% at 95% confidence. How many customers should they survey?

- Using p̂ = 0.5 (conservative approach):
  $$n = \frac{1.96^2 \times 0.5 \times 0.5}{0.03^2} = \frac{0.9604}{0.0009} = 1067.11$$
- They should survey at least 1068 customers.

## 7. Key Takeaways

1. Statistical inference allows us to make educated guesses about population parameters based on sample data.

2. The Central Limit Theorem tells us that for large enough samples, the sampling distribution of p̂ is approximately normal, regardless of the shape of the population distribution.

3. Confidence intervals provide a range of plausible values for the population parameter, along with a measure of confidence in that range.

4. The width of a confidence interval is affected by:
   - Sample size (n): Larger samples lead to narrower intervals
   - Confidence level: Higher confidence requires wider intervals
   - Population variability: More variable populations require wider intervals

5. When interpreting confidence intervals, be careful to avoid common misconceptions:
   - A 95% confidence interval does not mean there is a 95% probability that the parameter is in that specific interval
   - Instead, it means that if we repeated the sampling process many times, about 95% of the resulting intervals would contain the true parameter

6. Sample size calculations help us determine how many observations we need to achieve a desired level of precision in our estimates.

## Module 6 Summary: Hypothesis Testing for a Proportion

## 1. Introduction to Hypothesis Testing

### The Hypothesis Testing Framework
- **Definition**: A formal procedure for evaluating claims about a population parameter
- **Purpose**: To determine whether sample data provides sufficient evidence against a specified claim
- **Key Components**:
  - Null Hypothesis (H₀): The "status quo" or "skeptical" perspective
  - Alternative Hypothesis (H₁ or H<sub>A</sub>): The claim being investigated

### Types of Hypotheses
- **Null Hypothesis (H₀)**:
  - Makes a specific claim about the parameter (often that there is "no effect" or "no difference")
  - Always contains an equality (=, ≤, or ≥)
  - Example: H₀: p = 0.5 (the proportion equals 0.5)

- **Alternative Hypothesis (H<sub>A</sub>)**:
  - The claim we are looking for evidence to support
  - Contains only inequalities (<, >, or ≠)
  - Example: H<sub>A</sub>: p > 0.5 (the proportion is greater than 0.5)

### Types of Hypothesis Tests
- **Two-sided test**: H<sub>A</sub> claims the parameter is different from the null value (≠)
  - Example: H₀: p = 0.5 vs. H<sub>A</sub>: p ≠ 0.5
- **One-sided test**: H<sub>A</sub> claims the parameter is either greater than (>) or less than (<) the null value
  - Example: H₀: p ≤ 0.5 vs. H<sub>A</sub>: p > 0.5
  - Example: H₀: p ≥ 0.5 vs. H<sub>A</sub>: p < 0.5

## 2. The Logic of Hypothesis Testing

### Decision Process
- We start by assuming the null hypothesis is true
- We collect sample data and calculate a test statistic
- We determine how likely our observed result (or more extreme) would be if H₀ were true
- Based on this probability (p-value), we either:
  - Reject H₀ if the evidence against it is strong enough
  - Fail to reject H₀ if the evidence is not strong enough

### Important Note
- Failing to reject H₀ does not mean we accept or prove H₀
- It simply means we don't have enough evidence to reject it
- This is similar to "innocent until proven guilty" in a legal system

## 3. Errors in Hypothesis Testing

### Type I Error
- **Definition**: Rejecting a true null hypothesis
- **Probability**: α (significance level)
- **Example**: Concluding a medical treatment is effective when it actually isn't

### Type II Error
- **Definition**: Failing to reject a false null hypothesis
- **Probability**: β
- **Example**: Failing to detect that a medical treatment is effective when it actually is

### Relationship Between Errors
- There is a trade-off between Type I and Type II errors
- Decreasing α (making it harder to reject H₀) increases β
- Increasing α (making it easier to reject H₀) decreases β

### Power of a Test
- **Definition**: The probability of correctly rejecting a false null hypothesis (1 - β)
- **Factors affecting power**:
  - Sample size (larger samples increase power)
  - Effect size (larger differences from H₀ are easier to detect)
  - Significance level (higher α increases power but also increases Type I error risk)

## 4. Testing Hypotheses Using Confidence Intervals

### Procedure
1. Calculate a confidence interval for the parameter
2. Check if the null value (from H₀) falls within the interval
3. If the null value is outside the interval, reject H₀
4. If the null value is inside the interval, fail to reject H₀

### Example: Multiple Choice Question
A multiple choice question has 4 options. We want to test if adults perform better than random guessing.
- H₀: p = 0.25 (adults are as accurate as random guessing)
- H<sub>A</sub>: p ≠ 0.25 (adults perform differently than random guessing)

**Scenario 1**: 21 out of 100 adults answer correctly
- Sample proportion: p̂ = 0.21
- Standard error: SE = √[0.21(1-0.21)/100] = 0.0407
- 95% confidence interval: p̂ ± 1.96 × SE = 0.21 ± 1.96 × 0.0407 = (0.1302, 0.2898)
- Since p₀ = 0.25 falls within the interval, we fail to reject H₀

**Scenario 2**: 37 out of 100 adults answer correctly
- Sample proportion: p̂ = 0.37
- Standard error: SE = √[0.37(1-0.37)/100] = 0.0483
- 95% confidence interval: p̂ ± 1.96 × SE = 0.37 ± 1.96 × 0.0483 = (0.2754, 0.4646)
- Since p₀ = 0.25 falls outside the interval, we reject H₀

## 5. Testing Hypotheses Using P-values

### P-value
- **Definition**: The probability of observing a test statistic as extreme as, or more extreme than, the one observed, assuming H₀ is true
- **Interpretation**: A measure of the strength of evidence against H₀
- **Decision rule**: Reject H₀ if p-value < α

### Calculating P-values for a Proportion Test
1. Calculate the test statistic:
   $$z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$$
   where p₀ is the null value, p̂ is the sample proportion, and n is the sample size

2. Find the p-value:
   - For H<sub>A</sub>: p ≠ p₀ (two-sided): P(Z ≤ -|z|) + P(Z ≥ |z|) = 2 × P(Z ≥ |z|)
   - For H<sub>A</sub>: p > p₀ (right-tailed): P(Z ≥ z)
   - For H<sub>A</sub>: p < p₀ (left-tailed): P(Z ≤ z)

### Example: Policy Support
We want to test if a majority of adults support Policy A.
- H₀: p = 0.5 (50% support)
- H<sub>A</sub>: p ≠ 0.5 (support differs from 50%)

A random sample of 1000 adults shows 42% support the policy.
- Test statistic: z = (0.42 - 0.5)/√[0.5(1-0.5)/1000] = -0.08/0.016 = -5
- P-value (two-sided): 2 × P(Z ≥ 5) ≈ 5.73 × 10⁻⁷
- Since p-value < 0.05, we reject H₀
- Conclusion: There is strong evidence that the proportion of adults who support Policy A differs from 50% (specifically, it appears to be less than 50%)

## 6. Steps for Conducting a Hypothesis Test for a Proportion

### Step 1: Prepare
- Identify the parameter of interest
- State the null and alternative hypotheses
- Choose the significance level α
- Identify the sample proportion p̂ and sample size n

### Step 2: Check
- Verify that the sample is random or representative
- Check the success/failure condition using the null value p₀:
  - np₀ ≥ 10
  - n(1-p₀) ≥ 10

### Step 3: Calculate
- Compute the standard error using the null value: SE = √[p₀(1-p₀)/n]
- Calculate the test statistic: z = (p̂ - p₀)/SE
- Find the p-value based on the form of H<sub>A</sub>

### Step 4: Conclude
- Compare the p-value to α
- Make a decision: reject H₀ or fail to reject H₀
- State the conclusion in the context of the problem

## 7. Choosing the Significance Level (α)

### Common Choices
- α = 0.05 (5%): Standard in many fields
- α = 0.01 (1%): More stringent, used when Type I errors are costly
- α = 0.10 (10%): More lenient, used when Type II errors are costly

### Considerations
- **Type I Error Concerns**: Use smaller α when falsely rejecting H₀ is serious
  - Example: Approving an ineffective drug (α = 0.01 or lower)
- **Type II Error Concerns**: Use larger α when failing to detect an effect is serious
  - Example: Missing a potential environmental hazard (α = 0.10)

## 8. Practical Examples

### Example 1: College Dropout Rate
A study aims to determine if the dropout rate for undergraduate college students has changed from the 2018 rate of 40%.

- Parameter of interest: p = proportion of undergraduate students who drop out
- H₀: p = 0.4
- H<sub>A</sub>: p ≠ 0.4
- Significance level: α = 0.05

Suppose a random sample of 500 students shows 175 dropped out (p̂ = 0.35).
- Check conditions: np₀ = 500 × 0.4 = 200 ≥ 10; n(1-p₀) = 500 × 0.6 = 300 ≥ 10 ✓
- SE = √[0.4(1-0.4)/500] = 0.022
- Test statistic: z = (0.35 - 0.4)/0.022 = -2.27
- P-value (two-sided): 2 × P(Z ≥ 2.27) ≈ 0.023
- Decision: Since p-value < 0.05, reject H₀
- Conclusion: There is sufficient evidence to conclude that the dropout rate has changed from 40% (appears to have decreased).

### Example 2: New Drink Flavor
A company will introduce a new drink to the market if more than 65% of people like the flavor.

- Parameter of interest: p = proportion of people who like the flavor
- H₀: p = 0.65
- H<sub>A</sub>: p > 0.65
- Significance level: α = 0.05

Suppose a random sample of 200 people shows 140 like the flavor (p̂ = 0.7).
- Check conditions: np₀ = 200 × 0.65 = 130 ≥ 10; n(1-p₀) = 200 × 0.35 = 70 ≥ 10 ✓
- SE = √[0.65(1-0.65)/200] = 0.034
- Test statistic: z = (0.7 - 0.65)/0.034 = 1.47
- P-value (right-tailed): P(Z ≥ 1.47) ≈ 0.071
- Decision: Since p-value > 0.05, fail to reject H₀
- Conclusion: There is insufficient evidence to conclude that more than 65% of people like the flavor.

### Example 3: Law Support in a Small Town
A study investigates whether the proportion of residents in a small town who support a certain law is 68%.

- Parameter of interest: p = proportion of residents who support the law
- H₀: p = 0.68
- H<sub>A</sub>: p ≠ 0.68
- Significance level: α = 0.05

A random sample of 200 residents shows 140 support the law (p̂ = 0.7).
- Check conditions: np₀ = 200 × 0.68 = 136 ≥ 10; n(1-p₀) = 200 × 0.32 = 64 ≥ 10 ✓
- 95% confidence interval: p̂ ± 1.96 × √[p̂(1-p̂)/n] = 0.7 ± 1.96 × √[0.7(1-0.7)/200] = 0.7 ± 0.064 = (0.636, 0.764)
- Since p₀ = 0.68 falls within the interval, fail to reject H₀
- Conclusion: There is insufficient evidence to conclude that the proportion of residents who support the law differs from 68%.

## 9. Common Misconceptions and Pitfalls

### Misconception 1: Failing to reject H₀ means H₀ is true
- Correct interpretation: We don't have enough evidence to reject H₀, not that H₀ is proven true

### Misconception 2: P-value is the probability that H₀ is true
- Correct interpretation: P-value is the probability of observing data as extreme as ours if H₀ were true

### Misconception 3: Statistical significance implies practical significance
- A result can be statistically significant but too small to matter in practice
- Always consider the context and magnitude of the effect

### Pitfall 1: Using p̂ instead of p₀ to calculate SE in hypothesis testing
- For confidence intervals: Use p̂ to calculate SE
- For hypothesis tests: Use p₀ to calculate SE

### Pitfall 2: Incorrect formulation of hypotheses
- H₀ must contain an equality (=, ≤, or ≥)
- H<sub>A</sub> must contain only inequalities (<, >, or ≠)
- Parameters (p, μ) should be used, not statistics (p̂, x̄)

## 10. Key Takeaways

1. Hypothesis testing provides a formal framework for making decisions based on data

2. The null hypothesis (H₀) represents the status quo or skeptical perspective, while the alternative hypothesis (H<sub>A</sub>) represents the claim we're looking for evidence to support

3. Two approaches to hypothesis testing:
   - Using confidence intervals: Reject H₀ if the null value falls outside the interval
   - Using p-values: Reject H₀ if the p-value is less than the significance level α

4. Type I error occurs when we reject a true H₀; Type II error occurs when we fail to reject a false H₀

5. The significance level α represents the probability of making a Type I error

6. When conducting a hypothesis test for a proportion:
   - Use p₀ (not p̂) to calculate the standard error
   - Check the success/failure condition using p₀
   - State conclusions in the context of the problem

7. The choice of significance level should balance the risks of Type I and Type II errors based on the specific context of the problem

## Module 7 Summary: Inference for Comparing Two Proportions

## 1. Introduction to Comparing Two Proportions

### Why Compare Two Proportions?
- **Purpose**: To determine if there is a significant difference between two population proportions
- **Applications**:
  - Medical studies (treatment vs. control groups)
  - Marketing (comparing effectiveness of two campaigns)
  - Social research (comparing behaviors across different demographics)
  - Quality control (comparing defect rates between manufacturing processes)

### Key Parameters and Statistics
- **Population Parameters**: p₁ and p₂ (true proportions in each population)
- **Sample Statistics**: p̂₁ and p̂₂ (observed proportions in each sample)
- **Parameter of Interest**: p₁ - p₂ (difference between population proportions)
- **Point Estimate**: p̂₁ - p̂₂ (difference between sample proportions)

## 2. Confidence Intervals for the Difference of Two Proportions

### Conditions for Valid Inference
1. **Independence**:
   - Data are independent within each group
   - Data are independent between groups
   - Satisfied by random sampling or random assignment

2. **Success/Failure Condition**:
   - For each group i (i = 1, 2):
     - nᵢp̂ᵢ ≥ 10 (at least 10 successes)
     - nᵢ(1-p̂ᵢ) ≥ 10 (at least 10 failures)

### Formula for the Standard Error
- **Standard Error for Difference in Proportions**:
  $$SE = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}$$

- **Estimated Standard Error** (using sample proportions):
  $$SE = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$

### Confidence Interval Formula
- **General Formula**:
  $$(\hat{p}_1 - \hat{p}_2) \pm z^* \times SE$$

- **Expanded Form**:
  $$(\hat{p}_1 - \hat{p}_2) \pm z^* \times \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$

### Example: Blood Thinner Study
A study examined the effect of blood thinners on survival after heart attacks:

| Group | Survived | Died | Total |
|-------|----------|------|-------|
| Control | 11 | 39 | 50 |
| Treatment | 14 | 26 | 40 |

**Step 1**: Calculate sample proportions
- p̂ₜ (treatment) = 14/40 = 0.35
- p̂ₖ (control) = 11/50 = 0.22
- Difference: p̂ₜ - p̂ₖ = 0.35 - 0.22 = 0.13

**Step 2**: Check conditions
- Independence: Satisfied (randomized experiment)
- Success/Failure: All groups have at least 10 successes and failures

**Step 3**: Calculate standard error
$$SE = \sqrt{\frac{0.35(1-0.35)}{40} + \frac{0.22(1-0.22)}{50}} = 0.095$$

**Step 4**: Construct 90% confidence interval (z* = 1.65)
$$0.13 \pm 1.65 \times 0.095 = 0.13 \pm 0.157 = (-0.027, 0.287)$$

**Interpretation**: We are 90% confident that blood thinners have an impact on survival rate ranging from -2.7% (slightly harmful) to +28.7% (beneficial). Since the interval contains 0, we cannot conclude at this confidence level whether blood thinners help or harm in this context.

## 3. Hypothesis Testing for the Difference of Two Proportions

### Hypothesis Formulation
- **Null Hypothesis (H₀)**: p₁ - p₂ = 0 (no difference between proportions)
- **Alternative Hypothesis (H<sub>A</sub>)**:
  - Two-sided: p₁ - p₂ ≠ 0 (proportions are different)
  - One-sided: p₁ - p₂ > 0 or p₁ - p₂ < 0 (one proportion is larger)

### Pooled Proportion
- **When to Use**: When testing H₀: p₁ = p₂ (null hypothesis assumes equal proportions)
- **Formula**:
  $$\hat{p}_{pooled} = \frac{n_1\hat{p}_1 + n_2\hat{p}_2}{n_1 + n_2} = \frac{\text{total successes}}{\text{total sample size}}$$

### Standard Error Under the Null Hypothesis
- **Formula**:
  $$SE = \sqrt{\hat{p}_{pooled}(1-\hat{p}_{pooled})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}$$

### Test Statistic
- **Z-statistic**:
  $$Z = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{SE} = \frac{\hat{p}_1 - \hat{p}_2}{SE}$$

### P-value Calculation
- **Two-sided test**: P(|Z| ≥ |observed z|)
- **Right-tailed test**: P(Z ≥ observed z)
- **Left-tailed test**: P(Z ≤ observed z)

### Example: Survival Rate Study
A large-scale study examined survival rates in treatment and control groups:

| Group | Survived | Died | Total |
|-------|----------|------|-------|
| Control | 505 | 44,405 | 44,910 |
| Treatment | 500 | 44,425 | 44,925 |

**Step 1**: State hypotheses
- H₀: pₜ - pₖ = 0 (no difference in death rates)
- H<sub>A</sub>: pₜ - pₖ ≠ 0 (death rates are different)

**Step 2**: Calculate sample proportions
- p̂ₜ = 500/44,925 = 0.01113
- p̂ₖ = 505/44,910 = 0.01125
- Difference: p̂ₜ - p̂ₖ = -0.00012

**Step 3**: Calculate pooled proportion
$$\hat{p}_{pooled} = \frac{500 + 505}{44,925 + 44,910} = \frac{1,005}{89,835} = 0.0112$$

**Step 4**: Check conditions
- Independence: Satisfied (randomized experiment)
- Success/Failure: All values (nₜ × p̂ₚₒₒₗₑₐ, nₜ × (1-p̂ₚₒₒₗₑₐ), etc.) are greater than 10

**Step 5**: Calculate standard error
$$SE = \sqrt{0.0112 \times 0.9888 \times \left(\frac{1}{44,925} + \frac{1}{44,910}\right)} = 0.00070$$

**Step 6**: Calculate test statistic
$$Z = \frac{-0.00012 - 0}{0.00070} = -0.17$$

**Step 7**: Find p-value
For a two-sided test with Z = -0.17, p-value = 0.865

**Step 8**: Make decision
Since p-value = 0.865 > 0.05, we fail to reject H₀.

**Interpretation**: The difference in deaths between the control and treatment groups can be reasonably explained by chance. There is insufficient evidence to conclude that the treatment affects survival rates.

## 4. Sample Size Determination for Confidence Intervals

### Determining Sample Size for a Desired Margin of Error
- **Formula for a Single Proportion**:
  $$n = \frac{z^{*2} \times p(1-p)}{MOE^2}$$

- **Conservative Approach** (when p is unknown):
  - Use p = 0.5 (maximizes p(1-p))
  - Results in the largest, most conservative sample size
  - Formula simplifies to:
    $$n = \frac{z^{*2} \times 0.25}{MOE^2}$$

### Example: Sample Size Calculation
To achieve a margin of error of 5% with 95% confidence:
- z* = 1.96
- Using p = 0.5 (conservative approach):
  $$n = \frac{1.96^2 \times 0.5 \times 0.5}{0.05^2} = \frac{0.9604}{0.0025} = 384.16$$
- We would need at least 385 participants.

### Reducing the Margin of Error
To cut the margin of error in half while maintaining the same confidence level:
- Original MOE = z* × √[p(1-p)/n]
- To halve the MOE: n<sub>new</sub> = 4 × n<sub>original</sub>

**Example**: If a 95% CI based on 100 students has a certain margin of error, to cut that margin of error in half (while maintaining 95% confidence), we would need 4 × 100 = 400 students.

## 5. Practical Examples

### Example 1: Non-profit vs. For-profit Employee Happiness
A study compared happiness levels between employees in non-profit and for-profit organizations:
- Non-profit: 423 out of 467 employees reported being happy (p̂₁ = 0.906)
- For-profit: 446 out of 531 employees reported being happy (p̂₂ = 0.840)

**Confidence Interval Calculation**:
- Difference in proportions: p̂₁ - p̂₂ = 0.906 - 0.840 = 0.066
- Standard error:
  $$SE = \sqrt{\frac{0.906 \times 0.094}{467} + \frac{0.840 \times 0.160}{531}} = 0.0214$$
- 99% confidence interval (z* = 2.576):
  $$0.066 \pm 2.576 \times 0.0214 = 0.066 \pm 0.055 = (0.0167, 0.1233)$$

**Interpretation**: We are 99% confident that the proportion of employees who are happy working in non-profit organizations is between 1.67% and 12.33% higher than the proportion of employees who are happy working in for-profit organizations.

### Example 2: Foreign-born Residents Comparison
A study compared the proportion of foreign-born residents in the U.S. and China:
- U.S.: 196 out of 980 residents were foreign-born (p̂₁ = 0.2)
- China: 212 out of 1560 residents were foreign-born (p̂₂ = 0.136)

**Hypothesis Test**:
- H₀: p₁ - p₂ = 0 (same proportion of foreign-born residents)
- H<sub>A</sub>: p₁ - p₂ ≠ 0 (different proportions)

- Pooled proportion:
  $$\hat{p}_{pooled} = \frac{196 + 212}{980 + 1560} = \frac{408}{2540} = 0.1606$$

- Standard error:
  $$SE = \sqrt{0.1606 \times 0.8394 \times \left(\frac{1}{980} + \frac{1}{1560}\right)} = 0.0150$$

- Test statistic:
  $$Z = \frac{0.2 - 0.136}{0.0150} = 4.267$$

- P-value (two-sided): 2 × P(Z > 4.267) < 0.0001

**Conclusion**: Since p-value < 0.05, we reject H₀. There is strong evidence that the proportion of foreign-born residents differs between the U.S. and China.

## 6. Key Takeaways

1. **Comparing Two Proportions**:
   - Allows us to determine if there's a significant difference between two groups
   - Uses the difference in sample proportions (p̂₁ - p̂₂) as the point estimate

2. **Confidence Intervals**:
   - Provide a range of plausible values for the true difference in proportions
   - Use individual sample proportions to calculate the standard error
   - Interpretation should address both magnitude and direction of the difference

3. **Hypothesis Testing**:
   - Typically tests whether the difference in proportions equals zero
   - Uses the pooled proportion to calculate standard error under the null hypothesis
   - Follows the same general framework as single-proportion hypothesis tests

4. **Sample Size Considerations**:
   - Larger samples provide more precise estimates (narrower confidence intervals)
   - To halve the margin of error, quadruple the sample size
   - When planning studies, use conservative estimates (p = 0.5) if no prior information is available

5. **Practical Significance**:
   - Statistical significance doesn't always imply practical importance
   - Consider the context and magnitude of the difference when interpreting results
   - Confidence intervals provide more information about effect size than p-values alone

## Module 8 Summary: Goodness of Fit Tests

## 1. Introduction to Goodness of Fit Tests

### Purpose and Applications
- **Definition**: Statistical procedures that determine whether observed data conform to a theoretical distribution or expected frequencies
- **Applications**:
  - Testing if a sample is representative of a population
  - Determining if data follow a particular distribution (e.g., normal, binomial)
  - Evaluating whether observed frequencies match expected frequencies
  - Assessing whether categorical variables are distributed as expected

### Types of Goodness of Fit Tests
- **One-Way Chi-Square Test**: Tests if observed frequencies in a single categorical variable match expected frequencies
- **Chi-Square Test for Independence**: Tests if two categorical variables are related (covered in later modules)
- **Kolmogorov-Smirnov Test**: Tests if a sample comes from a specific continuous distribution (not covered in this module)

## 2. The Chi-Square Distribution

### Definition and Properties
- **Definition**: The distribution of a sum of squares of independent standard normal random variables
- **Formula**: If Z₁, Z₂, ..., Zₖ are independent standard normal random variables, then:
  $$\chi^2_k = Z_1^2 + Z_2^2 + ... + Z_k^2$$
  follows a chi-square distribution with k degrees of freedom

- **Properties**:
  - Always non-negative (sum of squared values)
  - Right-skewed, especially with low degrees of freedom
  - Becomes more symmetric and approximately normal as degrees of freedom increase
  - Mean equals the degrees of freedom
  - Variance equals twice the degrees of freedom

### Degrees of Freedom
- **Definition**: The number of values that are free to vary in the calculation of a statistic
- **Interpretation**: In a chi-square test, degrees of freedom = (number of categories - 1)
- **Reason**: When we know the total sample size and all but one category count, the final count is determined

### Example: Reading Chi-Square Tables
For a chi-square distribution with 3 degrees of freedom, the probability that the chi-square value exceeds 6.25 is:
$$P(\chi^2_3 \geq 6.25) \approx 0.1001$$

This means that if we have a chi-square test statistic of 6.25 with 3 degrees of freedom, the p-value would be approximately 0.1001.

## 3. Chi-Square Goodness of Fit Test

### Test Statistic
- **Formula**:
  $$\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}$$
  where:
  - Oᵢ = Observed count in category i
  - Eᵢ = Expected count in category i under the null hypothesis
  - k = Number of categories

- **Interpretation**: Measures the discrepancy between observed and expected frequencies
  - Larger values indicate greater discrepancy
  - Values close to zero suggest good fit

### Hypothesis Testing Framework
- **Null Hypothesis (H₀)**: The observed frequencies match the expected frequencies (or the data follow the specified distribution)
- **Alternative Hypothesis (H₁)**: The observed frequencies do not match the expected frequencies (or the data do not follow the specified distribution)
- **Decision Rule**: Reject H₀ if p-value < α (significance level)

### Conditions for Valid Chi-Square Test
1. **Independence**:
   - Observations must be independent of each other
   - Satisfied by random sampling or random assignment

2. **Sample Size**:
   - Each expected count must be at least 5
   - Ensures the chi-square approximation is valid

## 4. Testing for Specific Distributions

### Testing Categorical Distributions
- **Example**: Testing if jurors represent the racial demographics of registered voters
- **Procedure**:
  1. Calculate expected counts based on population proportions
  2. Compare observed counts to expected counts using chi-square statistic
  3. Determine p-value using chi-square distribution with (k-1) degrees of freedom

### Testing Probability Distributions
- **Example**: Testing if dice are fair
- **Procedure**:
  1. Determine the theoretical probabilities under the null hypothesis
  2. Calculate expected counts by multiplying probabilities by total sample size
  3. Compare observed counts to expected counts using chi-square statistic
  4. Determine p-value using chi-square distribution with appropriate degrees of freedom

### Binning Continuous Data
- For continuous distributions, data must be binned (grouped into categories)
- Bins should be chosen to ensure expected counts ≥ 5 in each bin
- Test is sensitive to choice of bins, but reasonable choices should yield similar results

## 5. Detailed Examples

### Example 1: Jury Representation Test

A random sample of 275 jurors from a small county had jurors identify their racial group. We want to test if the sample is representative of the population of registered voters.

| Race | White | Black | Hispanic | Other | Total |
|------|-------|-------|----------|-------|-------|
| On Juries (observed) | 205 | 26 | 25 | 19 | 275 |
| Registered Voters (proportion) | 0.72 | 0.07 | 0.12 | 0.09 | 1.00 |
| Expected Counts | 198 | 19.25 | 33 | 24.75 | 275 |

**Step 1**: State hypotheses
- H₀: The jurors are a random sample (no racial bias)
- H₁: The jurors are not a random sample (racial bias exists)

**Step 2**: Calculate expected counts
- Expected White jurors: 275 × 0.72 = 198
- Expected Black jurors: 275 × 0.07 = 19.25
- Expected Hispanic jurors: 275 × 0.12 = 33
- Expected Other jurors: 275 × 0.09 = 24.75

**Step 3**: Calculate Z-scores for each category
- Z₁ (White): (205 - 198)/√198 = 0.5
- Z₂ (Black): (26 - 19.25)/√19.25 = 1.54
- Z₃ (Hispanic): (25 - 33)/√33 = -1.39
- Z₄ (Other): (19 - 24.75)/√24.75 = -1.16

**Step 4**: Calculate chi-square statistic
$$\chi^2 = (0.5)^2 + (1.54)^2 + (-1.39)^2 + (-1.16)^2 = 5.8993$$

**Step 5**: Determine degrees of freedom and p-value
- Degrees of freedom = 4 - 1 = 3
- P-value = P(χ²₃ ≥ 5.8993) ≈ 0.1116

**Step 6**: Make decision
- Since p-value = 0.1116 > 0.05, we fail to reject H₀
- Conclusion: There is insufficient evidence of racial bias in juror selection

### Example 2: Testing if Dice are Fair

A player rolls two dice 200 times and records the number of sixes that appear on each roll:

| Number of Sixes | 0 | 1 | 2 | Total |
|-----------------|---|---|---|-------|
| Observed Count | 130 | 58 | 12 | 200 |

**Step 1**: State hypotheses
- H₀: The dice are fair
- H₁: The dice are not fair

**Step 2**: Calculate expected probabilities under H₀
- P(0 sixes) = (5/6)² = 25/36 ≈ 0.6944
- P(1 six) = 2 × (1/6) × (5/6) = 10/36 ≈ 0.2778
- P(2 sixes) = (1/6)² = 1/36 ≈ 0.0278

**Step 3**: Calculate expected counts
- Expected count for 0 sixes: 200 × 25/36 = 138.889
- Expected count for 1 six: 200 × 10/36 = 55.556
- Expected count for 2 sixes: 200 × 1/36 = 5.556

**Step 4**: Calculate chi-square statistic
$$\chi^2 = \frac{(130-138.889)^2}{138.889} + \frac{(58-55.556)^2}{55.556} + \frac{(12-5.556)^2}{5.556} \approx 8.15$$

**Step 5**: Determine degrees of freedom and p-value
- Degrees of freedom = 3 - 1 = 2
- P-value = P(χ²₂ ≥ 8.15) ≈ 0.017

**Step 6**: Make decision
- Since p-value = 0.017 < 0.05, we reject H₀
- Conclusion: There is evidence that the dice are not fair
- Note: Looking at the data, we can see more 2's (double sixes) than expected, suggesting the dice might be biased toward sixes

### Example 3: Hospital Check-ins by Weekday

A hospital administrator wants to find out if patient check-ins are evenly distributed across weekdays. They randomly sample 210 records:

| Day | Monday | Tuesday | Wednesday | Thursday | Friday | Total |
|-----|--------|---------|-----------|----------|--------|-------|
| Observed Count | 32 | 40 | 36 | 45 | 57 | 210 |

**Step 1**: State hypotheses
- H₀: Check-ins are evenly distributed across weekdays
- H₁: Check-ins are not evenly distributed across weekdays

**Step 2**: Calculate expected counts under H₀
- Expected count for each day: 210 ÷ 5 = 42

**Step 3**: Calculate chi-square statistic
$$\chi^2 = \frac{(32-42)^2}{42} + \frac{(40-42)^2}{42} + \frac{(36-42)^2}{42} + \frac{(45-42)^2}{42} + \frac{(57-42)^2}{42} \approx 9.29$$

**Step 4**: Determine degrees of freedom and p-value
- Degrees of freedom = 5 - 1 = 4
- P-value = P(χ²₄ ≥ 9.29) ≈ 0.054

**Step 5**: Make decision
- Since p-value = 0.054 > 0.05, we fail to reject H₀
- Conclusion: There is insufficient evidence that patient check-ins are unevenly distributed across weekdays

## 6. Common Misconceptions and Pitfalls

### Misconception 1: Chi-Square Tests Prove the Null Hypothesis
- Correct understanding: Failing to reject H₀ does not prove that the observed data follow the expected distribution; it only means we lack evidence to conclude otherwise

### Misconception 2: Chi-Square Tests Work with Small Samples
- Correct understanding: Chi-square tests require expected counts of at least 5 in each category for valid results

### Pitfall 1: Inappropriate Binning
- Problem: Results can vary based on how continuous data is binned
- Solution: Use consistent, reasonable binning strategies and ensure expected counts ≥ 5 in each bin

### Pitfall 2: Ignoring Independence Assumption
- Problem: Non-independent observations can lead to invalid results
- Solution: Ensure random sampling or appropriate experimental design

## 7. Key Takeaways

1. **Chi-Square Goodness of Fit Test**:
   - Tests whether observed frequencies match expected frequencies
   - Uses the chi-square statistic: χ² = Σ[(Oᵢ - Eᵢ)²/Eᵢ]
   - Larger values indicate greater discrepancy between observed and expected

2. **Chi-Square Distribution**:
   - Right-skewed distribution that approaches normal as degrees of freedom increase
   - Used to determine p-values for chi-square test statistics
   - Degrees of freedom = number of categories - 1

3. **Conditions for Valid Chi-Square Test**:
   - Independence of observations
   - Expected count ≥ 5 in each category

4. **Applications**:
   - Testing if a sample represents a population
   - Testing if data follow a specific distribution
   - Testing if categorical data are distributed as expected

5. **Interpretation**:
   - Small p-values (< α) suggest the observed data do not match the expected distribution
   - Large p-values (≥ α) suggest insufficient evidence to conclude the data don't match the expected distribution

## Module 9 Summary: Inference for Numerical Data

## 1. Introduction to Inference for Numerical Data

### From Categorical to Numerical Data
- **Previous Modules**: Focused on inference for categorical data
  - Single proportion
  - Difference of two proportions
  - Multiple groups (goodness of fit)
- **This Module**: Focuses on inference for numerical data
  - Single mean
  - Paired data
  - Difference of two means
  - Many means

### Key Differences in Approach

| Categorical Data | Numerical Data |
|------------------|----------------|
| Sample proportion: p̂ | Sample mean: x̄ |
| Population proportion: p | Population mean: μ |
| Normal distribution with SE = √[p(1-p)/n] | t-distribution with SE = s/√n |
| z-statistic | t-statistic |

## 2. The t-Distribution

### Why Use the t-Distribution?
- When working with numerical data, we typically don't know the population standard deviation (σ)
- We must estimate σ using the sample standard deviation (s)
- This additional uncertainty is accounted for by using the t-distribution instead of the normal distribution

### Properties of the t-Distribution
- Always centered at 0 (like the standard normal)
- Parametrized by a single parameter: degrees of freedom (df)
- More spread out than the normal distribution (heavier tails)
- As df increases, the t-distribution approaches the standard normal distribution
- For df > 30, the t-distribution is very similar to the normal distribution

### Degrees of Freedom
- For a single sample: df = n - 1
- For two independent samples: df = min(n₁ - 1, n₂ - 1) (conservative approach)
- Represents the number of independent pieces of information available

## 3. One-Sample t-Confidence Intervals

### Conditions for Valid Inference
1. **Independence**: The sample observations must be independent
   - Satisfied by random sampling from a large population
2. **Normality**: 
   - If n < 30: Data should come from a normally distributed population (or have no outliers)
   - If n ≥ 30: The Central Limit Theorem applies (no extreme outliers)

### Formula for t-Confidence Interval
$$\bar{x} \pm t^*_{df} \times \frac{s}{\sqrt{n}}$$

Where:
- x̄ is the sample mean
- s is the sample standard deviation
- n is the sample size
- t*ᵈᶠ is the critical value from the t-distribution with df degrees of freedom
- df = n - 1

### Steps for Constructing a t-Confidence Interval
1. **Prepare**: Identify or calculate x̄, s, n, and determine the confidence level
2. **Check**: Verify the conditions for using the t-distribution
3. **Calculate**: Compute SE = s/√n and find the critical value t*ᵈᶠ
4. **Conclude**: Construct and interpret the confidence interval

### Example: Height of 18-Year-Olds
A random sample of 25 eighteen-year-olds has a mean height of 67.73 inches with a standard deviation of 2.00 inches.

**Step 1**: We have x̄ = 67.73, s = 2.00, n = 25, and we want a 95% confidence interval.

**Step 2**: The sample is random, and there are no clear outliers, so conditions are satisfied.

**Step 3**: 
- SE = s/√n = 2.00/√25 = 0.4
- df = n - 1 = 25 - 1 = 24
- t*₂₄ = 2.10 (for 95% confidence)

**Step 4**: 
- 95% CI = 67.73 ± 2.10 × 0.4 = 67.73 ± 0.84 = (66.89, 68.57)
- Interpretation: We are 95% confident that the average height of 18-year-olds in the population is between 66.89 and 68.57 inches.

## 4. One-Sample t-Tests

### Hypothesis Testing Framework
- **Null Hypothesis (H₀)**: μ = μ₀ (population mean equals a specific value)
- **Alternative Hypothesis (H₁)**:
  - Two-sided: μ ≠ μ₀
  - Right-tailed: μ > μ₀
  - Left-tailed: μ < μ₀

### Test Statistic
$$T = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}}$$

### Steps for Conducting a One-Sample t-Test
1. **Prepare**: Identify or calculate x̄, s, n, and determine the significance level α
2. **Check**: Verify the conditions for using the t-distribution
3. **Calculate**: Compute the test statistic T and find the p-value
4. **Conclude**: Compare the p-value to α and make a decision

### Example: Sleep Duration of UCSD Students
We want to determine if UCSD students sleep less than 7 hours per night on average. A random sample of 50 students has a mean sleep duration of 6.74 hours with a standard deviation of 0.71 hours.

**Step 1**: We have x̄ = 6.74, s = 0.71, n = 50, and we'll use α = 0.05.

**Step 2**: The sample is random and n ≥ 30, so conditions are satisfied.

**Step 3**: 
- SE = s/√n = 0.71/√50 = 0.1004
- df = n - 1 = 50 - 1 = 49
- T = (6.74 - 7)/0.1004 = -2.59
- p-value = P(T < -2.59) = 0.0063

**Step 4**: 
- Since p-value = 0.0063 < 0.05, we reject H₀
- Conclusion: There is strong evidence that UCSD students sleep less than 7 hours per night on average.

## 5. Paired Data Analysis

### What is Paired Data?
- Two sets of observations are paired if each observation in one set has a special correspondence with exactly one observation in the other set
- Examples:
  - Before and after measurements on the same subjects
  - Measurements on matched pairs (e.g., twins)
  - Prices of the same items at two different stores

### Analyzing Paired Data
- Calculate the differences between paired observations
- Analyze these differences using one-sample t-methods
- The parameter of interest is μd (the mean difference)

### Example: Grocery Store Prices
Comparing prices of the same items at two different grocery stores:

| Item | Whole Foods | Vons | Difference (WF - V) |
|------|-------------|------|---------------------|
| Fuji Apples | $1.89 | $1.49 | $0.40 |
| Whole Milk | $2.49 | $3.99 | -$1.50 |
| Yogurt | $5.89 | $5.99 | -$0.10 |

We would analyze the differences using a one-sample t-test or confidence interval.

## 6. Confidence Intervals for Difference of Means

### Conditions for Valid Inference
1. **Independence (extended)**: 
   - Data are independent within each group
   - Data are independent between groups
   - Satisfied by random sampling or random assignment
2. **Normality**: 
   - If n < 30: Data should come from normally distributed populations
   - If n ≥ 30: The Central Limit Theorem applies

### Formula for Standard Error
$$SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$

### Formula for Confidence Interval
$$({\bar{x}_1 - \bar{x}_2}) \pm t^*_{df} \times SE$$

Where:
- x̄₁ and x̄₂ are the sample means
- s₁ and s₂ are the sample standard deviations
- n₁ and n₂ are the sample sizes
- df = min(n₁ - 1, n₂ - 1) (conservative approach)

### Example: Treatment Effect Study
A small randomized control trial gives the following results for treating a particular condition:

| Group | n | Sample Mean | Sample SD |
|-------|---|-------------|-----------|
| Treatment | 9 | 3.5 | 5.17 |
| Control | 9 | -4.33 | 2.76 |

**Step 1**: We want a 95% confidence interval for the treatment effect.

**Step 2**: The data are from a randomized trial, so independence is satisfied.

**Step 3**: 
- SE = √[(5.17²/9) + (2.76²/9)] = 1.95
- df = min(9-1, 9-1) = 8
- t*₈ = 2.31 (for 95% confidence)
- Point estimate = x̄₁ - x̄₂ = 3.5 - (-4.33) = 7.83

**Step 4**: 
- 95% CI = 7.83 ± 2.31 × 1.95 = 7.83 ± 4.51 = (3.32, 12.34)
- Interpretation: We are 95% confident that the true difference in mean outcomes between the treatment and control groups is between 3.32 and 12.34 units.

## 7. Hypothesis Testing for Difference of Means

### Hypothesis Testing Framework
- **Null Hypothesis (H₀)**: μ₁ - μ₂ = 0 (no difference between population means)
- **Alternative Hypothesis (H₁)**:
  - Two-sided: μ₁ - μ₂ ≠ 0
  - Right-tailed: μ₁ - μ₂ > 0
  - Left-tailed: μ₁ - μ₂ < 0

### Test Statistic
$$T = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{SE}$$

### Example: Birth Weight and Smoking
A study investigates whether newborns from mothers who smoke have different average birth weights than newborns from mothers who don't smoke.

| Group | n | Sample Mean | Sample SD |
|-------|---|-------------|-----------|
| Non-smoker | 100 | 7.18 | 1.6 |
| Smoker | 50 | 6.78 | 1.43 |

**Step 1**: We have α = 0.05.

**Step 2**: The data come from a random sample, and n ≥ 30 for both groups, so conditions are satisfied.

**Step 3**: 
- Point estimate = x̄ₙ - x̄ₛ = 7.18 - 6.78 = 0.4
- SE = √[(1.6²/100) + (1.43²/50)] = 0.26
- T = 0.4/0.26 = 1.54
- df = min(100-1, 50-1) = 49
- p-value = P(|T| ≥ 1.54) = 0.1304

**Step 4**: 
- Since p-value = 0.1304 > 0.05, we fail to reject H₀
- Conclusion: There is not enough evidence to conclude that there is a difference in average birth weight between newborns from mothers who smoke and those who don't.

## 8. Statistical Power for Difference of Means

### Definition of Statistical Power
- The probability of correctly rejecting the null hypothesis when a specific alternative hypothesis is true
- Mathematically: power = P(reject H₀ | H₁ is true)
- Equivalently: power = 1 - P(Type II error)

### Factors Affecting Power
1. **Sample size**: Larger samples increase power
2. **Effect size**: Larger differences are easier to detect
3. **Variability**: Less variability increases power
4. **Significance level**: Higher α increases power (but also increases Type I error risk)

### Calculating Power
1. Determine the rejection region under H₀
2. Calculate the probability of falling in the rejection region under H₁

### Example: Blood Pressure Medication
A study is designed to test if a new blood pressure medication reduces blood pressure compared to a standard medication. We want to detect a difference of 3 mmHg.

- Sample size: n₁ = n₂ = 100
- Estimated standard deviation: s₁ = s₂ = 12
- Significance level: α = 0.05

**Step 1**: Calculate the standard error
- SE = √[(12²/100) + (12²/100)] = 1.7

**Step 2**: Determine the rejection region
- For α = 0.05 (two-sided), reject H₀ if |T| > 1.96
- This corresponds to x̄₁ - x̄₂ < -3.332 or x̄₁ - x̄₂ > 3.332

**Step 3**: Calculate power for detecting a 3 mmHg reduction
- Under H₁, x̄₁ - x̄₂ follows approximately N(-3, 1.7²)
- Power = P(x̄₁ - x̄₂ < -3.332 | μ₁ - μ₂ = -3)
- Z = (-3.332 - (-3))/1.7 = -0.2
- Power = P(Z < -0.2) = 0.42 or 42%

**Step 4**: Determine sample size for 80% power
- For 80% power, we need Z = 0.84
- Distance between means = 2.8 × SE = 3
- 2.8 × √[(12²/n) + (12²/n)] = 3
- Solving for n: n = 251 per group

## 9. Common Misconceptions and Pitfalls

### Misconception 1: t-Distribution vs. Normal Distribution
- **Misconception**: The t-distribution is always very different from the normal distribution
- **Reality**: For large degrees of freedom (df > 30), the t-distribution is very similar to the normal distribution

### Misconception 2: Paired vs. Independent Samples
- **Misconception**: Any comparison of two groups should use the two-sample t-test
- **Reality**: Paired data should be analyzed using paired methods (one-sample t-test on differences)

### Pitfall 1: Ignoring Conditions
- **Problem**: Using t-methods when conditions are not satisfied
- **Solution**: Always check independence and normality conditions

### Pitfall 2: Misinterpreting p-values
- **Problem**: Interpreting a non-significant result as "proving" the null hypothesis
- **Solution**: A non-significant result only means there is insufficient evidence to reject H₀

## 10. Key Takeaways

1. **t-Distribution**:
   - Used when the population standard deviation is unknown
   - Accounts for the additional uncertainty from estimating σ with s
   - Approaches the normal distribution as sample size increases

2. **One-Sample Inference**:
   - Confidence interval: x̄ ± t*ᵈᶠ × (s/√n)
   - Hypothesis test: T = (x̄ - μ₀)/(s/√n)
   - Degrees of freedom: df = n - 1

3. **Paired Data**:
   - Analyze the differences between paired observations
   - Use one-sample methods on these differences

4. **Two-Sample Inference**:
   - Confidence interval: (x̄₁ - x̄₂) ± t*ᵈᶠ × SE
   - Hypothesis test: T = (x̄₁ - x̄₂)/SE
   - Standard error: SE = √[(s₁²/n₁) + (s₂²/n₂)]
   - Degrees of freedom: df = min(n₁ - 1, n₂ - 1) (conservative approach)

5. **Statistical Power**:
   - The probability of correctly rejecting H₀ when H₁ is true
   - Increases with larger sample sizes, larger effect sizes, and higher significance levels
   - Important for study design and sample size determination

# DSC 215: Probability and Statistics for Data Science
## Module 10 Summary: Comparing Many Means with ANOVA

## 1. Introduction to ANOVA

### What is ANOVA?
- **Definition**: Analysis of Variance (ANOVA) is a statistical method used to compare means across multiple groups
- **Purpose**: Tests whether there are significant differences between the means of three or more independent groups
- **Advantage over Multiple t-tests**: 
  - Reduces the risk of Type I errors that would accumulate when conducting multiple pairwise comparisons
  - If you have k groups, you would need k(k-1)/2 pairwise comparisons, increasing the chance of finding differences by random chance

### Hypothesis Framework
- **Null Hypothesis (H₀)**: All population means are equal
  $$H_0: \mu_1 = \mu_2 = \ldots = \mu_k$$
- **Alternative Hypothesis (H₁)**: At least one population mean is different from the others
  $$H_1: \mu_i \neq \mu_j \text{ for some } i \neq j$$

## 2. Conditions for Valid ANOVA

### Three Key Conditions
1. **Independence**: 
   - Observations are independent within each group
   - Observations are independent across groups
   - Typically satisfied by random sampling or random assignment

2. **Normality**: 
   - The data within each group is approximately normally distributed
   - Less critical for larger sample sizes due to the Central Limit Theorem
   - Can be assessed using histograms, Q-Q plots, or formal tests

3. **Variability** (Homogeneity of Variance): 
   - The variability across groups is approximately equal
   - Can be assessed by comparing standard deviations or using formal tests like Levene's test
   - ANOVA is somewhat robust to violations of this assumption when sample sizes are equal

## 3. The F-Statistic and F-Distribution

### The Basic Idea
- ANOVA compares two sources of variation:
  - **Between-group variation**: Variation of group means around the overall mean
  - **Within-group variation**: Variation of individual observations around their group means
- If the between-group variation is large relative to the within-group variation, we have evidence that the group means differ

### The F-Statistic
- **Formula**:
  $$F = \frac{\text{Mean Square Between Groups (MSG)}}{\text{Mean Square Error (MSE)}}$$

- **Mean Square Between Groups (MSG)**:
  $$\text{MSG} = \frac{1}{k-1} \sum_{i=1}^{k} n_i(\bar{x}_i - \bar{x})^2$$
  where:
  - k = number of groups
  - n_i = sample size of group i
  - $\bar{x}_i$ = sample mean of group i
  - $\bar{x}$ = overall sample mean

- **Mean Square Error (MSE)**:
  $$\text{MSE} = \frac{1}{n-k} \left( \sum_{i=1}^{n} (x_i - \bar{x})^2 - \sum_{i=1}^{k} n_i(\bar{x}_i - \bar{x})^2 \right)$$
  where n = total sample size

### The F-Distribution
- The F-distribution is parametrized by two degrees of freedom:
  - **df₁** = k - 1 (degrees of freedom for MSG)
  - **df₂** = n - k (degrees of freedom for MSE)
- Properties:
  - Always non-negative (ratio of variances)
  - Right-skewed, especially with low degrees of freedom
  - Becomes less skewed as degrees of freedom increase
  - Under H₀, the F-statistic follows an F-distribution with df₁ and df₂ degrees of freedom

## 4. Conducting an ANOVA Test

### Step-by-Step Procedure
1. **Prepare**: Identify the groups and collect data
2. **Check**: Verify the conditions for ANOVA
3. **Calculate**: Compute the F-statistic
4. **Conclude**: Determine the p-value and make a decision

### Example: Course Delivery Methods
A professor taught the same class in three different formats (remote, hybrid, and in-person) and wants to know if the mean final scores differ. Here are the scores:

| Remote | Hybrid | In-person |
|--------|--------|-----------|
| 80     | 94     | 78        |
| 84     | 85     | 83        |
| 90     | 87     | 93        |
| 84     | 90     | 81        |
| 89     | 76     | -         |
| -      | 89     | -         |
| **Mean** | **85.4** | **89.0** | **83.3** |

**Step 1**: Check conditions
- Independence: Assume students were randomly assigned to different formats
- Normality: Sample sizes are small, but assume scores within each group are approximately normal
- Variability: The standard deviations appear comparable across groups

**Step 2**: Calculate the overall mean
- $\bar{x} = \frac{5(85.4) + 6(89.0) + 4(83.3)}{5+6+4} = \frac{1278.8}{15} = 85.25$

**Step 3**: Calculate MSG
- MSG = $\frac{1}{3-1} [5(85.4-85.25)^2 + 6(89.0-85.25)^2 + 4(83.3-85.25)^2]$
- MSG = $\frac{1}{2} [5(0.15)^2 + 6(3.75)^2 + 4(-1.95)^2]$
- MSG = $\frac{1}{2} [0.1125 + 84.375 + 15.21]$
- MSG = $\frac{99.6975}{2} = 49.85$

**Step 4**: Calculate MSE
- First, calculate the sum of squares within each group:
  - Remote: $(80-85.4)^2 + (84-85.4)^2 + (90-85.4)^2 + (84-85.4)^2 + (89-85.4)^2 = 102.8$
  - Hybrid: $(94-89.0)^2 + (85-89.0)^2 + (87-89.0)^2 + (90-89.0)^2 + (76-89.0)^2 + (89-89.0)^2 = 234.0$
  - In-person: $(78-83.3)^2 + (83-83.3)^2 + (93-83.3)^2 + (81-83.3)^2 = 146.75$
- Total within-group sum of squares = 102.8 + 234.0 + 146.75 = 483.55
- MSE = $\frac{483.55}{15-3} = \frac{483.55}{12} = 40.30$

**Step 5**: Calculate the F-statistic
- F = $\frac{MSG}{MSE} = \frac{49.85}{40.30} = 1.24$

**Step 6**: Determine the p-value
- df₁ = 3 - 1 = 2
- df₂ = 15 - 3 = 12
- Using an F-distribution table or software: p-value = P(F ≥ 1.24) ≈ 0.32

**Step 7**: Make a decision
- Since p-value = 0.32 > 0.05, we fail to reject H₀
- Conclusion: There is insufficient evidence to conclude that the mean final scores differ across the three teaching formats

## 5. Multiple Comparisons

### The Multiple Comparisons Problem
- If ANOVA rejects H₀, we know that at least one mean differs, but we don't know which ones
- To identify specific differences, we need to conduct pairwise comparisons
- However, conducting multiple tests increases the risk of Type I errors (false positives)

### Bonferroni Correction
- **Purpose**: Controls the overall Type I error rate when conducting multiple comparisons
- **Adjusted Significance Level**:
  $$\alpha^* = \frac{\alpha}{K}$$
  where K = k(k-1)/2 is the number of pairwise comparisons

- **Justification**: By the union bound in probability theory:
  $$\mathbb{P}\left( \bigcup_{i=1}^{K} (p_i \leq \alpha/K) \right) \leq \sum_{i=1}^{K} \mathbb{P}(p_i \leq \alpha/K) \leq \alpha$$

### Example: Pairwise Comparisons with Bonferroni Correction
Suppose ANOVA indicates significant differences among four groups (A, B, C, D) with α = 0.05.

**Step 1**: Calculate the number of pairwise comparisons
- K = 4(4-1)/2 = 6 comparisons (A-B, A-C, A-D, B-C, B-D, C-D)

**Step 2**: Calculate the adjusted significance level
- α* = 0.05/6 = 0.0083

**Step 3**: Conduct pairwise t-tests
- For each pair, calculate the t-statistic and p-value
- Compare each p-value to α* = 0.0083
- Only pairs with p-values < 0.0083 are considered significantly different

### Important Note
- It is possible to reject the null hypothesis in ANOVA but not find significant differences in any pairwise comparisons
- This does not invalidate the ANOVA result
- It simply means we cannot identify which specific groups differ with the given sample sizes

## 6. Practical Example: Travel Routes

A student is interested in the times it takes (in minutes) to get to campus using three different routes. She wants to test whether the mean times are equal.

| Route 1 | Route 2 | Route 3 |
|---------|---------|---------|
| 30      | 27      | 16      |
| 32      | 29      | 41      |
| 27      | 28      | 22      |
| 35      | 36      | 31      |
| **Mean** | **31.0** | **30.0** | **27.5** |

**Step 1**: State hypotheses
- H₀: μ₁ = μ₂ = μ₃ (mean travel times are equal across routes)
- H₁: At least one mean is different

**Step 2**: Check conditions (assume all conditions are met)

**Step 3**: Calculate the overall mean
- $\bar{x} = \frac{4(31.0) + 4(30.0) + 4(27.5)}{12} = 29.5$

**Step 4**: Calculate MSG
- MSG = $\frac{1}{3-1} [4(31.0-29.5)^2 + 4(30.0-29.5)^2 + 4(27.5-29.5)^2]$
- MSG = $\frac{1}{2} [4(1.5)^2 + 4(0.5)^2 + 4(-2.0)^2]$
- MSG = $\frac{1}{2} [9 + 1 + 16]$
- MSG = $\frac{26}{2} = 13$

**Step 5**: Calculate MSE
- Total sum of squares = $(30-29.5)^2 + (32-29.5)^2 + ... + (31-29.5)^2 = 467$
- Between-group sum of squares = 26
- Within-group sum of squares = 467 - 26 = 441
- MSE = $\frac{441}{12-3} = \frac{441}{9} = 49$

**Step 6**: Calculate the F-statistic
- F = $\frac{MSG}{MSE} = \frac{13}{49} = 0.265$

**Step 7**: Determine the p-value
- df₁ = 3 - 1 = 2
- df₂ = 12 - 3 = 9
- Using software: p-value = P(F ≥ 0.265) ≈ 0.773

**Step 8**: Make a decision
- Since p-value = 0.773 > 0.05, we fail to reject H₀
- Conclusion: There is insufficient evidence to conclude that the mean travel times differ across the three routes

## 7. Common Misconceptions and Pitfalls

### Misconception 1: ANOVA Tests for Any Difference
- **Misconception**: ANOVA tests for any type of difference between groups
- **Reality**: ANOVA specifically tests for differences in means, not medians, variances, or other parameters

### Misconception 2: Significant ANOVA Means All Groups Differ
- **Misconception**: A significant ANOVA result means all group means are different
- **Reality**: A significant result only indicates that at least one mean differs from the others

### Pitfall 1: Ignoring Conditions
- **Problem**: Using ANOVA when conditions are not satisfied
- **Solution**: Always check independence, normality, and equal variance assumptions

### Pitfall 2: Conducting Multiple t-tests Without Correction
- **Problem**: Performing multiple pairwise comparisons without adjusting the significance level
- **Solution**: Use the Bonferroni correction or other multiple comparison procedures

## 8. Key Takeaways

1. **ANOVA Purpose**:
   - Compares means across multiple groups with a single hypothesis test
   - Controls the overall Type I error rate better than multiple pairwise comparisons

2. **F-Statistic**:
   - Ratio of between-group variance to within-group variance
   - Large values suggest significant differences between group means
   - Under H₀, follows an F-distribution with df₁ = k-1 and df₂ = n-k degrees of freedom

3. **Conditions**:
   - Independence: Observations are independent within and across groups
   - Normality: Data within each group is approximately normally distributed
   - Equal Variance: Variability across groups is comparable

4. **Multiple Comparisons**:
   - After a significant ANOVA, use pairwise comparisons to identify specific differences
   - Apply the Bonferroni correction to control the overall Type I error rate
   - Adjusted significance level: α* = α/K, where K = k(k-1)/2

5. **Interpretation**:
   - Failing to reject H₀: Insufficient evidence of differences between means
   - Rejecting H₀: At least one group mean differs from the others
   - Follow-up with multiple comparisons to identify specific differences

# DSC 215: Probability and Statistics for Data Science
## Comprehensive Final Exam Practice Questions

This document contains 50 practice questions representative of what might appear on the DSC 215 final exam. Questions cover all modules, with emphasis on Modules 5-10 as specified for the final exam. Each question includes the relevant module(s), necessary equations, and a complete solution.

---

### Question 1
**Module(s): 1**

A researcher wants to study the effect of background music on student performance on math tests. She randomly assigns 100 students to one of two groups: one group takes the test with classical music playing, and the other takes the test in silence. Identify:

a) The explanatory and response variables
b) Whether this is an observational study or an experiment
c) The population of interest

**Solution:**

a) Explanatory variable: Presence of background music (categorical: classical music or silence)
   Response variable: Math test performance (numerical)

b) This is an experiment because the researcher is actively manipulating the explanatory variable (presence of music) by randomly assigning students to conditions.

c) The population of interest is likely all students (or potentially all students at a particular school/level depending on the context).

---

### Question 2
**Module(s): 2**

The following data represents the number of hours 10 students spent studying for an exam:
3, 5, 2, 8, 4, 6, 3, 7, 5, 12

a) Calculate the mean, median, and mode.
b) Calculate the range, variance, and standard deviation.
c) Are there any outliers? Use the 1.5 × IQR rule.

**Solution:**

a) Mean = (3 + 5 + 2 + 8 + 4 + 6 + 3 + 7 + 5 + 12) / 10 = 55 / 10 = 5.5 hours
   
   To find the median, arrange the data in ascending order:
   2, 3, 3, 4, 5, 5, 6, 7, 8, 12
   
   Since n = 10 (even), median = (5 + 5) / 2 = 5 hours
   
   Mode = 3 and 5 (both appear twice)

b) Range = Maximum - Minimum = 12 - 2 = 10 hours

   Variance = Σ(x - μ)² / n
   = [(3-5.5)² + (5-5.5)² + (2-5.5)² + (8-5.5)² + (4-5.5)² + (6-5.5)² + (3-5.5)² + (7-5.5)² + (5-5.5)² + (12-5.5)²] / 10
   = [6.25 + 0.25 + 12.25 + 6.25 + 2.25 + 0.25 + 6.25 + 2.25 + 0.25 + 42.25] / 10
   = 78.5 / 10 = 7.85

   Standard deviation = √Variance = √7.85 ≈ 2.80 hours

c) To find outliers, we need Q1, Q3, and IQR:
   Q1 = median of lower half = (3 + 3) / 2 = 3
   Q3 = median of upper half = (7 + 8) / 2 = 7.5
   IQR = Q3 - Q1 = 7.5 - 3 = 4.5
   
   Lower fence = Q1 - 1.5 × IQR = 3 - 1.5 × 4.5 = 3 - 6.75 = -3.75
   Upper fence = Q3 + 1.5 × IQR = 7.5 + 1.5 × 4.5 = 7.5 + 6.75 = 14.25
   
   Since all values are between -3.75 and 14.25, there are no outliers according to the 1.5 × IQR rule.

---

### Question 3
**Module(s): 3**

A fair six-sided die is rolled twice. Let X be the sum of the two rolls.

a) Find the probability mass function (PMF) of X.
b) Calculate P(X = 7).
c) Calculate P(X ≤ 5).
d) Calculate the expected value and variance of X.

**Solution:**

a) The PMF of X (the sum of two dice) is:

   P(X = 2) = 1/36 (only when both dice show 1)
   P(X = 3) = 2/36 = 1/18 (when the dice show 1,2 or 2,1)
   P(X = 4) = 3/36 = 1/12 (when the dice show 1,3 or 3,1 or 2,2)
   P(X = 5) = 4/36 = 1/9 (when the dice show 1,4 or 4,1 or 2,3 or 3,2)
   P(X = 6) = 5/36 (when the dice show 1,5 or 5,1 or 2,4 or 4,2 or 3,3)
   P(X = 7) = 6/36 = 1/6 (when the dice show 1,6 or 6,1 or 2,5 or 5,2 or 3,4 or 4,3)
   P(X = 8) = 5/36 (when the dice show 2,6 or 6,2 or 3,5 or 5,3 or 4,4)
   P(X = 9) = 4/36 = 1/9 (when the dice show 3,6 or 6,3 or 4,5 or 5,4)
   P(X = 10) = 3/36 = 1/12 (when the dice show 4,6 or 6,4 or 5,5)
   P(X = 11) = 2/36 = 1/18 (when the dice show 5,6 or 6,5)
   P(X = 12) = 1/36 (only when both dice show 6)

b) P(X = 7) = 6/36 = 1/6 ≈ 0.167

c) P(X ≤ 5) = P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5)
   = 1/36 + 2/36 + 3/36 + 4/36 = 10/36 = 5/18 ≈ 0.278

d) Expected value:
   E[X] = Σ x·P(X = x)
   = 2·(1/36) + 3·(2/36) + 4·(3/36) + 5·(4/36) + 6·(5/36) + 7·(6/36) + 8·(5/36) + 9·(4/36) + 10·(3/36) + 11·(2/36) + 12·(1/36)
   = (2 + 6 + 12 + 20 + 30 + 42 + 40 + 36 + 30 + 22 + 12)/36
   = 252/36 = 7

   Variance:
   Var(X) = E[X²] - (E[X])²
   
   E[X²] = Σ x²·P(X = x)
   = 2²·(1/36) + 3²·(2/36) + 4²·(3/36) + 5²·(4/36) + 6²·(5/36) + 7²·(6/36) + 8²·(5/36) + 9²·(4/36) + 10²·(3/36) + 11²·(2/36) + 12²·(1/36)
   = (4 + 18 + 48 + 100 + 180 + 294 + 320 + 324 + 300 + 242 + 144)/36
   = 1974/36 = 54.83
   
   Var(X) = 54.83 - 7² = 54.83 - 49 = 5.83

---

### Question 4
**Module(s): 4**

The weights of adult male gorillas are normally distributed with a mean of 180 kg and a standard deviation of 21 kg.

a) What is the probability that a randomly selected male gorilla weighs more than 200 kg?
b) What is the probability that a randomly selected male gorilla weighs between 160 kg and 190 kg?
c) What weight separates the heaviest 10% of male gorillas from the rest?

**Solution:**

a) We need to find P(X > 200) where X ~ N(180, 21²)
   
   First, standardize to a Z-score:
   Z = (X - μ) / σ = (200 - 180) / 21 = 20 / 21 ≈ 0.95
   
   We need P(Z > 0.95) = 1 - P(Z < 0.95) = 1 - 0.8289 = 0.1711
   
   The probability that a randomly selected male gorilla weighs more than 200 kg is approximately 0.1711 or 17.11%.

b) We need to find P(160 < X < 190)
   
   Standardize both bounds:
   Z₁ = (160 - 180) / 21 = -20 / 21 ≈ -0.95
   Z₂ = (190 - 180) / 21 = 10 / 21 ≈ 0.48
   
   P(160 < X < 190) = P(-0.95 < Z < 0.48) = P(Z < 0.48) - P(Z < -0.95)
   = 0.6844 - 0.1711 = 0.5133
   
   The probability that a randomly selected male gorilla weighs between 160 kg and 190 kg is approximately 0.5133 or 51.33%.

c) We need to find the value x such that P(X > x) = 0.10, or equivalently, P(X < x) = 0.90
   
   For P(Z < z) = 0.90, we have z = 1.28 (from standard normal table)
   
   Therefore:
   (x - 180) / 21 = 1.28
   x - 180 = 1.28 × 21 = 26.88
   x = 180 + 26.88 = 206.88
   
   The weight that separates the heaviest 10% of male gorillas from the rest is approximately 206.88 kg.

---

### Question 5
**Module(s): 5**

A polling organization wants to estimate the proportion of voters who support a certain candidate in an upcoming election. They want the margin of error to be no more than 3% with 95% confidence.

a) What is the minimum sample size needed?
b) If a previous poll suggested that about 40% of voters support the candidate, what sample size would be needed?
c) If they end up surveying 1200 voters and find that 45% support the candidate, construct a 95% confidence interval for the true proportion.

**Solution:**

a) When we don't have a prior estimate of the proportion, we use p = 0.5 to get the most conservative (largest) sample size.

   The formula for sample size is:
   n = (z*)² × p(1-p) / (MOE)²
   
   For 95% confidence, z* = 1.96
   MOE = 0.03
   
   n = (1.96)² × 0.5 × 0.5 / (0.03)²
   = 3.8416 × 0.25 / 0.0009
   = 0.9604 / 0.0009
   = 1067.11
   
   Therefore, the minimum sample size needed is 1068 voters.

b) With a prior estimate of p = 0.4:
   
   n = (1.96)² × 0.4 × 0.6 / (0.03)²
   = 3.8416 × 0.24 / 0.0009
   = 0.922 / 0.0009
   = 1024.44
   
   Therefore, the sample size needed is 1025 voters.

c) With a sample of 1200 voters where 45% support the candidate:
   
   p̂ = 0.45
   
   The formula for the confidence interval is:
   p̂ ± z* × √[p̂(1-p̂)/n]
   
   95% CI = 0.45 ± 1.96 × √[0.45 × 0.55 / 1200]
   = 0.45 ± 1.96 × √[0.2475 / 1200]
   = 0.45 ± 1.96 × √0.000206
   = 0.45 ± 1.96 × 0.0144
   = 0.45 ± 0.028
   = (0.422, 0.478)
   
   We are 95% confident that the true proportion of voters who support the candidate is between 42.2% and 47.8%.

---

### Question 6
**Module(s): 6**

A company claims that more than 80% of its products meet the highest quality standards. A random sample of 200 products shows that 172 meet these standards.

a) Set up the appropriate hypotheses to test the company's claim.
b) Calculate the test statistic and p-value.
c) At α = 0.05, what is your conclusion?
d) Calculate a 95% confidence interval for the true proportion. Does this interval support your conclusion in part c?

**Solution:**

a) The company claims that more than 80% of products meet the highest standards.
   
   H₀: p ≤ 0.80 (null hypothesis)
   H₁: p > 0.80 (alternative hypothesis)

b) From the sample, p̂ = 172/200 = 0.86
   
   Test statistic:
   z = (p̂ - p₀) / √[p₀(1-p₀)/n]
   = (0.86 - 0.80) / √[0.80 × 0.20 / 200]
   = 0.06 / √0.0008
   = 0.06 / 0.0283
   = 2.12
   
   For a right-tailed test, p-value = P(Z > 2.12) = 1 - P(Z < 2.12) = 1 - 0.983 = 0.017

c) Since p-value = 0.017 < α = 0.05, we reject the null hypothesis.
   
   Conclusion: There is sufficient evidence to support the company's claim that more than 80% of its products meet the highest quality standards.

d) 95% confidence interval:
   p̂ ± z* × √[p̂(1-p̂)/n]
   = 0.86 ± 1.96 × √[0.86 × 0.14 / 200]
   = 0.86 ± 1.96 × √0.0006
   = 0.86 ± 1.96 × 0.0245
   = 0.86 ± 0.048
   = (0.812, 0.908)
   
   Since the entire confidence interval is above 0.80, this supports our conclusion to reject H₀ in part c.

---

### Question 7
**Module(s): 7**

A researcher wants to compare the effectiveness of two different teaching methods. Method A is used with 40 students, and their average test score is 78 with a standard deviation of 8. Method B is used with 35 students, and their average test score is 82 with a standard deviation of 10.

a) Construct a 95% confidence interval for the difference in mean test scores (Method B - Method A).
b) Based on your confidence interval, is there a significant difference between the two teaching methods at the 5% level?
c) Conduct a hypothesis test to determine if Method B results in higher test scores than Method A.

**Solution:**

a) The formula for the confidence interval for the difference in means is:
   (x̄₂ - x̄₁) ± t* × √[(s₁²/n₁) + (s₂²/n₂)]
   
   Where:
   x̄₁ = 78 (Method A)
   x̄₂ = 82 (Method B)
   s₁ = 8
   s₂ = 10
   n₁ = 40
   n₂ = 35
   
   For a conservative approach, we use df = min(n₁-1, n₂-1) = min(39, 34) = 34
   For 95% confidence with df = 34, t* ≈ 2.03
   
   Standard error = √[(8²/40) + (10²/35)] = √[1.6 + 2.86] = √4.46 = 2.11
   
   95% CI = (82 - 78) ± 2.03 × 2.11
   = 4 ± 4.28
   = (-0.28, 8.28)

b) Since the confidence interval includes zero, we cannot conclude that there is a significant difference between the two teaching methods at the 5% level.

c) Hypotheses:
   H₀: μ₂ - μ₁ ≤ 0 (Method B does not result in higher scores than Method A)
   H₁: μ₂ - μ₁ > 0 (Method B results in higher scores than Method A)
   
   Test statistic:
   t = (x̄₂ - x̄₁) / √[(s₁²/n₁) + (s₂²/n₂)]
   = (82 - 78) / 2.11
   = 4 / 2.11
   = 1.90
   
   For a right-tailed test with df = 34, the p-value = P(t > 1.90) ≈ 0.033
   
   Since p-value = 0.033 < α = 0.05, we reject the null hypothesis.
   
   Conclusion: There is sufficient evidence to conclude that Method B results in higher test scores than Method A.

---

### Question 8
**Module(s): 8**

A six-sided die is rolled 120 times with the following results:

| Outcome | 1 | 2 | 3 | 4 | 5 | 6 |
|---------|---|---|---|---|---|---|
| Frequency | 15 | 17 | 22 | 25 | 18 | 23 |

Test whether the die is fair using a chi-square goodness of fit test with α = 0.05.

**Solution:**

Step 1: State the hypotheses
H₀: The die is fair (all outcomes have equal probability)
H₁: The die is not fair (at least one outcome has a different probability)

Step 2: Calculate expected frequencies
If the die is fair, each outcome has probability 1/6.
Expected frequency for each outcome = 120 × (1/6) = 20

Step 3: Calculate the chi-square statistic
χ² = Σ[(O - E)²/E]
= (15 - 20)²/20 + (17 - 20)²/20 + (22 - 20)²/20 + (25 - 20)²/20 + (18 - 20)²/20 + (23 - 20)²/20
= 25/20 + 9/20 + 4/20 + 25/20 + 4/20 + 9/20
= 76/20 = 3.8

Step 4: Determine the critical value
Degrees of freedom = k - 1 = 6 - 1 = 5
For α = 0.05 and df = 5, the critical value is 11.07

Step 5: Make a decision
Since χ² = 3.8 < 11.07, we fail to reject H₀.

Conclusion: There is insufficient evidence to conclude that the die is not fair.

---

### Question 9
**Module(s): 9**

A researcher wants to test if a new medication reduces blood pressure. They measure the blood pressure (in mmHg) of 12 patients before and after taking the medication, with the following results:

| Patient | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---------|---|---|---|---|---|---|---|---|---|----|----|----|
| Before | 145 | 152 | 138 | 160 | 155 | 142 | 148 | 135 | 162 | 150 | 143 | 157 |
| After | 140 | 145 | 135 | 153 | 150 | 137 | 141 | 132 | 158 | 145 | 140 | 152 |
| Difference | 5 | 7 | 3 | 7 | 5 | 5 | 7 | 3 | 4 | 5 | 3 | 5 |

a) Is this a paired or independent samples test? Explain.
b) Conduct a hypothesis test to determine if the medication significantly reduces blood pressure at α = 0.01.
c) Calculate a 99% confidence interval for the mean reduction in blood pressure.

**Solution:**

a) This is a paired samples test because we have before and after measurements on the same 12 patients. The data are not independent; each "after" measurement is related to its corresponding "before" measurement.

b) Step 1: State the hypotheses
   H₀: μd ≤ 0 (medication does not reduce blood pressure)
   H₁: μd > 0 (medication reduces blood pressure)
   
   Where μd is the mean difference (before - after)
   
   Step 2: Calculate the sample statistics
   Mean difference: d̄ = (5 + 7 + 3 + 7 + 5 + 5 + 7 + 3 + 4 + 5 + 3 + 5) / 12 = 59 / 12 = 4.92
   
   Standard deviation of differences:
   sd = √[Σ(d - d̄)² / (n-1)]
   = √[((5-4.92)² + (7-4.92)² + ... + (5-4.92)²) / 11]
   = √[(0.08² + 2.08² + (-1.92)² + 2.08² + 0.08² + 0.08² + 2.08² + (-1.92)² + (-0.92)² + 0.08² + (-1.92)² + 0.08²) / 11]
   = √[26.92 / 11] = √2.45 = 1.57
   
   Step 3: Calculate the test statistic
   t = d̄ / (sd / √n) = 4.92 / (1.57 / √12) = 4.92 / 0.45 = 10.93
   
   Step 4: Determine the p-value
   For a right-tailed test with df = 12 - 1 = 11, the p-value = P(t > 10.93) < 0.0001
   
   Step 5: Make a decision
   Since p-value < 0.0001 < α = 0.01, we reject H₀.
   
   Conclusion: There is strong evidence that the medication significantly reduces blood pressure.

c) 99% confidence interval:
   d̄ ± t* × (sd / √n)
   
   For 99% confidence with df = 11, t* ≈ 3.11
   
   99% CI = 4.92 ± 3.11 × (1.57 / √12)
   = 4.92 ± 3.11 × 0.45
   = 4.92 ± 1.40
   = (3.52, 6.32)
   
   We are 99% confident that the true mean reduction in blood pressure due to the medication is between 3.52 and 6.32 mmHg.

---

### Question 10
**Module(s): 10**

A researcher wants to compare the effectiveness of three different fertilizers (A, B, and C) on plant growth. They randomly assign 18 plants to the three fertilizer groups (6 plants per group) and measure the height increase (in cm) after one month:

Fertilizer A: 5.2, 4.8, 6.1, 5.5, 4.9, 5.3
Fertilizer B: 6.5, 7.2, 6.8, 7.0, 6.3, 6.9
Fertilizer C: 5.8, 6.2, 5.5, 6.0, 5.7, 5.9

a) Conduct a one-way ANOVA to determine if there are significant differences in mean height increase among the three fertilizers at α = 0.05.
b) If the ANOVA indicates significant differences, which pairs of fertilizers differ significantly? Use the Bonferroni correction with an overall α = 0.05.

**Solution:**

a) Step 1: Calculate the group means and overall mean
   
   Mean for Fertilizer A: x̄₁ = (5.2 + 4.8 + 6.1 + 5.5 + 4.9 + 5.3) / 6 = 31.8 / 6 = 5.3
   Mean for Fertilizer B: x̄₂ = (6.5 + 7.2 + 6.8 + 7.0 + 6.3 + 6.9) / 6 = 40.7 / 6 = 6.78
   Mean for Fertilizer C: x̄₃ = (5.8 + 6.2 + 5.5 + 6.0 + 5.7 + 5.9) / 6 = 35.1 / 6 = 5.85
   
   Overall mean: x̄ = (31.8 + 40.7 + 35.1) / 18 = 107.6 / 18 = 5.98
   
   Step 2: Calculate the sum of squares
   
   Between-group sum of squares (SSB):
   SSB = Σ nᵢ(x̄ᵢ - x̄)²
   = 6(5.3 - 5.98)² + 6(6.78 - 5.98)² + 6(5.85 - 5.98)²
   = 6(0.68)² + 6(0.8)² + 6(0.13)²
   = 6(0.46 + 0.64 + 0.017)
   = 6 × 1.117 = 6.7
   
   Within-group sum of squares (SSW):
   SSW = Σ Σ (xᵢⱼ - x̄ᵢ)²
   
   For Fertilizer A:
   (5.2 - 5.3)² + (4.8 - 5.3)² + (6.1 - 5.3)² + (5.5 - 5.3)² + (4.9 - 5.3)² + (5.3 - 5.3)²
   = (-0.1)² + (-0.5)² + (0.8)² + (0.2)² + (-0.4)² + (0)²
   = 0.01 + 0.25 + 0.64 + 0.04 + 0.16 + 0 = 1.1
   
   For Fertilizer B:
   (6.5 - 6.78)² + (7.2 - 6.78)² + (6.8 - 6.78)² + (7.0 - 6.78)² + (6.3 - 6.78)² + (6.9 - 6.78)²
   = (-0.28)² + (0.42)² + (0.02)² + (0.22)² + (-0.48)² + (0.12)²
   = 0.078 + 0.176 + 0.0004 + 0.048 + 0.23 + 0.014 = 0.55
   
   For Fertilizer C:
   (5.8 - 5.85)² + (6.2 - 5.85)² + (5.5 - 5.85)² + (6.0 - 5.85)² + (5.7 - 5.85)² + (5.9 - 5.85)²
   = (-0.05)² + (0.35)² + (-0.35)² + (0.15)² + (-0.15)² + (0.05)²
   = 0.0025 + 0.1225 + 0.1225 + 0.0225 + 0.0225 + 0.0025 = 0.295
   
   SSW = 1.1 + 0.55 + 0.295 = 1.945
   
   Step 3: Calculate the mean squares
   
   Mean Square Between (MSB) = SSB / (k - 1) = 6.7 / (3 - 1) = 6.7 / 2 = 3.35
   Mean Square Within (MSW) = SSW / (n - k) = 1.945 / (18 - 3) = 1.945 / 15 = 0.13
   
   Step 4: Calculate the F-statistic
   
   F = MSB / MSW = 3.35 / 0.13 = 25.77
   
   Step 5: Determine the critical value
   
   For α = 0.05, df₁ = 2, df₂ = 15, the critical value is approximately 3.68
   
   Step 6: Make a decision
   
   Since F = 25.77 > 3.68, we reject H₀.
   
   Conclusion: There are significant differences in mean height increase among the three fertilizers.


## Prompts
### Summarize Module Content
----------------

Goal: Generate a comprehensive Markdown file summarizing key concepts from Module 1 of a statistics course (DSC 215).

Role: You are an expert statistics educator with a Ph.D. and 30 years of experience creating effective learning materials.

Context: The provided text includes content from Module 1 lectures, pre-checks, reviews, examples, and homework.


Instructions:

1. Create a Markdown summary of Module 1.
2. Include relevant statistical equations.
3. Provide examples demonstrating the application of those equations

----------------
### Summarize Midterm and id topics cover to what was asked on midterm
----------------
Goal: Generate a comprehensive Markdown file summarizing key concepts from Module 10 of a statistics course (DSC 215).

Role: You are an expert statistics educator with a Ph.D. and 30 years of experience creating effective learning materials.

Context: The provided text includes content from the midterm exam. Students were told that the midterm exam would cover Module 1, Module 2, Module 3, and Module 4.

Instructions:

Part A

1. Create a Markdown summary of the midterm exam.

2. Include relevant statistical equations.

3. Provide step by step instructions on how to solve each question



Part B

1. Are there any trends between the material covered and the questions asked

Context: material covered refers to the text used to generate module-1-summary.md,module-2-summary.md, module-3-summary.md and module-4-summary.md
----------------------
Goal: Generate a comprehensive Markdown file containing potential final exam questions for DSC 215.

Role: You are an expert statistics educator with a Ph.D. and 30 years of experience in creating effective learning materials, including detailed exam analyses and practice problems.

Context:
* The provided text includes the midterm exam and module summaries.
* The final exam for DSC 215 is comprehensive, covering Module 1, Module 2, Module 3, Module 4, Module 5, Module 6, Module 7, Module 8, Module 9, and Module 10.
* Students were informed that the final exam will emphasize Module 5, Module 6, Module 7, Module 8, Module 9, and Module 10.
* You have access to the following Markdown files:
    * module-1-summary.md
    * module-2-summary.md
    * module-3-summary.md
    * module-4-summary.md
    * midterm-exam-analysis.md (This file contains an analysis of the midterm exam, including trends and patterns)
    * module-5-summary.md
    * module-6-summary.md
    * module-7-summary.md
    * module-8-summary.md
    * module-9-summary.md
    * module-10-summary.md
* Students will have 3 hours to complete the final exam.

Instructions:

Part A: Generate Potential Final Exam Questions

1.  Create a Markdown file containing 50 potential final exam questions for DSC 215.
    * These questions should be representative of the material covered in all modules, with heavier emphasis on Modules 5-10.
    * Use information from the module summaries and the trends/patterns identified in `midterm-exam-analysis.md` to guide the creation of these questions.
    * Vary the question types (e.g., problem-solving, conceptual explanations, interpretations).
    * Ensure the difficulty level is appropriate for a final exam in an introductory statistics course.
    * Consider the 3-hour exam time limit when designing the questions (i.e., avoid overly complex problems that would take too long).
2.  For each question, include:
    * All relevant statistical equations needed to solve the problem (if applicable).
    * A step-by-step solution guide, clearly demonstrating how to apply the equations or arrive at the answer.
    * Indicate the module(s) the question primarily covers.