# Lesson 5: Basic Statistical Analysis & Regression in R
**Author:** Petr Čala  
**Date:** 2025-02-26

# Lesson 5 Notebook

Welcome to **Lesson 5**! Over the previous lessons, you’ve acquired a broad skill set in data cleaning, SQL, data manipulation, text handling, advanced visualization, and reproducible workflows. Here, we’ll turn our focus to **basic statistical analysis**, **hypothesis testing**, and **linear regression** in R.

## Topics
1. Recap & Course Context
2. Exploratory Data Analysis (EDA) Refresher
3. Basic Statistical Tests
4. Linear Regression (Simple & Multiple)
5. Interpreting Results & Reporting
6. Further Resources


---
## 1. Recap & Course Context
In previous lessons, you learned how to **handle** and **clean** data, plus how to **visualize** it effectively. Basic statistics and regression techniques are next steps for many final projects, including thesis or journalistic investigations.

> **Note**: We’ll keep it simple, focusing on how to **run** certain tests/regressions and **interpret** them at a high level. For deeper theoretical foundations, consult a dedicated statistics course or textbook.


---
## 2. Exploratory Data Analysis (EDA) Refresher
Before running tests or regressions, we usually do an **EDA** to understand the data. Let’s load an example dataset. We’ll use the built-in `mtcars` again or a custom dataset if you prefer.


In [None]:
library(tidyverse)

# We'll work with mtcars, adding factor columns for clarity
df <- mtcars %>%
  mutate(
    cyl = as.factor(cyl),
    am  = as.factor(am),  # 0 = automatic, 1 = manual
    vs  = as.factor(vs)   # 0 = V-shaped, 1 = straight engine
  )

head(df)


Remember to check:
- **Structure**: numeric vs. factor columns.
- **Summary stats**: means, medians, standard deviations.
- **Visual** inspection of distributions.


In [None]:
# Structure & summary
str(df)
summary(df)


### Quick Plot
Let’s see the relationship between **miles per gallon** (`mpg`) and **weight** (`wt`):

In [None]:
ggplot(df, aes(x = wt, y = mpg)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Scatter Plot of mpg vs. wt",
    x = "Weight (1,000 lbs)",
    y = "Miles per Gallon"
  ) +
  theme_minimal()

> **Observation**: Higher weight seems correlated with lower mpg (negative slope). We’ll confirm with **correlation** and **regression** next.


---
## 3. Basic Statistical Tests
We’ll cover:
1. **Correlation**: Measuring the strength of linear relationships.
2. **t-tests**: Comparing group means.
3. **Chi-square** (briefly): Checking association between categorical variables.

### 3.1 Correlation
We can use **Pearson’s correlation** (by default) to measure linear correlation between numeric variables.


In [None]:
# Correlation between mpg and weight
cor(df$mpg, df$wt, use = "complete.obs")

This gives a correlation coefficient. For a quick test with `cor.test()`:


In [None]:
cor.test(df$mpg, df$wt)

Output includes:
- **Correlation estimate**.
- **p-value** (if < 0.05, typically we say the correlation is statistically significant).
- Confidence interval.

> **Interpretation**: A strong negative correlation suggests that as weight increases, mpg decreases.


### 3.2 t-tests
A **t-test** is used to compare the means of two groups (e.g., does **automatic** vs. **manual** transmission yield different average mpg?).


In [None]:
# We'll split by transmission type (am: 0=auto, 1=manual)
t.test(mpg ~ am, data = df)

This is a **two-sample t-test**. Key outputs include:
- **Mean** of each group.
- **p-value** indicating if the difference in means is significant.

> **Journalistic Interpretation**: If the p-value < 0.05, we might say there’s a statistically significant difference in mpg between automatic and manual cars. Always consider context, sample size, and assumptions.


### 3.3 Chi-Square (Categorical Association)
If we want to see if there’s an association between two **categorical variables** (e.g., number of cylinders `cyl` and engine shape `vs`), we can use **chi-square**.


In [None]:
# Let's create a contingency table
tbl <- table(df$cyl, df$vs)
tbl

# Chi-square test
chisq.test(tbl)

If the p-value is low, there’s evidence that `cyl` and `vs` are **not** independent (i.e., there is some association between them).


---
## 4. Linear Regression (Simple & Multiple)
Regression is a fundamental tool for journalism investigations (e.g., does city budget predict crime rate?), social sciences, and general data analysis.

### 4.1 Simple Linear Regression
Let’s model `mpg` as a function of `wt` (the single predictor).


In [None]:
# Fit a simple linear regression: mpg ~ wt
model1 <- lm(mpg ~ wt, data = df)
summary(model1)

Interpretation from the `summary(model1)`:
- **Coefficients**: The intercept (`(Intercept)`) and slope (`wt`).
- **Estimate** for `wt` is negative, which aligns with correlation.
- **p-value** for `wt` (if < 0.05) indicates a statistically significant predictor.
- **R-squared** measures how much variance in `mpg` is explained by `wt`.


### 4.2 Multiple Linear Regression
What if we also consider the number of cylinders (`cyl`) and horsepower (`hp`)?
We can add them to the model:

```r
model2 <- lm(mpg ~ wt + cyl + hp, data = df)
summary(model2)
```
Now we interpret each coefficient *controlling for* the others. This is more realistic when multiple factors could influence mpg.

> **Exercise**: Try other variables like `am` (transmission type). `mpg ~ wt + hp + am` might reveal how manual vs. automatic influences mpg, after accounting for weight and horsepower.


In [None]:
# Let's do a quick multiple regression example:
model2 <- lm(mpg ~ wt + cyl + hp, data = df)
summary(model2)

### 4.3 Model Diagnostics
After fitting a model, it’s important to check **residuals** and see if assumptions (linearity, normality, homoscedasticity) hold. In R, you can do:
```r
plot(model2)
```
This will produce diagnostic plots (in a typical R environment). In Jupyter, it may generate multiple plots inline. Look for patterns or outliers that violate assumptions.


---
## 5. Interpreting Results & Reporting
1. **Statistical Significance**: A p-value < 0.05 is a common (though arbitrary) cutoff. Emphasize real-world context (effect size, sample size) over p-values alone.
2. **Coefficients**: In a regression, each coefficient’s sign (+/-) and magnitude matter. For instance, a coefficient of -5 for `wt` means for each additional unit of weight, mpg decreases by 5 (holding other factors constant).
3. **Causation vs. Correlation**: Journalists must clarify that regression doesn’t inherently prove causation.
4. **Transparency**: Document your steps and assumptions in your final report.


---
## 6. Further Resources
- **Intro to Statistical Learning** (free e-book): deeper dive into regression, classification, etc.
- **rstatix**: a handy R package for simpler stats syntax (t-tests, ANOVAs, etc.).
- **Advanced**: For logistic regression or more complex modeling, see `glm()` in R.
- **Reading**: Field, Andy. _Discovering Statistics Using R_. A thorough reference.


---
## Summary & Wrap-Up
In **Lesson 5**, you:
1. Recapped EDA and correlation.
2. Learned basic statistical tests (t-tests, chi-square).
3. Explored **linear regression** (simple & multiple) and how to read output.
4. Discussed the importance of **interpretation** and **reporting**.

You now have a **comprehensive** skillset:
- Data cleaning/preprocessing (Lessons 1–3)
- Reshaping & SQL (Lesson 2)
- Advanced visualization & reproducible workflows (Lesson 4)
- Basic statistical analysis & regression (Lesson 5)

This final lesson should help you incorporate **analytical rigor** into your journalistic or academic projects. Always remember to be transparent about your methods and mindful of the assumptions in any statistical technique.

Good luck with your final projects and beyond!

# End of Lesson 5
