---
title: "Lesson 1: Data Import, Cleaning, and Basic EDA in R"
author: "Your Name"
date: "Block Lecture 1"
---

# Lesson 1 Notebook

Welcome to our first 4-hour block! In this lesson, we will cover:

1. **Environment Setup and Package Installation**
2. **Basic Recap of R Commands**
3. **Data Import & Export**
4. **Data Exploration and Summary**
5. **Data Cleaning & Preprocessing** (missing values, renaming columns, data type conversions, etc.)
6. **Basic Exploratory Data Analysis** (simple statistics and plots)

We’ll work step-by-step, and by the end, you should feel comfortable handling a basic data cleaning workflow and generating a few quick insights from your dataset.


---
## 1. Environment Setup

In this section, we’ll ensure your environment is ready. We’ll load the libraries we need. If any of these libraries are missing, you can install them with `install.packages("package_name")`.

In [None]:
# Run this cell to load (and if necessary, install) required packages:

# If you need to install:
# install.packages("tidyverse")
# install.packages("skimr")  # Optional, but nice for data summaries

library(tidyverse)
library(skimr) # For a more detailed summary

### Check Your Working Directory

> **Instructor Note**: Make sure all students have their working directory set to a folder where their CSV file is located (or where they plan to save outputs).

In [None]:
# This shows your current working directory in a Jupyter environment:
getwd()

# If needed, set your working directory (uncomment & modify the path):
# setwd("path/to/your/folder")

---
## 2. (Optional) Create Example CSV
If you want to distribute a prepared CSV, you can skip this. Otherwise, run the following code to simulate a small “messy” dataset in your current directory.

In [None]:
## Create a small random dataset and write it to CSV:
set.seed(123)  # For reproducibility

# Example: We'll simulate data about some hypothetical employees
employee_data <- data.frame(
  ID = 1:15,
  Name = c("Alice","Bob","Carla","David","Elena","Frank","Georgia","Henry","Ivy","John","Kate","Luke","Mona","Nick","Olivia"),
  Age = c(25, 34, 28, NA, 42, 51, 29, 33, 41, 38, NA, 23, 27, 45, 36),
  Department = c("Marketing","Marketing","Sales","HR","Sales","Sales","Marketing","IT","IT","HR","HR","NA","Marketing","IT","NA"),
  Salary = c(50000, 60000, 55000, 45000, 65000, 70000, 52000, 80000, 75000, 48000, 46000, 40000, 53000, 81000, 42000),
  Performance_Score = c("Good","Excellent","Fair","Excellent","Good","Good","Fair","Excellent","Good","Fair","Good","NA","Fair","Excellent","NA")
)

# Write to CSV
write_csv(employee_data, "messy_data.csv")

# Check the file
list.files(pattern = "messy_data.csv")

> **Note**: The dataset has some missing values (`NA`), some potential data quality issues, and a text-based categorical column (`Performance_Score`).

---
## 3. Data Import
Now, let’s load the dataset we just created (or provided) in CSV format.

In [None]:
# Replace "messy_data.csv" with your actual CSV file name if different
df <- read_csv("messy_data.csv")

# Let's see the first few rows
head(df)

# Let's quickly inspect the structure
str(df)

### Quick Exercises
1. **Change the file name** to something else and load it again.
2. **Check** what happens if you use `read.csv("filename.csv")` vs. `read_csv("filename.csv")` (the base R and the readr approach differ slightly in how they handle data types).

---
## 4. Basic Recap of R & Exploratory Data Checks

### 4.1 Summaries

In [None]:
# Basic summary
summary(df)

# More detailed summary using skimr (if installed)
skim(df)

The `summary()` function gives a quick overview of numeric columns (min, max, median, mean) and factor/string columns. `skim()` goes a bit deeper if you have it installed.

### 4.2 Viewing the Data

In [None]:
# For a larger view in a separate tab (works in RStudio best):
# View(df)

> **Tip**: In Jupyter, you can just print `df` or head of `df`. In RStudio, `View(df)` opens a spreadsheet-like viewer.

---
## 5. Data Cleaning & Preprocessing

Here’s where we start cleaning. We’ll walk through **renaming columns**, **dealing with missing values**, and **converting data types** if needed.

### 5.1 Renaming Columns

We might have columns with awkward names. Let’s rename them using `dplyr::rename()`:

In [None]:
df <- df %>%
  rename(
    PerfScore = Performance_Score
  )

# Check new names
names(df)

### 5.2 Handling Missing Values

1. **Identify Missing Values** using `is.na()`:

In [None]:
colSums(is.na(df))

2. **Deciding What to Do**:
- We might remove rows with too many missing values.
- Or we might **impute** them (e.g., replace them with the mean or median).
- For this example, let’s remove rows where the department is `NA`:

In [None]:
# Example: remove rows with missing "Department"
df <- df %>%
  filter(!is.na(Department))

# Re-check for missing values
colSums(is.na(df))

3. **Imputing** (Example on Age):
- Let’s replace missing `Age` values with the mean `Age`:

In [None]:
mean_age <- mean(df$Age, na.rm = TRUE)
df$Age <- ifelse(is.na(df$Age), mean_age, df$Age)

# Check again
colSums(is.na(df))

> **Instructor Note**: Discuss the pros/cons of removing vs. imputing data. Also highlight that for journalistic data stories, transparency about how you handle missing data is crucial.

### 5.3 Data Type Conversion

Sometimes columns should be factors (categorical) or numeric. Let’s ensure `Department` and `PerfScore` are factors:

In [None]:
df <- df %>%
  mutate(
    Department = as.factor(Department),
    PerfScore  = as.factor(PerfScore)
  )

# Check again
str(df)

---
## 6. Basic Transformations with dplyr

### 6.1 Selecting and Filtering

In [None]:
# Select specific columns
df_selected <- df %>%
  select(Name, Department, Salary)

head(df_selected)

# Filter rows based on a condition
df_sales <- df %>%
  filter(Department == "Sales")

df_sales

### 6.2 Mutating (Creating New Columns)

In [None]:
# Create a new column "Salary_in_Thousands"
df <- df %>%
  mutate(
    Salary_in_Thousands = Salary / 1000
  )

head(df)

### 6.3 Grouping and Summarizing

In [None]:
# Average salary by department
df %>%
  group_by(Department) %>%
  summarise(
    Avg_Salary = mean(Salary, na.rm = TRUE),
    Count = n()
  )

> **Exercise**: Try grouping by both Department and PerfScore to see if performance scores correlate with salary or department.

---
## 7. Basic Exploratory Data Analysis (EDA)

### 7.1 Quick Statistical Checks

In [None]:
# Summaries for numeric columns
summary(df$Salary)
summary(df$Age)

### 7.2 Simple Plots with ggplot2

In [None]:
# Histogram of Age
ggplot(df, aes(x = Age)) +
  geom_histogram(bins = 5, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Age", x = "Age", y = "Count")

# Bar plot of Department distribution
ggplot(df, aes(x = Department)) +
  geom_bar(fill = "orange", color = "black") +
  labs(title = "Number of Employees per Department", x = "Department", y = "Count")

> **Instructor Note**: Depending on how advanced your class is, you might also introduce basic color/fill by another variable (e.g., fill by `PerfScore`).

### 7.3 Boxplot to Check Salary Distribution

In [None]:
ggplot(df, aes(x = Department, y = Salary)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Salary Distribution by Department", x = "Department", y = "Salary")

> **Exercise**:  
> 1. Create a boxplot or bar plot comparing `PerfScore` across `Department`.  
> 2. Try coloring the bars by `PerfScore`.

---
## 8. Exporting the Cleaned Dataset

Now that we’ve cleaned and explored the data, we can save our new version of the dataset to a CSV file for future use.

In [None]:
write_csv(df, "cleaned_data.csv")

# Confirm it exists
list.files(pattern = "cleaned_data.csv")

---
## 9. Summary & Next Steps

- **What We Did**:
  1. Reviewed how to load libraries and check the working directory.
  2. Read in our CSV dataset (`messy_data.csv`).
  3. Explored the data structure, found missing values, and handled them.
  4. Renamed columns, converted data types, and created new columns.
  5. Performed basic EDA: histograms, bar plots, boxplots.
  6. Exported our cleaned dataset.

- **Next**: In the next session, we’ll look at **more advanced reshaping**, **data merging**, and an **introduction to SQL** in R.

> **Instructor Tip**: Encourage students to apply these steps on any dataset relevant to their thesis or final project.

# End of Lesson 1


---
## Additional Notes for Instructors
1. **Data Size**: For demonstration, keep the datasets small so that everything runs quickly in class.
2. **Student Engagement**: Encourage them to modify code, try their own filters, or fix missing data in different ways to understand the trade-offs.
3. **Make It Real**: Journalists might enjoy using real datasets (e.g., small election datasets, public health statistics, or sample tweets). If time allows, demonstrate how to get data from a public portal (like data.gov or a relevant local open-data site).
4. **Troubleshooting**: Common issues include read/write permissions, mismatched working directories, or missing packages. Allocate time for these sorts of setup challenges.

By the end of this notebook, your class should have a foundation in reading data, performing essential cleaning steps, and running quick EDA. In **Lesson 2**, you’ll go further into **data reshaping, merging** with `dplyr` joins, and a practical introduction to **running SQL queries** in R.