---
title: "Lesson 3: Advanced Data Cleaning & Text Handling in R"
author: "Your Name"
date: "Block Lecture 3"
---

# Lesson 3 Notebook

Welcome to **Lesson 3**! In the previous lessons, you learned:
- **Lesson 1**: Reading in datasets, cleaning basics, and simple EDA.
- **Lesson 2**: Reshaping, merging data frames, and a quick intro to SQL in R.

Now, we’ll explore more **advanced data cleaning** techniques (especially for messy real-world data), **text processing** (since many journalism datasets include text), and wrap up with a **mini-project** to reinforce the entire workflow.

## Topics
1. Advanced Data Cleaning & Validation
2. Handling and Cleaning Text Data
3. Combining Data Wrangling and SQL
4. Mini Workflow / Capstone Example
5. Looking Ahead & Thesis Tips


---
## 1. Advanced Data Cleaning & Validation
In Lesson 1, we touched on **missing values**, **renaming**, and **basic type conversion**. In real-life scenarios, data might have **inconsistent formats**, **typos**, **outliers**, or **invalid values**. Below are some approaches and packages that can help.

### 1.1 Revisiting the `df` from Lesson 1 (or a new data set)
We’ll assume you have a dataset (`df`) with potential issues:
1. **String Inconsistencies**: different capitalization (`"Marketing"` vs. `"marketing"`).
2. **Out-of-range values**: e.g., `Age` might be 150.
3. **Whitespace or strange characters** in text fields.


In [None]:
# Load needed packages
library(tidyverse)

# Suppose we have a small example that simulates these issues
df <- data.frame(
  ID = 1:6,
  Department = c("Marketing ", "MARKETING", "Sales", "sales", "Sales ", "IT "),
  Age = c(25, 200, 30, NA, 28, 45),
  Comment = c("  Great Employee ", "N/A", "excellent worker", "??", "n/a", "    ")
)

df

### 1.2 Dealing with Inconsistent Text
Sometimes you want to **trim whitespace**, convert to **lowercase** or **title case**, and replace placeholders (`"N/A"`, `"n/a"`) with actual `NA`.


In [None]:
# Stringr (part of tidyverse) is useful for text manipulations
library(stringr)

df_clean <- df %>%
  mutate(
    # Trim whitespace
    Department = str_trim(Department),

    # Convert to proper case or consistent case
    Department = str_to_title(Department),

    # Replace 'N/A' or 'n/a' or '??' with an actual NA in 'Comment'
    Comment = na_if(Comment, "N/A"),
    Comment = na_if(Comment, "n/a"),
    Comment = na_if(Comment, "??"),

    # Also trim whitespace for Comment
    Comment = str_trim(Comment),

    # If Comment is just empty string, treat as NA
    Comment = ifelse(Comment == "", NA, Comment)
  )

df_clean

> **Note**: You could also create a small function that standardizes these steps (e.g., repeated calls to `na_if()`).

### 1.3 Validating Numeric Ranges
Next, suppose **Age** should never exceed 120. We can flag or fix out-of-range values.


In [None]:
# Flag unrealistic ages
df_clean <- df_clean %>%
  mutate(
    Age = ifelse(Age > 120, NA, Age)  # treat out-of-range as NA
  )

df_clean

### 1.4 Additional Validation Packages
- **`validate`**: a package that lets you define **validation rules**.
- **`assertr`**: for creating **assertions** that your data meets certain conditions.

> **Exercise**: Install and explore `validate` or `assertr` to define rules like `Age >= 0 & Age <= 120` or `Department` in a set of known departments.


---
## 2. Handling and Cleaning Text Data

As journalism students, you might deal with **text-heavy** data: articles, transcripts, tweets, etc. R provides packages like **`stringr`** and **`tidytext`** for text wrangling.

### 2.1 Basic Stringr Operations
We’ve seen **trim**, **to_lower**/**to_upper**. More advanced tasks:
- `str_detect(text, pattern)`: Check if pattern is present.
- `str_replace_all(text, pattern, replacement)`: Replace text with regex.
- `str_split(text, pattern)`: Split text into substrings.


In [None]:
# Simple example: searching & replacing in a column
text_data <- data.frame(
  ID = 1:4,
  Tweet = c(
    "This is #awesome!",
    "Check out https://example.com for details",
    "Breaking news: R is amazing???",
    "Email me at test@example.com."  
  )
)

text_data

#### Example: Removing URLs or Email Addresses
Let’s **remove** them or mask them in the text.


In [None]:
text_clean <- text_data %>%
  mutate(
    # Remove URLs using a simple regex pattern
    Tweet = str_replace_all(Tweet, "http[^"]+", "[LINK]"),

    # Remove email addresses
    Tweet = str_replace_all(Tweet, "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}", "[EMAIL]")
  )

text_clean

### 2.2 Tidytext Basics
For **tokenization** (splitting text into words), **stopword** removal, or **word frequencies**, the `tidytext` package is invaluable. A quick demonstration:

```r
# install.packages("tidytext") if needed
library(tidytext)

# We can unnest tokens to split each tweet into individual words
text_tokens <- text_clean %>%
  unnest_tokens(output = "word", input = "Tweet")

# Count frequencies
text_tokens %>% count(word, sort = TRUE)
```

> **Exercise**:
1. Filter out **stopwords** (e.g., "the", "is", "at"), using `stop_words` from tidytext.
2. Compare the top words used in your dataset.


---
## 3. Combining Data Wrangling and SQL
By now, you’ve seen how to **reshape** (Lesson 2), **clean** (Lesson 1 & 3), and do quick **SQL** queries in R. Often, you’ll use **both**:
- Use R’s `dplyr` or string functions for specialized wrangling.
- Use SQL queries to filter or aggregate large data.

### 3.1 Example: Clean + Query
Below is a mini workflow example:


In [None]:
library(sqldf)

# 1. Suppose we have a data frame with messy text or numeric issues
df_example <- data.frame(
  user_id = 1:5,
  status  = c("Active", "active", "inactive", "unknown", "ACTIVE"),
  score   = c(50, 80, 90, NA, 120)
)

# 2. Clean status text, set out-of-range score to NA
df_example <- df_example %>%
  mutate(
    status = str_to_lower(status),
    status = ifelse(status == "unknown", NA, status),
    score  = ifelse(score > 100, NA, score)
  )

# 3. Query using sqldf
sqldf("SELECT user_id, status, score FROM df_example WHERE score >= 60")

That’s a simplified illustration. In practice, you might have multiple data frames with complex relationships, or you might prefer to store them in a **SQLite** database. The key is: **R** + **SQL** can cooperate seamlessly.


---
## 4. Mini Workflow / Capstone Example
Now let’s outline a **mini-project** that ties everything together. We’ll do it conceptually here, but encourage you to apply it to **real** data you’re investigating for your thesis or class project.

### 4.1 Scenario
You have a dataset of **news articles** with the following columns:
- `article_id`, `title`, `author`, `content`, `published_date`, `category`, etc.
Some of the article content may contain missing text, or placeholders. You also have a **lookup** table that maps `category` codes to more descriptive labels (e.g., `POL = Politics`, `ENT = Entertainment`).

### 4.2 Steps
1. **Import** the main article dataset (`articles.csv`) and the category lookup file (`category_lookup.csv`).
2. **Clean** the article text: remove or replace certain patterns (URLs, email addresses), standardize capitalization in `title`, handle missing or placeholder content.
3. **Join** the main dataset with `category_lookup` to get descriptive names.
4. **Explore** text frequencies or word counts (with `tidytext`) to see which terms appear most often in each category.
5. **Optional**: Store the cleaned dataset in a local SQLite database, run a few **SQL** queries for practice.
6. **Visualize** the top words or categories in a bar chart or word cloud.
7. **Export** the final cleaned dataset.

### 4.3 Example Code Snippet (Pseudocode)
```r
# 1. Import
articles <- read_csv("articles.csv")
lookup   <- read_csv("category_lookup.csv")

# 2. Clean text
articles <- articles %>%
  mutate(
    title   = str_to_title(title),
    content = str_replace_all(content, "http[^"]+", "[LINK]"),
    content = str_trim(content),
    content = ifelse(content == "", NA, content)
  )

# 3. Join with category lookup
articles_joined <- left_join(articles, lookup, by = "category")

# 4. Explore text frequencies
library(tidytext)
tokenized <- articles_joined %>%
  unnest_tokens(word, content)

# 5. Summaries / Queries
library(sqldf)
sqldf("SELECT descriptive_category, COUNT(*) as count FROM tokenized GROUP BY descriptive_category")

# 6. Visualization
tokenized %>%
  count(word, sort = TRUE) %>%
  filter(n > 50) %>%
  ggplot(aes(x = reorder(word, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(title = "Most Common Words", x = "Word", y = "Count")

# 7. Export
write_csv(articles_joined, "articles_cleaned.csv")
```


> **Tips**:
1. Show how you cleaned the data in your project or thesis: create an **Appendix** with code.
2. Keep track of each step so it’s reproducible.
3. Use Git or version control if possible to track changes.


---
## 5. Looking Ahead & Thesis Tips
1. **Text Data**: Journalism students often face unstructured text. Keep practicing `stringr` and `tidytext`.
2. **SQL**: If your data is large, consider a database approach. Use R’s **DBI** or **sqldf** for more complex queries.
3. **Validation**: Tools like `validate` or `assertr` help ensure data meets certain rules. This is crucial for **fact-checking** or ensuring consistent data quality.
4. **Reproducibility**: Try using **R Markdown** or **Quarto** to produce **reproducible reports**. Insert your code and outputs into a single document for your thesis.
5. **Version Control**: If you collaborate, store your project in a Git repository or a shared platform so you can revert changes and track your progress.

Congratulations on making it this far! You now have a robust toolkit for **importing**, **cleaning**, **reshaping**, **merging**, **text processing**, and even **SQL querying** in R. These skills will serve you well in data-driven journalism and beyond.

# End of Lesson 3
