# Welcome to the Diploma Thesis Seminar II R & SQL Block Course

## Course Overview

Welcome to our in-depth course on **data preprocessing**, **cleaning**, **SQL**, and related data-handling workflows using R. We have split the material into **two major parts**, each containing multiple **4-hour blocks** (or lessons). By the end of this course, you’ll have a **comprehensive toolkit** for real-world data work and final thesis projects.

This notebook will give you a quick overview of:

- Course goals
- Topics covered & overall format
- Technical setup and resources
- Getting help
- Next steps

We hope these materials will greatly assist you when dealing with real-life data for your diploma thesis or journalism projects.

## 1. Learning Goals

- Learn how to **set up** and **navigate** R (and Jupyter/RStudio) for data analysis.
- Understand **data importing/exporting**, cleaning techniques, and best practices in **data preprocessing**.
- Explore **basic** to **advanced** Exploratory Data Analysis (EDA) and data transformations.
- Gain a **practical introduction to SQL** in R—querying and joining datasets.
- Practice more **advanced database** concepts (window functions, indexing, design) and connecting to external databases.
- Learn how to **scrape web data**, handle **APIs**, and parse **JSON**.
- Master **reproducible workflows** (R Markdown/Quarto) and final project best practices.
- Confidently apply these skills to real-world data for your **journalism** or **final thesis**.

## 2. Course Format & Outline

The course is organized into two main parts, each containing a series of **4-hour** blocks (lessons). **Part 1** focuses on foundational R and SQL topics, while **Part 2** explores more advanced database usage, web scraping, APIs, and final project workflows.

### **Part 1** (Foundational Topics)
- **Lesson 1**: Data Import, Cleaning & Basic EDA
  - Installing/loading packages, environment setup.
  - Reading/writing CSV, Excel; data cleaning (missing values, type conversions).
  - Basic exploratory plots (histograms, bar plots, boxplots).

- **Lesson 2**: Data Reshaping, Merging & Introduction to SQL
  - Pivoting data (long <-> wide) with `tidyr`.
  - Combining datasets via `dplyr` joins.
  - Intro to SQL in R (using `sqldf` or `DBI` + `RSQLite`); basic SELECT, WHERE, JOIN.

- **Lesson 3**: Advanced Data Cleaning & Text Handling
  - Handling complex missingness, text cleaning (`stringr`), factor management.
  - Simple text processing or tokenization (`tidytext`).
  - Mini capstone workflow example (import → clean → transform → export).

- **Lesson 4**: Advanced Visualization & Reproducible Reporting
  - Customizing plots with `ggplot2`, facets, themes.
  - Introduction to R Markdown or Quarto for integrated reporting.
  - Building reproducible workflows and final project best practices.

### **Part 2** (Advanced Databases, APIs & Workflow)
- **Lesson 1**: Databases, SQL & R (Revisited)
  - Deep dive on connecting R to databases (SQLite, MySQL, Postgres).
  - DB design concepts, CRUD operations, advanced queries.

- **Lesson 2**: Advanced SQL Features & dbplyr
  - Window functions, subqueries, indexing & performance.
  - Using `dbplyr` to translate dplyr code to SQL behind the scenes.

- **Lesson 3**: Web Scraping, APIs & JSON Handling
  - Using `httr` to call APIs and parse JSON.
  - Scraping HTML pages with `rvest`; storing scraped data in a DB.
  - Handling large or paginated data sources.

- **Lesson 4**: Reproducible Workflows, Final Project Best Practices
  - Structuring a project for transparency (file/folder organization, version control).
  - Using R Markdown/Quarto for final thesis or journalistic deliverables.
  - Sharing your findings & advanced integration (Shiny, dashboards).

We suggest you **follow** these lessons in order, but feel free to skip ahead or revisit topics as needed. Each lesson is hands-on, so come prepared to **code along** and **experiment**.

## 3. Requirements & Setup

- **Software**:
  1. **Binder** (if provided) to get coding right away without the need for local installs.
  2. If you want to set everything up locally, install **R** (4.0 or above), **Jupyter** (with an **R** kernel), or **RStudio**.
- **Packages**: The main libraries we’ll rely on are `tidyverse`, `readxl`, `DBI`, `RSQLite`, `sqldf`, `httr`, `rvest`, and `jsonlite`. In Binder, they may be pre-installed.
- **Hardware**: A laptop with **4GB+** of memory. More if you plan to work with large datasets.
- **Data**: We will provide example datasets throughout the lessons. You are encouraged to bring **your own** data relevant to your thesis.


## 4. Getting Help

- **During class**: Ask questions freely! We’ll troubleshoot code and environment issues in real time.
- **Online resources**:
  - *R for Data Science* (Hadley Wickham & Garrett Grolemund)
  - RStudio cheat sheets: <https://rstudio.com/resources/cheatsheets/>
  - SQL basics tutorials: w3schools, SQLBolt, etc.
  - For advanced topics: the `dbplyr`, `httr`, and `rvest` vignettes.
- **Office hours** / personal consultation (if provided) for in-depth assistance.


## 5. Course Policies & Expectations

- **Attendance**: Strongly recommended you attend each 4-hour session live.
- **Participation**: We’ll do in-class exercises. Bring your laptop, follow the demos, and practice.
- **Collaboration**: You’re encouraged to help each other. For individual submissions, please submit your own work.
- **Respect & Inclusion**: Be mindful of diverse backgrounds and coding experience levels.
- **Data Ethics**: Keep sensitive data anonymized, and respect licensing or terms of use when scraping or using APIs.


## 6. Next Steps

1. **Confirm your environment**:
   - If using Binder, ensure it loads properly.
   - If local, verify you can open Jupyter Notebooks (or RStudio) with an **R kernel**.
2. **Install/Update** packages locally if needed.
3. **Browse** the example files or bring your own CSV/Excel data if relevant.
4. **Review** the lesson outlines above to see how you might apply them to your final thesis.
5. **Come ready** to learn, experiment, and possibly fix real-time errors—that’s part of coding!

---

**Created by Petr Čala**

_Last updated: [2025-02-26]_
