# Data Cleaning and Visualization in R: A Beginner's Tutorial

This Jupyter Notebook demonstrates how to clean a large dataset and create basic visualizations in R. We will use the `nycflights13::flights` dataset as an example. The tutorial covers:

- Handling missing values (e.g., using `na.omit()`, `ifelse()`, or `replace_na()` to deal with NAs)
- Removing duplicate rows (using `dplyr::distinct()`)
- Fixing formatting issues (converting column types and renaming columns for clarity)

After cleaning the data, we'll create visualizations with **ggplot2** (scatter plot, histogram, boxplot, and a line chart) to explore the data.


In [None]:
# Install necessary packages if they're not already installed
install.packages("nycflights13")
install.packages("dplyr")
install.packages("tidyr")
install.packages("ggplot2")


# Load the necessary libraries
library(nycflights13)
library(dplyr)
library(tidyr)
library(ggplot2)

# Load the flights dataset
data("flights")  # flights data from nycflights13

# Check the size of the data (rows and columns)
dim(flights)

# Preview the first few rows
head(flights)

## Handling Missing Values

Let's identify missing values in the dataset and apply techniques to handle them. In R, missing data is represented by `NA`. We will:
- Count the number of missing values in key columns (departure and arrival delays).
- Remove rows with missing data using `na.omit()`.
- Fill missing values with a placeholder using `replace_na()`.

In [None]:
# Count missing values in departure and arrival delay columns
sum(is.na(flights$dep_delay))  # number of NAs in departure delays
sum(is.na(flights$arr_delay))  # number of NAs in arrival delays

# Remove rows with any missing values
flights_no_na <- na.omit(flights)
dim(flights_no_na)  # dimensions after dropping NAs

In [None]:
# Replace missing delay values with 0 (assuming missing means no delay)
flights_filled <- flights %>%
  replace_na(list(dep_delay = 0, arr_delay = 0))

# Check that missing values in delays are handled
sum(is.na(flights_filled$dep_delay))  # should be 0 after replacement
sum(is.na(flights_filled$arr_delay))  # should also be 0


## Removing Duplicates

Duplicate rows can skew analysis. We'll use `distinct()` from dplyr to remove any duplicate flight entries (if present). First, let's check if there are any duplicates.

In [None]:
# Compare number of rows before and after using distinct()
nrow(flights)           # original number of rows
nrow(distinct(flights)) # number of rows after removing duplicates

## Formatting and Consistency

Now we'll clean up column names and data types for consistency and clarity. We will:
- Remove canceled flights (where `dep_time` is NA).
- Add airline names by joining with the `airlines` dataset.
- Rename columns for clarity (e.g., change `carrier` to `carrier_code` and `name` to `airline_name`).
- Convert columns such as `origin`, `dest`, and `carrier_code` to factors.

In [None]:
# Remove canceled flights (where dep_time is NA) to focus on flights that actually flew
flights_clean <- flights %>%
  filter(!is.na(dep_time)) %>%
  # Add airline names by joining with the airlines dataset on carrier code
  left_join(airlines, by = "carrier") %>%
  # Rename columns for clarity
  rename(airline_name = name, carrier_code = carrier) %>%
  # Convert columns to factors for consistency
  mutate(
    origin = factor(origin),
    dest = factor(dest),
    carrier_code = factor(carrier_code)
  )

# Preview the cleaned data
head(flights_clean)

## Data Visualization

With the cleaned data (`flights_clean`), we can now create several plots to explore it. We'll build:
- A scatter plot of departure delay vs. arrival delay.
- A histogram of departure delays.
- A boxplot of arrival delays by origin airport.
- A line chart showing the number of flights per month.

### Scatter Plot: Departure Delay vs Arrival Delay

This scatter plot shows the relationship between departure delays and arrival delays. Each point represents a flight, and we expect a positive trend (flights that depart late tend to arrive late).

In [None]:
# Scatter plot of departure delay vs arrival delay
ggplot(flights_clean, aes(x = dep_delay, y = arr_delay)) +
  geom_point(alpha = 0.2) +
  labs(title = "Departure vs. Arrival Delay",
       x = "Departure Delay (minutes)",
       y = "Arrival Delay (minutes)")

### Histogram: Distribution of Departure Delays

This histogram visualizes how departure delays are distributed across all flights. Notice the concentration of flights with small or zero delays.

In [None]:
# Histogram of departure delays
ggplot(flights_clean, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Departure Delays",
       x = "Departure Delay (minutes)",
       y = "Number of Flights")

### Boxplot: Arrival Delays by Origin Airport

This boxplot compares the distribution of arrival delays for each of the NYC area airports. It helps identify which airport may have more variability or higher delays.


In [None]:
# Boxplot of arrival delays by origin airport
ggplot(flights_clean, aes(x = origin, y = arr_delay)) +
  geom_boxplot(fill = "orange") +
  labs(title = "Arrival Delays by Origin Airport",
       x = "Origin Airport",
       y = "Arrival Delay (minutes)")


### Line Chart: Number of Flights per Month

This line chart shows the trend of flight frequency over the months of the year. We group flights by month and count them to see seasonal variations in flight traffic.

In [None]:
# Line chart of number of flights each month
flights_per_month <- flights_clean %>%
  count(month)

ggplot(flights_per_month, aes(x = month, y = n)) +
  geom_line(color = "blue") +
  geom_point() +
  labs(title = "Flights per Month (2013)",
       x = "Month",
       y = "Number of Flights")

## Conclusion

In this notebook, we cleaned a large dataset and then visualized it to uncover patterns. We:
- Handled missing values by either removing or filling them.
- Checked for duplicate rows (none were found in this dataset).
- Renamed columns and adjusted data types for clarity.
- Created various plots (scatter, histogram, boxplot, and line chart) to explore relationships, distributions, and trends.

These steps form a solid foundation for data cleaning and analysis. Happy analyzing!