Skip to content

Fausford/Descriptive_Statistics

Repository files navigation

Descriptive analysis of numeric and categorical data

Overview

This repository demonstrates a complete, reproducible workflow for descriptive statistics and exploratory data analysis (EDA) on a synthetic medical dataset. It uses an R Markdown report to load data, assess structure and missingness, summarize variables, and visualize numerical and categorical patterns. A small utility script (descriptive.R) adds helper functions for streamlined numeric summaries and pairwise plots.

Repository Contents

  • Descriptive_Statistics.Rmd

Main analysis report (R Markdown) titled “Descriptive Statistics”.

  • descriptive.R

Helper functions referenced by the R Markdown (e.g., numeric handlers and plotting utilities).

Data Source

Synthetic medical dataset loaded from public URL.

Key Packages

tidyverse — Data wrangling and plotting

mlbench — Example datasets and utilities

DataExplorer — Automated EDA (missingness, histograms, correlations, boxplots)

skimr — Compact data summaries

psych — Descriptive statistics for numeric variables

knitr — Chunk options and report rendering (via R Markdown)

Data

Dataset: medical_synthetic.csv (downloaded directly in the report)

Contents: Demographics (age, sex, race), vitals, labs (e.g., glucose, creatinine), and derived indicators suitable for basic descriptive analysis.

What the Report Does

  1. Project Setup

Loads libraries and sources descriptive.R.

Sets chunk options for reproducibility and clean output.

  1. Data Import & Structure

Downloads the medical dataset from a public URL.

Prints structure and a compact overview (types, ranges, examples).

  1. Missingness & Summary

Missingness map: Visualizes proportion and distribution of NAs.

skim() summary: Variable types, completeness, and distribution summaries.

psych::describe(): Descriptive statistics for numeric columns.

  1. Numerical Data Exploration

Histograms for continuous variables.

Histogram

Boxplots stratified by sex and by race.

scatter plot for selected numeric features.

Scatterplot

Numeric subset extraction for focused analysis.

Helper utilities

handle_numeric() — Standardized numeric summaries.

plot_numeric() — Pairwise numeric plots for selected variables.

  1. Categorical Data Exploration Frequency tables for race and sex (ordered factor for race).

Clean display of counts for quick inspection.

  1. Numeric × Categorical Summaries Grouped means of age and glucose by sex (with NA-safe handling for glucose).

Simple cross-tabulation of sex × race.

Typical Outputs

Missingness plot (overview of NAs)

Skim summary and psych descriptives (tabular)

Histograms (numeric distributions)

Boxplots by sex and by race (group comparisons)

Correlation heatmap (numeric relationships)

Pairwise numeric plot for selected variables (e.g., age, creatinine, glucose)

Frequency tables (race, sex)

Grouped means (e.g., mean age/glucose by sex)

How to Use

Open the R Markdown file in RStudio (or your preferred editor).

Ensure required packages are installed.

Knit/render the report to HTML to reproduce the tables and figures.

The report references descriptive.R. Keep this file in the expected path (as referenced in the YAML/script) to ensure helper functions are available.

Goals

Provide a clear template for descriptive statistics on tabular medical data.

Standardize numeric and categorical summaries for quick reporting.

Produce publication-ready figures and tables via automated EDA tools.

Notes If you relocate files or change folder names, update paths in the report header or at the top of the document.

For larger datasets, consider chunk-wise processing or saving intermediate outputs in a dedicated folder.

About

Descriptive analysis of numeric and categorical data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages