This portfolio project demonstrates real-world data cleaning and exploratory data analysis (EDA) using MySQL.
The dataset represents global company layoffs from 2020β2023, including total layoffs, company stage, country, and industry.
The goal was to prepare raw, messy data for analysis, extract insights, and identify patterns in layoffs across industries and years.
Files:
17.Data Cleaning PROJECT PART 1.sql
17.Data Cleaning PROJECT PART 2.sql
- Created staging tables to protect the raw dataset while performing transformations.
- Removed duplicates using
ROW_NUMBER()
and Common Table Expressions (CTE). - Standardized text fields:
- Trimmed extra spaces in company names.
- Unified variations like βCrypto%β β βCryptoβ.
- Cleaned country names (βUnited States.β β βUnited Statesβ).
- Converted
date
column from text to properDATE
format usingSTR_TO_DATE()
. - Handled missing data:
- Filled missing industries using company matches.
- Removed rows with null values for both
total_laid_off
andpercentage_laid_off
.
- Dropped unnecessary columns after cleaning (e.g., helper columns for row numbers).
- Dataset became fully clean and ready for analysis.
- All text fields standardized, duplicates removed, and date formats corrected.
File:
18.MySQL Exploratory Data Analysis Full Project.sql
- Identify which companies and industries had the highest layoffs.
- Examine layoff trends by year, month, country, and funding stage.
- Find top 5 companies with most layoffs each year.
- Calculate rolling totals of layoffs over time.
GROUP BY
,ORDER BY
for aggregations.- Window functions (
DENSE_RANK()
,SUM() OVER
) for trend ranking. - Common Table Expressions (CTEs) for organizing intermediate results.
- The Tech industry had the highest total layoffs overall.
- Layoffs peaked between 2022β2023, with a strong correlation to economic slowdowns.
- The United States accounted for most layoffs globally.
- Companies in the Late-Stage and Post-IPO phases were most affected.
- MySQL Workbench / SQL Server Management Studio
- Data Cleaning, Standardization, and Validation
- Exploratory Data Analysis (EDA)
- CTEs, Window Functions, Aggregations
- Data Storytelling through SQL
WITH Company_Year AS (
SELECT company, YEAR(`date`) AS years, SUM(total_laid_off) AS total_laid_off
FROM layoffs_staging2
GROUP BY company, YEAR(`date`)
),
Company_Year_Rank AS (
SELECT *,
DENSE_RANK() OVER (PARTITION BY years ORDER BY total_laid_off DESC) AS Ranking
FROM Company_Year
WHERE years IS NOT NULL
)
SELECT *
FROM Company_Year_Rank
WHERE Ranking <= 5;
---
## π§ What I Learned
How to clean real-world messy data using SQL.
Importance of standardizing fields before analysis.
How to summarize trends by company, country, and time period.