Skip to content

ThisIsJohnnyLau/dirty_data_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

'80% of time in data science and analysis is spent data cleaning'

This includes:
• Loading multiple sources of data
• Consolidating data for analysis
• Reshaping and joining datasets
• Dealing with missing values, duplicates and outliers
• Cleaning strings

Why is data cleaning important?

• Estimated $3 trillion US GDP lost in 2016 - IBM
• 1 in 3 business leaders did not trust the data sources used in decision-making

'Garbage in leads to garbage out'

You get to know your data:

• Understanding data through inital cleaning and exploration
• Reduces the risk of incorrect assumptions
• Raises relevant questions
• Discovery of issues such as biases in data collection
• Opportunities to problem solve for unique datasets
• Setup to extract additional insight
• Setup to emphasis particular questions

Project overview

Tasks

This project centres around cleaning six dirty datasets [Folder]

• Task 1 - Decathlon Events [Analysis]
• Task 2 - Cake Ingredients [Analysis]
• Task 3 - Seabirds Spottings [Analysis]
• Task 4 - Sweeties Survey [Analysis]
• Task 5 - Right Wing Authoritarianism [Analysis]
• Task 6 - Dogs [Analysis]

Format

Each solution includes:
• Cleaning script
• Commentary, assumptions and process
• Answers to questions

Required libraries

• here
• janitor
• readr
• tidyverse

Folder structure

raw_data
data_cleaning_scripts
clean_data
documentation_and_analysis

Releases

No releases published

Packages

No packages published

Languages