## Cleaning

- Third and final step

- from our assess step, we implement the list from the assess step

    - we will fix quality and tidiness issues identified in assess step

- best done in three steps:

    - define
    
        - in words/ psuedocode
    
    - code
    
    - test

## Manual Vs Programmatic Cleaning

- manual: in spreadsheet applications or in text editor, it looks like data entry

    - inefficient, error prone, and bad
    
    - should never be done unless one off occurances

- Programmatic

    - we can automate cleaning tasks and minimize repetition
    
- Data wrangling takes up a bunch of time anyway so this is best

## The Process

- Step 0: make a copy of the data

    - we always want to be able to see the messy raw data
    
    - `df_clean = df.copy()`
    
- Step 1: Define

    - define a data cleaning task in writing. We convert our assess 
    
    - this will later become documentation so others can repeat it
    
    - we use verbs here (we didn't in assess stage) and describe the method we will be using
    
        - e.g: 
        
            - remove `bb` before every animal name using string slicing
    
- Step 2: Code

    - We just code it up
    
    - e.g. `df['Animal'] = df['Animal'].str[2:]`
    
- Step 3: Test

    - we can write tests to assure our cleaning does what we want it
    
    - we can use:
    
        - visual assement
        
        - assert statements
        
        - pandas built in testing methods

## Missing Data

- it is usually ideal to find missing data first if we can

    - completeness issues (data quality)
    
    - if we don't address this first we may have to redo work from previous steps
    
- We can use imputing (filling in missing data with some other values, using some appropriate method)

## Cleaning For Tidyness

- After addressing missing data first, usually next step to fix data tidy-ness issues next

- A huge amount of effort is spend on cleaning data for analysis, but not much research on how to make data cleaning as easy and effective as possible

    - data tidying is easier to manipulate when we want to fix data quality issues 
    
    - this is why it is smart to fix data structure issues first (tidyness), then to worry about data quality 
    
- Using regex with `str.extract` for python is very useful here

- We can use sql join like operations and the melt function as well

### Regular Expressions

- Everything is a character and we are writing a pattern to match specific sequence of characters

    - we can use Unicode to match any text or standard ascii

- we can match the three phases:

    - abcdef abcd abc with letters abc
    
    - abc12, bbb124bbb, sss22ff can be matched with \d\d
    
    - wildcard: `.` can match a single character, and if we want to use a period we escape it with `\.`
    
    - we can match a set of characters with `[]` as in `[abc]an` can match aan, ban, can
    
    - we can use `^` before bracket to say we dont want those characters
    
        - `[^b]og` will match hog, dog, but not bog
   - note we can choose a range of characters with `[A-C]`, note this is alphanumeric: `[A-Za-z0-9_]`
   
   - we can use bracket notation `z{2}` for z repeated twice, z{1,4} for no more than 4 times, but no less then 1
   
       - can combine with bracket [A-Z]{1,3}
       
   - Kleene star, kleene plus:
   
       - aacc, aabbbbbc, aaaabccc can be matched with: `a+b*c+` where * is 0 or more, + is 1 or more
       - `[^A-X]+` can be used with bracket notation
       
   - optional:
   
       - `?` will be 0 or 1 of a character
       
       - 1 file, 2 files, 24 files can be rep by `[12]\d? files?`
       
  - whitespace:
  
      - we can deal with any kind of whitespace with these kinds of regex`\d.[ \t\n\r]+abc`
      
  - start and end line
  
      - `^Mission: successful$` we use the start and end symbol to match begin and end of line 
      
  - `^(file.+)\.pdf$` we can define a group, so that any we can capture any group that starts with file
  
  - (.{3} (\d{4})) we can capture two groups here:
  
      - 1978 years from inner 
      
      - Jan 1987 month year from outer
      
  - and/ or conditional `I love (cats|dogs)`
  
- Other metocharacters:

    - \w is alphanumeric, \W is non alphanumeric
    
    - \d is digit, \D is non digit
    
    - \s is whitespace, \S is non whitespace
    
    - \b is end of word

## Last we can fix data quality issues

- After structure and missing rows taken care of, this is just addressing simple things like data types,