# Data Wrangling Intro

- Identify each step of the data wrangling process:

    - gathering
    
    - assessing
    
        - quality and structure
    
    - cleaning
    
        - before analysis, visualization, or ML for predive analysis
    
    - we do this by using a dataset from kaggle of online job postings
    
- Sometimes its as simple as:

    - download data, fix a few typos
    
    - sometimes the record isn't clean, it has missing records, duplicates, inaccurate data
    
    - sometimes the data is fine but structurally it is difficult to work with
    
    
- IF we dont take care of these issues, we risk making mistakes, missing insites, wasting time


- core skill we will work with everyday in data analysis process because so much of the worlds data  isnt clean


- easy to do with code and will become instinctual

    - even though we focus on Python here, we can use these concepts with any languange or application (including things like excel, sql, tableau, data engineering stack...)
    
- wrangling means that you round up, heard, or take charge of livestock

    - think about this with sheep. A shepard needs to get their sheep graze, guide them to the market to get sheared and put them in a barn to sleep
    
    - before this they round them up nice in organized groups
    
        - if they aren't the tasks will take longer, some may get lost, and a wolf could eat some of them
        
    - We can think of ourselves as shepards of data. We must **organize before we act**
    
        - if we don't wrangle our data there could be consequences
        
        - If we analyze, visualize, or model our data before wrangling it we can miss out on cool insites , make mistakes, or waste time
        
- conclusion: best practice says to wrangle!

## Gathering data

- Always the first step, before we gather we don't have it and afterwards we do. So this is always the first step, and also can be though of as collecting it

- Steps may vary

    - could download a file and import it into jupyter
    
    - could be files and databases (usually this is what we do in workplace)
    
    - We could get it from an API or scrape

## Downloading the data

- https://www.kaggle.com/udacity/armenian-online-job-postings

- Why it is interesting:

    - the online job market is a good indicator of overall demand for labor in an economy
    
    - the dataset contains 19000 job postings from 2004 to 2015 posted on CareerCenter, an armenian human resource portal
    
    - Postings are text documents, they tend to have similar structure, text mining can be used to extract features like posting date, job title, compnay name, job descripition, slaary, and more
    
- How to download:

    - we will download it programmatically because it is best practice
    
        - more scalable (we can download from 1000s of websites rather then click and point which would take forever
        
        - reproducability: it will allow people to reproduce your results which is important in the scientific community/ research community. It adds legitimacy to your analysis because if people can't reproduce it they won't really take it very seriously
        
            - dataset or website may change, so if you include the date it was downloaded they can look to access for archived copies of the dataset or understand why the results are different

In [9]:
# This is how we would have extracted it from a zip file, but we didn't have to in my environment

# import zipfile
# with zipfile.ZipFile('armenian-online-job-postings.zip', 'r') as myzip:
#     myzip.extractall()

 [0m[01;35mexample-job-posting.jpg[0m  'Introduction Data Wrangling.ipynb'
 features.txt              online-job-postings.csv


## Gather

- csv files are comma seperated files that are text files

- typically we get them through spreadsheet app or database export

- uses comma as a delimiter

- other file formats to parse are html files, .txt files, and a lot more

- Pandas can import almost all file formats and database easily

In [10]:
%ls

 [0m[01;35mexample-job-posting.jpg[0m  'Introduction Data Wrangling.ipynb'
 features.txt              online-job-postings.csv


In [11]:
import pandas as pd
df = pd.read_csv("online-job-postings.csv")

In [13]:
df.head(1)

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False


## Assess

- we have successfully downloaded and imported our dataset

- Now we need to determine what is clean and what else to gather if we are missing some data

    - we are not exploring our dataset we are just making sure our data is in a form that makes it easy to analyze later on
    
- What area we assessing? What is dirty/  messy data

    - our data's quality and tidyness
    
        - low quality data is dirty
        
        - untiy data is messy

- End result: a list of observations like "Nondescriptive column headers"

    .E.g.:
    
        - Data quality issues:
        
            - nondescriptive column headers
            
            - misisng vallues (NAN)
            
            - inconsistent representations
            
            - untidy data: split the data into two tables, redundant data
            
- Iterative approach:

    - We don't want to from the beginning have to fix every issue with our data, we should only do what is necessesary

Our dataset:
```csv
Name, Height
Jane, 55 inches
Juan,
Amalie, 145 centimetres
Kwasi, -50 inches
```

### Low Data Quality

- Data quality is a perception or an assessment of data fitness to serve its purpose in a given contex

- Commonly referred to as dirty data

- it has issues with its content

- Common data quality issues:
    
    - missing data 
    
        - like for juans height
    
    - invlaid data 
    
        - like negative vaulues for height (kwesi)
        
        - like centimeters and inches being in the height, because this makes it the wrong data type
        
    - inaccurate data
    
        - Like Jane actually being 58 inches tall, not 55
        
    - inconsistent data
    
        - Like using different units for height (inches and centimeters)

### Untidy data

- We may have to go through the normalization process to get the tidy to be in tidy format. It is very connected with normalizatioon, which would solve the process (or in general database design)

- Non tidy data (linked to data denormalization) is useful sometimes. We should take a problem first approach rather then a solution backward approach.

    - this means that we choose to solve a particular problem and the format to choose will most directly solve that problem or make it easier to solve rather than one that is theoretically optimal

- commonly refererred to as messy data

- has issues with its structure

- A dataset is messy or tidy depending on how rows, columns, and tables are matched up with observations, varialbes and types

    - in tidy data:
    
        - each variable forms a columns
        
        - each observation forms a row
        
        -  each type of observational unit forms a table

### How to assess quality and tidyness

- two styles:

    - visual
    
        - just to look at it with eyes to assess data 
        
        - This is easiest to do in spreadsheet application like excel
        
        - we can do it with pandas but it is usually easier to do with spreadsheet application
        
            - But sometimes if the dataset is too large it will crash if we try to use spreadsheeet application
            
        - Pandas is needed for large files, but scrolling through cells is not easy in pandas or fun  
    
    - programmatic
    
        - Assessing the data programmatically is using a computer program to detect problems in our data
    
        - Pandas is best for this. E.g. .info() function is great here
        
 

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19001 entries, 0 to 19000
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   jobpost           19001 non-null  object
 1   date              19001 non-null  object
 2   Title             18973 non-null  object
 3   Company           18994 non-null  object
 4   AnnouncementCode  1208 non-null   object
 5   Term              7676 non-null   object
 6   Eligibility       4930 non-null   object
 7   Audience          640 non-null    object
 8   StartDate         9675 non-null   object
 9   Duration          10798 non-null  object
 10  Location          18969 non-null  object
 11  JobDescription    15109 non-null  object
 12  JobRequirment     16479 non-null  object
 13  RequiredQual      18517 non-null  object
 14  Salary            9622 non-null   object
 15  ApplicationP      18941 non-null  object
 16  OpeningDate       18295 non-null  object
 17  Deadline    

### Clean

- ALWAYS make a copy of previous data before cleaning!

- Most code heavy part of the data wrangling process typically

- Acting on the assessments to improve quality and tidyness (we take the list we created in previous step as input in this step)

- Changing the content of the data is actually data fraud, not data cleaning

- we are instead correcting it when inaccurate, removing it when its wrong or irrelevant, replacing filling in missing values, or merging like when we want to combine datasets that were previously split up

- Data cleaning is inefficient, error prone, and demoralizing when using text editors or spreadsheet applications

    - unless it is one off issues, like changing a data type
    
- programmatic data cleaning is better for repetitive reproducable tasks. It can be broken down into three steps

    - Define
    
        - We define our assessments into data cleaning tasks. This will also serve as an instruction list so others or us in the future will look at our work and reproduce it 
    
    - Code
    
        - We will translate that definition into code and run that code
    
    - Test
    
        - We will run tests so we make sure they work


### Define. Code, Test

- Define what we want to include, make instructions

- This is turning list from assess phase into psuedocode for cleaning tasks

In [17]:
# Start data inconsistency

    # Find all the unique startdate ways it is listed
    
    # Choose one way to have them
    
    # Write code for that

# Nondescriptive column headers

     # Find what all the columns mean
        
    # Create a map from the current columns to renamed columns (underscores and all
        # lowercase with desc names)
        
    # Replace the column names

In [18]:
### Code/ Test

# We just code it up and write a unit test below it

# Unit test would be the same thing as data validation in the ThinkStats book. We just find values that
    # we know are correct and write a unit test for it
    
# Typically an `assert` statement would work well ehre

## Reassess and Iterate Post Cleaning

- After gather, assess, and cleaning data

- We always reassess after this process to see if we are happy with the quality and tidyness of data

    - We can end the data wrangling process and store our data or analyze, visualize, or model our data
    
    - Sometimes we need to return to previous steps as we make more discoveries

### Store Data (Optional)

- We can store data in a filee or database if you need to use it in the future

## Wrangling Vs Exploratory Data Analysis Vs Extract Transorm Load

- EDA: an analysis approach that focuses on identifying general patterns in the data, and identifying outliers and features of that data that might not have been anticipated

    - We explore our data to later augment it to maximize the potential of our analyses, vislaiztion, and models. 
    
    - We will usually use simple viz to summarzie main data characteristics
    
    - We can do things like remove outliers and create new and more descriptive features from data (feature engineering)
    
    
- Wrangling: gathering the right pieces of data, assess data quality and tidyness, modyfing and cleaning it. It wont make the analysis, viz, or model better, just to make it possible

- ETL: users are different, data is different, use cases are different

## Afterwards

- We will do  analysis and cleaning

- we for example, will make a bar chart of all PMF of urgent vs not urgent start dates

    - We can see that we went from 75% to 49% if we didn't wrangle the data beforehand