# **Data Wrangling Project 🧽✨📊**

# 1. Gather 👩🏽‍💻

In [1]:
""" TO DO
- download zip file (from Kaggle or TheBrittinator git repository)
- unzip the file using Python
- import and extract CCSV file into pandas DataFrame
"""


# import the zipfile package
import zipfile
import pandas as pd

In [3]:
# read zip file and extract all documents
with zipfile.ZipFile('archive.zip', 'r') as myzip:
    myzip.extractall()

In [4]:
# read csv file into a pandas dataFrame

df = pd.read_csv('online-job-postings.csv')

In [5]:
# check your work - using the .head method will only display the first 5 rows.

df.head()

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,...,,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False
3,Manoff Group\r\nJOB TITLE: BCC Specialist\r\n...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,...,,Please send cover letter and resume to Amy\r\n...,,23 January 2004\r\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\r\nJOB TITLE: Software...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,...,,Successful candidates should submit\r\n- CV; \...,,"20 January 2004, 18:00",,,,2004,1,True


# 2. Assess 🤔

### **What are we assessing?**

#### Not exploring our data just yet. First, let's assess our data's:
1. Quality
2. Tidiness

# Quality
### Common Data Quality Issues:
- Missing data
- Invlaid data
- Inaccurate data
- Inconsistent data

*identify if your dataset checks any of these boxes.*

# Tidiness
### Also referred to as 'messy data.' This kind of data has issues with it's *structure:*
Your data is **TIDY** if:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.

# How to Assess Data
### 2 styles to do it:
1. Visually - with your eyes 👀
2. Programmatic Assessment - using Pandas 🐼


In [None]:
""" TO DO
- Assess your dataset programmatically using pandas
- identify data quality issues
- confirm if your dataset is 'tidy' or 'messy.'
"""

# check programmatically! 4 methods we can use to do this
## Display the first five rows of the DataFrame using .head
df.head()

In [None]:
# Display the last five rows of the DataFrame using .tail
df.tail()

In [8]:
# Display a basic summary of the DataFrame using .info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19001 entries, 0 to 19000
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   jobpost           19001 non-null  object
 1   date              19001 non-null  object
 2   Title             18973 non-null  object
 3   Company           18994 non-null  object
 4   AnnouncementCode  1208 non-null   object
 5   Term              7676 non-null   object
 6   Eligibility       4930 non-null   object
 7   Audience          640 non-null    object
 8   StartDate         9675 non-null   object
 9   Duration          10798 non-null  object
 10  Location          18969 non-null  object
 11  JobDescription    15109 non-null  object
 12  JobRequirment     16479 non-null  object
 13  RequiredQual      18517 non-null  object
 14  Salary            9622 non-null   object
 15  ApplicationP      18941 non-null  object
 16  OpeningDate       18295 non-null  object
 17  Deadline    

In [None]:
# Display the entry counts for the Year column using .value_counts
df["Year"].value_counts()

In [None]:
# now check visually by reviewing your dataframe
df

## What Did We Find?

1. Quality 🏆
    - Missing Records: lots of NaN values for some columns
    - Multiple Terms that mean the same thing (in the 'start-date' column we found 'ASAP', 'immediately', 'as soon as possible')
    - fix nondescriptive column headers such as 'RequiredQual' and typo 'jobrequirment'
    
2. Tidiness 🏚️
    - does each variable have it's own column? ❌
    - does each observation have it's own row? ✅
    - is each type of observational unit a table? ❌
 * Both the date and Year + Month Column are representative of the same thing. This is a duplication. The day column needs to be added so that the entire date is seperated into columns. 
 * There are two types of observational data: job posting and company data. These can be seperated into two tables.
    
#### **For this introductory lesson, we will focus on Data Quality in our cleaning step, NOT Tidiness.**

# 3. Clean 🧽✨

The prgrammatic Data Cleaning Process:
1. Define - your data cleaning plan in writing. 
2. Code - transating these definitions into code and executing it.
3. Test - testing our dataset, using code, to make sure our cleaning operations worked.

### Define

    - Select all records in the StartDate column that have "As soon as possible" "immediately" etc. and replace with "ASAP"
    
    - Select all nondescriptive and misspelled column headers (ApplicationP, AboutC, RequiredQual, JobRequirment) and replace then with full words (Application Procedure, AboutCompany, RequiredQualifications, JobRequirement)
    

### Code

In [12]:
""" TO DO
- copy your dataFrame for cleaning
- rename your column headers 
    - https://stackoverflow.com/questions/11346283/renaming-column-names-in-pandas
- find and replace redundancies in StartDate column with "ASAP"
"""

# copy our code into a new df - let's keep an original just in case.
df_clean = df.copy()

In [17]:
# rename column headers - see docstring for resources
df_clean = df.rename(columns={'ApplicationP':'ApplicationProcedure',
                              'AboutC':'AboutCompany','RequiredQual':'RequiredQualifications',
                              'JobRequirment':'JobRequirements'})

In [21]:
# find and replace: locate our unique values
df_clean.StartDate.value_counts()

ASAP                              4754
Immediately                        773
As soon as possible                543
Upon hiring                        261
Immediate                          259
                                  ... 
Flexible                             1
11 April 2010                        1
ASAP starting 10 February 2006       1
07 April 2010                        1
15 March 2009                        1
Name: StartDate, Length: 1186, dtype: int64

In [23]:
# find and replace: place all ASAP unique values into a variable write a for loop to replace each value with ASAP.
# .replace structure: https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html
# you can expand your value_counts and do the digging on your own, or just use this code:

asap_list = ['Immediately', 'As soon as possible', 'Upon hiring',
             'Immediate', 'Immediate employment', 'As soon as possible.', 'Immediate job opportunity',
             '"Immediate employment, after passing the interview."',
             'ASAP preferred', 'Employment contract signature date',
             'Immediate employment opportunity', 'Immidiately', 'ASA',
             'Asap', '"The position is open immediately but has a flexible start date depending on the candidates earliest availability."',
             'Immediately upon agreement', '20 November 2014 or ASAP',
             'immediately', 'Immediatelly',
             '"Immediately upon selection or no later than November 15, 2009."',
             'Immediate job opening', 'Immediate hiring', 'Upon selection',
             'As soon as practical', 'Immadiate', 'As soon as posible',
             'Immediately with 2 months probation period',
             '12 November 2012 or ASAP', 'Immediate employment after passing the interview',
             'Immediately/ upon agreement', '01 September 2014 or ASAP',
             'Immediately or as per agreement', 'as soon as possible',
             'As soon as Possible', 'in the nearest future', 'immediate',
             '01 April 2014 or ASAP', 'Immidiatly', 'Urgent',
             'Immediate or earliest possible', 'Immediate hire',
             'Earliest  possible', 'ASAP with 3 months probation period.',
             'Immediate employment opportunity.', 'Immediate employment.',
             'Immidietly', 'Imminent', 'September 2014 or ASAP', 'Imediately']

for phrase in asap_list:
    df_clean.StartDate.replace( phrase, 'ASAP', inplace=True)

### Test 👩🏽‍🔬

In [None]:
# check our work!
df_clean.info()

In [None]:
# check your work!
df_clean.StartDate.value_counts()