# Data Wrangling `Lesson01`

##### Student Tags

* Author : AH Uyekita
* Title  :  _Introduction to Data Wrangling_
* Date   : 18/12/2018
* Course : Data Science II - Foundations Nanodegree
    * COD    : ND111
    * **Instructor:** David Venturi
    * **Instructor:** Mat Leonard

***

# Summary

There are roughly three steps in the Data Wrangling.

* <a href="#gathering">Gathering</a>;
* <a href="#assessing">Assessing</a>, and;
* <a href="#cleaning">Cleaning</a>.

***

In this first lesson, we are introduced to the Jupyter Notebook.

The first task is to import files from the Kaggle to the Jupyter Notebook.

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib as plt

# This is the package I have learnt in the class
import zipfile

Here, I use the `zipfile` package to unzip the file.

## 1. Data Gathering <a id='gathering'></a>

Gathering is the first step of a Data Wrangling, is also known as Collecting or Acquiring. The Armenian Online Job Post has 19,000 jobs postings from 2004 to 2015.

>Best Practice: Downloading Files Programmatically

This is the reasons:

* Scalability: This automation will save time, and prevents erros;
* Reproducibility: Key point to any research. Anyone could reproduce your work and check it.

In [2]:
# Creating the object
with zipfile.ZipFile('01-Dataset/armenian-online-job-postings.zip','r') as myzip:
    myzip.extractall('01-Dataset/')  # The argument inside of extractall is the path to save the extracted files.

Now, the uncompressed file `armenian-online-job-postings.zip` are stored in `01-Dataset` folder.

Let's import this file.

### 1.1. Loading Data

In [3]:
# Loading the file extracted
df_job = pd.read_csv('01-Dataset/online-job-postings.csv')

After load the data, let's start the Data Assessing.


## 2. Data Assessing <a id='assessing'></a>


The assessing in divided into two mains aspects:

* Quality of the dataset
* Tidiness of the dataset

#### Quality

Low quality dataset is related to a dirty dataset, which means the content quality of data.

Commom issues:

* Missing values
* Non standard units (km, meters, inches, etc. all mixed)
* Innacurate data, invalid data, inconsistent data, etc.

>One dataset may be high enough quality for one application but not for another.

#### Tidiness

Untidy data or _messy_ data, is about the structure of the dataset.

* Each obsevation by rows, and;
* Each variable/features by column.

This is the Hadley Wickham definition of tidy data.

### Assessing the data

There are two ways to assess the data.

* Visual, and;
* Programmatic.

#### Visual Assessment

Using regular tools, such as Graphics, Excel, tables, etc. It means, there is a human assessing the data.

#### Programmatic Assessment

Using automation to dataset evaluation is scalable, and allows you to handle a very huge quantity of data.

Examples of "Programmatic Assessment": Analysing the data using `.info()`, `.head()`, `.describe()`, plotting graphics (`.plot()`), etc..

In [4]:
# Shows the first 5 rows.
df_job.head()

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,...,,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False
3,Manoff Group\r\nJOB TITLE: BCC Specialist\r\n...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,...,,Please send cover letter and resume to Amy\r\n...,,23 January 2004\r\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\r\nJOB TITLE: Software...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,...,,Successful candidates should submit\r\n- CV; \...,,"20 January 2004, 18:00",,,,2004,1,True


In [5]:
# Shows the last 5 rows.
df_job.tail()

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
18996,Technolinguistics NGO\r\n\r\n\r\nTITLE: Senio...,"Dec 28, 2015",Senior Creative UX/ UI Designer,Technolinguistics NGO,,Full-time,,,,Long-term,...,Competitive,"To apply for this position, please send your\r...",29 December 2015,28 January 2016,,As a company Technolinguistics has a mandate t...,,2015,12,False
18997,"""Coca-Cola Hellenic Bottling Company Armenia"" ...","Dec 30, 2015",Category Development Manager,"""Coca-Cola Hellenic Bottling Company Armenia"" ...",,Full-time,All interested professionals.,,ASAP,Long-term with a probation period of 3 months.,...,,All interested candidates are kindly requested...,30 December 2015,20 January 2016,,,,2015,12,False
18998,"""Coca-Cola Hellenic Bottling Company Armenia"" ...","Dec 30, 2015",Operational Marketing Manager,"""Coca-Cola Hellenic Bottling Company Armenia"" ...",,Full-time,All interested professionals.,,ASAP,Long-term with a probation period of 3 months.,...,,All interested candidates are kindly requested...,30 December 2015,20 January 2016,,,,2015,12,False
18999,San Lazzaro LLC\r\n\r\n\r\nTITLE: Head of O...,"Dec 30, 2015",Head of Online Sales Department,San Lazzaro LLC,,,,,,Long-term,...,Highly competitive,Interested candidates can send their CVs to:\r...,30 December 2015,29 January 2016,,San Lazzaro LLC works with several internation...,,2015,12,False
19000,"""Kamurj"" UCO CJSC\r\n\r\n\r\nTITLE: Lawyer in...","Dec 30, 2015",Lawyer in Legal Department,"""Kamurj"" UCO CJSC",,Full-time,,,,Indefinite,...,,All qualified applicants are encouraged to\r\n...,30 December 2015,20 January 2016,,"""Kamurj"" UCO CJSC is providing micro and small...",,2015,12,False


In [6]:
# Shows general information about the variables (type of objects, quantity, etc.)
df_job.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19001 entries, 0 to 19000
Data columns (total 24 columns):
jobpost             19001 non-null object
date                19001 non-null object
Title               18973 non-null object
Company             18994 non-null object
AnnouncementCode    1208 non-null object
Term                7676 non-null object
Eligibility         4930 non-null object
Audience            640 non-null object
StartDate           9675 non-null object
Duration            10798 non-null object
Location            18969 non-null object
JobDescription      15109 non-null object
JobRequirment       16479 non-null object
RequiredQual        18517 non-null object
Salary              9622 non-null object
ApplicationP        18941 non-null object
OpeningDate         18295 non-null object
Deadline            18936 non-null object
Notes               2211 non-null object
AboutC              12470 non-null object
Attach              1559 non-null object
Year              

In [7]:
# Summarize a list with quantity of each category.
df_job['Year'].value_counts()

2012    2149
2015    2009
2013    2009
2014    1983
2008    1785
2011    1697
2007    1538
2010    1511
2009    1191
2005    1138
2006    1116
2004     875
Name: Year, dtype: int64

_Obs.: Bear in mind, in this step we do not use "verbs" to describe any erros/problem, because the "verbs" will be actions to the next step._

## 3. Data Cleaning <a id='cleaning'></a>




Improving the quality of a dataset or cleaning the dataset do not means: Changing the data (because it could be **data fraud**).

The meaning of Cleaning is correcting the data or removing the data.

* Innacurate, wrong or irrelevant data.
* Replacing or filling (NULL or NA values) data.
* Combining/Merging datasets.

Improving the tidiness is transform the dataset to follow:

* each observation = row
* each variable = column

There are two ways to cleaning the data: manually and programmatic.

#### Manually

To be avoided.

#### Programmatic

There are three steps:

1. Define
2. Code
3. Test

>**Defining** means defining a data cleaning plan in writing, where we turn our assessments into defined cleaning tasks. This plan will also serve as an instruction list so others (or us in the future) can look at our work and reproduce it.

>**Coding** means translating these definitions to code and executing that code.

>**Testing** means testing our dataset, often using code, to make sure our cleaning operations worked.

Text from the class notes.

In [8]:
# Copy of the original
df_clean = df_job.copy()

Fixing the columns header.

>- Select all nondescriptive and misspelled column headers (ApplicationP, AboutC, RequiredQual, JobRequirment) and replace them with full words (ApplicationProcedure, AboutCompany, RequiredQualifications, JobRequirement)

In [9]:
df_clean = df_clean.rename(columns={'ApplicationP': 'ApplicationProcedure',
                                    'AboutC': 'AboutCompany',
                                    'RequiredQual': 'RequiredQualifications',
                                    'JobRequirment': 'JobRequirements'})

Conveting records of `StartDate` to be standardize.

In [10]:
asap_list = ['Immediately', 'As soon as possible', 'Upon hiring',
             'Immediate', 'Immediate employment', 'As soon as possible.', 'Immediate job opportunity',
             '"Immediate employment, after passing the interview."',
             'ASAP preferred', 'Employment contract signature date',
             'Immediate employment opportunity', 'Immidiately', 'ASA',
             'Asap', '"The position is open immediately but has a flexible start date depending on the candidates earliest availability."',
             'Immediately upon agreement', '20 November 2014 or ASAP',
             'immediately', 'Immediatelly',
             '"Immediately upon selection or no later than November 15, 2009."',
             'Immediate job opening', 'Immediate hiring', 'Upon selection',
             'As soon as practical', 'Immadiate', 'As soon as posible',
             'Immediately with 2 months probation period',
             '12 November 2012 or ASAP', 'Immediate employment after passing the interview',
             'Immediately/ upon agreement', '01 September 2014 or ASAP',
             'Immediately or as per agreement', 'as soon as possible',
             'As soon as Possible', 'in the nearest future', 'immediate',
             '01 April 2014 or ASAP', 'Immidiatly', 'Urgent',
             'Immediate or earliest possible', 'Immediate hire',
             'Earliest  possible', 'ASAP with 3 months probation period.',
             'Immediate employment opportunity.', 'Immediate employment.',
             'Immidietly', 'Imminent', 'September 2014 or ASAP', 'Imediately']

Using a `for` loop to iterate and substitute several classification into a simplified one.

In [11]:
# Loop to iterate all list asap_list and convert.
for index in asap_list:
    df_clean.StartDate.replace(index,'ASAP', inplace=True)
    
df_clean.StartDate.value_counts().head()

ASAP                 6856
01 September 2012      31
March 2006             27
November 2006          22
January 2010           19
Name: StartDate, dtype: int64