# Data Wrangling

## 1. Gather

In [1]:
# Unzip the file
import zipfile

In [3]:
with zipfile.ZipFile('armenian-online-job-postings.zip', 'r') as myzip:
    myzip.extractall()

In [1]:
# Importing pandas
import pandas as pd

# Configurações Jupyter Notebooks
pd.set_option('display.max_columns', 300)
pd.set_option('display.max_rows', 300)

In [2]:
df = pd.read_csv('online-job-postings.csv')

## 2. Assess

#### Qualidade

**Problemas comuns de qualidade de dados incluem:**

- Falta de dados, como a ausência do valor da altura para Juan.
- Dados inválidos, como uma célula com um valor impossível, como por exemplo um valor negativo de altura Kwasi. Ter "polegadas" e "centímetros" nas entradas de altura também é, tecnicamente, inválido, pois o tipo de dados para a altura torna-se uma string quando esses estão presentes. O tipo de dados para a altura deve ser inteiro ou float.
- Dados imprecisos, como Jane ter 58 polegadas de altura e não 55 polegadas.
- Dados inconsistentes, como o uso de diferentes unidades de altura (polegadas e centímetros).

Abordaremos mais dicas e truques para identificar problemas de qualidade de dados e categorizaremos eles na terceira aula do curso.

A qualidade dos dados é uma percepção ou uma avaliação da adequação dos dados para servirem seu propósito em um determinado contexto. Infelizmente, essa é uma definição um tanto evasiva, mas esclarece algo importante: não existem regras rigorosas para a qualidade dos dados. Um conjunto de dados pode ter uma qualidade bastante alta para uma aplicação, mas não para outra.


#### Arrumação

- cada variável é uma coluna.
- cada observação é uma linha.
- cada tipo de unidade observacional é uma tabela.

Comumente, nos referimos à dados desarrumados como dados “bagunçados”. Dados bagunçados apresentam problemas em sua estrutura.

Dados arrumados são um conceito relativamente novo, cunhado pelo estatístico, professor e especialista geral em dados Hadley Wickham. Farei uma citação a seu excelente artigo sobre o assunto:

"Costuma-se dizer que 80% do tempo da análise de dados é gasto com limpeza e preparação dos dados. E isso não é apenas um primeiro passo, mas deve ser repetido muitas vezes ao longo da análise, conforme novos problemas surgem ou novos dados são coletados. Para lidar com o problema, este artigo foca em um pequeno, mas importante, aspecto da limpeza de dados, que eu chamo de arrumação: estruturar os conjuntos de dados para facilitar a análise.""



### Quality Issues

- Missing Values (NaN)
- StartDate inconsistencies
- Nondescriptive column headers names (AboutC, ApplicationP, RequiredQual and also JobRequirment)
- Wrong datatypes
- A lot of ways to say the same thing on: Term, Location, Salary...
- Wrong datatypes


### Tidiness Issues

- Duplicated columns: Date <-> Year and Month
- Two observational units: Job posting data and Company Data

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19001 entries, 0 to 19000
Data columns (total 24 columns):
jobpost             19001 non-null object
date                19001 non-null object
Title               18973 non-null object
Company             18994 non-null object
AnnouncementCode    1208 non-null object
Term                7676 non-null object
Eligibility         4930 non-null object
Audience            640 non-null object
StartDate           9675 non-null object
Duration            10798 non-null object
Location            18969 non-null object
JobDescription      15109 non-null object
JobRequirment       16479 non-null object
RequiredQual        18517 non-null object
Salary              9622 non-null object
ApplicationP        18941 non-null object
OpeningDate         18295 non-null object
Deadline            18936 non-null object
Notes               2211 non-null object
AboutC              12470 non-null object
Attach              1559 non-null object
Year              

In [4]:
df.tail()

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,Location,JobDescription,JobRequirment,RequiredQual,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
18996,Technolinguistics NGO\r\n\r\n\r\nTITLE: Senio...,"Dec 28, 2015",Senior Creative UX/ UI Designer,Technolinguistics NGO,,Full-time,,,,Long-term,"Yerevan, Armenia",A tech startup of Technolinguistics based in N...,- Work closely with product and business teams...,- At least 5 years of experience in Interface/...,Competitive,"To apply for this position, please send your\r...",29 December 2015,28 January 2016,,As a company Technolinguistics has a mandate t...,,2015,12,False
18997,"""Coca-Cola Hellenic Bottling Company Armenia"" ...","Dec 30, 2015",Category Development Manager,"""Coca-Cola Hellenic Bottling Company Armenia"" ...",,Full-time,All interested professionals.,,ASAP,Long-term with a probation period of 3 months.,"Yerevan, Armenia",,- Establish and manage Category Management dev...,"- University degree, ideally business related;...",,All interested candidates are kindly requested...,30 December 2015,20 January 2016,,,,2015,12,False
18998,"""Coca-Cola Hellenic Bottling Company Armenia"" ...","Dec 30, 2015",Operational Marketing Manager,"""Coca-Cola Hellenic Bottling Company Armenia"" ...",,Full-time,All interested professionals.,,ASAP,Long-term with a probation period of 3 months.,"Yerevan, Armenia",,"- Develop, establish and maintain marketing st...","- Degree in Business, Marketing or a related f...",,All interested candidates are kindly requested...,30 December 2015,20 January 2016,,,,2015,12,False
18999,San Lazzaro LLC\r\n\r\n\r\nTITLE: Head of O...,"Dec 30, 2015",Head of Online Sales Department,San Lazzaro LLC,,,,,,Long-term,"Yerevan, Armenia",San Lazzaro LLC is looking for a well-experien...,- Handle the project activites of the online s...,- At least 1 year of experience in online sale...,Highly competitive,Interested candidates can send their CVs to:\r...,30 December 2015,29 January 2016,,San Lazzaro LLC works with several internation...,,2015,12,False
19000,"""Kamurj"" UCO CJSC\r\n\r\n\r\nTITLE: Lawyer in...","Dec 30, 2015",Lawyer in Legal Department,"""Kamurj"" UCO CJSC",,Full-time,,,,Indefinite,"Yerevan, Armenia","""Kamurj"" UCO CJSC is looking for a Lawyer in L...",- Properly provide internal legal services of ...,- Higher legal education; Master's degree is a...,,All qualified applicants are encouraged to\r\n...,30 December 2015,20 January 2016,,"""Kamurj"" UCO CJSC is providing micro and small...",,2015,12,False


In [5]:
df.describe()

Unnamed: 0,Year,Month
count,19001.0,19001.0
mean,2010.274722,6.493869
std,3.315609,3.405503
min,2004.0,1.0
25%,2008.0,3.0
50%,2011.0,7.0
75%,2013.0,9.0
max,2015.0,12.0


## 3. Clean

In [6]:
df_clean = df.copy()

### Define
- Select all records in the StartDate column that have "as soon as possible", "Immediately", etc. and replace the text in those sells with "ASAP"
- Select all nondescriptive and misspelled column headers (ApplicationP, AboutC, RequiredQual, JobRequirment) and replace them with full words (ApplicationProcedure, AboutCompany, RequiredQualifications and JobRequirement)

In [7]:
df_clean['StartDate'].value_counts()

ASAP                                                                                                                                                         4754
Immediately                                                                                                                                                   773
As soon as possible                                                                                                                                           543
Upon hiring                                                                                                                                                   261
Immediate                                                                                                                                                     259
Immediate employment                                                                                                                                          140
As soon as possible.        

### Code

In [8]:
# replace ASAP
asap_list = ['Immediately', 'As soon as possible', 'Immediate', 'Immediate employment', 'As soon as possible.', 
             'Immediate job opportunity', 'Immediate employment, after passing the interview.', 'Immediate employment opportunity', 
             'Upon hiring', 'ASAP preferred', '01 March 2005 or earlier if possible', 'Immediately or as per agreement', 
             '01 September 2014 or ASAP','Immediate employment opportunity.', ]


for date in asap_list:
       df_clean.StartDate.replace(date, 'ASAP', inplace=True)

- Select all records in the StartDate column that have "as soon as possible", "Immediately", etc. and replace the text in those sells with "ASAP"

### Test

In [9]:
df_clean['StartDate'].value_counts()

ASAP                                                                                                                                                         6805
01 September 2012                                                                                                                                              31
March 2006                                                                                                                                                     27
November 2006                                                                                                                                                  22
January 2010                                                                                                                                                   19
01 February 2005                                                                                                                                               17
February 2014               

### Define 
- Rename Nondescriptive columns and the column that is wrong: (AboutC, ApplicationP, RequiredQual and also JobRequirment)

### Code

In [10]:
df_clean = df_clean.rename(columns={'AboutC': 'AboutCompany',
                                    'ApplicationP': 'ApplicationProcedure',
                                    'RequiredQual': 'RequiredQualifications',
                                    'JobRequirment': 'JobRequirement'})

### Test

In [11]:
df_clean.head()

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,Location,JobDescription,JobRequirement,RequiredQualifications,Salary,ApplicationProcedure,OpeningDate,Deadline,Notes,AboutCompany,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,"Yerevan, Armenia",AMERIA Investment Consulting Company is seekin...,- Supervises financial management and administ...,"To perform this job successfully, an\r\nindivi...",,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,"IREX Armenia Main Office; Yerevan, Armenia \r\...",,,- Bachelor's Degree; Master's is preferred;\r\...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,"Yerevan, Armenia",Public outreach and strengthening of a growing...,- Working with the Country Director to provide...,"- Degree in environmentally related field, or ...",,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False
3,Manoff Group\r\nJOB TITLE: BCC Specialist\r\n...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,"Manila, Philippines",The LEAD (Local Enhancement and Development fo...,- Identify gaps in knowledge and overseeing in...,"- Advanced degree in public health, social sci...",,Please send cover letter and resume to Amy\r\n...,,23 January 2004\r\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\r\nJOB TITLE: Software...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,"Yerevan, Armenia",,- Rendering technical assistance to Database M...,- University degree; economical background is ...,,Successful candidates should submit\r\n- CV; \...,,"20 January 2004, 18:00",,,,2004,1,True


## Analysis & Visualization

In [13]:
# Counting the percentage of companies that need the employee ASAP
asap = df_clean.StartDate.value_counts()['ASAP']
asap

6805

In [14]:
total = df_clean.StartDate.count()
total

9675

In [16]:
asap_percentage = asap/total
print('{:.2%}'.format(asap_percentage))

70.34%
