## Gather Data: [Download](https://www.kaggle.com/udacity/armenian-online-job-postings)
The dataset used in this lesson is hosted on this Kaggle Datasets page: Armenian Online Job Postings. Some context on this dataset, from the description section of that page:

The online job market is a good indicator of overall demand for labor in an economy. This dataset consists of 19,000 job postings from 2004 to 2015 posted on CareerCenter, an Armenian human resource portal.

Since postings are text documents and tend to have similar structures, text mining can be used to extract features like posting date, job title, company name, job description, salary, and more. Postings that had no structure or were not job-related were removed. The data was originally scraped from a Yahoo! mailing group.

Two Reasons Why downloading data is better with code than manually are:
1. **Reproducibility**	The ability of a process to produce the same results from identical inputs
2. **Scalability**	The ability of a process to handle an increasing scope of work

`<img src='./img/example-job-posting.jpg' width=400 height=200/>`
```
with open('./datasets/features.txt', 'r') as f:
    pprint.pprint(f.readlines())
```

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import os
import shutil
import glob
print('All Imported')

All Imported


In [2]:
# move any files
def move_file_types(source, dst, pattern):
    files = []
    for type_ in pattern:
        files.extend(glob.glob(f'{source}{type_}'))
        
    if files:
        for file in files:
            shutil.move(f'{file}', f'{dst}')
            
    print(f'{len(files)} file(s) moved!')

In [3]:
os.listdir()

['.idea',
 '.ipynb_checkpoints',
 'ALX-T Connect Weekly Schedule - Data.pdf',
 'cleaning_and_splitting_movie_genres.ipynb',
 'connect_sessions_studyguide.pdf',
 'data-wrangling-template.ipynb',
 'datasets',
 'data_wrangling.ipynb',
 'img',
 'lesson3_the_data_analysis_process.ipynb',
 'merging-data.ipynb',
 'testing.ipynb',
 'wine_visualizations_solutions.ipynb']

In [4]:
source = 'datasets/'
dst = 'img/'
types = ('*.jpg', '*.png')

# move any 
try:
    move_file_types(source, dst, pattern=types)
except Exception as e:
    print(e)

Destination path 'img/example-job-posting.jpg' already exists


*zipfile is also a context manager, therefore supports the `with` statement*

### Context Manager:

A protocol for handling resources in Python, e.g. the with statement

In [5]:
with zipfile.ZipFile("datasets/archive.zip", 'r') as myzip:
    myzip.extractall("./datasets")

In [6]:
os.listdir('datasets')

['.ipynb_checkpoints',
 '17may_69_students.csv',
 '4th_session_chat.txt',
 'archive.zip',
 'attendance_16thMay.csv',
 'attendance_condensed16thMay.csv',
 'bestofrt.tsv',
 'cleaning_and_splitting_movie_genres (1).ipynb',
 'ebert_reviews',
 'example-job-posting.jpg',
 'extract_attendance.ipynb',
 'features.txt',
 'getting_data.ipynb',
 'June_analysis.ipynb',
 'online-job-postings.csv',
 'p1_incomplete.csv',
 'p1_ungraded.csv',
 'p1_unsubs.csv',
 'rt-html.zip',
 'rt_html',
 'session-4356-report-5_15_2022.csv',
 'session-4356-report-5_17_2022.csv',
 'session-4356-report-6_1_2022.csv',
 'session-4356-report-6_7_2022.csv',
 'sessions_data.ipynb']

*move the image file to the image dirctory*

### What is CSV?

CSV stands for comma separated values, and is a common file format to represent tabular data.

Can be created and edited using any text editor
Often created by exporting data from a spreadsheet or database. On the right here, we have the CSV file opened in a
CSV files use commas as separators, also known as delimiters.

Each line of data represents a row. The delimiters (commas) separate the data in each row into columns.

There are many other file formats including JSON, txt, and html. We'll cover some of those in the next lesson.

### EDA ( Exploratory Data Analysis)

Here is one definition of EDA: an analysis approach that focuses on identifying general patterns in the data, and identifying outliers and features of the data that might not have been anticipated.

### Data wrangling 
is about gathering the right pieces of data, assessing your data's quality and structure, then modifying your data to make it clean. But the assessments you make and convert to cleaning operations won't make your analysis, visualization, or model better, though. The goal is to just make them possible, i.e., functional.

### EDA 
is about exploring your data to later augment it to maximize the potential of our analyses, visualizations, and models. When exploring, simple visualizations are often used to summarize your data's main characteristics. From there you can do things like remove outliers and create new and more descriptive features from existing data, also known as feature engineering. Or detect and remove outliers so your model's fit is better.

In practice, wrangling and EDA can and often do occur together.

In [7]:
jobs_df = pd.read_csv('datasets/online-job-postings.csv')
jobs_df.head(3)

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,Location,JobDescription,JobRequirment,RequiredQual,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,"Yerevan, Armenia",AMERIA Investment Consulting Company is seekin...,- Supervises financial management and administ...,"To perform this job successfully, an\r\nindivi...",,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,"IREX Armenia Main Office; Yerevan, Armenia \r\...",,,- Bachelor's Degree; Master's is preferred;\r\...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,"Yerevan, Armenia",Public outreach and strengthening of a growing...,- Working with the Country Director to provide...,"- Degree in environmentally related field, or ...",,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False


![Job Postings](./img/example-job-posting.jpg)

## Data Quality
Low quality data is commonly referred to as dirty data. Dirty data has issues with its content.

* missing data, 
* invalid data, 
* inaccurate date or 
* inconsistent data

## Tidiness
Untidy data is commonly referred to as "messy" data. Messy data has issues with its structure.

A dataset is messy or tidy depending on how **rows, columns, and tables** are matched up with **observations, variables, and types**. In tidy data:

* Each variable forms a column.
* Each observation forms a row.
* Each type of observational unit forms a table.

New Terms
Term	Definition

* **Dirty Data**:	Data that has issues with data content including missing data, invalid data, inaccurate date or inconsistent data
* **Messy Data**:	Data that has issues with its structure (columns, rows or table)
* **Programmatic Assessment**:	Reviewing data using code
* **Visual Assessment**:	Reviewing data by scrolling through it in a spreadsheet or text editing application

In [8]:
jobs_df

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,Location,JobDescription,JobRequirment,RequiredQual,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,"Yerevan, Armenia",AMERIA Investment Consulting Company is seekin...,- Supervises financial management and administ...,"To perform this job successfully, an\r\nindivi...",,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,"IREX Armenia Main Office; Yerevan, Armenia \r\...",,,- Bachelor's Degree; Master's is preferred;\r\...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,"Yerevan, Armenia",Public outreach and strengthening of a growing...,- Working with the Country Director to provide...,"- Degree in environmentally related field, or ...",,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False
3,Manoff Group\r\nJOB TITLE: BCC Specialist\r\n...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,"Manila, Philippines",The LEAD (Local Enhancement and Development fo...,- Identify gaps in knowledge and overseeing in...,"- Advanced degree in public health, social sci...",,Please send cover letter and resume to Amy\r\n...,,23 January 2004\r\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\r\nJOB TITLE: Software...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,"Yerevan, Armenia",,- Rendering technical assistance to Database M...,- University degree; economical background is ...,,Successful candidates should submit\r\n- CV; \...,,"20 January 2004, 18:00",,,,2004,1,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18996,Technolinguistics NGO\r\n\r\n\r\nTITLE: Senio...,"Dec 28, 2015",Senior Creative UX/ UI Designer,Technolinguistics NGO,,Full-time,,,,Long-term,"Yerevan, Armenia",A tech startup of Technolinguistics based in N...,- Work closely with product and business teams...,- At least 5 years of experience in Interface/...,Competitive,"To apply for this position, please send your\r...",29 December 2015,28 January 2016,,As a company Technolinguistics has a mandate t...,,2015,12,False
18997,"""Coca-Cola Hellenic Bottling Company Armenia"" ...","Dec 30, 2015",Category Development Manager,"""Coca-Cola Hellenic Bottling Company Armenia"" ...",,Full-time,All interested professionals.,,ASAP,Long-term with a probation period of 3 months.,"Yerevan, Armenia",,- Establish and manage Category Management dev...,"- University degree, ideally business related;...",,All interested candidates are kindly requested...,30 December 2015,20 January 2016,,,,2015,12,False
18998,"""Coca-Cola Hellenic Bottling Company Armenia"" ...","Dec 30, 2015",Operational Marketing Manager,"""Coca-Cola Hellenic Bottling Company Armenia"" ...",,Full-time,All interested professionals.,,ASAP,Long-term with a probation period of 3 months.,"Yerevan, Armenia",,"- Develop, establish and maintain marketing st...","- Degree in Business, Marketing or a related f...",,All interested candidates are kindly requested...,30 December 2015,20 January 2016,,,,2015,12,False
18999,San Lazzaro LLC\r\n\r\n\r\nTITLE: Head of O...,"Dec 30, 2015",Head of Online Sales Department,San Lazzaro LLC,,,,,,Long-term,"Yerevan, Armenia",San Lazzaro LLC is looking for a well-experien...,- Handle the project activites of the online s...,- At least 1 year of experience in online sale...,Highly competitive,Interested candidates can send their CVs to:\r...,30 December 2015,29 January 2016,,San Lazzaro LLC works with several internation...,,2015,12,False


In [9]:
jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19001 entries, 0 to 19000
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   jobpost           19001 non-null  object
 1   date              19001 non-null  object
 2   Title             18973 non-null  object
 3   Company           18994 non-null  object
 4   AnnouncementCode  1208 non-null   object
 5   Term              7676 non-null   object
 6   Eligibility       4930 non-null   object
 7   Audience          640 non-null    object
 8   StartDate         9675 non-null   object
 9   Duration          10798 non-null  object
 10  Location          18969 non-null  object
 11  JobDescription    15109 non-null  object
 12  JobRequirment     16479 non-null  object
 13  RequiredQual      18517 non-null  object
 14  Salary            9622 non-null   object
 15  ApplicationP      18941 non-null  object
 16  OpeningDate       18295 non-null  object
 17  Deadline    

* Missing values **(NaN)**
* StartDate inconsistencies **(ASAP)**
* Fix non-descriptive column headers
* Maybe make columns all lower case

### What is Data Cleaning?
Cleaning means acting on the assessments we made to improve quality and tidiness.

#### The Programmatic Data Cleaning Process
1. Define
2. Code
3. Test

* Defining means defining a data cleaning plan in writing, where we turn our assessments into defined cleaning tasks. This plan will also serve as an instruction list so others (or us in the future) can look at our work and reproduce it.

* Coding means translating these definitions to code and executing that code.

* Testing means testing our dataset, often using code, to make sure our cleaning operations worked.

**Defining What We Want to Clean:**

Defining what we want to clean is creating a cleaning plan or instruction list where we convert the notes we made in the Assess step into specific cleaning tasks.

We identified three issues

* Missing values or NaNs
* Start date inconsistencies for ASAP
* Non-descriptive column headers

#### Tasks
In the Jupyter Notebook below, in your own words, convert the assessments we made previously into defined cleaning operations, as shown in the above video. The missing values (NaN) and untidy dataset issues will not be cleaned in this lesson, so do not write a definition for those assessments.

* Write a cleaning dfinition for the startdates inconsistencies (ASAP)
* Write a cleaning definition for the nin-descriptive column headers


**Solution:**

* Select all nondescriptive and misspelled column headers (ApplicationP, AboutC, RequiredQual, JobRequirment) and replace them with full words (ApplicationProcedure, AboutCompany, RequiredQualifications, JobRequirement)
* Select all records in the StartDate column that have "As soon as possible", "Immediately", etc. and replace the text in those cells with "ASAP"

### Clean
#### Types of cleaning:
* Manual (not recommended unless the issues are single occurrences)
* Programmatic

**The programmatic data cleaning process:**

1. Define: convert our assessments into defined cleaning tasks. These definitions also serve as an instruction list so others (or yourself in the future) can look at your work and reproduce it.
2. Code: convert those definitions to code and run that code.
3. Test: test your dataset, visually or with code, to make sure your cleaning operations worked.
Always make copies of the original pieces of data before cleaning!

### Reassess and Iterate
After cleaning, always reassess and iterate on any of the data wrangling steps if necessary.
### Store (Optional)
Store data, in a file or database for example, if you need to use it in the future.

In [None]:
df_clean = jobs_df.copy()

In [None]:
df_clean.rename(columns={'ApplicationP':'ApplicationProcedure',
                        'AboutC':'AboutCompany',
                        'RequiredQual':'RequiredQualifications',
                        'JobRequirment':'JobRequirements'},
               inplace=True)

In [None]:
asap_list = ['Immediately', 'As soon as possible', 'Upon hiring',
             'Immediate', 'Immediate employment', 'As soon as possible.', 'Immediate job opportunity',
             '"Immediate employment, after passing the interview."',
             'ASAP preferred', 'Employment contract signature date',
             'Immediate employment opportunity', 'Immidiately', 'ASA',
             'Asap', '"The position is open immediately but has a flexible start date depending on the candidates earliest availability."',
             'Immediately upon agreement', '20 November 2014 or ASAP',
             'immediately', 'Immediatelly',
             '"Immediately upon selection or no later than November 15, 2009."',
             'Immediate job opening', 'Immediate hiring', 'Upon selection',
             'As soon as practical', 'Immadiate', 'As soon as posible',
             'Immediately with 2 months probation period',
             '12 November 2012 or ASAP', 'Immediate employment after passing the interview',
             'Immediately/ upon agreement', '01 September 2014 or ASAP',
             'Immediately or as per agreement', 'as soon as possible',
             'As soon as Possible', 'in the nearest future', 'immediate',
             '01 April 2014 or ASAP', 'Immidiatly', 'Urgent',
             'Immediate or earliest possible', 'Immediate hire',
             'Earliest  possible', 'ASAP with 3 months probation period.',
             'Immediate employment opportunity.', 'Immediate employment.',
             'Immidietly', 'Imminent', 'September 2014 or ASAP', 'Imediately']

In [None]:
# Use the df.col.replace()

[BeautifulSoup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find)

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)

### Quiz

With your knowledge of HTML file structure, you're going to use Beautiful Soup to extract our desired Audience Score metric and number of audience ratings, along with the movie title like in the video above (so we have something to merge the datasets on later) for each HTML file, then save them in a pandas DataFrame.

The Jupyter Notebook below contains template code that:

* Creates an empty list, `df_list`, to which dictionaries will be appended. 
* This list of dictionaries will eventually be converted to a pandas DataFrame (this is the most efficient way of building a DataFrame row by row).
* Loops through each movie's Rotten Tomatoes HTML file in the rt_html folder.
* Opens each HTML file and passes it into a filehandle called file.
* Creates a DataFrame called df by converting df_list using the pd.DataFrame constructor.
* Your task is to extract the title, audience score, and the number of audience ratings in each HTML file so each trio can be appended as a dictionary to df_list.

1. 
```
<div class="meter-value">
      <span class="superPageFontColor" style="vertical-align:top">97%</span>
</div>
```

2. 
```
<div class="audience-info hidden-xs superPageFontColor">
    <div>
            <span class="subtle superPageFontColor">Average Rating:</span>
            3.5/5
                </div>
    <div>
        <span class="subtle superPageFontColor">User Ratings:</span>
        32,313,030</div>
</div>
```