# Predicting US Working Visa Applications

## Background

- In order to employ foreign workers, companies in the US must submit an application to the Department of Labor (DOL).
- The DOL must certify to the Department of Homeland Security’s USCIS that there are not sufficient US workers able, willing, qualified and available to accept the job opportunity in the area of intended employment.
- The employment of a foreign worker must not adversely affect the wages and working conditions of similarly employed U.S. workers. We want to uncover insights that can help predict visa decisions based on employee/employer/wage/economic sector?

The dataset is in the form of a CSV file collected and distributed by the US Department of Labor. Data covers 2012-2017 and includes information on employer, position, wage offered, job posting history, employee education, past visa history, economic sector of employment, and final decision. There are 374,362 rows and 154 columns. 
There are many inconsistencies with how the data was collected. Columns collecting information of the employees past are in two subsets labelled both ‘foreign_worker’ and ‘fw’. Similarly with the position information there are two subsets labelled ‘job_info’ and ‘ji’. There will be a lot of cleaning and organising involved in preparing this data before any predictive models can be considered.

Data obtained from Kaggle: https://www.kaggle.com/jboysen/us-perm-visas/data

## **Data Understanding:** Exploring relationships between feature pairs and selecting promising features.

In [1]:
#Import the required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.backends.backend_pdf import PdfPages
import seaborn as sns
import itertools
%matplotlib inline
%config IPCompleter.greedy=True

#Display all columns in tables which were being left out before
#https://stackoverflow.com/questions/11707586/python-pandas-how-to-widen-output-display-to-see-more-columns
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 154 )

## 1. Prepare a data quality report for the CSV file.

In [2]:
# Reading from a csv file, into a data frame
df = pd.read_csv("us_perm_visas2.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
print("Unique values for:\n- case_status:", pd.unique(df.case_status.ravel()))

Unique values for:
- case_status: ['Certified' 'Denied' 'Certified-Expired' 'Withdrawn']


In [4]:
#In this project we are focusing on permanent working visa i.e. class H-1B so we only select rows with these visa types
df_visaType = df[df.class_of_admission == 'H-1B']

In [5]:
# Check how many rows and columns the data frame has 
df_visaType.shape
# Check how many rows
print("There are", df_visaType.shape[0], "rows")
# Check how many columns
print("There are", df_visaType.shape[1], "columns")

There are 283018 rows
There are 154 columns


In [6]:
# Print the first 5 lines on the dataset
df_visaType.head()

Unnamed: 0,add_these_pw_job_title_9089,agent_city,agent_firm_name,agent_state,application_type,case_no,case_number,case_received_date,case_status,class_of_admission,country_of_citizenship,country_of_citzenship,decision_date,employer_address_1,employer_address_2,employer_city,employer_country,employer_decl_info_title,employer_name,employer_num_employees,employer_phone,employer_phone_ext,employer_postal_code,employer_state,employer_yr_estab,foreign_worker_info_alt_edu_experience,foreign_worker_info_birth_country,foreign_worker_info_city,foreign_worker_info_education,foreign_worker_info_education_other,foreign_worker_info_inst,foreign_worker_info_major,foreign_worker_info_postal_code,foreign_worker_info_rel_occup_exp,foreign_worker_info_req_experience,foreign_worker_info_state,foreign_worker_info_training_comp,foreign_worker_ownership_interest,foreign_worker_yr_rel_edu_completed,fw_info_alt_edu_experience,fw_info_birth_country,fw_info_education_other,fw_info_postal_code,fw_info_rel_occup_exp,fw_info_req_experience,fw_info_training_comp,fw_info_yr_rel_edu_completed,fw_ownership_interest,ji_foreign_worker_live_on_premises,ji_fw_live_on_premises,ji_live_in_dom_svc_contract,ji_live_in_domestic_service,ji_offered_to_sec_j_foreign_worker,ji_offered_to_sec_j_fw,job_info_alt_cmb_ed_oth_yrs,job_info_alt_combo_ed,job_info_alt_combo_ed_exp,job_info_alt_combo_ed_other,job_info_alt_field,job_info_alt_field_name,job_info_alt_occ,job_info_alt_occ_job_title,job_info_alt_occ_num_months,job_info_combo_occupation,job_info_education,job_info_education_other,job_info_experience,job_info_experience_num_months,job_info_foreign_ed,job_info_foreign_lang_req,job_info_job_req_normal,job_info_job_title,job_info_major,job_info_training,job_info_training_field,job_info_training_num_months,job_info_work_city,job_info_work_postal_code,job_info_work_state,naics_2007_us_code,naics_2007_us_title,naics_code,naics_title,naics_us_code,naics_us_code_2007,naics_us_title,naics_us_title_2007,orig_case_no,orig_file_date,preparer_info_emp_completed,preparer_info_title,pw_amount_9089,pw_determ_date,pw_expire_date,pw_job_title_908,pw_job_title_9089,pw_level_9089,pw_soc_code,pw_soc_title,pw_source_name_9089,pw_source_name_other_9089,pw_track_num,pw_unit_of_pay_9089,rec_info_barg_rep_notified,recr_info_barg_rep_notified,recr_info_coll_teach_comp_proc,recr_info_coll_univ_teacher,recr_info_employer_rec_payment,recr_info_first_ad_start,recr_info_job_fair_from,recr_info_job_fair_to,recr_info_on_campus_recr_from,recr_info_on_campus_recr_to,recr_info_pro_org_advert_from,recr_info_pro_org_advert_to,recr_info_prof_org_advert_from,recr_info_prof_org_advert_to,recr_info_professional_occ,recr_info_radio_tv_ad_from,recr_info_radio_tv_ad_to,recr_info_second_ad_start,recr_info_sunday_newspaper,recr_info_swa_job_order_end,recr_info_swa_job_order_start,refile,ri_1st_ad_newspaper_name,ri_2nd_ad_newspaper_name,ri_2nd_ad_newspaper_or_journal,ri_campus_placement_from,ri_campus_placement_to,ri_coll_tch_basic_process,ri_coll_teach_pro_jnl,ri_coll_teach_select_date,ri_employee_referral_prog_from,ri_employee_referral_prog_to,ri_employer_web_post_from,ri_employer_web_post_to,ri_job_search_website_from,ri_job_search_website_to,ri_layoff_in_past_six_months,ri_local_ethnic_paper_from,ri_local_ethnic_paper_to,ri_posted_notice_at_worksite,ri_pvt_employment_firm_from,ri_pvt_employment_firm_to,ri_us_workers_considered,schd_a_sheepherder,us_economic_sector,wage_offer_from_9089,wage_offer_to_9089,wage_offer_unit_of_pay_9089,wage_offered_from_9089,wage_offered_to_9089,wage_offered_unit_of_pay_9089
2,,,,,PERM,A-07333-99643,,,Certified,H-1B,,INDIA,2011-12-01,1054 TECHNOLOGY PARK DRIVE,,GLEN ALLEN,,,"SCHNABEL ENGINEERING, INC.",,,,23059,VA,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Lutherville,,MD,541330,Engineering Services,,,,,,,,,,,47923.0,,,,Civil Engineer,Level I,17-2051.00,Civil Engineers,OES,,,yr,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Aerospace,47923,,yr,,,
6,,,,,PERM,A-07354-06926,,,Certified-Expired,H-1B,,MEXICO,2011-10-07,285 PAWLING AVE,,TROY,,,EMMA WILLARD SCHOOL,,,,12180,NY,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Troy,,NY,611110,Elementary and Secondary Schools,,,,,,,,,,,47083.3,,,,"Secondary School Teachers, Except Special and ...",Level II,25-2031.00,"Secondary School Teachers, Except Special and ...",OES,,,yr,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Educational Services,47084,52000.0,yr,,,
8,,,,,PERM,A-08004-10184,,,Certified,H-1B,,CANADA,2012-02-29,2711 CENTERVILLE ROAD,,WILMINGTON,,,ELECTRONIC DATA SYSTEMS CORPORATION,,,,19808,DE,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Fort Worth,,TX,541511,Custom Computer Programming Services,,,,,,,,,,,44824.0,,,,Computer Systems Analysts,Level I,15-1051.00,Computer Systems Analysts,OES,,,yr,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,IT,44824,85000.0,yr,,,
10,,,,,PERM,A-08057-27232,,,Withdrawn,H-1B,,INDIA,2012-03-05,4833 RUGBY AVE.,SUITE 500,BETHESDA,,,"AQUAS, INC.",,,,20814,MD,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Bethesda,,MD,541511,Custom Computer Programming Services,,,,,,,,,,,59758.0,,,,Computer Programmer,Level II,15-1021.00,Computer Programmers,OES,,,yr,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,IT,60000,,yr,,,
11,,,,,PERM,A-08058-28001,,,Certified,H-1B,,SINGAPORE,2012-01-06,"525 BROADWAY, SUITE 201",,NEW YORK,,,NINE MUSES AND APOLLO INC,,,,10012,NY,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,New York,,NY,711410,"Agents and Managers for Artists, Athletes, Ent...",,,,,,,,,,,46176.0,,,,Public Relations Specialist,Level II,27-3031.00,Public Relations Specialists,OES,,,yr,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Other Economic Sector,50000,,yr,,,


In [7]:
# Print the last 5 lines of the dataset
df_visaType.tail()

Unnamed: 0,add_these_pw_job_title_9089,agent_city,agent_firm_name,agent_state,application_type,case_no,case_number,case_received_date,case_status,class_of_admission,country_of_citizenship,country_of_citzenship,decision_date,employer_address_1,employer_address_2,employer_city,employer_country,employer_decl_info_title,employer_name,employer_num_employees,employer_phone,employer_phone_ext,employer_postal_code,employer_state,employer_yr_estab,foreign_worker_info_alt_edu_experience,foreign_worker_info_birth_country,foreign_worker_info_city,foreign_worker_info_education,foreign_worker_info_education_other,foreign_worker_info_inst,foreign_worker_info_major,foreign_worker_info_postal_code,foreign_worker_info_rel_occup_exp,foreign_worker_info_req_experience,foreign_worker_info_state,foreign_worker_info_training_comp,foreign_worker_ownership_interest,foreign_worker_yr_rel_edu_completed,fw_info_alt_edu_experience,fw_info_birth_country,fw_info_education_other,fw_info_postal_code,fw_info_rel_occup_exp,fw_info_req_experience,fw_info_training_comp,fw_info_yr_rel_edu_completed,fw_ownership_interest,ji_foreign_worker_live_on_premises,ji_fw_live_on_premises,ji_live_in_dom_svc_contract,ji_live_in_domestic_service,ji_offered_to_sec_j_foreign_worker,ji_offered_to_sec_j_fw,job_info_alt_cmb_ed_oth_yrs,job_info_alt_combo_ed,job_info_alt_combo_ed_exp,job_info_alt_combo_ed_other,job_info_alt_field,job_info_alt_field_name,job_info_alt_occ,job_info_alt_occ_job_title,job_info_alt_occ_num_months,job_info_combo_occupation,job_info_education,job_info_education_other,job_info_experience,job_info_experience_num_months,job_info_foreign_ed,job_info_foreign_lang_req,job_info_job_req_normal,job_info_job_title,job_info_major,job_info_training,job_info_training_field,job_info_training_num_months,job_info_work_city,job_info_work_postal_code,job_info_work_state,naics_2007_us_code,naics_2007_us_title,naics_code,naics_title,naics_us_code,naics_us_code_2007,naics_us_title,naics_us_title_2007,orig_case_no,orig_file_date,preparer_info_emp_completed,preparer_info_title,pw_amount_9089,pw_determ_date,pw_expire_date,pw_job_title_908,pw_job_title_9089,pw_level_9089,pw_soc_code,pw_soc_title,pw_source_name_9089,pw_source_name_other_9089,pw_track_num,pw_unit_of_pay_9089,rec_info_barg_rep_notified,recr_info_barg_rep_notified,recr_info_coll_teach_comp_proc,recr_info_coll_univ_teacher,recr_info_employer_rec_payment,recr_info_first_ad_start,recr_info_job_fair_from,recr_info_job_fair_to,recr_info_on_campus_recr_from,recr_info_on_campus_recr_to,recr_info_pro_org_advert_from,recr_info_pro_org_advert_to,recr_info_prof_org_advert_from,recr_info_prof_org_advert_to,recr_info_professional_occ,recr_info_radio_tv_ad_from,recr_info_radio_tv_ad_to,recr_info_second_ad_start,recr_info_sunday_newspaper,recr_info_swa_job_order_end,recr_info_swa_job_order_start,refile,ri_1st_ad_newspaper_name,ri_2nd_ad_newspaper_name,ri_2nd_ad_newspaper_or_journal,ri_campus_placement_from,ri_campus_placement_to,ri_coll_tch_basic_process,ri_coll_teach_pro_jnl,ri_coll_teach_select_date,ri_employee_referral_prog_from,ri_employee_referral_prog_to,ri_employer_web_post_from,ri_employer_web_post_to,ri_job_search_website_from,ri_job_search_website_to,ri_layoff_in_past_six_months,ri_local_ethnic_paper_from,ri_local_ethnic_paper_to,ri_posted_notice_at_worksite,ri_pvt_employment_firm_from,ri_pvt_employment_firm_to,ri_us_workers_considered,schd_a_sheepherder,us_economic_sector,wage_offer_from_9089,wage_offer_to_9089,wage_offer_unit_of_pay_9089,wage_offered_from_9089,wage_offered_to_9089,wage_offered_unit_of_pay_9089
374353,,Edison,LAW OFFICE OF STEVEN MARKAN LLC,NJ,,,A-16292-62659,2016-10-31,Certified,H-1B,INDIA,,2016-12-30,"1551 SOUTH WASHINGTON AVE, SUITE 402A",,PISCATAWAY,UNITED STATES OF AMERICA,Director of HR,"FIRST TEK, INC.",600.0,7327450107,,8854,NJ,2001.0,,,WEYMOUTH,Bachelor's,,VISVESVARAYA TECHNOLOGICAL UNIVERSITY,ENGINEERING,,,,MA,,,,A,INDIA,,2188,Y,Y,A,2003.0,N,,N,,N,,Y,,,N,,N,,,SOFTWARE BUSINESS DEVELOPMENT,60.0,N,Bachelor's,,Y,60.0,Y,N,Y,MANAGEMENT ANALYST,CS or Engg or IS or Business Admin,N,,,Piscataway,8854,NJ,,,,,541511,,Custom Computer Programming Services,,,,N,ATTORNEY AT LAW,136219.0,2016-07-12,2017-06-30,MANAGEMENT ANALYST,,Level IV,13-1111,Management Analysts,OES,,P10016118731047,Year,,A,,N,N,2016-08-14,,,,,,,,,Y,,,2016-08-21,Y,2016-09-03,2016-08-04,N,SUNDAY STAR LEDGER,SUNDAY STAR LEDGER,Y,,,,,,,,2016-08-08,2016-09-10,2016-08-14,2016-09-12,N,2016-08-17,2016-08-17,Y,,,,N,,136219.0,,Year,,,
374354,,Jacksonville,"Constangy, Brooks, Smith & Prophete LLP",FL,,,A-16352-82106,2016-12-22,Withdrawn,H-1B,INDIA,,2016-12-30,60 BROADHOLLOW ROAD,,MELVILLE,UNITED STATES OF AMERICA,HR Generalist,"ANALYSIS & DESIGN APPLICATION CO., LTD.",400.0,6315492300,,11747,NY,1980.0,,,AUSTIN,Master's,,BHARATHIAR UNIVERSITY,MANUFACTURING ENGINEERING,,,,TX,,,,Y,INDIA,,78717,Y,A,A,2002.0,N,,N,,N,,Y,5.0,Bachelor's,Y,,Y,"Computer Science, Mechanical Engineering or cl...","Computer Science, Mechanical Engineering or cl...","Developing production code in C++, Java and cl...",36.0,N,Master's,,N,,Y,N,Y,Senior Software Developer,Computer Information Systems,N,,,Austin,78750,TX,,,,,541330,,Engineering Services,,,,N,Attorney,100693.0,2016-11-01,2017-06-30,Senior Software Developer,,Level III,15-1133,"Software Developers, Systems Software",OES,,P10016195533489,Year,,A,,N,N,2016-10-02,,,,,,,,,Y,,,2016-10-09,Y,2016-11-18,2016-10-19,N,The Austin American Statesman,The Austin American Statesman,Y,,,,,,,,2016-09-23,2016-10-19,2016-09-26,2016-10-19,N,,,Y,2016-09-26,2016-10-26,,N,,125000.0,,Year,,,
374356,,San Francisco,Goeschl Law Corporation,CA,,,A-16328-74286,2016-12-28,Withdrawn,H-1B,CHINA,,2016-12-30,650 CASTRO STREET,SUITE 400,MOUNTAIN VIEW,UNITED STATES OF AMERICA,"Director, Human Resources","PURE STORAGE, INC.",1378.0,6503186593,,94041,CA,2009.0,,,FOSTER CITY,Master's,,PURDUE UNIVERSITY,COMPUTER SCIENCE,,,,CA,,,,A,CHINA,,94404,Y,Y,A,2011.0,N,,N,,N,,Y,,,N,,N,,,Any similar position,36.0,N,Master's,,Y,36.0,Y,N,N,Member of Technical Staff (Software Engineer),Computer Science or a related field,N,,,Mountain View,94041,CA,,,,,541512,,Computer Systems Design Services,,,,N,Attorney,142938.0,2016-12-06,2017-06-30,"Software Developers, Systems Software",,Level III,15-1133,"Software Developers, Systems Software",OES,,P10016229483265,Year,,A,,N,N,2016-10-02,,,,,,,2016-09-28,2016-10-08,Y,,,2016-10-09,Y,2016-10-28,2016-09-27,N,The Mercury News,The Mercury News,Y,,,,,,,,,,2016-09-28,2016-10-08,N,2016-10-07,2016-10-07,Y,,,,N,,142938.0,,Year,,,
374359,,Schaumburg,International Legal and Business Services Grou...,IL,,,A-16354-82345,2016-12-30,Withdrawn,H-1B,INDIA,,2016-12-30,220 W MICHIGAN AVE,,YPSILANTI,UNITED STATES OF AMERICA,Director,AMPHION GLOBAL INC,33.0,6143568160,,48197,MI,2010.0,,,DEARBORN,Master's,,CLEVELAND STATE UNIVERSITY,ELECTRICAL AND ELECTRONICS ENGINEERING,,,,MI,,,,A,INDIA,,48126,A,A,A,2013.0,N,,N,,N,,Y,,,N,,N,,,,,N,Master's,,N,,Y,N,Y,Computer Systems Analyst,"Management Info Systems,Comp Sci,Engg (any),Te...",N,,,Ypsilanti,48197,MI,,,,,541511,,Custom Computer Programming Services,,,,N,Attorney at Law,79082.0,2016-11-02,2017-06-30,Computer Systems Analyst,,Level II,15-1121,Computer Systems Analysts,OES,,P10016193760249,Year,,A,,N,N,2016-07-24,,,,,,,,,Y,,,2016-07-31,Y,2016-08-22,2016-07-18,N,Detroit Free Press,Detroit Free Press,Y,,,,,,,,2016-07-18,2016-08-16,2016-07-19,2016-08-16,N,2016-07-21,2016-07-28,Y,,,,N,,79082.0,79082.0,Year,,,
374361,,Phoenix,"Fragomen, Del Rey, Bernsen & Loewy, LLP",AZ,,,A-16279-59292,2016-12-30,Withdrawn,H-1B,CHINA,,2016-12-30,2200 MISSION COLLEGE BLVD.,,SANTA CLARA,UNITED STATES OF AMERICA,U.S. Immigration Ops Manager,INTEL CORPORATION,54000.0,4087658080,,95052,CA,1968.0,,,FORT COLLINS,Master's,,"UNIVERSITY OF MICHIGAN, ANN ARBOR",ELECTRICAL ENGINEERING,,,,CO,,,,A,CHINA,,80525,A,A,A,2016.0,N,,N,A,N,,Y,,,N,,N,,,,,N,Master's,,N,,Y,N,Y,Component Design Engineer,"Elec. &/or Comp. Eng., or Scie., or related Sc...",N,,,Fort Collins,80528,CO,,,,,3344,,Semiconductor and Other Electronic Component M...,,,,N,Attorney,84926.0,2016-12-13,2017-06-30,"Electronics Engineers, Except Computer",,Level II,17-2072,"Electronics Engineers, Except Computer",OES,,P10016237058122,Year,,A,,N,N,2016-08-21,,,,,,,2016-09-01,2016-09-01,Y,,,2016-08-28,Y,2016-10-03,2016-08-31,N,Fort Collins Coloradoan,Fort Collins Coloradoan,Y,,,,,,,,,,2016-08-29,2016-09-06,N,2016-08-19,2016-09-01,Y,,,,N,,84926.0,121500.0,Year,,,


### 1.1 Convert the features to their appropriate data types (e.g., decide which features are more appropriate as continuous and which ones as categorical). 

In [8]:
# Print the datatypes of each features in the dataset
df_visaType.dtypes

add_these_pw_job_title_9089                object
agent_city                                 object
agent_firm_name                            object
agent_state                                object
application_type                           object
case_no                                    object
case_number                                object
case_received_date                         object
case_status                                object
class_of_admission                         object
country_of_citizenship                     object
country_of_citzenship                      object
decision_date                              object
employer_address_1                         object
employer_address_2                         object
employer_city                              object
employer_country                           object
employer_decl_info_title                   object
employer_name                              object
employer_num_employees                    float64


## Observations of datatypes
 - The features 'fw_info_yr_rel_edu_completed' and 'foreign_worker_yr_rel_edu_completed' contain a year value to show in what year the applicant completed their relevant education. They are currently labelled a float64 however we feel these features would be better suited as a categorical features. 
- The features employer_yr_estab also shows a year in which would be best suited as a categorical features. 
- 'pw_amount_9089', 'wage_offer_from_9089' and 'wage_offer_to_9089' are all labelled as type object but these features show numerical figures of the wages offered to the applicants upon sucess. Therefore we are choosing to re-label these features as type float 64. It is important to note here that values in these features are labelled due to the fact that 4 or more digit numbers contain ',' . i.e. 18,400.00 . We must first remove all instance of the commas before converting to a float. 

In [9]:
#Setting data type as object
df_visaType['fw_info_yr_rel_edu_completed'] = df_visaType['fw_info_yr_rel_edu_completed'].astype('object')
df_visaType['foreign_worker_yr_rel_edu_completed'] = df_visaType['foreign_worker_yr_rel_edu_completed'].astype('object')
df_visaType['employer_yr_estab'] = df_visaType['employer_yr_estab'].astype('object')

#Removing ',' and setting data type as float64
df_visaType['pw_amount_9089'] = df_visaType['pw_amount_9089'].str.replace(',', '')
df_visaType['pw_amount_9089'] = df_visaType['pw_amount_9089'].astype('float64')

#Removing ',' and setting data type as float64
df_visaType['wage_offer_from_9089'] = df_visaType['wage_offer_from_9089'].str.replace(',', '')
df_visaType['wage_offer_from_9089'] = df_visaType['wage_offer_from_9089'].str.replace('#', '0')
df_visaType['wage_offer_from_9089'] = df_visaType['wage_offer_from_9089'].astype('float64')

#Removing ',' and setting data type as float64
df_visaType['wage_offer_to_9089'] = df_visaType['wage_offer_to_9089'].str.replace(',', '')
df_visaType['wage_offer_to_9089'] = df_visaType['wage_offer_to_9089'].str.replace('#', '0')
df_visaType['wage_offer_to_9089'] = df_visaType['wage_offer_to_9089'].astype('float64')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the cavea

In [10]:
# Checking the datatypes were changed successfully
df_visaType.dtypes

add_these_pw_job_title_9089                object
agent_city                                 object
agent_firm_name                            object
agent_state                                object
application_type                           object
case_no                                    object
case_number                                object
case_received_date                         object
case_status                                object
class_of_admission                         object
country_of_citizenship                     object
country_of_citzenship                      object
decision_date                              object
employer_address_1                         object
employer_address_2                         object
employer_city                              object
employer_country                           object
employer_decl_info_title                   object
employer_name                              object
employer_num_employees                    float64


### 1.2 Drop duplicate rows and columns, if any.

* It is important to note at this stage that there were some inconsistencies when collecting the data. There are some duplicate columns i.e. **country_of_citizenship** / **country_of_citzenship** This will be dealt with at a later stage. For now, we will simply remove any duplicate rows.

In [11]:
# Dropping any duplicate rows
df_visaType.drop_duplicates(subset='case_number', keep='first')

Unnamed: 0,add_these_pw_job_title_9089,agent_city,agent_firm_name,agent_state,application_type,case_no,case_number,case_received_date,case_status,class_of_admission,country_of_citizenship,country_of_citzenship,decision_date,employer_address_1,employer_address_2,employer_city,employer_country,employer_decl_info_title,employer_name,employer_num_employees,employer_phone,employer_phone_ext,employer_postal_code,employer_state,employer_yr_estab,foreign_worker_info_alt_edu_experience,foreign_worker_info_birth_country,foreign_worker_info_city,foreign_worker_info_education,foreign_worker_info_education_other,foreign_worker_info_inst,foreign_worker_info_major,foreign_worker_info_postal_code,foreign_worker_info_rel_occup_exp,foreign_worker_info_req_experience,foreign_worker_info_state,foreign_worker_info_training_comp,foreign_worker_ownership_interest,foreign_worker_yr_rel_edu_completed,fw_info_alt_edu_experience,fw_info_birth_country,fw_info_education_other,fw_info_postal_code,fw_info_rel_occup_exp,fw_info_req_experience,fw_info_training_comp,fw_info_yr_rel_edu_completed,fw_ownership_interest,ji_foreign_worker_live_on_premises,ji_fw_live_on_premises,ji_live_in_dom_svc_contract,ji_live_in_domestic_service,ji_offered_to_sec_j_foreign_worker,ji_offered_to_sec_j_fw,job_info_alt_cmb_ed_oth_yrs,job_info_alt_combo_ed,job_info_alt_combo_ed_exp,job_info_alt_combo_ed_other,job_info_alt_field,job_info_alt_field_name,job_info_alt_occ,job_info_alt_occ_job_title,job_info_alt_occ_num_months,job_info_combo_occupation,job_info_education,job_info_education_other,job_info_experience,job_info_experience_num_months,job_info_foreign_ed,job_info_foreign_lang_req,job_info_job_req_normal,job_info_job_title,job_info_major,job_info_training,job_info_training_field,job_info_training_num_months,job_info_work_city,job_info_work_postal_code,job_info_work_state,naics_2007_us_code,naics_2007_us_title,naics_code,naics_title,naics_us_code,naics_us_code_2007,naics_us_title,naics_us_title_2007,orig_case_no,orig_file_date,preparer_info_emp_completed,preparer_info_title,pw_amount_9089,pw_determ_date,pw_expire_date,pw_job_title_908,pw_job_title_9089,pw_level_9089,pw_soc_code,pw_soc_title,pw_source_name_9089,pw_source_name_other_9089,pw_track_num,pw_unit_of_pay_9089,rec_info_barg_rep_notified,recr_info_barg_rep_notified,recr_info_coll_teach_comp_proc,recr_info_coll_univ_teacher,recr_info_employer_rec_payment,recr_info_first_ad_start,recr_info_job_fair_from,recr_info_job_fair_to,recr_info_on_campus_recr_from,recr_info_on_campus_recr_to,recr_info_pro_org_advert_from,recr_info_pro_org_advert_to,recr_info_prof_org_advert_from,recr_info_prof_org_advert_to,recr_info_professional_occ,recr_info_radio_tv_ad_from,recr_info_radio_tv_ad_to,recr_info_second_ad_start,recr_info_sunday_newspaper,recr_info_swa_job_order_end,recr_info_swa_job_order_start,refile,ri_1st_ad_newspaper_name,ri_2nd_ad_newspaper_name,ri_2nd_ad_newspaper_or_journal,ri_campus_placement_from,ri_campus_placement_to,ri_coll_tch_basic_process,ri_coll_teach_pro_jnl,ri_coll_teach_select_date,ri_employee_referral_prog_from,ri_employee_referral_prog_to,ri_employer_web_post_from,ri_employer_web_post_to,ri_job_search_website_from,ri_job_search_website_to,ri_layoff_in_past_six_months,ri_local_ethnic_paper_from,ri_local_ethnic_paper_to,ri_posted_notice_at_worksite,ri_pvt_employment_firm_from,ri_pvt_employment_firm_to,ri_us_workers_considered,schd_a_sheepherder,us_economic_sector,wage_offer_from_9089,wage_offer_to_9089,wage_offer_unit_of_pay_9089,wage_offered_from_9089,wage_offered_to_9089,wage_offered_unit_of_pay_9089
2,,,,,PERM,A-07333-99643,,,Certified,H-1B,,INDIA,2011-12-01,1054 TECHNOLOGY PARK DRIVE,,GLEN ALLEN,,,"SCHNABEL ENGINEERING, INC.",,,,23059,VA,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Lutherville,,MD,541330,Engineering Services,,,,,,,,,,,,,,,Civil Engineer,Level I,17-2051.00,Civil Engineers,OES,,,yr,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Aerospace,,,yr,,,
135269,,Milwaukee,Reinhart Boerner Van Deuren s.c.,WISCONSIN,,,A-13316-14231,2013-11-19,Certified,H-1B,INDIA,,2015-05-29,S45 W29290 HWY 59,,WAUKESHA,UNITED STATES OF AMERICA,Vice President of Human Resources,GENERAC POWER SYSTEMS,1935.0,2625444811,,53189,WISCONSIN,1959,A,INDIA,WAUKESHA,Bachelor's,,UNIVERSITY OF KERALA,MECHANICAL ENGINEERING,53189,Y,A,WISCONSIN,A,N,2002,,,,,,,,,,N,,,N,Y,,,,N,,N,,Y,"Mech. Engr, Dsgn Engr, Dev Engr, Prod Engr, En...",84.0,N,Bachelor's,,N,,Y,N,N,Senior Engineer,Mechanical or Industrial Engineering,N,,,Waukesha,53189,WISCONSIN,,,335312,Motor and Generator Manufacturing,,,,,,,N,Attorney,83366.00,2013-09-27,2014-06-30,,Industrial Engineers,Level IV,17-2112,Industrial Engineers,OES,,P10013226788912,Year,A,,,N,N,2013-08-18,,,,,,,,,Y,,,2013-08-25,Y,2013-09-30,2013-08-16,N,Milwaukee Journal Sentinel,Milwaukee Journal Sentinel,Y,2013-08-14,2013-09-30,,,,,,2013-08-14,2013-09-12,2013-08-14,2013-09-12,N,,,Y,,,,N,,90000.00,95000.00,Year,,,
135271,,Littleton,Law Office of Jonathan Chin,COLORADO,,,A-13316-14312,2013-11-27,Denied,H-1B,GERMANY,,2014-10-16,12635 E. MONTVIEW BLVD. #140,,AURORA,UNITED STATES OF AMERICA,Chief Financial Officer,"AVIDITY, LLC",4.0,720-859-6111,,80045,COLORADO,1996,A,GERMANY,DENVER,Doctorate,,CHRISTIAN ALBRECHTS UNIVERSITY OF KIEL,MOLECULAR BIOLOGY,,A,Y,COLORADO,A,N,20,,,,,,,,,,N,,,N,Y,,,,N,,N,,N,,,N,Doctorate,,Y,36.0,Y,N,Y,Protein Production Scientist,Molecular Biology,N,,,Aurora,80045,COLORADO,,,54171,"Research and Development in the Physical, Engi...",,,,,,,N,Attorney,49982.00,2013-08-27,2014-06-30,,Protein Production Scientist,Level I,19-1029,"Biological Scientists, All Other",OES,,P100131960847,Year,A,,,N,N,2013-07-28,,,,,,,,,Y,,,2013-08-04,Y,2013-09-25,2013-08-26,N,Denver Post,Denver Post,Y,2013-08-16,2013-09-15,,,,,,,,2013-08-01,2013-08-31,N,2013-08-01,2013-08-07,Y,,,,N,,65000.00,,Year,,,
135275,,Williamsville,Cumming & Partners,NEW YORK,,,A-13317-14356,2014-07-02,Certified-Expired,H-1B,CANADA,,2014-12-01,317 MADISON AVENUE,SUITE 512,NEW YORK,UNITED STATES OF AMERICA,Managing Partner,TALBERT & TALBERT LLC,3.0,212-665-5202,7000,10017,NEW YORK,2011,A,CANADA,NEW YORK,Bachelor's,,THE UNIVERSITY OF WESTERN ONTARIO,MEDIA STUDIES,10009,Y,N,NEW YORK,A,N,2007,,,,,,,,,,N,,,N,Y,,,,N,,Y,Marketing or Related,Y,Related Consumer Lifestyle Public Relations,24.0,N,Bachelor's,,Y,24.0,Y,N,Y,Public Relations Specialist,Media Studies,N,,,New York,10017,NEW YORK,,,541219,Other Accounting Services,,,,,,,N,Attorney,55682.00,2014-01-15,2014-06-30,,Public Relations Specialists,Level II,27-3031,Public Relations Specialists,OES,,P10013311777515,Year,A,,,N,N,2014-02-16,,,,,,,,,Y,,,2014-02-23,Y,2014-03-17,2014-02-14,N,New York Times,New York Times,Y,2014-02-18,2014-03-03,,,,,,,,2014-02-15,2014-02-21,N,2014-02-19,2014-02-19,Y,,,,N,,55700.00,,Year,,,
135279,,,,,,,A-13316-14072,2013-11-13,Certified-Expired,H-1B,INDIA,,2015-03-30,211 QUALITY CIRCLE,,COLLEGE STATION,UNITED STATES OF AMERICA,Immigration Specialist,COGNIZANT TECHNOLOGY SOLUTIONS US CORPORATION,29000.0,201-290-9573,,77845,TEXAS,1994,A,INDIA,SMYRNA,Bachelor's,,"VEER BAHADUR SINGH PURVANCHAL UNIVERSITY, JAUNPUR",ENGINEERING (COMPUTER),30080,A,Y,GEORGIA,A,N,2005,,,,,,,,,,N,,,N,Y,,,,N,,N,,N,,,N,Bachelor's,,Y,60.0,Y,N,Y,Computer Systems Analyst - V,"Comp Sci, Sci, Eng (any), Math or Bus",N,,,College Station,77845,TEXAS,,,541512,Computer Systems Design Services,,,,,,,Y,,72467.00,2013-07-23,2014-06-30,,Computer Systems Analysts,Level IV,15-1121,Computer Systems Analysts,OES,,P10013156444032,Year,A,,,N,N,2013-07-14,,,,,,,,,Y,,,2013-07-21,Y,2013-08-06,2013-07-02,N,The Eagle,The Eagle,Y,,,,,,,,2013-08-01,2013-08-07,2013-07-19,2013-07-28,N,,,Y,2013-07-23,2013-07-30,,N,,72467.00,,Year,,,
135281,,Enola,"Law Offices of Kendra S. Elliott, Esq.",PENNSYLVANIA,,,A-13317-14353,2013-11-13,Certified,H-1B,BRAZIL,,2015-05-28,550 CLINTON DRIVE,,GALENA PARK,UNITED STATES OF AMERICA,Human Resources Manager,"GREEN EARTH FUELS OF HOUSTON, LLC.",36.0,713-237-2800,,77547,TEXAS,2006,A,BRAZIL,HOUSTON,Bachelor's,,CENTRO UNIVERSITARIO DE CIDADE - UNIVERCIDADE,MARKETING,77049,Y,A,TEXAS,A,N,2005,,,,,,,,,,N,,,N,Y,,,,N,,Y,Marketing or Finance field,Y,"Any occupation involving relevant experience, ...",36.0,N,Bachelor's,,N,,Y,N,N,Business Operations Specialist,Marketing or Finance field,N,,,Galena Park,77547,TEXAS,,,324110,Petroleum Refineries,,,,,,,N,Attorney,53560.00,2013-06-25,2013-09-23,,Market Research Analysts and Marketing Special...,Level II,13-1161,Market Research Analysts and Marketing Special...,OES,,P10013128271591,Year,A,,,N,N,2013-08-25,,,,,,,,,Y,,,2013-09-01,Y,2013-09-19,2013-08-16,N,Houston Chronicle,Houston Chronicle,Y,,,,,,,,2013-08-29,2013-09-12,2013-08-21,2013-08-28,N,2013-08-29,2013-09-11,Y,,,,N,,53560.00,60000.00,Year,,,
135283,,,,,,,A-13316-14140,2013-11-13,Certified-Expired,H-1B,INDIA,,2015-03-30,211 QUALITY CIRCLE,,COLLEGE STATION,UNITED STATES OF AMERICA,Immigration Specialist,COGNIZANT TECHNOLOGY SOLUTIONS US CORPORATION,29000.0,201-290-9573,,77845,TEXAS,1994,A,INDIA,PLAINSBORO,Bachelor's,,NAGPUR UNIVERSITY,ENGINEERING (ELECTRONIC),08536,A,Y,NEW JERSEY,A,N,2002,,,,,,,,,,N,,,N,Y,,,,N,,N,,N,,,N,Bachelor's,,Y,60.0,Y,N,Y,Computer Systems Analyst - V,"Comp Sci, Sci, Eng (any), Math or Bus",N,,,College Station,77845,TEXAS,,,541512,Computer Systems Design Services,,,,,,,Y,,72467.00,2013-07-23,2014-06-30,,Computer Systems Analysts,Level IV,15-1121,Computer Systems Analysts,OES,,P10013156444032,Year,A,,,N,N,2013-07-14,,,,,,,,,Y,,,2013-07-21,Y,2013-08-06,2013-07-02,N,The Eagle,The Eagle,Y,,,,,,,,2013-08-01,2013-08-07,2013-07-19,2013-07-28,N,,,Y,2013-07-23,2013-07-30,,N,,72467.00,,Year,,,
135286,,Edison,"Law Office of Thomas V. Allen, PLLC",NEW JERSEY,,,A-13317-14420,2013-11-14,Certified,H-1B,INDIA,,2015-05-20,33507 9TH AVENUE SOUTH,BLDG. D,FEDERAL WAY,UNITED STATES OF AMERICA,GENERAL MANAGER- OPERATIONS & DELIVERY,APPLEXUS TECHNOLOGIES LLC,36.0,206-249-0903,,98003,WASHINGTON,2005,Y,INDIA,MONTVALE,Bachelor's,,MOHAMED SATHAK ENGINEERING COLLEGE (ANNA UNIVE...,ENGINEERING,07645,Y,A,NEW JERSEY,A,N,2005,,,,,,,,,,N,,,N,Y,,5.0,Bachelor's,Y,,N,,Y,JOB OFFERED OR RELATED OCCUPATION,60.0,N,Master's,,N,,Y,N,Y,SOFTWARE ENGINEER,"ENGINEERING, SCIENCE OR TECHNOLGOY",N,,,FEDERAL WAY,98003,WASHINGTON,,,541512,Computer Systems Design Services,,,,,,,N,ATTORNEY-AT-LAW,99424.00,2013-06-20,2013-09-18,,"SOFTWARE DEVELOPERS, APPLICATIONS",Level III,15-1132,"Software Developers, Applications",OES,,P10013123221771,Year,A,,,N,N,2013-06-23,,,,,,,,,Y,,,2013-06-30,Y,2013-07-25,2013-06-25,N,TACOMA NEWS TRIBUNE,TACOMA NEWS TRIBUNE,Y,,,,,,2013-06-24,2013-07-24,2013-06-24,2013-07-24,2013-06-21,2013-07-22,N,,,Y,,,,N,,99424.00,,Year,,,
135287,,South San Francisco,Litwin & Associates,CALIFORNIA,,,A-13316-14176,2014-06-30,Certified,H-1B,CHINA,,2015-09-11,1371 MCCARTHY BLVD.,,MILPITAS,UNITED STATES OF AMERICA,Chief Financial Officer,"ARRAY NETWORKS, INC.",50.0,4082408700,8745,95035,CALIFORNIA,2000,A,CHINA,SAN JOSE,Master's,,SYRACUSE UNIVERSITY,COMPUTER SCIENCE,95127,A,A,CALIFORNIA,A,N,2011,,,,,,,,,,N,,,N,Y,,,,N,,N,,N,,,N,Master's,,N,,Y,N,N,Software Engineer,"Computer Science, Computer Engg, Electrical En...",N,,,Milpitas,95035,CALIFORNIA,,,334119,Other Computer Peripheral Equipment Manufacturing,,,,,,,N,Attorney,98675.00,2014-06-10,2014-09-08,,"Software Developers, Applications",Level II,15-1132,"Software Developers, Applications",OES,,P10014122178613,Year,A,,,N,N,2014-05-18,,,,,,,,,Y,,,2014-05-25,Y,2014-03-06,2014-01-31,N,San Jose Mercury News,San Jose Mercury News,Y,,,,,,2014-02-11,2014-02-27,2014-02-27,2014-03-13,2014-01-31,2014-02-24,N,,,Y,,,,N,,98675.00,,Year,,,
135289,,Dallas,Berry Appleman & Leiden LLP,TEXAS,,,A-13317-14346,2013-11-18,Certified,H-1B,INDIA,,2015-05-26,1900 N. AKARD ST.,,DALLAS,UNITED STATES OF AMERICA,International Human Resources Representative,"HUNT CONSOLIDATED, INC.",600.0,214-978-8107,,75201,TEXAS,1934,A,INDIA,DALLAS,Master's,,THE UNIVERSITY OF TEXAS AT DALLAS,COMPUTER SCIENCE,75206,Y,A,TEXAS,A,N,2011,,,,,,,,,,N,,,N,Y,,,,N,,Y,"Computer Science, Information Technology or re...",Y,See Item H-14 below,24.0,N,Master's,,N,,Y,N,N,Financial Information Analyst,"Computer Science, Information Technology or re...",N,,,Dallas,75201,TEXAS,,,211111,Crude Petroleum and Natural Gas Extraction,,,,,,,N,Attorney,69600.00,2013-08-21,2014-06-30,,Computer Systems Analysts,Level II,15-1121,Computer Systems Analysts,Other,Towers Watson Data Services Survey 2012,P10013154819935,Year,A,,,N,N,2013-06-09,,,,,,,,,Y,,,2013-06-16,Y,2013-07-08,2013-06-07,N,The Dallas Morning News,The Dallas Morning News,Y,,,,,,,,2013-09-04,2013-09-25,2013-06-09,2013-06-20,N,2013-06-21,2013-06-27,Y,,,,N,,69600.00,,Year,,,


### 1.3 Prepare a table with descriptive statistics for all the continuous features.

In [12]:
# Descriptive stats for categorical features only.
df_visaType.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
employer_num_employees,174329.0,26297.591634,637945.970636,0.0,117.0,1906.0,21000.0,263550614.0
job_info_alt_cmb_ed_oth_yrs,57547.0,4.563626,4.743169,0.0,3.0,5.0,5.0,96.0
job_info_alt_occ_num_months,119629.0,36.472034,22.982293,0.0,24.0,24.0,60.0,240.0
job_info_experience_num_months,98878.0,34.053551,22.763112,0.0,12.0,24.0,60.0,240.0
job_info_training_num_months,3726.0,36.391573,18.620979,0.0,36.0,36.0,36.0,144.0
pw_amount_9089,86756.0,90562.874066,33836.913976,8.0,72467.0,89150.0,106101.0,5067600.0
wage_offer_from_9089,86685.0,99404.989114,39090.83652,0.0,76378.0,95000.0,115000.0,1200000.0
wage_offer_to_9089,25361.0,129217.117623,43811.173657,0.0,104614.0,128000.0,150000.0,1158581.0
wage_offered_from_9089,92391.0,99152.179917,139258.787663,9.75,76947.0,92789.0,111508.0,16290600.0
wage_offered_to_9089,26196.0,124583.074387,93513.550863,0.0,98900.0,120600.0,145000.0,13285000.0


### 1.4 Prepare a table with descriptive statistics for all the categorical features.

In [13]:
df_categorical = df_visaType.select_dtypes(include=['object'])
# Descriptive stats for categorical features only.
df_categorical.describe().T

Unnamed: 0,count,unique,top,freq
add_these_pw_job_title_9089,34078,2939,"Software Developers, Applications",8622
agent_city,155308,1380,New York,16094
agent_firm_name,152318,7151,"Fragomen, Del Rey, Bernsen & Loewy, LLP",23197
agent_state,153067,106,CA,23669
application_type,108682,3,ONLINE,91846
case_no,108682,108448,A-13046-41062,2
case_number,174336,174037,A-15266-21000,2
case_received_date,174334,2002,2014-06-30,515
case_status,283018,4,Certified,139312
class_of_admission,283018,1,H-1B,283018


In [14]:
# It was obeserved in the above descriptive statistics that a number of columns had just two unique values either Y / N.
# It was agreed that these features were boolean so we decided to change the data type of these features to bool. 
for col in df_visaType:
    if df_visaType[col].nunique() == 2:
        df_visaType[col] = df_visaType[col].astype('bool')
        print(col, 'is now boolean')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


foreign_worker_ownership_interest is now boolean
fw_ownership_interest is now boolean
ji_foreign_worker_live_on_premises is now boolean
ji_fw_live_on_premises is now boolean
ji_live_in_domestic_service is now boolean
ji_offered_to_sec_j_foreign_worker is now boolean
ji_offered_to_sec_j_fw is now boolean
job_info_alt_combo_ed_exp is now boolean
job_info_alt_field is now boolean
job_info_combo_occupation is now boolean
job_info_experience is now boolean
job_info_foreign_ed is now boolean
job_info_foreign_lang_req is now boolean
job_info_job_req_normal is now boolean
job_info_training is now boolean
preparer_info_emp_completed is now boolean
recr_info_coll_teach_comp_proc is now boolean
recr_info_coll_univ_teacher is now boolean
recr_info_employer_rec_payment is now boolean
recr_info_professional_occ is now boolean
recr_info_sunday_newspaper is now boolean
refile is now boolean
ri_2nd_ad_newspaper_or_journal is now boolean
ri_coll_tch_basic_process is now boolean
ri_layoff_in_past_six_mon

### 1.5 Drop constant columns, if any.

In [15]:
# Dropping this feature as it has cardinality of 1
for column in df_visaType.columns:
        if len(df_visaType[column].unique()) == 1:
            df_visaType = df_visaType.drop(column,axis=1)
df_visaType.shape

(283018, 128)

## 2. Prepare a data quality plan for the cleaned CSV file.

### 2.1 Mark down all the features where there are potential problems or data quality issues.

### 2.1.1 Missing values

In [16]:
#Showing True or False for the columns that have less than 50% entries
df_visaType.count()<140000

add_these_pw_job_title_9089                True
agent_city                                False
agent_firm_name                           False
agent_state                               False
application_type                           True
case_no                                    True
case_number                               False
case_received_date                        False
case_status                               False
country_of_citizenship                    False
country_of_citzenship                      True
decision_date                             False
employer_address_1                        False
employer_address_2                        False
employer_city                             False
employer_country                          False
employer_decl_info_title                  False
employer_name                             False
employer_num_employees                    False
employer_phone                            False
employer_phone_ext                      

In [17]:
# Show the number of rows that have less than 50% values recorded
sum(df_visaType.count()<140000)

77

### Discussion:
As we can see above, there are **77** features that have less than 50% values recorded. Therefore we have decided to drop these features. **country_of_citzenship** is a feature that falls into this category. 

### 2.1.2 Irregular cardinality

- In the above descriptive statistics there were a number of features that were cause for alarm in regards to their cardinality. Although we expect to see high cardinalities for features such as dates, company names and addresses etc. We did not expect to see a cardinality above 50 for features such as **employer_state, agent_state** and **job_info_work_state**. 

- As all applications are to work in the US, a cardinality of 4 for **employer_country** was flagged as a possible issue in the dataset. 

- Along with the above features there were a number of features that we wanted to investigate as their cardinality was flagged as irregular. These include **employer_yr_estab, fw_info_alt_edu_experience, fw_info_education_other, fw_info_rel_occup_exp, fw_info_req_experience, fw_info_training_comp, fw_info_yr_rel_edu_completed, job_info_education, pw_unit_of_pay_9089, recr_info_barg_req_notified, ri_posted_notice_at_worksite** and **wage_offered_unit_of_pay_9089**

- In addition to this we wanted our target feature to contain only two values, certified or denied. We noted that the cardinality for this feature (**case_status**) is currently 4.

- Although this is only a subset of the features that we flagged for irregular cardinality, many of the features not listed here had less than 50% values recorded for that feature and we plan to drop these. 

In [18]:
# Check for irregular cardinality in categorical features. There could be same values spelled differently
print("Unique values for:\n- case_status:", pd.unique(df_visaType.case_status.ravel()))
print("\n- agent_state:", pd.unique(df_visaType.agent_state.ravel()))
print("\n- employer_state:", pd.unique(df_visaType.employer_state.ravel()))
print("\n- job_info_work_state:", pd.unique(df_visaType.job_info_work_state.ravel()))
print("\n- employer_country:", pd.unique(df_visaType.employer_country.ravel()))
print("\n- employer_yr_estab:", pd.unique(df_visaType.employer_yr_estab.ravel()))
print("\n- fw_info_alt_edu_experience:", pd.unique(df_visaType.fw_info_alt_edu_experience.ravel()))
print("\n- fw_info_education_other:", pd.unique(df_visaType.fw_info_education_other.ravel()))
print("\n- fw_info_rel_occup_exp:", pd.unique(df_visaType.fw_info_rel_occup_exp.ravel()))
print("\n- fw_info_req_experience:", pd.unique(df_visaType.fw_info_req_experience.ravel()))
print("\n- fw_info_training_comp:", pd.unique(df_visaType.fw_info_training_comp.ravel()))
print("\n- fw_info_yr_rel_edu_completed:", pd.unique(df_visaType.fw_info_yr_rel_edu_completed.ravel()))
print("\n- job_info_education:", pd.unique(df_visaType.job_info_education.ravel()))
print("\n- pw_unit_of_pay_9089:", pd.unique(df_visaType.pw_unit_of_pay_9089.ravel()))
print("\n- recr_info_barg_rep_notified:", pd.unique(df_visaType.recr_info_barg_rep_notified.ravel()))
print("\n- ri_posted_notice_at_worksite:", pd.unique(df_visaType.ri_posted_notice_at_worksite.ravel()))
print("\n- wage_offer_unit_of_pay_9089:", pd.unique(df_visaType.wage_offered_unit_of_pay_9089.ravel()))

Unique values for:
- case_status: ['Certified' 'Certified-Expired' 'Withdrawn' 'Denied']

- agent_state: [nan 'WISCONSIN' 'COLORADO' 'NEW YORK' 'PENNSYLVANIA' 'NEW JERSEY'
 'CALIFORNIA' 'TEXAS' 'MISSOURI' 'FLORIDA' 'VIRGINIA' 'NORTH CAROLINA'
 'OHIO' 'MICHIGAN' 'DISTRICT OF COLUMBIA' 'INDIANA' 'GEORGIA'
 'MASSACHUSETTS' 'ILLINOIS' 'OKLAHOMA' 'WASHINGTON' 'MINNESOTA' 'HAWAII'
 'CONNECTICUT' 'MARYLAND' 'UTAH' 'PUERTO RICO' 'ARIZONA' 'NEW HAMPSHIRE'
 'KANSAS' 'NEW MEXICO' 'VERMONT' 'TENNESSEE' 'NEBRASKA' 'OREGON'
 'ARKANSAS' 'LOUISIANA' 'KENTUCKY' 'IOWA' 'SOUTH CAROLINA' 'RHODE ISLAND'
 'GUAM' 'ALABAMA' 'MAINE' 'IDAHO' 'NORTH DAKOTA' 'ALASKA' 'NEVADA'
 'MISSISSIPPI' 'DELAWARE' 'SOUTH DAKOTA' 'MONTANA' 'WEST VIRGINIA' 'TX'
 'CA' 'MA' 'DC' 'IL' 'NC' 'NY' 'VA' 'MD' 'LA' 'NJ' 'GA' 'PA' 'MI' 'MO'
 'WI' 'OK' 'AZ' 'NE' 'MN' 'FL' 'OH' 'UT' 'SC' 'AL' 'WA' 'IN' 'NH' 'CO'
 'OR' 'VT' 'CT' 'TN' 'RI' 'HI' 'MS' 'IA' 'KS' 'KY' 'GU' 'ME' 'PR' 'NM'
 'SD' 'AR' 'MP' 'ND' 'MT' 'WV' 'AK' 'ID' 'NV' 'DE' 'VI']



As suspected in some cases **agent_state, employer_state, job_info_state, fw_info_yr_rel_edu_completed, pw_unit_of_pay_9089 and case_status** are all suffering from inconsistent labelling which is resulting in the irregular cardinality of these features.

### 2.1.3 Outliers

On reviewing the descriptive statistics for the continuous features above, it was noted that all features were subjected to outliers, some more heavily than others. Although huge outliers can be seen for **employer_num_employees, pw_amount_9089, wage_offer_from_9089, wage_offer_to_9089, wage_offered_from_9089** and **wage_offered_to_9089**, we have decided to leave these features as they are as very high or very low wages offered may affect the applicantion outcome. The following features that suffer from minor outliers can be understood as different criteria to be met for the various jobs being applied for (**job_info_alt_cmb_ed_oth_yrs, job_info_alt_occ_num_months, job_info_experience_num_months, job_info_training_num_months**).


### 2.1.4 Inconsistent labelling

### 2.2 Propose solutions to deal with the problems identified. Explain why did you choose one solution over potentially many other.

#### Data Quality Plan
| Feature                 | Data Quality Issue   | Handling Strategy                     |
|-------------------------|----------------------|--------------------------------       |
|add_these_pw_job_title_9089| MissingValue(more than 50%)   | Drop feature               |
|application_type|MissingValue(more than 50%)   | Drop feature               |
|case_no|MissingValue(more than 50%)   | Drop feature               |
|country_of_citzenship|MissingValue(more than 50%)   | Drop feature               |
|employer_phone_ext|MissingValue(more than 50%)   | Drop feature               |
|foreign_worker_info_alt_edu_experience|MissingValue(more than 50%)   | Drop feature               |
|foreign_worker_info_birth_country|MissingValue(more than 50%)   | Drop feature               |
|foreign_worker_info_education_other|MissingValue(more than 50%)   | Drop feature               |
|foreign_worker_info_postal_code|MissingValue(more than 50%)   | Drop feature               |
|foreign_worker_info_rel_occup_exp|MissingValue(more than 50%)   | Drop feature               |
|foreign_worker_info_req_experience|MissingValue(more than 50%)   | Drop feature               |
|foreign_worker_info_training_comp|MissingValue(more than 50%)   | Drop feature               |
|foreign_worker_yr_rel_edu_completed|MissingValue(more than 50%)   | Drop feature               |
|fw_info_alt_edu_experience|MissingValue(more than 50%)   | Drop feature               |
|fw_info_birth_country|MissingValue(more than 50%)   | Drop feature               |
|fw_info_education_other|MissingValue(more than 50%)   | Drop feature               |
|fw_info_postal_code|MissingValue(more than 50%)   | Drop feature               |
|fw_info_rel_occup_exp|MissingValue(more than 50%)   | Drop feature               |
|fw_info_req_experience|MissingValue(more than 50%)   | Drop feature               |
|fw_info_training_comp|MissingValue(more than 50%)   | Drop feature               |
|fw_info_yr_rel_edu_completed|MissingValue(more than 50%)   | Drop feature               |
|ji_live_in_dom_svc_contract|MissingValue(more than 50%)   | Drop feature               |
|job_info_alt_cmb_ed_oth_yrs|MissingValue(more than 50%)   | Drop feature               |
|job_info_alt_combo_ed|MissingValue(more than 50%)   | Drop feature               |
|job_info_alt_combo_ed_other|MissingValue(more than 50%)   | Drop feature               |
|job_info_alt_field_name|MissingValue(more than 50%)   | Drop feature               |
|job_info_alt_occ|MissingValue(more than 50%)   | Drop feature               |
|job_info_alt_occ_job_title|MissingValue(more than 50%)   | Drop feature               |
|job_info_alt_occ_num_months|MissingValue(more than 50%)   | Drop feature               |
|job_info_education_other|MissingValue(more than 50%)   | Drop feature               |
|job_info_experience_num_months|MissingValue(more than 50%)   | Drop feature               |
|job_info_training_field|MissingValue(more than 50%)   | Drop feature               |
|job_info_training_num_months|MissingValue(more than 50%)   | Drop feature               |
|naics_2007_us_code|MissingValue(more than 50%)   | Drop feature               |
|naics_2007_us_title|MissingValue(more than 50%)   | Drop feature               |
|naics_code|MissingValue(more than 50%)   | Drop feature               |
|naics_title|MissingValue(more than 50%)   | Drop feature               |
|naics_us_code|MissingValue(more than 50%)   | Drop feature               |
|naics_us_code_2007|MissingValue(more than 50%)   | Drop feature               |
|naics_us_title|MissingValue(more than 50%)   | Drop feature               |
|naics_us_title_2007|MissingValue(more than 50%)   | Drop feature               |
|orig_case_no|MissingValue(more than 50%)   | Drop feature               |
|orig_file_date|MissingValue(more than 50%)   | Drop feature               |
|pw_amount_9089|MissingValue(more than 50%)   | Drop feature               |
|pw_job_title_908|MissingValue(more than 50%)   | Drop feature               |
|pw_source_name_other_9089|MissingValue(more than 50%)   | Drop feature               |
|rec_info_barg_rep_notified|MissingValue(more than 50%)   | Drop feature               |
|recr_info_barg_rep_notified|MissingValue(more than 50%)   | Drop feature               |
|recr_info_job_fair_from|MissingValue(more than 50%)   | Drop feature               |
|recr_info_job_fair_to|MissingValue(more than 50%)   | Drop feature               |
|recr_info_on_campus_recr_from|MissingValue(more than 50%)   | Drop feature               |
|recr_info_on_campus_recr_to|MissingValue(more than 50%)   | Drop feature               |
|recr_info_pro_org_advert_from|MissingValue(more than 50%)   | Drop feature               |
|recr_info_pro_org_advert_to|MissingValue(more than 50%)   | Drop feature               |
|recr_info_prof_org_advert_from|MissingValue(more than 50%)   | Drop feature               |
|recr_info_prof_org_advert_to|MissingValue(more than 50%)   | Drop feature               |
|recr_info_radio_tv_ad_from|MissingValue(more than 50%)   | Drop feature               |
|recr_info_radio_tv_ad_to|MissingValue(more than 50%)   | Drop feature               |
|ri_campus_placement_from|MissingValue(more than 50%)   | Drop feature               |
|ri_campus_placement_to|MissingValue(more than 50%)   | Drop feature               |
|ri_coll_teach_pro_jnl|MissingValue(more than 50%)   | Drop feature               |
|ri_coll_teach_select_date|MissingValue(more than 50%)   | Drop feature               |
|ri_employee_referral_prog_from|MissingValue(more than 50%)   | Drop feature               |
|ri_employee_referral_prog_to|MissingValue(more than 50%)   | Drop feature               |
|ri_employer_web_post_from|MissingValue(more than 50%)   | Drop feature               |
|ri_employer_web_post_to|MissingValue(more than 50%)   | Drop feature               |
|ri_local_ethnic_paper_from|MissingValue(more than 50%)   | Drop feature               |
|ri_local_ethnic_paper_to|MissingValue(more than 50%)   | Drop feature               |
|ri_pvt_employment_firm_from|MissingValue(more than 50%)   | Drop feature               |
|ri_pvt_employment_firm_to|MissingValue(more than 50%)   | Drop feature               |
|ri_us_workers_considered|MissingValue(more than 50%)   | Drop feature               |
|us_economic_sector|MissingValue(more than 50%)   | Drop feature               |
|wage_offer_from_9089|MissingValue(more than 50%)   | Drop feature               |
|wage_offer_to_9089|MissingValue(more than 50%)   | Drop feature               |
|wage_offered_from_9089|MissingValue(more than 50%)   | Drop feature               |
|wage_offered_to_9089|MissingValue(more than 50%)   | Drop feature               |
|wage_offer_unit_of_pay_9089|MissingValue(more than 50%)   | Drop feature               |
|agent_state|Irregular cardinality | Re-label values|
|employer_state|Irregular cardinality | Re-label values                       |
|job_info_state|Irregular cardinality | Re-label values                       |
|pw_unit_of_pay_9089 |Irregular cardinality | Re-label values                       |
|case_status|Irregular cardinality | Re-label values                       |
|wage_offered_unit_of_pay_9089| Irregular cardinality | Re-label values
|employer_yr_estab|Irregular cardinality | Complete Case Analysis|
|recr_info_barg_rep_notified|Irregular cardinality | Complete Case Analysis|
|ri_posted_notice_at_worksite|Irregular cardinality | Complete Case Analysis|
|fw_info_training_comp|Irregular cardinality | Complete Case Analysis|
|fw_info_req_experience|Irregular cardinality | Complete Case Analysis|
|fw_info_rel_occup_exp|Irregular cardinality | Complete Case Analysis|
|fw_info_alt_edu_experience|Irregular cardinality | Complete Case Analysis|
|employer_num_employees|Outliers              | Do nothing                            |
|pw_amount_9089|Outliers              | Do nothing                            |
|wage_offer_from_9089|Outliers              | Do nothing                            |
|wage_offer_to_9089|Outliers              | Do nothing                            |
|wage_offered_from_9089|Outliers              | Do nothing                            |
|wage_offered_to_9089|Outliers              | Do nothing                            |
|job_info_alt_cmb_ed_oth_yrs|Outliers              | Do nothing                            |
|job_info_alt_occ_num_months|Outliers              | Do nothing                            |
|job_info_experience_num_months|Outliers              | Do nothing                            |
|job_info_training_num_months|Outliers              | Do nothing                            |

### 2.3 Apply your solutions to obtain a new CSV file where the identified data quality issues were addressed. 

In [19]:
# Find columns that have less than half of the values filled in and remove them from the dataset.
for col in df_visaType:
    if df_visaType[col].count() < 100000:
        print(col, df_visaType[col].count())
        df_visaType = df_visaType.drop([col], axis = 1)

add_these_pw_job_title_9089 34078
country_of_citzenship 16288
employer_phone_ext 18612
foreign_worker_info_alt_edu_experience 67579
foreign_worker_info_birth_country 67578
foreign_worker_info_education_other 3071
foreign_worker_info_postal_code 67130
foreign_worker_info_rel_occup_exp 67578
foreign_worker_info_req_experience 67579
foreign_worker_info_training_comp 67579
foreign_worker_yr_rel_edu_completed 67112
ji_live_in_dom_svc_contract 594
job_info_alt_cmb_ed_oth_yrs 57547
job_info_alt_combo_ed 57393
job_info_alt_combo_ed_other 7057
job_info_alt_field_name 77179
job_info_education_other 6317
job_info_experience_num_months 98878
job_info_training_field 3436
job_info_training_num_months 3726
naics_2007_us_code 16282
naics_2007_us_title 15757
naics_code 67540
naics_title 67540
naics_us_code_2007 92345
naics_us_title_2007 90033
orig_case_no 147
orig_file_date 145
pw_amount_9089 86756
pw_source_name_other_9089 12425
rec_info_barg_rep_notified 67535
recr_info_job_fair_from 1658
recr_info_j

In [20]:
# Check the removal of the columns worked
df_visaType.shape[1]

72

In [21]:
# Relabel values
replace_agent_state_values = {'MA':'MASSACHUSETTS', 'DC':'DISTRICT OF COLUMBIA', 'IL':'ILLINOIS', 'NC':'NORTH CAROLINA', 'NY':'NEW YORK', 'VA': 'VIRGINIA', 'MD': 'MARYLAND',  'LA':'LOUISIANA',  'NJ': 'NEW JERSEY', 'GA':'GEORGIA', 'MI':'MICHIGAN', 'MO':'MISSOURI', 'WI':'WISCONSIN', 'OK':'OKLAHOMA','AZ':'ARIZONA','NE': 'NEBRASKA', 'MN':'MINNESOTA', 'FL': 'FLORIDA', 'OH': 'OHIO', 'UT':'UTAH', 'SC':'SOUTH CAROLINA', 'AL': 'ALABAMA', 'WA': 'WASHINGTON',  'IN':'INDIANA', 'NH': 'NEW HAMPSHIRE', 'CO': 'COLORADO', 'OR':'OREGON','VT':'VERMONT','CT':'CONNECTICUT','TN': 'TENNESSEE','RI':'RHODE ISLAND','HI': 'HAWAII','MS':'MISSISSIPPI','IA':'IOWA','KS': 'KANSAS','KY':'KENTUCKY','ME':'MAINE','NM':'NEW MEXICO','SD':'SOUTH DAKOTA','AR':'ARKANSAS','ND':'NORTH DAKOTA','MT':'MONTANA','WV':'WEST VIRGINIA','AK':'ALASKA','ID':'IDAHO','NV':'NEVADA','DE': 'DELAWARE', 'VI':'VIRGIN ISLANDS', 'TX':'TEXAS','CA':'CALIFORNIA','PA':'PENNSYLVANIA','GU':'GUAM','PR':'PUERTO RICO', 'MP':'NORTHERN MARIANA ISLANDS'}
replace_employer_state_values = {'VA':'VIRGINIA', 'NY': 'NEW YORK', 'DE': 'DELAWARE', 'MD': 'MARYLAND', 'NJ': 'NEW JERSEY', 'GA':'GEORGIA', 'TX':'TEXAS', 'KY':'KENTUCKY', 'IL':'ILLINOIS', 'MS':'MISSISSIPPI' ,'MA':'MASSACHUSETTS', 'CA':'CALIFORNIA','NC':'NORTH CAROLINA', 'MO':'MISSOURI', 'WI':'WISCONSIN','CO': 'COLORADO', 'OH': 'OHIO', 'WA': 'WASHINGTON', 'AL': 'ALABAMA', 'FL': 'FLORIDA', 'OK':'OKLAHOMA', 'WY':'WYOMING', 'PA':'PENNSYLVANIA', 'RI':'RHODE ISLAND', 'DC':'DISTRICT OF COLUMBIA', 'NV':'NEVADA', 'CT':'CONNECTICUT', 'MN':'MINNESOTA', 'MI':'MICHIGAN', 'IA':'IOWA', 'NH': 'NEW HAMPSHIRE', 'NE': 'NEBRASKA', 'KS': 'KANSAS', 'TN': 'TENNESSEE' ,'OR':'OREGON', 'AR':'ARKANSAS', 'AZ':'ARIZONA','LA':'LOUISIANA', 'IN':'INDIANA', 'ND':'NORTH DAKOTA', 'SC':'SOUTH CAROLINA', 'UT':'UTAH', 'ID':'IDAHO', 'HI': 'HAWAII', 'VT':'VERMONT', 'ME':'MAINE', 'NM':'NEW MEXICO', 'WV':'WEST VIRGINIA', 'SD':'SOUTH DAKOTA', 'AK':'ALASKA', 'MT':'MONTANA', 'GU':'GUAM', 'BC': 'BRITISH COLUMBIA', 'MP':'NORTHERN MARIANA ISLANDS','PR':'PUERTO RICO', 'VI':'VIRGIN ISLANDS'}
replace_job_info_state_values = {'MD': 'MARYLAND',  'NY':'NEW YORK', 'TX':'TEXAS', 'NJ':'NEW JERSEY', 'GA':'GEORGIA', 'KY':'KENTUCKY', 'IL':'ILLINOIS', 'MS':'MISSISSIPPI', 'MA':'MASSACHUSETTS', 'ID':'IDAHO', 'NC':'NORTH CAROLINA', 'CA':'CALIFORNIA', 'MO':'MISSOURI', 'WI':'WISCONSIN', 'CO': 'COLORADO', 'OH': 'OHIO',  'WA': 'WASHINGTON', 'AZ':'ARIZONA', 'AL': 'ALABAMA', 'FL':'FLORIDA', 'OR':'OREGON', 'OK':'OKLAHOMA', 'WY':'WYOMING', 'PA':'PENNSYLVANIA',  'DC':'DISTRICT OF COLUMBIA', 'VA': 'VIRGINIA', 'NV':'NEVADA', 'LA':'LOUISIANA', 'CT':'CONNECTICUT', 'MN':'MINNESOTA', 'MI':'MICHIGAN', 'IA':'IOWA', 'NH':'NEW HAMPSHIRE', 'NE': 'NEBRASKA', 'KS': 'KANSAS', 'TN': 'TENNESSEE', 'AR':'ARKANSAS', 'DE': 'DELAWARE', 'IN':'INDIANA', 'SC':'SOUTH CAROLINA', 'RI': 'RHODE ISLAND', 'UT': 'UTAH', 'ND':'NORTH DAKOTA', 'HI': 'HAWAII', 'WV':'WEST VIRGINIA', 'NM':'NEW MEXICO', 'SD':'SOUTH DAKOTA', 'AK':'ALASKA', 'MT':'MONTANA', 'VT': 'VERMONT', 'ME':'MAINE','PR':'PUERTO RICO', 'VI':'VIRGIN ISLANDS','GU':'GUAM', 'MP':'NORTHERN MARIANA ISLANDS'}
replace_pw_unit_of_pay_9089_values = {'yr': 'Year', 'hr':'Hour', 'mth': 'Month', 'wk':'Week','bi':'Bi-Weekly'}
replace_wage_offer_unit_of_pay_9089_values = {'yr': 'Year', 'hr':'Hour', 'mth': 'Month', 'wk':'Week','bi':'Bi-Weekly'}
replace_case_status_values = {'Certified-Expired': 'Certified'}
df_visaType = df_visaType.replace({'agent_state': replace_agent_state_values})
df_visaType = df_visaType.replace({'employer_state': replace_employer_state_values})
df_visaType = df_visaType.replace({'job_info_work_state': replace_job_info_state_values})
df_visaType = df_visaType.replace({'pw_unit_of_pay_9089': replace_pw_unit_of_pay_9089_values})
df_visaType = df_visaType.replace({'wage_offer_unit_of_pay_9089': replace_wage_offer_unit_of_pay_9089_values})
df_visaType = df_visaType.replace({'case_status': replace_case_status_values})

In [22]:
# Checking the relabeling worked
print("Unique values for:\n- case_status:", pd.unique(df_visaType.case_status.ravel()))
print("\n- agent_state:", pd.unique(df_visaType.agent_state.ravel()))
print("\n- employer_state:", pd.unique(df_visaType.employer_state.ravel()))
print("\n- job_info_work_state:", pd.unique(df_visaType.job_info_work_state.ravel()))
print("\n- pw_unit_of_pay_9089:", pd.unique(df_visaType.pw_unit_of_pay_9089.ravel()))
print("\n- wage_offer_unit_of_pay_9089:", pd.unique(df_visaType.pw_unit_of_pay_9089.ravel()))

Unique values for:
- case_status: ['Certified' 'Withdrawn' 'Denied']

- agent_state: [nan 'WISCONSIN' 'COLORADO' 'NEW YORK' 'PENNSYLVANIA' 'NEW JERSEY'
 'CALIFORNIA' 'TEXAS' 'MISSOURI' 'FLORIDA' 'VIRGINIA' 'NORTH CAROLINA'
 'OHIO' 'MICHIGAN' 'DISTRICT OF COLUMBIA' 'INDIANA' 'GEORGIA'
 'MASSACHUSETTS' 'ILLINOIS' 'OKLAHOMA' 'WASHINGTON' 'MINNESOTA' 'HAWAII'
 'CONNECTICUT' 'MARYLAND' 'UTAH' 'PUERTO RICO' 'ARIZONA' 'NEW HAMPSHIRE'
 'KANSAS' 'NEW MEXICO' 'VERMONT' 'TENNESSEE' 'NEBRASKA' 'OREGON'
 'ARKANSAS' 'LOUISIANA' 'KENTUCKY' 'IOWA' 'SOUTH CAROLINA' 'RHODE ISLAND'
 'GUAM' 'ALABAMA' 'MAINE' 'IDAHO' 'NORTH DAKOTA' 'ALASKA' 'NEVADA'
 'MISSISSIPPI' 'DELAWARE' 'SOUTH DAKOTA' 'MONTANA' 'WEST VIRGINIA'
 'NORTHERN MARIANA ISLANDS' 'VIRGIN ISLANDS']

- employer_state: ['VIRGINIA' 'NEW YORK' 'DELAWARE' 'MARYLAND' 'NEW JERSEY' 'GEORGIA'
 'TEXAS' 'KENTUCKY' 'ILLINOIS' 'MISSISSIPPI' 'MASSACHUSETTS' 'CALIFORNIA'
 'NORTH CAROLINA' 'MISSOURI' 'WISCONSIN' 'COLORADO' 'OHIO' 'WASHINGTON'
 'ALABAMA' 'FLORI

In [23]:
# Only select case status that are Certified or Denied
df_visaType = df_visaType.loc[(df.case_status =="Certified") | (df_visaType.case_status =="Denied")]

In [24]:
# Analyse the values in fw_info_yr_rel_edu_completed
rowsToAnalyse = []
for row in df_visaType['fw_info_yr_rel_edu_completed'].iteritems():
    if row[1]==0.0 or row[1]==2.0  or row[1]==8.0 or row[1]== 4.0 or row[1]==9.0 or row[1]==6.0 or row[1]==212.0 or row[1]==12.0 or row[1]==200.0 or row[1]==3.0 or row[1]==2102.0 or row[1]==5.0 or row[1]==7.0 or row[1]==16.0 or row[1]==208.0 or row[1]==1900.0 or row[1]==10.0 or row[1]==2207.0 or row[1]==2021.0 or row[1]==11.0 or row[1]==1.0 :
        rowsToAnalyse.append(row[0])
for i in range(0, len(rowsToAnalyse)):
    index = rowsToAnalyse[i]
    print(df_visaType.loc[[index]])

       agent_city agent_firm_name agent_state application_type case_no  \
239270        NaN             NaN         NaN              NaN     NaN   

          case_number case_received_date case_status country_of_citizenship  \
239270  A-15105-67422         2015-04-24      Denied                 CANADA   

       decision_date employer_address_1 employer_address_2 employer_city  \
239270    2015-11-24   1205 N. 2nd Ave.               None    Siler City   

                employer_country employer_decl_info_title employer_name  \
239270  UNITED STATES OF AMERICA                      CEO   WebenergyNC   

        employer_num_employees employer_phone employer_postal_code  \
239270                     5.0     9197999076                27344   

        employer_state employer_yr_estab foreign_worker_info_city  \
239270  NORTH CAROLINA              2014                  HALIFAX   

       foreign_worker_info_education  \
239270                         Other   

                           

       agent_city     agent_firm_name    agent_state application_type case_no  \
291383     Boston  Chin & Curtis, LLP  MASSACHUSETTS              NaN     NaN   

          case_number case_received_date case_status country_of_citizenship  \
291383  A-15312-37069         2015-12-15   Certified                 ISRAEL   

       decision_date employer_address_1 employer_address_2 employer_city  \
291383    2016-04-21       150 BROADWAY               None     CAMBRIDGE   

                employer_country       employer_decl_info_title  \
291383  UNITED STATES OF AMERICA  Global Mobility Specialist II   

                    employer_name  employer_num_employees employer_phone  \
291383  AKAMAI TECHNOLOGIES, INC.                  3400.0   617-444-3000   

       employer_postal_code employer_state employer_yr_estab  \
291383                02142  MASSACHUSETTS              1998   

       foreign_worker_info_city foreign_worker_info_education  \
291383                SAN DIEGO            

           agent_city     agent_firm_name agent_state application_type  \
299662  San Francisco  Reinhorn Law, Inc.  CALIFORNIA              NaN   

       case_no    case_number case_received_date case_status  \
299662     NaN  A-15356-53967         2016-01-08   Certified   

       country_of_citizenship decision_date employer_address_1  \
299662               SLOVENIA    2016-05-12  548 Market #23008   

       employer_address_2  employer_city          employer_country  \
299662               None  San Francisco  UNITED STATES OF AMERICA   

       employer_decl_info_title employer_name  employer_num_employees  \
299662       PEOPLE OPS PARTNER      Coinbase                    77.0   

       employer_phone employer_postal_code employer_state employer_yr_estab  \
299662     8583547043                94109     CALIFORNIA              2012   

       foreign_worker_info_city foreign_worker_info_education  \
299662            SAN FRANCISCO                         Other   

           

       agent_city agent_firm_name agent_state application_type case_no  \
305415        NaN             NaN         NaN              NaN     NaN   

          case_number case_received_date case_status country_of_citizenship  \
305415  A-15110-68351         2015-04-20      Denied                HUNGARY   

       decision_date employer_address_1 employer_address_2 employer_city  \
305415    2016-05-26    1036 1st Street           Suite A3        Humble   

                employer_country employer_decl_info_title  employer_name  \
305415  UNITED STATES OF AMERICA                President  Radarview LLC   

        employer_num_employees employer_phone employer_postal_code  \
305415                    22.0   281-446-7363                77338   

       employer_state employer_yr_estab foreign_worker_info_city  \
305415          TEXAS              2004            FORT MCMURRAY   

       foreign_worker_info_education foreign_worker_info_inst  \
305415                      Master's    UNI

         agent_city                          agent_firm_name agent_state  \
317072  Lake Oswego  McClellan Immigration Law Offices, P.C.      OREGON   

       application_type case_no    case_number case_received_date case_status  \
317072              NaN     NaN  A-16081-87109         2016-03-25   Certified   

       country_of_citizenship decision_date employer_address_1  \
317072                DENMARK    2016-06-27    26 STOKES DRIVE   

       employer_address_2 employer_city          employer_country  \
317072               None   MOUND HOUSE  UNITED STATES OF AMERICA   

       employer_decl_info_title             employer_name  \
317072            President/CEO  VINEBURG MACHINING, INC.   

        employer_num_employees employer_phone employer_postal_code  \
317072                    33.0   775-246-4336                89706   

       employer_state employer_yr_estab foreign_worker_info_city  \
317072         NEVADA              1977            WASHOE VALLEY   

       fore

       agent_city          agent_firm_name agent_state application_type  \
336582   NEW YORK  DEWAN & ASSOCIATES PLLC    NEW YORK              NaN   

       case_no    case_number case_received_date case_status  \
336582     NaN  A-16104-96498         2016-05-05   Certified   

       country_of_citizenship decision_date        employer_address_1  \
336582                  INDIA    2016-08-23  149 AVENUE AT THE COMMON   

       employer_address_2 employer_city          employer_country  \
336582          SUITE 203    SHREWSBURY  UNITED STATES OF AMERICA   

       employer_decl_info_title                     employer_name  \
336582                PRESIDENT  STRATUS TECHNOLOGY SERVICES, LLC   

        employer_num_employees employer_phone employer_postal_code  \
336582                   150.0   732-380-0323                07702   

       employer_state employer_yr_estab foreign_worker_info_city  \
336582     NEW JERSEY              2001                 BENSALEM   

       foreign_wo

       agent_city    agent_firm_name agent_state application_type case_no  \
360662    Houston  Lin & Valdez, LLP       TEXAS              NaN     NaN   

          case_number case_received_date case_status country_of_citizenship  \
360662  A-16214-37789         2016-08-10   Certified                  CHINA   

       decision_date  employer_address_1 employer_address_2 employer_city  \
360662    2016-11-07  2334 Old Mill Road               None    Sugar Land   

                employer_country employer_decl_info_title  \
360662  UNITED STATES OF AMERICA                President   

                    employer_name  employer_num_employees employer_phone  \
360662  OilFind International LLC                     6.0     7139808888   

       employer_postal_code employer_state employer_yr_estab  \
360662                77478          TEXAS              2009   

       foreign_worker_info_city foreign_worker_info_education  \
360662               SUGAR LAND                      Master's

### Discussions:
- We feel that these rows have full entries and we do not want to drop the features. We decided to leave these values as they are. 

In [25]:
# Analyse the values in employer_yr_estab
rowsToAnalyse = []
for row in df_visaType['employer_yr_estab'].iteritems():
    if row[1]==14.0 or row[1]==1646.0  or row[1]==0.0 or row[1]==20.0 or row[1]==35.0 or row[1]==804.0 or row[1]==1111.0 or row[1]==100.0 or row[1]==6.0:
        rowsToAnalyse.append(row[0])
for i in range(0, len(rowsToAnalyse)):
    index = rowsToAnalyse[i]
    print(df_visaType.loc[[index]])

       agent_city      agent_firm_name agent_state application_type case_no  \
294186   New York  Simin H. Syed, P.C.    NEW YORK              NaN     NaN   

          case_number case_received_date case_status country_of_citizenship  \
294186  A-15335-45045         2015-12-21   Certified                  INDIA   

       decision_date  employer_address_1 employer_address_2 employer_city  \
294186    2016-04-28  101 SUNNYSIDE BLVD               None     PLAINVIEW   

                employer_country employer_decl_info_title  \
294186  UNITED STATES OF AMERICA        SR.VICE PRESIDENT   

                 employer_name  employer_num_employees employer_phone  \
294186  CES COMPUTER SOLUTIONS                    17.0   516-576-8000   

       employer_postal_code employer_state employer_yr_estab  \
294186                11803       NEW YORK                14   

       foreign_worker_info_city foreign_worker_info_education  \
294186                 WOODBURY                      Master's  

### Discussion:
- Rows seem to have quite informative information and employer_yr_estab may not hold much importance, therefore we have decided to leave it as is.

In [26]:
#Evaluate the values in ri_posted_notice_at_worksite
A = []
Y = []
N = []
for row in df_visaType['ri_posted_notice_at_worksite'].iteritems():
    if row[1]=='A':
        A.append(row[0])
    elif row[1]=='Y':
        Y.append(row[0])
    else:
        N.append(row[0])
print('A has:', len(A))
print('Y has:', len(Y))
print('N has:', len(N))

A has: 1151
Y has: 94025
N has: 58449


In [27]:
# Evaluate the values in recr_info_barg_rep_notified
A = []
Y = []
N = []
for row in df_visaType['recr_info_barg_rep_notified'].iteritems():
    if row[1]=='A':
        A.append(row[0])
    elif row[1]=='Y':
        Y.append(row[0])
    else:
        N.append(row[0])
print('A has:', len(A))
print('Y has:', len(Y))
print('N has:', len(N))

A has: 60671
Y has: 704
N has: 92250


In [28]:
# Evaluate the values in ri_posted_notice_at_worksite
A = []
Y = []
N = []
for row in df_visaType['ri_posted_notice_at_worksite'].iteritems():
    if row[1]=='A':
        A.append(row[0])
    elif row[1]=='Y':
        Y.append(row[0])
    else:
        N.append(row[0])
print('A has:', len(A))
print('Y has:', len(Y))
print('N has:', len(N))

A has: 1151
Y has: 94025
N has: 58449


In [29]:
# Evaluate the values in fw_info_training_comp
A = []
Y = []
N = []
for row in df_visaType['fw_info_training_comp'].iteritems():
    if row[1]=='A':
        A.append(row[0])
    elif row[1]=='Y':
        Y.append(row[0])
    else:
        N.append(row[0])
print('A has:', len(A))
print('Y has:', len(Y))
print('N has:', len(N))

A has: 60627
Y has: 1284
N has: 91714


In [30]:
# Evaluate the values in fw_info_req_experience
A = []
Y = []
N = []
for row in df_visaType['fw_info_req_experience'].iteritems():
    if row[1]=='A':
        A.append(row[0])
    elif row[1]=='Y':
        Y.append(row[0])
    else:
        N.append(row[0])
print('A has:', len(A))
print('Y has:', len(Y))
print('N has:', len(N))

A has: 25065
Y has: 19649
N has: 108911


In [31]:
# Evaluate the values in fw_info_rel_occup_exp
A = []
Y = []
N = []
for row in df_visaType['fw_info_rel_occup_exp'].iteritems():
    if row[1]=='A':
        A.append(row[0])
    elif row[1]=='Y':
        Y.append(row[0])
    else:
        N.append(row[0])
print('A has:', len(A))
print('Y has:', len(Y))
print('N has:', len(N))

A has: 17909
Y has: 42375
N has: 93341


In [32]:
# Evaluate the values in fw_info_alt_edu_experience
A = []
Y = []
N = []
for row in df_visaType['fw_info_alt_edu_experience'].iteritems():
    if row[1]=='A':
        A.append(row[0])
    elif row[1]=='Y':
        Y.append(row[0])
    else:
        N.append(row[0])
print('A has:', len(A))
print('Y has:', len(Y))
print('N has:', len(N))

A has: 41814
Y has: 11692
N has: 100119


### Discussion:
- There were varying amount in the number of values for Y, N and A in all rows. It was decided that it was best to keep these values as is. We assume that Y indicates Yes, N indicates No and A indicates Applicable.

In [33]:
# Save cleaned dataframe to new CSV file
type_dic = {}
for col in df_visaType.columns:
    type_dic[col] = df_visaType[col].dtype
df_visaType.to_csv("US-Perm-Visa-CleanedData.csv", encoding='utf-8', index=False)

In [34]:
# Check how many rows and columns the data frame has 
df_visaType.shape
#Check how many rows
print("There are", df_visaType.shape[0], "rows")
#Check how many columns
print("There are", df_visaType.shape[1], "columns")

There are 153625 rows
There are 72 columns


In [35]:
df_visaType.dtypes

agent_city                        object
agent_firm_name                   object
agent_state                       object
application_type                  object
case_no                           object
case_number                       object
case_received_date                object
case_status                       object
country_of_citizenship            object
decision_date                     object
employer_address_1                object
employer_address_2                object
employer_city                     object
employer_country                  object
employer_decl_info_title          object
employer_name                     object
employer_num_employees           float64
employer_phone                    object
employer_postal_code              object
employer_state                    object
employer_yr_estab                 object
foreign_worker_info_city          object
foreign_worker_info_education     object
foreign_worker_info_inst          object
foreign_worker_i