# Analysing H1B Acceptance Trends 

H1B visa is a nonimmigrant visa issued to gradute level workers which allows them to work in the United States. The employer sponsors the H1B visa for workers with theoretical or technical expertise in specialized fields such as in IT, finance, accounting etc. An interesting fact about immigrant workers is that about 52 percent of new Silicon valley companies were founded by such workers during 1995 and 2005. Some famous CEOs like Indira Nooyi (Pepsico), Elon Musk (Tesla), Sundar Pichai (Google),Satya Nadella (Microsoft) once arrived to the US on a H1B visa.

**Motivation**: Our team consists of five international gradute students, in the future we will be applying for H1B visa. The visa application process seems very long, complicated and uncertain. So we decided to understand this process and use Machine learning algorithms to predict the acceptance rate and trends of H1B visa. 

## Data 
The data used in the project has been collected from <a href="https://www.foreignlaborcert.doleta.gov/performancedata.cfm">the Office of Foreign Labor Certification (OFLC).</a> 

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

TIMEOUT: ignored

In [0]:
!pip install autocorrect
import pandas as pd
import numpy as np
import warnings
import nltk
from textblob import TextBlob
from autocorrect import Speller 
nltk.download('wordnet')

## Exploratory Data Analysis

Before we begin working on our data we need to understand the traits of our data which we accomplish using EDA. We see that we have about 260 columns , not all 260 coulms have essential information that contributes to our analysis. Hence we pick out the columns such as case status( Accepted/ Denied) ,Employer, Job title etc. 

In [0]:
#Read the csv file and stored in file
file=pd.read_csv('/content/gdrive/My Drive/H-1B_Disclosure_Data_FY2019.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [0]:
file.shape

(1048548, 260)

In [0]:
cleaned=file[['CASE_NUMBER','CASE_STATUS','CASE_SUBMITTED','DECISION_DATE','VISA_CLASS','JOB_TITLE','SOC_CODE','SOC_TITLE','EMPLOYER_NAME','WAGE_RATE_OF_PAY_FROM_1','WAGE_UNIT_OF_PAY_1']]
cleaned.head()

Unnamed: 0,CASE_NUMBER,CASE_STATUS,CASE_SUBMITTED,DECISION_DATE,VISA_CLASS,JOB_TITLE,SOC_CODE,SOC_TITLE,EMPLOYER_NAME,WAGE_RATE_OF_PAY_FROM_1,WAGE_UNIT_OF_PAY_1
0,I-200-16092-327771,WITHDRAWN,4/8/2016,4/30/2019,H-1B,ASSOCIATE CREATIVE DIRECTOR,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,"R/GA MEDIA GROUP, INC.","$179,000.00",Year
1,I-203-17188-450729,WITHDRAWN,7/14/2017,5/13/2019,E-3 Australian,ACCOUNT SUPERVISOR (MOTHER),11-2011,ADVERTISING AND PROMOTIONS MANAGERS,MOTHER INDUSTRIES LLC,"$110,000.00",Year
2,I-203-17229-572307,WITHDRAWN,8/23/2017,4/30/2019,E-3 Australian,EXECUTIVE CREATIVE DIRECTOR,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,"WE ARE UNLIMITED, INC.","$275,000.00",Year
3,I-203-17356-299648,WITHDRAWN,12/22/2017,8/20/2019,E-3 Australian,PROJECT MANAGEMENT LEAD,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,"HELLO ELEPHANT, LLC","$140,000.00",Year
4,I-203-18008-577576,WITHDRAWN,1/10/2018,4/15/2019,E-3 Australian,"CREATIVE DIRECTOR, UX",11-2011,ADVERTISING AND PROMOTIONS MANAGERS,"HELLO ELEPHANT, LLC","$180,000.00",Year


In [0]:
cleaned['VISA_CLASS'].value_counts()

H-1B               649083
E-3 Australian      13087
H-1B1 Singapore      1291
H-1B1 Chile          1155
Name: VISA_CLASS, dtype: int64

In [0]:
# Visa class has many categories which are not of use , we require only H1B visa type , hence we drop all records with other visa types
cleaned.drop(labels=cleaned[cleaned['VISA_CLASS']!='H-1B'].index , inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [0]:
cleaned['CASE_STATUS'].value_counts()

CERTIFIED              592103
CERTIFIED-WITHDRAWN     46946
WITHDRAWN               19674
DENIED                   5893
I-200-19196-496412          1
Name: CASE_STATUS, dtype: int64

In [0]:
#As we want to only need accepted and denied cases we are dropping withdrawn from the data frame. 
#Case status of class certified-withdraw were certified earlier and later withdraw which can be considered a
cleaned.replace({"CASE_STATUS":"CERTIFIED-WITHDRAWN"},"CERTIFIED",inplace=True)
cleaned.drop(labels=cleaned[cleaned['CASE_STATUS']=='WITHDRAWN'].index , inplace=True)
cleaned.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,CASE_NUMBER,CASE_STATUS,CASE_SUBMITTED,DECISION_DATE,VISA_CLASS,JOB_TITLE,SOC_CODE,SOC_TITLE,EMPLOYER_NAME,WAGE_RATE_OF_PAY_FROM_1,WAGE_UNIT_OF_PAY_1
16,I-203-17048-800372,CERTIFIED,2/17/2017,2/26/2019,E-3 Australian,ASSOCIATE EXPERIENCE DIRECTOR,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,"HUGE, LLC","$147,000.00",Year
17,I-203-17118-231630,CERTIFIED,5/17/2017,1/7/2019,E-3 Australian,ASSOCIATE CREATIVE DIRECTOR,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,"R/GA MEDIA GROUP, INC.","$150,000.00",Year
18,I-200-17250-072640,CERTIFIED,9/7/2017,1/7/2019,H-1B,"EXECUTIVE DIRECTOR, STRATEGY",11-2011,ADVERTISING AND PROMOTIONS MANAGERS,FIGLIULO & PARTNERS LLC,"$230,000.00",Year
19,I-200-18026-717110,CERTIFIED,1/26/2018,7/5/2019,H-1B,PROJECT OPERATIONS MANAGER,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,INVISIONAPP INC.,"$107,000.00",Year
20,I-203-18052-454057,CERTIFIED,3/5/2018,11/15/2018,E-3 Australian,ACCOUNT DIRECTOR,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,"GOODBY, SILVERSTEIN AND PARTNERS, INC.","$127,962.00",Year


In [0]:
cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1028874 entries, 16 to 1048547
Data columns (total 11 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   CASE_NUMBER              644941 non-null  object
 1   CASE_STATUS              644943 non-null  object
 2   CASE_SUBMITTED           644942 non-null  object
 3   DECISION_DATE            644942 non-null  object
 4   VISA_CLASS               644942 non-null  object
 5   JOB_TITLE                644942 non-null  object
 6   SOC_CODE                 644938 non-null  object
 7   SOC_TITLE                644938 non-null  object
 8   EMPLOYER_NAME            644934 non-null  object
 9   WAGE_RATE_OF_PAY_FROM_1  644938 non-null  object
 10  WAGE_UNIT_OF_PAY_1       644938 non-null  object
dtypes: object(11)
memory usage: 94.2+ MB


In [0]:
#the column wages has a mix of both string and float value types and some record have the symbol '$' which we want to remove
cleaned['WAGE_RATE_OF_PAY_FROM_1'].apply(type).value_counts()

<class 'float'>    848828
<class 'str'>      180046
Name: WAGE_RATE_OF_PAY_FROM_1, dtype: int64

In [0]:
def clean_wages(w):
    """ Function to remove '$' symbol and other delimiters from wages column which consistes of str and float type values
    if the column entry is string type then remove the symbols else return the column value as it is 
    """
    if isinstance(w, str):
        return(w.replace('$', '').replace(',', ''))
    return(w)

In [0]:
cleaned['WAGES']=cleaned['WAGE_RATE_OF_PAY_FROM_1'].apply(clean_wages).astype('float')
cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1028874 entries, 16 to 1048547
Data columns (total 12 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   CASE_NUMBER              644941 non-null  object 
 1   CASE_STATUS              644943 non-null  object 
 2   CASE_SUBMITTED           644942 non-null  object 
 3   DECISION_DATE            644942 non-null  object 
 4   VISA_CLASS               644942 non-null  object 
 5   JOB_TITLE                644942 non-null  object 
 6   SOC_CODE                 644938 non-null  object 
 7   SOC_TITLE                644938 non-null  object 
 8   EMPLOYER_NAME            644934 non-null  object 
 9   WAGE_RATE_OF_PAY_FROM_1  644938 non-null  object 
 10  WAGE_UNIT_OF_PAY_1       644938 non-null  object 
 11  WAGES                    644938 non-null  float64
dtypes: float64(1), object(11)
memory usage: 102.0+ MB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [0]:
# the wage information that we have available has different unit of pay
cleaned['WAGE_UNIT_OF_PAY_1'].value_counts()

Year         601004
Hour          43268
Month           405
Week            133
Bi-Weekly       128
Name: WAGE_UNIT_OF_PAY_1, dtype: int64

In [0]:
x=cleaned.loc[cleaned['WAGE_UNIT_OF_PAY_1']=="Month"]
x.head(2)

Unnamed: 0,CASE_NUMBER,CASE_STATUS,CASE_SUBMITTED,DECISION_DATE,VISA_CLASS,JOB_TITLE,SOC_CODE,SOC_TITLE,EMPLOYER_NAME,WAGE_RATE_OF_PAY_FROM_1,WAGE_UNIT_OF_PAY_1,WAGES
110,I-203-19030-554145,DENIED,2/5/2019 22:56,2/8/2019 13:24,E-3 Australian,SENIOR PARTNERSHIPS MANAGER,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,INTREPID US INC,95000,Month,95000.0
818,I-200-18306-399497,DENIED,11/02/2018 11:37:37,11/05/2018 12:07:42,H-1B,ACCOUNTING & MARKETING MANAGER FOR AFRICA,11-2021,MARKETING MANAGERS,SHOP2SHIP LLC,2000,Month,2000.0


In [0]:
# we convert the different units of pay to the type 'Year'
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Month',cleaned['WAGES'] * 12,cleaned['WAGES'])
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Hour',cleaned['WAGES'] * 2080,cleaned['WAGES']) # 2080=8 hours*5 days* 52 weeks
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Bi-Weekly',cleaned['WAGES'] *26,cleaned['WAGES'])
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Week',cleaned['WAGES'] * 52,cleaned['WAGES'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using

In [0]:
#As we have got the information of Wages and made transformation we can drop the initial two records
cleaned.drop(columns=['WAGE_RATE_OF_PAY_FROM_1','WAGE_UNIT_OF_PAY_1'],axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [0]:
cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1028874 entries, 16 to 1048547
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   CASE_NUMBER     644941 non-null  object 
 1   CASE_STATUS     644943 non-null  object 
 2   CASE_SUBMITTED  644942 non-null  object 
 3   DECISION_DATE   644942 non-null  object 
 4   VISA_CLASS      644942 non-null  object 
 5   JOB_TITLE       644942 non-null  object 
 6   SOC_CODE        644938 non-null  object 
 7   SOC_TITLE       644938 non-null  object 
 8   EMPLOYER_NAME   644934 non-null  object 
 9   WAGES           644938 non-null  float64
dtypes: float64(1), object(9)
memory usage: 86.3+ MB


In [0]:
"""
We should remove record that have null objects, from the above cell we see
that all columns don't have same number of non-null records
which means we have to remove the records that have the null values.
we see that there are about 17 records that have null values
""" 
null_rows = cleaned.isnull().any(axis=1)
print(cleaned[null_rows].shape)
print(cleaned.shape)

(383949, 10)
(1028874, 10)


In [0]:
cleaned.dropna(inplace=True)
print(cleaned.shape)

(644925, 10)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [0]:
cleaned['JOB_TITLE'].value_counts()

SOFTWARE ENGINEER                         33574
SOFTWARE DEVELOPER                        33155
SENIOR SYSTEMS ANALYST JC60               12770
SENIOR SOFTWARE ENGINEER                   8285
MANAGER JC50                               8134
                                          ...  
SR BUSINESS ARCHITECT (SOC 15-1199.02)        1
TEST ENGINEER 1615.46631                      1
SENIOR ASSOCIATE II, QUALITY ASSURANCE        1
IT TRAINING & BUSINESS MANAGER                1
FRONT END UI   DEVELOPER                      1
Name: JOB_TITLE, Length: 112793, dtype: int64

In [0]:
#we see that the job title has integers(words with integers also) 
#removing comma also
def remove_num(text):
  if not any(c.isdigit() for c in text):
    return text
  return ''
cleaned['JOB_TITLE']=cleaned.JOB_TITLE.apply(lambda txt: " ".join([remove_num(i) for i in txt.lower().split()]))
cleaned['JOB_TITLE']=cleaned['JOB_TITLE'].str.replace(',', '')
cleaned['SOC_TITLE']=cleaned.SOC_TITLE.apply(lambda txt: " ".join([remove_num(i) for i in txt.lower().split()]))
cleaned['SOC_TITLE']=cleaned['SOC_TITLE'].str.replace(',', '')

cleaned.head()
cleaned['JOB_TITLE'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/p

software engineer              33767
software developer             33244
senior systems analyst         12771
senior software engineer        8300
manager                         8143
                               ...  
unix systems adminitrator          1
quality scientist iii              1
pediatrician/neonatologist         1
clinical programmer/analyst        1
systems engineer (ms iis)          1
Name: JOB_TITLE, Length: 100583, dtype: int64

In [0]:
#code to clean and group the JOB_TITLE COLUMN
# lemmatization and spell check function
nltk.download('words')
lemmatizer = nltk.stem.WordNetLemmatizer()
words = set(nltk.corpus.words.words())
spell = Speller()


def lemmatize_text(text):
  return lemmatizer.lemmatize(text)

def spelling_checker(text):
  return spell(text)
 
print(spelling_checker("computr sciece progam check"))

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
computer science program check


In [0]:
#this part takes more time because spell_checker 
cleaned['JOB_TITLE']=cleaned.JOB_TITLE.apply(lambda txt: " ".join([lemmatize_text(i) for i in txt.lower().split()]))
print(' after lemmatization')
print(cleaned['JOB_TITLE'].value_counts() )
cleaned['JOB_TITLE']=cleaned.JOB_TITLE.apply(lambda txt: " ".join([spelling_checker(i) for i in txt.lower().split()]))
print('after spell correction')
#cleaned['JOB_TITLE']=cleaned.JOB_TITLE.apply(lambda txt: " ".join([remove_text(i) for i in txt.lower().split()]))
cleaned['JOB_TITLE'].value_counts() 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


 after lemmatization
software engineer                                        41171
software developer                                       33458
senior system analyst                                    13312
manager                                                   9703
senior software engineer                                  8448
                                                         ...  
analyst sec reporting / technical accounting                 1
engineer network                                             1
principal it project management                              1
global business process leader - product configurator        1
sr. front end (fe) engineer                                  1
Name: JOB_TITLE, Length: 97053, dtype: int64
after spell correction


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


software engineer                                             41214
software developer                                            33511
senior system analyst                                         13312
manager                                                        9703
senior software engineer                                       8460
                                                              ...  
industrials analyst                                               1
engineer principal project                                        1
technical service associate                                       1
strategic finance associate                                       1
sr. market research analyst (analytics & strategy manager)        1
Name: JOB_TITLE, Length: 96179, dtype: int64

In [0]:
#clean SOC TITLE
cleaned['SOC_TITLE']=cleaned.SOC_TITLE.apply(lambda txt: " ".join([lemmatize_text(i) for i in txt.lower().split()]))
cleaned['SOC_TITLE']=cleaned.SOC_TITLE.apply(lambda txt: " ".join([spelling_checker(i) for i in txt.lower().split()]))
cleaned['SOC_TITLE'].value_counts() 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


software developer application                    210295
computer system analyst                            71524
computer occupation all other                      54961
software developer system software                 30909
computer programmer                                16662
                                                   ...  
food watchmakers                                       1
accountant and auditors                                1
regulatory affair specialists                          1
property real estate manager                           1
sawing machine setter operator and tender wood         1
Name: SOC_TITLE, Length: 776, dtype: int64

In [0]:
grouped_wages=cleaned.groupby('JOB_TITLE', as_index=False).agg({'WAGES':'mean'})
op=grouped_wages.sort_values(by=['WAGES'],ascending=False)
#X=op.loc[op['JOB_TITLE']=='software engineer']
display(op)
display(X)

Unnamed: 0,JOB_TITLE,WAGES
65553,senior application engineer - power management,9.884750e+07
48524,nurse practitioner (licensed),9.778095e+07
80803,specialist web developer,7.406071e+07
12942,business intelligence associate,3.439664e+07
49111,operation professional,2.776000e+07
...,...,...
18627,customer service all task & duty of a nail salon,1.820000e+04
25223,early education classroom assistant,1.768000e+04
92168,track and field coach,1.704000e+04
41568,live streaming service,1.700000e+04


Unnamed: 0,JOB_TITLE,WAGES
78215,software engineer,131035.528326
