# Analysing H1B Acceptance Trends 

H1B visa is a nonimmigrant visa issued to gradute level workers which allows them to work in the United States. The employer sponsors the H1B visa for workers with theoretical or technical expertise in specialized fields such as in IT, finance, accounting etc. An interesting fact about immigrant workers is that about 52 percent of new Silicon valley companies were founded by such workers during 1995 and 2005. Some famous CEOs like Indira Nooyi (Pepsico), Elon Musk (Tesla), Sundar Pichai (Google),Satya Nadella (Microsoft) once arrived to the US on a H1B visa.

**Motivation**: Our team consists of five international gradute students, in the future we will be applying for H1B visa. The visa application process seems very long, complicated and uncertain. So we decided to understand this process and use Machine learning algorithms to predict the acceptance rate and trends of H1B visa. 

## Data 
The data used in the project has been collected from <a href="https://www.foreignlaborcert.doleta.gov/performancedata.cfm">the Office of Foreign Labor Certification (OFLC).</a> 

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
import pandas as pd
import numpy as np
import warnings

## Exploratory Data Analysis

Before we begin working on our data we need to understand the traits of our data which we accomplish using EDA. We see that we have about 260 columns , not all 260 coulms have essential information that contributes to our analysis. Hence we pick out the columns such as case status( Accepted/ Denied) ,Employer, Job title etc. 

In [6]:
#Read the csv file and stored in file
file=pd.read_csv('/content/gdrive/My Drive/H-1B_Disclosure_Data_FY2019.csv')


  interactivity=interactivity, compiler=compiler, result=result)


In [7]:
file.shape

(1048548, 260)

In [10]:
cleaned=file[['CASE_NUMBER','CASE_STATUS','CASE_SUBMITTED','DECISION_DATE','VISA_CLASS','JOB_TITLE','SOC_CODE','SOC_TITLE','EMPLOYER_NAME','WAGE_RATE_OF_PAY_FROM_1','WAGE_UNIT_OF_PAY_1']]
cleaned.head()

Unnamed: 0,CASE_NUMBER,CASE_STATUS,CASE_SUBMITTED,DECISION_DATE,VISA_CLASS,JOB_TITLE,SOC_CODE,SOC_TITLE,EMPLOYER_NAME,WAGE_RATE_OF_PAY_FROM_1,WAGE_UNIT_OF_PAY_1
0,I-200-16092-327771,WITHDRAWN,4/8/2016,4/30/2019,H-1B,ASSOCIATE CREATIVE DIRECTOR,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,"R/GA MEDIA GROUP, INC.","$179,000.00",Year
1,I-203-17188-450729,WITHDRAWN,7/14/2017,5/13/2019,E-3 Australian,ACCOUNT SUPERVISOR (MOTHER),11-2011,ADVERTISING AND PROMOTIONS MANAGERS,MOTHER INDUSTRIES LLC,"$110,000.00",Year
2,I-203-17229-572307,WITHDRAWN,8/23/2017,4/30/2019,E-3 Australian,EXECUTIVE CREATIVE DIRECTOR,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,"WE ARE UNLIMITED, INC.","$275,000.00",Year
3,I-203-17356-299648,WITHDRAWN,12/22/2017,8/20/2019,E-3 Australian,PROJECT MANAGEMENT LEAD,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,"HELLO ELEPHANT, LLC","$140,000.00",Year
4,I-203-18008-577576,WITHDRAWN,1/10/2018,4/15/2019,E-3 Australian,"CREATIVE DIRECTOR, UX",11-2011,ADVERTISING AND PROMOTIONS MANAGERS,"HELLO ELEPHANT, LLC","$180,000.00",Year


In [16]:
cleaned['VISA_CLASS'].value_counts()

H-1B    649083
Name: VISA_CLASS, dtype: int64

In [13]:
# Visa class has many categories which are not of use , we require only H1B visa type , hence we drop all records with other visa types
cleaned.drop(labels=cleaned[cleaned['VISA_CLASS']!='H-1B'].index , inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [14]:
cleaned['CASE_STATUS'].value_counts()

CERTIFIED              578640
CERTIFIED-WITHDRAWN     46050
WITHDRAWN               19227
DENIED                   5166
Name: CASE_STATUS, dtype: int64

In [36]:
#As we want to only need accepted and denied cases we are dropping withdrawn from the data frame. 
#Case status of class certified-withdraw were certified earlier and later withdraw which can be considered a
cleaned.replace({"CASE_STATUS":"CERTIFIED-WITHDRAWN"},"CERTIFIED",inplace=True)
cleaned.drop(labels=cleaned[cleaned['CASE_STATUS']=='WITHDRAWN'].index , inplace=True)
cleaned.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,CASE_NUMBER,CASE_STATUS,CASE_SUBMITTED,DECISION_DATE,VISA_CLASS,JOB_TITLE,SOC_CODE,SOC_TITLE,EMPLOYER_NAME,WAGES
18,I-200-17250-072640,CERTIFIED,9/7/2017,1/7/2019,H-1B,"EXECUTIVE DIRECTOR, STRATEGY",11-2011,ADVERTISING AND PROMOTIONS MANAGERS,FIGLIULO & PARTNERS LLC,230000.0
19,I-200-18026-717110,CERTIFIED,1/26/2018,7/5/2019,H-1B,PROJECT OPERATIONS MANAGER,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,INVISIONAPP INC.,107000.0
21,I-200-18039-081565,CERTIFIED,3/5/2018,1/8/2019,H-1B,MANAGER OF LEAGUE AND TOURNAMENT SERVICES,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,OREGON YOUTH SOCCER ASSOCIATION,49087.0
22,I-200-18082-340860,CERTIFIED,3/23/2018,4/22/2019,H-1B,"DIRECTOR, DEMAND",11-2011,ADVERTISING AND PROMOTIONS MANAGERS,"FACTUAL, INC.",172930.0
24,I-200-18162-689783,CERTIFIED,9/26/2018,10/2/2018,H-1B,ADVERSTING AND PROMOTIONS MANAGER,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,FANTUAN GROUP INC,68640.0


In [19]:
cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 629856 entries, 18 to 664616
Data columns (total 11 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   CASE_NUMBER              629855 non-null  object
 1   CASE_STATUS              629856 non-null  object
 2   CASE_SUBMITTED           629856 non-null  object
 3   DECISION_DATE            629856 non-null  object
 4   VISA_CLASS               629856 non-null  object
 5   JOB_TITLE                629856 non-null  object
 6   SOC_CODE                 629852 non-null  object
 7   SOC_TITLE                629852 non-null  object
 8   EMPLOYER_NAME            629848 non-null  object
 9   WAGE_RATE_OF_PAY_FROM_1  629852 non-null  object
 10  WAGE_UNIT_OF_PAY_1       629852 non-null  object
dtypes: object(11)
memory usage: 57.7+ MB


In [20]:
#the column wages has a mix of both string and float value types and some record have the symbol '$' which we want to remove
cleaned['WAGE_RATE_OF_PAY_FROM_1'].apply(type).value_counts()

<class 'float'>    460442
<class 'str'>      169414
Name: WAGE_RATE_OF_PAY_FROM_1, dtype: int64

In [0]:
def clean_wages(w):
    """ Function to remove '$' symbol and other delimiters from wages column which consistes of str and float type values
    if the column entry is string type then remove the symbols else return the column value as it is 
    """
    if isinstance(w, str):
        return(w.replace('$', '').replace(',', ''))
    return(w)

In [22]:
cleaned['WAGES']=cleaned['WAGE_RATE_OF_PAY_FROM_1'].apply(clean_wages).astype('float')
cleaned.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 629856 entries, 18 to 664616
Data columns (total 12 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   CASE_NUMBER              629855 non-null  object 
 1   CASE_STATUS              629856 non-null  object 
 2   CASE_SUBMITTED           629856 non-null  object 
 3   DECISION_DATE            629856 non-null  object 
 4   VISA_CLASS               629856 non-null  object 
 5   JOB_TITLE                629856 non-null  object 
 6   SOC_CODE                 629852 non-null  object 
 7   SOC_TITLE                629852 non-null  object 
 8   EMPLOYER_NAME            629848 non-null  object 
 9   WAGE_RATE_OF_PAY_FROM_1  629852 non-null  object 
 10  WAGE_UNIT_OF_PAY_1       629852 non-null  object 
 11  WAGES                    629852 non-null  float64
dtypes: float64(1), object(11)
memory usage: 62.5+ MB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [24]:
# the wage information that we have available has different unit of pay
cleaned['WAGE_UNIT_OF_PAY_1'].value_counts()

Year         587386
Hour          41927
Month           342
Bi-Weekly       105
Week             92
Name: WAGE_UNIT_OF_PAY_1, dtype: int64

In [0]:
x=cleaned.loc[cleaned['WAGE_UNIT_OF_PAY_1']=="Month"]
x.head(2)


Unnamed: 0,CASE_NUMBER,CASE_STATUS,CASE_SUBMITTED,DECISION_DATE,VISA_CLASS,JOB_TITLE,SOC_CODE,SOC_TITLE,EMPLOYER_NAME,WAGE_RATE_OF_PAY_FROM_1,WAGE_UNIT_OF_PAY_1,WAGES
818,I-200-18306-399497,DENIED,11/02/2018 11:37:37,11/05/2018 12:07:42,H-1B,ACCOUNTING & MARKETING MANAGER FOR AFRICA,11-2021,MARKETING MANAGERS,SHOP2SHIP LLC,2000,Month,2000.0
826,I-200-18309-843479,CERTIFIED,11/05/2018 12:34:19,11/09/2018 22:00:34,H-1B,ACCOUNTING & MARKETING MANAGER FOR AFRICA,11-2021,MARKETING MANAGERS,SHOP2SHIP LLC,2000,Month,2000.0


In [25]:
# we convert the different units of pay to the type 'Year'
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Month',cleaned['WAGES'] * 12,cleaned['WAGES'])
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Hour',cleaned['WAGES'] * 2080,cleaned['WAGES']) # 2080=8 hours*5 days* 52 weeks
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Bi-Weekly',cleaned['WAGES'] *26,cleaned['WAGES'])
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Week',cleaned['WAGES'] * 52,cleaned['WAGES'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using

In [26]:
#As we have got the information of Wages and made transformation we can drop the initial two records
cleaned.drop(columns=['WAGE_RATE_OF_PAY_FROM_1','WAGE_UNIT_OF_PAY_1'],axis=1,inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [27]:
cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 629856 entries, 18 to 664616
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   CASE_NUMBER     629855 non-null  object 
 1   CASE_STATUS     629856 non-null  object 
 2   CASE_SUBMITTED  629856 non-null  object 
 3   DECISION_DATE   629856 non-null  object 
 4   VISA_CLASS      629856 non-null  object 
 5   JOB_TITLE       629856 non-null  object 
 6   SOC_CODE        629852 non-null  object 
 7   SOC_TITLE       629852 non-null  object 
 8   EMPLOYER_NAME   629848 non-null  object 
 9   WAGES           629852 non-null  float64
dtypes: float64(1), object(9)
memory usage: 52.9+ MB


In [37]:
"""
We should remove record that have null objects, from the above cell we see
that all columns don't have same number of non-null records
which means we have to remove the records that have the null values.
we see that there are about 17 records that have null values
""" 
null_rows = cleaned.isnull().any(axis=1)
print(cleaned[null_rows].shape)
print(cleaned.shape)

(17, 10)
(629856, 10)


In [41]:
cleaned.dropna(inplace=True)
print(cleaned.shape)

(629839, 10)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [42]:
cleaned['JOB_TITLE'].value_counts()

SOFTWARE ENGINEER                              33199
SOFTWARE DEVELOPER                             33057
SENIOR SYSTEMS ANALYST JC60                    12759
SENIOR SOFTWARE ENGINEER                        8135
MANAGER JC50                                    8118
                                               ...  
SOFTWARE ENGINEER 1615.31987                       1
SENIOR DATA INTEGRATION PROGRAMMER                 1
MANAGER TECHNOLOGY & INTEGRATION                   1
SENIOR PARTNER ACCOUNT MANAGER (15-1199.08)        1
MIDDLEWARE ADMINISTRATOR/PROGRAMMER ANALYST        1
Name: JOB_TITLE, Length: 108139, dtype: int64

In [43]:
#we see that the job title has integers in the record which we can remove
cleaned['JOB_TITLE']=cleaned['JOB_TITLE'].str.replace('[0-9(){}[].]', '')
cleaned.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,CASE_NUMBER,CASE_STATUS,CASE_SUBMITTED,DECISION_DATE,VISA_CLASS,JOB_TITLE,SOC_CODE,SOC_TITLE,EMPLOYER_NAME,WAGES
18,I-200-17250-072640,CERTIFIED,9/7/2017,1/7/2019,H-1B,"EXECUTIVE DIRECTOR, STRATEGY",11-2011,ADVERTISING AND PROMOTIONS MANAGERS,FIGLIULO & PARTNERS LLC,230000.0
19,I-200-18026-717110,CERTIFIED,1/26/2018,7/5/2019,H-1B,PROJECT OPERATIONS MANAGER,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,INVISIONAPP INC.,107000.0
21,I-200-18039-081565,CERTIFIED,3/5/2018,1/8/2019,H-1B,MANAGER OF LEAGUE AND TOURNAMENT SERVICES,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,OREGON YOUTH SOCCER ASSOCIATION,49087.0
22,I-200-18082-340860,CERTIFIED,3/23/2018,4/22/2019,H-1B,"DIRECTOR, DEMAND",11-2011,ADVERTISING AND PROMOTIONS MANAGERS,"FACTUAL, INC.",172930.0
24,I-200-18162-689783,CERTIFIED,9/26/2018,10/2/2018,H-1B,ADVERSTING AND PROMOTIONS MANAGER,11-2011,ADVERTISING AND PROMOTIONS MANAGERS,FANTUAN GROUP INC,68640.0


In [44]:
cleaned['JOB_TITLE'].value_counts()

SOFTWARE ENGINEER                             33199
SOFTWARE DEVELOPER                            33057
SENIOR SYSTEMS ANALYST JC60                   12759
SENIOR SOFTWARE ENGINEER                       8135
MANAGER JC50                                   8118
                                              ...  
EDI (ELECTRONIC DATA INTERCHANGE) ENGINEER        1
VISITING CLINCIAL ASSOCIATE PROFESSOR             1
SR. SOFTWARE DEVELOPER-MAINFRAME                  1
QUANTITATIVE BUSINESS ANALYST 1615.18749          1
COORDINATOR, SPORT SCIENCES & MEDICINE            1
Name: JOB_TITLE, Length: 108130, dtype: int64

In [45]:
grouped_wages=cleaned.groupby('JOB_TITLE', as_index=False).agg({'WAGES':'mean'})
op=grouped_wages.sort_values(by=['WAGES'],ascending=False)
X=op.loc[op['JOB_TITLE']=='SOFTWARE ENGINEER']
display(op)
display(X)

Unnamed: 0,JOB_TITLE,WAGES
86328,SOFTWARE ENGINEER 1615.36850,357136000.0
83630,SOFTWARE ENGINEER (1615.46146),110293000.0
70008,SENIOR APPLICATIONS ENGINEER - POWER MANAGEMENT,98847500.0
50810,NURSE PRACTITIONERS (LICENSED),97780945.6
91813,SPECIALIST WEB DEVELOPER,74060709.0
...,...,...
49912,NAIL TECHNICAN,18720.0
19351,"CUSTOMER SERVICE, ALL TASKS & DUTIES OF A NAIL...",18200.0
104210,TRACK AND FIELD COACH,17040.0
43547,LIVE STREAMING SERVICE,17000.0


Unnamed: 0,JOB_TITLE,WAGES
83066,SOFTWARE ENGINEER,112102.821495
