## Analysing H1B Acceptance Trends 

H1B visa is a nonimmigrant visa issued to gradute level applicants allowing them to work in the United States. The employer sponsors the H1B visa for workers with theoretical or technical expertise in specialized fields such as in IT, finance, accounting etc. An interesting fact about immigrant workers is that about 52 percent of new Silicon valley companies were founded by such workers during 1995 and 2005. Some famous CEOs like Indira Nooyi (Pepsico), Elon Musk (Tesla), Sundar Pichai (Google),Satya Nadella (Microsoft) once arrived to the US on a H1B visa.\
**Motivation**: Our team consists of five international gradute students, in the future we will be applying for H1B visa. The visa application process seems very long, complicated and uncertain. So we decided to understand this process and use Machine learning algorithms to predict the acceptance rate and trends of H1B visa. 

### Data 
The data used in the project has been collected from <a href="https://www.foreignlaborcert.doleta.gov/performancedata.cfm">the Office of Foreign Labor Certification (OFLC).</a>The Data provides insight into each petition with information such as the Job title, Wage, Employer, Worksite location etc. To get the dataset click on the above link-> click on Disclosure data -> scroll down to H1B data.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
!pip install autocorrect
import pandas as pd
import numpy as np
import nltk,warnings
import clean_wage as cw
%matplotlib inline
import remove
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from statistics import mean
nltk.download('words')
from autocorrect import Speller 
from sklearn import metrics
from sklearn import model_selection
from sklearn.preprocessing import OneHotEncoder #ONE HOT ENCODING
from sklearn.ensemble import RandomForestClassifier #Build model - Random Forest Classifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
nltk.download('wordnet')

### Exploratory Data Analysis

Before we begin working on our data we need to understand the traits of our data which we accomplish using EDA. We see that we have about 260 columns , not all 260 columns have essential information that contributes to our analysis. Hence we pick out the columns such as case status( Accepted/ Denied) ,Employer, Job title etc. 

In [None]:
file=pd.read_csv('/content/gdrive/My Drive/H-1B_Disclosure_Data_FY2019.csv')#Read the csv file and store it in file

In [None]:
cleaned=file[['CASE_NUMBER','CASE_STATUS','CASE_SUBMITTED','DECISION_DATE','VISA_CLASS','FULL_TIME_POSITION','JOB_TITLE','SOC_CODE','SOC_TITLE',\
              'EMPLOYER_NAME','WAGE_RATE_OF_PAY_FROM_1','WAGE_UNIT_OF_PAY_1','NAICS_CODE','WORKSITE_CITY_1','WORKSITE_STATE_1']]
cleaned.shape

In [None]:
print(cleaned['VISA_CLASS'].value_counts()) # similarly we can find the categories in CASE_STATUS and 'FULL_TIME_POSITION'
# Visa class has many categories which are not of use , we require only H1B visa type , hence we drop all records with other visa types
cleaned.drop(labels=cleaned.loc[cleaned['VISA_CLASS']!='H-1B'].index , inplace=True)
#In case status we can drop withdrawn records and we can change certified-withdrawn to certified
cleaned.replace({"CASE_STATUS":"CERTIFIED-WITHDRAWN"},"CERTIFIED",inplace=True)
cleaned.drop(labels=cleaned.loc[cleaned['CASE_STATUS']=='WITHDRAWN'].index , inplace=True)
cleaned.info()

In [None]:
#the column wages has a mix of both string and float value types and some record have the symbol '$' which we want to remove
cleaned['WAGE_RATE_OF_PAY_FROM_1'].apply(type).value_counts()
cleaned['WAGES']=cleaned['WAGE_RATE_OF_PAY_FROM_1'].apply(cw.clean_wages).astype('float')
cleaned['WAGE_UNIT_OF_PAY_1'].value_counts()# the wage information that we have available has different unit of pay

In [None]:
# we convert the different units of pay to the type 'Year'
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Month',cleaned['WAGES'] * 12,cleaned['WAGES'])
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Hour',cleaned['WAGES'] * 2080,cleaned['WAGES']) # 2080=8 hours*5 days* 52 weeks
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Bi-Weekly',cleaned['WAGES'] *26,cleaned['WAGES'])
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Week',cleaned['WAGES'] * 52,cleaned['WAGES'])
cleaned.drop(columns=['WAGE_RATE_OF_PAY_FROM_1','WAGE_UNIT_OF_PAY_1'],axis=1,inplace=True)#we can drop the columns as cleaning is complete
cleaned.dropna(inplace=True)# We can check if we have null records using cleaned.info() and drop null records

In [None]:
cleaned['JOB_TITLE']=cleaned.JOB_TITLE.apply(lambda txt: " ".join([cw.remove_num(i) for i in txt.lower().split()]))
cleaned['JOB_TITLE']=cleaned['JOB_TITLE'].str.replace(',', '')
cleaned['SOC_TITLE']=cleaned.SOC_TITLE.apply(lambda txt: " ".join([cw.remove_num(i) for i in txt.lower().split()]))
cleaned['SOC_TITLE']=cleaned['SOC_TITLE'].str.replace(',', '')

In [None]:
lemmatizer = nltk.stem.WordNetLemmatizer()
words = set(nltk.corpus.words.words())
spell = Speller()
def lemmatize_text(text):
     return lemmatizer.lemmatize(text)
def spelling_checker(text):
     return spell(text)

In [None]:
cleaned['JOB_TITLE']=cleaned.JOB_TITLE.apply(lambda txt: " ".join([lemmatize_text(i) for i in txt.lower().split()]))
cleaned['JOB_TITLE']=cleaned.JOB_TITLE.apply(lambda txt: " ".join([spelling_checker(i) for i in txt.lower().split()]))
cleaned['SOC_TITLE']=cleaned.SOC_TITLE.apply(lambda txt: " ".join([lemmatize_text(i) for i in txt.lower().split()]))
cleaned['SOC_TITLE']=cleaned.SOC_TITLE.apply(lambda txt: " ".join([spelling_checker(i) for i in txt.lower().split()]))

In [None]:
cleaned = cleaned.groupby("SOC_CODE").filter(lambda x: len(x) > 15)
cleaned = cleaned.groupby("SOC_TITLE").filter(lambda x: len(x) > 15)
cleaned['EMPLOYER_NAME'].value_counts()
cleaned = cleaned.groupby("EMPLOYER_NAME").filter(lambda x: len(x) > 15)
Top_Employer=cleaned['EMPLOYER_NAME'].value_counts()[:10]
cleaned['WAGE_CATEGORY'] = cleaned['WAGES'].apply(remove.wage_feature_eng)

In [None]:
plt.figure(figsize=[8,8])
ax=sns.barplot(y=Top_Employer.index,x=Top_Employer.values,palette=sns.color_palette('viridis',10))
ax.tick_params(labelsize=12)
for i, v in enumerate(Top_Employer.values): 
    ax.text(.5, i, v,fontsize=15,color='white',weight='bold')
plt.title('Top 10 Companies sponsoring H1B Visa in 2019', fontsize=20)
plt.show()

In [None]:
state={"AL":"ALABAMA","AK":"ALASKA","AZ":"ARIZONA","AR":"ARKANSAS","CA":"CALIFORNIA","CO":"COLORADO","DE":"DELAWARE",\
       "FL":"FLORIDA","GA":"GEORGIA","HI":"HAWAII","ID":"IDAHO","IL":"ILLINOIS","IN":"INDIANA","IA":"IOWA","KS":"KANSAS",\
       "KY":"KENTUCKY","LA":"LOUISIANA","ME":"MAINE","MD":"MARYLAND","MA":"MASSACHUSETTS","MI":"MICHIGAN","MN":"MINNESOTA",\
       "MS":"MISSISSIPPI","MO":"MISSOURI","MT":"MONTANA","NE":"NEBRASKA","NV":"NEVADA","NH":"NEW HAMPSHIRE","NJ":"NEW JERSEY",\
       "NM":"NEW MEXICO","NY":"NEW YORK","NC":"NORTH CAROLINA","ND":"NORTH DAKOTA","OH":"OHIO","OK":"OKLAHOMA","OR":"OREGON",\
       "PA":"PENNSYLVANIA","RI":"RHODE ISLAND","SC":"SOUTH CAROLINA","SD":"SOUTH DAKOTA","TN":"TENNESSEE","TX":"TEXAS",\
       "UT":"UTAH","VT":"VERMONT","VA":"VIRGINIA","WA":"WASHINGTON","WV":"WEST VIRGINIA","WI":"WISCONSIN","WY":"WYOMING",\
       "PR":"PUERTO RICO","VI":"U.S. VIRGIN ISLANDS","MP":"NORTHERN MARIANA ISLANDS","GU":"GUAM","MH":"MARSHALL ISLANDS",\
       "PW":"PALAU","DC":"DISTRICT OF COLUMBIA","CT":"CONNECTICUT"}
cleaned.replace({"WORKSITE_STATE_1": state})
cleaned = cleaned.groupby("WORKSITE_STATE_1").filter(lambda x: len(x) > 15) #removing less significant records

In [None]:
#CONVERTING CATEGORICAL COLUMNS INTO NUMERIC COLUMNS
cleaned.loc[(cleaned.CASE_STATUS == "CERTIFIED"),"CASE_STATUS"] = 1
cleaned.loc[(cleaned.CASE_STATUS == "DENIED"),"CASE_STATUS"] = 0
cleaned.loc[(cleaned.FULL_TIME_POSITION == "Y"),"FULL_TIME_POSITION"] = 1
cleaned.loc[(cleaned.FULL_TIME_POSITION == "N"),"FULL_TIME_POSITION"] = 0

### Baseline classifier

The baseline classifier is done with a basic model. In this case we are taking the mean of the labels ('certified' and 'denied' for H1B visa approvals). It will give us the base accuracy to which we will compare our classifier's accuracy. Our classifier should have a better accuracy than the baseline classifier accuracy.


In [None]:
# This step assigns a binary class label (0 or 1) to each label for H1B visa approval. 
def create_class_labels(processed_data):
    y = np.where((processed_data['CASE_STATUS']=='CERTIFIED'),1, 0)
    return y
X = cleaned['CASE_STATUS'].to_numpy()
y = create_class_labels(cleaned)# Groundtruth labels for the dataset
counts = cleaned['CASE_STATUS'].value_counts()
print(counts)
print('proportion: ', counts[0]/counts[1], ': 1')

In [None]:
class BaselineClasifier(): # Baseline classifier that predicts the class base on the mode of the labels.
    def __init__(self):
        self.central_tendency = None
    def fit(self, data, y, central_t='mode'): 
        label, counts = np.unique(y, return_counts=True) # Count labels and find the most frequent one 
        if central_t == 'mode':
            self.central_tendency = counts.argmax()
        elif central_t == 'mean':
            self.central_tendency = round(np.sum(y)/len(y))
        return self# Return an array with size equal to the data size  and each element setted to the mode.
    def predict(self, data):
        result = np.full(data.shape[0], self.central_tendency)
        return result

In [None]:
def compute_accuracy(validation, predicted):
    comp = prediction == validation 
    match_counts = np.count_nonzero(comp == True) 
    clasifier_accuracy = match_counts/len(validation)
    return clasifier_accuracy  
def compute_AUC(y, prediction):
    auc = roc_auc_score(y, prediction)
    return auc

In [None]:
accuracies = []
kf = sklearn.model_selection.KFold(n_splits=4, random_state=1, shuffle=True)# Testing with K-folds 
for train_idx, test_idx in kf.split(X):
    X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx] 
    baseline_clasifier = BaselineClasifier()
    classifier = baseline_clasifier.fit(X_train, y_train, 'mean')
    prediction = baseline_clasifier.predict(X_test)
    fold_accuracy = compute_accuracy(y_test, prediction)
    fold_AUC = compute_AUC(y_test, prediction)
    accuracies.append(fold_accuracy)
baseline_clasifier_accuracy = mean(accuracies)
print('Baseline accuracy: ', baseline_clasifier_accuracy) 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)# Testing with regular split
baseline_clasifier = BaselineClasifier()
classifier = baseline_clasifier.fit(X_train, y_train, 'mean')
prediction = baseline_clasifier.predict(X_test)
split_accuracy = compute_accuracy(y_test, prediction)
split_AUC = compute_AUC(y_test, prediction)
print('Baseline accuracy: ', split_accuracy)  

The accuracy results of the baseline classifier is 0.99. This result is due to the highly imbalanced data, where there are 624682 CERTIFIED applications and 5158 DENIED applications. The proportion of it is 121.10934470725087 to 1. Therefore, a performance measure based on the accuracy is not a good one. A better performance measure in imbalanced data is the Area under the ROC Curve (AUC). It meassures the likelihood that given two random points (one from the positive and one from the negative class) the classifier will rank the point from the positive class higher than the one from the negative one.

In [None]:
Dataset = cleaned[["CASE_STATUS", "FULL_TIME_POSITION","JOB_TITLE", "SOC_CODE", "SOC_TITLE", "EMPLOYER_NAME", "WORKSITE_STATE_1", "WAGE_CATEGORY"]]
Top_Job_positions = Dataset["JOB_TITLE"].value_counts().head(72)
def job_function(job):#Considered top positions in job roles
    if job in Top_Job_positions:
        return job
    else:
        return "others"
Dataset["JOB_POSITION"] = Dataset["JOB_TITLE"].apply(job_function)
Dataset["JOB_POSITION"].value_counts()

In [None]:
Top_SOC_CODES = Dataset["SOC_CODE"].value_counts().head(70)#Considered top domains
def soc_function(soc):
    if soc in Top_SOC_CODES:
        return soc
    else:
        return "others"
Dataset["TOP_SOC_CODE"] = Dataset["SOC_CODE"].apply(soc_function)
Dataset["TOP_SOC_CODE"].value_counts()

In [None]:
Top_SOC_TITLE = Dataset["SOC_TITLE"].value_counts().head(70)#Considered names of top domains
def soc_function(soc):
    if soc in Top_SOC_TITLE:
        return soc
    else:
        return "others"
Dataset["TOP_SOC_TITLE"] = Dataset["SOC_TITLE"].apply(soc_function)
Dataset["TOP_SOC_TITLE"].value_counts()

In [None]:
Top_EMPLOYER = Dataset["EMPLOYER_NAME"].value_counts().head(70)#Considered top employers
def emp_function(emp):
    if emp in Top_EMPLOYER:
        return emp
    else:
        return "others"
Dataset["TOP_EMPLOYER"] = Dataset["EMPLOYER_NAME"].apply(emp_function)
Dataset["TOP_EMPLOYER"].value_counts()

In [None]:
Dataset.drop(columns=['JOB_TITLE','SOC_CODE','SOC_TITLE','EMPLOYER_NAME',],axis=1,inplace=True)
Encoding = OneHotEncoder(handle_unknown='ignore',sparse = True)
Encoding_df = pd.DataFrame(SOC_Encoding.fit_transform(Dataset[["WORKSITE_STATE_1","JOB_POSITION","TOP_SOC_CODE","TOP_SOC_TITLE","TOP_EMPLOYER"]]).toarray())
Dataset = Dataset.join(Encoding_df)
Dataset.drop(columns=['WORKSITE_STATE_1','JOB_POSITION','TOP_SOC_CODE','TOP_SOC_TITLE','TOP_EMPLOYER','WAGE_CATEGORY'],axis=1,inplace=True)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=20)#Split the date into training and test set
model_clf = RandomForestClassifier(n_jobs=2,random_state=0)
model_clf.fit(X_train,y_train)#train the model
prediction_test = model_clf.predict(X_test)#test the model (predict with our test data)
print("Accuracy = ", metrics.accuracy_score(y_test, prediction_test))#compare with original value, Y_test

In [None]:
#Test a simple input to predict its STATUS.
status_label = np.array(['Denied','Approved'])
new_pred_number = model_clf.predict([
    [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,\
     0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,\
     0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]])
new_pred_label= status_label[ new_pred_number ]
new_pred_label

### Reflection

What is the hardest part of the project that you’ve encountered so far?

1. Setting up the data for visualization and ML analysis, e.g. same job title is cluttered with different words, integers, and punctuation characters. 
2. Encoding the dataset to be used in the Classifier. We tried with JOB_TITLE attribute but got Memory error, instead, we started to use SOC_TITLE attribute