# Analysing H1B Acceptance Trends 

H1B visa is a nonimmigrant visa issued to gradute level workers which allows them to work in the United States. The employer sponsors the H1B visa for workers with theoretical or technical expertise in specialized fields such as in IT, finance, accounting etc. An interesting fact about immigrant workers is that about 52 percent of new Silicon valley companies were founded by such workers during 1995 and 2005. Some famous CEOs like Indira Nooyi (Pepsico), Elon Musk (Tesla), Sundar Pichai (Google),Satya Nadella (Microsoft) once arrived to the US on a H1B visa.

**Motivation**: Our team consists of five international gradute students, in the future we will be applying for H1B visa. The visa application process seems very long, complicated and uncertain. So we decided to understand this process and use Machine learning algorithms to predict the acceptance rate and trends of H1B visa. 

## Data 
The data used in the project has been collected from <a href="https://www.foreignlaborcert.doleta.gov/performancedata.cfm">the Office of Foreign Labor Certification (OFLC).</a> 

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:

!pip install autocorrect
import nltk
from textblob import TextBlob
from autocorrect import Speller 
nltk.download('wordnet')
import pandas as pd
import numpy as np
import warnings

## Exploratory Data Analysis

Before we begin working on our data we need to understand the traits of our data which we accomplish using EDA. We see that we have about 260 columns , not all 260 columns have essential information that contributes to our analysis. Hence we pick out the columns such as case status( Accepted/ Denied) ,Employer, Job title etc. 

In [None]:
#Read the csv file and stored in file
file=pd.read_csv('/content/gdrive/My Drive/H-1B_Disclosure_Data_FY2019.csv')

In [None]:
file.shape

In [None]:
cleaned=file[['CASE_NUMBER','CASE_STATUS','CASE_SUBMITTED','DECISION_DATE','VISA_CLASS','FULL_TIME_POSITION','JOB_TITLE','SOC_CODE','SOC_TITLE',\
              'EMPLOYER_NAME','WAGE_RATE_OF_PAY_FROM_1','WAGE_UNIT_OF_PAY_1','NAICS_CODE','WORKSITE_CITY_1','WORKSITE_STATE_1']]
cleaned.head()

In [None]:
cleaned.shape

In [None]:
cleaned['VISA_CLASS'].value_counts()

In [None]:
# Visa class has many categories which are not of use , we require only H1B visa type , hence we drop all records with other visa types
cleaned.drop(labels=cleaned.loc[cleaned['VISA_CLASS']!='H-1B'].index , inplace=True)

In [None]:
cleaned['FULL_TIME_POSITION'].value_counts()

In [None]:
cleaned['CASE_STATUS'].value_counts()

In [None]:
#As we want to only need accepted and denied cases we are dropping withdrawn from the data frame. 
#Case status of class certified-withdraw were certified earlier and later withdraw which can be considered a
cleaned.replace({"CASE_STATUS":"CERTIFIED-WITHDRAWN"},"CERTIFIED",inplace=True)
cleaned.drop(labels=cleaned.loc[cleaned['CASE_STATUS']=='WITHDRAWN'].index , inplace=True)
cleaned.head()

In [None]:
#cleaned.info()

In [None]:
#the column wages has a mix of both string and float value types and some record have the symbol '$' which we want to remove
cleaned['WAGE_RATE_OF_PAY_FROM_1'].apply(type).value_counts()

In [None]:
cleaned['WORKSITE_STATE_1'].apply(type).value_counts()

In [None]:
def clean_wages(w):
    """ Function to remove '$' symbol and other delimiters from wages column which consistes of str and float type values
    if the column entry is string type then remove the symbols else return the column value as it is 
    """
    if isinstance(w, str):
        return(w.replace('$', '').replace(',', ''))
    return(w)

In [None]:
cleaned['WAGES']=cleaned['WAGE_RATE_OF_PAY_FROM_1'].apply(clean_wages).astype('float')
#cleaned.info()

In [None]:
# the wage information that we have available has different unit of pay
cleaned['WAGE_UNIT_OF_PAY_1'].value_counts()

In [None]:
# we convert the different units of pay to the type 'Year'
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Month',cleaned['WAGES'] * 12,cleaned['WAGES'])
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Hour',cleaned['WAGES'] * 2080,cleaned['WAGES']) # 2080=8 hours*5 days* 52 weeks
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Bi-Weekly',cleaned['WAGES'] *26,cleaned['WAGES'])
cleaned['WAGES'] = np.where(cleaned['WAGE_UNIT_OF_PAY_1'] == 'Week',cleaned['WAGES'] * 52,cleaned['WAGES'])

In [None]:
#As we have got the information of Wages and made transformation we can drop the initial two records
cleaned.drop(columns=['WAGE_RATE_OF_PAY_FROM_1','WAGE_UNIT_OF_PAY_1'],axis=1,inplace=True)


In [None]:
cleaned.info()

In [None]:
"""
We should remove record that have null objects, from the above cell we see
that all columns don't have same number of non-null records
which means we have to remove the records that have the null values.
we see that there are about 17 records that have null values
""" 
null_rows = cleaned.isnull().any(axis=1)
print(cleaned[null_rows].shape)
print(cleaned.shape)

In [None]:
cleaned.dropna(inplace=True)
print(cleaned.shape)

In [None]:
#cleaned['JOB_TITLE'].value_counts()

In [None]:
#we see that the job title has integers(words with integers also) 
#removing comma also
def remove_num(text):
  if not any(c.isdigit() for c in text):
    return text
  return ''
cleaned['JOB_TITLE']=cleaned.JOB_TITLE.apply(lambda txt: " ".join([remove_num(i) for i in txt.lower().split()]))
cleaned['JOB_TITLE']=cleaned['JOB_TITLE'].str.replace(',', '')
cleaned['SOC_TITLE']=cleaned.SOC_TITLE.apply(lambda txt: " ".join([remove_num(i) for i in txt.lower().split()]))
cleaned['SOC_TITLE']=cleaned['SOC_TITLE'].str.replace(',', '')
print("Numbers and strings with numbers removed" )
#cleaned.head()
#cleaned['JOB_TITLE'].value_counts()

In [None]:
nltk.download('words')
lemmatizer = nltk.stem.WordNetLemmatizer()
words = set(nltk.corpus.words.words())
spell = Speller()


def lemmatize_text(text):
  return lemmatizer.lemmatize(text)

def spelling_checker(text):
  return spell(text)
 
print(spelling_checker("computr sciece progam check"))

In [None]:
#this part takes more time because spell_checker 
cleaned['JOB_TITLE']=cleaned.JOB_TITLE.apply(lambda txt: " ".join([lemmatize_text(i) for i in txt.lower().split()]))
print(' JOB_TITLE IS  lemmatized')
#print(cleaned['JOB_TITLE'].value_counts() 

In [None]:
cleaned['JOB_TITLE']=cleaned.JOB_TITLE.apply(lambda txt: " ".join([spelling_checker(i) for i in txt.lower().split()]))
print('JOB_TITLE SPELLING MISTAKES RECTIFIED)

In [None]:
#clean SOC TITLE
cleaned['SOC_TITLE']=cleaned.SOC_TITLE.apply(lambda txt: " ".join([lemmatize_text(i) for i in txt.lower().split()]))
cleaned['SOC_TITLE']=cleaned.SOC_TITLE.apply(lambda txt: " ".join([spelling_checker(i) for i in txt.lower().split()]))
print('SOC_TITLE SPELLING MISTAKES RECTIFIED')
cleaned = cleaned.groupby("SOC_TITLE").filter(lambda x: len(x) > 15)
print(" removing least significant values from SOC_TITLE")
#cleaned['SOC_TITLE'].value_counts()

In [None]:
cleaned = cleaned.groupby("SOC_CODE").filter(lambda x: len(x) > 15)
print("DROPPING THE LEAST SIGNIFICANT EMPLYOERS")
#cleaned['SOC_CODE'].value_counts()

In [None]:
#we see that the job title has integers in the record which we can remove
#handeled above
#so commenting this part
#cleaned['JOB_TITLE']=cleaned['JOB_TITLE'].str.replace('[0-9(){}[].]', '')
#cleaned.head()

In [None]:
cleaned['SOC_TITLE'].value_counts()

In [None]:
cleaned['EMPLOYER_NAME'].value_counts()
cleaned = cleaned.groupby("EMPLOYER_NAME").filter(lambda x: len(x) > 15)
print("DROPPING THE LEAST SIGNIFICANT EMPLYOERS")
#cleaned['EMPLOYER_NAME'].value_counts()

In [None]:
Top_Employer=cleaned['EMPLOYER_NAME'].value_counts()[:10]
Top_Employer

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.figure(figsize=[10,10])
ax=sns.barplot(y=Top_Employer.index,x=Top_Employer.values,palette=sns.color_palette('viridis',10))
ax.tick_params(labelsize=12)
for i, v in enumerate(Top_Employer.values): 
    ax.text(.5, i, v,fontsize=15,color='white',weight='bold')
plt.title('Top 10 Companies sponsoring H1B Visa in 2019', fontsize=20)
plt.show()

In [None]:
def wage_feature_eng(wage):
    if wage <=50000:
        return "VERY LOW"
    elif wage in range(50000,75000):
        return "LOW"
    elif wage in range(75000,100000):
        return "AVERAGE"
    elif wage in range(100000,150000):
        return "HIGH"
    elif wage >=150000:
        return "VERY HIGH"

In [None]:
cleaned['WAGE_CATEGORY'] = cleaned['WAGES'].apply(wage_feature_eng)
cleaned.head()

In [None]:
cleaned.columns

In [None]:
print("BEFORE CLEANING THE WORKSITE_STATE_1 COLUMN")
cleaned["WORKSITE_STATE_1"].value_counts()

In [None]:
# Code to clean the "WORKSITE_STATE_1" column because some of the values are abbrevations and some of them are names
#So changed to names as most of the values are names.   
print("AFTER CLEANING THE WORKSITE_STATE_1 COLUMN")

cleaned.loc[(cleaned.WORKSITE_STATE_1 == "AL"),"WORKSITE_STATE_1"] = "ALABAMA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "AK"),"WORKSITE_STATE_1"] = "ALASKA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "AZ"),"WORKSITE_STATE_1"] = "ARIZONA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "AR"),"WORKSITE_STATE_1"] = "ARKANSAS"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "CA"),"WORKSITE_STATE_1"] = "CALIFORNIA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "CO"),"WORKSITE_STATE_1"] = "COLORADO"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "DE"),"WORKSITE_STATE_1"] = "DELAWARE"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "FL"),"WORKSITE_STATE_1"] = "FLORIDA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "GA"),"WORKSITE_STATE_1"] = "GEORGIA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "HI"),"WORKSITE_STATE_1"] = "HAWAII"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "ID"),"WORKSITE_STATE_1"] = "IDAHO"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "IL"),"WORKSITE_STATE_1"] = "ILLINOIS"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "IN"),"WORKSITE_STATE_1"] = "INDIANA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "IA"),"WORKSITE_STATE_1"] = "IOWA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "KS"),"WORKSITE_STATE_1"] = "KANSAS"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "KY"),"WORKSITE_STATE_1"] = "KENTUCKY"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "LA"),"WORKSITE_STATE_1"] = "LOUISIANA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "ME"),"WORKSITE_STATE_1"] = "MAINE"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "MD"),"WORKSITE_STATE_1"] = "MARYLAND"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "MA"),"WORKSITE_STATE_1"] = "MASSACHUSETTS"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "MI"),"WORKSITE_STATE_1"] = "MICHIGAN"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "MN"),"WORKSITE_STATE_1"] = "MINNESOTA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "MS"),"WORKSITE_STATE_1"] = "MISSISSIPPI"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "MO"),"WORKSITE_STATE_1"] = "MISSOURI"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "MT"),"WORKSITE_STATE_1"] = "MONTANA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "NE"),"WORKSITE_STATE_1"] = "NEBRASKA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "NV"),"WORKSITE_STATE_1"] = "NEVADA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "NH"),"WORKSITE_STATE_1"] = "NEW HAMPSHIRE"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "NJ"),"WORKSITE_STATE_1"] = "NEW JERSEY"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "NM"),"WORKSITE_STATE_1"] = "NEW MEXICO"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "NY"),"WORKSITE_STATE_1"] = "NEW YORK"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "NC"),"WORKSITE_STATE_1"] = "NORTH CAROLINA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "ND"),"WORKSITE_STATE_1"] = "NORTH DAKOTA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "OH"),"WORKSITE_STATE_1"] = "OHIO"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "OK"),"WORKSITE_STATE_1"] = "OKLAHOMA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "OR"),"WORKSITE_STATE_1"] = "OREGON"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "PA"),"WORKSITE_STATE_1"] = "PENNSYLVANIA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "RI"),"WORKSITE_STATE_1"] = "RHODE ISLAND"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "SC"),"WORKSITE_STATE_1"] = "SOUTH CAROLINA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "SD"),"WORKSITE_STATE_1"] = "SOUTH DAKOTA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "TN"),"WORKSITE_STATE_1"] = "TENNESSEE"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "TX"),"WORKSITE_STATE_1"] = "TEXAS"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "UT"),"WORKSITE_STATE_1"] = "UTAH"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "VT"),"WORKSITE_STATE_1"] = "VERMONT"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "VA"),"WORKSITE_STATE_1"] = "VIRGINIA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "WA"),"WORKSITE_STATE_1"] = "WASHINGTON"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "WV"),"WORKSITE_STATE_1"] = "WEST VIRGINIA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "WI"),"WORKSITE_STATE_1"] = "WISCONSIN"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "WY"),"WORKSITE_STATE_1"] = "WYOMING"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "PR"),"WORKSITE_STATE_1"] = "PUERTO RICO"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "VI"),"WORKSITE_STATE_1"] = "U.S. VIRGIN ISLANDS"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "MP"),"WORKSITE_STATE_1"] = "NORTHERN MARIANA ISLANDS"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "GU"),"WORKSITE_STATE_1"] = "GUAM"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "MH"),"WORKSITE_STATE_1"] = "MARSHALL ISLANDS"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "PW"),"WORKSITE_STATE_1"] = "PALAU"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "DC"),"WORKSITE_STATE_1"] = "DISTRICT OF COLUMBIA"
cleaned.loc[(cleaned.WORKSITE_STATE_1 == "CT"),"WORKSITE_STATE_1"] = "CONNECTICUT"


In [None]:
cleaned["WORKSITE_STATE_1"].value_counts()
print("CONVERTING CATEGORICAL COLUMNS INTO NUMERIC COLUMNS")

In [None]:
cleaned = cleaned.groupby("WORKSITE_STATE_1").filter(lambda x: len(x) > 15)
print("DROPPING LEAST SIGNIFICANT STATES")
#cleaned["WORKSITE_STATE_1"].value_counts()

In [None]:
#CONVERTING CATEGORICAL COLUMNS INTO NUMERIC COLUMNS
print(cleaned["CASE_STATUS"].value_counts())
cleaned.loc[(cleaned.CASE_STATUS == "CERTIFIED"),"CASE_STATUS"] = 1
cleaned.loc[(cleaned.CASE_STATUS == "DENIED"),"CASE_STATUS"] = 0
print(cleaned["CASE_STATUS"].value_counts())

In [None]:
print(cleaned["FULL_TIME_POSITION"].value_counts())
cleaned.loc[(cleaned.FULL_TIME_POSITION == "Y"),"FULL_TIME_POSITION"] = 1
cleaned.loc[(cleaned.FULL_TIME_POSITION == "N"),"FULL_TIME_POSITION"] = 0
print(cleaned["FULL_TIME_POSITION"].value_counts())

### Baseline classifier

The baseline classifier is done with a basic model. In this case we are taking the mean of the labels ('certified' and 'denied' for H1B visa approvals). It will give us the base accuracy to which we will compare our classifier's accuracy. Our classifier should have a better accuracy than the baseline classifier accuracy.


In [None]:
# This step assigns a binary class label (0 or 1) to each label for H1B visa approval. 
#'CERTIFIED' is mapped to 1 and 'DENIED' to 0

def create_class_labels(processed_data):
    
    y = np.where((processed_data['CASE_STATUS']=='CERTIFIED'),1, 0)
    
    return y

X = cleaned['CASE_STATUS'].to_numpy()

# Groundtruth labels for the dataset
y = create_class_labels(cleaned)
counts = cleaned['CASE_STATUS'].value_counts()
print(counts)
print('proportion: ', counts[0]/counts[1], ': 1')

In [None]:
import sklearn
from statistics import mean

# Baseline classifier that predicts the class base on the mode of the labels.

class BaselineClasifier():
    
    def __init__(self):
        self.central_tendency = None
        
    def fit(self, data, y, central_t='mode'): 
        
        # Count labels and find the most frequent one
        label, counts = np.unique(y, return_counts=True) 
        
        if central_t == 'mode':
            self.central_tendency = counts.argmax()
        elif central_t == 'mean':
            self.central_tendency = round(np.sum(y)/len(y))
        
        return self
    
    # Return an array with size equal to the data size  and each element setted to the mode.
    def predict(self, data):
        
        result = np.full(data.shape[0], self.central_tendency)
        
        return result

In [None]:
def compute_accuracy(validation, predicted):
    
    comp = prediction == validation 
    match_counts = np.count_nonzero(comp == True) 
    clasifier_accuracy = match_counts/len(validation)
    
    return clasifier_accuracy
    

In [None]:
from sklearn import metrics
from sklearn.metrics import roc_auc_score

def compute_AUC(y, prediction):
    
    auc = roc_auc_score(y, prediction)

    return auc

In [None]:
from sklearn import model_selection

# Testing with K-folds

accuracies = []

kf = sklearn.model_selection.KFold(n_splits=4, random_state=1, shuffle=True) 

for train_idx, test_idx in kf.split(X):
    
    X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx] 
    baseline_clasifier = BaselineClasifier()
    classifier = baseline_clasifier.fit(X_train, y_train, 'mean')
    prediction = baseline_clasifier.predict(X_test)
    
    fold_accuracy = compute_accuracy(y_test, prediction)
    fold_AUC = compute_AUC(y_test, prediction)
    accuracies.append(fold_accuracy)
    
baseline_clasifier_accuracy = mean(accuracies)

print('Baseline accuracy: ', baseline_clasifier_accuracy) 

In [None]:
from sklearn.model_selection import train_test_split

# Testing with regular split

# create training and testing vars
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
baseline_clasifier = BaselineClasifier()
classifier = baseline_clasifier.fit(X_train, y_train, 'mean')
prediction = baseline_clasifier.predict(X_test)

split_accuracy = compute_accuracy(y_test, prediction)
split_AUC = compute_AUC(y_test, prediction)

print('Baseline accuracy: ', split_accuracy)  

The accuracy results of the baseline classifier is 0.99. This result is due to the highly imbalanced data, where there are 624682 CERTIFIED applications and 5158 DENIED applications. The proportion of it is 121.10934470725087 to 1. Therefore, a performance measure based on the accuracy is not a good one. A better performance measure in imbalanced data is the Area under the ROC Curve (AUC). It meassures the likelihood that given two random points (one from the positive and one from the negative class) the classifier will rank the point from the positive class higher than the one from the negative one.

In [None]:
print('K-fold: ', fold_AUC)
print('split (80-20): ', split_AUC)

In [None]:
#ONE HOT ENCODING ON SOC_CODE column
from sklearn.preprocessing import OneHotEncoder

Dataset = cleaned[["CASE_STATUS", "FULL_TIME_POSITION", "SOC_CODE", "SOC_TITLE", "EMPLOYER_NAME", "WORKSITE_STATE_1", "WAGES"]]
Dataset.head()
SOC_Encoding = OneHotEncoder(handle_unknown='ignore',sparse = True)
SOC_Encoding_df = pd.DataFrame(SOC_Encoding.fit_transform(cleaned[["SOC_CODE"]]).toarray())

In [None]:
SOC_Encoding_df.head()

In [None]:
#Dependent and independent variables
Y = Dataset['CASE_STATUS'].values
X = SOC_Encoding_df

In [None]:
#Split the date into training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=20)

In [None]:
#Build model - Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

#model_clf = RandomForestClassifier(n_estimators=10, random_state=30)
model_clf = RandomForestClassifier(n_jobs=2,random_state=0)

#train the model
model_clf.fit(X_train,y_train)

In [None]:
#test the model (predict with our test data)
prediction_test = model_clf.predict(X_test)
prediction_test

In [None]:
#compare with original value, Y_test
from sklearn import metrics
print("Accuracy = ", metrics.accuracy_score(y_test, prediction_test))

In [None]:
#Test a simple input to predict its STATUS.
status_label = np.array(['Denied','Approved'])
new_pred_number = model_clf.predict([
    [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
    0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
    0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
    0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
    0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
    0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
    0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]])
new_pred_label= status_label[ new_pred_number ]
new_pred_label

### Reflection

What is the hardest part of the project that you’ve encountered so far?

1. Setting up the data for visualization and ML analysis, e.g. same job title is cluttered with different words, integers, and punctuation characters. 
2. Encoding the dataset to be used in the Classifier. We tried with JOB_TITLE attribute but got Memory error, instead, we started to use SOC_TITLE attribute