## EasyVisa Project


## Problem Statement

### Context:

Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).

OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.

### Objective:

In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.

The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-driven solutions. You as a data  scientist at EasyVisa have to analyze the data provided and, with the help of a classification model:

* Facilitate the process of visa approvals.
* Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the drivers that significantly influence the case status.

### Data Description

The data contains the different attributes of employee and the employer. The detailed data dictionary is given below.

* case_id: ID of each visa application
* continent: Information of continent the employee
* education_of_employee: Information of education of the employee
* has_job_experience: Does the employee has any job experience? Y= Yes; N = No
* requires_job_training: Does the employee require any job training? Y = Yes; N = No
* no_of_employees: Number of employees in the employer's company
* yr_of_estab: Year in which the employer's company was established
* region_of_employment: Information of foreign worker's intended region of employment in the US.
* prevailing_wage:  Average wage paid to similarly employed workers in a specific occupation in the area of intended employment. The purpose of the prevailing wage is to ensure that the foreign worker is not underpaid compared to other workers offering the same or similar service in the same area of employment.
* unit_of_wage: Unit of prevailing wage. Values include Hourly, Weekly, Monthly, and Yearly.
* full_time_position: Is the position of work full-time? Y = Full Time Position; N = Part Time Position
* case_status:  Flag indicating if the Visa was certified or denied

## Importing necessary libraries

In [1]:
# this will help in making the Python code more structured automatically (good coding practice)
#%load_ext nb_black

import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Library to split data
from sklearn.model_selection import train_test_split

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)


# Libraries different ensemble classifiers
from sklearn.ensemble import (
    BaggingClassifier,
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    StackingClassifier,
)

from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

# Libraries to get different metric scores
from sklearn import metrics
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

# To tune different models
from sklearn.model_selection import GridSearchCV

ModuleNotFoundError: No module named 'xgboost'

## Importing Dataset

In [3]:
# read the data
data = pd.read_csv('EasyVisa.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'EasyVisa.csv'

In [None]:
# copy data to another variable to avoid any changes to original data
df = data.copy()

## Overview of the Dataset

In [None]:
# return the first five rows
df.head()

In [None]:
# return the last five rows
df.tail()

In [None]:
# return the number of rows by the number of columns
df.shape

* The dataset has 25480 rows and 12 columns of data

In [None]:
# print a concise summary of the DataFrame
df.info()

#Observations:


* case_status is the dependent variable- type object.
* Independent variables are in both type of the data, integer (3) and object (8).

In [None]:
# check missing values across each columns
df.isnull().sum()

* There is no missing value in the data set.

In [None]:
# check for duplicates
duplicates = df.duplicated()
# print the duplicated rows
print(df[duplicates])

* There is no dublicates.

In [None]:
# checking for unique values in ID column
df["case_id"].nunique()

* Since all the values in **case_id** column are unique we can drop it

In [None]:
# drop "case_id" column
columns_to_drop = ["case_id"]
df = df.drop(columns_to_drop, axis=1)

In [None]:
df.head()


## Exploratory Data Analysis (EDA)

- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.

**Leading Questions**:
1. Those with higher education may want to travel abroad for a well-paid job. Does education play a role in Visa certification?

2. How does the visa status vary across different continents?

3. Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Does work experience influence visa status?

4. In the United States, employees are paid at different intervals. Which pay unit is most likely to be certified for a visa?

5. The US government has established a prevailing wage to protect local talent and foreign workers. How does the visa status change with the prevailing wage?

In [None]:
# check statistical summary of the all data
df.describe(include='all').T

## Observations:
* Majority of the employees applied from **Asia**
* The most common education of employees is **Bachelors**
* Most of the employees do not require training.
* Unit wage is yearly for most of the employees.
* Most of the employees aplied for full-time position.

In [None]:
# check statistical summary of the numerical data
df.describe().T

* Average prevailing wage is about 75,000 dollars.
* 75% of the companys have number of employees that is less than 3504.


## Data Preprocessing

- Missing value treatment (if needed)
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)

## EDA

- It is a good idea to explore the data once again after manipulating it.

#### Fixing the negative values in number of employees columns

In [None]:
negative_values = df[df['no_of_employees'] < 0]
num_negative_values = negative_values.shape[0]


In [None]:
shape_of_negative_values = negative_values.shape


In [None]:
# write the function to convert the values to a positive number
df["no_of_employees"] = df["no_of_employees"].abs()


In [None]:
#lets check again.
df.describe().T

#### Let's check the count of each unique category in each of the categorical variables

In [None]:
# Making a list of all catrgorical variables
cat_col = list(df.select_dtypes("object").columns)

# Printing number of count of each unique value in each column
for column in cat_col:
    print(data[column].value_counts())
    print("-" * 50)

### Univariate Analysis

In [None]:
# Lets write function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [None]:
# Lets write a function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

#### Observations on number of employees

In [None]:
histogram_boxplot(df, "no_of_employees")

#### Observations on prevailing wage

In [None]:
histogram_boxplot(df, "prevailing_wage")

Observations:
* The distribution is right skewed.
* Prevailing wage is very small for more than 2500 employees. There might be a mistake here.
* Mean and median are close.

In [None]:
# check the observations which have less than 100 prevailing wage
df.loc[df["prevailing_wage"] < 100]

In [None]:
# count of the values in the mentioned column
count_values = df.loc[df["prevailing_wage"] < 100, "unit_of_wage"].value_counts()
print(count_values)


* 176 prevailing wage values rae less than 100. And all are hourly rates which make sense.

#### Observations on continent

In [None]:
labeled_barplot(df, "continent", perc=True)

* 66% of the employees are applying from Asia.

#### Observations on education of employee

In [None]:
labeled_barplot(df, 'education_of_employee', perc=True)

* Most of the employees have Bachelor's or Masters degree.
* Only 8.6 % of them have Doctorate degree.

#### Observations on job experience

In [None]:
labeled_barplot(df, 'has_job_experience', perc=True)

* More than half ( 58.1 %)of the employees have job experience.

#### Observations on job training

In [None]:
labeled_barplot(df, 'requires_job_training', perc=True)

* 88.4 % of the employees do not require job training.

#### Observations on region of employment

In [None]:
labeled_barplot(df, 'region_of_employment', perc=True)

* Although Northeast has the highest interest of the employees, South and West has very close values.

#### Observations on unit of wage

In [None]:
labeled_barplot(df, 'unit_of_wage', perc=True)

* 90 % of the unit wage is yearly.

#### Observations on case status

In [None]:
labeled_barplot(df, 'case_status', perc=True)

* About 2/3 of the visa applications are certified.

### Bivariate Analysis

In [None]:
# find the correlation between the variables
cols_list = df.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(10, 5))
sns.heatmap(
    df[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.title('Correlation Heatmap of Numeric Variables')
plt.show()


* There is weak positive correlation between  prevailing wage and year of estab.
* There is weak negative correlation between number of employees and year of estab.
* There is weak negative correlation between number of employees and year of estab.

**Creating functions that will help us with further analysis.**

In [None]:
### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

In [None]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

#### Those with higher education may want to travel abroad for a well-paid job. Let's find out if education has any impact on visa certification

In [None]:
stacked_barplot(df, "education_of_employee", "case_status")

* Employees that has doctorate have highest rate of the Visa certification.
* As level of degree increase, rate of the getting visa certified increase.

#### Different regions have different requirements of talent having diverse educational backgrounds. Let's analyze it further

In [None]:
plt.figure(figsize=(10, 5))
sns.heatmap(pd.crosstab(df['education_of_employee'], df['region_of_employment']),
    annot=True,
    fmt="g",
    cmap="viridis"
)

plt.ylabel("Education")
plt.xlabel("Region")
plt.title("Crosstab Heatmap: Education vs Region of Employment")
plt.show()


#### Let's have a look at the percentage of visa certifications across each region

In [None]:
stacked_barplot(df, "region_of_employment", "case_status")

* Even though slightly higher, West region has the highest rate of that visa certified.

#### Lets' similarly check for the continents and find out how the visa status vary across different continents.

In [None]:
stacked_barplot(df, "continent", "case_status")

* Europe has the highest rate of that visa certified.

In [None]:
plt.figure(figsize=(10, 5))
sns.heatmap(pd.crosstab(df['education_of_employee'], df['continent']),
    annot=True,
    fmt="g",
    cmap="viridis"
)

plt.ylabel("Education")
plt.xlabel("Region")
plt.title("Crosstab Heatmap: Education vs Region of Employment")
plt.show()


#### Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Let's see if having work experience has any influence over visa certification

In [None]:
stacked_barplot(df, "has_job_experience", "case_status")

* Employees that have work experience more likely have visa.

#### Do the employees who have prior work experience require any job training?

In [None]:
stacked_barplot(df, "has_job_experience", "requires_job_training")

#### The US government has established a prevailing wage to protect local talent and foreign workers. Let's analyze the data and see if the visa status changes with the prevailing wage

In [None]:
distribution_plot_wrt_target(df, "prevailing_wage", "case_status")

#### Checking if the prevailing wage is similar across all the regions of the US

In [None]:
plt.figure(figsize=(10, 5))
sns.boxplot(data=df, x='region_of_employment', y='prevailing_wage', palette='Set3')

plt.title('Boxplot: Region of Employment vs Prevailing Wage')
plt.xlabel('Region of Employment')
plt.ylabel('Prevailing Wage')
plt.show()


* Although Midwest and Island are slighlty higher, other three regions have very close median of prevailing_wage.

#### The prevailing wage has different units (Hourly, Weekly, etc). Let's find out if it has any impact on visa applications getting certified.

In [None]:
stacked_barplot(df, "unit_of_wage", "case_status") ## Complete the code to plot stacked barplot for unit of wage and case status

* Employees applied jobs that have yaerly wage has higher rate of the getting the visa.

## Data Preprocessing

### Outlier Check

- Let's check for outliers in the data.

In [None]:
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns, 1):
    plt.subplot(3, 3, i)
    sns.boxplot(data=data, y=variable, palette='Set2')
    plt.title(f'Boxplot: {variable}')

plt.tight_layout()
plt.show()


* All numerical variables have outliers. But we will not treat them.

### Data Preparation for modeling

- We want to predict which visa will be certified.
- Before we proceed to build a model, we'll have to encode categorical features.
- We'll split the data into train and test to be able to evaluate the model that we build on the train data.

In [None]:
from sklearn.model_selection import train_test_split

# Convert "case_status" to binary (1 for "Certified", 0 for others)
df["case_status"] = df["case_status"].apply(lambda x: 1 if x == "Certified" else 0)

# Drop the "case_status" column from the features
X = df.drop("case_status", axis=1)

# Create dummy variables for categorical columns in X
X = pd.get_dummies(X)

# Split the data into train and test sets (70:30 ratio)
X_train, X_test, y_train, y_test = train_test_split(X, data["case_status"], test_size=0.3, random_state=1, stratify=data["case_status"])


In [None]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))

## Model evaluation criterion

### Model can make wrong predictions as:

1. Model predicts that the visa application will get certified but in reality, the visa application should get denied.
2. Model predicts that the visa application will not get certified but in reality, the visa application should get certified.

### Which case is more important?
* Both the cases are important as:

* If a visa is certified when it had to be denied a wrong employee will get the job position while US citizens will miss the opportunity to work on that position.

* If a visa is denied when it had to be certified the U.S. will lose a suitable human resource that can contribute to the economy.



### How to reduce the losses?

* `F1 Score` can be used a the metric for evaluation of the model, greater the F1  score higher are the chances of minimizing False Negatives and False Positives.
* We will use balanced class weights so that model focuses equally on both classes.

**First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.**
* The model_performance_classification_sklearn function will be used to check the model performance of models.
* The confusion_matrix_sklearn function will be used to plot the confusion matrix.

In [None]:
# define a function to compute different metrics to check performance of a classification model built using sklearn


def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf

In [None]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

## Decision Tree - Model Building and Hyperparameter Tuning

### Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Define the Decision Tree classifier
model = DecisionTreeClassifier(random_state=1)

# Fit the Decision Tree classifier on the training data
model.fit(X_train, y_train)


#### Checking model performance on training set

In [None]:
confusion_matrix_sklearn(model, X_train, y_train)
plt.title('Confusion Matrix - Train Data')
plt.show()


* The decision tree is overfitting the training data.
* Let's try hyperparameter tuning and see if the model performance improves.

In [None]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    if isinstance(target.iloc[0], str):  # Check if true labels are strings
        pred_labels = np.where(pred == 'Certified', 'Certified', 'Denied')
    else:
        pred_labels = np.where(pred == 1, 'Certified', 'Denied')

    acc = accuracy_score(target, pred_labels)  # to compute Accuracy
    recall = recall_score(target, pred_labels, pos_label='Certified')  # to compute Recall
    precision = precision_score(target, pred_labels, pos_label='Certified')  # to compute Precision
    f1 = f1_score(target, pred_labels, pos_label='Certified')  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf


In [None]:
decision_tree_perf_train = model_performance_classification_sklearn(model, X_train, y_train)
print(decision_tree_perf_train)


#### Checking model performance on test set

In [None]:
confusion_matrix_sklearn(model, X_test, y_test)
plt.title('Confusion Matrix - Test Data')
plt.show()

In [None]:
decision_tree_perf_test = model_performance_classification_sklearn(model, X_test, y_test)
decision_tree_perf_test

* F1 value in test data is not so high.
* Let's try hyperparameter tuning and see if the model performance improves.

### Hyperparameter Tuning - Decision Tree

In [None]:
# Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(class_weight="balanced", random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(10, 20, 5),
    "min_samples_leaf": [3, 5],
    "max_leaf_nodes": [2, 3, 5],
    "min_impurity_decrease": [0.0001, 0.001],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer,n_jobs=-1)

grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
dtree_estimator.fit(X_train, y_train)

In [None]:
confusion_matrix_sklearn(dtree_estimator,X_test,y_test)

In [None]:
#Calculate different metrics
dtree_estimator_model_train_perf=model_performance_classification_sklearn(dtree_estimator,X_train,y_train)
print("Training performance:\n",dtree_estimator_model_train_perf)
dtree_estimator_model_test_perf=model_performance_classification_sklearn(dtree_estimator,X_test,y_test)
print("Testing performance:\n",dtree_estimator_model_test_perf)

#Create confusion matrix
confusion_matrix_sklearn(dtree_estimator,X_test,y_test)



* The overfitting has reduced and the test f1-score has increased.
* Let's try some other models.

## Bagging - Model Building and Hyperparameter Tuning

### Bagging Classifier

In [None]:
# Define the Bagging classifier
bagging_classifier = BaggingClassifier(base_estimator=base_estimator, random_state=1)

# Fit the Bagging classifier on the training data
bagging_classifier.fit(X_train, y_train)


#### Checking model performance on training set

In [None]:
bagging_classifier_model_train_perf=model_performance_classification_sklearn(bagging_classifier,X_train,y_train)
print(bagging_classifier_model_train_perf)

In [None]:
confusion_matrix_sklearn(bagging_classifier,X_train,y_train)

#### Checking model performance on test set

In [None]:
bagging_classifier_model_test_perf=model_performance_classification_sklearn(bagging_classifier,X_test,y_test)
print(bagging_classifier_model_test_perf)

In [None]:
confusion_matrix_sklearn(bagging_classifier,X_test,y_test)

* Bagging classifier is overfitting the training data.
* Let's try hyperparameter tuning and see if the model performance improves.

### Hyperparameter Tuning - Bagging Classifier

In [None]:
# Choose the type of classifier.
bagging_estimator_tuned = BaggingClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_samples": [0.7, 0.8],
    "max_features": [0.5,0.7,1],
    "n_estimators": np.arange(50,110,25),
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)


# Set the clf to the best combination of parameters
bagging_estimator_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(X_train, y_train)

#### Checking model performance on training set

In [None]:
bagging_estimator_tuned_model_train_perf=model_performance_classification_sklearn(bagging_estimator_tuned,X_train,y_train)
print(bagging_estimator_tuned_model_train_perf)

In [None]:
# create confusion matrix for train data
confusion_matrix_sklearn(bagging_estimator_tuned,X_train,y_train)

#### Checking model performance on test set

In [None]:
bagging_estimator_tuned_model_test_perf=model_performance_classification_sklearn(bagging_estimator_tuned,X_test,y_test)
print(bagging_estimator_tuned_model_test_perf)

In [None]:
# create confusion matrix for test data on tuned estimator
confusion_matrix_sklearn(bagging_estimator_tuned,X_test,y_test)

* Model performance increased after hyperparameter tuning. F1 value increased to 81%

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Define the Random Forest classifier
rf_estimator = RandomForestClassifier(random_state=1, class_weight='balanced')

# Fit the Random Forest classifier on the training data
rf_estimator.fit(X_train, y_train)


#### Checking model performance on training set

In [None]:
# Calculating different metrics
rf_estimator_model_train_perf=model_performance_classification_sklearn(rf_estimator,X_train,y_train)
print("Training performance:\n",rf_estimator_model_train_perf)

In [None]:
# create confusion matrix for train data
confusion_matrix_sklearn(rf_estimator, X_train, y_train)

#### Checking model performance on test set

In [None]:
rf_estimator_model_test_perf=model_performance_classification_sklearn(rf_estimator,X_test,y_test)
print("Testing performance:\n",rf_estimator_model_test_perf)

In [None]:
# create confusion matrix for test data
confusion_matrix_sklearn(rf_estimator, X_test, y_test)

* Random forest is overfitting the training data as there is significant difference between training and test scores for all the metrics.

### Hyperparameter Tuning - Random Forest

In [None]:
# Choose the type of classifier.
rf_tuned = RandomForestClassifier(random_state=1, oob_score=True, bootstrap=True)

parameters = {
    "max_depth": list(np.arange(5, 15, 5)),
    "max_features": ["sqrt", "log2"],
    "min_samples_split": [3, 5, 7],
    "n_estimators": np.arange(10, 40, 10),
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer,cv=5,n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)


# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)

#### Checking model performance on training set

In [None]:
# check performance for train data on tuned estimator
rf_tuned_model_train_perf=model_performance_classification_sklearn(rf_tuned,X_train,y_train)
print("Training performance:\n",rf_tuned_model_train_perf)

In [None]:
# create confusion matrix for train data on tuned estimator
confusion_matrix_sklearn(rf_tuned, X_train, y_train)

#### Checking model performance on test set

In [None]:
# check performance for test data on tuned estimator
rf_tuned_model_test_perf=model_performance_classification_sklearn(rf_tuned,X_test,y_test)
print("Testing performance:\n",rf_tuned_model_test_perf)

In [None]:
# create confusion matrix for test data on tuned estimator
confusion_matrix_sklearn(rf_tuned, X_test, y_test)

* There is no significant difference between train and test results.

## Boosting - Model Building and Hyperparameter Tuning

### AdaBoost Classifier

In [None]:
# define AdaBoost Classifier with random state = 1
ab_classifier = AdaBoostClassifier(random_state=1)
# fit AdaBoost Classifier on the train data
ab_classifier.fit(X_train,y_train)



#### Checking model performance on training set

In [None]:
ab_classifier_model_train_perf=model_performance_classification_sklearn(ab_classifier,X_train,y_train)
print(ab_classifier_model_train_perf)

In [None]:
confusion_matrix_sklearn(ab_classifier,X_train,y_train)

* F1 value stays the same but overfitting on train data decreased.

#### Checking model performance on test set

In [None]:
ab_classifier_model_test_perf=model_performance_classification_sklearn(ab_classifier,X_test,y_test)
print(ab_classifier_model_test_perf)


In [None]:
confusion_matrix_sklearn(ab_classifier,X_test,y_test)

* Results are almost the same with training data

### Hyperparameter Tuning - AdaBoost Classifier

In [None]:
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    # Let's try different max_depth for base_estimator
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2,random_state=1),
        DecisionTreeClassifier(max_depth=3,random_state=1),
    ],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": np.arange(0.01,0.1,0.05),
}

# Type of scoring used to compare parameter  combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)

#### Checking model performance on training set

In [None]:
abc_tuned_model_train_perf=model_performance_classification_sklearn(abc_tuned,X_train,y_train)
print(abc_tuned_model_train_perf)

In [None]:
confusion_matrix_sklearn(abc_tuned,X_train,y_train)

#### Checking model performance on test set

In [None]:
abc_tuned_model_test_perf=model_performance_classification_sklearn(abc_tuned,X_test,y_test)
print(abc_tuned_model_test_perf)

In [None]:
confusion_matrix_sklearn(abc_tuned,X_test,y_test)

* Values are very close for train and test data.

### Gradient Boosting Classifier

In [None]:
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train,y_train)

#### Checking model performance on training set

In [None]:
gb_classifier_model_train_perf=model_performance_classification_sklearn(gb_classifier,X_train,y_train)
print("Training performance:\n",gb_classifier_model_train_perf)

In [None]:
confusion_matrix_sklearn(gb_classifier,X_train,y_train)

#### Checking model performance on test set

In [None]:
gb_classifier_model_test_perf=model_performance_classification_sklearn(gb_classifier,X_test,y_test)
print("Testing performance:\n",gb_classifier_model_test_perf)

In [None]:
confusion_matrix_sklearn(gb_classifier,X_test,y_test)

* We received highest F1 value so far.

### Hyperparameter Tuning - Gradient Boosting Classifier

In [None]:
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(
    init=AdaBoostClassifier(random_state=1), random_state=1
)

# Grid of parameters to choose from
parameters = {
    "n_estimators": np.arange(50,110,25),
    "subsample": [0.7,0.9],
    "max_features": [0.7, 0.8, 0.9, 1],
    "learning_rate": [0.01,0.1,0.05],
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)

#### Checking model performance on training set

In [None]:
gbc_tuned_model_train_perf=model_performance_classification_sklearn(gbc_tuned,X_train,y_train)
print("Training performance:\n",gbc_tuned_model_train_perf)

In [None]:
confusion_matrix_sklearn(gbc_tuned,X_train,y_train)

#### Checking model performance on test set

In [None]:
gbc_tuned_model_test_perf=model_performance_classification_sklearn(gbc_tuned,X_test,y_test)
print("Testing performance:\n",gbc_tuned_model_test_perf)

In [None]:
confusion_matrix_sklearn(gbc_tuned,X_test,y_test)

## Stacking Classifier

In [None]:
estimators = [
    ("AdaBoost", ab_classifier),
    ("Bagging", bagging_estimator_tuned),
    ("Random Forest", rf_tuned),
]

final_estimator = gbc_tuned
from sklearn.ensemble import StackingClassifier

# Define the Stacking classifier
stacking_classifier = StackingClassifier(
    estimators=estimators,
    final_estimator=final_estimator
)

# Fit the Stacking classifier on the training data
stacking_classifier.fit(X_train, y_train)


### Checking model performance on training set

In [None]:
stacking_classifier_model_train_perf=model_performance_classification_sklearn(stacking_classifier,X_train,y_train)
print("Training performance:\n",stacking_classifier_model_train_perf)

In [None]:
confusion_matrix_sklearn(stacking_classifier,X_train,y_train)

### Checking model performance on test set

In [None]:
stacking_classifier_model_test_perf=model_performance_classification_sklearn(stacking_classifier,X_test,y_test)
print("Testing performance:\n",stacking_classifier_model_test_perf)

In [None]:
confusion_matrix_sklearn(stacking_classifier,X_test,y_test)

* This model performed well on both training and test data.

## Model Performance Comparison and Final Model Selection

In [None]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        decision_tree_perf_train.T,
        dtree_estimator_model_train_perf.T,
        bagging_classifier_model_train_perf.T,
        bagging_estimator_tuned_model_train_perf.T,
        rf_estimator_model_train_perf.T,
        rf_tuned_model_train_perf.T,
        ab_classifier_model_train_perf.T,
        abc_tuned_model_train_perf.T,
        gb_classifier_model_train_perf.T,
        gbc_tuned_model_train_perf.T,
        stacking_classifier_model_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree",
    "Tuned Decision Tree",
    "Bagging Classifier",
    "Tuned Bagging Classifier",
    "Random Forest",
    "Tuned Random Forest",
    "Adaboost Classifier",
    "Tuned Adaboost Classifier",
    "Gradient Boost Classifier",
    "Tuned Gradient Boost Classifier",
    "Stacking Classifier",
]
print("Training performance comparison:")
models_train_comp_df

In [None]:
# testing performance comparison


models_test_comp_df = pd.concat(

       [ decision_tree_perf_test.T,
        dtree_estimator_model_test_perf.T,
        rf_estimator_model_test_perf.T,
        rf_tuned_model_test_perf.T,
        bagging_classifier_model_test_perf.T,
        bagging_estimator_tuned_model_test_perf.T,
        ab_classifier_model_test_perf.T,
        abc_tuned_model_test_perf.T,
        gb_classifier_model_test_perf.T,
        gbc_tuned_model_test_perf.T,
        stacking_classifier_model_test_perf.T,],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree",
    "Decision Tree Estimator",
    "Random Forest Estimator",
    "Random Forest Tuned",
    "Bagging Classifier",
    "Bagging Estimator Tuned",
    "Adaboost Classifier",
    "Adabosst Classifier Tuned",
    "Gradient Boost Classifier",
    "Gradient Boost Classifier Tuned",
    "Stacking Classifier"]
print("Testing performance comparison:")
models_test_comp_df

* Gradient Boost Classifier has the highest F1 score (.8255).

### Important features of the final model

In [None]:
feature_names = X_train.columns
importances = gb_classifier.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

## Business Insights and Recommendations

* Based on our analysis we can say that  approval process has the following features important. :

    1-**Education of Employee - High School:** Applicants with a high school education level are influential in the prediction.

    2-**Has Job Experience - No**: Applicants with no job experience seem to impact the visa approval status.

    3-**Unit of Wage - Hour:** The unit of wage being hourly is a significant factor.

    4-**Education of Employee - Bachelors:** Applicants with a bachelor's degree are also considered in the prediction.

    5-**Continent - Europe:** The continent being Europe seems to play a role in the prediction.

    6-**Prevailing Wage:** The prevailing wage is an important factor in determining the visa approval status.

    7-**Has Job Experience - Yes:** On the other hand, applicants with job experience also contribute to the prediction.



Based on the information provided and the identified feature importances, here are some business insights and recommendations:

* **Education Level Impact**:

    **Insight:** High school education appears to be a significant factor influencing the visa approval process.

   **Recommendation:** Consider evaluating the educational qualifications of applicants, giving attention to those with high school education.

* **Job Experience Consideration:**

     **Insight:** The absence of job experience is identified as an important driver affecting visa approval.

     **Recommendation:** Assess applicants without job experience carefully, as this factor may impact the approval decision.

* **Wage Structure Matters:**

     **Insight:** The unit of wage being hourly is highlighted as influential.

     **Recommendation:** Pay attention to the wage structure, particularly for positions with hourly wages, as it appears to be a key feature in the decision-making process.

* **Educational Attainment - Bachelor's Degree:**

    **Insight:** Having a bachelor's degree is recognized as a factor affecting visa approval.

    **Recommendation:** Place emphasis on applicants with a bachelor's degree, as this educational level may positively contribute to their chances of visa approval.

* **Geographical Considerations:**

  **Insight:** The continent, specifically Europe, is identified as a relevant feature.

  **Recommendation:** Take into account the continent of origin, especially Europe, when evaluating visa applications. Consider regional variations and factors.

  **Prevailing Wage Significance:**

    **Insight:** Prevailing wage is highlighted as an important consideration.

    **Recommendation:** Pay careful attention to the prevailing wage associated with the positions applied for, ensuring it aligns with the norms in the specific occupation and region.

 **Job Experience - Yes:**

  **Insight:** Applicants with job experience also impact the visa approval process.
        
  **Recommendation:** Acknowledge the positive influence of job experience. Consider applicants with relevant work experience favorably during the evaluation process.

**General Recommendations:**

  * **Data-Driven Decision-Making:** Leverage the insights gained from machine learning models to inform decision-making in the visa approval process.
  * **Continuous Monitoring:** Regularly update and monitor the model's performance to adapt to changes in application patterns and regulations.
  * **Transparency and Fairness:** Ensure transparency in the decision-making process and maintain fairness in evaluating applications, avoiding bias based on individual features.
  * **Communication with Applicants:** Clearly communicate the criteria for visa approval to applicants, providing guidance on areas that can strengthen their applications.

By incorporating these insights and recommendations, EasyVisa can enhance the efficiency and effectiveness of the visa approval process, providing a more informed and data-driven approach to decision-making.

___