# Easy Visa (Visa Application Analysis and Classification)

## Overview

This project focuses on analyzing and predicting visa application outcomes using machine learning. The Office of Foreign Labor Certification (OFLC) processes thousands of applications for employers seeking to bring foreign workers into the U.S. every year. As the number of applications increases, it becomes increasingly tedious to manually review all cases.

This project aims to:
- Facilitate the process of visa approvals using a machine learning classification model.
- Recommend a suitable profile for applicants based on the significant factors that influence visa approval or denial.

## Objective

In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions, a 9% increase from the previous year. Given this increasing number of applications, the goal of this project is to develop a **Machine Learning** solution that helps predict visa certification outcomes and shortlists candidates with a higher likelihood of approval.

## Dataset Description

The dataset contains attributes related to both the employee (foreign worker) and the employer. Below are key columns in the data:

- **case_id**: ID of each visa application.
- **continent**: Continent of the employee.
- **education_of_employee**: Employee's education level.
- **has_job_experience**: Indicates if the employee has previous job experience (Y/N).
- **requires_job_training**: Indicates if job training is required (Y/N).
- **no_of_employees**: Number of employees in the employer's company.
- **yr_of_estab**: Year the employer's company was established.
- **region_of_employment**: U.S. region where the foreign worker is employed.
- **prevailing_wage**: Average wage paid to similarly employed workers in the occupation area.
- **unit_of_wage**: Wage unit (Hourly, Weekly, Monthly, Yearly).
- **full_time_position**: Whether the position is full-time (Y/N).
- **case_status**: Visa certification status (Certified/Denied).

## Exploratory Data Analysis (EDA) Questions

The EDA seeks to answer key questions that will help us understand the drivers of visa certification:

1. **Education and Certification**: Does education level impact visa certification?
2. **Continent and Visa Status**: How does visa certification vary across different continents?
3. **Work Experience**: Does having job experience influence visa approval?
4. **Wage Unit**: Which wage unit (Hourly, Weekly, Monthly, Yearly) is most likely to lead to visa certification?
5. **Prevailing Wage**: How does visa status change with different levels of prevailing wage?

## Project Structure

- `data/`: Contains the raw dataset for visa applications.
- `notebooks/`: Jupyter notebooks used for data exploration, analysis, and modeling.
- `scripts/`: Python scripts for data preprocessing, feature engineering, and model training.
- `models/`: Directory to store trained models.
- `README.md`: Project documentation (this file).

## Machine Learning Approach

The classification model is designed to predict whether a visa application will be **certified** or **denied**. The following steps are taken:

1. **Data Preprocessing**: Cleaning the data, handling missing values, encoding categorical variables, and scaling numerical features.
2. **Feature Engineering**: Creating new features based on domain knowledge, such as grouping wage units or calculating the company's age.
3. **Model Selection**: Evaluating different classification models like Logistic Regression, Decision Trees, Random Forests, and Gradient Boosting Machines.
4. **Model Evaluation**: Using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC to assess model performance.

## Results

- The final model demonstrates significant accuracy in predicting visa certification.
- Features such as **education level**, **work experience**, and **prevailing wage** are strong indicators of visa approval.


In [5]:
# install XgBoost
!pip install xgboost



In [1]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Library to help with statistical analysis
import scipy.stats as stats
from mpl_toolkits.mplot3d import axes3d
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from matplotlib.colors import ListedColormap
# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
    precision_recall_curve,
)
import statsmodels.api as sm
# To build model for prediction
from sklearn.linear_model import LogisticRegression
# Import standard scalar
from sklearn.preprocessing import StandardScaler
# import RFE
from sklearn.feature_selection import RFE
# To ignore unnecessary warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
# mount the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
path='/content/drive/MyDrive/Python Course'

In [12]:
# store pellete for future use
pellete='Set2'
colors = sns.color_palette(pellete)  # Get Set2 color palette for future use
sns.set(style="darkgrid") # Set grid style

## **Data Overview**

- Observe the first few rows of the dataset, to check whether the dataset has been loaded properly or not
- Get information about the number of rows and columns in the dataset
- Find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as expected.
- Check the statistical summary of the dataset to get an overview of the numerical columns of the data
- Check for missing values
- Check for null values

In [7]:
# load the data in to panda dataframe
ez_df=pd.read_csv(f'{path}/EasyVisa.csv')

In [8]:
# Deep copy the dataframe
ezdf=ez_df.copy(deep=True)

In [9]:
# Detail info about the dataset
ezdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25480 entries, 0 to 25479
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   case_id                25480 non-null  object 
 1   continent              25480 non-null  object 
 2   education_of_employee  25480 non-null  object 
 3   has_job_experience     25480 non-null  object 
 4   requires_job_training  25480 non-null  object 
 5   no_of_employees        25480 non-null  int64  
 6   yr_of_estab            25480 non-null  int64  
 7   region_of_employment   25480 non-null  object 
 8   prevailing_wage        25480 non-null  float64
 9   unit_of_wage           25480 non-null  object 
 10  full_time_position     25480 non-null  object 
 11  case_status            25480 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 2.3+ MB


### Observations:
The dataset contains 12 columns.  
Below is the summary:

#### Dataset Summary

| Column                               | Data Type | Non-Null Count | Description                               |
|--------------------------------------|-----------|----------------|-------------------------------------------|
| `case_id`                           | Object    | 25480          | Unique identifier for each case          |
| `continent`                         | Object    | 25480          | Continent of the employment region        |
| `education_of_employee`             | Object    | 25480          | Education level of the employee           |
| `has_job_experience`                | Object    | 25480          | Indicates if the employee has experience  |
| `requires_job_training`             | Object    | 25480          | Indicates if training is required         |
| `no_of_employees`                   | Int64     | 25480          | Number of employees in the organization   |
| `yr_of_estab`                       | Int64     | 25480          | Year of establishment                      |
| `region_of_employment`              | Object    | 25480          | Region of employment                       |
| `prevailing_wage`                   | Float64   | 25480          | Wage prevailing in the region             |
| `unit_of_wage`                      | Object    | 25480          | Unit of wage (hourly, yearly, etc.)      |
| `full_time_position`                | Object    | 25480          | Indicates if the position is full-time    |
| `case_status`                       | Object    | 25480          | Status of the case (approved, denied, etc.) |


In [15]:
# Check null values
ezdf.isnull().sum()

Unnamed: 0,0
case_id,0
continent,0
education_of_employee,0
has_job_experience,0
requires_job_training,0
no_of_employees,0
yr_of_estab,0
region_of_employment,0
prevailing_wage,0
unit_of_wage,0


In [14]:
ezdf.shape

(25480, 12)

### Observations:
There are 25480 rows and 12 colums in the dataset. The dataframe has no null value. Row 5 , 6 and 8 has numeric values.

In [11]:
ezdf.head(5)

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.65,Year,Y,Certified
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.86,Year,Y,Denied
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.03,Year,Y,Denied
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.39,Year,Y,Certified


### Observations:

#### Categorical Columns
- `case_id` (Object)
- `continent` (Object)
- `education_of_employee` (Object)
- `has_job_experience` (Object)
- `requires_job_training` (Object)
- `region_of_employment` (Object)
- `unit_of_wage` (Object)
- `full_time_position` (Object)
- `case_status` (Object)

#### Numerical Columns
- `no_of_employees` (Int64)
- `yr_of_estab` (Int64)
- `prevailing_wage` (Float64)

### Summary
- **Total Categorical Columns:** 9
- **Total Numerical Columns:** 3

In [16]:
# Check for duplicates in dataset
has_duplicates = ezdf.duplicated().any()

print(f"Does the DataFrame have duplicates? {has_duplicates}")

Does the DataFrame have duplicates? False


In [26]:
# Unique values
# Take all catgorical columns apart from case_id as this column seems to be unique column for joining or ref
catgoricalcol=ezdf.select_dtypes(include=['object']).columns[1:]
for col in catgoricalcol:
  print(f'Unique values in {col}:')
  print(ezdf[col].unique())

Unique values in continent:
['Asia' 'Africa' 'North America' 'Europe' 'South America' 'Oceania']
Unique values in education_of_employee:
['High School' "Master's" "Bachelor's" 'Doctorate']
Unique values in has_job_experience:
['N' 'Y']
Unique values in requires_job_training:
['N' 'Y']
Unique values in region_of_employment:
['West' 'Northeast' 'South' 'Midwest' 'Island']
Unique values in unit_of_wage:
['Hour' 'Year' 'Week' 'Month']
Unique values in full_time_position:
['Y' 'N']
Unique values in case_status:
['Denied' 'Certified']


### Observations:
#### Unique Values for Categorical Columns

- **Continent:**
  - `['Asia', 'Africa', 'North America', 'Europe', 'South America', 'Oceania']`
  
- **Education of Employee:**
  - `['High School', "Master's", "Bachelor's", 'Doctorate']`
  
- **Has Job Experience:**
  - `['N', 'Y']`
  
- **Requires Job Training:**
  - `['N', 'Y']`
  
- **Region of Employment:**
  - `['West', 'Northeast', 'South', 'Midwest', 'Island']`
  
- **Unit of Wage:**
  - `['Hour', 'Year', 'Week', 'Month']`
  
- **Full-Time Position:**
  - `['Y', 'N']`
  
- **Case Status:**
  - `['Denied', 'Certified']`


In [28]:
ezdf.describe(exclude='object').T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
no_of_employees,25480.0,5667.04321,22877.928848,-26.0,1022.0,2109.0,3504.0,602069.0
yr_of_estab,25480.0,1979.409929,42.366929,1800.0,1976.0,1997.0,2005.0,2016.0
prevailing_wage,25480.0,74455.814592,52815.942327,2.1367,34015.48,70308.21,107735.5125,319210.27


### Observations:
- The min value for no_of_employees is -26
- There is huge gap between 75% in no of employees i.e 3504 where as max value is 602069 .This might be due to large orginzation who has more man power.
- Average prevailing wage is $74455.There's also a very huge difference in 75th percentile and maximum value.SO there might be some outliers.
- The oldest company who has applied for visa for its employee is established in 1800



In [29]:
ezdf.describe(include='object').T

Unnamed: 0,count,unique,top,freq
case_id,25480,25480,EZYV01,1
continent,25480,6,Asia,16861
education_of_employee,25480,4,Bachelor's,10234
has_job_experience,25480,2,Y,14802
requires_job_training,25480,2,N,22525
region_of_employment,25480,5,Northeast,7195
unit_of_wage,25480,4,Year,22962
full_time_position,25480,2,Y,22773
case_status,25480,2,Certified,17018


### Insights from Categorical Columns
1. **Continent:**
   - **Unique Values:** 6 (Asia, Africa, North America, Europe, South America, Oceania)
   - **Most Frequent Value:** Asia (16,861 occurrences)
   - **Insight:** A significant proportion of cases are concentrated in Asia, suggesting a possible focus area for analysis or resource allocation.

2. **Education of Employee:**
   - **Unique Values:** 4 (High School, Bachelor's, Master's, Doctorate)
   - **Most Frequent Value:** Bachelor's degree (10,234 occurrences)
   - **Insight:** The majority of employees have a Bachelor's degree, which may imply a workforce that is moderately educated, with potential implications for job requirements and training needs.

3. **Job Experience:**
   - **Unique Values:** 2 (Yes, No)
   - **Most Frequent Value:** Yes (14,802 occurrences)
   - **Insight:** A substantial number of employees possess job experience, indicating a more skilled workforce that could reduce training time and costs.

4. **Job Training Requirement:**
   - **Unique Values:** 2 (Yes, No)
   - **Most Frequent Value:** No (22,525 occurrences)
   - **Insight:** A majority of cases do not require job training, suggesting that the workforce may already be well-trained or that job roles typically do not demand extensive training.

5. **Region of Employment:**
   - **Unique Values:** 5 (West, Northeast, South, Midwest, Island)
   - **Most Frequent Value:** Northeast (7,195 occurrences)
   - **Insight:** The Northeast region has the highest representation.
6. **Unit of Wage:**
   - **Unique Values:** 4 (Hour, Year, Week, Month)
   - **Most Frequent Value:** Year (22,962 occurrences)
   - **Insight:** Most employees are compensated on an annual basis, which is typical for salaried positions, indicating a potential need to analyze wage structures and salary ranges.

7. **Full-Time Position:**
   - **Unique Values:** 2 (Yes, No)
   - **Most Frequent Value:** Yes (22,773 occurrences)
   - **Insight:** The majority of cases are full-time positions, which could reflect workforce stability and employee retention strategies.

8. **Case Status:**
   - **Unique Values:** 2 (Certified, Denied)
   - **Most Frequent Value:** Certified (17,018 occurrences)
   - **Insight:** A large number of cases are certified, which may indicate effective application processes or favorable conditions for approval.