Project Objective

In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.


The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-driven solutions. You as a data scientist at EasyVisa have to analyze the data provided and, with the help of a classification model:


Facilitate the process of visa approvals.

Recommend a suitable profile for the applicants for whom the visa should be 

certified or denied based on the drivers that significantly influence the case 

Status.

## **DATA CLEANING**


#### **1. Import Libraries and Load Data**

In [1]:
# Core libaries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

print("libaries have been imported sucesssfully")

libaries have been imported sucesssfully


In [2]:
def load_data(url):
    """ The function takes a url /link read it into a pandas datframe and returns back a Dataframe"""
    """
    Parameters: Url
    Returns : Dataframe
    """
    df = pd.read_csv(url)
    print("Dataset has been loaded sucessfully")

    return df

In [3]:
url = "https://raw.githubusercontent.com/ek-chris/Practice_datasets/refs/heads/main/EasyVisa%20(1).csv"
df = load_data(url)
df

Dataset has been loaded sucessfully


Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.6500,Year,Y,Certified
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.8600,Year,Y,Denied
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.0300,Year,Y,Denied
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.3900,Year,Y,Certified
...,...,...,...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,Asia,Bachelor's,Y,Y,2601,2008,South,77092.5700,Year,Y,Certified
25476,EZYV25477,Asia,High School,Y,N,3274,2006,Northeast,279174.7900,Year,Y,Certified
25477,EZYV25478,Asia,Master's,Y,N,1121,1910,South,146298.8500,Year,N,Certified
25478,EZYV25479,Asia,Master's,Y,Y,1918,1887,West,86154.7700,Year,Y,Certified


In [4]:
df_copy = df.copy()

In [5]:
def clean_data(df):
    """ This function checks the dataset for missing values and duplicate values, handles the missing values and drops the duplicates"""
    """
    Parameter: Raw Dataset
    Return: Clean Dataset
    """
    # set index of the dataframe
    print("\n1. Set Index:")
    df.set_index("case_id", inplace=True)
    #checks the basic information of the dataset
    print("\n2. Basic Information of the dataset")
    df_info = df.info()
    print(df_info)
    # checks the shape of the dataset
    print("\n3. Shape of the dataset")
    shape = df.shape
    print(f"Shape: {shape}")
    #checks for missing values in the dataset
    print("\n4. Checking and handling missing values")
    missing_values = df.isna().sum()
    # prints missing values
    print(missing_values)
    # calculates the percentage of missing values
    percent_missing = (missing_values.sum()/len(df))*100
    print(f"Percentage of missing values: {percent_missing}%")
    if percent_missing <= 5:
        df = df.dropna()
    elif percent_missing > 5: 
         # extracts numerical columns
        num_col = df.select_dtypes(include=["float64","int64"]).columns
          # extracts categorical column
        cat_col = df.select_dtypes(include="object").columns
         # get the mode of the categorical columns
        modes = df[cat_col].apply(lambda x: x.value_counts().index[0])
        # fill the  missing values in numerical columns with median
        df.loc[:, num_col] = df[num_col].fillna(df[num_col].median())
        df.loc[:,cat_col] = df[cat_col].fillna(modes)
    else:
        print("No missing values")
    print("\n5. Checking and handling Duplicates values")
    duplicates = df.duplicated().sum()
    print(f"Duplicates: {duplicates}")
    if duplicates > 0:
        df = df.drop_duplicates()
    else:
        print("No Duplicate Values")

    return df


In [6]:
clean_data(df)


1. Set Index:

2. Basic Information of the dataset
<class 'pandas.core.frame.DataFrame'>
Index: 25480 entries, EZYV01 to EZYV25480
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   continent              25480 non-null  object 
 1   education_of_employee  25480 non-null  object 
 2   has_job_experience     25480 non-null  object 
 3   requires_job_training  25480 non-null  object 
 4   no_of_employees        25480 non-null  int64  
 5   yr_of_estab            25480 non-null  int64  
 6   region_of_employment   25480 non-null  object 
 7   prevailing_wage        25480 non-null  float64
 8   unit_of_wage           25480 non-null  object 
 9   full_time_position     25480 non-null  object 
 10  case_status            25480 non-null  object 
dtypes: float64(1), int64(2), object(8)
memory usage: 2.3+ MB
None

3. Shape of the dataset
Shape: (25480, 11)

4. Checking and handling missing values
contine

Unnamed: 0_level_0,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.6500,Year,Y,Certified
EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.8600,Year,Y,Denied
EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.0300,Year,Y,Denied
EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.3900,Year,Y,Certified
...,...,...,...,...,...,...,...,...,...,...,...
EZYV25476,Asia,Bachelor's,Y,Y,2601,2008,South,77092.5700,Year,Y,Certified
EZYV25477,Asia,High School,Y,N,3274,2006,Northeast,279174.7900,Year,Y,Certified
EZYV25478,Asia,Master's,Y,N,1121,1910,South,146298.8500,Year,N,Certified
EZYV25479,Asia,Master's,Y,Y,1918,1887,West,86154.7700,Year,Y,Certified


In [7]:
def save_data(df):
    cleaned_data = df
    cleaned_data.to_csv("Easy_Vias_clean_data.csv", index=False)
    print("Dataset sucessfully saved")

In [8]:
save_data(df)

Dataset sucessfully saved
