## Problem Statement

### Business Context

Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).

OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.

### Objective

In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.

The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-driven solutions. You as a data  scientist at EasyVisa have to analyze the data provided and, with the help of a classification model:

* Facilitate the process of visa approvals.
* Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the drivers that significantly influence the case status.

### Data Description

The data contains the different attributes of employee and the employer. The detailed data dictionary is given below.

* case_id: ID of each visa application
* continent: Information of continent the employee
* education_of_employee: Information of education of the employee
* has_job_experience: Does the employee has any job experience? Y= Yes; N = No
* requires_job_training: Does the employee require any job training? Y = Yes; N = No
* no_of_employees: Number of employees in the employer's company
* yr_of_estab: Year in which the employer's company was established
* region_of_employment: Information of foreign worker's intended region of employment in the US.
* prevailing_wage:  Average wage paid to similarly employed workers in a specific occupation in the area of intended employment. The purpose of the prevailing wage is to ensure that the foreign worker is not underpaid compared to other workers offering the same or similar service in the same area of employment.
* unit_of_wage: Unit of prevailing wage. Values include Hourly, Weekly, Monthly, and Yearly.
* full_time_position: Is the position of work full-time? Y = Full Time Position; N = Part Time Position
* case_status:  Flag indicating if the Visa was certified or denied

## Installing and Importing the necessary libraries

In [82]:
# Check Python version
import sys
print(sys.version)

3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0]


In [83]:
# Upgrade build tools first
%pip install -q --upgrade pip setuptools wheel

# Py3.12-compatible pins
%pip install -q \
  numpy==2.0.2 \
  pandas==2.2.2 \
  matplotlib==3.8.4 \
  seaborn==0.13.2 \
  scikit-learn==1.6.1 \
  sklearn-pandas==2.2.0 \
  xgboost==2.0.3

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


**Note**: *After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the below.*

In [84]:
# Core data science stack
try:
    import os
    import sys
    import pandas as pd
    import numpy as np, pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
except ImportError as e:
    print(f"[ERROR] Core package missing: {e.name}.")
    print("Solution: Install with pip, e.g.: !pip install pandas numpy matplotlib seaborn")

# Utilities
try:
    from pathlib import Path
   # from google.colab import drive
    from datetime import datetime
    from collections import Counter
except ImportError as e:
    print(f"[ERROR] Utility package missing: {e.name}.")
    print("Solution: If running outside Colab, remove or replace 'google.colab' imports.")

# Version verification
try:
    import pandas, numpy, matplotlib, seaborn, sklearn, xgboost
    print("\n[OK] Installed package versions:")
    print(f" * Python:     {sys.version.split()[0]}")
    print(f" * pandas:     {pandas.__version__}")
    print(f" * numpy:      {numpy.__version__}")
    print(f" * matplotlib: {matplotlib.__version__}")
    print(f" * seaborn:    {seaborn.__version__}")
    print(f" * scikit-learn: {sklearn.__version__}")
    print(f" * xgboost:    {xgboost.__version__}")
except Exception as e:
    print("[ERROR] Could not verify all package versions:", e)
    print("Double-check that packages are installed with pip/conda.")


[OK] Installed package versions:
 * Python:     3.10.12
 * pandas:     2.2.2
 * numpy:      2.0.2
 * matplotlib: 3.8.4
 * seaborn:    0.13.2
 * scikit-learn: 1.6.1
 * xgboost:    2.0.3


## Import Dataset

In [85]:
# Primary WSL path Windows C: mounted at /mnt/c
p1 = Path("/mnt/c/Users/caste/OneDrive/Desktop/MLE/AI ML/EasyVisa.csv")
# Fallback in case
p2 = Path("/mnt/c/Users/caste/Desktop/MLE/AI ML/EasyVisa.csv")

data_path = p1 if p1.exists() else (p2 if p2.exists() else None)

print("[INFO] Candidates:")
print("  -", p1)
print("  -", p2)

if data_path is None:
    raise FileNotFoundError(
        "[ERROR] Could not find EasyVisa.csv at either location above.\n"
        "Checks:\n"
        "  • Confirm the file is not cloud-only in OneDrive.\n"
        "  • Verify the path inside WSL\n"
        "  • If stored elsewhere, update the path here."
    )

print(f"[OK] Using: {data_path}")

# Loader with encoding fallback
def read_csv_safely(path: Path, **kwargs):
    try:
        return pd.read_csv(path, **kwargs)
    except UnicodeDecodeError:
        print(f"[WARN] UnicodeDecodeError for {path.name}. Retrying with latin-1 …")
        return pd.read_csv(path, encoding="latin-1", engine="python", **kwargs)
    except pd.errors.ParserError as e:
        print(f"[WARN] ParserError for {path.name}: {e}\nTrying engine='python' with on_bad_lines='skip'.")
        return pd.read_csv(path, engine="python", on_bad_lines="skip", **kwargs)

# Load
df = read_csv_safely(data_path)
print(f"[OK] Loaded {data_path.name}")

[INFO] Candidates:
  - /mnt/c/Users/caste/OneDrive/Desktop/MLE/AI ML/EasyVisa.csv
  - /mnt/c/Users/caste/Desktop/MLE/AI ML/EasyVisa.csv
[OK] Using: /mnt/c/Users/caste/OneDrive/Desktop/MLE/AI ML/EasyVisa.csv


[OK] Loaded EasyVisa.csv


## Overview of the Dataset

#### View the first and last 5 rows of the dataset

In [86]:
# Dataset first 5 rows
display(df.head()
        .style.set_properties(**{'text-align': 'left'}).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
        )


Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV21644,Europe,Bachelor's,N,Y,3126,1800,South,192806.06,Year,Y,Certified
1,EZYV24364,Asia,Master's,Y,N,1649,1800,Midwest,148321.57,Year,Y,Certified
2,EZYV22098,Asia,Bachelor's,Y,N,2043,1800,Northeast,145985.43,Year,Y,Certified
3,EZYV12103,Asia,High School,N,Y,808,1800,West,127303.96,Year,Y,Denied
4,EZYV13156,Asia,Master's,Y,N,573,1800,Northeast,124457.5,Year,Y,Certified


In [87]:
# Dataset last 5 rows
display(df.tail()
        .style.set_properties(**{'text-align': 'left'}).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
        )


Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
25475,EZYV5163,Asia,Bachelor's,Y,N,500,2016,Northeast,12476.4,Year,Y,Certified
25476,EZYV15379,Europe,Doctorate,N,N,41,2016,Northeast,9442.23,Year,Y,Certified
25477,EZYV15437,Europe,Bachelor's,Y,N,1338,2016,West,639.9571,Hour,Y,Denied
25478,EZYV20413,Asia,High School,N,N,1655,2016,Northeast,538.9966,Hour,Y,Denied
25479,EZYV8640,Europe,Master's,N,N,2089,2016,West,399.6297,Hour,Y,Denied


#### Understand the shape of the dataset

In [88]:
# Print the rows and coulmns
print(f"\n[INFO] Dataset shape: {df.shape}")


[INFO] Dataset shape: (25480, 12)


#### Check the data types of the columns for the dataset

In [89]:
# Dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25480 entries, 0 to 25479
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   case_id                25480 non-null  object 
 1   continent              25480 non-null  object 
 2   education_of_employee  25480 non-null  object 
 3   has_job_experience     25480 non-null  object 
 4   requires_job_training  25480 non-null  object 
 5   no_of_employees        25480 non-null  int64  
 6   yr_of_estab            25480 non-null  int64  
 7   region_of_employment   25480 non-null  object 
 8   prevailing_wage        25480 non-null  float64
 9   unit_of_wage           25480 non-null  object 
 10  full_time_position     25480 non-null  object 
 11  case_status            25480 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 2.3+ MB


## Exploratory Data Analysis (EDA)

Learned to set the data up before hand to avoid pitfalls later down the pipline

* Rock-solid config & verification (RNG, target integrity, reproducibility)

* Canonical maps for messy strings (target, Y/N flags, categories)

* A schema guard that prints unknown headers, missing canon columns, and collisions

* A single build_canonical_df(df) entrypoint returning a clean, model-ready frame

* Stratified split sanity and class-balance report

In [90]:
# A) CONFIG & VERIFICATION (Pro)
import re, difflib
from typing import Dict, Any

# --- Config ---
RNG = 72
np.random.seed(RNG)

TARGET_RAW = "case_status"     # raw target column in EasyVisa
TARGET_LBL = "_case_label"     # canonical string label: 'Certified' / 'Denied'
TARGET_BIN = "y"               # numeric target: 1/0

# Minimal raw columns expected before any feature engineering
EXPECTED_COLS = {
    "case_id","continent","education_of_employee","has_job_experience",
    "requires_job_training","no_of_employees","yr_of_estab",
    "region_of_employment","prevailing_wage","unit_of_wage",
    "full_time_position","case_status"
}

# --- Utilities ---
def verify_rng(seed: int = RNG) -> None:
    """Deterministically verify RNG reproducibility."""
    a = np.random.RandomState(seed).rand(5)
    b = np.random.RandomState(seed).rand(5)
    assert np.allclose(a, b), "RNG reproducibility failed"
    print("[OK] RNG reproducibility")

def verify_target_exists(df: pd.DataFrame) -> None:
    """Ensure the raw target column is present."""
    assert TARGET_RAW in df.columns, f"Missing target column `{TARGET_RAW}`"
    print("[OK] Target column present")

def report_class_balance(y: pd.Series, title: str = "[INFO] class balance") -> Dict[str, Any]:
    """Print and return class balance + imbalance ratio (if binary)."""
    vc = y.value_counts(dropna=False)
    ratio = float(vc.max()/vc.min()) if (len(vc) == 2 and vc.min() > 0) else float("nan")
    print(f"{title} →")
    print(vc.to_string())
    if np.isfinite(ratio):
        print(f"  Imbalance ratio ≈ {ratio:.2f}:1")
    return {"counts": vc.to_dict(), "ratio": ratio}

def verify_expected_columns(df: pd.DataFrame) -> Dict[str, Any]:
    """Check presence of all expected raw columns; print concise status."""
    missing = sorted([c for c in EXPECTED_COLS if c not in df.columns])
    extra   = sorted([c for c in df.columns if c not in EXPECTED_COLS])
    if missing:
        print(f"[WARN] Missing expected columns: {missing}")
    else:
        print("[OK] All expected columns present")
    if extra:
        print(f"[INFO] Extra columns (not required): {extra[:10]}{' …' if len(extra) > 10 else ''}")
    return {"missing": missing, "extra": extra, "passed": len(missing) == 0}

def verify_no_nans(series: pd.Series, name: str) -> None:
    """Hard guard to ensure no NaNs in a required series."""
    n = int(series.isna().sum())
    print(f"[VERIFY] {name}: NaN count = {n}")
    assert n == 0, f"Found NaNs in `{name}`"

# --- One-call dataset verifier ---
def verify_dataset(df: pd.DataFrame, name: str = "EasyVisa", strict: bool = False) -> Dict[str, Any]:
    """
    End-to-end verification for the raw EasyVisa frame.
    - RNG reproducibility
    - Target column exists
    - Expected raw columns present
    - (Optional) class balance if a canonical/encoded target is available
    Returns a dict report; raises if strict=True and checks fail.
    """
    print(f"\n=== VERIFICATION START: {name} ===")
    verify_rng()
    verify_target_exists(df)
    schema_rep = verify_expected_columns(df)

    # If the notebook has already created canonical target columns, report them
    report_lbl = {}
    report_bin = {}
    if TARGET_LBL in df.columns:
        report_lbl = report_class_balance(df[TARGET_LBL], "[INFO] target label balance")
    if TARGET_BIN in df.columns:
        report_bin = report_class_balance(df[TARGET_BIN], "[INFO] target numeric balance")

    passed = bool(schema_rep["passed"])
    summary = {
        "dataset": name,
        "rows": int(len(df)),
        "cols": int(df.shape[1]),
        "expected_ok": passed,
        "missing": schema_rep["missing"],
        "extra": schema_rep["extra"],
        "label_balance": report_lbl,
        "numeric_balance": report_bin,
        "passed": passed,
    }

    status = "PASS ✓" if summary["passed"] else "FAIL ✗"
    print(f"[SUMMARY] rows={summary['rows']} cols={summary['cols']} | "
          f"expected_ok={summary['expected_ok']}  ⇒  {status}")
    print(f"=== VERIFICATION END: {name} ===\n")

    if strict and not summary["passed"]:
        raise AssertionError(f"[{name}] Verification failed: {summary}")

    return summary

report = verify_dataset(df, name="EasyVisa Raw", strict=True)


=== VERIFICATION START: EasyVisa Raw ===
[OK] RNG reproducibility
[OK] Target column present
[OK] All expected columns present
[SUMMARY] rows=25480 cols=12 | expected_ok=True  ⇒  PASS ✓
=== VERIFICATION END: EasyVisa Raw ===



In [91]:
# B) CANONICALIZATION 
from typing import Dict, Any, Tuple
import difflib, re

# Helpers
_norm = lambda s: re.sub(r"[^a-z0-9]", "", str(s).lower())

_CANON_MAP: Dict[str,str] = {
    "caseid": "case_id",
    "continent": "continent",
    "educationofemployee": "education_of_employee",
    "hasjobexperience": "has_job_experience",
    "requiresjobtraining": "requires_job_training",
    "noofemployees": "no_of_employees",
    "yrofestab": "yr_of_estab",
    "regionofemployment": "region_of_employment",
    "prevailingwage": "prevailing_wage",
    "unitofwage": "unit_of_wage",
    "fulltimeposition": "full_time_position",
    "casestatus": "case_status",
}

def canon_case_status(s: pd.Series) -> pd.Series:
    s = s.astype("string").str.strip().str.lower()
    m = {
        "certified": "Certified",
        "certified-expired": "Certified",
        "denied": "Denied",
        "rejected": "Denied",
        "withdrawn": "Denied",
    }
    return s.map(m)

def canon_yn(s: pd.Series) -> pd.Series:
    s = s.astype("string").str.strip().str.upper().replace({"YES":"Y","NO":"N"})
    return s.where(s.isin(["Y","N"]), np.nan)

def canon_title(s: pd.Series) -> pd.Series:
    return s.astype("string").str.strip().str.title()

def canon_education(s: pd.Series) -> pd.Series:
    s = s.astype("string").str.strip().str.lower()
    s = s.str.replace("’","'", regex=False).str.replace(r"\s+"," ", regex=True)
    def map_one(x):
        if x is None or x == "nan": return np.nan
        if "high"      in x: return "High School"
        if "bachelor"  in x: return "Bachelor's"
        if "master"    in x: return "Master's"
        if "doctor" in x or "phd" in x: return "Doctorate"
        return "Other"
    return s.map(map_one)

# Internal columns ignore schema warnings 
INTERNAL_COLS = {
    "_case_label","_edu","_continent","_region","_uow",
    "_job_exp","_job_train","_full_time"
}

# Column-level canonicalization 
def canonicalize_columns(df: pd.DataFrame, name: str = "frame") -> Tuple[pd.DataFrame, Dict[str, Any]]:
    """Rename raw columns using _CANON_MAP, report unknowns/missing/collisions."""
    raw = list(df.columns)
    keys = [_norm(c) for c in raw]

    # rename with collision protection
    ren, used = {}, set()
    for raw_col, k in zip(raw, keys):
        new = _CANON_MAP.get(k, raw_col)
        if new in used and new != raw_col:
            new = raw_col  # avoid overwrite
        ren[raw_col] = new
        used.add(new)
    out = df.rename(columns=ren).copy()

    # unknown headers skip internals expected names are the canonical keys’ values
    expected = set(_CANON_MAP.values())
    unknown = [(r, k) for r, k in zip(raw, keys)
               if r not in INTERNAL_COLS and not str(r).startswith("_")
               and (k not in _CANON_MAP) and (r not in expected)]

    # missing expected canonical columns
    missing = sorted([c for c in expected if c not in out.columns])

    # collisions two raw normalize to same key
    seen, collisions = {}, {}
    for r, k in zip(raw, keys):
        if k in seen and seen[k] != r:
            collisions.setdefault(k, set()).update([seen[k], r])
        else:
            seen[k] = r

    # print audit
    if unknown:
        print(f"[{name}] UNKNOWN headers:", [r for r,_ in unknown])
        print(f"[{name}] Suggested _CANON_MAP patches:")
        for r,k in unknown:
            guess = difflib.get_close_matches(k, list(_CANON_MAP.keys()), n=1, cutoff=0.6)
            guess_str = guess[0] if guess else "<add-key>"
            print(f'    _CANON_MAP["{k}"] = "{r}"   # raw="{r}"')
    else:
        print(f"[{name}] No unknown headers. ✓")

    print(f"[{name}] MISSING expected canonical columns:", missing if missing else "None ✓")

    if collisions:
        print(f"[{name}] COLLISIONS:")
        for k, raws in collisions.items():
            print(f"    norm='{k}' ← raw {sorted(list(raws))}")
    else:
        print(f"[{name}] No collisions. ✓")

    report = {
        "dataset": name,
        "unknown": [r for r,_ in unknown],
        "unknown_count": len(unknown),
        "missing": missing,
        "missing_count": len(missing),
        "collisions": {k: sorted(list(v)) for k,v in collisions.items()},
        "collision_count": len(collisions),
        "passed": (len(unknown) == 0 and len(missing) == 0 and len(collisions) == 0)
    }
    status = "PASS ✓" if report["passed"] else "WARN ✗"
    print(f"[{name}] COLUMN SUMMARY → unknown={report['unknown_count']} | "
          f"missing={report['missing_count']} | collisions={report['collision_count']} ⇒ {status}")
    return out, report

# Value-level canonicalization 
def apply_canonical_values(df: pd.DataFrame, name: str = "frame") -> Tuple[pd.DataFrame, Dict[str, Any]]:
    """Create canonical helper columns, coerce numerics, and print a validation summary."""
    out = df.copy()

    # target labels and numeric target
    out["_case_label"] = canon_case_status(out["case_status"])
    out["y"] = out["_case_label"].map({"Certified":1, "Denied":0}).astype("Int64")

    # Y/N flags
    out["_job_exp"]   = canon_yn(out["has_job_experience"])
    out["_job_train"] = canon_yn(out["requires_job_training"])
    out["_full_time"] = canon_yn(out["full_time_position"])

    # categoricals
    out["_continent"] = canon_title(out["continent"])
    out["_region"]    = canon_title(out["region_of_employment"])
    out["_uow"]       = canon_title(out["unit_of_wage"])
    out["_edu"]       = canon_education(out["education_of_employee"])

    # numerics keep raw wage normalization 
    out["no_of_employees"] = pd.to_numeric(out["no_of_employees"], errors="coerce")
    out["yr_of_estab"]     = pd.to_numeric(out["yr_of_estab"], errors="coerce")
    out["prevailing_wage"] = pd.to_numeric(out["prevailing_wage"], errors="coerce")

    # Validation report 
    # Validation report 
    rep: Dict[str, Any] = {"dataset": name}

    # target check
    lbl_counts = out["_case_label"].value_counts(dropna=False).to_dict()
    y_counts   = out["y"].value_counts(dropna=False).to_dict()
    rep["target_label_counts"]  = lbl_counts
    rep["target_numeric_counts"] = y_counts
    rep["target_na"] = int(out["_case_label"].isna().sum())

    # Y/N validity
    def yn_invalid(col: str) -> int:
        # True where value is NOT in {Y,N} or is NA
        mask_valid = out[col].isin(["Y","N"])
        invalid = (~mask_valid) | mask_valid.isna()
        return int(invalid.sum())

    rep["job_exp_invalid"]   = yn_invalid("_job_exp")
    rep["job_train_invalid"] = yn_invalid("_job_train")
    rep["full_time_invalid"] = yn_invalid("_full_time")

    # education distribution flag 'Other' share if present
    edu_counts = out["_edu"].value_counts(dropna=False)
    rep["edu_counts"] = edu_counts.to_dict()
    rep["edu_other_pct"] = float((edu_counts.get("Other", 0) / max(1, edu_counts.sum())) * 100)

    # Print concise audit
    print(f"[{name}] TARGET label counts:", lbl_counts)
    print(f"[{name}] TARGET numeric counts:", y_counts)
    print(f"[{name}] TARGET NaNs: {rep['target_na']}")
    print(f"[{name}] YN invalid → job_exp={rep['job_exp_invalid']} | job_train={rep['job_train_invalid']} | full_time={rep['full_time_invalid']}")
    print(f"[{name}] Education 'Other' share: {rep['edu_other_pct']:.1f}%")

    # pass criteria no target NaNs & YN all valid
    rep["passed"] = (rep["target_na"] == 0 and 
                    rep["job_exp_invalid"] == 0 and 
                    rep["job_train_invalid"] == 0 and 
                    rep["full_time_invalid"] == 0)

    status = "PASS ✓" if rep["passed"] else "WARN ✗"
    print(f"[{name}] VALUE SUMMARY ⇒ {status}")
    return out, rep

# One call convenience
def canonicalize_and_validate(df: pd.DataFrame, name: str = "EasyVisa", strict: bool = False) -> Tuple[pd.DataFrame, Dict[str, Any], Dict[str, Any]]:
    """
    Column rename + value canonicalization with full validation.
    Returns (DF, column_report, value_report).
    """
    df1, col_rep = canonicalize_columns(df, name=f"{name}-cols")
    DF, val_rep  = apply_canonical_values(df1, name=f"{name}-vals")

    all_pass = col_rep["passed"] and val_rep["passed"]
    final_status = "PASS ✓" if all_pass else "WARN ✗"
    print(f"[{name}] CANONICALIZATION SUMMARY → columns={col_rep['passed']} & values={val_rep['passed']} ⇒ {final_status}\n")

    if strict and not all_pass:
        raise AssertionError(f"[{name}] Canonicalization failed: col={col_rep}, val={val_rep}")

    return DF, col_rep, val_rep

# Build canonical frame with verification
DF, col_report, val_report = canonicalize_and_validate(df, name="EasyVisa", strict=True)

[EasyVisa-cols] No unknown headers. ✓
[EasyVisa-cols] MISSING expected canonical columns: None ✓
[EasyVisa-cols] No collisions. ✓
[EasyVisa-cols] COLUMN SUMMARY → unknown=0 | missing=0 | collisions=0 ⇒ PASS ✓
[EasyVisa-vals] TARGET label counts: {'Certified': 17018, 'Denied': 8462}
[EasyVisa-vals] TARGET numeric counts: {np.int64(1): 17018, np.int64(0): 8462}
[EasyVisa-vals] TARGET NaNs: 0
[EasyVisa-vals] YN invalid → job_exp=0 | job_train=0 | full_time=0
[EasyVisa-vals] Education 'Other' share: 0.0%
[EasyVisa-vals] VALUE SUMMARY ⇒ PASS ✓
[EasyVisa] CANONICALIZATION SUMMARY → columns=True & values=True ⇒ PASS ✓



In [92]:
# C) RENAME AND SCHEMA GUARD
from typing import Dict, Any
import difflib

# helper/internal columns create during canonicalization
INTERNAL_COLS = {
    "_case_label","_edu","_continent","_region","_uow",
    "_job_exp","_job_train","_full_time"
}

def rename_canonical(df: pd.DataFrame) -> pd.DataFrame:
    """Rename columns using _CANON_MAP while avoiding alias collisions."""
    ren, used = {}, set()
    for c in df.columns:
        k = _norm(c)
        new = _CANON_MAP.get(k, c)
        # avoid overwriting when alias and real both exist
        if new in used and new != c:
            new = c
        ren[c] = new
        used.add(new)
    return df.rename(columns=ren).copy()

def schema_guard(df: pd.DataFrame, name: str = "frame", strict: bool = False) -> Dict[str, Any]:
    """
    Validate schema quality:
      - unknown headers (not in alias map or expected), ignoring internal helper cols
      - missing expected columns (after rename)
      - collisions (two raw headers normalize to same key)

    Returns a dict report; raises AssertionError if strict=True and validation fails.
    """
    raw  = list(df.columns)
    keys = [_norm(c) for c in raw]

    # Unknown headers skip internals and underscore-prefixed helpers 
    unknown = [
        (r, k) for r, k in zip(raw, keys)
        if (r not in INTERNAL_COLS)
        and (not str(r).startswith("_"))
        and (k not in _CANON_MAP)
        and (r not in EXPECTED_COLS)
    ]
    if unknown:
        print(f"[{name}] UNKNOWN headers:", [r for r,_ in unknown])
        print(f"[{name}] Suggested _CANON_MAP patches:")
        for r, k in unknown:
            guess = difflib.get_close_matches(k, list(_CANON_MAP.keys()), n=1, cutoff=0.6)
            guess_str = guess[0] if guess else "<add-key>"
            print(f'    _CANON_MAP["{k}"] = "{r}"   # raw="{r}"')
    else:
        print(f"[{name}] No unknown headers. ✓")

    # Missing expected columns after rename
    missing = sorted([c for c in EXPECTED_COLS if c not in df.columns])
    print(f"[{name}] MISSING expected columns:", missing if missing else "None ✓")

    # Collisions two raw → same normalized key
    seen, collisions = {}, {}
    for r, k in zip(raw, keys):
        if k in seen and seen[k] != r:
            collisions.setdefault(k, set()).update([seen[k], r])
        else:
            seen[k] = r
    if collisions:
        print(f"[{name}] COLLISIONS:")
        for k, raws in collisions.items():
            print(f"   norm='{k}' ← raw {sorted(list(raws))}")
    else:
        print(f"[{name}] No collisions. ✓")

    # Summary and machine readable report 
    report = {
        "dataset": name,
        "unknown": [r for r,_ in unknown],
        "unknown_count": len(unknown),
        "missing": missing,
        "missing_count": len(missing),
        "collisions": {k: sorted(list(v)) for k, v in collisions.items()},
        "collision_count": len(collisions),
        "passed": (len(unknown) == 0 and len(missing) == 0 and len(collisions) == 0),
    }
    status = "PASS ✓" if report["passed"] else "FAIL ✗"
    print(f"[{name}] SUMMARY → unknown={report['unknown_count']} | "
          f"missing={report['missing_count']} | collisions={report['collision_count']}  ⇒  {status}")

    if strict and not report["passed"]:
        raise AssertionError(f"[{name}] Schema validation failed: {report}")

    return report

# After loading raw df
df_renamed = rename_canonical(df)

# Print audit and get structured report
rep = schema_guard(df_renamed, name="EasyVisa-cols", strict=True)

[EasyVisa-cols] No unknown headers. ✓
[EasyVisa-cols] MISSING expected columns: None ✓
[EasyVisa-cols] No collisions. ✓
[EasyVisa-cols] SUMMARY → unknown=0 | missing=0 | collisions=0  ⇒  PASS ✓


In [93]:
# D) BUILD CANONICAL DF
from typing import Dict, Any, Tuple

def build_canonical_df(df: pd.DataFrame, name: str = "EasyVisa", strict: bool = False
                      ) -> Tuple[pd.DataFrame, Dict[str, Any]]:
    """
    Rename → canonicalize values → coerce numerics → validate.
    Returns (DF, report). If strict=True, raises on failed validation.
    """
    # 1) Column rename and schema audit
    out = rename_canonical(df).copy()
    col_rep = schema_guard(out, f"{name}-cols", strict=False)

    # 2) Target string and numeric
    out[TARGET_LBL] = canon_case_status(out[TARGET_RAW])
    out[TARGET_BIN] = out[TARGET_LBL].map({"Certified": 1, "Denied": 0}).astype("Int64")

    # 3) Binary flags
    out["_job_exp"]   = canon_yn(out["has_job_experience"])
    out["_job_train"] = canon_yn(out["requires_job_training"])
    out["_full_time"] = canon_yn(out["full_time_position"])

    # 4) Core categoricals
    out["_continent"] = canon_title(out["continent"])
    out["_region"]    = canon_title(out["region_of_employment"])
    out["_uow"]       = canon_title(out["unit_of_wage"])
    out["_edu"]       = canon_education(out["education_of_employee"])

    # 5) Numerics coerce safely wage normalization happens later
    out["no_of_employees"] = pd.to_numeric(out["no_of_employees"], errors="coerce")
    out["yr_of_estab"]     = pd.to_numeric(out["yr_of_estab"], errors="coerce")
    out["prevailing_wage"] = pd.to_numeric(out["prevailing_wage"], errors="coerce")

    # 6) Value level validation
    rep: Dict[str, Any] = {"dataset": name}

    # Target checks
    rep["target_label_counts"]  = out[TARGET_LBL].value_counts(dropna=False).to_dict()
    rep["target_numeric_counts"] = out[TARGET_BIN].value_counts(dropna=False).to_dict()
    rep["target_na"] = int(out[TARGET_LBL].isna().sum())
    rep["y_na"]      = int(out[TARGET_BIN].isna().sum())

    # Y/N validity
    def yn_invalid(col: str) -> int:
        m = out[col].isin(["Y", "N"])
        return int((~m | m.isna()).sum())

    rep["job_exp_invalid"]   = yn_invalid("_job_exp")
    rep["job_train_invalid"] = yn_invalid("_job_train")
    rep["full_time_invalid"] = yn_invalid("_full_time")

    # Numeric NA counts sanity only
    rep["no_of_employees_na"] = int(out["no_of_employees"].isna().sum())
    rep["yr_of_estab_na"]     = int(out["yr_of_estab"].isna().sum())
    rep["prevailing_wage_na"] = int(out["prevailing_wage"].isna().sum())

    # Pass criteria values — tune as needed
    rep["values_passed"] = (
        rep["target_na"] == 0 and rep["y_na"] == 0
        and rep["job_exp_invalid"] == 0
        and rep["job_train_invalid"] == 0
        and rep["full_time_invalid"] == 0
    )

    # 7) Final summary
    overall_pass = col_rep["passed"] and rep["values_passed"]
    rep["columns_passed"] = col_rep["passed"]
    rep["passed"] = overall_pass

    # Print concise summary
    print(f"[{name}-vals] TARGET NaNs: labels={rep['target_na']} | y={rep['y_na']}")
    print(f"[{name}-vals] Y/N invalid → job_exp={rep['job_exp_invalid']} | "
          f"job_train={rep['job_train_invalid']} | full_time={rep['full_time_invalid']}")
    print(f"[{name}-vals] Numeric NaNs → employees={rep['no_of_employees_na']} | "
          f"yr_of_estab={rep['yr_of_estab_na']} | wage={rep['prevailing_wage_na']}")

    status_cols = "PASS ✓" if col_rep["passed"]     else "FAIL ✗"
    status_vals = "PASS ✓" if rep["values_passed"]  else "FAIL ✗"
    status_all  = "PASS ✓" if overall_pass          else "FAIL ✗"
    print(f"[{name}] CANON SUMMARY → columns={status_cols} | values={status_vals}  ⇒  {status_all}\n")

    if strict and not overall_pass:
        raise AssertionError(f"[{name}] Canonicalization failed: columns={col_rep}, values={rep}")

    return out, rep

In [94]:
# E) RUN & REPORT 
# 1) Core verifications on the raw frame
verify_rng()
verify_target_exists(DF)
raw_schema = verify_expected_columns(DF)

# 2) Build canonical dataframe and value audit
DF, canon_report = build_canonical_df(df, name="EasyVisa", strict=False)

# 3) Target class balance labels and numeric
lbl_bal = report_class_balance(DF[TARGET_LBL], "[INFO] target label balance")
num_bal = report_class_balance(DF[TARGET_BIN], "[INFO] target numeric balance")

# 4) Quick canonical feature sanity uniques
key_cats = ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"]
uniques = {c: int(DF[c].dropna().nunique()) for c in key_cats}
print("\n[INFO] canonical categorical uniques:", uniques)

# 5) Compact PASS / FAIL footer
overall_pass = bool(raw_schema["passed"] and canon_report["passed"])
status = "PASS ✓" if overall_pass else "WARN ✗"
print("\n================ SUMMARY ================")
print(f"rows={len(DF):,}  cols={DF.shape[1]}  |  "
      f"raw_expected_ok={raw_schema['passed']}  "
      f"columns_passed={canon_report['columns_passed']}  "
      f"values_passed={canon_report['values_passed']}  ⇒  {status}")
print("========================================\n")

# 6) Machine readable bundle 
verification_bundle = {
    "raw_schema": raw_schema,
    "canonicalization": canon_report,
    "label_balance": lbl_bal,
    "numeric_balance": num_bal,
    "cat_uniques": uniques,
    "passed": overall_pass,
}

assert verification_bundle["passed"], f"Verification failed: {verification_bundle}"

[OK] RNG reproducibility
[OK] Target column present
[OK] All expected columns present
[INFO] Extra columns (not required): ['_case_label', '_continent', '_edu', '_full_time', '_job_exp', '_job_train', '_region', '_uow', 'y']
[EasyVisa-cols] No unknown headers. ✓
[EasyVisa-cols] MISSING expected columns: None ✓
[EasyVisa-cols] No collisions. ✓
[EasyVisa-cols] SUMMARY → unknown=0 | missing=0 | collisions=0  ⇒  PASS ✓
[EasyVisa-vals] TARGET NaNs: labels=0 | y=0
[EasyVisa-vals] Y/N invalid → job_exp=0 | job_train=0 | full_time=0
[EasyVisa-vals] Numeric NaNs → employees=0 | yr_of_estab=0 | wage=0
[EasyVisa] CANON SUMMARY → columns=PASS ✓ | values=PASS ✓  ⇒  PASS ✓

[INFO] target label balance →
_case_label
Certified    17018
Denied        8462
  Imbalance ratio ≈ 2.01:1
[INFO] target numeric balance →
y
1    17018
0     8462
  Imbalance ratio ≈ 2.01:1

[INFO] canonical categorical uniques: {'_edu': 4, '_continent': 6, '_region': 5, '_uow': 4, '_job_exp': 2, '_job_train': 2, '_full_time': 2}

#### Let's check the statistical summary of the data

In [95]:
# 1) Total missing values
DF.isna().sum().sort_values(ascending=False)

# 2) Split by type
DF.select_dtypes('number').isna().sum().sort_values(ascending=False)
DF.select_dtypes(exclude='number').isna().sum().sort_values(ascending=False)

# Hard guardrail
assert DF.isna().sum().sum() == 0, "Unexpected NaNs remain after cleaning."

In [96]:
# Numeric summary
num_cols = DF.select_dtypes('number').columns
num_desc = DF[num_cols].describe().T
num_desc['nulls'] = DF[num_cols].isna().sum().values
num_desc['null_pct'] = (DF[num_cols].isna().mean().values * 100).round(2)

# Categorical summary
cat_cols = DF.select_dtypes(exclude='number').columns
cat_desc = DF[cat_cols].describe().T  
cat_desc['nulls'] = DF[cat_cols].isna().sum().values
cat_desc['null_pct'] = (DF[cat_cols].isna().mean().values * 100).round(2)

num_desc, cat_desc 

(                   count          mean           std     min       25%  \
 no_of_employees  25480.0    5667.04321  22877.928848   -26.0    1022.0   
 yr_of_estab      25480.0   1979.409929     42.366929  1800.0    1976.0   
 prevailing_wage  25480.0  74455.814592  52815.942327  2.1367  34015.48   
 y                25480.0      0.667896      0.470977     0.0       0.0   
 
                       50%          75%        max  nulls  null_pct  
 no_of_employees    2109.0       3504.0   602069.0      0       0.0  
 yr_of_estab        1997.0       2005.0     2016.0      0       0.0  
 prevailing_wage  70308.21  107735.5125  319210.27      0       0.0  
 y                     1.0          1.0        1.0      0       0.0  ,
                        count unique         top   freq  nulls  null_pct
 case_id                25480  25480    EZYV8640      1      0       0.0
 continent              25480      6        Asia  16861      0       0.0
 education_of_employee  25480      4  Bachelor's  102

In [97]:
#Print the statistics
DF.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
case_id,25480.0,25480.0,EZYV8640,1.0,,,,,,,
continent,25480.0,6.0,Asia,16861.0,,,,,,,
education_of_employee,25480.0,4.0,Bachelor's,10234.0,,,,,,,
has_job_experience,25480.0,2.0,Y,14802.0,,,,,,,
requires_job_training,25480.0,2.0,N,22525.0,,,,,,,
no_of_employees,25480.0,,,,5667.04321,22877.928848,-26.0,1022.0,2109.0,3504.0,602069.0
yr_of_estab,25480.0,,,,1979.409929,42.366929,1800.0,1976.0,1997.0,2005.0,2016.0
region_of_employment,25480.0,5.0,Northeast,7195.0,,,,,,,
prevailing_wage,25480.0,,,,74455.814592,52815.942327,2.1367,34015.48,70308.21,107735.5125,319210.27
unit_of_wage,25480.0,4.0,Year,22962.0,,,,,,,


#### Fixing the negative values in number of employees columns

#### Let's check the count of each unique category in each of the categorical variables

### Univariate Analysis

In [98]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [99]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

#### Observations on education of employee

#### Observations on region of employment

#### Observations on job experience

#### Observations on case status

### Bivariate Analysis

**Creating functions that will help us with further analysis.**

In [100]:
### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

In [101]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

#### Does higher education increase the chances of visa certification for well-paid jobs abroad?

#### How does visa status vary across different continents?

#### Does having prior work experience influence the chances of visa certification for career opportunities abroad?

#### Is the prevailing wage consistent across all regions of the US?

#### Does visa status vary with changes in the prevailing wage set to protect both local talent and foreign workers?

#### Does the unit of prevailing wage (Hourly, Weekly, etc.) have any impact on the likelihood of visa application certification?

## Data Pre-processing

### Outlier Check

### Data Preparation for modeling

## Model Building

### Model Evaluation Criterion

- Choose the primary metric to evaluate the model on
- Elaborate on the rationale behind choosing the metric

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
* The `model_performance_classification_sklearn` function will be used to check the model performance of models.
* The `confusion_matrix_sklearn` function will be used to plot the confusion matrix.

In [102]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn


def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf

In [103]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

#### Defining scorer to be used for cross-validation and hyperparameter tuning

**We are now done with pre-processing and evaluation criterion, so let's start building the model.**

### Model building with Original data

### Model Building with Oversampled data

### Model Building with Undersampled data

## Hyperparameter Tuning

**Best practices for hyperparameter tuning in AdaBoost:**

`n_estimators`:

- Start with a specific number (50 is used in general) and increase in steps: 50, 75, 85, 100

- Use fewer estimators (e.g., 50 to 100) if using complex base learners (like deeper decision trees)

- Use more estimators (e.g., 100 to 150) when learning rate is low (e.g., 0.1 or lower)

- Avoid very high values unless performance keeps improving on validation

`learning_rate`:

- Common values to try: 1.0, 0.5, 0.1, 0.01

- Use 1.0 for faster training, suitable for fewer estimators

- Use 0.1 or 0.01 when using more estimators to improve generalization

- Avoid very small values (< 0.01) unless you plan to use many estimators (e.g., >500) and have sufficient data


---

**Best practices for hyperparameter tuning in Random Forest:**


`n_estimators`:

* Start with a specific number (50 is used in general) and increase in steps: 50, 75, 100, 125
* Higher values generally improve performance but increase training time
* Use 100-150 for large datasets or when variance is high


`min_samples_leaf`:

* Try values like: 1, 2, 4, 5, 10
* Higher values reduce model complexity and help prevent overfitting
* Use 1–2 for low-bias models, higher (like 5 or 10) for more regularized models
* Works well in noisy datasets to smooth predictions


`max_features`:

* Try values: `"sqrt"` (default for classification), `"log2"`, `None`, or float values (e.g., `0.3`, `0.5`)
* `"sqrt"` balances between diversity and performance for classification tasks
* Lower values (e.g., `0.3`) increase tree diversity, reducing overfitting
* Higher values (closer to `1.0`) may capture more interactions but risk overfitting


`max_samples` (for bootstrap sampling):

* Try float values between `0.5` to `1.0` or fixed integers
* Use `0.6–0.9` to introduce randomness and reduce overfitting
* Smaller values increase diversity between trees, improving generalization

---

**Best practices for hyperparameter tuning in Gradient Boosting:**

`n_estimators`:

* Start with 100 (default) and increase: 100, 200, 300, 500
* Typically, higher values lead to better performance, but they also increase training time
* Use 200–500 for larger datasets or complex problems
* Monitor validation performance to avoid overfitting, as too many estimators can degrade generalization


`learning_rate`:

* Common values to try: 0.1, 0.05, 0.01, 0.005
* Use lower values (e.g., 0.01 or 0.005) if you are using many estimators (e.g., > 200)
* Higher learning rates (e.g., 0.1) can be used with fewer estimators for faster convergence
* Always balance the learning rate with `n_estimators` to prevent overfitting or underfitting


`subsample`:

* Common values: 0.7, 0.8, 0.9, 1.0
* Use a value between `0.7` and `0.9` for improved generalization by introducing randomness
* `1.0` uses the full dataset for each boosting round, potentially leading to overfitting
* Reducing `subsample` can help reduce overfitting, especially in smaller datasets


`max_features`:

* Common values: `"sqrt"`, `"log2"`, or float (e.g., `0.3`, `0.5`)
* `"sqrt"` (default) works well for classification tasks
* Lower values (e.g., `0.3`) help reduce overfitting by limiting the number of features considered at each split

---

**Best practices for hyperparameter tuning in XGBoost:**

`n_estimators`:

* Start with 50 and increase in steps: 50,75,100,125.
* Use more estimators (e.g., 150-250) when using lower learning rates
* Monitor validation performance
* High values improve learning but increase training time

`subsample`:

* Common values: 0.5, 0.7, 0.8, 1.0
* Use `0.7–0.9` to introduce randomness and reduce overfitting
* `1.0` uses the full dataset in each boosting round; may overfit on small datasets
* Values < 0.5 are rarely useful unless dataset is very large

`gamma`:

* Try values: 0 (default), 1, 3, 5, 8
* Controls minimum loss reduction needed for a split
* Higher values make the algorithm more conservative (i.e., fewer splits)
* Use values > 0 to regularize and reduce overfitting, especially on noisy data


`colsample_bytree`:

* Try values: 0.3, 0.5, 0.7, 1.0
* Fraction of features sampled per tree
* Lower values (e.g., 0.3 or 0.5) increase randomness and improve generalization
* Use `1.0` when you want all features considered for every tree


`colsample_bylevel`:

* Try values: 0.3, 0.5, 0.7, 1.0
* Fraction of features sampled at each tree level (i.e., per split depth)
* Lower values help in regularization and reducing overfitting
* Often used in combination with `colsample_bytree` for fine control over feature sampling

---

## Model Performance Summary and Final Model Selection

## Actionable Insights and Recommendations

<font size=6 color='blue'>Power Ahead</font>
___