### General Description
#### 1. Task Statement
**Company:** FutureForward Careers

**Issue:** FutureForward Careers, a leading tech recruitment agency, uses data science to identify promising candidates who are likely seeking new job opportunities. Their dataset of candidates, however, contains a significant number of missing values, particularly in the `training_hours` column. This incomplete data hinders their ability to build accurate predictive models, making it difficult to efficiently target their recruitment efforts.

**ML/DS Solution:** To overcome this, we will employ a multivariate imputation technique called K-Nearest Neighbors (KNN) Imputer. Unlike simpler methods like filling with the mean or median, KNN Imputer leverages the entire feature set to estimate missing values based on similar candidates ('neighbors'), providing a more accurate and context-aware imputation. This will allow us to create a complete dataset suitable for training a reliable classification model.

**Feasibility:** Manually contacting thousands of candidates to fill in missing data is impractical, slow, and costly. Simple statistical imputation might introduce bias and fail to capture the complex relationships between candidate attributes. KNN Imputation offers a sophisticated, automated alternative.

**Task:** FutureForward Careers has tasked you with implementing a data processing pipeline that uses `KNNImputer` to handle missing values in the `training_hours` feature. Afterwards, you will build a simple classification model to demonstrate the effectiveness of this imputation method.

**Data:** The company provides a dataset of job applicants containing various professional attributes and a `target` variable, where `1` indicates the candidate is looking for a job change.

**Definition of Done:** The goal is to successfully impute the missing data and train a logistic regression model. The performance of the model, measured by accuracy on a test set, will serve as an indicator of the imputation's success.
#### 2. Rewards
- Understanding and implementing advanced imputation techniques (KNNImputer).
- Practical experience with data preprocessing in Scikit-learn.
- Building and evaluating a complete machine learning pipeline from data loading to prediction.
- Deeper insight into how missing data impacts model performance.
#### 3. Difficulty Level
easy
#### 4. Task Type
Data Preprocessing, Imputation, Classification
#### 5. Tools
Pandas, NumPy, Scikit-learn

### Theoretical Background: KNN Imputer
The KNN Imputer is a multivariate imputation method that fills in missing values by considering the values of the 'k' nearest neighbors of the data point. It operates on the principle that a data point is similar to its neighbors.

**How it works:**
1.  **Distance Calculation:** For each row with a missing value, the algorithm calculates a distance metric (e.g., Euclidean distance) to all other rows. This is done using only the columns where both rows have values.
2.  **Identifying Neighbors:** It identifies the `k` rows (neighbors) that are closest to the row with the missing value.
3.  **Imputation:** The missing value is then filled by aggregating the values from that same column in the `k` nearest neighbors (e.g., taking the mean for numerical data).

In [ ]:
import pandas as pd
import numpy as np
import sklearn.model_selection
import sklearn.impute
import sklearn.linear_model
import sklearn.metrics
from typing import Tuple

```json
{
  "issue": "Before any preprocessing, we need to load the dataset and understand the extent of the missing data problem.",
  "action": "Define two functions: one to load the CSV data into a pandas DataFrame, and another to calculate and display the percentage of missing values for each column.",
  "state": "The dataset is loaded into memory, and we have a clear, sorted list of columns and their corresponding percentage of missing data."
}
```

In [ ]:
def load_data(path: str) -> pd.DataFrame:
    """Loads data from a CSV file."""
    return pd.read_csv(path)

def inspect_missing_values(df: pd.DataFrame) -> pd.Series:
    """Calculates the percentage of missing values for each column."""
    return (df.isnull().mean() * 100).sort_values(ascending=False)

# NOTE: The original notebook had a hardcoded path. Update this to your local environment.
DATA_PATH = '/kaggle/input/data-science-jobs/data_science_job.csv'

df = load_data(DATA_PATH)
missing_percentages = inspect_missing_values(df)

print("Dataset Head:")
print(df.head())
print("\nMissing Value Percentages:")
print(missing_percentages)

```json
{
  "issue": "To test the imputer's effectiveness, we need to prepare the data for a machine learning model. This involves selecting a single feature and a target, then splitting them into training and testing sets.",
  "action": "A function is created to select the specified feature and target columns, split them into training and test sets, and reshape the feature arrays to be compatible with Scikit-learn estimators.",
  "state": "The data is cleanly partitioned into `X_train`, `X_test`, `y_train`, and `y_test` sets, ready for the imputation and modeling stages."
}
```

In [ ]:
def prepare_and_split_data(df: pd.DataFrame, feature_col: str, target_col: str, test_size: float = 0.2, random_state: int = 2) -> Tuple[np.ndarray, np.ndarray, pd.Series, pd.Series]:
    """Selects features and target, splits data, and reshapes features."""
    X = df[[feature_col]]
    y = df[target_col]
    
    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    
    return X_train.values, X_test.values, y_train, y_test

FEATURE = 'training_hours'
TARGET = 'target'

X_train, X_test, y_train, y_test = prepare_and_split_data(df, FEATURE, TARGET)

```json
{
  "issue": "The training and testing feature sets contain missing values which will cause errors in most machine learning algorithms.",
  "action": "Define a function to apply the `KNNImputer`. It initializes the imputer, fits it on the training data to learn the data's structure, and then transforms both the training and test sets to fill the missing values.",
  "state": "The missing values in `X_train` and `X_test` are imputed, resulting in complete, dense numpy arrays ready for modeling."
}
```

In [ ]:
def apply_knn_imputation(X_train: np.ndarray, X_test: np.ndarray, n_neighbors: int = 5) -> Tuple[np.ndarray, np.ndarray]:
    """Applies KNN imputation to train and test sets."""
    imputer = sklearn.impute.KNNImputer(n_neighbors=n_neighbors)
    
    X_train_transformed = imputer.fit_transform(X_train)
    X_test_transformed = imputer.transform(X_test)
    
    return X_train_transformed, X_test_transformed

X_train_imputed, X_test_imputed = apply_knn_imputation(X_train, X_test)

print(f"Missing values in imputed training data: {np.isnan(X_train_imputed).sum()}")

```json
{
  "issue": "Now that the data is clean and complete, we need to evaluate the impact of the imputation by training a predictive model.",
  "action": "Create a function to train a Logistic Regression classifier on the imputed training data. The function then uses the trained model to make predictions on the imputed test data and returns the accuracy score.",
  "state": "A classification model is trained and evaluated, providing a concrete accuracy score that demonstrates the utility of the preceding imputation step."
}
```

In [ ]:
def train_and_evaluate_model(X_train: np.ndarray, y_train: pd.Series, X_test: np.ndarray, y_test: pd.Series) -> float:
    """Trains a logistic regression model and returns its accuracy."""
    model = sklearn.linear_model.LogisticRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
    return accuracy

accuracy = train_and_evaluate_model(X_train_imputed, y_train, X_test_imputed, y_test)
print(f"Model Accuracy after KNN Imputation: {accuracy:.4f}")