# Iterative Imputation

-   One of the technique in MULTIVARIATE IMPUTATION
-   The algorithm uses a round-robin approach, where each feature is imputed in turn, using the current estimates of the other features.

#How it works?

-   Initially, all missing values in the dataset are filled with the mean of their respective features.
-   For each column (proceeding from left to right), the missing values in that column are treated as the test set, while the remaining data is treated as the training set. 
-   This process updates the dataset iteratively for each feature.
-   The process continues iteratively until the difference between consecutive iterations becomes minimal, indicating convergence.

Key Parameters of IterativeImputer:

- max_iter: The maximum number of iterations for the imputation process.
- tol: The tolerance threshold for convergence.
- n_nearest_features: The number of nearest features to use for imputation.
- initial_strategy: The initial imputation strategy, which can be either 'mean' or 'median'.

#Advantages
-   Accuracy: By considering the relationships between features, multivariate imputation can provide more accurate estimates than univariate methods.
-   Flexibility: The IterativeImputer can be used with various estimators, allowing for customization based on the specific dataset.

#Disadvantages
-   Computationally Intensive: Iterative imputation can be computationally expensive, especially for large datasets with many features.
-   Complexity: The method involves multiple iterations and the choice of estimator, which can add complexity to the preprocessing pipeline.

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as publish_display_data

In [56]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression

In [10]:
df  = pd.read_csv('E:\ml_revision\Missing_values\Datasets\data_science_job.csv')

  df  = pd.read_csv('E:\ml_revision\Missing_values\Datasets\data_science_job.csv')


In [11]:
df.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,20.0,,,36.0,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15.0,50-99,Pvt Ltd,47.0,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5.0,,,83.0,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,0.0,,Pvt Ltd,52.0,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,20.0,50-99,Funded Startup,8.0,0.0


In [12]:
df.shape

(19158, 13)

In [13]:
df.isnull().mean()

enrollee_id               0.000000
city                      0.000000
city_development_index    0.025003
gender                    0.235306
relevent_experience       0.000000
enrolled_university       0.020148
education_level           0.024011
major_discipline          0.146832
experience                0.003393
company_size              0.309949
company_type              0.320493
training_hours            0.039983
target                    0.000000
dtype: float64

In [14]:
data = df[['city_development_index','training_hours','experience','target']]

In [None]:
data.sample(5) 

Unnamed: 0,city_development_index,training_hours,experience,target
1386,0.855,55.0,7.0,1.0
14527,0.884,24.0,19.0,0.0
5485,0.884,39.0,20.0,0.0
14249,0.92,9.0,20.0,0.0
11535,0.887,106.0,10.0,0.0


In [25]:
subset = data[data.isnull().any(axis=1)]

In [65]:
required = subset.sample(15)
temp = required
required

Unnamed: 0,city_development_index,training_hours,experience,target
831,0.878,,10.0,0.0
15222,0.836,,20.0,0.0
11588,,43.0,4.0,0.0
14153,0.915,,14.0,0.0
3760,0.92,24.0,,0.0
11093,,56.0,11.0,1.0
2859,0.92,41.0,,1.0
16114,0.855,,9.0,1.0
7562,,8.0,7.0,1.0
3069,0.92,,16.0,0.0


In [66]:
required['city_development_index'] = required['city_development_index'].fillna(required['city_development_index'].mean())
required['training_hours'] = required['training_hours'].fillna(required['training_hours'].mean())
required['experience'] = required['experience'].fillna(required['experience'].mean())


In [67]:
print(temp)
for i in range(5):  # Number of iterations
    print(f"\nIteration {i + 1}:")
    
    for clm in required.columns:
        # Prepare training and target data
        x = required.drop(columns=[clm])  # Drop the current column to use the rest as features
        y = required[clm]  # Current column as the target
        
        # Create masks for missing and non-missing values
        missing_mask = y.isnull()
        not_missing_mask = ~missing_mask
        
        # Drop rows with missing values in x for model training
        x_clean = x[not_missing_mask].dropna()
        y_clean = y[not_missing_mask]
        
        # Ensure there is sufficient data to train
        if len(x_clean) > 0 and len(y_clean) > 0:
            # Train a simple imputer (e.g., linear regression) on non-missing data
            model = LinearRegression()
            model.fit(x_clean, y_clean)  # Fit on rows without missing values
            
            # Predict missing values for the column
            x_missing = x[missing_mask].dropna(axis=1)  # Drop any NaN columns for prediction
            if len(x_missing) > 0:  # Ensure there is data to predict on
                predicted_values = model.predict(x_missing)
                
                # Update the DataFrame with predicted values for the missing entries
                required.loc[missing_mask, clm] = predicted_values
    
    # Display the progress after each iteration
    print(required)

       city_development_index  training_hours  experience  target
831                    0.8780       52.571429   10.000000     0.0
15222                  0.8360       52.571429   20.000000     0.0
11588                  0.8968       43.000000    4.000000     0.0
14153                  0.9150       52.571429   14.000000     0.0
3760                   0.9200       24.000000   12.153846     0.0
11093                  0.8968       56.000000   11.000000     1.0
2859                   0.9200       41.000000   12.153846     1.0
16114                  0.8550       52.571429    9.000000     1.0
7562                   0.8968        8.000000    7.000000     1.0
3069                   0.9200       52.571429   16.000000     0.0
7808                   0.8840       52.571429   20.000000     0.0
1934                   0.8968      162.000000    1.000000     1.0
17688                  0.9200       52.571429    9.000000     0.0
15268                  0.9200       52.571429   17.000000     1.0
7461      

### Although it is not changing significantly, but thats how the process is...


### Now we will see how to perform it using sklearn


In [69]:
imputer = IterativeImputer(max_iter=10, random_state=42)

# Perform imputation
imputed_data = imputer.fit_transform(data)

# Convert the result back to a DataFrame
imputed_data = pd.DataFrame(imputed_data, columns=data.columns)

# Display the result
print("Original DataFrame:")
print(data)
print("\nImputed DataFrame:")
print(imputed_data)

Original DataFrame:
       city_development_index  training_hours  experience  target
0                       0.920            36.0        20.0     1.0
1                       0.776            47.0        15.0     0.0
2                       0.624            83.0         5.0     0.0
3                       0.789            52.0         0.0     1.0
4                       0.767             8.0        20.0     0.0
...                       ...             ...         ...     ...
19153                   0.878            42.0        14.0     1.0
19154                   0.920            52.0        14.0     1.0
19155                   0.920            44.0        20.0     0.0
19156                   0.802            97.0         0.0     0.0
19157                   0.855           127.0         2.0     0.0

[19158 rows x 4 columns]

Imputed DataFrame:
       city_development_index  training_hours  experience  target
0                       0.920            36.0        20.0     1.0
1         