In [5]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

In [6]:
### Missing Indicators

Missing indicators are auxiliary binary variables added to a dataset to flag missing values in the original variables. Each missing indicator variable corresponds to an original variable and takes the value 1 if the original value is missing, and 0 otherwise.

### Purpose of Missing Indicators:

1. **Capture Missingness Information:** Explicitly record whether a value was missing, which can be useful in understanding the patterns and reasons for missingness.
  
2. **Retain Data Information:** Preserve information about missing values, which can be valuable in certain types of analysis and modeling.

3. **Improve Model Accuracy:** Enhance the performance of predictive models by allowing them to use the pattern of missingness as an additional feature.

### Example:

Suppose you have a dataset with columns `Age` and `Income`, and `Income` has some missing values. You can create an indicator for the missing values in `Income`.

#### Creating Missing Indicators in Python using pandas:

```python
import pandas as pd
import numpy as np

# Sample data with missing values
data = {
    'Age': [25, 30, np.nan, 40, 35],
    'Income': [50000, np.nan, 60000, 70000, 80000]
}

df = pd.DataFrame(data)

# Creating missing indicators for 'Income'
df['Income_missing'] = df['Income'].isnull().astype(int)

print("Original DataFrame:")
print(df[['Age', 'Income']])

print("\nDataFrame with missing indicators:")
print(df[['Age', 'Income', 'Income_missing']])
```

### Output:

```
Original DataFrame:
    Age   Income
0  25.0  50000.0
1  30.0      NaN
2   NaN  60000.0
3  40.0  70000.0
4  35.0  80000.0

DataFrame with missing indicators:
    Age   Income  Income_missing
0  25.0  50000.0               0
1  30.0      NaN               1
2   NaN  60000.0               0
3  40.0  70000.0               0
4  35.0  80000.0               0
```

### Advantages of Missing Indicators:

- **Improved Model Performance:** Models can leverage the presence of missing values as an additional piece of information, which might correlate with the target variable.
  
- **Transparent Handling of Missing Data:** Makes it clear which values were originally missing, aiding in interpretation and analysis.

- **Flexible Imputation Strategies:** Missing indicators allow for different imputation strategies to be applied to the actual values without losing information about the original missingness.

### Considerations:

- **Increased Dimensionality:** Adding missing indicators increases the number of features in the dataset, which might require more computational resources and can lead to overfitting in some cases.

- **Missing Data Mechanism:** The utility of missing indicators depends on the mechanism of missing data. They are particularly useful when data is missing at random (MAR) or missing completely at random (MCAR), but less so when data is missing not at random (MNAR).

### Implementation in a Modeling Pipeline:

When using missing indicators in a machine learning pipeline, you can include them as part of the preprocessing steps.

```python
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Sample data with missing values
data = {
    'Age': [25, 30, np.nan, 40, 35],
    'Income': [50000, np.nan, 60000, 70000, 80000],
    'Target': [1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

# Split the data
X = df[['Age', 'Income']]
y = df['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Pipeline with imputation and missing indicator creation
pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='mean', add_indicator=True)),
    ('model', RandomForestClassifier())
])

# Fit the model
pipeline.fit(X_train, y_train)

# Predict
predictions = pipeline.predict(X_test)
```

In this example, `SimpleImputer` with `add_indicator=True` automatically creates missing indicators for any imputed features, seamlessly integrating the handling of missing data into the model training process. This approach ensures that the model can consider missingness as part of the feature set, potentially improving its predictive power.

SyntaxError: invalid syntax (<ipython-input-6-e878511ca40a>, line 3)

In [7]:
df = pd.read_csv('train.csv')

In [8]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [11]:
df.drop(columns=['PassengerId','Name','Ticket','Cabin'])

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.2500,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.9250,S
3,1,1,female,35.0,1,0,53.1000,S
4,0,3,male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S
887,1,1,female,19.0,0,0,30.0000,S
888,0,3,female,,1,2,23.4500,S
889,1,1,male,26.0,0,0,30.0000,C


In [12]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [13]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)


In [15]:
X_train.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
30,31,1,"Uruchurtu, Don. Manuel E",male,40.0,0,0,PC 17601,27.7208,,C
10,11,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
873,874,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0,,S
182,183,3,"Asplund, Master. Clarence Gustaf Hugo",male,9.0,4,2,347077,31.3875,,S
876,877,3,"Gustafsson, Mr. Alfred Ossian",male,20.0,0,0,7534,9.8458,,S


In [21]:
numerical_Features = ['Age','Fare']
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_Features = ['Embarked','Sex']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

In [17]:
SimpleImputer().get_params()

{'add_indicator': False,
 'copy': True,
 'fill_value': None,
 'keep_empty_features': False,
 'missing_values': nan,
 'strategy': 'mean'}

In [25]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_Features),
        ('cat', categorical_transformer, categorical_Features)
    ]
)

In [26]:
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

In [27]:
from sklearn import set_config

set_config(display='diagram')
clf

In [28]:
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'preprocessor__cat__imputer__strategy': ['most_frequent', 'constant'],
    'classifier__C': [0.1, 1.0, 10, 100]
}

grid_search = GridSearchCV(clf, param_grid, cv=10)

In [29]:
grid_search.fit(X_train, y_train)

print(f"Best params:")
print(grid_search.best_params_)

Best params:
{'classifier__C': 1.0, 'preprocessor__cat__imputer__strategy': 'most_frequent', 'preprocessor__num__imputer__strategy': 'mean'}


In [36]:
print(f"Internal CV score: {grid_search.best_score_:.3f}")


Internal CV score: 0.788


In [37]:
import pandas as pd

cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results = cv_results.sort_values("mean_test_score", ascending=False)
cv_results[['param_classifier__C','param_preprocessor__cat__imputer__strategy','param_preprocessor__num__imputer__strategy','mean_test_score']]

Unnamed: 0,param_classifier__C,param_preprocessor__cat__imputer__strategy,param_preprocessor__num__imputer__strategy,mean_test_score
4,1.0,most_frequent,mean,0.787852
5,1.0,most_frequent,median,0.787852
6,1.0,constant,mean,0.787852
7,1.0,constant,median,0.787852
8,10.0,most_frequent,mean,0.787852
9,10.0,most_frequent,median,0.787852
10,10.0,constant,mean,0.787852
11,10.0,constant,median,0.787852
12,100.0,most_frequent,mean,0.787852
13,100.0,most_frequent,median,0.787852
