here is a guide for binary classification pipeline using Python, Pandas, Scikit-learn, Numpy and other libraries:

1. Descriptive statistics
   - Use Pandas to load the data and get a basic understanding of the data.
   - Use functions such as `df.head()`, `df.info()`, `df.describe()`, and `df.shape` to check the structure, the types, and the distribution of data.

2. Data cleaning and missing values replacement
   - Identify and remove duplicates, if any.
   - Check for missing values and handle them using techniques such as imputation or removal. 
   - Use Pandas' `fillna()` function to fill missing values with a specific value or a statistical measure such as the mean or median.

3. Handling outliers
   - Identify the outliers using techniques such as boxplots or scatterplots.
   - Remove the outliers or apply a transformation such as log or square root to the data.

4. Encoding categorical variables using onehot and target encodings
   - Convert categorical variables to numerical values using one-hot encoding or target encoding. 
   - Use Scikit-learn's `OneHotEncoder` or `LabelEncoder` for one-hot encoding or target encoding.

5. Feature selection using correlation 
   - Identify and remove features that are highly correlated with each other. 
   - Use Pandas' `corr()` function to check the correlation between features.
   - Drop highly correlated features using `df.drop()` function.

6. Numerical columns normalisation using min-max scaler
   - Normalise the numerical columns using techniques such as min-max scaling.
   - Use Scikit-learn's `MinMaxScaler` to scale the numerical data between 0 and 1.

7. Modeling using logistic regression, decision tree, random forest and LGBM
   - Split the data into training and testing sets using Scikit-learn's `train_test_split()`.
   - Build the models using Scikit-learn's `LogisticRegression()`, `DecisionTreeClassifier()`, `RandomForestClassifier()`, and `LGBMClassifier()`.
   - Fit the models using `fit()` function and predict using `predict()` function.

8. Interpretation of results using f1, precision, recal, ROC-AUC and confusion matrix
   - Evaluate the model performance using metrics such as f1-score, precision, recall, ROC-AUC curve, and confusion matrix.
   - Use Scikit-learn's `classification_report()`, `roc_auc_score()`, and `confusion_matrix()` to evaluate the model performance.

## <span style="color:darkgreen"><b> Dataset Description<b></span>
This data corresponds to a set of financial transactions associated with individuals. The data has been standardized, de-trended, and anonymized. You are provided with over two hundred thousand observations and nearly 800 features.  Each observation is independent from the previous. 

For each observation, it was recorded whether a default was triggered. In case of a default, the loss was measured. This quantity lies between 0 and 100. It has been normalised, considering that the notional of each transaction at inception is 100. For example, a loss of 60 means that only 40 is reimbursed. If the loan did not default, the loss was 0. You are asked to predict the losses for each observation in the test set.

Missing feature values have been kept as is, so that the competing teams can really use the maximum data available, implementing a strategy to fill the gaps if desired. Note that some variables may be categorical (e.g. f776 and f777).

The competition sponsor has worked to remove time-dimensionality from the data. However, the observations are still listed in order from old to new in the training set. In the test set they are in random order.

More info: https://www.kaggle.com/competitions/loan-default-prediction/overview

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn 
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [None]:
train_df = pd.read_csv("train_v2.csv")
test_df = pd.read_csv("test_v2.csv")

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
train_df.info()

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
train_df = reduce_mem_usage(train_df)

In [None]:
test_df = reduce_mem_usage(test_df)

In [None]:
train_df.select_dtypes(include=['object']).head()

In [None]:
test_df.select_dtypes(include=['object']).head()


Categorical features seem invalid

In [None]:
# Drop categorical columns which are invalid and drop id columns in train and test data
invalid = train_df.select_dtypes(include=['object']).columns
train_df.drop(invalid, axis=1, inplace=True)
test_df.drop(invalid, axis=1, inplace=True)
train_df_id = train_df['id'].copy()
train_df.drop('id', axis=1, inplace=True)
test_df_id = test_df['id'].copy
test_df.drop('id', axis=1, inplace=True)

## <span style="color:darkgreen"><b> EDA

In [None]:
train_df.info(); test_df.info()

## <span style="color:darkgreen"><b> Missing vals

In [None]:
train_df.dtypes.value_counts()

In [None]:
train_miss = train_df.isnull().sum()
train_miss = pd.DataFrame(train_miss[train_miss > 0])
train_miss.columns = ['Number_missing']
train_miss['Percent_missing'] = train_miss['Number_missing'] / len(train_df) * 100
train_miss.sort_values(by='Percent_missing', ascending=False)

In [None]:
train_df.fillna(train_df.mean(), inplace=True)

In [None]:
test_df.fillna(test_df.mean(), inplace=True)

<span style="color:darkgreen"><b> We can select the most informational features and drop the most correlated, e.g. setting thereshold > 0.9 

In [None]:
y_for_corr = train_df['loss']
train_df_for_corr= train_df.drop('loss', axis=1)
correlations = train_df_for_corr.corr(method='spearman').abs()
#correlations = correlations['loss'].sort_values(ascending=False)

In [None]:
correlations_test = test_df.corr(method='spearman').abs()
upper_test = correlations_test.where(np.triu(np.ones(correlations_test.shape), k=1).astype(np.bool))
threshold = 0.90
to_drop_test = [column for column in upper_test.columns if any(upper_test[column] > threshold)]

print('There are %d columns to remove in test df.' % (len(to_drop_test)))


In [None]:
#taking upper triangular part of correlation matrix 
upper = correlations.where(np.triu(np.ones(correlations.shape), k=1).astype(np.bool))

In [None]:
# Select columns with correlations above threshold
threshold = 0.90
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]

print('There are %d columns to remove.' % (len(to_drop)))

In [None]:
# we do not drop as we will apply PCA
train_df = train_df.drop(columns = to_drop)
test_df = test_df.drop(columns = to_drop_test)

In [None]:
y = train_df['loss']

In [None]:
# finding correlation of features with target values of loss and convert into a dataframe
corr_tar = train_df.corrwith(y).sort_values()
print(corr_tar.head(10))
print(corr_tar.tail(10))
corr_tar_df = corr_tar.to_frame().transpose()
corr_tar_df.isna()

In [None]:
# extracting features having NaN value correlation with loss to remove them 
col_to_drop_1 = corr_tar_df.columns[corr_tar_df.isna().any()].to_list()
print(len(col_to_drop_1))
print(col_to_drop_1)

In [None]:
train_df = train_df.drop(columns = col_to_drop_1)
test_df = test_df.drop(columns = col_to_drop_1)


In [None]:
test_df.shape

## <span style="color:darkgreen"><b> Split data

In [None]:
y = train_df['loss']
X = train_df.drop('loss', axis=1)
#y = train_df['loss']
y.value_counts()

In [None]:
y[y>0] = 1
y.value_counts()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

In [None]:
[X_train.shape, X_test.shape, y_train.shape, y_test.shape, test_df.shape]

In [None]:
diff_cols = set(X_train.columns) - set(test_df.columns)
diff_cols

## <span style="color:darkgreen"><b>Standardization
PCA is effected by scale so we need to scale the features in the data before applying PCA. 

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
test_df_scaled = scaler.transform(test_df)


In [None]:
#test_df_scaled = scaler.transform(test_df)

## <span style="color:darkgreen"><b> Applying PCA</span>
PCA is a method used to reduce number of variables in the data by extracting the important ines from a large pool. It reduces the dimension of the data with an aim to retain as much information as possible. 
Method combines highly correlated variables together to form a smaller number of an artificial set of variables which is called " principal components" that aacount for most variance in the data

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(X_train_scaled)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')

In [None]:
np.cumsum(pca.explained_variance_ratio_)[200]
print(f'{np.cumsum(pca.explained_variance_ratio_)[200]:.2f}' + ' of the variance is explained by 200 components')

In [None]:
final_pca = PCA(n_components=200)
final_pca.fit(X_train_scaled)
X_train_pca = final_pca.transform(X_train_scaled)
X_test_pca = final_pca.transform(X_test_scaled)
test_pca = final_pca.transform(test_df_scaled)


In [None]:
X_train_pca.shape, test_pca.shape

In [None]:
X_train_pca = pd.DataFrame(X_train_pca)
X_test_pca = pd.DataFrame(X_test_pca)
test_pca = pd.DataFrame(test_pca)

##### <span style="color:darkgreen"><b>Use these variables to fit the model with 200 independent variables to predict loss.</span> 

## <span style="color:darkgreen"><b>Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
logreg = LogisticRegression(solver = 'saga', class_weight= 'balanced', max_iter=500, random_state = 1)
logreg.fit(X_train_pca, y_train)
logreg.coef_


### 

In [None]:
logreg.coef_[0, :10]

In [None]:
importance = logreg.coef_[0]
# summarize feature importance
for i, v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i, v))
# plot feature importance
plt.bar([x for x in range(len(importance))], importance)
plt.show()

In [None]:
plt.figure(figsize=(25, 10))
sns.barplot(x = 'weight', y = 'feature', data = pd.DataFrame({'feature': X_train_pca.columns, 'weight': logreg.coef_[0]}).sort_values(by = 'weight', ascending = False).iloc[0:50] )

## <span style="color:darkgreen"><b>Validation on test data

In [None]:
y_pred = logreg.predict(X_test_pca)
y_pred

<span style="color:darkgreen"><b>Model Evaluation   Confusion Matrix

In [None]:
import sklearn.metrics as metrics
c = pd.DataFrame(metrics.confusion_matrix(y_test, y_pred), index = 
                 ["Actual non defaulter", 
                "Actual defaulter"])
c.columns = ["Predicted non defaulter", "Predicted defaulter"]
c['Actual Total'] = c.sum(axis = 1)
c.loc['Predicted Total', :] = c.sum(axis = 0)

In [None]:
c

<span style="color:darkgreen"><b>Accuracy

In [None]:
print('The accuracy on the validation data is '+ str(round(metrics.accuracy_score(y_test, y_pred)*100, ndigits=2)) +"%")

<span style="color:darkgreen"><b>Sensitivity

In [None]:
print("The sensitivity (true positive rate) is "+ str(round(metrics.recall_score(y_test, y_pred)*100, ndigits = 2)) + "%")

<span style="color:darkgreen"><b>AUC Area Under the Curve

In [None]:
ns_fpr, ns_tpr, _ = metrics.roc_curve(y_test, y_pred)
np_zero_fpr, np_zero_tpr, _ = metrics.roc_curve(y_test, np.zeros(len(y_test)))

logreg_probs = logreg.predict_proba(X_test_pca)
# keep probabilities for the positive outcome only
logreg_probs = logreg_probs[:, 1]
logreg_fpr, logreg_tpr, _ = metrics.roc_curve(y_test, logreg_probs)

plt.plot(ns_fpr, ns_tpr, linestyle = '--', label = 'Logistic Regression probability')
plt.plot(np_zero_fpr, np_zero_tpr, linestyle = '--', label  = 'No prediction"')
plt.plot(logreg_fpr, logreg_tpr, marker = ".", label = 'Logistic Regression')
# axis labels
plt.xlabel('False Posititve Rate')
plt.ylabel('True Positive Rate')
# show the lehend
plt.legend()
plt.show()

In [None]:
print('Area under ROC CURVE is ' +str(round(100*metrics.roc_auc_score(y_test, y_pred), ndigits = 2)) +'%')

<span style="color:darkgreen"><b>Classification Report

In [None]:
print(metrics.classification_report(y_test, y_pred))

### <span style="color:darkgreen"><b>Prediction on given test data

In [None]:
pred = logreg.predict(test_pca)
sns.countplot(pred)

In [None]:
#submission = pd.DataFrame({'id': test_df_id, 'loss': pred})
#submission.to_csv('submission.csv', index= False)
submission = pd.read_csv('sampleSubmission.csv')
submission['loss'] = pred


In [None]:
submission.head()

In [None]:
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(X_train_pca .shape[1])

In [None]:
feature_importances += logreg.feature_importances_

In [None]:
feature_importances = pd.DataFrame({'feature': list(X_train_pca.columns), 
                                    'importance':feature_importances}).sort_values('importance', ascending=False)
feature_importances

## Another way

In [None]:
# Convert y to one-dimensional array (vector)
y_train = np.array(y_train).reshape((-1, ))
y_test = np.array(y_test).reshape((-1, ))

In [None]:
# # # Models to Evaluate

# We will compare five different machine learning Cassification models:

# 1 - Logistic Regression
# 2 - K-Nearest Neighbors Classification
# 3 - Suport Vector Machine
# 4 - Naive Bayes
# 5 - Random Forest Classification

# Function to calculate mean absolute error
def cross_val(X_train, y_train, model):
    # Applying k-Fold Cross Validation
    from sklearn.model_selection import cross_val_score
    accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 5)
    return accuracies.mean()

# Takes in a model, trains the model, and evaluates the model on the test set
def fit_and_evaluate(model):
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions and evalute
    model_pred = model.predict(X_test)
    model_cross = cross_val(X_train, y_train, model)
    
    # Return the performance metric
    return model_cross

In [None]:
# # Naive Bayes
from sklearn.naive_bayes import GaussianNB
naive = GaussianNB()
naive_cross = fit_and_evaluate(naive)

print('Naive Bayes Performance on the test set: Cross Validation Score = %0.4f' % naive_cross)

In [None]:
# # Random Forest Classification
from sklearn.ensemble import RandomForestClassifier
random = RandomForestClassifier(n_estimators = 10, criterion = 'entropy')
random_cross = fit_and_evaluate(random)

print('Random Forest Performance on the test set: Cross Validation Score = %0.4f' % random_cross)

In [None]:
# # Gradiente Boosting Classification
from xgboost import XGBClassifier
gb = XGBClassifier()
gb_cross = fit_and_evaluate(gb)

print('Gradiente Boosting Classification Performance on the test set: Cross Validation Score = %0.4f' % gb_cross)