# The final project of the 'Machine Learning: Fundamentals and Applications' section


## Technical task


The purpose of this work is to study the database of anonymized marketing data of a real telecommunications company.

The task is performed as a competition on the Kaggle platform.

The main goal of the final project of the course is to predict which customers of the company may consider changing their service provider (according to a well-known marketing term, this is "customer churn").

Thus, the task is to develop an effective predictive model that can process a large number of input features. Demonstrated ability to work with a volume and variety of data that includes both numerical and categorical features while paying attention to class imbalances.


### 1 - Data exploration


In [1]:
import pandas as pd

file_path = "./datasets/final_proj_data.csv"
raw_data = pd.read_csv(file_path)

raw_data.head()

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Var6,Var7,Var8,Var9,Var10,...,Var222,Var223,Var224,Var225,Var226,Var227,Var228,Var229,Var230,y
0,,,,,,812.0,14.0,,,,...,catzS2D,jySVZNlOJy,,xG3x,Aoh3,ZI9m,ib5G6X1eUxUn6,mj86,,0
1,,,,,,2688.0,7.0,,,,...,i06ocsg,LM8l689qOp,,kG3k,WqMG,RAYp,55YFVY9,mj86,,0
2,,,,,,1015.0,14.0,,,,...,P6pu4Vl,LM8l689qOp,,kG3k,Aoh3,ZI9m,R4y5gQQWY8OodqDV,am7c,,0
3,,,,,,168.0,0.0,,,,...,BNrD3Yd,LM8l689qOp,,,FSa2,RAYp,F2FyR07IdsN7I,,,0
4,,,,,,14.0,0.0,,,,...,3B1QowC,LM8l689qOp,,,WqMG,RAYp,F2FyR07IdsN7I,,,0


In [2]:
raw_data.shape

(10000, 231)

The data set consists of 10.000 rows and 231 columns, the last of which is the target feature, which is responsible for whether the customer will switch their provider. Draws attention a large number of gaps in the values in some columns, some of them will be necessary to completely delete, since the number of gaps is too large and the data cannot be restored.


### 2. Determination of types of features and existing gaps


In [3]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Columns: 231 entries, Var1 to y
dtypes: float64(191), int64(2), object(38)
memory usage: 17.6+ MB


The dataset includes 193 columns with numerical features, one of which is the target value, and 38 columns with categorical features. Let's see how much is left after cleaning the data.


In [4]:
missing_values = raw_data.isna().mean().sort_values(ascending=False)
missing_values.head(25)

Var20     1.0000
Var39     1.0000
Var32     1.0000
Var31     1.0000
Var8      1.0000
Var15     1.0000
Var42     1.0000
Var48     1.0000
Var52     1.0000
Var55     1.0000
Var79     1.0000
Var230    1.0000
Var175    1.0000
Var141    1.0000
Var167    1.0000
Var185    1.0000
Var169    1.0000
Var209    1.0000
Var118    0.9957
Var92     0.9957
Var190    0.9957
Var64     0.9954
Var45     0.9922
Var102    0.9918
Var98     0.9896
dtype: float64

As expected, there are many columns in which the number of omissions reaches 100%. Practice shows that data in columns with more than 35% gaps cease to be informative, and it is better to delete them.


In [5]:
categorical_columns = raw_data.select_dtypes(include=["object"]).columns
unique_values = raw_data[categorical_columns].nunique().sort_values()
unique_values

Var191       1
Var215       1
Var213       1
Var224       1
Var211       2
Var208       2
Var218       2
Var201       2
Var205       3
Var194       3
Var225       3
Var196       3
Var223       4
Var229       4
Var203       4
Var210       6
Var227       7
Var221       7
Var207      12
Var219      17
Var195      18
Var206      21
Var226      23
Var228      29
Var193      40
Var212      65
Var204     100
Var197     185
Var192     297
Var216     977
Var199    1850
Var222    2100
Var220    2100
Var198    2100
Var202    3802
Var200    4478
Var214    4478
Var217    5529
dtype: int64

As can be seen, some columns have only one categorical value, making them unsuitable for binary classification. Also, columns in which 50% of the categorical data are unique, firstly, overload the model when encoding categorical features, secondly, they most likely contain personal information, such as names and phone numbers, which in no way affects the client's desire to change provider. Such columns must be deleted.


### 3. Data preparation for numerical columns


#### 3.1. Removing columns with a large number of NaN


In [6]:
nan_threshold = 0.5
columns_to_drop = missing_values[missing_values > nan_threshold].index
data_reduced = raw_data.drop(columns=columns_to_drop)
data_reduced.shape

(10000, 72)

In [7]:
data_reduced_shape = data_reduced.shape
remaining_missing_values = data_reduced.isna().mean().sort_values(ascending=False)
remaining_missing_values.head(10)

Var200    0.4957
Var214    0.4957
Var94     0.4386
Var72     0.4386
Var126    0.2780
Var24     0.1360
Var109    0.1360
Var149    0.1360
Var119    0.1020
Var206    0.1020
dtype: float64

### 3.2. Filling in the NaN for numerical features


In [8]:
numerical_columns = data_reduced.select_dtypes(include=["float64", "int64"]).columns
data_reduced[numerical_columns] = data_reduced[numerical_columns].fillna(
    data_reduced[numerical_columns].mean()
)

missing_values_after_imputation = data_reduced[numerical_columns].isna().sum().sum()
missing_values_after_imputation

np.int64(0)

NaN values in the numerical columns are not present.


#### 3.3. Check for one-value numerical columns


In [9]:
single_value_numerical_columns = [
    col for col in numerical_columns if data_reduced[col].nunique() == 1
]

single_value_numerical_columns

[]

One-value numerical columns are not present.


### 4. Data preparation for categorical columns


#### 4.1. Removing columns with a large number of unique vlues and with only one value


In [10]:
unique_threshold = 100
new_categorical_columns = data_reduced.select_dtypes(include=["object"]).columns

single_value_categorical_columns = [
    col for col in new_categorical_columns if data_reduced[col].nunique() == 1
]
high_cardinality_categorical_columns = [
    col
    for col in new_categorical_columns
    if data_reduced[col].nunique() > unique_threshold
]

columns_to_drop_categorical = (
    high_cardinality_categorical_columns + single_value_categorical_columns
)
data_processed = data_reduced.drop(columns=columns_to_drop_categorical)

data_processed.shape

(10000, 61)

#### 4.2. Filling in the NaN for categorical features


In [11]:
categorical_columns_remaining = data_processed.select_dtypes(include=["object"]).columns
data_processed[categorical_columns_remaining] = data_processed[
    categorical_columns_remaining
].fillna(data_processed[categorical_columns_remaining].mode().iloc[0])

missing_values_categorical_after_imputation = (
    data_processed[categorical_columns_remaining].isnull().sum().sum()
)
missing_values_categorical_after_imputation

np.int64(0)

NaN values in the categorical columns are not present.


#### 4.3. Categorical features encoding


Low cardinality categorical features are coded with one hot encoder and others with target encoder.


In [12]:
from sklearn.preprocessing import OneHotEncoder, TargetEncoder

low_cardinality_cats = [
    col for col in categorical_columns_remaining if data_processed[col].nunique() <= 15
]
moderate_cardinality_cats = [
    col for col in categorical_columns_remaining if data_processed[col].nunique() > 15
]

one_hot_encoder = OneHotEncoder(
    drop="if_binary", sparse_output=False, handle_unknown="ignore"
).set_output(transform="pandas")
one_hot_encoded_data = one_hot_encoder.fit_transform(
    data_processed[low_cardinality_cats]
)

target_encoder = TargetEncoder(random_state=42).set_output(transform="pandas")
target_encoded_data = target_encoder.fit_transform(
    data_processed[moderate_cardinality_cats], data_processed["y"]
)

data_encoded = data_processed.drop(
    columns=low_cardinality_cats + moderate_cardinality_cats
)
data_encoded = pd.concat(
    [data_encoded, one_hot_encoded_data, target_encoded_data], axis=1
)

data_encoded.shape

(10000, 99)

#### 4.4. Drop outliers


An object is recognized as an outlier if 20% or more of the object's features are outside ±3 standard deviations.


In [13]:
from scipy.stats import zscore
import numpy as np

outliers = data_encoded.apply(lambda x: np.abs(zscore(x)).ge(3)).mean(1)

out_ind = np.where(outliers > 0.2)[0]

data_encoded.drop(out_ind, inplace=True)
data_encoded.shape

(10000, 99)

### 5. Class balancing


In [14]:
class_distribution = data_encoded["y"].value_counts(normalize=True)
class_distribution

y
0    0.8695
1    0.1305
Name: proportion, dtype: float64

The number of negative results (the customer will not change the provider) significantly exceeds the number of positive ones. Considering the fact that several algorithms are planned to be tested that are sensitive to the class imbalance, it is necessary to balance them.


In [15]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

X = data_encoded.drop(columns=["y"])
y = data_encoded["y"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

y_train.value_counts(normalize=True)

y
0    0.5
1    0.5
Name: proportion, dtype: float64

### 6. Features normalization and model construction


#### 6.1. Features standardization


In [16]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

X_train = pd.DataFrame(
    scaler.transform(X_train), columns=X_train.columns, index=X_train.index
)
X_test = pd.DataFrame(
    scaler.transform(X_test), columns=X_test.columns, index=X_test.index
)

X_train.describe().round(2)

Unnamed: 0,Var6,Var7,Var13,Var21,Var22,Var24,Var25,Var28,Var35,Var38,...,Var227_nIGjgSB,Var227_vJ_w8kB,Var193,Var195,Var204,Var206,Var212,Var219,Var226,Var228
count,13912.0,13912.0,13912.0,13912.0,13912.0,13912.0,13912.0,13912.0,13912.0,13912.0,...,13912.0,13912.0,13912.0,13912.0,13912.0,13912.0,13912.0,13912.0,13912.0,13912.0
mean,-0.05,-0.18,-0.14,-0.02,-0.02,-0.03,-0.03,0.04,0.01,0.05,...,-0.01,-0.05,0.24,0.06,0.09,0.26,0.32,0.07,0.09,0.27
std,0.85,0.89,0.78,0.89,0.89,0.89,0.91,0.99,0.96,0.99,...,0.77,0.76,0.83,0.8,0.95,0.95,0.87,0.82,0.91,0.85
min,-0.59,-1.15,-0.49,-0.46,-0.45,-0.53,-0.48,-2.32,-0.25,-0.9,...,-0.02,-0.12,-1.68,-5.74,-2.43,-1.48,-1.3,-5.84,-2.06,-1.53
25%,-0.36,-1.15,-0.49,-0.23,-0.23,-0.53,-0.4,-0.42,-0.25,-0.88,...,-0.02,-0.12,0.61,0.17,-0.53,-0.14,-0.18,0.26,-0.59,0.23
50%,-0.2,0.0,-0.41,-0.16,-0.16,-0.21,-0.21,-0.0,-0.25,0.0,...,-0.02,-0.12,0.62,0.17,0.08,0.27,0.85,0.27,0.19,0.72
75%,0.0,0.02,0.0,-0.0,0.0,0.0,0.02,0.37,-0.25,0.65,...,-0.02,-0.12,0.63,0.18,0.57,0.58,0.87,0.28,0.78,0.73
max,33.18,4.71,35.98,39.04,39.01,24.17,34.73,22.73,20.22,5.39,...,40.81,8.16,11.17,38.2,3.83,2.08,1.75,0.31,2.53,10.15


#### 6.2. Random Forest Classifier


In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import balanced_accuracy_score, f1_score

rf_model = RandomForestClassifier(random_state=42)

rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)

rf_balanced_accuracy = balanced_accuracy_score(y_test, rf_predictions)
rf_f1 = f1_score(y_test, rf_predictions, average="weighted")

rf_balanced_accuracy, rf_f1

(np.float64(0.7118714459139991), np.float64(0.8989325409013312))

#### 6.3. Gradient Boosting Classifier


In [18]:
from sklearn.ensemble import GradientBoostingClassifier

gb_model = GradientBoostingClassifier(random_state=42)

gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)

gb_balanced_accuracy = balanced_accuracy_score(y_test, gb_predictions)
gb_f1 = f1_score(y_test, gb_predictions, average="weighted")

gb_balanced_accuracy, gb_f1

(np.float64(0.8129193022810044), np.float64(0.9199318696534544))

Gradient Boosting Classifier showed much better results on the current dataset. Lut's attempt to improve the score.


### 7. Data dimensionality reduction


#### 7.1. Extracting feature importances from the model


In [19]:
feature_importances = pd.DataFrame(
    {"Feature": X_train.columns, "Importance": gb_model.feature_importances_}
).sort_values(by="Importance", ascending=False)

feature_importances.head(14)

Unnamed: 0,Feature,Importance
28,Var126,0.290737
94,Var212,0.289003
71,Var218_cJvF,0.057611
34,Var144,0.057031
14,Var73,0.045182
48,Var205_09_Q,0.035295
50,Var205_sJzTlal,0.032788
49,Var205_VpdQ,0.030088
1,Var7,0.028791
76,Var221_oslk,0.021939


#### 7.2. Elimination of unimportant features


In [20]:
top_features = feature_importances["Feature"].head(54).values

X_train_top = X_train[top_features]
X_test_top = X_test[top_features]

gb_model_top = GradientBoostingClassifier(random_state=42)
gb_model_top.fit(X_train_top, y_train)

gb_predictions_top = gb_model_top.predict(X_test_top)
gb_balanced_accuracy_top = balanced_accuracy_score(y_test, gb_predictions_top)
gb_f1_weighted_top = f1_score(y_test, gb_predictions_top, average="weighted")

gb_balanced_accuracy_top, gb_f1_weighted_top

(np.float64(0.8142599679650303), np.float64(0.9196327975557692))

Elimination of less important features does not increase accuracy, but reduces data dimensionality.


#### 7.3. Dimensionality reduction with PCA


In [21]:
from sklearn.decomposition import PCA

explained_variance_threshold = 0.85

pca = PCA(n_components=explained_variance_threshold, random_state=42)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

num_components = X_train_pca.shape[1]

gb_model_pca = GradientBoostingClassifier(random_state=42)
gb_model_pca.fit(X_train_pca, y_train)

gb_predictions_pca = gb_model_pca.predict(X_test_pca)
gb_balanced_accuracy_pca = balanced_accuracy_score(y_test, gb_predictions_pca)
gb_f1_weighted_pca = f1_score(y_test, gb_predictions_pca, average="weighted")

gb_balanced_accuracy_pca, gb_f1_weighted_pca

(np.float64(0.7680428043597523), np.float64(0.8128348457607095))

Dimensionality reduction using PCA does not improve model accuracy.


### 8. Optimization of hyperparameters


In [22]:
from sklearn.model_selection import GridSearchCV

gb_param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

gb_model_opt = GradientBoostingClassifier(random_state=42)

grid_search_gb = GridSearchCV(
    estimator=gb_model_opt,
    param_grid=gb_param_grid,
    scoring='balanced_accuracy',
    cv=3,
    n_jobs=-1,
    verbose=1
)

grid_search_gb.fit(X_train, y_train)

best_params_gb = grid_search_gb.best_params_
best_balanced_accuracy_gb = grid_search_gb.best_score_

best_params_gb, best_balanced_accuracy_gb

Grid Search predicted that the default settings were best. Let's try to optimize the hyperparameters manually.


In [23]:
gb_model_best = GradientBoostingClassifier(
    random_state=42, max_depth=5, learning_rate=0.01, subsample=0.8
)

gb_model_best.fit(X_train_top, y_train)
gb_predictions_final = gb_model_best.predict(X_test_top)


gb_balanced_accuracy_final = balanced_accuracy_score(y_test, gb_predictions_final)
gb_f1_final = f1_score(y_test, gb_predictions_final, average="weighted")

gb_balanced_accuracy_final, gb_f1_final

(np.float64(0.8501472859506609), np.float64(0.864980426580023))

### 9. ML Pipeline


In [58]:
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.base import BaseEstimator, TransformerMixin


class CustomPreprocessorWithEncoding(BaseEstimator, TransformerMixin):
    def __init__(self, nan_threshold=0.5, unique_threshold=100):
        self.nan_threshold = nan_threshold
        self.unique_threshold = unique_threshold
        self.one_hot_encoder = None
        self.target_encoder = None
        self.columns_to_drop = None
        self.numerical_columns = None
        self.categorical_columns = None
        self.categorical_columns_to_drop = None
        self.remaining_categorical_columns = None

    def fit(self, X: pd.DataFrame, y=None):
        # Columns to drop based on nan values threshold
        self.columns_to_drop = X.columns[X.isna().mean() > self.nan_threshold].tolist()

        # Split numerical and categorical columns
        num_columns = X.select_dtypes(include=["float64", "int64"]).columns
        self.numerical_columns = [
            col for col in num_columns if col not in self.columns_to_drop
        ]
        cat_columns = X.select_dtypes(include=["object"]).columns
        self.categorical_columns = [
            col for col in cat_columns if col not in self.columns_to_drop
        ]

        # Columns to drop based on unique values threshold
        single_value_cats = [
            col for col in self.categorical_columns if X[col].nunique() == 1
        ]
        high_cardinality_cats = [
            col
            for col in self.categorical_columns
            if X[col].nunique() > self.unique_threshold
        ]
        self.categorical_columns_to_drop = single_value_cats + high_cardinality_cats
        self.remaining_categorical_columns = [
            col
            for col in self.categorical_columns
            if col not in self.categorical_columns_to_drop
        ]

        # Identify columns for encoding
        self.low_cardinality_cats = [
            col for col in self.remaining_categorical_columns if X[col].nunique() <= 15
        ]
        self.moderate_cardinality_cats = [
            col
            for col in self.remaining_categorical_columns
            if 15 < X[col].nunique() <= self.unique_threshold
        ]

        # Initialize encoders
        self.one_hot_encoder = OneHotEncoder(
            drop="if_binary", sparse_output=False, handle_unknown="ignore"
        ).set_output(transform="pandas")
        self.one_hot_encoder.fit(X[self.low_cardinality_cats])

        self.target_encoder = TargetEncoder(random_state=42).set_output(
            transform="pandas"
        )
        if y is not None:
            self.target_encoder.fit(X[self.moderate_cardinality_cats], y)

        return self

    def transform(self, X: pd.DataFrame):
        # Drop columns with excessive missing and unique values
        X = X.drop(
            columns=self.columns_to_drop + self.categorical_columns_to_drop,
            errors="ignore",
        )

        # Fill NaN values in numerical features with mean
        X[self.numerical_columns] = X[self.numerical_columns].fillna(
            X[self.numerical_columns].mean()
        )

        # Fill NaN values in categorical features with mode
        X[self.remaining_categorical_columns] = X[
            self.remaining_categorical_columns
        ].fillna(X[self.remaining_categorical_columns].mode().iloc[0])

        # Apply one-hot encoding and target encoding
        one_hot_encoded_data = self.one_hot_encoder.transform(
            X[self.low_cardinality_cats]
        )

        remaining_moderate_cardinality_cats = [
            col for col in self.moderate_cardinality_cats if col in X.columns
        ]
        target_encoded_data = self.target_encoder.transform(
            X[remaining_moderate_cardinality_cats]
        )

        # Drop original categorical columns used in encoding and concatenate encoded columns
        X = X.drop(
            columns=self.low_cardinality_cats + remaining_moderate_cardinality_cats,
            errors="ignore",
        )
        X = pd.concat([X, one_hot_encoded_data, target_encoded_data], axis=1)

        return X


pipeline = ImbPipeline(
    [
        (
            "preprocess",
            CustomPreprocessorWithEncoding(nan_threshold=0.5, unique_threshold=100),
        ),
        ("smote", SMOTE(random_state=42)),
        ("scaler", StandardScaler()),
        (
            "classifier",
            GradientBoostingClassifier(
                random_state=42, max_depth=5, learning_rate=0.01, subsample=0.8
            ),
        ),
    ]
)

### 10. Retraining the model on the full data set


In [59]:
train_file_path = "./datasets/final_proj_data.csv"
train_data = pd.read_csv(train_file_path)
X_train = train_data.drop(columns=["y"])
y_train = train_data["y"]

test_file_path = "./datasets/final_proj_test.csv"
X_test = pd.read_csv(test_file_path)

pipeline.fit(X_train, y_train)
pipeline_predictions = pipeline.predict(X_test)



### 11. CSV with predictions

In [60]:
predictions_df = pd.DataFrame(
    {"index": range(len(pipeline_predictions)), "y": pipeline_predictions}
)

predictions_df.to_csv("./datasets/final_proj_submission.csv", index=False)

### 12. Conclusions

With the help of the described technique, it was possible to obtain an accuracy of 85% on this data set on Kagle. Further experimentation with feature encoding and the use of deep networks may help further improve accuracy.