# <center> Predicting Client Gender </center>

### It is necessary to identify the gender of the client based on their transactional historical data. The quality metric is [ROC AUC](https://dyakonov.org/2017/07/28/auc-roc-%D0%BF%D0%BB%D0%BE%D1%89%D0%B0%D0%B4%D1%8C-%D0%BF%D0%BE%D0%B4-%D0%BA%D1%80%D0%B8%D0%B2%D0%BE%D0%B9-%D0%BE%D1%88%D0%B8%D0%B1%D0%BE%D0%BA/), which needs to be maximized.

## File Descriptions
- transactions.csv - historical transactions of bank clients
- gender.csv - gender information for some clients (null for test clients)
- tr_mcc_codes.csv - mcc codes of transactions
- tr_types.csv - types of transactions

## Field Descriptions
### transactions.csv
- client_id - client identifier
- tr_datetime - date and time of the transaction (days are numbered from the start of the data)
- mcc_code - mcc code of the transaction
- tr_type - type of transaction
- amount - transaction amount in conditional units; with a "+" sign — funds credited to the client, "-" — funds debited
- term_id - terminal identifier

### gender.csv
- client_id - client identifier
- gender - client gender (empty values for test clients)

### tr_mcc_codes.csv
- mcc_code - mcc code of the transaction
- mcc_description - description of the mcc code

### tr_types.csv
- tr_type - type of transaction
- tr_description - description of the transaction type

## Tasks:
- Develop a binary classification model to determine the client's gender. There are no restrictions on the model - it can be anything from KNN to transformers. The main goal is to achieve an ROC AUC above 77.5% on the hold-out test set.
- Interpret the model results: the importance of the variables included in it, demonstrating on several examples why the corresponding prediction was made. This will help understand which gender corresponds to which target (0/1). Again, there is complete freedom of choice of approaches! Useful keywords: gain, permutation importance, SHAP.
- Convert the results into a report without code (ideally - directly into [html](https://stackoverflow.com/questions/49907455/hide-code-when-exporting-jupyter-notebook-to-html))

#### P.S. Don't forget about [PEP8](https://www.python.org/dev/peps/pep-0008/)!

In [None]:
%load_ext autoreload
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.ensemble import HistGradientBoostingClassifier, \
                             GradientBoostingClassifier, \
                             RandomForestClassifier \
                            #  BaggingClassifier,\
                            #  StackingClassifier, \
                            #  VotingClassifier
# from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
# from sklearn.linear_model import LogisticRegressionCV
# from lightgbm import LGBMClassifier
# from sklearn.feature_extraction.text import TfidfVectorizer
from operator import itemgetter
import warnings

In [None]:
warnings.filterwarnings(action = 'ignore')

# Объединение датасетов

In [None]:
tr_mcc_codes = pd.read_csv("data/mcc_codes.csv", sep=";", index_col="mcc_code")
tr_types = pd.read_csv("data/trans_types.csv", sep=";", index_col="trans_type")

transactions = pd.read_csv("data/transactions.csv", index_col="client_id")
test_gender = pd.read_csv("data/test.csv", index_col="client_id")
train_gender = pd.read_csv("data/train.csv", index_col="client_id")
gender = pd.concat([train_gender, test_gender]).drop('Unnamed: 0', axis=1)

del test_gender, train_gender

In [None]:
tr_mcc_codes.head(3)

In [None]:
tr_types.head(3)

In [None]:
transactions.head(3)

In [None]:
gender.head(3)

In [None]:
df = transactions
df = df.join(gender)
df = pd.merge(df, tr_types, left_on="trans_type", right_index=True)
df = pd.merge(df, tr_mcc_codes, left_on="mcc_code", right_index=True)
del transactions, tr_types, tr_mcc_codes, gender
df

In [None]:
df = df.loc[df['gender'].notna()]

In [None]:
df.head(3)

In [None]:
df.info()

In [None]:
def missing_features(data, column_set):
    incomplete_features = {feature: data.shape[0]-sum(data[feature].value_counts())
                                   for feature in column_set
                                   if not sum(data[feature].value_counts()) == data.shape[0]}
    incomplete_features_sorted = sorted(incomplete_features, key=lambda feature: incomplete_features[feature], reverse=True)
    incompleteness = [round((incomplete_features[feature]/data.shape[0])*100, 2) for feature in incomplete_features_sorted]
    
    for feature, percentage in zip(incomplete_features_sorted, incompleteness):
        print(f'{feature} {incomplete_features[feature]} ({percentage}%)')
missing_features(df, df.columns)

In [None]:
sns.distplot(df['amount'], bins=20)

# Corelation Matrix

In [None]:
df.columns

In [None]:
plt.figure(figsize=(20, 10))
sns.heatmap(
    df.select_dtypes(include=[np.number]).corr(),  # Select only numerical columns
    annot=True,
    fmt='.2f',
    # cmap='coolwarm'
)
plt.show()

# Feature Engineering

In [None]:
df

In [None]:
df.columns.tolist()

In [None]:
df['earned'] = [0 if i <0 else 1 for i in df['amount']]

In [None]:
df['day'] = pd.DataFrame(df["trans_time"].str.split(' ', expand = True)[0]).astype("int64").fillna(0)
df['time'] = pd.DataFrame(df["trans_time"].str.split(' ', expand = True)[1].str.split(':', expand = True)[0]).astype("int64")

In [None]:
df['total'] = df.groupby(['client_id'])["amount"].sum().fillna(0)
df['total_spent'] = df.loc[df['earned'] == 0].groupby(['client_id'])["amount"].sum().fillna(0)
df['total_earned'] = df.loc[df['earned'] == 1].groupby(['client_id'])["amount"].sum().fillna(0)

In [None]:
df["avg_amount_per_day"] = df.groupby(['client_id','day'])["amount"].transform('mean').fillna(0)
df["var_amount_per_day"] = df.groupby(['client_id','day'])["amount"].transform('std').fillna(0)

In [None]:
df['avg_sum_per_transaction'] = df.groupby(['client_id'])['amount'].mean().fillna(0)
df['avg_sum_spent_per_transaction'] = df.loc[df['earned'] == 0].groupby(['client_id'])['amount'].mean().fillna(0)
df['avg_sum_earned_per_transaction'] = df.loc[df['earned'] == 1].groupby(['client_id'])['amount'].mean().fillna(0)

In [None]:
df['var_sum_per_trans'] = df.groupby(['client_id'])['amount'].std().fillna(0)
df['var_sum_spent_per_trans'] = df.loc[df['earned'] == 0].groupby(['client_id'])['amount'].std().fillna(0)
df['var_sum_earned_per_trans'] = df.loc[df['earned'] == 1].groupby(['client_id'])['amount'].std().fillna(0)

In [None]:
df['transactions_per_day'] = df.groupby([df.index, 'day'])["day"].transform("count").fillna(0)
df['transactions_per_day_spent'] = df.groupby(['client_id'])["day"].transform(lambda x: x[x < 0].count()).fillna(0)
df['transactions_per_day_earned'] = df.groupby(['client_id'])["day"].transform(lambda x: x[x > 0].count()).fillna(0)

In [None]:
df["var_amount_per_day"] = df.groupby(['client_id','day'])['amount'].transform('std').fillna(0)

In [None]:
df['sum_per_day'] = df.groupby(['client_id', 'day'])['amount'].transform('sum').fillna(0)
df['sum_per_day_spent'] = df.loc[df['earned'] == 0].groupby(['client_id', 'day'])['amount'].sum().fillna(0)
df['sum_per_day_earned'] = df.loc[df['earned'] == 1].groupby(['client_id', 'day'])['amount'].sum().fillna(0)

In [None]:
df['terminal_unique'] = df.groupby(['client_id','term_id'])['term_id'].transform('count').fillna(0)

In [None]:
df["var_time_transaction"] = df.groupby(['client_id'])['time'].transform('std')

In [None]:

df['tr_unique_count'] = df.groupby(['client_id',"trans_type"])["trans_type"].transform('count').fillna(0)
df['tr_unique_sum'] = df.groupby(['client_id',"trans_type"])["trans_type"].transform('sum').fillna(0)
df['tr_unique_std'] = df.groupby(['client_id',"trans_type"])["trans_type"].transform('std').fillna(0)

In [None]:
df['mcc_unique_count'] = df.groupby(['client_id','mcc_code'])['mcc_code'].transform('count').fillna(0)
df['mcc_unique_sum'] = df.groupby(['client_id','mcc_code'])['mcc_code'].transform('sum').fillna(0)
df['mcc_unique_std'] = df.groupby(['client_id','mcc_code'])['mcc_code'].transform('std').fillna(0)

In [None]:
df['tr_unique_count'] = df.groupby(['client_id','trans_type'])['trans_type'].transform('count').fillna(0)

In [None]:
df["total_amount_spend_to_earn"] = np.divide(df["total_spent"].abs(), df["total_earned"].abs())
df["total_amount_spend_to_earn"].replace(np.inf, 1000, inplace = True)

In [None]:
df = pd.DataFrame(pd.get_dummies(data = df, columns = ['trans_type', 'mcc_code']))

# Grouping by client_id

In [None]:
df

In [None]:
numeric_cols = df.select_dtypes(include=[np.number, np.bool_]).columns
df_customers = df.groupby(['client_id'])[numeric_cols].max()

In [None]:
y = df_customers['gender']
X = df_customers.drop(['gender', 'amount', 'earned'], axis = 1)

In [None]:
X['total_earned'] = X['total_earned'].fillna(0)
X['avg_sum_earned_per_transaction'] = X['avg_sum_earned_per_transaction'].fillna(0)
X['total_spent'] = X['total_spent'].fillna(0)
X['avg_sum_spent_per_transaction'] = X['avg_sum_spent_per_transaction'].fillna(0)
X['var_sum_earned_per_trans'] = X['var_sum_earned_per_trans'].fillna(0)
X['var_sum_spent_per_trans'] = X['var_sum_spent_per_trans'].fillna(0)
X['var_time_transaction'] = X['var_time_transaction'].fillna(0)
X['total_amount_spend_to_earn'] = X['total_amount_spend_to_earn'].fillna(0)
# X['amount'] =  X['amount'].fillna(0)

In [None]:
missing_features(X, X.columns)

In [None]:
# drop the columns with missing values
X = X.drop(['sum_per_day_spent', 'sum_per_day_earned'], axis = 1)
missing_features(X, X.columns)

# Catboost

In [None]:
cb = CatBoostClassifier(depth= 4, iterations= 25, l2_leaf_reg= 0, learning_rate= 0.5)

In [None]:
cb.fit(X, y)

In [None]:
cv_split = KFold(n_splits = 4)

In [None]:
cv_results = cross_validate(cb, X, y, scoring='roc_auc', cv=cv_split, return_train_score=True)

In [None]:
cv_results

In [None]:
cv_results['test_score'].mean()

# Grid Search

In [None]:
# params = {
#     'iterations': [5,10,15,20,25,30],
#     'learning_rate': [0.5, 0.1, 0.05, 0.01],
#     'l2_leaf_reg': [0.5, 0.1, 0.05, 0.01],
#     'depth': [None,1,2,3,4,5],
#     'l2_leaf_reg': [0,0.1,0.01]
# }

# cb_cv = GridSearchCV(cb, param_grid=params, scoring='roc_auc', cv=5)

In [None]:
# cb_cv.fit(X, y)

In [None]:
# cb_cv.best_params_

# Feature importance

In [None]:
feature_importance = sorted(list(zip(X.columns,cb.feature_importances_.tolist())), key=itemgetter(1))
feature_importance

In [None]:
# remove all features with importance <= 0.01
important_features = [feature for feature, importance in feature_importance if importance > 0.01]
X = X[important_features]

In [None]:
params = {
    'iterations': [5,10,15,20,25,30],
    'learning_rate': [0.5, 0.1, 0.05, 0.01],
    'l2_leaf_reg': [0.5, 0.1, 0.05, 0.01],
    'depth': [None,1,2,3,4,5],
    'l2_leaf_reg': [0,0.1,0.01]
}

cb_cv = GridSearchCV(cb, param_grid=params, scoring='roc_auc', cv=5)

In [None]:
cb_cv.fit(X, y)

In [None]:
cb_cv.best_params_

In [None]:
cb = CatBoostClassifier(depth= 4, iterations= 30, l2_leaf_reg= 0.1, learning_rate= 0.5)

In [None]:
cv_results = cross_validate(cb, X, y, scoring='roc_auc', cv=cv_split, return_train_score=True)

In [None]:
cv_results

In [None]:
cv_results['test_score'].mean()

# Random Forest

In [None]:
mm_scaler = MinMaxScaler()
scaled_X = mm_scaler.fit_transform(X)
scaled_X = pd.DataFrame(scaled_X, columns = X.columns)

In [None]:
rf = RandomForestClassifier()

In [None]:
params = {
    'n_estimators': [5,10,15,20,25,30,100],
#     'learning_rate': [0.5, 0.1, 0.05, 0.01],
#     'l2_leaf_reg': [0.5, 0.1, 0.05, 0.01],
    'max_depth': [None,1,2,3,4,5],
    'min_samples_split': [2,3,5,10],
    'min_samples_leaf': [1,3,5,10],
#     'l2_leaf_reg': [0,0.1,0.01]
#     n_estimators=100,
#     *,
#     criterion='gini',
}

rf_cv = GridSearchCV(rf, param_grid=params, scoring='roc_auc', cv=5)

In [None]:
rf_cv.fit(scaled_X, y)

In [None]:
rf_cv.best_params_

In [None]:
rf = RandomForestClassifier(max_depth= None, min_samples_leaf= 5, min_samples_split= 10, n_estimators= 100)

In [None]:
rf_results = cross_validate(rf, X, y, scoring='roc_auc', cv=cv_split, return_train_score=True)

In [None]:
rf_results

In [None]:
rf_results['test_score'].mean()