## i. BUSINESS JOB DESCRIPTION

*   Our client is an e-commerce company (_All in one Place_), a multibrand outlet. It profits comes from reseling large quantities of various products by offering low tag prices.

*   After collecting data from their clients database during the period of 1 year, the company's marketing team is analyzing wheter it would be profitable or not to separate it into distinct groups, in order to distinguish those who represent a larger ammount of the company's earnings.

*   The intended group of interest obtained from this clustering analysis will then be called 'Insiders', and will be targeted as eligible clients to win special fidelity programs opportunities.

## ii. THE CHALLENGE

*   I was hired as a Data Scientist consultant, in order to build a model capable of performing such clustering with great accuracy.

*   With the solution, the marketing team can acordingly plan how to target groups of clients in order to optimize profits.

*   In order to understand client's behaviour, we have a database containing information about sales transactions, specifying the products that were bought, their description, quantity, unit price as well as general information about client's physical location (Customer ID, country).

## iii. BUSINESS QUESTIONS

*   It is expected a report as the result of the clsutering analysis, which answers the following questions:

    1.  Which clients are eligible  to take part on the 'Insiders' group ?
    
    2.  How many clients will be selected?

    3.  What are the main features that impacts more the clustering analysis from said clients?

    4.  What is the 'Insiders' group percentage upon the company's total earnings?
    
    5.  What is the expected profit from the 'Insiders' group for the next months?

    6.  What are the main conditions that make one eligible for being on 'Insiders'?

    7.  What are the conditions for one to be excluded from 'Insiders'?

    8.  What guarantees that 'Insiders' group grants more profits to the company compared to the rest of the database?

    9.  Which actions the marketing team can partake to increase profits?


# 0.0 IMPORTS, FUNCTIONS AND DATABASE LOAD

## 0.1 Imports

In [None]:
import pandas       as pd
import numpy        as np
import seaborn      as sns
import scikitplot   as skplt
import xgboost      as xgb
import lightgbm     as lgbm
import inflection
import optuna
import warnings
import os

from matplotlib     import pyplot as plt
from collections    import Counter

from sklearn.preprocessing      import MinMaxScaler, StandardScaler
from sklearn.dummy              import DummyClassifier
from sklearn.ensemble           import RandomForestRegressor, ExtraTreesClassifier
from sklearn.linear_model       import LogisticRegression
from sklearn.neighbors          import KNeighborsClassifier
from sklearn.model_selection    import train_test_split, StratifiedKFold, KFold
from sklearn.metrics            import log_loss

from imblearn.ensemble          import BalancedRandomForestClassifier
from imblearn.pipeline          import Pipeline
from imblearn.combine           import SMOTEENN
from imblearn.under_sampling    import EditedNearestNeighbours

from optuna.integration     import XGBoostPruningCallback
from optuna.visualization   import plot_param_importances
from optuna.pruners         import MedianPruner

from IPython.core.display import HTML
from IPython.display      import Image

## 1.2 Helper Functions

In [None]:
warnings.filterwarnings ('ignore')

def jupyter_settings():
    %matplotlib inline
    %pylab inline

    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [25, 12]
    plt.rcParams['font.size'] = 24

    display( HTML( '<style>.container { width:90% !important; }</style>') )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )

    sns.set()

In [None]:
jupyter_settings()

## 1.3 Database load

### 1.3.1 Generall Description

In [None]:
df_raw = pd.read_csv('')   
df_raw.head()

In [None]:
df1 = df_raw.copy()

In [None]:
print(f'Number of Rows: {df1.shape[0]}')
print(f'Number of Columns: {df1.shape[1]}')

### 1.3.2 Data Typification

*** DESCRIÇÃO QUALITATIVA DE CADA COLUNA ***

In [None]:
df1.info()

In [None]:
cols_old = []

snakecase = lambda x: inflection.underscore(x)
cols_new = list(map(snakecase, cols_old))

df1.columns = cols_new

In [None]:
df1.columns()

 *** CHCAGEM DE CONFORMIDADE DO TIPO DE DADO EM CADA COLUNA ***

### 1.3.3 Missing Data Treatment

In [None]:
df1.isna().sum()

*** VERIFICAR ESTRATÉGIAS ADEQUADAS DE PREENCHIMENTO DE DADOS FALTANTES ***

## 1.4 Numerical Data Description

In [None]:
# Data Selection
df1_num = df1.select_dtypes(include=['int64', 'float64'])

#Central Tendency
ct1 = pd.DataFrame(df1_num.apply(np.mean)).T
ct2 = pd.DataFrame(df1_num.apply(np.median)).T

#Dispersion
d1 = pd.DataFrame(df1_num.apply(min)).T
d2 = pd.DataFrame(df1_num.apply(max)).T
d3 = pd.DataFrame(df1_num.apply(lambda x: x.max() - x.min())).T
d4 = pd.DataFrame(df1_num.apply(np.std)).T
d5 = pd.DataFrame(df1_num.apply(lambda x: x.skew())).T
d6 = pd.DataFrame(df1_num.apply(lambda x: x.kurtosis())).T

df_descript = pd.concat([d1, d2, d3, ct1, ct2, d4, d5, d6]).T
df_descript.columns = ['min','max', 'range', 'average', 'median', 'std', 'skew', 'kurtosis']
df_descript

*** ANOTAR OBSERVAÇÕES SOBRE AS DISTRIBUIÇÕES OBSERVADAS E POSSÍVEIS INSIGHTS ***

## 1.5 Categorical Data Description

*** ANOTAR OBSERVAÇÕES SOBRE AS DISTRIBUIÇÕES OBSERVADAS E POSSÍVEIS INSIGHTS ***

# 2.0 DATA PREPARATION

## 2.1 Hypothesis Creation

In [None]:
df2 = df1.copy()

### 2.1.1 Hypothesis Mindmap

### 2.1.2 Created Hypothesis

## 2.2 EDA and Feature Engineering

### 2.2.1 Univariative Analysis

In [None]:
df2_num = df2.select_dtypes(include=['int64', 'float64'])

df2_num.hist(bins=25);

*** Verificar comportamento das distribuições/outliers ***
*** Verificar Comportamento das variáveis categóricas ***

### 2.2.2 Bivariative Analysis

***Validação das Hipóteses/ Feature Engineering/ Mapa de Calor ***

## 2.3 Data Preparation

## 2.4 Feature Selection

# 3.0 MODELING

# 4.0 RESULTS VALIDATION

# 5.0 MODEL IMPLEMENTATION