# 4th_hometask

Домашнее задание по занятию.

Датасет содержит данные по телефонным звонкам банка, и нужно определить - согласится клиент на дальнейшее сотрудничество или нет.

Аккумулируйте то, что мы рассматривали на всех занятиях. Проведите анализ и предобработку данных.

Расписывайте свои наблюдения и гипотезы. Обучите модель логистической регрессии.


Описание датасета:
1 - age (numeric)

2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6 - housing: has housing loan? (categorical: 'no','yes','unknown')

7 - loan: has personal loan? (categorical: 'no','yes','unknown')

related with the last contact of the current campaign:

8 - contact: contact communication type (categorical: 'cellular','telephone')

9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', …, 'nov', 'dec')

10 - dayofweek: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

other attributes:

11 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

12 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means 
client was not previously contacted)

13 - previous: number of contacts performed before this campaign and for this client (numeric)

14 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

social and economic context attributes:

15 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

16 - cons.price.idx: consumer price index - monthly indicator (numeric)

17 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

18 - euribor3m: euribor 3 month rate - daily indicator (numeric)

19 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

20 - y - has the client subscribed a term deposit? (binary: 'yes','no')

### Import Section

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

### Function "Reduce Memory Usage"

In [2]:
def reduce_memory_usage(df):
    
    # СДЕЛАТЬ ПРОВЕРКУ type(df) == pd.DataFrame
    
    initial_memory_usage = df.memory_usage().sum() / 1024 / 1024
    print(f'Initial memory usage of dataframe:\t{initial_memory_usage:.3} Mb')
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != 'object':
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            
            # Можно было бы сделать оптимизацию вплоть до типа "float16", однако данный тип данных,
            # как сообщается от сообщества аналитиков данных, плохо поддерживается некоторыми библиотеками
            
            if str(col_type)[:5] == 'float':
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                elif c_min > np.finfo(np.float64).min and c_max < np.finfo(np.float64).max:
                    df[col] = df[col].astype(np.float64)
        
        else:
            df[col] = df[col].astype('category')
    
    final_memory_usage = df.memory_usage().sum() / 1024 / 1024
    print(f'Final memory usage of dataframe:\t{final_memory_usage:.3} Mb')
    
    comparison = np.round(100 * (initial_memory_usage - final_memory_usage) / initial_memory_usage, 3)
    print(f'Memory usage has been decreased by:\t{comparison} %')
    
    return df

### Path Section

In [3]:
PATH_DATA = r'bank-direct-marketing-campaigns.csv'

## Exploratory Data Analysis

In [4]:
df_data = pd.read_csv(PATH_DATA)

In [5]:
df_data = df_data.drop_duplicates().copy()

In [6]:
df_data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [7]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39404 entries, 0 to 41187
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             39404 non-null  int64  
 1   job             39404 non-null  object 
 2   marital         39404 non-null  object 
 3   education       39404 non-null  object 
 4   default         39404 non-null  object 
 5   housing         39404 non-null  object 
 6   loan            39404 non-null  object 
 7   contact         39404 non-null  object 
 8   month           39404 non-null  object 
 9   day_of_week     39404 non-null  object 
 10  campaign        39404 non-null  int64  
 11  pdays           39404 non-null  int64  
 12  previous        39404 non-null  int64  
 13  poutcome        39404 non-null  object 
 14  emp.var.rate    39404 non-null  float64
 15  cons.price.idx  39404 non-null  float64
 16  cons.conf.idx   39404 non-null  float64
 17  euribor3m       39404 non-null 

In [8]:
df_data.describe()

Unnamed: 0,age,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,39404.0,39404.0,39404.0,39404.0,39404.0,39404.0,39404.0,39404.0,39404.0
mean,40.116105,2.618744,960.847097,0.178738,0.064067,93.577538,-40.499604,3.601243,5165.986481
std,10.460328,2.81478,190.869184,0.503172,1.577041,0.58382,4.644327,1.742337,72.763866
min,17.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.334,5099.1
50%,38.0,2.0,999.0,0.0,1.1,93.798,-41.8,4.857,5191.0
75%,47.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


## Data Preprocessing

##### Для дальнейшего обучения модели выберем наиболее важные признаки:
##### - признаки описывающие личность клиента;
##### - признаки с финансовой/банковской информацией;
##### - признаки, значения которых имеют сильный контраст значений (по этому условию можно отсеять признаки "month" и "day_of_week" значения которых распределены равномерно).

In [9]:
features = ['age', 'job', 'marital', 'education', 'default', 'loan', 'campaign', 'pdays', 'previous', 'emp.var.rate',
            'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'poutcome', 'y']
df_data_preprocessed = df_data[features].copy()

In [10]:
df_data_preprocessed['y'] = (df_data_preprocessed['y'] == 'yes').astype(int)

In [11]:
ohe_job = OneHotEncoder()

df_job = ohe_job.fit_transform(df_data_preprocessed[['job']])
df_job = pd.DataFrame(df_job.toarray(), columns=ohe_job.categories_[0])
df_job.index = df_data_preprocessed.index
df_job = df_job.drop(columns=['unknown']).copy()

df_data_preprocessed = df_data_preprocessed.join(df_job)
df_data_preprocessed = df_data_preprocessed.drop(columns=['job']).copy()

In [12]:
ohe_marital = OneHotEncoder()

df_marital = ohe_marital.fit_transform(df_data_preprocessed[['marital']])
df_marital = pd.DataFrame(df_marital.toarray(), columns=ohe_marital.categories_[0])
df_marital.index = df_data_preprocessed.index
df_marital = df_marital.drop(columns=['unknown']).copy()

df_data_preprocessed = df_data_preprocessed.join(df_marital)
df_data_preprocessed = df_data_preprocessed.drop(columns=['marital']).copy()

In [13]:
ohe_education = OneHotEncoder()

df_education = ohe_education.fit_transform(df_data_preprocessed[['education']])
df_education = pd.DataFrame(df_education.toarray(), columns=ohe_education.categories_[0])
df_education.index = df_data_preprocessed.index
df_education = df_education.drop(columns=['unknown']).copy()

df_data_preprocessed = df_data_preprocessed.join(df_education)
df_data_preprocessed = df_data_preprocessed.drop(columns=['education']).copy()

In [14]:
ohe_default = OneHotEncoder()

df_default = ohe_default.fit_transform(df_data_preprocessed[['default']])
df_default = pd.DataFrame(df_default.toarray(), columns=ohe_default.categories_[0])
df_default.index = df_data_preprocessed.index
df_default = df_default.drop(columns=['unknown']).copy()
df_default = df_default.add_prefix('default_').copy()

df_data_preprocessed = df_data_preprocessed.join(df_default)
df_data_preprocessed = df_data_preprocessed.drop(columns=['default']).copy()

In [15]:
ohe_loan = OneHotEncoder()

df_loan = ohe_loan.fit_transform(df_data_preprocessed[['loan']])
df_loan = pd.DataFrame(df_loan.toarray(), columns=ohe_loan.categories_[0])
df_loan.index = df_data_preprocessed.index
df_loan = df_loan.drop(columns=['unknown']).copy()
df_loan = df_loan.add_prefix('loan_').copy()

df_data_preprocessed = df_data_preprocessed.join(df_loan)
df_data_preprocessed = df_data_preprocessed.drop(columns=['loan']).copy()

In [16]:
ohe_poutcome = OneHotEncoder()

df_poutcome = ohe_poutcome.fit_transform(df_data_preprocessed[['poutcome']])
df_poutcome = pd.DataFrame(df_poutcome.toarray(), columns=ohe_poutcome.categories_[0])
df_poutcome.index = df_data_preprocessed.index
df_poutcome = df_poutcome.drop(columns=['nonexistent']).copy()

df_data_preprocessed = df_data_preprocessed.join(df_poutcome)
df_data_preprocessed = df_data_preprocessed.drop(columns=['poutcome']).copy()

In [17]:
df_data_preprocessed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39404 entries, 0 to 41187
Data columns (total 37 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  39404 non-null  int64  
 1   campaign             39404 non-null  int64  
 2   pdays                39404 non-null  int64  
 3   previous             39404 non-null  int64  
 4   emp.var.rate         39404 non-null  float64
 5   cons.price.idx       39404 non-null  float64
 6   cons.conf.idx        39404 non-null  float64
 7   euribor3m            39404 non-null  float64
 8   nr.employed          39404 non-null  float64
 9   y                    39404 non-null  int32  
 10  admin.               39404 non-null  float64
 11  blue-collar          39404 non-null  float64
 12  entrepreneur         39404 non-null  float64
 13  housemaid            39404 non-null  float64
 14  management           39404 non-null  float64
 15  retired              39404 non-null 

##### Рассмотрим признаки "month" и "day_of_week" отдельно подробнее:

In [18]:
df_month = df_data[['month', 'y']].copy()
series_month = df_month[df_month['y'] == 'yes']['month'].value_counts()
series_month

may    882
aug    651
jul    645
jun    554
apr    532
nov    411
oct    312
mar    268
sep    255
dec     88
Name: month, dtype: int64

In [19]:
df_day_of_week = df_data[['day_of_week', 'y']].copy()
series_day_of_week = df_day_of_week[df_day_of_week['y'] == 'yes']['day_of_week'].value_counts()
series_day_of_week

thu    1037
tue     945
wed     934
mon     842
fri     840
Name: day_of_week, dtype: int64

##### Очевидно, что значения признаков "month" и "day_of_week" окажут влияние на целевую переменную. Закодируем признаки как вероятности:

In [20]:
series_month = np.round(series_month / series_month.sum(), 4)
series_month

may    0.1918
aug    0.1416
jul    0.1403
jun    0.1205
apr    0.1157
nov    0.0894
oct    0.0679
mar    0.0583
sep    0.0555
dec    0.0191
Name: month, dtype: float64

In [21]:
series_day_of_week = np.round(series_day_of_week / series_day_of_week.sum(), 4)
series_day_of_week

thu    0.2255
tue    0.2055
wed    0.2031
mon    0.1831
fri    0.1827
Name: day_of_week, dtype: float64

##### Также добавим вероятности в набор данных.

In [22]:
df_month = pd.DataFrame(df_data['month'], columns=['month'])
df_month.index = df_data.index
series_month = pd.DataFrame(series_month).copy()
series_month = series_month.add_suffix('_p').copy()
series_month['month'] = series_month.index
df_month = df_month.merge(series_month, on='month', how='left').copy()
df_month = df_month.drop(columns=['month'])
df_month.index = df_data_preprocessed.index

df_data_preprocessed = df_data_preprocessed.join(df_month).copy()
df_month.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39404 entries, 0 to 41187
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   month_p  39404 non-null  float64
dtypes: float64(1)
memory usage: 1.6 MB


In [23]:
df_day_of_week = pd.DataFrame(df_data['day_of_week'], columns=['day_of_week'])
df_day_of_week.index = df_data.index
series_day_of_week = pd.DataFrame(series_day_of_week).copy()
series_day_of_week = series_day_of_week.add_suffix('_p').copy()
series_day_of_week['day_of_week'] = series_day_of_week.index
df_day_of_week = df_day_of_week.merge(series_day_of_week, on='day_of_week', how='left').copy()
df_day_of_week = df_day_of_week.drop(columns=['day_of_week'])
df_day_of_week.index = df_data_preprocessed.index

df_data_preprocessed = df_data_preprocessed.join(df_day_of_week).copy()
df_day_of_week.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39404 entries, 0 to 41187
Data columns (total 1 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   day_of_week_p  39404 non-null  float64
dtypes: float64(1)
memory usage: 1.6 MB


In [24]:
df_data_preprocessed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39404 entries, 0 to 41187
Data columns (total 39 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  39404 non-null  int64  
 1   campaign             39404 non-null  int64  
 2   pdays                39404 non-null  int64  
 3   previous             39404 non-null  int64  
 4   emp.var.rate         39404 non-null  float64
 5   cons.price.idx       39404 non-null  float64
 6   cons.conf.idx        39404 non-null  float64
 7   euribor3m            39404 non-null  float64
 8   nr.employed          39404 non-null  float64
 9   y                    39404 non-null  int32  
 10  admin.               39404 non-null  float64
 11  blue-collar          39404 non-null  float64
 12  entrepreneur         39404 non-null  float64
 13  housemaid            39404 non-null  float64
 14  management           39404 non-null  float64
 15  retired              39404 non-null 

##### Масштабируем банковские признаки "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed" для облегчения обучения модели, так как некоторые значения находятся крайне близко друг к другу.

In [25]:
ss = StandardScaler()

df_data_preprocessed[["emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"]] = ss.fit_transform(df_data_preprocessed[["emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"]])
df_data_preprocessed[["emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"]].describe()

Unnamed: 0,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,39404.0,39404.0,39404.0,39404.0,39404.0
mean,1.846502e-16,-8.286179e-15,-5.654913e-16,4.616256e-16,3.185216e-15
std,1.000013,1.000013,1.000013,1.000013,1.000013
min,-2.196589,-2.357841,-2.217873,-1.703047,-2.78145
25%,-1.182018,-0.8607866,-0.4737875,-1.301282,-0.9192383
50%,0.6568924,0.3776238,-0.2800002,0.7207407,0.3437673
75%,0.8471245,0.7133478,0.8827234,0.7804314,0.8536422
max,0.8471245,2.037402,2.928256,0.8286431,0.8536422


In [26]:
df_data_preprocessed = reduce_memory_usage(df_data_preprocessed)
df_data_preprocessed.info()

Initial memory usage of dataframe:	12.9 Mb
Final memory usage of dataframe:	6.64 Mb
Memory usage has been decreased by:	48.422 %
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39404 entries, 0 to 41187
Data columns (total 39 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  39404 non-null  int8   
 1   campaign             39404 non-null  int8   
 2   pdays                39404 non-null  int16  
 3   previous             39404 non-null  int8   
 4   emp.var.rate         39404 non-null  float32
 5   cons.price.idx       39404 non-null  float32
 6   cons.conf.idx        39404 non-null  float32
 7   euribor3m            39404 non-null  float32
 8   nr.employed          39404 non-null  float32
 9   y                    39404 non-null  int8   
 10  admin.               39404 non-null  float32
 11  blue-collar          39404 non-null  float32
 12  entrepreneur         39404 non-null  float32
 13  housema

## Model Building

In [27]:
X = df_data_preprocessed.drop(columns=['y']).copy()
y = df_data_preprocessed['y'].copy()

In [28]:
y.value_counts()

0    34806
1     4598
Name: y, dtype: int64

##### Необходимо учесть дисбаланс признаков.

##### Балансировка признаков от модели:

In [29]:
%%time

lr = LogisticRegression(penalty='l2', class_weight='balanced')

lr.fit(X, y)
y_pred = lr.predict(X)
print(classification_report(y, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.76      0.85     34806
           1       0.28      0.69      0.40      4598

    accuracy                           0.76     39404
   macro avg       0.61      0.73      0.62     39404
weighted avg       0.87      0.76      0.79     39404

Wall time: 314 ms


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


##### Слишком низкая Precision для TP. 

##### Без балансировки признаков:

In [30]:
%%time

weights = {}

lr = LogisticRegression(penalty='l2')

lr.fit(X, y)
y_pred = lr.predict(X)
print(classification_report(y, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.99      0.94     34806
           1       0.69      0.20      0.30      4598

    accuracy                           0.90     39404
   macro avg       0.79      0.59      0.62     39404
weighted avg       0.88      0.90      0.87     39404

Wall time: 327 ms


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


##### Слишком низкая Recall для TP. 

##### Ручная балансировка признаков:

In [31]:
print(y.shape[0])
print(y[y == 1].shape[0])
print(y[y == 0].shape[0])
print((y.shape[0] - y[y == 0].shape[0]) / y.shape[0])
print((y.shape[0] - y[y == 1].shape[0]) / y.shape[0])

39404
4598
34806
0.11668866104963962
0.8833113389503604


In [32]:
%%time

weights = {0: (y.shape[0] - y[y == 0].shape[0]) / y.shape[0], 1: (y.shape[0] - y[y == 1].shape[0]) / y.shape[0]}

lr = LogisticRegression(penalty='l2', class_weight=weights)

lr.fit(X, y)
y_pred = lr.predict(X)
print(classification_report(y, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.78      0.86     34806
           1       0.29      0.68      0.41      4598

    accuracy                           0.77     39404
   macro avg       0.62      0.73      0.63     39404
weighted avg       0.87      0.77      0.81     39404

Wall time: 280 ms


##### Также слишком низкая Precision для TP. 

##### Ручная кросс-валидация (методом перебора весов):

In [33]:
%%time

weights = {0: 0.22, 1: 0.78}

lr = LogisticRegression(penalty='l2', class_weight=weights)

lr.fit(X, y)
y_pred = lr.predict(X)
print(classification_report(y, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.93      0.93     34806
           1       0.46      0.46      0.46      4598

    accuracy                           0.87     39404
   macro avg       0.69      0.69      0.69     39404
weighted avg       0.87      0.87      0.87     39404

Wall time: 321 ms


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Идеальный баланс признаков, однако F1 для TP оставляет желать лучшего.

In [34]:
#               precision    recall  f1-score   support

#            0       0.93      0.93      0.93     34806
#            1       0.46      0.46      0.46      4598

#     accuracy                           0.87     39404
#    macro avg       0.69      0.69      0.69     39404
# weighted avg       0.87      0.87      0.87     39404