<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Телеком" data-toc-modified-id="Телеком-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Телеком</a></span><ul class="toc-item"><li><span><a href="#План-работы" data-toc-modified-id="План-работы-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>План работы</a></span></li><li><span><a href="#Предобработка" data-toc-modified-id="Предобработка-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Предобработка</a></span></li></ul></li></ul></div>

# Телеком

## План работы
* Предобработка данных: ознакомиться с данными, объединить датасеты, обработать пропуски, дубликаты и некорректные значения, привести данные к корректным типам, выделить целевой признак, удалить лишние столбцы.
* Исследовательский анализ данных: исследовать категориальные и количественные признаки, баланс классов целевого признака.
* Построение модели: подготовить признаки к обучению, разделить датасет на выборки, обучить модели с различными гиперпараметрами и выбрать финальную модель.
* Тестирование модели: проверить качество финальной модели на тестовой выборке.
* Вывод: проанализировать полученные результаты.

In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as warn
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, recall_score, precision_score, roc_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
from imblearn.over_sampling import BorderlineSMOTE

##  Предобработка

**Загрузим данные**

In [45]:
path_local = '/Users/bogda/anaconda3/projects/praktikum/project_project/final_provider/'
path_yandex = '/datasets/final_provider/'

In [46]:
def tryexept(path):
    files = ['contract', 'personal', 'internet', 'phone']
    data = {}
    for i in files:
        data[i] = pd.read_csv(path + i + '.csv', index_col= 'customerID')
        print('\n' + i)
        display(data[i].sample(5))
        data[i].info()
        print('—' * 54)

In [47]:
try:
    tryexept(path_local)
except FileNotFoundError as e:
    print(e)
    tryexept(path_yandex)


contract


Unnamed: 0_level_0,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
5656-JAMLX,2014-12-01,No,Two year,No,Bank transfer (automatic),19.85,1253.65
3771-PZOBW,2018-06-01,No,Month-to-month,Yes,Credit card (automatic),90.7,1781.35
7321-ZNSLA,2019-01-01,No,Two year,No,Mailed check,40.55,590.35
6778-JFCMK,2018-02-01,No,One year,Yes,Mailed check,50.6,1288.75
4140-MUHUG,2019-08-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,86.85,220.95


<class 'pandas.core.frame.DataFrame'>
Index: 7043 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   BeginDate         7043 non-null   object 
 1   EndDate           7043 non-null   object 
 2   Type              7043 non-null   object 
 3   PaperlessBilling  7043 non-null   object 
 4   PaymentMethod     7043 non-null   object 
 5   MonthlyCharges    7043 non-null   float64
 6   TotalCharges      7043 non-null   object 
dtypes: float64(1), object(6)
memory usage: 440.2+ KB
——————————————————————————————————————————————————————

personal


Unnamed: 0_level_0,gender,SeniorCitizen,Partner,Dependents
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2856-NNASM,Male,1,No,No
2907-ILJBN,Female,0,Yes,Yes
1699-HPSBG,Male,0,No,No
1818-ESQMW,Female,0,No,No
2874-YXVVA,Female,0,No,No


<class 'pandas.core.frame.DataFrame'>
Index: 7043 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   gender         7043 non-null   object
 1   SeniorCitizen  7043 non-null   int64 
 2   Partner        7043 non-null   object
 3   Dependents     7043 non-null   object
dtypes: int64(1), object(3)
memory usage: 275.1+ KB
——————————————————————————————————————————————————————

internet


Unnamed: 0_level_0,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7346-MEDWM,Fiber optic,No,Yes,Yes,No,No,No
7340-KEFQE,DSL,Yes,No,No,Yes,No,No
9102-OXKFY,DSL,No,No,Yes,No,No,No
4910-AQFFX,Fiber optic,No,No,Yes,No,No,No
5795-KTGUD,Fiber optic,Yes,Yes,Yes,No,Yes,Yes


<class 'pandas.core.frame.DataFrame'>
Index: 5517 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   InternetService   5517 non-null   object
 1   OnlineSecurity    5517 non-null   object
 2   OnlineBackup      5517 non-null   object
 3   DeviceProtection  5517 non-null   object
 4   TechSupport       5517 non-null   object
 5   StreamingTV       5517 non-null   object
 6   StreamingMovies   5517 non-null   object
dtypes: object(7)
memory usage: 344.8+ KB
——————————————————————————————————————————————————————

phone


Unnamed: 0_level_0,MultipleLines
customerID,Unnamed: 1_level_1
1219-NNDDO,Yes
5685-IIXLY,No
5153-RTHKF,Yes
0301-KOBTQ,No
3398-GCPMU,Yes


<class 'pandas.core.frame.DataFrame'>
Index: 6361 entries, 5575-GNVDE to 3186-AJIEK
Data columns (total 1 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   MultipleLines  6361 non-null   object
dtypes: object(1)
memory usage: 99.4+ KB
——————————————————————————————————————————————————————


In [50]:
data['contract.csv']

Unnamed: 0_level_0,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85
5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5
3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15
7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.30,1840.75
9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.70,151.65
...,...,...,...,...,...,...,...
6840-RESVB,2018-02-01,No,One year,Yes,Mailed check,84.80,1990.5
2234-XADUH,2014-02-01,No,One year,Yes,Credit card (automatic),103.20,7362.9
4801-JZAZL,2019-03-01,No,Month-to-month,Yes,Electronic check,29.60,346.45
8361-LTMKD,2019-07-01,2019-11-01 00:00:00,Month-to-month,Yes,Mailed check,74.40,306.6


**Небольшой обзор, пропусков нет(на первый взгляд:), две последних таблицы содержат меньше данных чем первые, причина в том что клиенты могут пользоваться разными услугами независимо. Не соответсвтие типов данных и столбцов в таблице contract 'BeginDate', 'EndDate' - привести к дате, 'TotalCharges' к float.В таблице personal столбец 'SeniorSitizen' числовой хотя это категоральный признак. Разный регистр, нет целевого признака.**

**Объеденим таблицы в один датасет, приведем названия столбцов к одному регистру**

In [51]:
df = data['contract.csv'].join(data['personal.csv']).join(data['internet.csv']).join(data['phone.csv'])
df.columns = [k.lower() for k in list(df.columns)]
display(df.sample(5))
df.info()

Unnamed: 0_level_0,begindate,enddate,type,paperlessbilling,paymentmethod,monthlycharges,totalcharges,gender,seniorcitizen,partner,dependents,internetservice,onlinesecurity,onlinebackup,deviceprotection,techsupport,streamingtv,streamingmovies,multiplelines
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
4628-WQCQQ,2017-01-01,2019-12-01 00:00:00,One year,Yes,Electronic check,85.15,3030.6,Male,0,No,Yes,Fiber optic,No,No,Yes,No,Yes,No,No
0013-EXCHZ,2019-09-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,83.9,267.4,Female,1,Yes,No,Fiber optic,No,No,No,Yes,Yes,No,No
5996-DAOQL,2020-01-01,No,Month-to-month,Yes,Mailed check,20.45,20.45,Male,0,No,No,,,,,,,,No
4827-USJHP,2018-06-01,No,Month-to-month,Yes,Mailed check,51.8,1023.85,Male,0,No,No,DSL,No,Yes,No,No,No,No,No
6741-QRLUP,2014-11-01,No,Two year,Yes,Credit card (automatic),80.3,4995.35,Female,0,No,No,DSL,Yes,Yes,Yes,Yes,No,Yes,Yes


<class 'pandas.core.frame.DataFrame'>
Index: 7043 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   begindate         7043 non-null   object 
 1   enddate           7043 non-null   object 
 2   type              7043 non-null   object 
 3   paperlessbilling  7043 non-null   object 
 4   paymentmethod     7043 non-null   object 
 5   monthlycharges    7043 non-null   float64
 6   totalcharges      7043 non-null   object 
 7   gender            7043 non-null   object 
 8   seniorcitizen     7043 non-null   int64  
 9   partner           7043 non-null   object 
 10  dependents        7043 non-null   object 
 11  internetservice   5517 non-null   object 
 12  onlinesecurity    5517 non-null   object 
 13  onlinebackup      5517 non-null   object 
 14  deviceprotection  5517 non-null   object 
 15  techsupport       5517 non-null   object 
 16  streamingtv       5517 non-null 