# Идея проекта, выборка и предобработка данных
<hr>

## Введение

```Текущая accuracy: 60-70%```

Проект представляет из себя анализатор дипфейков на основе анализа артефактов в различных зонах лица

В ходе предварительного анализа предметной области было выделено 3 наиболее перспективных для реализации метода анализа дипфейков:

- анализ движений глаз (саккад)
- анализ синзронизации речи и движений губ
- анализ артефактов

Так как анализ движений возможен только в динамике и довольно сложен в реализации, было решено реализовать метод анализа артефактов на статических изображениях (фото)

- В качестве "источника вдохновения" использовался [проект](https://github.com/rakshitsakhuja/Detecting-Deepfakes-with-OpenCV/blob/master/1.%20Processing%20Videos%20and%20Face%20Detection.ipynb), доступный по ссылке: 

https://github.com/rakshitsakhuja/Detecting-Deepfakes-with-OpenCV/blob/master/1.%20Processing%20Videos%20and%20Face%20Detection.ipynb

- В качестве датасета использовались фрагменты датасета [Deep Fake Detection Challenge DFDC](https://www.kaggle.com/competitions/deepfake-detection-challenge):

https://www.kaggle.com/competitions/deepfake-detection-challenge

- Про [распознанвание зон лица](https://pyimagesearch.com/2017/04/10/detect-eyes-nose-lips-jaw-dlib-opencv-python/) можно прочесть по ссылке:

https://pyimagesearch.com/2017/04/10/detect-eyes-nose-lips-jaw-dlib-opencv-python/

- Сжатый [датасет 300 Faces In-the-Wild Challenge (300-W)](https://ibug.doc.ic.ac.uk/resources/300-W/) для распознавания точек на лице и аннотации к нему:

https://ibug.doc.ic.ac.uk/resources/300-W/

## Пайплайн проекта

Проект работает в следующем порядке:

- datacollector.py, faceparts.py:
1. Считывание по 1 кадру из видео-датасета DFDC (5 x 10 GB, 5 x 1700 видео)
2. Распознавание 68 точек на лице с помощью OpenCV, DLib и датасета 300-W
3. Отсев фото, которые не удалось распознать (остается около 3500 фото), запись *размера* лица и бинарного значения яркости *пикселей* в 68 зонах вокруг ключевых точек в их черно-белой версии (размер зон - 8-17% от ширины лица, сейчас подбирается) в dataset (5 х 0.2-0.5 GB)
> псевдокод: ```image.to_greyscale().cut([zone_schape])```

- notebook.ipynb (до метки "Legacy"):
1. Добавление гистограмм распределения яркости к склейке датасетов
2. Добавление метрик std и noise для каждой зоны
3. Очистка от исходных массивов пикселей, запись на диск (30 MB)

- notebook2.ipynb:
1. Отсекание лишних фейков из исходного датасета (остается 1000=500*2 из 3500 фото) для избавления от перекоса (по крайней мере в процессе подбора методов, такой датасет удобнее, иначе недо/переобучение вощникает слишком часто)
2. Сбор std и avg значений для каждого из классов => определение точек, с наибольшим значением ```|avg_fake-avg_real| / (std_fake + std_real)```
> sklearn.feature_selection.SelectKBest работает плохо, т.к. не включает метрики для всего лица, которые нужны для нормализации значений по точкам

3. Выборка самых показательных features + features по всему лицу
4. Обучение моделей

> Так как 5 из 8 моделей дают сопоставимую точность, на данном этапе целесообразно работать надо подбором фичей и размерами сканируемых зон, мб фильтрацией плохих/маленьких фото, тюнинг моделей (тщательный) следует проводть позднее

## Потенциальные темы НИРов

> 1 тема != 1 НИР, темы могут быть объединены или разбиты

- процесс сборки данных: введение про проблему и методы анализа дипфейков, датасет DFDC, распознавание точек, OpenCV, DLib, датасет 300-W
- Предобработка данных: фильтрация, гистограммы и агрегационные метрики
- Подбор features, метрики метрик, метрики по всему лицу
- Выбор и тюнинг моделей, влияние размера и перекоса датасета и т.д.

На более поздних этапах (опционально):
- выстраивание вокруг модели эконсистемы: веб-приложение, комплект клиентов на React-native / чат-бот, Google OAuth, Metamask Auth
- Биллинг для приложений, ценообразование, токены (можно крипту из тестнета (Goerli, Ether testnet) ради забавы прикрутить), мониторинг (Grafana?), Docker-Kuber, балансировка и прочие умные штуки

# Обработка данных
<hr>

> 990 фото, баланс классов 1:1, размер зон - 8..17% от ширины лица (в процессе экспериментов), f1-метрика на разных моделях - 0.60..0.70

## Импорт dataset-а и агрегация данных для классов 'fake' и 'real'

In [1]:
import os
os.getcwd()

'c:\\Users\\sergey.astakhov\\Desktop\\BmstuDeepFake'

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_json("../deep_fake_src/dfdc_dataframes/df_total_0_4_compact_frame_10.json")
# df = pd.read_json("../deep_fake_src/dfdc_dataframes/dataframe_compact_total_27_frame_10_lim100_offset0.json")

In [4]:
# Балансировка классов 1:1
# Инверсные метрики для полиномов

# df = df.drop(columns=['index','face_size_px'])
df = pd.concat([
    df[df.fake==True].sample(
        int(df[df.fake==False].shape[0]*1)), 
    df[df.fake==False]
])

face = df.filter(like='face', axis=1).copy()

for col in face:
    name = str(col) + '_reversed'
    df[name] = face[str(col)].map(lambda x: 1.0 / max(1,x))

df = df.filter(regex='^(.(?!(var)))*$', axis=1).filter(regex='^(.(?!(noise_1)))*$', axis=1)

print(df.shape)
# df = df.reset_index()
df.head()

(990, 281)


Unnamed: 0,index,filename,fake,face_size_px,pt_48_std,pt_49_std,pt_50_std,pt_51_std,pt_52_std,pt_53_std,...,pt_13_hist_noise_3,pt_14_hist_noise_3,pt_15_hist_noise_3,pt_16_hist_noise_3,overall_face_hist_noise_3,face_size_px_reversed,overall_face_std_reversed,overall_face_hist_std_reversed,overall_face_hist_noise_0.5_reversed,overall_face_hist_noise_3_reversed
3102,483,wdukmquzms.mp4,True,6084,11.55422,3.535534,1.089725,4.153312,4.81534,5.53963,...,0.0,0.0,0.0,0.0,2.95858,0.000164,0.022004,0.029886,0.515593,0.338
233,245,lmhkuaobue.mp4,True,40804,23.793989,34.882481,25.929412,15.909067,4.343029,0.970773,...,2.0,0.0,0.0,0.0,0.485247,2.5e-05,0.024501,0.005304,1.0,1.0
3424,806,dkicrqucgy.mp4,True,112896,15.800391,15.260562,13.016295,13.781179,13.891335,16.135245,...,8.984375,4.296875,5.078125,6.640625,0.209042,9e-06,0.020796,0.001459,1.0,1.0
2596,588,plccjliaxn.mp4,True,87616,23.637761,32.387943,24.234672,14.927587,5.254129,19.051615,...,34.693878,10.714286,2.55102,20.918367,0.261368,1.1e-05,0.02299,0.000931,1.0,1.0
1311,515,hsllwtgadk.mp4,True,93636,17.36651,23.398106,21.937207,20.222863,13.482415,14.794257,...,19.897959,25.0,3.571429,4.591837,0.215729,1.1e-05,0.020116,0.003088,1.0,1.0


In [5]:
# агрегация для фейков

df_fakes_compact = df[df.fake==True].filter(regex='^(.(?!(raw)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ake)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ilename)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ist_simple)))*$', axis=1)
                                    
df_fakes_reduced = pd.DataFrame(df_fakes_compact.mean()).T
df_fakes_reduced['fake'] = True
df_fakes_reduced = df_fakes_reduced.set_index('fake')
df_fakes_reduced.filter(like='face', axis=1)

Unnamed: 0_level_0,face_size_px,overall_face_std,overall_face_hist_std,overall_face_hist_noise_0.5,overall_face_hist_noise_3,face_size_px_reversed,overall_face_std_reversed,overall_face_hist_std_reversed,overall_face_hist_noise_0.5_reversed,overall_face_hist_noise_3_reversed
fake,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
True,100986.812121,45.314327,537.200961,0.173002,0.286541,1.5e-05,0.024383,0.003158,0.997426,0.993569


In [6]:
# агрегация для реальных фото

df_real_compact = df[df.fake==False].filter(regex='^(.(?!(raw)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ake)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ilename)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ist_simple)))*$', axis=1)
                                    
df_real_reduced = pd.DataFrame(df_real_compact.mean()).T
df_real_reduced['fake'] = False
df_real_reduced = df_real_reduced.set_index('fake')
df_real_reduced.filter(like='face', axis=1)

Unnamed: 0_level_0,face_size_px,overall_face_std,overall_face_hist_std,overall_face_hist_noise_0.5,overall_face_hist_noise_3,face_size_px_reversed,overall_face_std_reversed,overall_face_hist_std_reversed,overall_face_hist_noise_0.5_reversed,overall_face_hist_noise_3_reversed
fake,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
False,92126.343434,49.182392,534.771606,0.200743,0.308146,1.6e-05,0.022526,0.003057,0.999456,0.998436


In [7]:
# самые ярко-различающиеся по классам features

df_compare = pd.concat([df_real_reduced, df_fakes_reduced]).T
df_compare['diff_rel'] = abs(df_compare[False] - df_compare[True]) / (df_compare[False] + df_compare[True])
df_compare['diff'] = abs(df_compare[False] - df_compare[True]) 

df_compare = df_compare.sort_values(by=['diff_rel'], ascending=False)
df_compare.head(15)

fake,False,True,diff_rel,diff
pt_8_hist_noise_3,12.59469,9.277604,0.151657,3.317086
pt_19_hist_noise_0.5,1.641023,1.231309,0.142642,0.409715
pt_20_hist_noise_0.5,1.829038,1.382728,0.138961,0.44631
pt_57_hist_noise_3,16.263845,12.299708,0.138783,3.964138
pt_57_hist_noise_0.5,0.871193,0.668165,0.131892,0.203029
pt_1_hist_noise_0.5,1.813904,1.393418,0.131102,0.420486
pt_56_hist_noise_3,15.815474,12.253384,0.126905,3.56209
pt_3_hist_noise_3,23.379733,18.118205,0.12679,5.261528
pt_6_hist_noise_3,12.490156,9.680411,0.126733,2.809745
pt_2_hist_noise_3,19.610132,15.23425,0.125584,4.375883


In [8]:
# самые стабильные features

df_fakes_std = pd.DataFrame(df_fakes_compact.std()).T
df_fakes_std['fake'] = True
df_fakes_std = df_fakes_std.set_index('fake')

df_real_std = pd.DataFrame(df_real_compact.std()).T
df_real_std['fake'] = False
df_real_std = df_real_std.set_index('fake')

df_std = pd.concat([df_real_reduced, df_fakes_reduced]).T
df_std['total_std'] = (df_std[False] + df_std[True])

df_std = df_std.sort_values(by=['total_std'], ascending=True)
df_std.head(10)

fake,False,True,total_std
face_size_px_reversed,1.6e-05,1.5e-05,3.1e-05
overall_face_hist_std_reversed,0.003057,0.003158,0.006216
overall_face_std_reversed,0.022526,0.024383,0.046909
overall_face_hist_noise_0.5,0.200743,0.173002,0.373745
overall_face_hist_noise_3,0.308146,0.286541,0.594687
pt_8_hist_noise_0.5,0.507715,0.482656,0.990371
pt_7_hist_noise_0.5,0.540935,0.591095,1.132031
pt_9_hist_noise_0.5,0.687613,0.620837,1.308451
pt_6_hist_noise_0.5,0.688294,0.683331,1.371625
pt_57_hist_noise_0.5,0.871193,0.668165,1.539358


In [9]:
# самые надежные (разилчие_по_классам / разброс) features
# (var не кажется действительно надежной, скорее всего просто зависит от std квадратично или как-то еще)

df_smart = pd.concat([pd.DataFrame(df_compare['diff']),pd.DataFrame(df_std['total_std'])], axis=1)
df_smart['metric'] = df_smart['diff'] / df_smart['total_std']
df_smart = df_smart.sort_values(by=['metric'], ascending=False)
df_smart.head(30)

Unnamed: 0,diff,total_std,metric
pt_8_hist_noise_3,3.317086,21.872294,0.151657
pt_19_hist_noise_0.5,0.409715,2.872332,0.142642
pt_20_hist_noise_0.5,0.44631,3.211766,0.138961
pt_57_hist_noise_3,3.964138,28.563553,0.138783
pt_57_hist_noise_0.5,0.203029,1.539358,0.131892
pt_1_hist_noise_0.5,0.420486,3.207321,0.131102
pt_56_hist_noise_3,3.56209,28.068857,0.126905
pt_3_hist_noise_3,5.261528,41.497937,0.12679
pt_6_hist_noise_3,2.809745,22.170566,0.126733
pt_2_hist_noise_3,4.375883,34.844382,0.125584


## Подбор features [to be modified]

> опирается на интуитивные предположения + метод определения фейков по шуму

In [10]:
features = list(set(
    list(df_smart.filter(like='face_hist', axis=0).index) +  # for relative metrics
    list(df_smart.filter(like='noise_0.5', axis=0).head(7).index) +
    list(df_smart.filter(like='noise_3', axis=0).head(7).index) +
    list(df_smart.filter(like='std', axis=0).head(5).index) +
    list(df_smart.filter(like='reversed', axis=0).index) 
    # list(df_smart.filter(like='var', axis=0).head(3).index) 
))

print(len(features))
features

27


['overall_face_hist_noise_3_reversed',
 'pt_41_std',
 'pt_57_hist_noise_0.5',
 'pt_30_hist_noise_0.5',
 'pt_36_std',
 'pt_57_hist_noise_3',
 'overall_face_hist_noise_0.5_reversed',
 'overall_face_hist_std_reversed',
 'pt_1_std',
 'pt_21_hist_noise_0.5',
 'face_size_px_reversed',
 'overall_face_hist_noise_0.5',
 'overall_face_std_reversed',
 'pt_1_hist_noise_0.5',
 'pt_56_hist_noise_3',
 'pt_6_hist_noise_3',
 'overall_face_hist_noise_3',
 'pt_20_hist_noise_0.5',
 'overall_face_hist_std',
 'pt_8_hist_noise_3',
 'pt_0_hist_noise_0.5',
 'pt_2_hist_noise_3',
 'pt_0_std',
 'pt_39_std',
 'pt_39_hist_noise_3',
 'pt_3_hist_noise_3',
 'pt_19_hist_noise_0.5']

## Разбивка данных на train и test сплиты

In [11]:
# очистка dataset-а от лишних колонок перед тренировкой модели (необязательно?)

df_compact = df.filter(regex='^(.(?!(raw)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ilename)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ist_simple)))*$', axis=1)
df_compact.head(5)

Unnamed: 0,index,fake,face_size_px,pt_48_std,pt_49_std,pt_50_std,pt_51_std,pt_52_std,pt_53_std,pt_54_std,...,pt_13_hist_noise_3,pt_14_hist_noise_3,pt_15_hist_noise_3,pt_16_hist_noise_3,overall_face_hist_noise_3,face_size_px_reversed,overall_face_std_reversed,overall_face_hist_std_reversed,overall_face_hist_noise_0.5_reversed,overall_face_hist_noise_3_reversed
3102,483,True,6084,11.55422,3.535534,1.089725,4.153312,4.81534,5.53963,12.152675,...,0.0,0.0,0.0,0.0,2.95858,0.000164,0.022004,0.029886,0.515593,0.338
233,245,True,40804,23.793989,34.882481,25.929412,15.909067,4.343029,0.970773,0.73885,...,2.0,0.0,0.0,0.0,0.485247,2.5e-05,0.024501,0.005304,1.0,1.0
3424,806,True,112896,15.800391,15.260562,13.016295,13.781179,13.891335,16.135245,13.561806,...,8.984375,4.296875,5.078125,6.640625,0.209042,9e-06,0.020796,0.001459,1.0,1.0
2596,588,True,87616,23.637761,32.387943,24.234672,14.927587,5.254129,19.051615,13.855798,...,34.693878,10.714286,2.55102,20.918367,0.261368,1.1e-05,0.02299,0.000931,1.0,1.0
1311,515,True,93636,17.36651,23.398106,21.937207,20.222863,13.482415,14.794257,21.10267,...,19.897959,25.0,3.571429,4.591837,0.215729,1.1e-05,0.020116,0.003088,1.0,1.0


In [12]:
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

In [13]:
y = df_compact['fake']
X = pd.DataFrame(df_compact.drop(columns=['fake']))
# selector = SelectKBest(f_classif, k=7)
# X = selector.fit_transform(X, y)
X =  df_compact[features]

# mask = selector.get_support()
# X = X[X.columns[mask]]

# df_compact.filter(like='size', axis=0).head()

# print(df_compact[X.columns[mask]].shape)
# df_compact[X.columns[mask]].head()

pd.DataFrame(X).head()

Unnamed: 0,overall_face_hist_noise_3_reversed,pt_41_std,pt_57_hist_noise_0.5,pt_30_hist_noise_0.5,pt_36_std,pt_57_hist_noise_3,overall_face_hist_noise_0.5_reversed,overall_face_hist_std_reversed,pt_1_std,pt_21_hist_noise_0.5,...,pt_20_hist_noise_0.5,overall_face_hist_std,pt_8_hist_noise_3,pt_0_hist_noise_0.5,pt_2_hist_noise_3,pt_0_std,pt_39_std,pt_39_hist_noise_3,pt_3_hist_noise_3,pt_19_hist_noise_0.5
3102,0.338,14.874475,0.0,0.0,6.708204,0.0,0.515593,0.029886,5.894913,0.0,...,0.0,33.460678,0.0,0.0,0.0,0.829156,4.062019,0.0,0.0,0.0
233,1.0,13.712782,0.0,0.0,18.812762,17.0,1.0,0.005304,8.613309,0.0,...,0.0,188.5469,8.0,0.0,1.0,19.37613,11.808044,20.0,5.0,0.0
3424,1.0,13.553713,0.78125,1.953125,12.642846,3.125,1.0,0.001459,22.676752,0.390625,...,2.34375,685.549118,5.46875,5.859375,17.96875,23.78088,11.700141,13.671875,14.453125,1.5625
2596,1.0,33.512107,0.0,0.0,33.072177,13.265306,1.0,0.000931,19.374299,0.0,...,0.0,1073.759102,3.061224,0.0,20.408163,22.50738,25.746245,37.755102,6.122449,0.0
1311,1.0,25.001097,0.0,0.0,17.557038,8.163265,1.0,0.003088,26.76362,0.0,...,0.0,323.85341,13.265306,0.0,20.918367,33.920124,18.068323,25.510204,16.836735,0.0


In [14]:
# print(df_compact.filter(like='size', axis=1).shape)
# df_compact.filter(like='size', axis=1).head()

In [15]:
# sizes_col = X.filter(like='size', axis=1).copy()
# X = pd.concat([
#     df_compact[X.columns[mask]],
#     sizes_col
# ], axis=0)
# # X.join(df_compact.filter(like='size', axis=1), lsuffix='_caller', rsuffix='_other')
# X.head()

In [16]:
X_train, X_test, y_train, y_test = train_test_split( 
    X, 
    y,
    test_size=0.20, 
    random_state=420)

In [17]:
pd.DataFrame(X_train).head()

Unnamed: 0,overall_face_hist_noise_3_reversed,pt_41_std,pt_57_hist_noise_0.5,pt_30_hist_noise_0.5,pt_36_std,pt_57_hist_noise_3,overall_face_hist_noise_0.5_reversed,overall_face_hist_std_reversed,pt_1_std,pt_21_hist_noise_0.5,...,pt_20_hist_noise_0.5,overall_face_hist_std,pt_8_hist_noise_3,pt_0_hist_noise_0.5,pt_2_hist_noise_3,pt_0_std,pt_39_std,pt_39_hist_noise_3,pt_3_hist_noise_3,pt_19_hist_noise_0.5
1874,1.0,28.313497,1.0,0.5,22.093872,3.0,1.0,0.000908,16.133186,1.75,...,3.25,1101.052972,5.25,3.5,10.25,30.046374,16.051814,13.0,17.75,3.25
1519,1.0,28.820247,1.234568,2.469136,19.53879,4.938272,1.0,0.001978,4.155099,1.234568,...,5.555556,505.479641,14.814815,0.308642,5.864198,7.621677,17.702121,17.901235,10.802469,3.08642
100,1.0,5.834917,0.0,1.851852,5.183362,1.54321,1.0,0.000993,5.661588,0.0,...,0.0,1006.901957,1.234568,0.925926,3.08642,5.376026,3.42361,2.469136,2.469136,0.0
1169,1.0,32.932466,6.640625,1.953125,17.681318,15.625,1.0,0.002351,16.283367,2.34375,...,7.03125,425.435209,15.625,9.765625,18.359375,20.468321,17.15821,19.921875,24.21875,2.734375
2329,1.0,36.606967,2.734375,5.078125,23.38804,13.28125,1.0,0.002955,28.787787,1.171875,...,3.125,338.385216,4.6875,21.09375,29.6875,55.435687,24.7319,26.5625,29.296875,7.03125


In [18]:
y_train.head()

1874    False
1519     True
100      True
1169     True
2329    False
Name: fake, dtype: bool

## Обучение моделей

### Служебные функции

In [19]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import SGDClassifier


In [20]:
import joblib

In [21]:
def save_model(model, X_train, y_train, X_test, y_test, p=2, name='my_model.pkl'):
    clf = make_pipeline(
        StandardScaler(),
        PolynomialFeatures(degree=p), 
        model
    )
    clf.fit(X_train, y_train)
    joblib.dump(clf, name)

In [22]:
def train_polynomial_pipeline(model, X_train, y_train, X_test, y_test, p=2):
    clf = make_pipeline(
        StandardScaler(),
        PolynomialFeatures(degree=p), 
        model
    )
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    return (
        confusion_matrix(y_test, y_pred), 
        classification_report(
            y_test, 
            y_pred, 
            target_names=['class "real"', 'class "fakes"'], 
            zero_division=np.nan)
    )

In [23]:
def print_polynomial_pipeline(model, X_train, y_train, X_test, y_test, p=2):
    confusion_matrix, classification_report = train_polynomial_pipeline(
        model, 
        X_train, 
        y_train, 
        X_test, 
        y_test,
        p
    )
    print(confusion_matrix)
    print(classification_report)

### SGDClassifier (f1=0.60) [to be modified]

In [24]:
print_polynomial_pipeline(
    SGDClassifier(),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2
) 
save_model(
    SGDClassifier(),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2,
    name="sgd.pkl"
) 

[[40 61]
 [23 74]]
               precision    recall  f1-score   support

 class "real"       0.63      0.40      0.49       101
class "fakes"       0.55      0.76      0.64        97

     accuracy                           0.58       198
    macro avg       0.59      0.58      0.56       198
 weighted avg       0.59      0.58      0.56       198



### SVC (f1=0.64) [to be modified]

In [25]:
print_polynomial_pipeline(
#     SVC(gamma='auto'),
    SVC(kernel='rbf', gamma='scale'),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2
) 
save_model(
#     SVC(gamma='auto'),
    SVC(kernel='rbf', gamma='scale'),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2,
    name="svc.pkl"
) 

[[38 63]
 [20 77]]
               precision    recall  f1-score   support

 class "real"       0.66      0.38      0.48       101
class "fakes"       0.55      0.79      0.65        97

     accuracy                           0.58       198
    macro avg       0.60      0.59      0.56       198
 weighted avg       0.60      0.58      0.56       198



In [26]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(
    classification_report(
            y_test, 
            y_pred, 
            target_names=['class "real"', 'class "fakes"'], 
            zero_division=np.nan)
)
# print_polynomial_pipeline(
# #     SVC(gamma='auto'),
#     MultinomialNB(),
#     X_train, 
#     y_train, 
#     X_test, 
#     y_test,
#     p=2
# ) 
# save_model(
# #     SVC(gamma='auto'),
#     MultinomialNB(),
#     X_train, 
#     y_train, 
#     X_test, 
#     y_test,
#     p=2,
#     name="mnb.pkl"
# ) 

               precision    recall  f1-score   support

 class "real"       0.54      0.60      0.57       101
class "fakes"       0.53      0.46      0.49        97

     accuracy                           0.54       198
    macro avg       0.53      0.53      0.53       198
 weighted avg       0.53      0.54      0.53       198



### LogisticRegression (f1=0.61) [to be modified]

In [27]:
print_polynomial_pipeline(
    # LogisticRegression(max_iter=15000)
    LogisticRegression(
        max_iter=15000, 
        # penalty=None,
        class_weight='balanced',
        solver='liblinear',
        # tol=1e-6
    ),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2
) 
save_model(
    # LogisticRegression(max_iter=15000)
    LogisticRegression(
        max_iter=15000, 
        # penalty=None,
        class_weight='balanced',
        solver='liblinear',
        # tol=1e-6
    ),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2,
    name="logistic_regression.pkl"
) 

[[66 35]
 [34 63]]
               precision    recall  f1-score   support

 class "real"       0.66      0.65      0.66       101
class "fakes"       0.64      0.65      0.65        97

     accuracy                           0.65       198
    macro avg       0.65      0.65      0.65       198
 weighted avg       0.65      0.65      0.65       198



### NN - MLPClassifier (f1=0.66) [to be modified]

In [28]:
print_polynomial_pipeline(
    # MLPClassifier(max_iter=5000), # 0.57
    MLPClassifier(
        solver='lbfgs', 
        hidden_layer_sizes=(34,), # 5, 7, 7, 3 - 0.60
        random_state=1, 
        alpha=0.001, 
        # activation='relu',
        tol=1e-6,
        max_fun=15000,
        max_iter=15000),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=1
)
save_model(
    # MLPClassifier(max_iter=5000), # 0.57
    MLPClassifier(
        solver='lbfgs', 
        hidden_layer_sizes=(34,), # 5, 7, 7, 3 - 0.60
        random_state=1, 
        alpha=0.001, 
        # activation='relu',
        tol=1e-6,
        max_fun=15000,
        max_iter=15000),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=1,
    name="mlp.pkl"
) 

[[66 35]
 [48 49]]
               precision    recall  f1-score   support

 class "real"       0.58      0.65      0.61       101
class "fakes"       0.58      0.51      0.54        97

     accuracy                           0.58       198
    macro avg       0.58      0.58      0.58       198
 weighted avg       0.58      0.58      0.58       198



### RandomForestClassifier (good, f1=0.64) [to be modified] 

In [29]:
print_polynomial_pipeline(
    RandomForestClassifier(max_depth=7, random_state=42),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2
) 
save_model(
    RandomForestClassifier(max_depth=7, random_state=42),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2,
    name="random_forest.pkl"
) 

[[57 44]
 [28 69]]
               precision    recall  f1-score   support

 class "real"       0.67      0.56      0.61       101
class "fakes"       0.61      0.71      0.66        97

     accuracy                           0.64       198
    macro avg       0.64      0.64      0.64       198
 weighted avg       0.64      0.64      0.63       198



### DecisionTreeClassifier (f1=0.59) [to be modified]

In [30]:
print_polynomial_pipeline(
    DecisionTreeClassifier(max_depth=5, min_samples_split=10, min_samples_leaf=5),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2
)

[[44 57]
 [28 69]]
               precision    recall  f1-score   support

 class "real"       0.61      0.44      0.51       101
class "fakes"       0.55      0.71      0.62        97

     accuracy                           0.57       198
    macro avg       0.58      0.57      0.56       198
 weighted avg       0.58      0.57      0.56       198



### RadiusNeighborsClassifier (f1=0.46) [уязвим к выбросам]

In [31]:
print_polynomial_pipeline(
    RadiusNeighborsClassifier(
        radius=100, 
        weights='distance', 
        p=1, 
        outlier_label='most_frequent'),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2
)

[[23 78]
 [16 81]]
               precision    recall  f1-score   support

 class "real"       0.59      0.23      0.33       101
class "fakes"       0.51      0.84      0.63        97

     accuracy                           0.53       198
    macro avg       0.55      0.53      0.48       198
 weighted avg       0.55      0.53      0.48       198



### KNeighborsClassifier (f1=0.69) [to be modified]

In [32]:
print_polynomial_pipeline(
    KNeighborsClassifier(n_neighbors=4, weights='distance'),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=1
)
save_model(
    KNeighborsClassifier(n_neighbors=4, weights='distance'),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=1,
    name="knn.pkl"
) 

[[63 38]
 [33 64]]
               precision    recall  f1-score   support

 class "real"       0.66      0.62      0.64       101
class "fakes"       0.63      0.66      0.64        97

     accuracy                           0.64       198
    macro avg       0.64      0.64      0.64       198
 weighted avg       0.64      0.64      0.64       198



In [53]:
from sklearn.linear_model import Perceptron

In [58]:
mx_iter=10000
print_polynomial_pipeline(
    Perceptron(fit_intercept=True, max_iter=mx_iter, tol=None, shuffle=True, n_jobs=-1),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2
)
save_model(
    Perceptron(fit_intercept=True, max_iter=mx_iter, tol=None, shuffle=True, n_jobs=-1),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2,
    name="perceptron.pkl"
) 

[[66 35]
 [41 56]]
               precision    recall  f1-score   support

 class "real"       0.62      0.65      0.63       101
class "fakes"       0.62      0.58      0.60        97

     accuracy                           0.62       198
    macro avg       0.62      0.62      0.62       198
 weighted avg       0.62      0.62      0.62       198



In [98]:
estms = [
    SGDClassifier(),
    SVC(gamma='auto'),
    LogisticRegression(
        max_iter=15000, 
        # penalty=None,
        class_weight='balanced',
        solver='liblinear',
        # tol=1e-6
    ),
    # MLPClassifier(
    #     solver='lbfgs', 
    #     hidden_layer_sizes=(34,), # 5, 7, 7, 3 - 0.60
    #     random_state=1, 
    #     alpha=0.001, 
    #     # activation='relu',
    #     tol=1e-6,
    #     max_fun=15000,
    #     max_iter=15000
    # ),
    # Perceptron(fit_intercept=True, max_iter=10000, tol=None, shuffle=True),
    RandomForestClassifier(max_depth=7, random_state=42),
    DecisionTreeClassifier(max_depth=5, min_samples_split=10, min_samples_leaf=5),
    RadiusNeighborsClassifier(
        radius=100, 
        weights='distance', 
        p=1, 
        outlier_label='most_frequent'
    ),
    KNeighborsClassifier(n_neighbors=4, weights='distance')
]

estms2 = []

for model in estms:
    estms2.append(
        (
            str(model),
            make_pipeline(
                StandardScaler(),
                PolynomialFeatures(degree=2), 
                model
            )
        )
    )

In [99]:
# print(estms[0])

In [100]:
from sklearn.ensemble import StackingClassifier

In [101]:
md = LogisticRegression(max_iter=15000)
# md = KNeighborsClassifier()

clf = StackingClassifier(
    estimators=estms2, 
    final_estimator = make_pipeline(
        StandardScaler(),
        PolynomialFeatures(degree=2), 
        md
    ),
    stack_method = 'predict',
    n_jobs=-1,
    verbose=1
)

In [102]:
from datetime import datetime

In [103]:
print(datetime.now().strftime("%H:%M:%S"))

11:04:59


In [104]:
clf.fit(X_train, y_train)

In [105]:
y_pred = clf.predict(X_test)
print(datetime.now().strftime("%H:%M:%S"))

11:05:16


In [106]:
confusion_matrix(y_test, y_pred)

array([[66, 35],
       [32, 65]], dtype=int64)

In [107]:
print(
    classification_report(
        y_test, 
        y_pred, 
        target_names=['class "real"', 'class "fakes"'], 
        zero_division=np.nan
    )
)

               precision    recall  f1-score   support

 class "real"       0.67      0.65      0.66       101
class "fakes"       0.65      0.67      0.66        97

     accuracy                           0.66       198
    macro avg       0.66      0.66      0.66       198
 weighted avg       0.66      0.66      0.66       198

