# Идея проекта, выборка и предобработка данных
<hr>


> TODO: ссылки на методологию анализа фейков по гистограммам, распознавание лиц в OpenCV, исходный датасет

## Введение

Проект представляет из себя анализатор дипфейков на основе анализа артефактов в различных зонах лица

В ходе предварительного анализа предметной области было выделено 3 наиболее перспективных для реализации метода анализа дипфейков:

- анализ движений глаз (саккад)
- анализ синзронизации речи и движений губ
- анализ артефактов

Так как анализ движений возможен только в динамике и довольно сложен в реализации, было решено реализовать метод анализа артефактов на статических изображениях (фото)

- В качестве "источника вдохновения" использовался [проект](https://github.com/rakshitsakhuja/Detecting-Deepfakes-with-OpenCV/blob/master/1.%20Processing%20Videos%20and%20Face%20Detection.ipynb), доступный по ссылке: 

https://github.com/rakshitsakhuja/Detecting-Deepfakes-with-OpenCV/blob/master/1.%20Processing%20Videos%20and%20Face%20Detection.ipynb

- В качестве датасета использовались фрагменты датасета [Deep Fake Detection Challenge DFDC](https://www.kaggle.com/competitions/deepfake-detection-challenge):

https://www.kaggle.com/competitions/deepfake-detection-challenge

- Про [распознанвание зон лица](https://pyimagesearch.com/2017/04/10/detect-eyes-nose-lips-jaw-dlib-opencv-python/) можно прочесть по ссылке:

https://pyimagesearch.com/2017/04/10/detect-eyes-nose-lips-jaw-dlib-opencv-python/

- Сжатый [датасет 300 Faces In-the-Wild Challenge (300-W)](https://ibug.doc.ic.ac.uk/resources/300-W/) для распознавания точек на лице и аннотации к нему:

https://ibug.doc.ic.ac.uk/resources/300-W/

## Пайплайн проекта

Проект работает в следующем порядке:

- datacollector.py, faceparts.py:
1. Считывание по 1 кадру из видео-датасета DFDC (5 x 10 GB, 5 x 1700 видео)
2. Распознавание 68 точек на лице с помощью OpenCV, DLib и датасета 300-W
3. Отсев фото, которые не удалось распознать (остается около 3500 фото), запись *размера* лица и бинарного значения яркости *пикселей* в 68 зонах вокруг ключевых точек в их черно-белой версии (размер зон - 8-17% от ширины лица, сейчас подбирается) в dataset (5 х 0.2-0.5 GB)
> псевдокод: ```image.to_greyscale().cut([zone_schape])```

- notebook.ipynb:
1. Добавление гистограмм распределения яркости к склейке датасетов
2. Добавление метрик std и noise для каждой зоны
3. Очистка от исходных массивов пикселей, запись на диск (30 MB)

- notebook2.ipynb:
1. Отсекание лишних фейков из исходного датасета (остается 1000=500*2 из 3500 фото) для избавления от перекоса (по крайней мере в процессе подбора методов, такой датасет удобнее, иначе недо/переобучение вощникает слишком часто)
2. Сбор std и avg значений для каждого из классов => определение точек, с наибольшим значением ```|avg_fake-avg_real| / (std_fake + std_real)```
> sklearn.feature_selection.SelectKBest работает плохо, т.к. не включает метрики для всего лица, которые нужны для нормализации значений по точкам

3. Выборка самых показательных features + features по всему лицу
4. Обучение моделей

> Так как 5 из 8 моделей дают сопоставимую точность, на данном этапе целесообразно работать надо подбором фичей и размерами сканируемых зон, мб фильтрацией плохих/маленьких фото, тюнинг моделей (тщательный) следует проводть позднее

## Потенциальные темы НИРов

> 1 тема != 1 НИР, темы могут быть объединены или разбиты

- процесс сборки данных: введение про проблему и методы анализа дипфейков, датасет DFDC, распознавание точек, OpenCV, DLib, датасет 300-W
- Предобработка данных: фильтрация, гистограммы и агрегационные метрики
- Подбор features, метрики метрик, метрики по всему лицу
- Выбор и тюнинг моделей, влияние размера и перекоса датасета и т.д.

На более поздних этапах (опционально):
- выстраивание вокруг модели эконсистемы: веб-приложение, комплект клиентов на React-native / чат-бот, Google OAuth, Metamask Auth
- Биллинг для приложений, ценообразование, токены (можно крипту из тестнета (Goerli, Ether testnet) ради забавы прикрутить), мониторинг (Grafana?), Docker-Kuber, балансировка и прочие умные штуки

# Обработка данных
<hr>

> 990 фото, баланс классов 1:1, размер зон - 8..17% от ширины лица (в процессе экспериментов), f1-метрика на разных моделях - 0.60..0.70

## Импорт dataset-а и агрегация данных для классов 'fake' и 'real'

In [202]:
import os
os.getcwd()

'C:\\Users\\sergey.astakhov\\Desktop\\BmstuDeepFake'

In [203]:
import pandas as pd
import numpy as np

In [204]:
df = pd.read_json("../dfdc_dataframes/df_total_0_4_compact_frame_12.json")

In [205]:
# Балансировка классов 1:1
# Инверсные метрики для полиномов

# df = df.drop(columns=['index','face_size_px'])
df = pd.concat([
    df[df.fake==True].sample(
        int(df[df.fake==False].shape[0]*1)), 
    df[df.fake==False]
])

face = df.filter(like='face', axis=1).copy()

for col in face:
    name = str(col) + '_reversed'
    df[name] = face[str(col)].map(lambda x: 1.0 / x)

df = df.filter(regex='^(.(?!(var)))*$', axis=1).filter(regex='^(.(?!(noise_1)))*$', axis=1)

print(df.shape)
# df = df.reset_index()
df.head()

(990, 281)


Unnamed: 0,index,filename,fake,face_size_px,pt_48_std,pt_49_std,pt_50_std,pt_51_std,pt_52_std,pt_53_std,...,pt_13_hist_noise_3,pt_14_hist_noise_3,pt_15_hist_noise_3,pt_16_hist_noise_3,overall_face_hist_noise_3,face_size_px_reversed,overall_face_std_reversed,overall_face_hist_std_reversed,overall_face_hist_noise_0.5_reversed,overall_face_hist_noise_3_reversed
1801,1005,kgsgnwoahd.mp4,True,34596,16.804833,30.888799,25.723056,20.941755,14.477137,18.690843,...,53.0,49.0,30.0,61.0,0.638802,2.9e-05,0.019224,0.00737,2.812683,1.56543
3319,701,iilrffkxoh.mp4,True,186624,10.179104,13.568882,11.596774,15.865968,17.277402,24.243699,...,23.090278,21.875,17.881944,17.708333,0.133959,5e-06,0.012503,0.001279,10.913684,7.46496
1006,210,xeemhzcqdk.mp4,True,68644,14.52872,14.97825,17.230362,14.234978,9.904893,13.478108,...,26.020408,27.040816,30.102041,34.693878,0.365655,1.5e-05,0.01938,0.004713,4.428645,2.734821
3123,504,vvdisddtuy.mp4,True,116964,8.429044,4.504664,3.750726,5.156462,13.085259,17.99277,...,25.25,29.25,33.75,35.75,0.174413,9e-06,0.037992,0.001276,8.354571,5.733529
2019,11,ixihznhwqr.mp4,True,23716,6.008437,9.371666,10.53972,11.493035,9.490075,9.356009,...,26.5625,14.0625,3.125,9.375,0.763198,4.2e-05,0.022079,0.009189,2.136577,1.310276


In [206]:
# агрегация для фейков

df_fakes_compact = df[df.fake==True].filter(regex='^(.(?!(raw)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ake)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ilename)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ist_simple)))*$', axis=1)
                                    
df_fakes_reduced = pd.DataFrame(df_fakes_compact.mean()).T
df_fakes_reduced['fake'] = True
df_fakes_reduced = df_fakes_reduced.set_index('fake')
df_fakes_reduced.filter(like='face', axis=1)

Unnamed: 0_level_0,face_size_px,overall_face_std,overall_face_hist_std,overall_face_hist_noise_0.5,overall_face_hist_noise_3,face_size_px_reversed,overall_face_std_reversed,overall_face_hist_std_reversed,overall_face_hist_noise_0.5_reversed,overall_face_hist_noise_3_reversed
fake,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
True,99839.692929,45.537059,543.636882,0.173,0.280857,1.5e-05,0.02434,0.003034,10.787977,5.513369


In [207]:
# агрегация для реальных фото

df_real_compact = df[df.fake==False].filter(regex='^(.(?!(raw)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ake)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ilename)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ist_simple)))*$', axis=1)
                                    
df_real_reduced = pd.DataFrame(df_real_compact.mean()).T
df_real_reduced['fake'] = False
df_real_reduced = df_real_reduced.set_index('fake')
df_real_reduced.filter(like='face', axis=1)

Unnamed: 0_level_0,face_size_px,overall_face_std,overall_face_hist_std,overall_face_hist_noise_0.5,overall_face_hist_noise_3,face_size_px_reversed,overall_face_std_reversed,overall_face_hist_std_reversed,overall_face_hist_noise_0.5_reversed,overall_face_hist_noise_3_reversed
fake,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
False,92126.343434,49.182392,534.771606,0.200743,0.308146,1.6e-05,0.022526,0.003057,8.383215,4.79785


In [208]:
# самые ярко-различающиеся по классам нfeatures

df_compare = pd.concat([df_real_reduced, df_fakes_reduced]).T
df_compare['diff_rel'] = abs(df_compare[False] - df_compare[True]) / (df_compare[False] + df_compare[True])
df_compare['diff'] = abs(df_compare[False] - df_compare[True]) 

df_compare = df_compare.sort_values(by=['diff_rel'], ascending=False)
df_compare.head(15)

fake,False,True,diff_rel,diff
index,541.650505,381.159596,0.173915,160.490909
pt_21_hist_noise_0.5,2.223233,1.647713,0.148677,0.57552
pt_6_hist_noise_3,12.170309,9.142584,0.142061,3.027724
pt_8_hist_noise_3,12.842086,9.737749,0.137483,3.104336
pt_7_hist_noise_3,13.83453,10.493817,0.137318,3.340713
pt_57_hist_noise_3,16.081387,12.362432,0.130747,3.718956
pt_39_hist_noise_3,23.890869,18.510376,0.126895,5.380493
overall_face_hist_noise_0.5_reversed,8.383215,10.787977,0.125436,2.404762
pt_24_hist_noise_0.5,2.428787,1.887691,0.125356,0.541095
pt_56_hist_noise_3,15.780588,12.333453,0.122613,3.447135


In [209]:
# самые стабильные features

df_fakes_std = pd.DataFrame(df_fakes_compact.std()).T
df_fakes_std['fake'] = True
df_fakes_std = df_fakes_std.set_index('fake')

df_real_std = pd.DataFrame(df_real_compact.std()).T
df_real_std['fake'] = False
df_real_std = df_real_std.set_index('fake')

df_std = pd.concat([df_real_reduced, df_fakes_reduced]).T
df_std['total_std'] = (df_std[False] + df_std[True])

df_std = df_std.sort_values(by=['total_std'], ascending=True)
df_std.head(10)

fake,False,True,total_std
face_size_px_reversed,1.6e-05,1.5e-05,3e-05
overall_face_hist_std_reversed,0.003057,0.003034,0.006092
overall_face_std_reversed,0.022526,0.02434,0.046866
overall_face_hist_noise_0.5,0.200743,0.173,0.373743
overall_face_hist_noise_3,0.308146,0.280857,0.589003
pt_8_hist_noise_0.5,0.897199,0.85837,1.75557
pt_7_hist_noise_0.5,1.001284,0.865505,1.866789
pt_9_hist_noise_0.5,1.018589,1.051679,2.070268
pt_6_hist_noise_0.5,1.13939,1.080033,2.219424
pt_10_hist_noise_0.5,1.137544,1.253838,2.391381


In [210]:
# самые надежные (разилчие_по_классам / разброс) features
# (var не кажется действительно надежной, скорее всего просто зависит от std квадратично или как-то еще)

df_smart = pd.concat([pd.DataFrame(df_compare['diff']),pd.DataFrame(df_std['total_std'])], axis=1)
df_smart['metric'] = df_smart['diff'] / df_smart['total_std']
df_smart = df_smart.sort_values(by=['metric'], ascending=False)
df_smart.head(30)

Unnamed: 0,diff,total_std,metric
index,160.490909,922.810101,0.173915
pt_21_hist_noise_0.5,0.57552,3.870946,0.148677
pt_6_hist_noise_3,3.027724,21.312893,0.142061
pt_8_hist_noise_3,3.104336,22.579835,0.137483
pt_7_hist_noise_3,3.340713,24.328347,0.137318
pt_57_hist_noise_3,3.718956,28.443819,0.130747
pt_39_hist_noise_3,5.380493,42.401246,0.126895
overall_face_hist_noise_0.5_reversed,2.404762,19.171193,0.125436
pt_24_hist_noise_0.5,0.541095,4.316478,0.125356
pt_56_hist_noise_3,3.447135,28.114041,0.122613


## Подбор features [to be modified]

> опирается на интуитивные предположения + метод определения фейков по шуму

In [211]:
features = list(set(
    list(df_smart.filter(like='face_hist', axis=0).index) +  # for relative metrics
    list(df_smart.filter(like='noise_0.5', axis=0).head(7).index) +
    list(df_smart.filter(like='noise_3', axis=0).head(7).index) +
    list(df_smart.filter(like='std', axis=0).head(5).index) +
    list(df_smart.filter(like='reversed', axis=0).index) 
    # list(df_smart.filter(like='var', axis=0).head(3).index) 
))

print(len(features))
features

26


['pt_56_hist_noise_3',
 'pt_1_hist_noise_0.5',
 'overall_face_std_reversed',
 'pt_39_std',
 'pt_39_hist_noise_3',
 'overall_face_hist_std',
 'overall_face_hist_noise_3_reversed',
 'pt_7_hist_noise_3',
 'pt_24_hist_noise_0.5',
 'pt_7_std',
 'pt_36_hist_noise_0.5',
 'pt_1_std',
 'overall_face_hist_noise_0.5_reversed',
 'pt_52_std',
 'pt_30_hist_noise_0.5',
 'pt_0_std',
 'pt_57_hist_noise_3',
 'pt_65_hist_noise_3',
 'face_size_px_reversed',
 'pt_6_hist_noise_3',
 'pt_21_hist_noise_0.5',
 'pt_13_hist_noise_0.5',
 'overall_face_hist_noise_3',
 'overall_face_hist_std_reversed',
 'pt_8_hist_noise_3',
 'overall_face_hist_noise_0.5']

## Разбивка данных на train и test сплиты

In [212]:
# очистка dataset-а от лишних колонок перед тренировкой модели (необязательно?)

df_compact = df.filter(regex='^(.(?!(raw)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ilename)))*$', axis=1) \
                                    .filter(regex='^(.(?!(ist_simple)))*$', axis=1)
df_compact.head(5)

Unnamed: 0,index,fake,face_size_px,pt_48_std,pt_49_std,pt_50_std,pt_51_std,pt_52_std,pt_53_std,pt_54_std,...,pt_13_hist_noise_3,pt_14_hist_noise_3,pt_15_hist_noise_3,pt_16_hist_noise_3,overall_face_hist_noise_3,face_size_px_reversed,overall_face_std_reversed,overall_face_hist_std_reversed,overall_face_hist_noise_0.5_reversed,overall_face_hist_noise_3_reversed
1801,1005,True,34596,16.804833,30.888799,25.723056,20.941755,14.477137,18.690843,21.148567,...,53.0,49.0,30.0,61.0,0.638802,2.9e-05,0.019224,0.00737,2.812683,1.56543
3319,701,True,186624,10.179104,13.568882,11.596774,15.865968,17.277402,24.243699,16.527363,...,23.090278,21.875,17.881944,17.708333,0.133959,5e-06,0.012503,0.001279,10.913684,7.46496
1006,210,True,68644,14.52872,14.97825,17.230362,14.234978,9.904893,13.478108,13.356713,...,26.020408,27.040816,30.102041,34.693878,0.365655,1.5e-05,0.01938,0.004713,4.428645,2.734821
3123,504,True,116964,8.429044,4.504664,3.750726,5.156462,13.085259,17.99277,12.67073,...,25.25,29.25,33.75,35.75,0.174413,9e-06,0.037992,0.001276,8.354571,5.733529
2019,11,True,23716,6.008437,9.371666,10.53972,11.493035,9.490075,9.356009,8.602765,...,26.5625,14.0625,3.125,9.375,0.763198,4.2e-05,0.022079,0.009189,2.136577,1.310276


In [213]:
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

In [214]:
y = df_compact['fake']
X = pd.DataFrame(df_compact.drop(columns=['fake']))
# selector = SelectKBest(f_classif, k=7)
# X = selector.fit_transform(X, y)
X =  df_compact[features]

# mask = selector.get_support()
# X = X[X.columns[mask]]

# df_compact.filter(like='size', axis=0).head()

# print(df_compact[X.columns[mask]].shape)
# df_compact[X.columns[mask]].head()

pd.DataFrame(X).head()

Unnamed: 0,pt_56_hist_noise_3,pt_1_hist_noise_0.5,overall_face_std_reversed,pt_39_std,pt_39_hist_noise_3,overall_face_hist_std,overall_face_hist_noise_3_reversed,pt_7_hist_noise_3,pt_24_hist_noise_0.5,pt_7_std,...,pt_57_hist_noise_3,pt_65_hist_noise_3,face_size_px_reversed,pt_6_hist_noise_3,pt_21_hist_noise_0.5,pt_13_hist_noise_0.5,overall_face_hist_noise_3,overall_face_hist_std_reversed,pt_8_hist_noise_3,overall_face_hist_noise_0.5
1801,16.0,0.0,0.019224,9.341113,12.0,135.693228,1.56543,23.0,0.0,16.730269,...,10.0,47.0,2.9e-05,13.0,0.0,0.0,0.638802,0.00737,13.0,0.355532
3319,6.423611,1.909722,0.012503,8.979228,6.597222,781.587807,7.46496,3.819444,2.256944,9.896725,...,4.166667,6.944444,5e-06,1.736111,0.868056,8.680556,0.133959,0.001279,3.993056,0.091628
1006,19.897959,0.0,0.01938,12.544866,20.408163,212.200607,2.734821,34.693878,0.0,25.086992,...,25.510204,18.877551,1.5e-05,26.020408,0.0,0.0,0.365655,0.004713,11.734694,0.225803
3123,5.25,2.0,0.037992,10.852898,8.0,783.917254,5.733529,3.75,1.25,8.382826,...,3.0,7.75,9e-06,1.25,0.0,7.75,0.174413,0.001276,1.25,0.119695
2019,0.0,0.0,0.022079,11.704324,25.0,108.828688,1.310276,3.125,0.0,2.512275,...,3.125,1.5625,4.2e-05,1.5625,0.0,0.0,0.763198,0.009189,3.125,0.468038


In [215]:
# print(df_compact.filter(like='size', axis=1).shape)
# df_compact.filter(like='size', axis=1).head()

In [216]:
# sizes_col = X.filter(like='size', axis=1).copy()
# X = pd.concat([
#     df_compact[X.columns[mask]],
#     sizes_col
# ], axis=0)
# # X.join(df_compact.filter(like='size', axis=1), lsuffix='_caller', rsuffix='_other')
# X.head()

In [217]:
X_train, X_test, y_train, y_test = train_test_split( 
    X, 
    y,
    test_size=0.20, 
    random_state=420)

In [218]:
pd.DataFrame(X_train).head()

Unnamed: 0,pt_56_hist_noise_3,pt_1_hist_noise_0.5,overall_face_std_reversed,pt_39_std,pt_39_hist_noise_3,overall_face_hist_std,overall_face_hist_noise_3_reversed,pt_7_hist_noise_3,pt_24_hist_noise_0.5,pt_7_std,...,pt_57_hist_noise_3,pt_65_hist_noise_3,face_size_px_reversed,pt_6_hist_noise_3,pt_21_hist_noise_0.5,pt_13_hist_noise_0.5,overall_face_hist_noise_3,overall_face_hist_std_reversed,pt_8_hist_noise_3,overall_face_hist_noise_0.5
1874,4.585799,2.071006,0.017269,16.71247,9.023669,1101.052972,9.913474,3.698225,4.733728,10.858803,...,4.289941,16.863905,5e-06,1.47929,2.810651,6.952663,0.100873,0.000908,3.698225,0.059993
3199,17.346939,0.0,0.012127,13.291437,16.326531,653.306081,2.7735,2.55102,0.0,7.443377,...,15.306122,23.469388,1.5e-05,4.081633,0.0,0.0,0.360555,0.001531,6.122449,0.241872
263,4.545455,0.826446,0.04076,5.947128,3.305785,986.286963,10.175294,1.239669,1.239669,2.435445,...,3.099174,5.991736,7e-06,4.752066,0.413223,0.0,0.098277,0.001014,3.099174,0.051307
1514,6.25,0.0,0.023876,11.542191,20.138889,253.709249,2.575311,0.0,0.0,2.365478,...,5.555556,9.027778,1.9e-05,0.694444,0.0,0.0,0.388303,0.003942,6.944444,0.252675
2329,13.888889,7.716049,0.014984,26.517308,22.222222,338.385216,4.482667,23.45679,1.54321,24.953704,...,14.197531,16.358025,9e-06,16.666667,1.234568,5.246914,0.223081,0.002955,4.320988,0.152439


In [219]:
y_train.head()

1874    False
3199     True
263      True
1514     True
2329    False
Name: fake, dtype: bool

## Обучение моделей

### Служебные функции

In [220]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import SGDClassifier

In [221]:
def train_polynomial_pipeline(model, X_train, y_train, X_test, y_test, p=2):
    clf = make_pipeline(
        StandardScaler(),
        PolynomialFeatures(degree=p), 
        model
    )
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    return (
        confusion_matrix(y_test, y_pred), 
        classification_report(
            y_test, 
            y_pred, 
            target_names=['class "real"', 'class "fakes"'], 
            zero_division=np.nan)
    )

In [222]:
def print_polynomial_pipeline(model, X_train, y_train, X_test, y_test, p=2):
    confusion_matrix, classification_report = train_polynomial_pipeline(
        model, 
        X_train, 
        y_train, 
        X_test, 
        y_test,
        p
    )
    print(confusion_matrix)
    print(classification_report)

### SGDClassifier (f1=0.60) [to be modified]

In [245]:
print_polynomial_pipeline(
    SGDClassifier(),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=1
) 

[[54 47]
 [31 66]]
               precision    recall  f1-score   support

 class "real"       0.64      0.53      0.58       101
class "fakes"       0.58      0.68      0.63        97

     accuracy                           0.61       198
    macro avg       0.61      0.61      0.60       198
 weighted avg       0.61      0.61      0.60       198



### SVC (f1=0.64) [to be modified]

In [249]:
print_polynomial_pipeline(
    SVC(gamma='auto'),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2
) 

[[65 36]
 [22 75]]
               precision    recall  f1-score   support

 class "real"       0.75      0.64      0.69       101
class "fakes"       0.68      0.77      0.72        97

     accuracy                           0.71       198
    macro avg       0.71      0.71      0.71       198
 weighted avg       0.71      0.71      0.71       198



### LogisticRegression (f1=0.61) [to be modified]

In [252]:
print_polynomial_pipeline(
    # LogisticRegression(max_iter=15000)
    LogisticRegression(
        max_iter=15000, 
        # penalty=None,
        class_weight='balanced',
        solver='liblinear',
        # tol=1e-6
    ),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=1
) 

[[60 41]
 [31 66]]
               precision    recall  f1-score   support

 class "real"       0.66      0.59      0.62       101
class "fakes"       0.62      0.68      0.65        97

     accuracy                           0.64       198
    macro avg       0.64      0.64      0.64       198
 weighted avg       0.64      0.64      0.64       198



### NN - MLPClassifier (f1=0.66) [to be modified]

In [244]:
print_polynomial_pipeline(
    # MLPClassifier(max_iter=5000), # 0.57
    MLPClassifier(
        solver='lbfgs', 
        hidden_layer_sizes=(34,), # 5, 7, 7, 3 - 0.60
        random_state=1, 
        alpha=0.001, 
        # activation='relu',
        tol=1e-6,
        max_fun=15000,
        max_iter=15000),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2
) 

[[66 35]
 [37 60]]
               precision    recall  f1-score   support

 class "real"       0.64      0.65      0.65       101
class "fakes"       0.63      0.62      0.62        97

     accuracy                           0.64       198
    macro avg       0.64      0.64      0.64       198
 weighted avg       0.64      0.64      0.64       198



### RandomForestClassifier (good, f1=0.64) [to be modified] 

In [239]:
print_polynomial_pipeline(
    RandomForestClassifier(max_depth=7, random_state=42),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=1
) 

[[64 37]
 [26 71]]
               precision    recall  f1-score   support

 class "real"       0.71      0.63      0.67       101
class "fakes"       0.66      0.73      0.69        97

     accuracy                           0.68       198
    macro avg       0.68      0.68      0.68       198
 weighted avg       0.68      0.68      0.68       198



### DecisionTreeClassifier (f1=0.59) [to be modified]

In [238]:
print_polynomial_pipeline(
    DecisionTreeClassifier(max_depth=5, min_samples_split=10, min_samples_leaf=5),
    X_train, 
    y_train, 
    X_test, 
    y_test
)

[[50 51]
 [34 63]]
               precision    recall  f1-score   support

 class "real"       0.60      0.50      0.54       101
class "fakes"       0.55      0.65      0.60        97

     accuracy                           0.57       198
    macro avg       0.57      0.57      0.57       198
 weighted avg       0.57      0.57      0.57       198



### RadiusNeighborsClassifier (f1=0.46) [уязвим к выбросам]

In [229]:
print_polynomial_pipeline(
    RadiusNeighborsClassifier(
        radius=100, 
        weights='distance', 
        p=1, 
        outlier_label='most_frequent'),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=2
)

[[23 78]
 [17 80]]
               precision    recall  f1-score   support

 class "real"       0.57      0.23      0.33       101
class "fakes"       0.51      0.82      0.63        97

     accuracy                           0.52       198
    macro avg       0.54      0.53      0.48       198
 weighted avg       0.54      0.52      0.47       198



### KNeighborsClassifier (f1=0.69) [to be modified]

In [233]:
print_polynomial_pipeline(
    KNeighborsClassifier(n_neighbors=4, weights='distance'),
    X_train, 
    y_train, 
    X_test, 
    y_test,
    p=1
)

[[72 29]
 [39 58]]
               precision    recall  f1-score   support

 class "real"       0.65      0.71      0.68       101
class "fakes"       0.67      0.60      0.63        97

     accuracy                           0.66       198
    macro avg       0.66      0.66      0.65       198
 weighted avg       0.66      0.66      0.66       198

