**Загрузка датасета и импорт библиотек**

In [1]:
!pip install opendatasets

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl.metadata (9.2 kB)
Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


In [2]:
import opendatasets as od
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

od.download('https://www.kaggle.com/datasets/martaarroyo/palmer-penguins-for-binary-classification/data')

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: daniillosev
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/martaarroyo/palmer-penguins-for-binary-classification
Downloading palmer-penguins-for-binary-classification.zip to ./palmer-penguins-for-binary-classification


100%|██████████| 2.42k/2.42k [00:00<00:00, 2.22MB/s]







In [3]:
df = pd.read_csv('/content/palmer-penguins-for-binary-classification/penguins_binary_classification.csv')

**Анализ данных**

In [None]:
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,2007
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,2007
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,2007
...,...,...,...,...,...,...,...
269,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,2009
270,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,2009
271,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,2009
272,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,2009


Данный датасет содержит информацию о  различных видах пингвинов и состоит из:


*   Вид пингвинов
*   Остров обитания
*   Длина клюва
*   Высота клюва
*   Длина плавника
*   Масса пингвина
*   Год обнаружения?



In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 274 entries, 0 to 273
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            274 non-null    object 
 1   island             274 non-null    object 
 2   bill_length_mm     274 non-null    float64
 3   bill_depth_mm      274 non-null    float64
 4   flipper_length_mm  274 non-null    float64
 5   body_mass_g        274 non-null    float64
 6   year               274 non-null    int64  
dtypes: float64(4), int64(1), object(2)
memory usage: 15.1+ KB


**Вывод:** два признака выражены текстовой переменной, необходимо перевести их в числа

In [None]:
df.isna().sum()

Unnamed: 0,0
species,0
island,0
bill_length_mm,0
bill_depth_mm,0
flipper_length_mm,0
body_mass_g,0
year,0


**Вывод:** заполнение пропусков не нужно

In [None]:
df['species'].value_counts()

Unnamed: 0_level_0,count
species,Unnamed: 1_level_1
Adelie,151
Gentoo,123


In [None]:
df['island'].value_counts()

Unnamed: 0_level_0,count
island,Unnamed: 1_level_1
Biscoe,167
Dream,56
Torgersen,51


**Вывод:** в датасете хранится информация только о 2 видах пингвинов и о 3 островах

**Предобработка данных**

Для перевода строковых признаков в числовые будет использован кодировщик OneHotEncoder, так как различных значений в датсете мало и, в отличие от LabelEncoder, будущий алгоритм не будет воспринимать закодированные категории одну приоритетнее другой. Для целевой переменной применим LabelEncoder, так как пространство должно быть одномерное

In [None]:
x = df.drop(['species'], axis = 1)
y = df['species']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
x_train = x_train.reset_index()
x_train = x_train.drop(['index'], axis = 1)
y_train = y_train.reset_index()
y_train = y_train.drop(['index'], axis = 1)
x_test = x_test.reset_index()
x_test = x_test.drop(['index'], axis = 1)
y_test = y_test.reset_index()
y_test = y_test.drop(['index'], axis = 1)

In [None]:
one_hot_encoder = OneHotEncoder()
one_hot_encoder.fit(x_train['island'].values.reshape(-1, 1))

In [None]:
label_encoder = LabelEncoder()
label_encoder.fit(y_train['species'])


In [None]:
for idx, label in enumerate(label_encoder.classes_):
    print(f"{label} - {idx}")

Adelie - 0
Gentoo - 1


In [None]:
y_train['species'] = label_encoder.transform(y_train['species'])
y_test['species'] = label_encoder.transform(y_test['species'])

In [None]:
x_new_train = one_hot_encoder.transform(x_train['island'].values.reshape(-1, 1)).toarray()
encoded_columns = one_hot_encoder.get_feature_names_out(['island'])
encoded_df = pd.DataFrame(x_new_train, columns=encoded_columns)
x_train = pd.concat([x_train, encoded_df], axis=1)
x_train.drop(columns=['island'], inplace=True)

In [None]:
x_new_test = one_hot_encoder.transform(x_test['island'].values.reshape(-1, 1)).toarray()
encoded_columns = one_hot_encoder.get_feature_names_out(['island'])
encoded_df = pd.DataFrame(x_new_test, columns=encoded_columns)
x_test = pd.concat([x_test, encoded_df], axis=1)
x_test.drop(columns=['island'], inplace=True)

**Обучение алгоритма**

In [None]:
model = LogisticRegression()

In [None]:
model.fit(x_train, y_train['species'])

In [None]:
answers_pred = model.predict(x_test)

**Оценка работы**

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, answers_pred).ravel()
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)
print("TP:", tp)


TN: 39
FP: 1
FN: 0
TP: 29


**Вывод:** Модель допустила всего лишь 1 ошибку(Определила негативный класс как положительный - ложное срабатывание)

In [None]:
accuracy_score(y_test, answers_pred)

0.9855072463768116

**Вывод:** Точность составляет 99%

In [None]:
recall_score(y_test, answers_pred)

1.0

In [None]:
precision_score(y_test, answers_pred)

0.9666666666666667

In [None]:
f1_score(y_test, answers_pred)

0.9830508474576272