Вы получили основные навыки обработки данных, теперь пора испытать их на практике. Сейчас вам предстоит заняться задачей классификации.

Представлен датасет центра приюта животных, и вашей задачей будет обучить модель таким образом, чтобы  по определенным признакам была возможность максимально уверенно предсказать метки 'Adoption' и 'Transfer' (столбец “outcome_type”).

Здесь вы вольны делать что угодно. Я хочу видеть от вас:
1. Проверка наличия/обработка пропусков
2. Проверьте взаимосвязи между признаками
3. Попробуйте создать свои признаки
4. Удалите лишние
5. Обратите внимание на текстовые столбцы. Подумайте, что можно извлечь полезного оттуда
6. Использование профайлера вам поможет.
7. Не забывайте, что у вас есть PCA (Метод главных компонент). Он может пригодиться.

Вспомните о всем, что я говорил на предыдущих занятиях. Не все будет пригодится, но в жизни вам никто не будет говорить, что использовать :)

Хорошим классификатором для этой задачи будет "Случайный лес" (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Понимать суть работы "леса" не обязательно на данном этапе, но качество предсказаний будет выше, чем с линейным классификатором. (если желаете, вот гайд https://adataanalyst.com/scikit-learn/linear-classification-method/)

Желаю успеха :)

In [99]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import seaborn as sns
from matplotlib import pyplot as plt

data = pd.read_csv("aac_shelter_outcomes.csv")
data.head(20)

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,Transfer,Intact Male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,Spayed Female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,,Adoption,Neutered Male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,Neutered Male
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07T00:00:00,2014-07-07T14:04:00,2014-07-07T14:04:00,,Rabies Risk,Euthanasia,Unknown
5,4 months,A664462,Dog,Leonberger Mix,Brown/White,2013-06-03T00:00:00,2013-10-07T13:06:00,2013-10-07T13:06:00,*Edgar,Partner,Transfer,Intact Male
6,1 year,A693700,Other,Squirrel Mix,Tan,2013-12-13T00:00:00,2014-12-13T12:20:00,2014-12-13T12:20:00,,Suffering,Euthanasia,Unknown
7,3 years,A692618,Dog,Chihuahua Shorthair Mix,Brown,2011-11-23T00:00:00,2014-12-08T15:55:00,2014-12-08T15:55:00,*Ella,Partner,Transfer,Spayed Female
8,1 month,A685067,Cat,Domestic Shorthair Mix,Blue Tabby/White,2014-06-16T00:00:00,2014-08-14T18:45:00,2014-08-14T18:45:00,Lucy,,Adoption,Intact Female
9,3 months,A678580,Cat,Domestic Shorthair Mix,White/Black,2014-03-26T00:00:00,2014-06-29T17:45:00,2014-06-29T17:45:00,*Frida,Offsite,Adoption,Spayed Female


Выдвенем несколько гипотез. На решение оставить животное или забрать в большей тепени могут влиять параметры 

In [100]:
data.describe()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
count,78248,78256,78256,78256,78256,78256,78256,78256,54370,35963,78244,78254
unique,46,70855,5,2128,525,5869,64361,64361,14574,19,9,5
top,1 year,A706536,Dog,Domestic Shorthair Mix,Black/White,2014-05-05T00:00:00,2016-04-18T00:00:00,2016-04-18T00:00:00,Bella,Partner,Adoption,Neutered Male
freq,14355,11,44242,23335,8153,112,39,39,344,19660,33112,27784


In [101]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78256 entries, 0 to 78255
Data columns (total 12 columns):
age_upon_outcome    78248 non-null object
animal_id           78256 non-null object
animal_type         78256 non-null object
breed               78256 non-null object
color               78256 non-null object
date_of_birth       78256 non-null object
datetime            78256 non-null object
monthyear           78256 non-null object
name                54370 non-null object
outcome_subtype     35963 non-null object
outcome_type        78244 non-null object
sex_upon_outcome    78254 non-null object
dtypes: object(12)
memory usage: 7.2+ MB


In [102]:
data["age_upon_outcome"].unique()

array(['2 weeks', '1 year', '9 years', '5 months', '4 months', '3 years',
       '1 month', '3 months', '2 years', '2 months', '4 years', '8 years',
       '3 weeks', '8 months', '12 years', '7 years', '5 years', '6 years',
       '5 days', '10 months', '4 weeks', '10 years', '2 days', '6 months',
       '14 years', '11 months', '15 years', '7 months', '13 years',
       '11 years', '16 years', '9 months', '3 days', '6 days', '4 days',
       '5 weeks', '1 week', '1 day', '1 weeks', '0 years', '17 years',
       '20 years', '18 years', '19 years', '22 years', '25 years', nan],
      dtype=object)

In [103]:
data["age_upon_outcome"].value_counts()

1 year       14355
2 years      11194
2 months      9213
3 years       5157
3 months      3442
1 month       3344
4 years       2990
5 years       2691
4 months      2425
5 months      1951
6 months      1897
6 years       1810
8 years       1554
7 years       1537
3 weeks       1467
2 weeks       1330
10 months     1204
4 weeks       1194
8 months      1178
10 years      1159
7 months       963
9 years        822
9 months       673
12 years       609
1 weeks        513
11 months      490
11 years       429
1 week         427
13 years       389
14 years       253
3 days         235
2 days         217
15 years       208
1 day          153
6 days         152
4 days         136
5 days         116
16 years       101
0 years         95
5 weeks         61
17 years        58
18 years        26
19 years        13
20 years        12
22 years         4
25 years         1
Name: age_upon_outcome, dtype: int64

In [104]:
data.isnull().sum()

age_upon_outcome        8
animal_id               0
animal_type             0
breed                   0
color                   0
date_of_birth           0
datetime                0
monthyear               0
name                23886
outcome_subtype     42293
outcome_type           12
sex_upon_outcome        2
dtype: int64

In [105]:
data['age_upon_outcome'].fillna('1 year', inplace = True)

In [106]:
data.isnull().sum()

age_upon_outcome        0
animal_id               0
animal_type             0
breed                   0
color                   0
date_of_birth           0
datetime                0
monthyear               0
name                23886
outcome_subtype     42293
outcome_type           12
sex_upon_outcome        2
dtype: int64

In [107]:
data['sex_upon_outcome'].fillna('Neutered Male', inplace = True)

In [108]:
data.isnull().sum()

age_upon_outcome        0
animal_id               0
animal_type             0
breed                   0
color                   0
date_of_birth           0
datetime                0
monthyear               0
name                23886
outcome_subtype     42293
outcome_type           12
sex_upon_outcome        0
dtype: int64

In [109]:
data['outcome_type'].fillna('Adoption', inplace = True)

In [110]:
data.isnull().sum()

age_upon_outcome        0
animal_id               0
animal_type             0
breed                   0
color                   0
date_of_birth           0
datetime                0
monthyear               0
name                23886
outcome_subtype     42293
outcome_type            0
sex_upon_outcome        0
dtype: int64

In [111]:
data['outcome_subtype'].fillna('Partner', inplace = True)
data['name'].fillna('Bella', inplace = True)

In [112]:
data.isnull().sum()

age_upon_outcome    0
animal_id           0
animal_type         0
breed               0
color               0
date_of_birth       0
datetime            0
monthyear           0
name                0
outcome_subtype     0
outcome_type        0
sex_upon_outcome    0
dtype: int64

In [113]:
data.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,Bella,Partner,Transfer,Intact Male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,Spayed Female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,Partner,Adoption,Neutered Male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,Neutered Male
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07T00:00:00,2014-07-07T14:04:00,2014-07-07T14:04:00,Bella,Rabies Risk,Euthanasia,Unknown


Признаки datetime и monthyear не несут большой информации их можно удалить из датасета, а date_of_birth - избыточен, так как эта информация указана в age_upon_outcome,animal_id- не несет информации

In [114]:
data.drop(['datetime', 'monthyear', 'date_of_birth','animal_id'], axis='columns', inplace=True)

In [116]:
data.head(10)

Unnamed: 0,age_upon_outcome,animal_type,breed,color,name,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,Cat,Domestic Shorthair Mix,Orange Tabby,Bella,Partner,Transfer,Intact Male
1,1 year,Dog,Beagle Mix,White/Brown,Lucy,Partner,Transfer,Spayed Female
2,1 year,Dog,Pit Bull,Blue/White,*Johnny,Partner,Adoption,Neutered Male
3,9 years,Dog,Miniature Schnauzer Mix,White,Monday,Partner,Transfer,Neutered Male
4,5 months,Other,Bat Mix,Brown,Bella,Rabies Risk,Euthanasia,Unknown
5,4 months,Dog,Leonberger Mix,Brown/White,*Edgar,Partner,Transfer,Intact Male
6,1 year,Other,Squirrel Mix,Tan,Bella,Suffering,Euthanasia,Unknown
7,3 years,Dog,Chihuahua Shorthair Mix,Brown,*Ella,Partner,Transfer,Spayed Female
8,1 month,Cat,Domestic Shorthair Mix,Blue Tabby/White,Lucy,Partner,Adoption,Intact Female
9,3 months,Cat,Domestic Shorthair Mix,White/Black,*Frida,Offsite,Adoption,Spayed Female


In [117]:
data['animal_type'].value_counts()

Dog          44242
Cat          29422
Other         4249
Bird           334
Livestock        9
Name: animal_type, dtype: int64

In [118]:
data['color'].value_counts()

Black/White                  8153
Black                        6602
Brown Tabby                  4445
Brown                        3486
White                        2784
Brown/White                  2444
Tan/White                    2394
Brown Tabby/White            2338
Orange Tabby                 2180
White/Black                  2100
Blue/White                   2081
Tricolor                     1982
Tan                          1963
Black/Tan                    1829
White/Brown                  1577
Black/Brown                  1532
Brown Brindle/White          1353
Tortie                       1340
Calico                       1338
Blue                         1326
Brown/Black                  1323
White/Tan                    1160
Blue Tabby                   1130
Orange Tabby/White           1095
Red                          1029
Red/White                     860
Torbie                        845
Brown Brindle                 715
Tan/Black                     607
Chocolate/Whit

In [163]:
#import re
#test = 'White/Lilac Point'
#rez = re.search(r"\/", test)
#print(rez.group())
def two_color(a):
    import re
    rez = re.search(r"\/", a)
    if rez is None:
        return 0
    else:
        return 1

In [164]:
data['two_color'] = data['color'].apply(two_color)

In [165]:
data.head()

Unnamed: 0,age_upon_outcome,animal_type,breed,color,name,outcome_subtype,outcome_type,sex_upon_outcome,two_color
0,2 weeks,Cat,Domestic Shorthair Mix,Orange Tabby,Bella,Partner,Transfer,Intact Male,0
1,1 year,Dog,Beagle Mix,White/Brown,Lucy,Partner,Transfer,Spayed Female,1
2,1 year,Dog,Pit Bull,Blue/White,*Johnny,Partner,Adoption,Neutered Male,1
3,9 years,Dog,Miniature Schnauzer Mix,White,Monday,Partner,Transfer,Neutered Male,0
4,5 months,Other,Bat Mix,Brown,Bella,Rabies Risk,Euthanasia,Unknown,0


In [182]:
def mix_breed(a):
    import re
    rez1 = re.search(r"Mix", a)
    if rez1 is None:
        return 0
    else:
        return 1

In [183]:
data['mix_breed'] = data['breed'].apply(mix_breed)

In [185]:
data.head(15)

Unnamed: 0,age_upon_outcome,animal_type,breed,color,name,outcome_subtype,outcome_type,sex_upon_outcome,two_color,mix_breed
0,2 weeks,Cat,Domestic Shorthair Mix,Orange Tabby,Bella,Partner,Transfer,Intact Male,0,1
1,1 year,Dog,Beagle Mix,White/Brown,Lucy,Partner,Transfer,Spayed Female,1,1
2,1 year,Dog,Pit Bull,Blue/White,*Johnny,Partner,Adoption,Neutered Male,1,0
3,9 years,Dog,Miniature Schnauzer Mix,White,Monday,Partner,Transfer,Neutered Male,0,1
4,5 months,Other,Bat Mix,Brown,Bella,Rabies Risk,Euthanasia,Unknown,0,1
5,4 months,Dog,Leonberger Mix,Brown/White,*Edgar,Partner,Transfer,Intact Male,1,1
6,1 year,Other,Squirrel Mix,Tan,Bella,Suffering,Euthanasia,Unknown,0,1
7,3 years,Dog,Chihuahua Shorthair Mix,Brown,*Ella,Partner,Transfer,Spayed Female,0,1
8,1 month,Cat,Domestic Shorthair Mix,Blue Tabby/White,Lucy,Partner,Adoption,Intact Female,1,1
9,3 months,Cat,Domestic Shorthair Mix,White/Black,*Frida,Offsite,Adoption,Spayed Female,1,1


In [186]:
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [195]:
y = data['outcome_type']
X = data.drop(['outcome_type'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)
from sklearn.ensemble import RandomForestClassifier

# Создаём модель леса из сотни деревьев
model = RandomForestClassifier(n_estimators=100, 
                               bootstrap = True,
                               max_features = 'sqrt')
# Обучаем на тренировочных данных
model.fit(X_train, y_train)

# Действующая классификация
rf_predictions = model.predict(X_test)
# Вероятности для каждого класса
rf_probs = model.predict_proba(X_test)[:, 1]

ValueError: could not convert string to float: '4 years'