Вы получили основные навыки обработки данных, теперь пора испытать их на практике. Сейчас вам предстоит заняться задачей классификации.

Представлен датасет центра приюта животных, и вашей задачей будет обучить модель таким образом, чтобы  по определенным признакам была возможность максимально уверенно предсказать метки 'Adoption' и 'Transfer' (столбец “outcome_type”).

Здесь вы вольны делать что угодно. Я хочу видеть от вас:
1. Проверка наличия/обработка пропусков
2. Проверьте взаимосвязи между признаками
3. Попробуйте создать свои признаки
4. Удалите лишние
5. Обратите внимание на текстовые столбцы. Подумайте, что можно извлечь полезного оттуда
6. Использование профайлера вам поможет.
7. Не забывайте, что у вас есть PCA (Метод главных компонент). Он может пригодиться.

Вспомните о всем, что я говорил на предыдущих занятиях. Не все будет пригодится, но в жизни вам никто не будет говорить, что использовать :)

Хорошим классификатором для этой задачи будет "Случайный лес" (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Понимать суть работы "леса" не обязательно на данном этапе, но качество предсказаний будет выше, чем с линейным классификатором. (если желаете, вот гайд https://adataanalyst.com/scikit-learn/linear-classification-method/)

Желаю успеха :)

In [425]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
from scipy import stats

In [426]:
data = pd.read_csv('data/aac_shelter_outcomes.csv')

In [427]:
data

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,Transfer,Intact Male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,Spayed Female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,,Adoption,Neutered Male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,Neutered Male
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07T00:00:00,2014-07-07T14:04:00,2014-07-07T14:04:00,,Rabies Risk,Euthanasia,Unknown
...,...,...,...,...,...,...,...,...,...,...,...,...
78251,1 month,A764894,Dog,Golden Retriever/Labrador Retriever,Brown/White,2017-12-04T00:00:00,2018-02-01T18:26:00,2018-02-01T18:26:00,,Foster,Adoption,Spayed Female
78252,3 years,A764468,Dog,Mastiff Mix,Blue/White,2014-12-30T00:00:00,2018-02-01T18:06:00,2018-02-01T18:06:00,Max,,Adoption,Neutered Male
78253,,A766098,Other,Bat Mix,Brown,2017-02-01T00:00:00,2018-02-01T18:08:00,2018-02-01T18:08:00,,Rabies Risk,Euthanasia,Unknown
78254,2 months,A765858,Dog,Standard Schnauzer,Red,2017-11-13T00:00:00,2018-02-01T18:32:00,2018-02-01T18:32:00,,,Adoption,Spayed Female


In [428]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78256 entries, 0 to 78255
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   age_upon_outcome  78248 non-null  object
 1   animal_id         78256 non-null  object
 2   animal_type       78256 non-null  object
 3   breed             78256 non-null  object
 4   color             78256 non-null  object
 5   date_of_birth     78256 non-null  object
 6   datetime          78256 non-null  object
 7   monthyear         78256 non-null  object
 8   name              54370 non-null  object
 9   outcome_subtype   35963 non-null  object
 10  outcome_type      78244 non-null  object
 11  sex_upon_outcome  78254 non-null  object
dtypes: object(12)
memory usage: 7.2+ MB


In [429]:
data['outcome_type'].value_counts()

Adoption           33112
Transfer           23499
Return to Owner    14354
Euthanasia          6080
Died                 680
Disposal             307
Rto-Adopt            150
Missing               46
Relocate              16
Name: outcome_type, dtype: int64

# План действий

1 В датасете по столбцу outcome_type оставляем строки со значениями Adoption и Transfer

2 обработаем столбцы с пропусками:

age_upon_outcome - пропуски можно удалить, так как их немного и это не повлияет на модель

name - удалять нельзя, пропусков много, но можно удалить столбец, так как имя ни на что не влияет

outcome_subtype - удалять нельзя, много пропусков

outcome_type - пропуски можно удалить, так как их немного и это не повлияет на модель

sex_upon_outcome - пропуски можно удалить, так как их немного и это не повлияет на модель

3 Столбец animal_id удалим, он не вляет на модель, просто присвоенный номер

4 Столбец age_upon_outcome преобразуем в численные значения

5 Столбец outcome_subtype преобразуем с помощью OneHotEncoding

6 Столбцы date_of_birth, datetime, monthyear взаимосвязаны с age_upon_outcome, причем datetime и monthyear дублируются из за аномалии, удаляем



In [430]:
data = data[(data['outcome_type']=='Adoption') | (data['outcome_type']=='Transfer')]
data

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,Transfer,Intact Male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,Spayed Female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,,Adoption,Neutered Male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,Neutered Male
5,4 months,A664462,Dog,Leonberger Mix,Brown/White,2013-06-03T00:00:00,2013-10-07T13:06:00,2013-10-07T13:06:00,*Edgar,Partner,Transfer,Intact Male
...,...,...,...,...,...,...,...,...,...,...,...,...
78250,1 month,A764895,Dog,Golden Retriever/Labrador Retriever,Brown/White,2017-12-04T00:00:00,2018-02-01T18:40:00,2018-02-01T18:40:00,,Foster,Adoption,Neutered Male
78251,1 month,A764894,Dog,Golden Retriever/Labrador Retriever,Brown/White,2017-12-04T00:00:00,2018-02-01T18:26:00,2018-02-01T18:26:00,,Foster,Adoption,Spayed Female
78252,3 years,A764468,Dog,Mastiff Mix,Blue/White,2014-12-30T00:00:00,2018-02-01T18:06:00,2018-02-01T18:06:00,Max,,Adoption,Neutered Male
78254,2 months,A765858,Dog,Standard Schnauzer,Red,2017-11-13T00:00:00,2018-02-01T18:32:00,2018-02-01T18:32:00,,,Adoption,Spayed Female


In [431]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56611 entries, 0 to 78255
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   age_upon_outcome  56609 non-null  object
 1   animal_id         56611 non-null  object
 2   animal_type       56611 non-null  object
 3   breed             56611 non-null  object
 4   color             56611 non-null  object
 5   date_of_birth     56611 non-null  object
 6   datetime          56611 non-null  object
 7   monthyear         56611 non-null  object
 8   name              38660 non-null  object
 9   outcome_subtype   29425 non-null  object
 10  outcome_type      56611 non-null  object
 11  sex_upon_outcome  56611 non-null  object
dtypes: object(12)
memory usage: 5.6+ MB


In [432]:
# удаляем пропуски в столбце age_upon_outcome
data = data.dropna(subset=['age_upon_outcome'])

In [433]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56609 entries, 0 to 78255
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   age_upon_outcome  56609 non-null  object
 1   animal_id         56609 non-null  object
 2   animal_type       56609 non-null  object
 3   breed             56609 non-null  object
 4   color             56609 non-null  object
 5   date_of_birth     56609 non-null  object
 6   datetime          56609 non-null  object
 7   monthyear         56609 non-null  object
 8   name              38660 non-null  object
 9   outcome_subtype   29423 non-null  object
 10  outcome_type      56609 non-null  object
 11  sex_upon_outcome  56609 non-null  object
dtypes: object(12)
memory usage: 5.6+ MB


In [434]:
# в результате пропуски исчезли вместе с удаленными строками

In [435]:
# удалим столбец animal_id
data = data.drop(['animal_id'], axis=1)

In [436]:
data.shape

(56609, 11)

In [437]:
# data

In [438]:
# видно, что datetime - date_of_birth = age_upon_outcome, с округлением
# сначала преобразуем datetime и date_of_birth, отбросим время, так как оно не всегда указано корректно

In [439]:
data['date_of_birth'].value_counts()

2014-05-05T00:00:00    97
2014-04-21T00:00:00    95
2015-04-28T00:00:00    92
2015-09-01T00:00:00    91
2015-04-20T00:00:00    85
                       ..
2004-12-07T00:00:00     1
2004-07-13T00:00:00     1
2009-07-16T00:00:00     1
2004-09-28T00:00:00     1
2007-12-17T00:00:00     1
Name: date_of_birth, Length: 4883, dtype: int64

In [440]:
data['date_of_birth'] = data['date_of_birth'].str.extract(r'([0-9]{4}-[0-9]{2}-[0-9]{2})')
data['datetime'] = data['datetime'].str.extract(r'([0-9]{4}-[0-9]{2}-[0-9]{2})')
data['monthyear'] = data['monthyear'].str.extract(r'([0-9]{4}-[0-9]{2}-[0-9]{2})')

In [441]:
data

Unnamed: 0,age_upon_outcome,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07,2014-07-22,2014-07-22,,Partner,Transfer,Intact Male
1,1 year,Dog,Beagle Mix,White/Brown,2012-11-06,2013-11-07,2013-11-07,Lucy,Partner,Transfer,Spayed Female
2,1 year,Dog,Pit Bull,Blue/White,2013-03-31,2014-06-03,2014-06-03,*Johnny,,Adoption,Neutered Male
3,9 years,Dog,Miniature Schnauzer Mix,White,2005-06-02,2014-06-15,2014-06-15,Monday,Partner,Transfer,Neutered Male
5,4 months,Dog,Leonberger Mix,Brown/White,2013-06-03,2013-10-07,2013-10-07,*Edgar,Partner,Transfer,Intact Male
...,...,...,...,...,...,...,...,...,...,...,...
78250,1 month,Dog,Golden Retriever/Labrador Retriever,Brown/White,2017-12-04,2018-02-01,2018-02-01,,Foster,Adoption,Neutered Male
78251,1 month,Dog,Golden Retriever/Labrador Retriever,Brown/White,2017-12-04,2018-02-01,2018-02-01,,Foster,Adoption,Spayed Female
78252,3 years,Dog,Mastiff Mix,Blue/White,2014-12-30,2018-02-01,2018-02-01,Max,,Adoption,Neutered Male
78254,2 months,Dog,Standard Schnauzer,Red,2017-11-13,2018-02-01,2018-02-01,,,Adoption,Spayed Female


In [442]:
# преобразуем в формат datetime

In [443]:
data['date_of_birth'] = pd.to_datetime(data['date_of_birth'])
data['datetime'] = pd.to_datetime(data['datetime'])
data['monthyear'] = pd.to_datetime(data['monthyear'])

In [444]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56609 entries, 0 to 78255
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   age_upon_outcome  56609 non-null  object        
 1   animal_type       56609 non-null  object        
 2   breed             56609 non-null  object        
 3   color             56609 non-null  object        
 4   date_of_birth     56609 non-null  datetime64[ns]
 5   datetime          56609 non-null  datetime64[ns]
 6   monthyear         56609 non-null  datetime64[ns]
 7   name              38660 non-null  object        
 8   outcome_subtype   29423 non-null  object        
 9   outcome_type      56609 non-null  object        
 10  sex_upon_outcome  56609 non-null  object        
dtypes: datetime64[ns](3), object(8)
memory usage: 5.2+ MB


In [445]:
data['datetime'].equals(data['monthyear'])

True

In [446]:
# удалим столбец monthyear, так как он дублирует datetime
data = data.drop(['monthyear'], axis=1)

In [447]:
data

Unnamed: 0,age_upon_outcome,animal_type,breed,color,date_of_birth,datetime,name,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07,2014-07-22,,Partner,Transfer,Intact Male
1,1 year,Dog,Beagle Mix,White/Brown,2012-11-06,2013-11-07,Lucy,Partner,Transfer,Spayed Female
2,1 year,Dog,Pit Bull,Blue/White,2013-03-31,2014-06-03,*Johnny,,Adoption,Neutered Male
3,9 years,Dog,Miniature Schnauzer Mix,White,2005-06-02,2014-06-15,Monday,Partner,Transfer,Neutered Male
5,4 months,Dog,Leonberger Mix,Brown/White,2013-06-03,2013-10-07,*Edgar,Partner,Transfer,Intact Male
...,...,...,...,...,...,...,...,...,...,...
78250,1 month,Dog,Golden Retriever/Labrador Retriever,Brown/White,2017-12-04,2018-02-01,,Foster,Adoption,Neutered Male
78251,1 month,Dog,Golden Retriever/Labrador Retriever,Brown/White,2017-12-04,2018-02-01,,Foster,Adoption,Spayed Female
78252,3 years,Dog,Mastiff Mix,Blue/White,2014-12-30,2018-02-01,Max,,Adoption,Neutered Male
78254,2 months,Dog,Standard Schnauzer,Red,2017-11-13,2018-02-01,,,Adoption,Spayed Female


In [448]:
data['age'] = (data['datetime'] - data['date_of_birth'])

In [449]:
# data.loc[:, 'age'] = len(pd.date_range(data['date_of_birth'], data['datetime'], freq="D"))

In [450]:
data

Unnamed: 0,age_upon_outcome,animal_type,breed,color,date_of_birth,datetime,name,outcome_subtype,outcome_type,sex_upon_outcome,age
0,2 weeks,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07,2014-07-22,,Partner,Transfer,Intact Male,15 days
1,1 year,Dog,Beagle Mix,White/Brown,2012-11-06,2013-11-07,Lucy,Partner,Transfer,Spayed Female,366 days
2,1 year,Dog,Pit Bull,Blue/White,2013-03-31,2014-06-03,*Johnny,,Adoption,Neutered Male,429 days
3,9 years,Dog,Miniature Schnauzer Mix,White,2005-06-02,2014-06-15,Monday,Partner,Transfer,Neutered Male,3300 days
5,4 months,Dog,Leonberger Mix,Brown/White,2013-06-03,2013-10-07,*Edgar,Partner,Transfer,Intact Male,126 days
...,...,...,...,...,...,...,...,...,...,...,...
78250,1 month,Dog,Golden Retriever/Labrador Retriever,Brown/White,2017-12-04,2018-02-01,,Foster,Adoption,Neutered Male,59 days
78251,1 month,Dog,Golden Retriever/Labrador Retriever,Brown/White,2017-12-04,2018-02-01,,Foster,Adoption,Spayed Female,59 days
78252,3 years,Dog,Mastiff Mix,Blue/White,2014-12-30,2018-02-01,Max,,Adoption,Neutered Male,1129 days
78254,2 months,Dog,Standard Schnauzer,Red,2017-11-13,2018-02-01,,,Adoption,Spayed Female,80 days


In [451]:
# age_upon_outcome, date_of_birth, datetime больше не нужны, удаляем
data = data.drop(['age_upon_outcome', 'date_of_birth', 'datetime'], axis=1)

In [452]:
data

Unnamed: 0,animal_type,breed,color,name,outcome_subtype,outcome_type,sex_upon_outcome,age
0,Cat,Domestic Shorthair Mix,Orange Tabby,,Partner,Transfer,Intact Male,15 days
1,Dog,Beagle Mix,White/Brown,Lucy,Partner,Transfer,Spayed Female,366 days
2,Dog,Pit Bull,Blue/White,*Johnny,,Adoption,Neutered Male,429 days
3,Dog,Miniature Schnauzer Mix,White,Monday,Partner,Transfer,Neutered Male,3300 days
5,Dog,Leonberger Mix,Brown/White,*Edgar,Partner,Transfer,Intact Male,126 days
...,...,...,...,...,...,...,...,...
78250,Dog,Golden Retriever/Labrador Retriever,Brown/White,,Foster,Adoption,Neutered Male,59 days
78251,Dog,Golden Retriever/Labrador Retriever,Brown/White,,Foster,Adoption,Spayed Female,59 days
78252,Dog,Mastiff Mix,Blue/White,Max,,Adoption,Neutered Male,1129 days
78254,Dog,Standard Schnauzer,Red,,,Adoption,Spayed Female,80 days


In [453]:
data['sex_upon_outcome'].value_counts()

Neutered Male    20732
Spayed Female    19949
Intact Female     6873
Intact Male       6294
Unknown           2761
Name: sex_upon_outcome, dtype: int64

In [454]:
data['sex'] = data['sex_upon_outcome'].str.extract(r' (\w+)')

In [455]:
data

Unnamed: 0,animal_type,breed,color,name,outcome_subtype,outcome_type,sex_upon_outcome,age,sex
0,Cat,Domestic Shorthair Mix,Orange Tabby,,Partner,Transfer,Intact Male,15 days,Male
1,Dog,Beagle Mix,White/Brown,Lucy,Partner,Transfer,Spayed Female,366 days,Female
2,Dog,Pit Bull,Blue/White,*Johnny,,Adoption,Neutered Male,429 days,Male
3,Dog,Miniature Schnauzer Mix,White,Monday,Partner,Transfer,Neutered Male,3300 days,Male
5,Dog,Leonberger Mix,Brown/White,*Edgar,Partner,Transfer,Intact Male,126 days,Male
...,...,...,...,...,...,...,...,...,...
78250,Dog,Golden Retriever/Labrador Retriever,Brown/White,,Foster,Adoption,Neutered Male,59 days,Male
78251,Dog,Golden Retriever/Labrador Retriever,Brown/White,,Foster,Adoption,Spayed Female,59 days,Female
78252,Dog,Mastiff Mix,Blue/White,Max,,Adoption,Neutered Male,1129 days,Male
78254,Dog,Standard Schnauzer,Red,,,Adoption,Spayed Female,80 days,Female


In [456]:
data['sex'].value_counts()

Male      27026
Female    26822
Name: sex, dtype: int64

In [457]:
data['sex'].unique()

array(['Male', 'Female', nan], dtype=object)

In [460]:
data['Neuter'] = data['sex_upon_outcome'].str.extract(r'(\w+) ')

In [461]:
data

Unnamed: 0,animal_type,breed,color,name,outcome_subtype,outcome_type,sex_upon_outcome,age,sex,Neuter
0,Cat,Domestic Shorthair Mix,Orange Tabby,,Partner,Transfer,Intact Male,15 days,Male,Intact
1,Dog,Beagle Mix,White/Brown,Lucy,Partner,Transfer,Spayed Female,366 days,Female,Spayed
2,Dog,Pit Bull,Blue/White,*Johnny,,Adoption,Neutered Male,429 days,Male,Neutered
3,Dog,Miniature Schnauzer Mix,White,Monday,Partner,Transfer,Neutered Male,3300 days,Male,Neutered
5,Dog,Leonberger Mix,Brown/White,*Edgar,Partner,Transfer,Intact Male,126 days,Male,Intact
...,...,...,...,...,...,...,...,...,...,...
78250,Dog,Golden Retriever/Labrador Retriever,Brown/White,,Foster,Adoption,Neutered Male,59 days,Male,Neutered
78251,Dog,Golden Retriever/Labrador Retriever,Brown/White,,Foster,Adoption,Spayed Female,59 days,Female,Spayed
78252,Dog,Mastiff Mix,Blue/White,Max,,Adoption,Neutered Male,1129 days,Male,Neutered
78254,Dog,Standard Schnauzer,Red,,,Adoption,Spayed Female,80 days,Female,Spayed


In [463]:
data['Neuter'].value_counts()

Neutered    20732
Spayed      19949
Intact      13167
Name: Neuter, dtype: int64

In [462]:
data['Neuter'].unique()

array(['Intact', 'Spayed', 'Neutered', nan], dtype=object)

In [464]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56609 entries, 0 to 78255
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype          
---  ------            --------------  -----          
 0   animal_type       56609 non-null  object         
 1   breed             56609 non-null  object         
 2   color             56609 non-null  object         
 3   name              38660 non-null  object         
 4   outcome_subtype   29423 non-null  object         
 5   outcome_type      56609 non-null  object         
 6   sex_upon_outcome  56609 non-null  object         
 7   age               56609 non-null  timedelta64[ns]
 8   sex               53848 non-null  object         
 9   Neuter            53848 non-null  object         
dtypes: object(9), timedelta64[ns](1)
memory usage: 4.8+ MB


In [465]:
# столбец sex_upon_outcome можно удалить
data = data.drop(['sex_upon_outcome'], axis=1)

In [467]:
data

Unnamed: 0,animal_type,breed,color,name,outcome_subtype,outcome_type,age,sex,Neuter
0,Cat,Domestic Shorthair Mix,Orange Tabby,,Partner,Transfer,15 days,Male,Intact
1,Dog,Beagle Mix,White/Brown,Lucy,Partner,Transfer,366 days,Female,Spayed
2,Dog,Pit Bull,Blue/White,*Johnny,,Adoption,429 days,Male,Neutered
3,Dog,Miniature Schnauzer Mix,White,Monday,Partner,Transfer,3300 days,Male,Neutered
5,Dog,Leonberger Mix,Brown/White,*Edgar,Partner,Transfer,126 days,Male,Intact
...,...,...,...,...,...,...,...,...,...
78250,Dog,Golden Retriever/Labrador Retriever,Brown/White,,Foster,Adoption,59 days,Male,Neutered
78251,Dog,Golden Retriever/Labrador Retriever,Brown/White,,Foster,Adoption,59 days,Female,Spayed
78252,Dog,Mastiff Mix,Blue/White,Max,,Adoption,1129 days,Male,Neutered
78254,Dog,Standard Schnauzer,Red,,,Adoption,80 days,Female,Spayed


In [469]:
data['outcome_subtype'].unique()

array(['Partner', nan, 'Offsite', 'Foster', 'SCRP', 'Barn', 'Snr'],
      dtype=object)

In [470]:
data['outcome_subtype'].value_counts()

Partner    19658
Foster      5558
SCRP        3211
Snr          626
Offsite      367
Barn           3
Name: outcome_subtype, dtype: int64

In [474]:
data['name'].value_counts()

Bella          204
Max            169
Luna           158
Daisy          146
Lucy           135
              ... 
Leu              1
*Bullwinkle      1
Buppie           1
Nunki            1
Doty             1
Name: name, Length: 11989, dtype: int64

In [476]:
data['name'].unique()

array([nan, 'Lucy', '*Johnny', ..., 'Wonder Woman', 'Eisley',
       'Allee Chat'], dtype=object)

In [477]:
data['name'].nunique()

11989

In [478]:
# столбец name можно удалить
data = data.drop(['name'], axis=1)

In [479]:
data

Unnamed: 0,animal_type,breed,color,outcome_subtype,outcome_type,age,sex,Neuter
0,Cat,Domestic Shorthair Mix,Orange Tabby,Partner,Transfer,15 days,Male,Intact
1,Dog,Beagle Mix,White/Brown,Partner,Transfer,366 days,Female,Spayed
2,Dog,Pit Bull,Blue/White,,Adoption,429 days,Male,Neutered
3,Dog,Miniature Schnauzer Mix,White,Partner,Transfer,3300 days,Male,Neutered
5,Dog,Leonberger Mix,Brown/White,Partner,Transfer,126 days,Male,Intact
...,...,...,...,...,...,...,...,...
78250,Dog,Golden Retriever/Labrador Retriever,Brown/White,Foster,Adoption,59 days,Male,Neutered
78251,Dog,Golden Retriever/Labrador Retriever,Brown/White,Foster,Adoption,59 days,Female,Spayed
78252,Dog,Mastiff Mix,Blue/White,,Adoption,1129 days,Male,Neutered
78254,Dog,Standard Schnauzer,Red,,Adoption,80 days,Female,Spayed


In [5]:
# data.describe(include='object')

In [77]:
data.sex_upon_outcome.value_counts()

Neutered Male    20732
Spayed Female    19949
Intact Female     6873
Intact Male       6294
Unknown           2761
Name: sex_upon_outcome, dtype: int64

In [15]:
# data.hist(figsize=(15,15))
# plt.show()

In [16]:
# data.boxplot(figsize=(15,6), rot=90)

In [17]:
# sns.heatmap(data.corr(),annot=True,cmap='RdBu',linewidths=0.2) #data.corr()-->матрица корреляций
# # gcf() - получение текущей фигуры
# fig=plt.gcf()
# # set_size_inches() - установить размер фигуры
# fig.set_size_inches(12,12)
# plt.show()

In [18]:
# !pip install pandas_profiling

In [471]:
import pandas_profiling

In [472]:
pandas_profiling.ProfileReport(data)

TypeError: concat() got an unexpected keyword argument 'join_axes'

In [80]:
data.profile_report()

AttributeError: 'DataFrame' object has no attribute 'profile_report'