# CRISP-DM Framework

CRISP-DM = Cross-industry standard process for data mining.



# Business / Problem Statement Understanding

NOTE: This is a fictitious problem with very real world data.

Given are the properties of pet ( both cat and dog ). Based on the properties we have to find out:
* the type of the pet ( dog ot cat ).
* the adoption speed ( ranging from 0 to 4 ) of the given pet.

This is a `Multi Output Multi Class CLassification Problem`.



<details>
<summary><b><font size="+1">How Multi is it a Output Multi Class CLassification Problem ?</font></b></summary>

`Multi Output` - As there are 2 targets ( **type of the pet** and **adoption speed of the pet** )

`Multi Class` - one of the target ( **adoption speed** ) has 4 classes named 0,1,2,3,4

</details>




# Data Understanding

## Data Dictionary

Features:
1. `PetID` - Unique hash ID of pet profile
1. `Name` - Name of pet (Empty if not named)
1. `Age` - Age of pet when listed, in months
1. `Breed1` - Primary breed of pet (Refer to BreedLabels dictionary)
1. `Breed2` - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
1. `Gender` - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
1. `Color1` - Color 1 of pet (Refer to ColorLabels dictionary)
1. `Color2` - Color 2 of pet (Refer to ColorLabels dictionary)
1. `Color3` - Color 3 of pet (Refer to ColorLabels dictionary)
1. `MaturitySize` - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
1. `FurLength` - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
1. `Vaccinated` - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
1. `Dewormed` - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
1. `Sterilized` - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
1. `Health` - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
1. `Quantity` - Number of pets represented in profile
1. `Fee` - Adoption fee (0 = Free)
1. `State` - State location in Malaysia (Refer to StateLabels dictionary)
1. `RescuerID` - Unique hash ID of rescuer
1. `VideoAmt` - Total uploaded videos for this pet
1. `PhotoAmt` - Total uploaded photos for this pet
1. `Description` - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

Targets:
1. `AdoptionSpeed` - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
1. `Type` - Type of animal (1 = Dog, 2 = Cat)


AdoptionSpeed:
The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way:

* `0` - Pet was adopted on the same day as it was listed.
* `1` - Pet was adopted between 1 and 7 days (1st week) after being listed.
* `2` - Pet was adopted between 8 and 30 days (1st month) after being listed.
* `3` - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.
* `4` - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("./pet_train.csv")

In [3]:
new_col_order = ['PetID','Name', 'Age', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2', 'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed', 'Sterilized', 'Health', 'Quantity', 'Fee', 'State', 'RescuerID', 'VideoAmt', 'Description',  'PhotoAmt', 'Type','AdoptionSpeed']
df = df.reindex(columns=new_col_order)

In [4]:
df.head()

Unnamed: 0,PetID,Name,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,...,Health,Quantity,Fee,State,RescuerID,VideoAmt,Description,PhotoAmt,Type,AdoptionSpeed
0,86e1089a3,Nibble,3,299,0,1,1,7,0,1,...,1,1,100,41326,8480853f516546f6cf33aa88cd76c379,0,Nibble is a 3+ month old ball of cuteness. He ...,1.0,2,2
1,6296e909a,No Name Yet,1,265,0,1,1,2,0,2,...,1,1,0,41401,3082c7125d8fb66f7dd4bff4192c8b14,0,I just found it alone yesterday near my apartm...,2.0,2,0
2,3422e4906,Brisco,1,307,0,1,2,7,0,2,...,1,1,0,41326,fa90fa5b1ee11c86938398b60abc32cb,0,Their pregnant mother was dumped by her irresp...,7.0,1,3
3,5842f1ff5,Miko,4,307,0,2,1,2,0,2,...,1,1,150,41401,9238e4f44c71a75282e62f7136c6b240,0,"Good guard dog, very alert, active, obedience ...",8.0,1,2
4,850a43f90,Hunter,1,307,0,1,1,0,0,2,...,1,1,0,41326,95481e953f8aed9ec3d16fc4509537e8,0,This handsome yet cute boy is up for adoption....,3.0,1,2


In [5]:
orig_df = df

In [6]:
df.dtypes

PetID             object
Name              object
Age                int64
Breed1             int64
Breed2             int64
Gender             int64
Color1             int64
Color2             int64
Color3             int64
MaturitySize       int64
FurLength          int64
Vaccinated         int64
Dewormed           int64
Sterilized         int64
Health             int64
Quantity           int64
Fee                int64
State              int64
RescuerID         object
VideoAmt           int64
Description       object
PhotoAmt         float64
Type               int64
AdoptionSpeed      int64
dtype: object

## Unique value

In [7]:
#Helper fundtion - finding unique values for the columns:
def unique_values(df):
  for col in df.columns:
    print(f"==================================Unique values for column ``` {col} ``` ")
    print(df[col].unique())
    print(f"==============================END==============================\n")

In [8]:
unique_values(df)

['86e1089a3' '6296e909a' '3422e4906' ... 'd981b6395' 'e4da1c9e4'
 'a83d95ead']

['Nibble' 'No Name Yet' 'Brisco' ... 'Monkies' 'Ms Daym' 'Fili']

[  3   1   4  12   0   2  78   6   8  10  36  14  24   5  72  60   9  48
  62  47  19 120  32   7  17  22  16  13  11  37  18  55  20  28  74  53
  25  84  76  30 132  96  46  15  50  56  54  23  92  29  27  49  44 144
  21  31  41  51  65  34 135  39  52  42 108  81  26  38  69 212  33  75
  95  80  63  61 255  89  91  35 117  73 122 123  64  87 112 156  66  67
  77 180  82  86  40  57 168 102  45 147  68  85  88  43 238 100]

[299 265 307 266 264 218 114 285 189 205 292 128 243 213 141 173 207 250
 119 195 109 206  70 103 303  78 254  10  20 305 283 306 288  69 179  31
 247 200 248  26  25   0 129 202  72  24 284 286 152 277  44  75  64  60
 296 185 300  76 139 242 294 276 102 182 289 145 178 233  82  49 239 231
 169 111 232 270 267 268 251  58 155 295 304 147 245 282  21 215 192 154
  71 272 241 262 249 273 108 240  83 293  39  50  93   1 

In [9]:
df.shape

(14993, 24)

In [10]:
df.describe()

Unnamed: 0,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,FurLength,Vaccinated,Dewormed,Sterilized,Health,Quantity,Fee,State,VideoAmt,PhotoAmt,Type,AdoptionSpeed
count,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0
mean,10.452078,265.272594,74.009738,1.776162,2.234176,3.222837,1.882012,1.862002,1.467485,1.731208,1.558727,1.914227,1.036617,1.576069,21.259988,41346.028347,0.05676,3.889215,1.457614,2.516441
std,18.15579,60.056818,123.011575,0.681592,1.745225,2.742562,2.984086,0.547959,0.59907,0.667649,0.695817,0.566172,0.199535,1.472477,78.414548,32.444153,0.346185,3.48781,0.498217,1.177265
min,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,41324.0,0.0,0.0,1.0,0.0
25%,2.0,265.0,0.0,1.0,1.0,0.0,0.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,0.0,41326.0,0.0,2.0,1.0,2.0
50%,3.0,266.0,0.0,2.0,2.0,2.0,0.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,0.0,41326.0,0.0,3.0,1.0,2.0
75%,12.0,307.0,179.0,2.0,3.0,6.0,5.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,0.0,41401.0,0.0,5.0,2.0,4.0
max,255.0,307.0,307.0,3.0,7.0,7.0,7.0,4.0,3.0,3.0,3.0,3.0,3.0,20.0,3000.0,41415.0,8.0,30.0,2.0,4.0


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14993 entries, 0 to 14992
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PetID          14993 non-null  object 
 1   Name           13728 non-null  object 
 2   Age            14993 non-null  int64  
 3   Breed1         14993 non-null  int64  
 4   Breed2         14993 non-null  int64  
 5   Gender         14993 non-null  int64  
 6   Color1         14993 non-null  int64  
 7   Color2         14993 non-null  int64  
 8   Color3         14993 non-null  int64  
 9   MaturitySize   14993 non-null  int64  
 10  FurLength      14993 non-null  int64  
 11  Vaccinated     14993 non-null  int64  
 12  Dewormed       14993 non-null  int64  
 13  Sterilized     14993 non-null  int64  
 14  Health         14993 non-null  int64  
 15  Quantity       14993 non-null  int64  
 16  Fee            14993 non-null  int64  
 17  State          14993 non-null  int64  
 18  Rescue

In [12]:
df.isnull().sum()

PetID               0
Name             1265
Age                 0
Breed1              0
Breed2              0
Gender              0
Color1              0
Color2              0
Color3              0
MaturitySize        0
FurLength           0
Vaccinated          0
Dewormed            0
Sterilized          0
Health              0
Quantity            0
Fee                 0
State               0
RescuerID           0
VideoAmt            0
Description        13
PhotoAmt            0
Type                0
AdoptionSpeed       0
dtype: int64

In [13]:
df['Type'].value_counts(normalize=True)*100

Type
1    54.238645
2    45.761355
Name: proportion, dtype: float64

In [14]:
df['AdoptionSpeed'].value_counts(normalize=True)*100

AdoptionSpeed
4    27.993063
2    26.925899
3    21.736811
1    20.609618
0     2.734609
Name: proportion, dtype: float64

## Inferences:

1. There are null values in Name and Description columns. Based on the problem statement these are not very important columns and could be ignored.
2. Also the RecuserID, PetID are only unique identifiers so don't add much info to the model. These can be removed
3. The PhotoAmt should be int instead of float as per the data dictionary.
4. the classes are all imbalanced, so keep in mind to balance while maing prediction.

# Data Preparation

## separating the cols into numerical and categorical cols

In [15]:
df.columns

Index(['PetID', 'Name', 'Age', 'Breed1', 'Breed2', 'Gender', 'Color1',
       'Color2', 'Color3', 'MaturitySize', 'FurLength', 'Vaccinated',
       'Dewormed', 'Sterilized', 'Health', 'Quantity', 'Fee', 'State',
       'RescuerID', 'VideoAmt', 'Description', 'PhotoAmt', 'Type',
       'AdoptionSpeed'],
      dtype='object')

In [16]:
target_cols = ['Type', 'AdoptionSpeed']

In [17]:
selected_cols = [ col for col in df.columns if col not in target_cols]
selected_cols

['PetID',
 'Name',
 'Age',
 'Breed1',
 'Breed2',
 'Gender',
 'Color1',
 'Color2',
 'Color3',
 'MaturitySize',
 'FurLength',
 'Vaccinated',
 'Dewormed',
 'Sterilized',
 'Health',
 'Quantity',
 'Fee',
 'State',
 'RescuerID',
 'VideoAmt',
 'Description',
 'PhotoAmt']

In [18]:
# discarding Name and Description columns
selected_cols.remove('Name')
selected_cols.remove('Description')
selected_cols.remove('PetID')
selected_cols.remove('RescuerID')

In [19]:
num_cols = ['Age', 'Quantity', 'Fee', 'VideoAmt', 'PhotoAmt']
# num_cols = ['Breed1', 'Breed2', 'Gender', 'Color1', 'Color2', 'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed', 'Sterilized', 'Health', 'State',]
cat_cols = [col for col in selected_cols if col not in num_cols]
selected_cols,cat_cols,num_cols

(['Age',
  'Breed1',
  'Breed2',
  'Gender',
  'Color1',
  'Color2',
  'Color3',
  'MaturitySize',
  'FurLength',
  'Vaccinated',
  'Dewormed',
  'Sterilized',
  'Health',
  'Quantity',
  'Fee',
  'State',
  'VideoAmt',
  'PhotoAmt'],
 ['Breed1',
  'Breed2',
  'Gender',
  'Color1',
  'Color2',
  'Color3',
  'MaturitySize',
  'FurLength',
  'Vaccinated',
  'Dewormed',
  'Sterilized',
  'Health',
  'State'],
 ['Age', 'Quantity', 'Fee', 'VideoAmt', 'PhotoAmt'])

In [20]:
df[selected_cols] = df[selected_cols].astype({"PhotoAmt": int})

In [21]:
df[selected_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14993 entries, 0 to 14992
Data columns (total 18 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   Age           14993 non-null  int64
 1   Breed1        14993 non-null  int64
 2   Breed2        14993 non-null  int64
 3   Gender        14993 non-null  int64
 4   Color1        14993 non-null  int64
 5   Color2        14993 non-null  int64
 6   Color3        14993 non-null  int64
 7   MaturitySize  14993 non-null  int64
 8   FurLength     14993 non-null  int64
 9   Vaccinated    14993 non-null  int64
 10  Dewormed      14993 non-null  int64
 11  Sterilized    14993 non-null  int64
 12  Health        14993 non-null  int64
 13  Quantity      14993 non-null  int64
 14  Fee           14993 non-null  int64
 15  State         14993 non-null  int64
 16  VideoAmt      14993 non-null  int64
 17  PhotoAmt      14993 non-null  int64
dtypes: int64(18)
memory usage: 2.1 MB


## Splitting the dataset into train, validation dataset

In [22]:
# split the data then we can apply the processing
X = df[selected_cols]
y_targets = df[target_cols]

In [23]:
X.columns, type(y_targets)

(Index(['Age', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2', 'Color3',
        'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed', 'Sterilized',
        'Health', 'Quantity', 'Fee', 'State', 'VideoAmt', 'PhotoAmt'],
       dtype='object'),
 pandas.core.frame.DataFrame)

In [24]:
from sklearn.model_selection import train_test_split

In [25]:
X_train, X_val, y_train, y_val = train_test_split(X, y_targets, test_size=0.25, random_state=42)

In [26]:
type(X_train), type(X_val), type(y_train), type(y_val),

(pandas.core.frame.DataFrame,
 pandas.core.frame.DataFrame,
 pandas.core.frame.DataFrame,
 pandas.core.frame.DataFrame)

In [27]:
type(y_train['Type']), type(y_val['AdoptionSpeed'])

(pandas.core.series.Series, pandas.core.series.Series)

In [28]:
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((11244, 18), (3749, 18), (11244, 2), (3749, 2))

## One hot encoding

In [29]:
from sklearn.preprocessing import OneHotEncoder

In [30]:
ohe_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

In [31]:
ohe_encoder.fit(X_train[cat_cols]) 

In [32]:
encoded_data = ohe_encoder.transform(X_train[cat_cols])

In [33]:
ohe_encoder.get_feature_names_out()

array(['Breed1_0', 'Breed1_1', 'Breed1_3', 'Breed1_5', 'Breed1_7',
       'Breed1_10', 'Breed1_11', 'Breed1_15', 'Breed1_16', 'Breed1_17',
       'Breed1_18', 'Breed1_19', 'Breed1_20', 'Breed1_21', 'Breed1_23',
       'Breed1_24', 'Breed1_25', 'Breed1_26', 'Breed1_31', 'Breed1_32',
       'Breed1_39', 'Breed1_42', 'Breed1_44', 'Breed1_49', 'Breed1_50',
       'Breed1_56', 'Breed1_58', 'Breed1_60', 'Breed1_61', 'Breed1_64',
       'Breed1_65', 'Breed1_69', 'Breed1_70', 'Breed1_71', 'Breed1_72',
       'Breed1_75', 'Breed1_76', 'Breed1_78', 'Breed1_81', 'Breed1_82',
       'Breed1_83', 'Breed1_85', 'Breed1_88', 'Breed1_93', 'Breed1_97',
       'Breed1_98', 'Breed1_99', 'Breed1_100', 'Breed1_102', 'Breed1_103',
       'Breed1_105', 'Breed1_108', 'Breed1_109', 'Breed1_111',
       'Breed1_114', 'Breed1_117', 'Breed1_119', 'Breed1_122',
       'Breed1_123', 'Breed1_125', 'Breed1_128', 'Breed1_129',
       'Breed1_130', 'Breed1_132', 'Breed1_139', 'Breed1_141',
       'Breed1_145', 'Breed1_1

In [34]:
encoded_df = pd.DataFrame(encoded_data, columns=ohe_encoder.get_feature_names_out())

In [35]:
encoded_df

Unnamed: 0,Breed1_0,Breed1_1,Breed1_3,Breed1_5,Breed1_7,Breed1_10,Breed1_11,Breed1_15,Breed1_16,Breed1_17,...,State_41330,State_41332,State_41335,State_41336,State_41342,State_41345,State_41361,State_41367,State_41401,State_41415
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11239,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
11240,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
11241,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11242,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [36]:
encoded_df.index = X_train.index

In [37]:
# Joining tables
X_train = pd.concat([X_train, encoded_df], axis=1)

In [38]:
# Dropping old categorical columns
X_train.drop(cat_cols, axis=1, inplace=True)

In [39]:
# CHecking result
X_train.head()

Unnamed: 0,Age,Quantity,Fee,VideoAmt,PhotoAmt,Breed1_0,Breed1_1,Breed1_3,Breed1_5,Breed1_7,...,State_41330,State_41332,State_41335,State_41336,State_41342,State_41345,State_41361,State_41367,State_41401,State_41415
5646,7,2,400,0,3,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10891,5,1,0,0,5,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1934,12,3,0,0,10,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
8387,4,1,0,0,11,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2952,4,1,0,0,3,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Scaling

In [40]:
from sklearn.preprocessing import MinMaxScaler

In [41]:
X_train[num_cols].describe().loc[['min', 'max'], :]

Unnamed: 0,Age,Quantity,Fee,VideoAmt,PhotoAmt
min,0.0,1.0,0.0,0.0,0.0
max,255.0,20.0,1000.0,8.0,30.0


In [42]:
scaler = MinMaxScaler()

In [43]:
scaler.fit(X_train[num_cols])

In [44]:
scaled_data = scaler.transform(X_train[num_cols])

In [45]:
scaled_data

array([[0.02745098, 0.05263158, 0.4       , 0.        , 0.1       ],
       [0.01960784, 0.        , 0.        , 0.        , 0.16666667],
       [0.04705882, 0.10526316, 0.        , 0.        , 0.33333333],
       ...,
       [0.04705882, 0.        , 0.        , 0.        , 0.03333333],
       [0.01960784, 0.05263158, 0.        , 0.        , 0.06666667],
       [0.0627451 , 0.        , 0.        , 0.        , 0.16666667]])

In [46]:
type(scaled_data), scaled_data.shape

(numpy.ndarray, (11244, 5))

In [47]:
scaled_df =  pd.DataFrame(data=scaled_data,columns=num_cols)

In [48]:
scaled_df.describe().loc[['min', 'max'], :]

Unnamed: 0,Age,Quantity,Fee,VideoAmt,PhotoAmt
min,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0


In [49]:
scaled_df[num_cols].isnull().sum()

Age         0
Quantity    0
Fee         0
VideoAmt    0
PhotoAmt    0
dtype: int64

In [50]:
X_train[num_cols] = scaler.transform(X_train[num_cols])
X_train[num_cols].describe().loc[['min', 'max'], :]

Unnamed: 0,Age,Quantity,Fee,VideoAmt,PhotoAmt
min,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0


In [51]:
X_train.shape

(11244, 350)

In [52]:
X_train[num_cols].isnull().sum()

Age         0
Quantity    0
Fee         0
VideoAmt    0
PhotoAmt    0
dtype: int64

# Modelling

In [53]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

In [54]:
('Type' in X_train.columns),('AdoptionSpeed' in X_train.columns) 

(False, False)

In [55]:
X1 = X_train
y1 = y_train['Type']

In [56]:
type_model = LogisticRegression()

## model 1

In [57]:
ovr_type = OneVsRestClassifier(type_model)

In [58]:
X1.isnull().sum()

Age            0
Quantity       0
Fee            0
VideoAmt       0
PhotoAmt       0
              ..
State_41345    0
State_41361    0
State_41367    0
State_41401    0
State_41415    0
Length: 350, dtype: int64

In [59]:
ovr_type.fit(X1, y1)

In [60]:
# # predicting on the same training Data and using it in the next prediction
# y_train_type_pred = ovr_type.predict(X1)
# y_train_type_pred,type(y_train_type_pred)

In [61]:
# from sklearn.metrics import classification_report

In [62]:
# target_names = ['Dog', 'Cat']

In [63]:
# # classification_report
# print(classification_report(y1, y_train_type_pred, target_names=target_names))

## model2

In [64]:
X2 =  pd.concat([X1, y_train['Type']], axis=1)
y2 = y_train['AdoptionSpeed']

In [65]:
X2.shape, y1.shape

((11244, 351), (11244,))

In [66]:
adoptionSpeed_model = LogisticRegression(max_iter=1000)

In [67]:
ovr_adoption = OneVsRestClassifier(adoptionSpeed_model)

In [68]:
ovr_adoption.fit(X2,y2)

# Evaluation

## Encoding

In [69]:
val_encoded_data = ohe_encoder.transform(X_val[cat_cols])

In [70]:
val_encoded_df = pd.DataFrame(val_encoded_data, columns=ohe_encoder.get_feature_names_out())

In [71]:
val_encoded_df.index = X_val.index

In [72]:
# Joining tables
X_val = pd.concat([X_val, val_encoded_df], axis=1)

In [73]:
# Dropping old categorical columns
X_val.drop(cat_cols, axis=1, inplace=True)

In [74]:
X_val.shape

(3749, 350)

## Scaling

In [75]:
X_val[num_cols] = scaler.transform(X_val[num_cols])

In [76]:
X_val[num_cols].describe().loc[['min', 'max'], :]

Unnamed: 0,Age,Quantity,Fee,VideoAmt,PhotoAmt
min,0.0,0.0,0.0,0.0,0.0
max,0.705882,1.0,3.0,1.0,1.0


## Applying the model

In [77]:
y_val_type_predict = ovr_type.predict(X_val)

In [78]:
y_val_type_predict,type(y_val_type_predict)

(array([1, 2, 2, ..., 2, 1, 1]), numpy.ndarray)

In [79]:
y_val_type_true = y_val['Type']

In [80]:
from sklearn.metrics import classification_report

In [81]:
target_names = ['Dog', 'Cat']

In [82]:
# classification_report
print(classification_report(y_val_type_true, y_val_type_predict, target_names=target_names))

              precision    recall  f1-score   support

         Dog       1.00      1.00      1.00      1998
         Cat       1.00      0.99      1.00      1751

    accuracy                           1.00      3749
   macro avg       1.00      1.00      1.00      3749
weighted avg       1.00      1.00      1.00      3749



## Model 2 prediction

In [87]:
y_val_type_predict.shape

(3749,)

In [83]:
y_val_type_predict_series = pd.Series(y_val_type_predict, name='Type')

In [89]:
X_val.index

Index([13408,  6472,  9967,   862,  5967, 12680,  3017,  3087,  6375,  2250,
       ...
        6018,  7634,  2174,  6385,  9913,  2102,  8183, 13916,  2656, 12642],
      dtype='int64', length=3749)

In [91]:
y_val_type_predict_series.index = X_val.index

In [92]:
y_val_type_predict_series.index

Index([13408,  6472,  9967,   862,  5967, 12680,  3017,  3087,  6375,  2250,
       ...
        6018,  7634,  2174,  6385,  9913,  2102,  8183, 13916,  2656, 12642],
      dtype='int64', length=3749)

In [93]:
X_val_adoption = pd.concat([X_val,y_val_type_predict_series ], axis=1)

In [94]:
X_val_adoption.isnull().sum()

Age            0
Quantity       0
Fee            0
VideoAmt       0
PhotoAmt       0
              ..
State_41361    0
State_41367    0
State_41401    0
State_41415    0
Type           0
Length: 351, dtype: int64

In [95]:
X_val.shape, y_val_type_predict_series.shape, X_val_adoption.shape, y_val['AdoptionSpeed'].shape

((3749, 350), (3749,), (3749, 351), (3749,))

In [96]:
y_val_adpotion_predict = ovr_adoption.predict(X_val_adoption)

In [99]:
y_val_adpotion_predict,type(y_val_adpotion_predict), np.unique(y_val_adpotion_predict)

(array([4, 2, 4, ..., 4, 3, 3]), numpy.ndarray, array([0, 1, 2, 3, 4]))

In [100]:
y_val_adpotion_true = y_val['AdoptionSpeed']

In [103]:
# to be done once the classification_report is created
n_labels = [0,1,2,3,4]
target_names = ['same_day', '1st_week', '1st_month', '2nd_&_3rd_month', 'no_adoption_100_days']

In [106]:
# classification_report
print(classification_report(y_val_adpotion_true, y_val_adpotion_predict,
                           labels = n_labels,
                           target_names = target_names))

                      precision    recall  f1-score   support

            same_day       0.20      0.01      0.02       114
            1st_week       0.31      0.26      0.28       773
           1st_month       0.32      0.43      0.37       994
     2nd_&_3rd_month       0.34      0.15      0.21       813
no_adoption_100_days       0.44      0.59      0.50      1055

            accuracy                           0.37      3749
           macro avg       0.32      0.29      0.28      3749
        weighted avg       0.35      0.37      0.34      3749



# Resources:

* https://dataknowsall.com/blog/multiclass.html
* 