# CRISP-DM Framework

CRISP-DM = Cross-industry standard process for data mining.



# Business / Problem Statement Understanding

NOTE: This is a fictitious problem with very real world data.

Given are the properties of pet ( both cat and dog ). Based on the properties we have to find out:
* the type of the pet ( dog ot cat ).
* the adoption speed ( ranging from 0 to 4 ) of the given pet.

This is a `Multi Output Multi Class CLassification Problem`.



<details>
<summary><b><font size="+1">How Multi Output Multi Class CLassification Problem ?</font></b></summary>

`Multi Output` - As there are 2 targets ( **type of the pet** and **adoption speed of the pet** )

`Multi Class` - one of the target ( **adoption speed** ) has 4 classes named 0,1,2,3,4

</details>




# Data Understanding

## Data Dictionary

Features:
1. `PetID` - Unique hash ID of pet profile
1. `Name` - Name of pet (Empty if not named)
1. `Age` - Age of pet when listed, in months
1. `Breed1` - Primary breed of pet (Refer to BreedLabels dictionary)
1. `Breed2` - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
1. `Gender` - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
1. `Color1` - Color 1 of pet (Refer to ColorLabels dictionary)
1. `Color2` - Color 2 of pet (Refer to ColorLabels dictionary)
1. `Color3` - Color 3 of pet (Refer to ColorLabels dictionary)
1. `MaturitySize` - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
1. `FurLength` - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
1. `Vaccinated` - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
1. `Dewormed` - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
1. `Sterilized` - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
1. `Health` - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
1. `Quantity` - Number of pets represented in profile
1. `Fee` - Adoption fee (0 = Free)
1. `State` - State location in Malaysia (Refer to StateLabels dictionary)
1. `RescuerID` - Unique hash ID of rescuer
1. `VideoAmt` - Total uploaded videos for this pet
1. `PhotoAmt` - Total uploaded photos for this pet
1. `Description` - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

Targets:
1. `AdoptionSpeed` - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
1. `Type` - Type of animal (1 = Dog, 2 = Cat)


AdoptionSpeed:
The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way:

* `0` - Pet was adopted on the same day as it was listed.
* `1` - Pet was adopted between 1 and 7 days (1st week) after being listed.
* `2` - Pet was adopted between 8 and 30 days (1st month) after being listed.
* `3` - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.
* `4` - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("./pet_train.csv")

In [3]:
new_col_order = ['PetID','Name', 'Age', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2', 'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed', 'Sterilized', 'Health', 'Quantity', 'Fee', 'State', 'RescuerID', 'VideoAmt', 'Description',  'PhotoAmt', 'Type','AdoptionSpeed']
df = df.reindex(columns=new_col_order)

In [4]:
df.head()

Unnamed: 0,PetID,Name,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,...,Health,Quantity,Fee,State,RescuerID,VideoAmt,Description,PhotoAmt,Type,AdoptionSpeed
0,86e1089a3,Nibble,3,299,0,1,1,7,0,1,...,1,1,100,41326,8480853f516546f6cf33aa88cd76c379,0,Nibble is a 3+ month old ball of cuteness. He ...,1.0,2,2
1,6296e909a,No Name Yet,1,265,0,1,1,2,0,2,...,1,1,0,41401,3082c7125d8fb66f7dd4bff4192c8b14,0,I just found it alone yesterday near my apartm...,2.0,2,0
2,3422e4906,Brisco,1,307,0,1,2,7,0,2,...,1,1,0,41326,fa90fa5b1ee11c86938398b60abc32cb,0,Their pregnant mother was dumped by her irresp...,7.0,1,3
3,5842f1ff5,Miko,4,307,0,2,1,2,0,2,...,1,1,150,41401,9238e4f44c71a75282e62f7136c6b240,0,"Good guard dog, very alert, active, obedience ...",8.0,1,2
4,850a43f90,Hunter,1,307,0,1,1,0,0,2,...,1,1,0,41326,95481e953f8aed9ec3d16fc4509537e8,0,This handsome yet cute boy is up for adoption....,3.0,1,2


In [5]:
orig_df = df

In [6]:
df.dtypes

PetID             object
Name              object
Age                int64
Breed1             int64
Breed2             int64
Gender             int64
Color1             int64
Color2             int64
Color3             int64
MaturitySize       int64
FurLength          int64
Vaccinated         int64
Dewormed           int64
Sterilized         int64
Health             int64
Quantity           int64
Fee                int64
State              int64
RescuerID         object
VideoAmt           int64
Description       object
PhotoAmt         float64
Type               int64
AdoptionSpeed      int64
dtype: object

In [7]:
#Helper fundtion - finding unique values for the columns:
def unique_values(df):
  for col in df.columns:
    print(f"==================================Unique values for column ``` {col} ``` ")
    print(df[col].unique())
    print(f"==============================END==============================\n")

In [8]:
unique_values(df)

['86e1089a3' '6296e909a' '3422e4906' ... 'd981b6395' 'e4da1c9e4'
 'a83d95ead']

['Nibble' 'No Name Yet' 'Brisco' ... 'Monkies' 'Ms Daym' 'Fili']

[  3   1   4  12   0   2  78   6   8  10  36  14  24   5  72  60   9  48
  62  47  19 120  32   7  17  22  16  13  11  37  18  55  20  28  74  53
  25  84  76  30 132  96  46  15  50  56  54  23  92  29  27  49  44 144
  21  31  41  51  65  34 135  39  52  42 108  81  26  38  69 212  33  75
  95  80  63  61 255  89  91  35 117  73 122 123  64  87 112 156  66  67
  77 180  82  86  40  57 168 102  45 147  68  85  88  43 238 100]

[299 265 307 266 264 218 114 285 189 205 292 128 243 213 141 173 207 250
 119 195 109 206  70 103 303  78 254  10  20 305 283 306 288  69 179  31
 247 200 248  26  25   0 129 202  72  24 284 286 152 277  44  75  64  60
 296 185 300  76 139 242 294 276 102 182 289 145 178 233  82  49 239 231
 169 111 232 270 267 268 251  58 155 295 304 147 245 282  21 215 192 154
  71 272 241 262 249 273 108 240  83 293  39  50  93   1 

# Data Preparation

# Modelling

## Try to understnad and then apply the concepts as required. THe code seems correct but can't be trusted until and unless applied on the problem.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier

# Separate features and targets
X = df.drop(['Type', 'AdoptionSpeed'], axis=1)
y_type = df['Type']
y_speed = df['AdoptionSpeed']

# Split data into train and test sets
X_train, X_test, y_train_type, y_test_type = train_test_split(
    X, y_type, test_size=0.2, random_state=42)

# Define preprocessing for numerical and categorical data
numeric_features = ['Age', 'Quantity', 'Fee', 'VideoAmt', 'PhotoAmt']
categorical_features = ['Name', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2',
                        'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed',
                        'Sterilized', 'Health', 'State', 'RescuerID', 'Description']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Define classifiers for type and speed prediction
classifier_type = LogisticRegression()
classifier_speed = LogisticRegression()

# Bundle preprocessing and modeling code in a pipeline
pipeline_type = Pipeline(steps=[('preprocessor', preprocessor),
                                ('classifier', MultiOutputClassifier(classifier_type))])

# Fit the pipeline for type prediction
pipeline_type.fit(X_train, y_train_type)

# Use the predicted type as a feature for speed prediction
X_train_with_type = X_train.copy()
X_train_with_type['PredictedType'] = pipeline_type.predict(X_train)

# Split the augmented training data into train and test sets
X_train_augmented, X_test_augmented, y_train_speed, y_test_speed = train_test_split(
    X_train_with_type, y_speed, test_size=0.2, random_state=42)

# Fit the pipeline for speed prediction using augmented data
pipeline_speed = Pipeline(steps=[('preprocessor', preprocessor),
                                 ('classifier', classifier_speed)])
pipeline_speed.fit(X_train_augmented, y_train_speed)

# Predictions
y_pred_speed = pipeline_speed.predict(X_test_augmented)


# Evaluation