# **Pet Category Prediction**

This project was created to train a machine learning model, that can predict the category of a pet from the given data. Our data have various features of a pet like shape, size, color, and more. So we trained a ml model to predict the category, when test data is given

## Import Data and modules

### Importing the required **modules** for this project

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

### Load data into a **pandas data frame**

In [2]:
train_df = pd.read_csv('Hackerearth_challenge/train.csv')

### Structured view of data frame

In [3]:
train_df.head()

Unnamed: 0,pet_id,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,breed_category,pet_category
0,ANSL_69903,2016-07-10 00:00:00,2016-09-21 16:25:00,2.0,Brown Tabby,0.8,7.78,13,9,0.0,1
1,ANSL_66892,2013-11-21 00:00:00,2018-12-27 17:47:00,1.0,White,0.72,14.19,13,9,0.0,2
2,ANSL_69750,2014-09-28 00:00:00,2016-10-19 08:24:00,,Brown,0.15,40.9,15,4,2.0,4
3,ANSL_71623,2016-12-31 00:00:00,2019-01-25 18:30:00,1.0,White,0.62,17.82,0,1,0.0,2
4,ANSL_57969,2017-09-28 00:00:00,2017-11-19 09:38:00,2.0,Black,0.5,11.06,18,4,0.0,1


### Remove less needed features from data frame

In [4]:
train = train_df.drop(['pet_id', 'issue_date', 'listing_date'], axis = 1)

In [5]:
train_df.describe()

Unnamed: 0,condition,length(m),height(cm),X1,X2,breed_category,pet_category
count,17357.0,18834.0,18834.0,18834.0,18834.0,18834.0,18834.0
mean,0.88339,0.502636,27.448832,5.369598,4.577307,0.600563,1.709143
std,0.770434,0.288705,13.019781,6.572366,3.517763,0.629883,0.717919
min,0.0,0.0,5.0,0.0,0.0,0.0,0.0
25%,0.0,0.25,16.1725,0.0,1.0,0.0,1.0
50%,1.0,0.5,27.34,0.0,4.0,1.0,2.0
75%,1.0,0.76,38.89,13.0,9.0,1.0,2.0
max,2.0,1.0,50.0,19.0,9.0,2.0,4.0


## **Data Preprocessing**

### Encoding the text column into labels

In [6]:
le = preprocessing.LabelEncoder()
train['color_type'] = le.fit_transform(train['color_type'])

### Using simple imputer to fill missing values

In [7]:
imp = SimpleImputer(strategy="most_frequent")
train = imp.fit_transform(train)

In [8]:
train = pd.DataFrame(train)

## **Train Test Split**

In [9]:
X = train[train.columns[0:6]]
y = train[train.columns[6:]]

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
y_test['breed_category'] = y_test[y_test.columns[0]]
y_test['pet_category'] = y_test[y_test.columns[1]]
y_test = y_test.drop(y_test[y_test.columns[0:2]], axis = 1)

In [None]:
y_train['breed_category'] = y_train[y_train.columns[0]]
y_train['pet_category'] = y_train[y_train.columns[1]]
y_train = y_train.drop(y_train[y_train.columns[0:2]], axis = 1)

## **Model Training**

### **Random Forest Classifier**

In [13]:
forest = RandomForestClassifier(n_estimators = 35, max_depth = 15, random_state = 10)
multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)
multi_target_forest.fit(X_train, y_train)
y_pred = multi_target_forest.predict(X_test)
y_pred = pd.DataFrame(y_pred)
y_pred['breed_category'] = y_pred[0]
y_pred['pet_category'] = y_pred[1]
y_pred = y_pred.drop(y_pred[y_pred.columns[0:2]], axis=1)
s1 = f1_score(y_pred['pet_category'], y_test['pet_category'], average = 'weighted')
s2 = f1_score(y_pred['breed_category'], y_test['breed_category'], average = 'weighted')
forest_score = 100 * ((s1+s2) / 2)
print(forest_score)

85.32153421749537


### **Ensemble Classifier**

In [14]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging1 = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5)
bagging2 = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5)
bagging1.fit(X_train, y_train['breed_category'])
bagging2.fit(X_train, y_train['pet_category'])

y_pred1 = bagging1.predict(X_test)
y_pred2 = bagging2.predict(X_test)
y_pred = pd.DataFrame()
y_pred['breed_category'] = y_pred1
y_pred['pet_category'] = y_pred2
s1 = f1_score(y_pred['pet_category'], y_test['pet_category'], average = 'weighted')
s2 = f1_score(y_pred['breed_category'], y_test['breed_category'], average = 'weighted')
bagging_score = 100 * ((s1+s2) / 2)
print(bagging_score)

85.34534011937066


  'recall', 'true', average, warn_for)


### **AdaBoost Classifier**

In [15]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

clf1 = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=2),
    n_estimators=100,
    learning_rate=1.5,
    algorithm="SAMME")
clf2 = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=2),
    n_estimators=100,
    learning_rate=1.5,
    algorithm="SAMME")
clf1.fit(X_train, y_train['breed_category'])
clf2.fit(X_train, y_train['pet_category'])

y_pred1 = clf1.predict(X_test)
y_pred2 = clf2.predict(X_test)
y_pred = pd.DataFrame()
y_pred['breed_category'] = y_pred1
y_pred['pet_category'] = y_pred2
#y_pred = y_pred.drop(y_pred[y_pred.columns[0:2]], axis=1)
s1 = f1_score(y_pred['pet_category'], y_test['pet_category'], average = 'weighted')
s2 = f1_score(y_pred['breed_category'], y_test['breed_category'], average = 'weighted')
clf_score = 100 * ((s1+s2) / 2)
print(clf_score)


72.20528611713682


  'recall', 'true', average, warn_for)


### **Support Vector Machines**

In [16]:
from sklearn.svm import SVC
clf = SVC(C = 7.0, gamma='scale')
multi_target_forest = MultiOutputClassifier(clf, n_jobs=-1)
multi_target_forest.fit(X_train, y_train)
y_pred = multi_target_forest.predict(X_test)
y_pred = pd.DataFrame(y_pred)
y_pred['breed_category'] = y_pred[0]
y_pred['pet_category'] = y_pred[1]
y_pred = y_pred.drop(y_pred[y_pred.columns[0:2]], axis=1)
s1 = f1_score(y_pred['pet_category'], y_test['pet_category'], average = 'weighted')
s2 = f1_score(y_pred['breed_category'], y_test['breed_category'], average = 'weighted')
svm_score = 100 * ((s1+s2) / 2)
print(svm_score)

81.64164194228154


  'recall', 'true', average, warn_for)


### **Logistic Regression**

In [17]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(solver = 'newton-cg', penalty = 'l2', max_iter = 100)
multi_target_forest = MultiOutputClassifier(clf, n_jobs=-1)
multi_target_forest.fit(X_train, y_train)
y_pred = multi_target_forest.predict(X_test)
y_pred = pd.DataFrame(y_pred)
y_pred['breed_category'] = y_pred[0]
y_pred['pet_category'] = y_pred[1]
y_pred = y_pred.drop(y_pred[y_pred.columns[0:2]], axis=1)
s1 = f1_score(y_pred['pet_category'], y_test['pet_category'], average = 'weighted')
s2 = f1_score(y_pred['breed_category'], y_test['breed_category'], average = 'weighted')
log_score = 100 * ((s1+s2) / 2)
print(log_score)



73.48191555622293


  'recall', 'true', average, warn_for)


### **Multi Layer Perceptron Classifier**

In [18]:
from sklearn.neural_network import MLPClassifier
nn = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(10, 5), random_state=1)
nn1 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(10, 5), random_state=1)
nn.fit(X_train, y_train['breed_category'])
nn1.fit(X_train, y_train['pet_category'])
y_pred1 = nn.predict(X_test)
y_pred2 = nn1.predict(X_test)
y_pred = pd.DataFrame()
y_pred['breed_category'] = y_pred1
y_pred['pet_category'] = y_pred2
#y_pred = y_pred.drop(y_pred[y_pred.columns[0:2]], axis=1)
s1 = f1_score(y_pred['pet_category'], y_test['pet_category'], average = 'weighted')
s2 = f1_score(y_pred['breed_category'], y_test['breed_category'], average = 'weighted')
mlp_score = 100 * ((s1+s2) / 2)
print(mlp_score)

74.88004530035013


  'recall', 'true', average, warn_for)


### **Accuracy**

In [19]:
print("Accuracy for Random Forest Classifier - {}".format(forest_score))
print("Accuracy for Ensemble Classifier - {}".format(bagging_score))
print("Accuracy for AdaBoost Classifier - {}".format(clf_score))
print("Accuracy for Support Vertor Machines - {}".format(svm_score))
print("Accuracy for Logistic Regression - {}".format(log_score))
print("Accuracy for Multi Layer Perceptron - {}".format(mlp_score))

Accuracy for Random Forest Classifier - 85.32153421749537
Accuracy for Ensemble Classifier - 85.34534011937066
Accuracy for AdaBoost Classifier - 72.20528611713682
Accuracy for Support Vertor Machines - 81.64164194228154
Accuracy for Logistic Regression - 73.48191555622293
Accuracy for Multi Layer Perceptron - 74.88004530035013


### From the above results, **Ensemble Classifier** gave *best* accuracy results after manual *hyperparameter tuning* of all other classifiers.