The "Animal Condition Classification Dataset" presents a unique and intricate data challenge in the realm of animal health assessment. Featuring a diverse array of animal species, ranging from birds to mammals, this dataset enables the development of predictive models to determine whether an animal's condition is dangerous or not based on five distinct symptoms. The dataset's diversity opens doors to creating a classification system that transcends taxonomic boundaries, making it particularly valuable for people interested in animal welfare and wildlife conservation. 

However, its manual collection process introduces potential sources of error, including spelling mistakes and variations in symptom representation. This necessitates meticulous data-cleaning efforts.

As you delve into the "Animal Condition Classification Dataset," they are poised to confront challenges such as class imbalance and the need for feature engineering. Addressing these challenges will be crucial for achieving robust classification models. Thus, this dataset serves as a rich resource for those eager to make a meaningful impact in the field of animal health assessment, with the understanding that it demands careful handling and methodological rigour to deliver insightful and ethically sound results.


- `AnimalName`: Contains the animal kind. Like dog, cat, etc
- `symptoms1-5`: Contain symptoms
- `Dangerous`: Contains whether the condition is dangerous or not.

source : [kaggle](https://www.kaggle.com/datasets/gracehephzibahm/animal-disease/)

In [20]:
import pandas as pd


In [32]:
df=pd.read_csv("data.csv")
df.head()

Unnamed: 0,AnimalName,symptoms1,symptoms2,symptoms3,symptoms4,symptoms5,Dangerous
0,Dog,Fever,Diarrhea,Vomiting,Weight loss,Dehydration,Yes
1,Dog,Fever,Diarrhea,Coughing,Tiredness,Pains,Yes
2,Dog,Fever,Diarrhea,Coughing,Vomiting,Anorexia,Yes
3,Dog,Fever,Difficulty breathing,Coughing,Lethargy,Sneezing,Yes
4,Dog,Fever,Diarrhea,Coughing,Lethargy,Blue Eye,Yes


In [39]:
# Assuming 'df' is your DataFrame
df.dropna(subset=['Dangerous'], inplace=True)

# Reset the index
df.reset_index(drop=True, inplace=True)

# Create a new DataFrame without rows where 'Dangerous' is missing and reset the index
#df = df.dropna(subset=['Dangerous']).reset_index(drop=True)



# Assuming 'df' is your DataFrame
df.info()

# To check unique values for each column
for column in df.columns:
    print(f"\nColumn: {column}")
    print(df[column].value_counts())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 869 entries, 0 to 868
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   AnimalName  869 non-null    object
 1   symptoms1   869 non-null    object
 2   symptoms2   869 non-null    object
 3   symptoms3   869 non-null    object
 4   symptoms4   869 non-null    object
 5   symptoms5   869 non-null    object
 6   Dangerous   869 non-null    object
dtypes: object(7)
memory usage: 47.7+ KB

Column: AnimalName
AnimalName
Buffaloes            128
Sheep                109
Pig                   63
Fowl                  62
Elephant              59
Duck                  56
Deer                  38
Donkey                38
Birds                 37
cat                   36
Dog                   34
Monkey                28
Goat                  26
Cattle                21
Hamster               18
Tiger                 17
Lion                  16
Rabbit                11
Horse         

### Create dataframe

In [40]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Features (X)
X = df[['AnimalName', 'symptoms1', 'symptoms2', 'symptoms3', 'symptoms4', 'symptoms5']]

# Target variable (y)
y = df['Dangerous']

# Label encoding for the target variable
le = LabelEncoder()
y = le.fit_transform(y)

# One-hot encoding for other categorical variables
encoder = OneHotEncoder(drop='first', sparse=False)
X_encoded = encoder.fit_transform(X)
columns = encoder.get_feature_names_out(X.columns)
X_encoded = pd.DataFrame(X_encoded, columns=columns)

# Concatenate the encoded columns with the rest of the features
X = pd.concat([X.drop(['AnimalName', 'symptoms1', 'symptoms2', 'symptoms3', 'symptoms4', 'symptoms5'], axis=1), X_encoded], axis=1)





### 1. Logistic Regression

In [42]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a logistic regression model
logreg_model = LogisticRegression()

# Train the model
logreg_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = logreg_model.predict(X_test)

# Inverse transform for label encoding to get original 'Dangerous' values
y_test_original = le.inverse_transform(y_test)
y_pred_original = le.inverse_transform(y_pred)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.9885057471264368


### 2. Decision Trees:


In [43]:
from sklearn.tree import DecisionTreeClassifier

# Create a decision tree model
tree_model = DecisionTreeClassifier()

# Train the model
tree_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = tree_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))


Accuracy: 0.9942528735632183
              precision    recall  f1-score   support

           0       1.00      0.50      0.67         2
           1       0.99      1.00      1.00       172

    accuracy                           0.99       174
   macro avg       1.00      0.75      0.83       174
weighted avg       0.99      0.99      0.99       174



### 3. Random Forest:


In [44]:
from sklearn.ensemble import RandomForestClassifier

# Create a random forest model
rf_model = RandomForestClassifier()

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))


Accuracy: 0.9885057471264368
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.99      1.00      0.99       172

    accuracy                           0.99       174
   macro avg       0.49      0.50      0.50       174
weighted avg       0.98      0.99      0.98       174



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 4. Support Vector Machines (SVM):


In [47]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create an SVM model
svm_model = SVC()

# Train the model
svm_model.fit(X_train_scaled, y_train)

# Make predictions on the scaled test set
y_pred = svm_model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))


Accuracy: 0.9885057471264368
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.99      1.00      0.99       172

    accuracy                           0.99       174
   macro avg       0.49      0.50      0.50       174
weighted avg       0.98      0.99      0.98       174



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 5. Naive Bayes:


In [48]:
from sklearn.naive_bayes import GaussianNB

# Create a Naive Bayes model
nb_model = GaussianNB()

# Train the model
nb_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = nb_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))


Accuracy: 0.9770114942528736
              precision    recall  f1-score   support

           0       0.33      1.00      0.50         2
           1       1.00      0.98      0.99       172

    accuracy                           0.98       174
   macro avg       0.67      0.99      0.74       174
weighted avg       0.99      0.98      0.98       174



### 6. Gradient Boosting (XGBoost):


In [49]:
from xgboost import XGBClassifier

# Create an XGBoost model
xgb_model = XGBClassifier()

# Train the model
xgb_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = xgb_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))


  if is_sparse(dtype):
  is_categorical_dtype(dtype) or is_pa_ext_categorical_dtype(dtype)
  if is_categorical_dtype(dtype):
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)


Accuracy: 0.9770114942528736
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.99      0.99      0.99       172

    accuracy                           0.98       174
   macro avg       0.49      0.49      0.49       174
weighted avg       0.98      0.98      0.98       174



  if is_sparse(dtype):
  is_categorical_dtype(dtype) or is_pa_ext_categorical_dtype(dtype)
  if is_categorical_dtype(dtype):
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)


### 7. Neural Networks (using TensorFlow/Keras):


In [50]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Create a neural network model
model = Sequential()
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Accuracy: {accuracy}')


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy: 0.9885057210922241
