# Jarrale Butts

## AnimalRescue Data Mining Enhancement

In [1]:
import pandas as pd
import pymongo
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Connect to MongoDB Atlas
client = pymongo.MongoClient("mongobd_connection_string")
db = client['AAC']
collection = db['animals']

# Load data from MongoDB into DataFrame
data = list(collection.find())
df = pd.DataFrame(data)  # Convert to DataFrame

# Data preprocessing
# Drop rows with missing values in the 'outcome_type' column
df = df.dropna(subset=['outcome_type'])

# Convert categorical data to numeric
label_enc = LabelEncoder()
df['breed_encoded'] = label_enc.fit_transform(df['breed'])
df['sex_upon_outcome_encoded'] = label_enc.fit_transform(df['sex_upon_outcome'])
df['outcome_type_encoded'] = label_enc.fit_transform(df['outcome_type'])

# Convert date from type object to type datetime64 -> then to a numeric feature
df['date_of_birth'] = pd.to_datetime(df['date_of_birth'])
df['days_since_birth'] = (pd.Timestamp.now() - df['date_of_birth']).dt.days

# Features and target variable
y = df.outcome_type_encoded
features_columns = ['breed_encoded', 'days_since_birth', 'sex_upon_outcome_encoded']
X = df[features_columns]


# Train a Decision Tree Classifier

# Initial Clasifer model
classifier = DecisionTreeClassifier(random_state=0)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

# Fit classifier with the training data
classifier.fit(X_train, y_train)

# Make predictions
y_pred = classifier.predict(X_test)

# Evaluate the classifier
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=label_enc.classes_))


Confusion Matrix:
[[  0   0   0   0   1   0   0   0   0   0]
 [  0 685   1   0  15   0   0 163   1 185]
 [  0   0   0   0   3   0   0   4   0  17]
 [  0   0   0   1   5   0   0   0   0   1]
 [  0  11   4   5 114   0   0  29   1  45]
 [  0   0   0   0   0   0   0   1   0   0]
 [  0   0   0   0   3   0   0   0   0   0]
 [  0 171   1   1  18   0   0 157   0 102]
 [  0   2   0   0   0   0   0   2   0   0]
 [  0 200   7   0  45   2   0 105   1 391]]

Classification Report:
                 precision    recall  f1-score   support

                      0.00      0.00      0.00         1
       Adoption       0.64      0.65      0.65      1050
           Died       0.00      0.00      0.00        24
       Disposal       0.14      0.14      0.14         7
     Euthanasia       0.56      0.55      0.55       209
        Missing       0.00      0.00      0.00         1
       Relocate       0.00      0.00      0.00         3
Return to Owner       0.34      0.35      0.34       450
      Rto-Ad

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


A connection is made to MongoDB Atlas fetching data related to animal outcomes from the ‘AAC’ database and the ‘animals’ collection. After establishing a connection using the pymongo library, all documents are retrieved from the ‘animals’ collection and converts them into a Pandas DataFrame for further analysis. The data is then processed by dropping rows with missing values in the ‘outcome_type’ column, encoding categorical variables (such as ‘breed’ and ‘sex_upon_outcome’) into numerical representations using LabelEncoder, and calculating the ‘days_since_birth’ based on the ‘date_of_birth’ field. This structured approach prepares the data for training a Decision Tree Classifier, where features are defined as the encoded breed, days since birth, and sex upon outcome, with the target variable being the encoded outcome type.

The output of the model evaluation consists of a confusion matrix and a classification report, both of which provide insights into the classifier’s performance. The confusion matrix reveals that the model struggles significantly with certain classes, particularly with classes like ‘Died’ and ‘Rto-Adopt’, where it has not made any correct predictions. On the other hand, the ‘Adoption’ class has a relatively high number of true positives, but there is a notable level of misclassification across various classes. The classification report further details the model’s performance metrics, including precision, recall, and F1-scores. The overall accuracy of the model stands at 54%, indicating that there is considerable room for improvement, especially in handling underrepresented classes. The low precision and recall for several classes suggest that the model may benefit from adjustments in the data preprocessing steps, such as addressing class imbalances or exploring alternative modeling techniques.