# Zoo Dataset Analysis

## Introduction

In this notebook, we will analyze the Zoo dataset obtained from the UCI Machine Learning Repository. This dataset contains various features about different animals and aims to predict their classification based on these features. Understanding the relationships between the features and the target variable will help us gain insights into the characteristics that distinguish different types of animals.

### Dataset Overview

The Zoo dataset contains 101 instances (animals) and 17 attributes, including categorical and binary features that describe the animals. The target variable represents the type of animal. We will perform various preprocessing steps to prepare the data for analysis and modeling.

## 1. Installing Required Libraries

To start, we need to install the `ucimlrepo` library, which allows us to fetch datasets from the UCI repository easily. Let's install it using pip.


In [1]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


## 2. Importing Necessary Libraries
We will import the following libraries for our analysis:

* `Pandas`: For data manipulation and analysis.
* `NumPy`: For numerical operations.
* `Scikit-learn`: For machine learning utilities, including data preprocessing and model evaluation.
* `UCI ML Repo`: For fetching datasets directly from the UCI Machine Learning Repository.



In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.utils import shuffle

## 3. Fetching the Dataset
We will fetch the Zoo dataset using the `fetch_ucirepo` function. The dataset ID for the Zoo dataset is 111. This function retrieves the dataset along with its metadata.


In [3]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
zoo = fetch_ucirepo(id=111)

{'uci_id': 111, 'name': 'Zoo', 'repository_url': 'https://archive.ics.uci.edu/dataset/111/zoo', 'data_url': 'https://archive.ics.uci.edu/static/public/111/data.csv', 'abstract': 'Artificial, 7 classes of animals', 'area': 'Biology', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 101, 'num_features': 16, 'feature_types': ['Categorical', 'Integer'], 'demographics': [], 'target_col': ['type'], 'index_col': ['animal_name'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1990, 'last_updated': 'Fri Sep 15 2023', 'dataset_doi': '10.24432/C5R59V', 'creators': ['Richard Forsyth'], 'intro_paper': None, 'additional_info': {'summary': 'A simple database containing 17 Boolean-valued attributes.  The "type" attribute appears to be the class attribute.  Here is a breakdown of which animals are in which type: (I find it unusual that there are 2 instances of "frog" and one of "girl"!)', 'purpose': None, 'funded_by': None, 'inst

## 4. Preparing the Data
After fetching the dataset, we will separate the features (X) and the target variable (y). The features represent the characteristics of the animals, while the target variable indicates the type of animal.

In [None]:
# data (as pandas dataframes)
X = zoo.data.features
y = zoo.data.targets

## 5. Exploring Metadata
Understanding the metadata of the dataset provides us with valuable information about the dataset's context, including the number of instances, attributes, and their types.

In [None]:
# metadata
print(zoo.metadata)



**Key Metadata Insights**
The metadata provides insights such as:

* **Number of Instances:**101
* **Attributes:** 17 features with various types (categorical and binary)
* **Repository URL:** UCI Zoo Dataset

## 6. Variable Information
Next, we will inspect the variable information to understand their roles and types. This will help us determine how to preprocess the data.

In [None]:
# variable information
print(zoo.variables)

# Important Variables
The dataset includes the following important variables:

* animal_name: Categorical
* hair: Binary (1 = yes, 0 = no)
* feathers: Binary
* eggs: Binary
* milk: Binary
* airborne: Binary
* aquatic: Binary
* predator: Binary
* toothed: Binary
* backbone: Binary
* breathes: Binary
* venomous: Binary
* fins: Binary
* legs: Categorical
* tail: Binary
* domestic: Binary
* catsize: Binary
8 type: Target variable (Categorical)

In [None]:
X

Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize
0,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1
1,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1
2,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0
3,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1
4,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,1,0,0,1,0,0,0,1,1,1,0,0,2,1,0,1
97,1,0,1,0,1,0,0,0,0,1,1,0,6,0,0,0
98,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1
99,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0


In [None]:
y

Unnamed: 0,type
0,1
1,1
2,4
3,1
4,1
...,...
96,1
97,6
98,1
99,7


# **7. Preprocessing the Data**
Before training any models, we need to preprocess the data to ensure that it's in the right format.

## 7.1. One-Hot Encoding for Categorical Variables
The legs feature is categorical, so we will use one-hot encoding to convert it into multiple binary columns. This transformation allows our machine learning models to better interpret this feature.

In [None]:
# One-Hot Encoding for 'legs'  and 'type'

#instance of onehotencoder
encoder = OneHotEncoder()



In [None]:
# one-hot encoding for legs
X_encoded = pd.DataFrame(encoder.fit_transform(X[['legs']]).toarray(), columns=encoder.get_feature_names_out(['legs']))



In [None]:
# concatenate the original dataframe  with the one-hot encoded columns
X = pd.concat([X.drop(columns=['legs']), X_encoded], axis=1)



## 7.2. Standardization of Features
To improve the performance of machine learning algorithms, we will standardize the features. Standardization scales the data to have a mean of 0 and a standard deviation of 1.

In [None]:
# normalization/Standardization for 'legs'
scaler = StandardScaler()



In [None]:
# scale the entire feature set
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)



## 7.3. Label Encoding for the Target Variable
We will use label encoding for the target variable type, converting the categorical values into numeric format. This encoding is crucial for our modeling process.

In [None]:
# label Encoding for the target variable 'type'
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)



  y = column_or_1d(y, warn=True)


# 8. Splitting the Dataset
Before splitting the dataset, we shuffle it to ensure randomness. Then, we will split the dataset into training and testing sets using an 80/20 split.

In [None]:
# shuffle the dataset before splitting
X_scaled, y_encoded = shuffle(X_scaled, y_encoded, random_state=42)



In [None]:
# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_encoded, test_size=0.2, random_state=42)



# Summary of Dataset Shapes
Let's print the shapes of the training and testing datasets to verify our split.

In [None]:
# final Preprocessed Dataset
print("Training Set Shape:", X_train.shape)
print("Test Set Shape:", X_test.shape)


Training Set Shape: (80, 21)
Test Set Shape: (21, 21)


# 9. Displaying Sample Data
Finally, we can view a few rows of the training and testing datasets to understand their structure.

In [None]:

# Summary
print(f"Training data:\n\n{X_train.head()}")
print(f"\n\nTest data:\n\n{X_test.head()}")

Training data:

        hair  feathers      eggs      milk  airborne   aquatic  predator  \
23 -0.861034  2.012461  0.843721 -0.826640  1.791182 -0.744208 -1.115547   
40  1.161395 -0.496904  0.843721 -0.826640  1.791182 -0.744208 -1.115547   
7  -0.861034 -0.496904  0.843721 -0.826640 -0.558291  1.343710 -1.115547   
98  1.161395 -0.496904 -1.185227  1.209717 -0.558291 -0.744208  0.896421   
4   1.161395 -0.496904 -1.185227  1.209717 -0.558291 -0.744208  0.896421   

     toothed  backbone  breathes  ...      fins      tail  domestic   catsize  \
23 -1.234909   0.46569  0.512348  ... -0.449868  0.588784 -0.384353  1.138180   
40 -1.234909  -2.14735  0.512348  ... -0.449868 -1.698416 -0.384353 -0.878595   
7   0.809776   0.46569 -1.951800  ...  2.222876  0.588784  2.601775 -0.878595   
98  0.809776   0.46569  0.512348  ... -0.449868  0.588784 -0.384353  1.138180   
4   0.809776   0.46569  0.512348  ... -0.449868  0.588784 -0.384353  1.138180   

      legs_0    legs_2    legs_4  legs_5

# Summary of Zoo Dataset Analysis

## Introduction
The analysis explores the Zoo dataset from the UCI Machine Learning Repository, which contains 101 animals described by 17 attributes. The goal is to understand the relationships between these features and predict animal classifications.

## 1. Installation of Libraries
The necessary libraries, including `ucimlrepo`, `pandas`, `numpy`, and `scikit-learn`, are installed to facilitate data handling and analysis.

## 2. Data Fetching
The Zoo dataset is fetched using `fetch_ucirepo`, providing the data and its metadata.

## 3. Data Preparation
The dataset is divided into features (X) and the target variable (y). Key metadata insights reveal the number of instances and attributes, guiding the analysis.

## 4. Variable Information
The report outlines the attributes, highlighting binary and categorical variables that need preprocessing.

## 5. Preprocessing Steps
1. **One-Hot Encoding**: The categorical variable `legs` is converted into binary columns.
2. **Standardization**: Features are standardized for better algorithm performance.
3. **Label Encoding**: The target variable `type` is encoded numerically.

## 6. Dataset Splitting
The dataset is shuffled and split into training (80%) and testing (20%) sets, with shapes verified.

## 7. Sample Data Display
Sample rows from both training and testing datasets are displayed to review their structure.

## Conclusion
The preprocessing steps prepare the data for further analysis and modeling. Future steps will include training machine learning models, evaluating their performance, and visualizing results.
