# Home Work 2
 - using PyFME library to include details about meta information of the dataset.


---

# Soybean Dataset Data Card

## Dataset Overview
The Soybean dataset is a well-known benchmark dataset commonly used in machine learning tasks. It contains 683 instances and 35 attributes (excluding the target class). The dataset has 19 distinct classes, representing different types of soybean diseases.

### Basic Dataset Features:
- **Number of Instances:** 683
- **Number of Attributes:** 35
- **Number of Classes:** 19

## Meta-Features Extracted
Meta-features provide insights into the characteristics of the dataset that could be used for tasks like meta-learning. Below are some of the meta-features extracted:

| **Feature**                  | **Value**                      |
|------------------------------|--------------------------------|
| num_instances                | 683                            |
| num_attributes               | 35                             |
| num_classes                  | 19                             |
| attr_conc.mean               | 0.097                          |
| attr_conc.sd                 | 0.155                          |
| best_node.mean               | 0.261                          |
| best_node.sd                 | 0.006                          |
| can_cor.mean                 | 0.945                          |
| can_cor.sd                   | 0.088                          |
| class_conc.mean              | 0.553                          |
| class_conc.sd                | 0.292                          |
| class_ent                    | 3.836                          |
| cor.mean                     | 0.145                          |
| cor.sd                       | 0.151                          |
| eigenvalues.mean             | 0.145                          |
| eigenvalues.sd               | 0.368                          |
| freq_class.mean              | 0.053                          |
| freq_class.sd                | 0.044                          |
| gravity                      | 5.676                          |
| kurtosis.mean                | 12.651                         |
| kurtosis.sd                  | 68.915                         |
| leaves                       | 72                             |
| leaves_branch.mean           | 10.389                         |
| leaves_branch.sd             | 3.899                          |
| leaves_per_class.mean        | 0.053                          |
| leaves_per_class.sd          | 0.066                          |
| max.mean                     | 1.0                            |
| median.mean                  | 0.263                          |
| median.sd                    | 0.442                          |
| min.mean                     | 0.0                            |
| min.sd                       | 0.0                            |
| nodes                        | 71                             |
| nodes_per_attr               | 0.717                          |
| one_nn.mean                  | 0.924                          |
| one_nn.sd                    | 0.053                          |
| random_node.mean             | 0.198                          |
| random_node.sd               | 0.029                          |
| tree_depth.mean              | 9.469                          |
| tree_depth.sd                | 4.036                          |
| var.mean                     | 0.145                          |
| var.sd                       | 0.076                          |
| worst_node.mean              | 0.132                          |
| worst_node.sd                | 0.001                          |

### Note:
Some meta-features like `attr_ent.mean`, `attr_ent.sd`, `cat_to_num`, `mut_inf.mean`, etc., were not available in this dataset (NaN values).

## Code to Extract Features
The following code was used to load the Soybean dataset, extract basic dataset features, and compute meta-features using the PyMFE library:


In [14]:
import pandas as pd
import arff
from pymfe.mfe import MFE

# Load the ARFF dataset using liac-arff
with open('../../dataset_42_soybean.arff', 'r') as f:
    dataset = arff.load(f)

# Convert to a DataFrame
df = pd.DataFrame(dataset['data'], columns=[attr[0] for attr in dataset['attributes']])

# Extract basic dataset features
num_instances = df.shape[0]
num_attributes = df.shape[1] - 1  # Subtract 1 for the target class column
num_classes = df['class'].nunique()

# Separate features and target
X = df.drop('class', axis=1)
y = df['class']

# Initialize and fit the MFE model
mfe = MFE(groups=["general", "statistical", "info-theory", "model-based", "landmarking"])
mfe.fit(X.values, y.values)

# Extract meta-features
ft = mfe.extract(suppress_warnings=True)

# Convert the meta-features to a DataFrame
meta_features_df = pd.DataFrame(ft).T
meta_features_df.columns = ['Feature', 'Value']

# Create a DataFrame for basic features
basic_features = pd.DataFrame({
    'Feature': ['num_instances', 'num_attributes', 'num_classes'],
    'Value': [num_instances, num_attributes, num_classes]
})

# Concatenate basic features with meta-features
all_features_df = pd.concat([basic_features, meta_features_df], ignore_index=True)

# Save the DataFrame to a CSV file
all_features_df.to_csv('soybean_all_features.csv', index=False)

# Optional: Print the path to confirm where the file was saved
print("All features saved to 'soybean_all_features.csv'")


TypeError("'<' not supported between instances of 'NoneType' and 'str'").


All features saved to 'soybean_all_features.csv'



## File Output
The extracted features are stored in the CSV file `soybean_all_features.csv`. This file contains both the basic dataset characteristics and the extracted meta-features, formatted as `Feature, Value` pairs.

---

This data card provides a quick summary of the dataset and the code used for feature extraction. It's a useful reference for understanding the dataset's structure and the extracted meta-features.