# Machine Learning

# Task: Supervised Learning - SVM

# Oscar Andre Dorantes Victor

# IRC 9B

# 10/30/2023

In [20]:
#Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

In [21]:
data = pd.read_csv("/content/kaggledino.csv")

#Remove rows with missing 'Diet'
data = data.dropna(subset=['Diet'])

#Label encode categorical columns
label_encoders = {}
for column in ['What Dinosaurs Eat', 'Country', 'Geological Time Period']:
    le = LabelEncoder()
    data[column] = le.fit_transform(data[column])



1.   Load the Dataset: The read_csv function from the pandas library is used to load the dataset from the provided path into a DataFrame called data.

2.   Remove Missing Values: The dropna function removes rows where the 'Diet' column has missing values. This ensures that the data used for modeling is complete.

3.   Label Encoding: This section encodes categorical columns into numerical values, making them suitable for machine learning algorithms. The columns 'What Dinosaurs Eat', 'Country', and 'Geological Time Period' are transformed. A dictionary (label_encoders) is initialized to potentially store these encoders for later use, but it's not used in the provided code.








In [22]:
#Define features and target
features = ['Lat', 'Lng', 'What Dinosaurs Eat', 'Country', 'Geological Time Period', 'Max Ma', 'Min Ma']
X = data[features]
y = data['Diet']

#Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



1.   Define Features and Target: The features (input variables) and the target (output variable) for the machine learning model are defined. The columns specified in the features list are used as input features, and the 'Diet' column is the target variable.

2.   Data Splitting: The train_test_split function divides the dataset into training and testing sets. 80% of the data is used for training, and the remaining 20% is reserved for testing.

3. Feature Standardization: The features are standardized (zero mean and unit variance) using the StandardScaler. This step is essential for algorithms like SVM that are sensitive to the scale of input features.



In [24]:
#Initialize and train SVM with RBF kernel
svm_classifier_rbf = SVC(kernel='rbf', C=1)
svm_classifier_rbf.fit(X_train_scaled, y_train)



1.   SVM Initialization: An SVM classifier with a Radial Basis Function (RBF) kernel is initialized. The regularization parameter C is set to 1.

2.   Training: The classifier is trained using the standardized training data.



In [25]:
#Predict on test data
y_pred_rbf = svm_classifier_rbf.predict(X_test_scaled)



1.   Using the trained SVM classifier, predictions are made on the standardized test data. The predictions are stored in y_pred_rbf.




In [28]:
# Function to display results
def display_results(y_true, y_pred):
    report = classification_report(y_true, y_pred, zero_division=0, output_dict=True)

    print("Classification Results:\n")
    for label, metrics in report.items():
        if label in ['accuracy', 'macro avg', 'weighted avg']:
            continue
        print(f"Class: {label}")
        print(f"  - Precision (True Positives / (True Positives + False Positives)): {metrics['precision']:.2f}")
        print(f"  - Recall (True Positives / (True Positives + False Negatives)): {metrics['recall']:.2f}")
        print(f"  - F1-Score (Harmonic Mean of Precision and Recall): {metrics['f1-score']:.2f}")
        print(f"  - Support (Number of actual occurrences of the class): {metrics['support']}")
        print("\n")

    print(f"Overall Accuracy (Correct Predictions / Total Predictions): {report['accuracy']:.2f}")

# Display results
display_results(y_test, y_pred_rbf)

Classification Results:

Class: carnivore
  - Precision (True Positives / (True Positives + False Positives)): 1.00
  - Recall (True Positives / (True Positives + False Negatives)): 1.00
  - F1-Score (Harmonic Mean of Precision and Recall): 1.00
  - Support (Number of actual occurrences of the class): 212


Class: carnivore, omnivore
  - Precision (True Positives / (True Positives + False Positives)): 0.57
  - Recall (True Positives / (True Positives + False Negatives)): 0.81
  - F1-Score (Harmonic Mean of Precision and Recall): 0.67
  - Support (Number of actual occurrences of the class): 16


Class: herbivore
  - Precision (True Positives / (True Positives + False Positives)): 1.00
  - Recall (True Positives / (True Positives + False Negatives)): 1.00
  - F1-Score (Harmonic Mean of Precision and Recall): 1.00
  - Support (Number of actual occurrences of the class): 243


Class: herbivore, omnivore
  - Precision (True Positives / (True Positives + False Positives)): 0.00
  - Recall (T



1.   Function Definition: The display_results function is defined to take in true labels (y_true) and predicted labels (y_pred). It computes various classification metrics like precision, recall, F1-score, and support using the classification_report function.

2.   Displaying Metrics: The function is then called with the true test labels and the predicted labels to display the classification results. The metrics are printed in a formatted manner for each class in the dataset, followed by the overall accuracy.



**Classification Explanation:**


*   Accuracy: This is the ratio of the number of correct predictions to the total number of predictions. In my case, it's 97.46%, which means the model correctly predicted the diet of the dinosaur for 97.46% of the test samples.

*   Precision (for a class): It's the ratio of true positive predictions to the sum of true positive and false positive predictions for that class. In simpler words, out of all the samples that the model predicted to be of a certain class, how many were actually of that class?

*   Recall (for a class): It's the ratio of true positive predictions to the sum of true positive and false negative predictions for that class. Essentially, out of all the samples that were actually of a certain class, how many did the model correctly predict?

*   F1-score (for a class): It's the harmonic mean of precision and recall and gives a balance between the two. If either precision or recall is low for a class, the F1-score will also be low.

*   Support: It's the number of actual occurrences of the class in the dataset.

*   Macro avg: This is the average of the metrics for each class without considering class imbalance.

*   Weighted avg: This is the average of the metrics for each class, weighted by the number of samples in each class.



