# **Master** **Of** **Data** **Science** **NoteBook**

## Analyzing the dataset for my research on : **Preserving** **and reviving the Oshiwambo language through hybrid SVM-Deep Learning Models**, using Python Jupyter notebook.



### Architecture Flow Chart for Hybrid Model:

1. **Data Input**:
   - Input: Raw text data from multiple columns (e.g., `Oshiwambo`, `Aa-ndonga`, etc.)
   
2. **Data Cleaning**:
   - Tokenization
   - Punctuation Removal
   - Stemming (applied only to the `Oshiwambo` column)
   - Output: Cleaned tokenized text data
   
3. **One-Hot Encoding** / **Term Frequency-Inverse Document Frequency**/ **Word embending**:
   - Convert cleaned text into numerical format (One-Hot Encoding)
   - Output: Encoded feature matrix

4. **Feature Extraction**:
   - **Path 1: LSTM Model**:
     - LSTM is used to extract sequential patterns from the encoded data.
     - Output: LSTM feature vector
   - **Path 2: CNN Model**:
     - CNN is used to capture local features from the encoded text data.
     - Output: CNN feature vector

5. **SVM Classification**:
   - SVM is trained using features from LSTM/CNN to categorize language patterns.
   - Output: Classified language patterns (e.g., dialects)

6. **Model Fusion**:
   - Combine the outputs of the SVM and LSTM/CNN models.
   - Output: Final prediction based on the fusion of both models.

### Labels:
- Data Cleaning → Preprocessing → Feature Extraction → Classification → Model Fusion → Final Output


## **Step 1: Load necessary libraries and your CSV file**

In [25]:
# Step 1: Load necessary libraries and CSV file
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import text_to_word_sequence
import string
import nltk

# Download necessary NLTK data packages
nltk.download('punkt')  # For word tokenization
nltk.download('punkt_tab')  # For sentence tokenization (needed by word_tokenize) # This line is the fix

# Load your CSV file
file_path = '/content/sample_data/Thesis_Dataset - Sheet(11).CSV'
df = pd.read_csv(file_path)

# Strip any extra spaces from the column names
df.columns = df.columns.str.strip()

# Display the column names
print("Columns in DataFrame after stripping spaces:")
print(df.columns)

Columns in DataFrame after stripping spaces:
Index(['Oshiwambo', 'Aa-ndonga', 'Aa-kwambi', 'Aa-mbalanhu', 'Aa-kwaluudhi',
       'Aa-kwanyama', 'Aa-ngandjera', 'Aa-mbandja'],
      dtype='object')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [26]:
# List of columns to be processed
columns_to_process = [
    'Oshiwambo', 'Aa-ndonga', 'Aa-kwambi', 'Aa-mbalanhu', 'Aa-kwaluudhi', 'Aa-kwanyama', 'Aa-ngandjera', 'Aa-mbandja']
# Function to tokenize and remove punctuation (no stemming)
def clean_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove punctuation and non-alphabetic tokens
    tokens = [word for word in tokens if word.isalpha()]
    return ' '.join(tokens)

# Apply the cleaning function (tokenize and remove punctuation) to each specified column
for column in columns_to_process:
    if column in df.columns:
        df[column] = df[column].astype(str).apply(clean_text)
    else:
        print(f"Column '{column}' not found in DataFrame.")

# Display the first few rows of the cleaned DataFrame (without stemming)
print(df.head())


  Oshiwambo Aa-ndonga  Aa-kwambi Aa-mbalanhu Aa-kwaluudhi Aa-kwanyama  \
0       Ame     Ngame      Ngaye        Aame         Amee         Ame   
1       Ove     Ngoye      Ngwee         Oye          Oye         Ove   
2                  Ye         Ye          Ye           Ye          Ye   
3      Fyee       Tse         Se          Se           Se         Fye   
4    Amushe        Ne  Ne amushe         Nye        Amuhe         Nye   

  Aa-ngandjera Aa-mbandja  
0        Ngaye        Ame  
1        Ngwee        ove  
2           Ye             
3          Tse             
4           Ne             


## **Step 2: Apply stemming to the "Oshiwambo" column**

In [28]:
# Apply stemming to the "Oshiwambo" column and replace the original column with stemmed text
# Import necessary libraries
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
import pandas as pd

# Initialize the Porter Stemmer
porter = PorterStemmer()

# Function to stem words in the "Oshiwambo" column
def stem_words(text):
    # Assume the text is already tokenized, just apply stemming
    stemmed = [porter.stem(word) for word in text.split()]  # Apply stemming to tokenized words
    return ' '.join(stemmed)

# Apply stemming to the "Oshiwambo" column and replace the original column with stemmed text
df['Oshiwambo'] = df['Oshiwambo'].astype(str).apply(stem_words)

# Display the first few rows of the DataFrame with the replaced "Oshiwambo" column
print(df.head())


  Oshiwambo Aa-ndonga  Aa-kwambi Aa-mbalanhu Aa-kwaluudhi Aa-kwanyama  \
0       ame     Ngame      Ngaye        Aame         Amee         Ame   
1       ove     Ngoye      Ngwee         Oye          Oye         Ove   
2                  Ye         Ye          Ye           Ye          Ye   
3      fyee       Tse         Se          Se           Se         Fye   
4     amush        Ne  Ne amushe         Nye        Amuhe         Nye   

  Aa-ngandjera Aa-mbandja  
0        Ngaye        Ame  
1        Ngwee        ove  
2           Ye             
3          Tse             
4           Ne             


# **Step 3: One-Hot Encoding**

In [29]:
# Step 3: One-Hot Encoding
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import text_to_word_sequence

# List of columns to be one-hot encoded
columns_to_encode =  [
    'Oshiwambo', 'Aa-ndonga', 'Aa-kwambi', 'Aa-mbalanhu', 'Aa-kwaluudhi', 'Aa-kwanyama', 'Aa-ngandjera', 'Aa-mbandja']

# Initialize a set to store unique words across all columns
unique_words = set()

# Extract unique words from each column
for column in columns_to_encode:
    df[column] = df[column].astype(str)
    for text in df[column]:
        words_in_text = text_to_word_sequence(text)
        unique_words.update(words_in_text)

# Estimate the vocabulary size
vocab_size = len(unique_words)
print(f"Vocabulary Size: {vocab_size}")

# One-hot encode each column
encoded_columns = {}
for column in columns_to_encode:
    encoded_column = [one_hot(text, round(vocab_size * 1.3)) for text in df[column]]
    encoded_columns[column] = encoded_column

# Convert the encoded columns back into a DataFrame
encoded_df = pd.DataFrame(encoded_columns)

# Display the first few rows of the one-hot encoded DataFrame
print(encoded_df.head())


Vocabulary Size: 1681
  Oshiwambo Aa-ndonga    Aa-kwambi Aa-mbalanhu Aa-kwaluudhi Aa-kwanyama  \
0    [1808]    [1697]       [2066]       [937]       [1020]      [1808]   
1     [828]    [1185]       [1014]      [2154]       [2154]       [828]   
2        []    [1723]       [1723]      [1723]       [1723]      [1723]   
3     [805]    [1735]       [1785]      [1785]       [1785]      [1880]   
4     [326]     [257]  [257, 2086]      [1650]        [490]      [1650]   

  Aa-ngandjera Aa-mbandja  
0       [2066]     [1808]  
1       [1014]      [828]  
2       [1723]         []  
3       [1735]         []  
4        [257]         []  


**2. Using TF-IDF Encoding (Term Frequency-Inverse Document Frequency)**

Why use this?

TF-IDF assigns more importance to meaningful words and reduces the impact of frequently occurring words like "the" or "and."

It works well for traditional machine learning models like SVM.

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Combine text from all columns
all_text = df[columns_to_encode].astype(str).apply(lambda x: ' '.join(x), axis=1)

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Limit feature space to 5000 words
tfidf_matrix = tfidf_vectorizer.fit_transform(all_text)

# Get the vocabulary size
vocab_size = len(tfidf_vectorizer.vocabulary_)
print(f"Vocabulary Size: {vocab_size}")

# Convert to DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Display the first few rows
print(tfidf_df.head())


Vocabulary Size: 1680
   aalongwa      aame  aamushanga  aamwahinathana  aamwaina  aamwainafana  \
0       0.0  0.259776         0.0             0.0       0.0           0.0   
1       0.0  0.000000         0.0             0.0       0.0           0.0   
2       0.0  0.000000         0.0             0.0       0.0           0.0   
3       0.0  0.000000         0.0             0.0       0.0           0.0   
4       0.0  0.000000         0.0             0.0       0.0           0.0   

   aamwayinafana  aamwayinathana  aanasikola  aanona  ...  yomepya  yomomeva  \
0            0.0             0.0         0.0     0.0  ...      0.0       0.0   
1            0.0             0.0         0.0     0.0  ...      0.0       0.0   
2            0.0             0.0         0.0     0.0  ...      0.0       0.0   
3            0.0             0.0         0.0     0.0  ...      0.0       0.0   
4            0.0             0.0         0.0     0.0  ...      0.0       0.0   

   yomomeya  yomuzizimba  yongob  

**3. Using Word Embeddings (TensorFlow Embedding Layer)**

Why use this?

Embeddings allow the model to learn contextual meaning rather than treating words as independent categories.

This method is scalable and works well with deep learning architectures like CNNs and LSTMs.

In [33]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Initialize the tokenizer
#tokenizer = Tokenizer(num_words=vocab_size)
#tokenizer.fit_on_texts(df[columns_to_encode].astype(str).values.flatten())

# Convert text to sequences
encoded_columns = {}
for column in columns_to_encode:
    sequences = tokenizer.texts_to_sequences(df[column].astype(str))
    padded_sequences = pad_sequences(sequences, padding='post')  # Ensure same sequence length
    encoded_columns[column] = list(padded_sequences)

# Convert encoded columns into DataFrame
encoded_df = pd.DataFrame(encoded_columns)

# Display the first few rows
print(encoded_df.head())

# Vocabulary size used in the embedding layer
embedding_vocab_size = len(tokenizer.word_index) + 1  # Add 1 for padding token
print(f"Embedding Vocabulary Size: {embedding_vocab_size}")


                 Oshiwambo             Aa-ndonga            Aa-kwambi  \
0  [307, 0, 0, 0, 0, 0, 0]  [576, 0, 0, 0, 0, 0]    [577, 0, 0, 0, 0]   
1  [308, 0, 0, 0, 0, 0, 0]  [877, 0, 0, 0, 0, 0]    [578, 0, 0, 0, 0]   
2    [0, 0, 0, 0, 0, 0, 0]  [140, 0, 0, 0, 0, 0]    [140, 0, 0, 0, 0]   
3  [878, 0, 0, 0, 0, 0, 0]  [580, 0, 0, 0, 0, 0]    [309, 0, 0, 0, 0]   
4  [880, 0, 0, 0, 0, 0, 0]  [423, 0, 0, 0, 0, 0]  [423, 881, 0, 0, 0]   

         Aa-mbalanhu    Aa-kwaluudhi        Aa-kwanyama    Aa-ngandjera  \
0  [875, 0, 0, 0, 0]  [876, 0, 0, 0]  [307, 0, 0, 0, 0]  [577, 0, 0, 0]   
1  [579, 0, 0, 0, 0]  [579, 0, 0, 0]  [308, 0, 0, 0, 0]  [578, 0, 0, 0]   
2  [140, 0, 0, 0, 0]  [140, 0, 0, 0]  [140, 0, 0, 0, 0]  [140, 0, 0, 0]   
3  [309, 0, 0, 0, 0]  [309, 0, 0, 0]  [879, 0, 0, 0, 0]  [580, 0, 0, 0]   
4  [581, 0, 0, 0, 0]  [882, 0, 0, 0]  [581, 0, 0, 0, 0]  [423, 0, 0, 0]   

             Aa-mbandja  
0  [307, 0, 0, 0, 0, 0]  
1  [308, 0, 0, 0, 0, 0]  
2    [0, 0, 0, 0, 0, 0]  
3    [

**Convert to NumPy array** and **Split into training and testing sets**

🧠 Summary of Why Each Step Matters:

Step	                 Purpose

1. Convert to Sequences:	Convert text into numbers the model can understand

2. Pad Sequences:	Ensure all input/output sequences are of the same length

3. Convert to NumPy Arrays:	Required format for TensorFlow/Keras models

4. Split into Train/Test Sets	Prevent overfitting and evaluate generalization of the model

**Tokenization & Word Embeddings (Using TensorFlow)**
Why use this?

Converts text into numeric format for training CNN and LSTM models.

Ensures each word is represented by dense embeddings rather than sparse one-hot vectors

In [49]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import numpy as np

# Define the target labels ('Oshiwambo' column is the target)
target_column = 'Oshiwambo'

# Convert target column into numerical labels
tokenizer_labels = Tokenizer()
tokenizer_labels.fit_on_texts(df[target_column].astype(str))
y = tokenizer_labels.texts_to_sequences(df[target_column].astype(str))

# Pad sequences in y to ensure uniform length
max_length = max(len(seq) for seq in y) # Find maximum sequence length in y
y = pad_sequences(y, maxlen=max_length, padding='post', truncating='post')

# Convert to NumPy array
y = np.array(y)

# Encode text columns (features)
tokenizer = Tokenizer(num_words=5000)  # Limit vocabulary size
tokenizer.fit_on_texts(df[columns_to_encode].astype(str).values.flatten())

# Convert text to sequences
encoded_columns = []
for column in columns_to_encode:
    sequences = tokenizer.texts_to_sequences(df[column].astype(str))
    padded_sequences = pad_sequences(sequences, padding='post', maxlen=50)  # Ensure same sequence length
    encoded_columns.append(padded_sequences)

# Stack all features together
X = np.hstack(encoded_columns)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training Data Shape: {X_train.shape}, Testing Data Shape: {X_test.shape}")

Training Data Shape: (456, 400), Testing Data Shape: (114, 400)


#**Step 6: Define the CNN model**

Why use CNN?

Captures short-range dependencies in text.

Fast and effective for text classification.

In [50]:
# Step 6: Define the CNN model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D, Dense, Dropout

# Reshape X (the one-hot encoded input) to fit the CNN input shape
# CNN expects input shape as (samples, time steps, features)
# Assuming time steps = the length of each one-hot encoded sequence
X = X.reshape((X.shape[0], X.shape[1], 1))  # Reshape to (samples, time steps, features)

# Define the CNN model
model = Sequential()

# Add a 1D convolutional layer with 64 filters and a kernel size of 5
model.add(Conv1D(filters=64, kernel_size=5, activation='relu', input_shape=(X.shape[1], 1)))

# Add a Global Max Pooling layer
model.add(GlobalMaxPooling1D())

# Add a Dense layer with 32 units and ReLU activation for feature extraction
model.add(Dense(32, activation='relu'))

# Optionally, add a Dropout layer for regularization to avoid overfitting
model.add(Dropout(0.5))

# Add the Output layer for multi-class classification
# Use y.shape[1] to get the correct number of output units (7 in this case)
num_classes = y.shape[1]  # Get the correct number of classes from y's shape
model.add(Dense(num_classes, activation='softmax'))  # Adjust activation for multi-class

# Compile the model using Adam optimizer and categorical crossentropy for multi-class classification
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Display the model summary to check its structure
model.summary()

# Fit the model (use the one-hot encoded `y`)
model.fit(X, y, epochs=10, batch_size=32, validation_split=0.2)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 30ms/step - accuracy: 0.4237 - loss: 21361.9727 - val_accuracy: 1.0000 - val_loss: 22.7129
Epoch 2/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step - accuracy: 0.7295 - loss: 22101.3105 - val_accuracy: 1.0000 - val_loss: 36.5867
Epoch 3/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.7643 - loss: 31943.5605 - val_accuracy: 1.0000 - val_loss: 55.4911
Epoch 4/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - accuracy: 0.7820 - loss: 48934.6055 - val_accuracy: 1.0000 - val_loss: 81.5807
Epoch 5/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.7811 - loss: 66719.6250 - val_accuracy: 1.0000 - val_loss: 118.0352
Epoch 6/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.7911 - loss: 81140.4609 - val_accuracy: 1.0000 - val_loss: 169.0833
Ep

<keras.src.callbacks.history.History at 0x7a27fc261790>

**6.2 Train an LSTM Model**

LSTMs are good for learning sequential dependencies in text.

✅ Why use LSTM?

Captures long-range dependencies in text.

Good for language modeling and context learning.

In [46]:
#LSTM and CNN for Feature Extraction
import numpy as np
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Input
from sklearn.preprocessing import OneHotEncoder

# Define the LSTM model
model = Sequential()

# Add the Input layer with shape matching your data
model.add(Input(shape=(X.shape[1], 1)))  # Input shape: (time steps, features)

# Add an LSTM layer with 64 units
model.add(LSTM(units=64, return_sequences=False))

# Add a Dense layer with 32 units and ReLU activation for feature extraction
model.add(Dense(32, activation='relu'))

# Add a Dropout layer for regularization to avoid overfitting
model.add(Dropout(0.5))

# Add the Output layer for multi-class classification (use y_encoded's shape to get the number of classes)
# Get the number of unique classes in your target variable (y)
num_classes = y.shape[1]  # Assuming 'y' is one-hot encoded or has shape (samples, classes)
model.add(Dense(num_classes, activation='softmax'))  # Multi-class classification

# Compile the model using Adam optimizer and categorical crossentropy for multi-class classification
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Display the model summary to check its structure
model.summary()

# Fit the model (use the one-hot encoded `y`)
model.fit(X, y, epochs=10, batch_size=32, validation_split=0.2)

Epoch 1/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 176ms/step - accuracy: 0.7323 - loss: 470.9318 - val_accuracy: 1.0000 - val_loss: 193.9520
Epoch 2/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 148ms/step - accuracy: 0.8048 - loss: 261.4023 - val_accuracy: 1.0000 - val_loss: 2.0211
Epoch 3/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 149ms/step - accuracy: 0.8189 - loss: 416.6333 - val_accuracy: 1.0000 - val_loss: 2.1389
Epoch 4/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 140ms/step - accuracy: 0.8639 - loss: 508.2468 - val_accuracy: 1.0000 - val_loss: 2.9424
Epoch 5/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 220ms/step - accuracy: 0.8305 - loss: 902.5732 - val_accuracy: 1.0000 - val_loss: 3.7132
Epoch 6/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 140ms/step - accuracy: 0.8556 - loss: 975.3773 - val_accuracy: 1.0000 - val_loss: 4.4895
Epoch 7/10
[1

<keras.src.callbacks.history.History at 0x7a27fc971790>

**Train an SVM Model (Using TF-IDF)**

TF-IDF + SVM is often a strong baseline for text classification.

✅ Why use SVM?

Works well with TF-IDF features.

Strong baseline for text classification problems.

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import numpy as np

# Combine text from all columns
all_text = df[columns_to_encode].astype(str).apply(lambda x: ' '.join(x), axis=1)

# TF-IDF encoding
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_tfidf = tfidf_vectorizer.fit_transform(all_text)

# Split data
# Ensure y is 1-dimensional
y_1d = y.argmax(axis=1)  # Assuming 'y' is one-hot encoded, get the class index
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(
    X_tfidf, y_1d, test_size=0.2, random_state=42
)


# Train SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train_tfidf, y_train_tfidf)  # Remove .ravel()

# Evaluate
svm_accuracy = svm_model.score(X_test_tfidf, y_test_tfidf)
print(f"SVM Accuracy: {svm_accuracy:.4f}")

SVM Accuracy: 0.8509


**Compare Model Performance**

After training all three models, compare them based on Accuracy, Precision, Recall, and F1-score.

In [48]:
from sklearn.metrics import classification_report
import numpy as np

# Evaluate CNN model
cnn_preds = cnn_model.predict(X_test)  # Get raw predictions

# Reshape predictions to be 2D if necessary
if cnn_preds.ndim == 3 and cnn_preds.shape[2] == 1:  # Check for extra dimension
    cnn_preds = cnn_preds.squeeze(axis=2)  # Remove extra dimension

# Evaluate LSTM model
lstm_preds = lstm_model.predict(X_test)
# Convert LSTM predictions to class labels (assuming you have a multi-class problem)
lstm_preds_classes = np.argmax(lstm_preds, axis=1)
print("LSTM Classification Report:")
# Reshape y_test to have the same number of elements as lstm_preds_classes
print(classification_report(y_test.argmax(axis=1), lstm_preds_classes)) # Use argmax on y_test to get class labels

# Evaluate SVM model
svm_preds = svm_model.predict(X_test_tfidf)
print("SVM Classification Report:")
print(classification_report(y_test_tfidf, svm_preds))

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 116ms/step
LSTM Classification Report:
              precision    recall  f1-score   support

           0       0.86      1.00      0.92        98
           1       0.00      0.00      0.00        13
           2       0.00      0.00      0.00         1
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         1

    accuracy                           0.86       114
   macro avg       0.17      0.20      0.18       114
weighted avg       0.74      0.86      0.79       114

SVM Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.99      0.92        98
           1       0.00      0.00      0.00        13
           2       0.00      0.00      0.00         1
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
