# **Master** **Of** **Data** **Science** **NoteBook**

## Analyzing the dataset for my research on : **Preserving** **and reviving the Oshiwambo language through hybrid SVM-Deep Learning Models**, using Python Jupyter notebook.

# ***DATA CLEANING***

# TOKENIZATION AND CLEANING WITH NLTK

In [18]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string
import nltk


# Load your CSV file
file_path = '/content/sample_data/Thesis_Dataset - Sheet1.CSV'
df = pd.read_csv(file_path)

# Strip any extra spaces from the column names
df.columns = df.columns.str.strip()

# Print column names to check if they match after stripping
print("Columns in DataFrame after stripping spaces:")
print(df.columns)

Columns in DataFrame after stripping spaces:
Index(['Oshiwambo', 'Aa-ndonga', 'Aa-kwambi', 'Aa-mbalanhu', 'Aa-kwaluudhi',
       'Aa-kwanyama', 'Aa-ngandjera', 'Aa-mbandja'],
      dtype='object')


In [None]:


# Download the 'punkt' tokenizer if not already downloaded
nltk.download('punkt')


In [19]:


# List of columns to be processed (after removing spaces from column names)
columns_to_process = [
    'Oshiwambo', 'Aa-ndonga', 'Aa-kwambi',
    'Aa-mbalanhu', 'Aa-kwaluudhi',
    'Aa-kwanyama', 'Aa-ngandjera', 'Aa-mbandja'
]
# Function to tokenize and remove punctuation (no stemming)
def clean_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove punctuation and non-alphabetic tokens
    tokens = [word for word in tokens if word.isalpha()]
    return ' '.join(tokens)

# Apply the cleaning function (tokenize and remove punctuation) to each specified column
for column in columns_to_process:
    if column in df.columns:
        df[column] = df[column].astype(str).apply(clean_text)
    else:
        print(f"Column '{column}' not found in DataFrame.")

# Display the first few rows of the cleaned DataFrame (without stemming)
print(df.head())


  Oshiwambo Aa-ndonga  Aa-kwambi Aa-mbalanhu Aa-kwaluudhi Aa-kwanyama  \
0       Ame     Ngame      Ngaye        Aame         Amee         Ame   
1       Ove     Ngoye      Ngwee         Oye          Oye         Ove   
2                  Ye         Ye          Ye           Ye          Ye   
3      Fyee       Tse         Se          Se           Se         Fye   
4    Amushe        Ne  Ne amushe         Nye        Amuhe         Nye   

  Aa-ngandjera Aa-mbandja  
0        Ngaye        Ame  
1        Ngwee        ove  
2           Ye             
3          Tse             
4           Ne             


In [24]:
# Initialize the Porter Stemmer
porter = PorterStemmer()

# Function to stem words in the "Oshiwambo" column
def stem_words(text):
    # Assume the text is already tokenized, just apply stemming
    stemmed = [porter.stem(word) for word in text.split()]  # Apply stemming to tokenized words
    return ' '.join(stemmed)

# Apply stemming to the "Oshiwambo" column and replace the original column with stemmed text
df['Oshiwambo'] = df['Oshiwambo'].astype(str).apply(stem_words)

# Display the first few rows of the DataFrame with the replaced "Oshiwambo" column
print(df.head())

# Example: Print the first 100 stemmed words from the 'Oshiwambo' column
osh_stemmed_flat = [word for sublist in df['Oshiwambo'].str.split() for word in sublist]
print(osh_stemmed_flat[:100])

  Oshiwambo Aa-ndonga  Aa-kwambi Aa-mbalanhu Aa-kwaluudhi Aa-kwanyama  \
0       ame     Ngame      Ngaye        Aame         Amee         Ame   
1       ove     Ngoye      Ngwee         Oye          Oye         Ove   
2                  Ye         Ye          Ye           Ye          Ye   
3      fyee       Tse         Se          Se           Se         Fye   
4     amush        Ne  Ne amushe         Nye        Amuhe         Nye   

  Aa-ngandjera Aa-mbandja  
0        Ngaye        Ame  
1        Ngwee        ove  
2           Ye             
3          Tse             
4           Ne             
['ame', 'ove', 'fyee', 'amush', 'voo', 'meme', 'tate', 'omwaina', 'kadona', 'omwaina', 'mati', 'umweinafana', 'onhu', 'adala', 'kumweina', 'wamem', 'or', 'tate', 'woy', 'tatekulu', 'okatekulu', 'kokamati', 'okatekulu', 'kokakadona', 'tate', 'mweno', 'oofelend', 'mee', 'mweno', 'nawa', 'ongula', 'okuhala', 'komatango', 'oku', 'tokelwa', 'uufiku', 'walelepo', 'wauhalapo', 'watokelwapo', 'ngei

## **One-Hot Encoding**

In [32]:
from sklearn.preprocessing import OneHotEncoder

# List of columns to be one-hot encoded
columns_to_encode = ["Oshiwambo", "Aa-ndonga", "Aa-kwambi", "Aa-mbalanhu", "Aa-kwaluudhi", "Aa-kwanyama", "Aa-ngandjera", "Aa-mbandja"]

# Initialize the OneHotEncoder with the updated parameter
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform the selected columns into one-hot encoded format
one_hot_encoded = encoder.fit_transform(df[columns_to_encode])

# Print the shape of the one-hot encoded data
print(f"One-hot encoded shape: {one_hot_encoded.shape}")

# Example: Display a few rows of one-hot encoded data
print(one_hot_encoded[:5])


One-hot encoded shape: (570, 3796)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


# **LSTM or CNN for Feature Extraction**

In [37]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, Input

# Define the LSTM model
model = Sequential()

# Add the Input layer explicitly
model.add(Input(shape=(X.shape[1], 1)))  # Input shape: (time steps, features)

# Add an LSTM layer
model.add(LSTM(units=64, return_sequences=False))

# Add a Dense layer for feature extraction or classification
model.add(Dense(32, activation='relu'))

# Optionally, add a Dropout layer for regularization
model.add(Dropout(0.5))

# Output layer for binary classification (adjust for multi-class as needed)
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Display the model summary
model.summary()

# Example: Fit the model (assuming y is the label for classification)
# Replace 'y' with your actual labels
# model.fit(X, y, epochs=10, batch_size=32, validation_split=0.2)


In [35]:
from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, Dense, Dropout

# Reshape one_hot_encoded data to 3D (samples, time steps, features)
X = one_hot_encoded.reshape((one_hot_encoded.shape[0], one_hot_encoded.shape[1], 1))

# Define the CNN model
model = Sequential()

# Add a 1D convolutional layer
model.add(Conv1D(filters=64, kernel_size=5, activation='relu', input_shape=(X.shape[1], 1)))

# Global Max Pooling layer
model.add(GlobalMaxPooling1D())

# Add a Dense layer for feature extraction or classification
model.add(Dense(32, activation='relu'))

# Optionally, add a Dropout layer for regularization (optional)
model.add(Dropout(0.5))

# Output layer for binary classification
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Display the model summary
model.summary()

# Example: Fit the model (assuming y is the label for classification)
# Replace 'y' with your actual labels
# model.fit(X, y, epochs=10, batch_size=32, validation_split=0.2)


# **SVM for Classification**

In [29]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# Assuming 'features' contains the extracted features from LSTM/CNN and 'labels' are the true labels
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(one_hot_encoded, labels, test_size=0.2, random_state=42)

# Initialize the SVM classifier
svm_classifier = SVC(kernel='linear')

# Fit the SVM model to the training data
svm_classifier.fit(X_train, y_train)

# Predict using the SVM classifier
y_pred = svm_classifier.predict(X_test)

# Evaluate the SVM model
from sklearn.metrics import accuracy_score
print(f"SVM Accuracy: {accuracy_score(y_test, y_pred)}")


NameError: name 'labels' is not defined

In [39]:
# Ensure you've defined your LSTM model and trained it first
# For example:
# model.fit(X, y, epochs=10, batch_size=32, validation_split=0.2)

# After training, create the feature extractor
feature_extractor = Model(inputs=model.input, outputs=model.layers[-2].output)  # Use -2 for the last Dense layer before output

# Use the LSTM model to predict/extract features from the one-hot encoded data
# Reshape X as before to 3D (samples, time steps, features)
X = one_hot_encoded.reshape((one_hot_encoded.shape[0], one_hot_encoded.shape[1], 1))

# Extract features from the LSTM model
features = feature_extractor.predict(X)

# Now 'features' can be used for SVM input
# Assuming 'labels' contains the true labels for the data
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Initialize the SVM classifier
svm_classifier = SVC(kernel='linear')

# Fit the SVM model to the training data
svm_classifier.fit(X_train, y_train)

# Predict using the SVM classifier
y_pred = svm_classifier.predict(X_test)

# Evaluate the SVM model
print(f"SVM Accuracy: {accuracy_score(y_test, y_pred)}")


ValueError: The layer sequential_5 has never been called and thus has no defined input.

# **Model Fusion: Combining SVM and Deep Learning Models**

In [None]:
#Option 1: Use SVM on Deep Learning Features

# Get the features from the LSTM or CNN layer (e.g., intermediate layers)
intermediate_model = Model(inputs=model.input, outputs=model.layers[-2].output)
lstm_cnn_features = intermediate_model.predict(one_hot_encoded)

# Use these features as input to the SVM classifier
svm_classifier.fit(lstm_cnn_features, labels)

# Evaluate SVM performance
y_pred_svm = svm_classifier.predict(lstm_cnn_features)
print(f"Fused Model Accuracy: {accuracy_score(labels, y_pred_svm)}")


In [None]:
#Option 2: Ensemble Method (Combine Probabilities)
# Example: Combining probabilities of SVM and Deep Learning model for final prediction
lstm_cnn_prob = model.predict(one_hot_encoded)
svm_prob = svm_classifier.decision_function(one_hot_encoded)

# Combine predictions (example: average the probabilities)
final_prob = (lstm_cnn_prob + svm_prob) / 2
final_pred = [1 if prob > 0.5 else 0 for prob in final_prob]

# Evaluate combined model performance
print(f"Ensemble Model Accuracy: {accuracy_score(labels, final_pred)}")




### Architecture Flow Chart for Hybrid Model:

1. **Data Input**:
   - Input: Raw text data from multiple columns (e.g., `Oshiwambo`, `Aa-ndonga`, etc.)
   
2. **Data Cleaning**:
   - Tokenization
   - Punctuation Removal
   - Stemming (applied only to the `Oshiwambo` column)
   - Output: Cleaned tokenized text data
   
3. **One-Hot Encoding**:
   - Convert cleaned text into numerical format (One-Hot Encoding)
   - Output: Encoded feature matrix

4. **Feature Extraction**:
   - **Path 1: LSTM Model**:
     - LSTM is used to extract sequential patterns from the encoded data.
     - Output: LSTM feature vector
   - **Path 2: CNN Model**:
     - CNN is used to capture local features from the encoded text data.
     - Output: CNN feature vector

5. **SVM Classification**:
   - SVM is trained using features from LSTM/CNN to categorize language patterns.
   - Output: Classified language patterns (e.g., dialects)

6. **Model Fusion**:
   - Combine the outputs of the SVM and LSTM/CNN models.
   - Output: Final prediction based on the fusion of both models.

### Labels:
- Data Cleaning → Preprocessing → Feature Extraction → Classification → Model Fusion → Final Output
