# CS345 Project

## Team Members
1. Hamad Alyami
2. Benito Encarnacion

## Dataset
Our dataset was from Kaggle by a user called Mexwell. The data is paragraphs scraped from wikipedia in 2018 in 235 languages.

The dataset contains 235,000 datasets with balance between language proportions and a test and train split provided.

The downloaded folder from Kaggle contains:
- labels.csv: A file containing the language name, 2-3 letter code, German name, and language family of all the languages present in the dataset.
- README.txt: A file explaining the folder contents.
- urls.txt: A file containing the urls of where the paragraphs were found.
- x_test.txt: The testing data samples, paragraphs in multiple languages.
- x_train.txt: The training data samples, paragraphs in multiple langauges
- y_test.txt: The labels for the testing dataset, using the 2-3 letter codes found in labels.csv.
- y_train.txt: The labels for the training dataset, using the 2-3 letter codes found in labels.csv


## Project
Our project is to train and compare two ML models on the Latin Alphabet languages present in the dataset and compare their performance.

## Motivation
We decided to do this project because it allows us to explore practical applications of natural language processing and machine learning by working with real-world multilingual data. Language identification is an important task in many systems and applications like search engines, translation tools, and content moderation. Working with such a dataset gives us the opportunity to apply classification techniques in a meaningful way. By focusing on languages that use the Latin alphabet, we avoid complications from different writing systems while still working with a variety of languages.

## Models
The models we decided to work with in this project are:
- Multinomial Naive-Bayes (MNB): Uses word frequencies in each class, langauges in our case, to guess the most likely class for text it has not seen.

- Feed Forward Neural Network (FNN): An artificial Neural Network where information moves from input to output without looping back. It uses neurons, connected nodes, to learn patterns and make predictions.

### Data Preprocessing
We will begin by reading the data from the files then:
1. Remove Null Values
2. Filter to keep texts of languages we want using the 2-3 letter codes
3. Return both samples from x_test and x_train and labels from y_test and y_train stacked into X and y

In [121]:
#Understanding the data set
import pandas as pd
import numpy as np
            #Italina, French, Spanish, Portugese, English, German, Dutch, Indonesian, Finnish, Hausa
lang_codes = ['ita', 'fra', 'spa', 'por', 'eng', 'deu', 'nld', 'ind', 'fin', 'hau']
langs = ['Italian', 'French', 'Spanish', 'Portuguese', 'English', 'German', 'Dutch', 'Indonesian', 'Finnish', 'Hausa']

def file_to_np_array(path, label):
    try:
        df = pd.read_csv(path, sep='<NonExistenceSeparator>', header=None, engine='python')
        print(f"{label}: Read!")
    except Exception as e:
        print(f"Error reading the {label} file: {e}")
        return None
    return df.to_numpy()


def clean_np_data(X, y):
    stacked = np.hstack((y, X)) # Stack y and X side by side
    # print(stacked.shape)
    clean_stacked = stacked[~np.any(pd.isna(stacked), axis=1), :] # Remove empty values
    # print(clean_stacked.shape)
    true_clean = clean_stacked[np.isin(clean_stacked[:,0], lang_codes),:] # Remove all rows that aren't our target languages
    # print(true_clean.shape)
    return true_clean[:,1], true_clean[:,0] # Return cleaned as X and y split again

def clean_filter_and_stack(X_train_file, y_train_file, X_test_file, y_test_file):
    X_train_clean, y_train_clean = clean_np_data(file_to_np_array(X_train_file, X_train_file), 
                                       file_to_np_array(y_train_file, y_train_file))
    X_test_clean, y_test_clean = clean_np_data(file_to_np_array(X_test_file, X_test_file), 
                                       file_to_np_array(y_test_file, y_test_file))
    return np.hstack((X_train_clean, X_test_clean)), np.hstack((y_train_clean, y_test_clean))

X, y = clean_filter_and_stack("Data/x_train.txt", 
                                      "Data/y_train.txt", 
                                      "Data/x_test.txt", 
                                      "Data/y_test.txt")

print(X.shape, y.shape)

Data/x_train.txt: Read!
Data/y_train.txt: Read!
Data/x_test.txt: Read!
Data/y_test.txt: Read!
(10000,) (10000,)


#### Data Discovery
This code is to find what is the sample distribution between languages and average word count of each sample of each language.

In [122]:
def avg_words(filtered_X):
    total = 0
    for text in filtered_X:
        words = str(text).split()
        total += len(words)

    return total / len(filtered_X)

def word_count_perlang(X, y):
    avg_word_count = []
    for lang in lang_codes:
        filtered_X = X[y == lang]
        avg_word_count.append(avg_words(filtered_X))
    
    return avg_word_count

def lang_perc(y):
    lang_perc = []
    total = len(y)
    for lang in lang_codes:
        count = (y == lang).sum()
        percent = (count / total) * 100
        lang_perc.append(percent)
    return lang_perc

df = pd.DataFrame({
    'Language': langs,
    'Percent of Dataset (%)': lang_perc(y),
    'Average Word Count': word_count_perlang(X, y)
})

display(df)

Unnamed: 0,Language,Percent of Dataset (%),Average Word Count
0,Italian,10.0,68.192
1,French,10.0,67.707
2,Spanish,10.0,67.295
3,Portuguese,10.0,66.184
4,English,10.0,70.455
5,German,10.0,59.762
6,Dutch,10.0,55.657
7,Indonesian,10.0,57.147
8,Finnish,10.0,48.431
9,Hausa,10.0,75.802


#### Data Split
Here we use Sklearn train_test_split to split our data into 70/30 train and test splits, respectively, after shuffling them randomly.

In [123]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=17)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(7000,) (7000,)
(3000,) (3000,)


And then vectorize our dataset for the MNB.

In [124]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(strip_accents='unicode')
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)
print("Done vectorizing")

Done vectorizing


We train our MNB using the MultinomialNB() function from sklearn on X_train which is 70% of our dataset.

In [None]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_vectors, y_train)
print("Done training MNB")

from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test_vectors)
accuracy = accuracy_score(y_test, y_pred)
print("Overall accuracy of MNB: " + str(accuracy * 100) + "%")

Done training MNB


Overall accuracy of MNB: 98.1%


In [139]:
def get_lang_accuracies(y_true, y_pred):
    df = pd.DataFrame({'language': y_true, 'pred': y_pred})
    accuracies = []

    for lang in lang_codes:
        lang_group = df[df['language'] == lang]
        if len(lang_group) > 0:
            acc = accuracy_score(lang_group['language'], lang_group['pred'])
        else:
            acc = 0
        accuracies.append(acc)
    
    percent_accuracies = [x * 100 for x in accuracies]
    df = None
    return percent_accuracies

df_MNB_lang = pd.DataFrame({
    'Language': langs,
    'Accuracy of MNB/language': get_lang_accuracies(y_test, y_pred)
})

display(df_MNB_lang)

Unnamed: 0,Language,Accuracy of MNB/language
0,Italian,96.855346
1,French,99.335548
2,Spanish,97.647059
3,Portuguese,93.728223
4,English,100.0
5,German,97.569444
6,Dutch,98.275862
7,Indonesian,98.615917
8,Finnish,99.315068
9,Hausa,99.662162


In [140]:
# Get the class labels (languages)
classes = model.classes_

# Get the log probabilities of features per class
log_probs = model.feature_log_prob_  # shape: [n_classes, n_features]

# Get the vocabulary mapping from the vectorizer
feature_names = vectorizer.get_feature_names_out()  # assumes CountVectorizer or TfidfVectorizer

for idx, lang in enumerate(classes):
    top_word_idx = log_probs[idx].argmax()  # index of the most likely word for this class
    top_word = feature_names[top_word_idx]
    print(f"Most indicative word for {lang}: {top_word}")



AttributeError: 'Pipeline' object has no attribute 'feature_log_prob_'

In [129]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.pipeline import make_pipeline

model = make_pipeline(CountVectorizer(), MultinomialNB())
num_folds = 5
kf = KFold(n_splits=num_folds, shuffle=True, random_state=17)

all_preds = []
all_true = []

X_train_kfold = X_train.copy()
X_test_kfold = X_test.copy()
y_train_kfold = y_train.copy()
y_test_kfold = y_test.copy()

k_folds_accuracies = []
for i, (train_index, test_index) in enumerate(kf.split(X), 1):
    X_train_kfold, X_test_kfold = X[train_index], X[test_index]
    y_train_kfold, y_test_kfold = y[train_index], y[test_index]

    model.fit(X_train_kfold.flatten(), y_train_kfold.flatten())
    y_pred_kfold = model.predict(X_test_kfold.flatten())

    fold_accuracy = accuracy_score(y_test_kfold, y_pred_kfold)
    k_folds_accuracies.append(fold_accuracy)

    all_preds.extend(y_pred_kfold)
    all_true.extend(y_test_kfold)

In [130]:
df_kfold_accuracy = pd.DataFrame({
    'Kfold': [1, 2, 3, 4, 5],
    'Accuracy': k_folds_accuracies
})


print("Cross-Validation Results (Accuracy):")

display(df_kfold_accuracy)

print(f'\nOverall Kfolds Mean Accuracy: {cross_val_score(model, X.flatten(), y.flatten(), cv=kf).mean() * 100:}%')

def get_lang_accuracies(y_true, y_pred):
    df = pd.DataFrame({'language': y_true, 'pred': y_pred})
    accuracies = []

    for lang in lang_codes:
        lang_group = df[df['language'] == lang]
        if len(lang_group) > 0:
            acc = accuracy_score(lang_group['language'], lang_group['pred'])
        else:
            acc = 0
        accuracies.append(acc)

    return accuracies

df_lang_accuracy = pd.DataFrame({
    'Language': langs,
    'Accuracy of MNB/language': get_lang_accuracies(np.array(all_true), np.array(all_preds))
})

display(df_lang_accuracy)

Cross-Validation Results (Accuracy):


Unnamed: 0,Kfold,Accuracy
0,1,0.9825
1,2,0.983
2,3,0.9825
3,4,0.986
4,5,0.984



Overall Kfolds Mean Accuracy: 98.36%


Unnamed: 0,Language,Accuracy of MNB/language
0,Italian,0.985
1,French,0.993
2,Spanish,0.98
3,Portuguese,0.949
4,English,0.998
5,German,0.982
6,Dutch,0.98
7,Indonesian,0.978
8,Finnish,0.996
9,Hausa,0.995


In [131]:
# Reorganizing data for Feed Forward Neural Network input

# 1) convert string labels into integer labels
#      and make func to convert back


def str_labels_to_int_labels(labelArr, string_labels):
    rtn = np.empty(labelArr.shape, dtype=int)
    for i, v in enumerate(string_labels):
        rtn[labelArr == v] = i
    return rtn

def int_labels_to_str_labels(labelArr, string_labels):
    rtn = np.empty(labelArr.shape, dtype='object')
    for i, v in enumerate(string_labels):
        rtn[labelArr == i] = v
    return rtn

# print(y_test[0:5])
# y1 = str_labels_to_int_labels(y_test, all_str_labels)
# print(y1[0:5])
# y2 = int_labels_to_str_labels(y1, all_str_labels)
# print(y2[0:5])

In [132]:
# 2) convert data into multi-column matrix of characters
#      and make func to convert back

def str_vec_to_float_matrix(strVec, longest_str_len):
    # Pad strings to all be equal length
    padded_strVec = np.char.ljust(strVec, longest_str_len, fillchar=' ')

    # turn vector of strings into matrix of characters
    stacked_char_matrix = np.vstack([np.array(list(s)) for s in padded_strVec])

    # turn char matrix into int matrix
    char_matrix_to_int_matrix = np.vectorize(ord)
    int_matrix = char_matrix_to_int_matrix(stacked_char_matrix)

    #normalize and scale so each value is a float between 0 and 1
    matrix_max = np.max(int_matrix)
    matrix_min = np.min(int_matrix)
    min_subtracted_matrix = int_matrix - matrix_min
    normalized_matrix = (min_subtracted_matrix / (matrix_max - matrix_min))
    return normalized_matrix

# Don't need to convert matrices back into rows of text because the neural network isn't designed to generate anything, just classify
# def int_matrix_to_str_vec(intMatrix):
#     int_matrix_to_char_matrix = np.vectorize(chr)
#     char_matrix = int_matrix_to_char_matrix(intMatrix)
#     padded_strVec = np.array(["".join(r) for r in char_matrix])
#     return np.char.rstrip(padded_strVec)

# print(X_test[0])
# X_x = str_vec_to_int_matrix(X_test.astype(str))
# print(X_x[0])
# X_y = int_matrix_to_str_vec(X_x)
# print(X_y[0])

In [133]:
# 3) use them both

def convert_to_FFNN_format(Xr, Xe, yr, ye):
    all_string_labels = ['ita', 'fra', 'spa', 'por', 'eng', 'deu', 'nld', 'ind', 'fin', 'hau']
    max_str_len_1 = np.max(np.char.str_len(Xr.astype(str)))
    max_str_len_2 = np.max(np.char.str_len(Xe.astype(str)))
    max_str_len = max(max_str_len_1, max_str_len_2)
    Xr_rtn = str_vec_to_float_matrix(Xr.astype(str), max_str_len)
    Xe_rtn = str_vec_to_float_matrix(Xe.astype(str), max_str_len)
    return (Xr_rtn, 
            Xe_rtn,
            str_labels_to_int_labels(yr, all_string_labels), 
            str_labels_to_int_labels(ye, all_string_labels))

(X_tr_nn, X_te_nn, y_tr_nn, y_te_nn) = convert_to_FFNN_format(X_train, X_test, y_train, y_test)

print("Done converting data into FFNN format")

Done converting data into FFNN format


In [134]:
# print(X_tr_nn[0])
# print(X_tr_nn[1])
print(X_tr_nn.shape)
print(X_te_nn.shape)

(7000, 5577)
(3000, 5577)


In [135]:
# Applying properly structured data to a basic FFNN

import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
from keras.losses import SparseCategoricalCrossentropy
from keras.metrics import SparseCategoricalAccuracy

print("Done importing Tensorflow Stuff")

Done importing Tensorflow Stuff


In [136]:
# Make FFNN

sample_length = X_tr_nn[0].shape[0]

FFNN_model = Sequential([
    Dense(sample_length, activation='relu'),
    Dense(sample_length, activation='relu'),
    Dense(1024, activation='relu'),
    Dense(1024, activation='relu'),
    Dense(256, activation='relu'),
    Dense(256, activation='relu'),
    Dense(64, activation='relu'),
    Dense(64, activation='relu'),
    Dense(10, activation='softmax')
])

FFNN_model.compile(optimizer=Adam(),
                   loss=SparseCategoricalCrossentropy(), 
                   metrics=[SparseCategoricalAccuracy()])

In [137]:

for i in range(1):
    FFNN_model.fit(X_tr_nn, y_tr_nn)
    test_loss, test_acc = FFNN_model.evaluate(X_te_nn, y_te_nn)
    print(f'\nTest accuracy: {test_acc}')

[1m 39/219[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m1:12[0m 403ms/step - loss: 2.3036 - sparse_categorical_accuracy: 0.1074

KeyboardInterrupt: 