# CS345 Project

## Team Members
1. Hamad Alyami
2. Benito Encarnacion

## Dataset
Our dataset was from Kaggle by a user called Mexwell. The data is paragraphs scraped from wikipedia in 2018 in 235 languages.

The dataset contains 235,000 datasets with balance between language proportions and a test and train split provided.

The downloaded folder from Kaggle contains:
- labels.csv: A file containing the language name, 2-3 letter code, German name, and language family of all the languages present in the dataset.
- README.txt: A file explaining the folder contents.
- urls.txt: A file containing the urls of where the paragraphs were found.
- x_test.txt: The testing data samples, paragraphs in multiple languages.
- x_train.txt: The training data samples, paragraphs in multiple langauges
- y_test.txt: The labels for the testing dataset, using the 2-3 letter codes found in labels.csv.
- y_train.txt: The labels for the training dataset, using the 2-3 letter codes found in labels.csv


## Project
Our project is to train and compare two ML models on the Latin Alphabet languages present in the dataset and compare their performance.

## Motivation
We decided to do this project because it allows us to explore practical applications of natural language processing and machine learning by working with real-world multilingual data. Language identification is an important task in many systems and applications like search engines, translation tools, and content moderation. Working with such a dataset gives us the opportunity to apply classification techniques in a meaningful way. By focusing on languages that use the Latin alphabet, we avoid complications from different writing systems while still working with a variety of languages.

## Models
The models we decided to work with in this project are:
- Multinomial Naive-Bayes (MNB): Uses word frequencies in each class, langauges in our case, to guess the most likely class for text it has not seen.

- Feed Forward Neural Network (FNN): An artificial Neural Network where information moves from input to output without looping back. It uses neurons, connected nodes, to learn patterns and make predictions.

### Data Preprocessing
We will begin by reading the data from the files then:
1. Remove Null Values
2. Filter to keep texts of languages we want using the 2-3 letter codes
3. Return both samples from x_test and x_train and labels from y_test and y_train stacked into X and y

In [3]:
#Understanding the data set
import pandas as pd
import numpy as np

def file_to_np_array(path, label):
    try:
        df = pd.read_csv(path, sep='<NonExistenceSeparator>', header=None, engine='python')
        print(f"{label}: Read!")
    except Exception as e:
        print(f"Error reading the {label} file: {e}")
        return None
    return df.to_numpy()


def clean_np_data(X, y):
    stacked = np.hstack((y, X)) # Stack y and X side by side
    # print(stacked.shape)
    clean_stacked = stacked[~np.any(pd.isna(stacked), axis=1), :] # Remove empty values
    # print(clean_stacked.shape)
    lang_codes = ['ita', 'fra', 'eng', 'ind', 'spa']
    true_clean = clean_stacked[np.isin(clean_stacked[:,0], lang_codes),:] # Remove all rows that aren't our target languages
    # print(true_clean.shape)
    return true_clean[:,1], true_clean[:,0] # Return cleaned as X and y split again

def clean_filter_and_stack(X_train_file, y_train_file, X_test_file, y_test_file):
    X_train_clean, y_train_clean = clean_np_data(file_to_np_array(X_train_file, X_train_file), 
                                       file_to_np_array(y_train_file, y_train_file))
    X_test_clean, y_test_clean = clean_np_data(file_to_np_array(X_test_file, X_test_file), 
                                       file_to_np_array(y_test_file, y_test_file))
    return np.hstack((X_train_clean, X_test_clean)), np.hstack((y_train_clean, y_test_clean))

X, y = clean_filter_and_stack("Data/x_train.txt", 
                                      "Data/y_train.txt", 
                                      "Data/x_test.txt", 
                                      "Data/y_test.txt")

print(X.shape, y.shape) 

Data/x_train.txt: Read!
Data/y_train.txt: Read!
Data/x_test.txt: Read!
Data/y_test.txt: Read!
(5000,) (5000,)


#### Data Split
Here we use Sklearn train_test_split to split our data into 70/30 train and test splits, respectively, after shuffling them randomly.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=17)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(strip_accents='unicode')
X_tr_vectors = vectorizer.fit_transform(X_train)
X_te_vectors = vectorizer.transform(X_test)
print("Done vectorizing")

Done vectorizing


In [None]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_tr_vectors, y_train)
print("Done training MNB")

Done training MNB


In [None]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_te_vectors)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)
print(y_pred[0:10])
print(X_test[0:10])

Accuracy:  0.9712258064516129
['nld' 'ind' 'eng' 'cat' 'tsn' 'fin' 'kur' 'orm' 'nob' 'spa' 'rup' 'fra'
 'nob' 'eng' 'cat' 'ind' 'ind' 'tsn' 'rup' 'orm' 'fra' 'aze' 'lug' 'rup'
 'cat' 'ita' 'nob' 'hau']
['Schiedam is gelegen tussen Rotterdam en Vlaardingen, oorspronkelijk aan de Schie en later ook aan de Nieuwe Maas. Per 30 april 2017 had de gemeente 77.833 inwoners (bron: CBS). De stad is vooral bekend om haar jenever, de historische binnenstad met grachten, en de hoogste windmolens ter wereld.'
 'Argentina adalah sebuah negara yang kaya dengan SDA, tingkat melek huruf yang tinggi, sektor pertanian yang maju serta industri yang beragam. Malangnya, sejak akhir 1980-an negara ini telah menimbun hutang luar negeri yang tinggi, inflasi sampai 200% sebulan, dan pengeluaran yang merudum. Dalam mengatasi krisis ekonomi tersebut, pemerintahan telah mengambil langkah-langkah seperti liberalisasi perdagangan, deregulasi, dan swastanisasi. Pada 1991, pemerintahan telah melaksanakan reformasi fina