I have to build a classification model to identify which language a text was written. I used logistic regression and naive bayes to train model.

The data [dataset.csv](https://drive.google.com/file/d/1US-ZpvKuXaNMN5TDjEh3_kOz-Z_OoSMM/view?usp=sharing)

In [None]:
%%capture

!rm -f dataset.csv
from google_drive_downloader import GoogleDriveDownloader as gdd
gdd.download_file_from_google_drive(file_id="1US-ZpvKuXaNMN5TDjEh3_kOz-Z_OoSMM",
                                    dest_path="./dataset.csv",
                                    )

In [None]:
!head dataset.csv

Text,language
klement gottwaldi surnukeha palsameeriti ning paigutati mausoleumi surnukeha oli aga liiga hilja ja oskamatult palsameeritud ning hakkas ilmutama lagunemise tundemärke  aastal viidi ta surnukeha mausoleumist ära ja kremeeriti zlíni linn kandis aastatel – nime gottwaldov ukrainas harkivi oblastis kandis zmiivi linn aastatel – nime gotvald,Estonian
sebes joseph pereira thomas  på eng the jesuits and the sino-russian treaty of nerchinsk  the diary of thomas pereira bibliotheca instituti historici s i --   rome libris ,Swedish
ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เริ่มตั้งแต่ถนนสนามไชยถึงแม่น้ำเจ้าพระยาที่ถนนตก กรุงเทพมหานคร เป็นถนนรุ่นแรกที่ใช้เทคนิคการสร้างแบบตะวันตก ปัจจุบันผ่านพื้นที่เขตพระนคร เขตป้อมปราบศัตรูพ่าย เขตสัมพันธวงศ์ เขตบางรัก เขตสาทร และเขตบางคอแหลม,Thai
"விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திரிகை-விசாகப்பட்டின ஆசிரியர் சம்பத்துடன் இணைந்து விரிவுபடுத்தினார்  ஆண்டுகள் தொடர்ந்து செயலராக இருந்து தமிழ்மன்றத்தை நடத்திச் சென்றார்  கோவை செம்மொழி ம

There are two columns in the data:

- `Text` contains the content written in a certain language
- `language` is the name of the language of the corresponding text.

There are 22 labels in the dataset, and 1000 samples for each label.

In [None]:
import pandas as pd
from google.colab import drive

def load_dataset(file_path):

    texts = []
    labels = []

    drive.mount('/content/drive')

    try:
        dataset = pd.read_csv('/content/drive/MyDrive/dataset.csv')
        texts = dataset['Text'].tolist()
        labels = dataset['language'].tolist()
        unique_labels = list(set(labels))

        print("Dataset loaded successfully.")
        print("Number of texts:", len(texts))
        print("Number of unique labels:", len(unique_labels))

    except Exception as e:
        print("Error loading dataset:", e)

    return texts, labels

texts, labels = load_dataset('/content/drive/MyDrive/dataset.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Dataset loaded successfully.
Number of texts: 22000
Number of unique labels: 22


In [None]:
# Load the dataset
texts, labels = load_dataset('/content/drive/MyDrive/dataset.csv')

#I tried to print out 2 lists by mentioning it in the function load_dataset()
#but there is too much output being generated, causing the notebook server to temporarily stop sending output

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# Feature extraction using character-level CountVectorizer
vectorizer = CountVectorizer(analyzer='char')
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)
# Define a function for feature extraction and model training
def train_model(train_texts, train_labels, test_texts, test_labels, model_type='logistic_regression'):

    # Initialize and train the model
    if model_type == 'logistic_regression':
        model = LogisticRegression(solver='lbfgs', max_iter=10000, C=0.1)
    elif model_type == 'naive_bayes':
        model = MultinomialNB()
    else:
        raise ValueError("Invalid model type. Supported types: 'logistic_regression', 'naive_bayes'")

    model.fit(X_train, train_labels)

    # Predict on the test set
    y_pred = model.predict(X_test)

    return model


# Train Logistic Regression model
logistic_regression = train_model(train_texts, train_labels, test_texts, test_labels, model_type='logistic_regression')
# Train Naive Bayes model
naive_bayes = train_model(train_texts, train_labels, test_texts, test_labels, model_type='naive_bayes')


In [None]:
from sklearn import metrics
from sklearn.metrics import classification_report


y_test=test_labels
# Define a function to evaluate the model and report classification results
def evaluate_model(model, X_test, y_test):
    # Get predicted labels for test samples
    y_pred = model.predict(X_test)

    # Report classification results
    report = classification_report(y_test, y_pred)
    print("Classification Report:")
    print(report)


# Evaluate the Logistic Regression model
evaluate_model(logistic_regression, X_test, y_test)

# Evaluate the Naive Bayes model
evaluate_model(naive_bayes, X_test, y_test)


Classification Report:
              precision    recall  f1-score   support

      Arabic       1.00      1.00      1.00       202
     Chinese       0.99      0.99      0.99       201
       Dutch       0.94      0.93      0.94       230
     English       0.80      0.90      0.85       194
    Estonian       0.95      0.94      0.94       200
      French       0.95      0.97      0.96       188
       Hindi       1.00      0.99      0.99       208
  Indonesian       0.99      0.97      0.98       213
    Japanese       1.00      0.98      0.99       194
      Korean       1.00      0.99      1.00       190
       Latin       0.91      0.92      0.92       210
     Persian       0.98      0.99      0.99       196
   Portugese       0.95      0.92      0.93       194
      Pushto       0.98      0.95      0.96       196
    Romanian       1.00      0.97      0.98       197
     Russian       0.99      1.00      0.99       213
     Spanish       0.92      0.95      0.94       199
    

In [None]:
def classify(text):
    """
    Return:
        language: Language of the text
    """
    language = None

    # Transform the input text using the vectorizer
    text_vectorized = vectorizer.transform([text])

    # Use the trained model to predict the language
    predicted_language = logistic_regression.predict(text_vectorized)

    # Retrieve the predicted language (assuming it's a single prediction)
    language = predicted_language[0]
    return language

In [None]:
classify("விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திரிகை-விசாகப்பட்டின ஆசிரியர் சம்பத்துடன் இணைந்து விரிவுபடுத்தினார்")

'Tamil'