## LANGAUGE DETECTION

The most important part of training a language detection model is data. The more data you have about every language, the more accurate your model will perform in real-time. The dataset that I am using is collected from Kaggle, which contains data about 

In [3]:
import pandas as pd
df = pd.read_csv("dataset.csv")
df.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


In [4]:
df.shape

(22000, 2)

check for null values

In [5]:
df.isna().sum()

Text        0
language    0
dtype: int64

In [6]:
df.language.value_counts()

Estonian      1000
Swedish       1000
English       1000
Russian       1000
Romanian      1000
Persian       1000
Pushto        1000
Spanish       1000
Hindi         1000
Korean        1000
Chinese       1000
French        1000
Portugese     1000
Indonesian    1000
Urdu          1000
Latin         1000
Turkish       1000
Japanese      1000
Dutch         1000
Tamil         1000
Thai          1000
Arabic        1000
Name: language, dtype: int64

A BALANCED DATASET

In [7]:
len(df.language.unique())

22

LANGUAGE DETECTION MODEL

In [8]:
import numpy as np

In [9]:
x = df.Text
y = df.language

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

In [36]:
cv = CountVectorizer()
X = cv.fit_transform(x)
X.shape

(22000, 277720)

In [37]:
cv

In [12]:
y.shape

(22000,)

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33, random_state= 42)

In [14]:
X_train.shape

(14740, 277720)

In [15]:
y_train.shape

(14740,)

In [16]:
X_train

<14740x277720 sparse matrix of type '<class 'numpy.int64'>'
	with 613529 stored elements in Compressed Sparse Row format>

In [17]:
y_train.describe()

count       14740
unique         22
top       Swedish
freq          683
Name: language, dtype: object

lets try hyper parameter tuning

since this is a classification problems, we use the following classifiers:


In [18]:
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [19]:

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, ShuffleSplit



In [20]:
import warnings
warnings.filterwarnings('ignore')

In [21]:

def find_best_model_using_gridsearchcv(X_train, y_train):
    sample_fraction = 0.1  # Placeholder value,  adjust as needed

    X_train = X_train.tocsc()  # Convert X_train to Compressed Sparse Column format
    y_train = y_train.values  # Convert y_train to a NumPy array

    # Randomly select a subset of the training data
    random_indices = np.random.choice(X_train.shape[0], int(X_train.shape[0] * sample_fraction), replace=False)
    X_train_sampled = X_train[random_indices, :]  # Apply indices to X_train
    y_train_sampled = y_train[random_indices]  # Apply indices to y_train

    algos = {
        'RandomForestClassifier': {
            'model': RandomForestClassifier(),
            'params': {
                'n_estimators': list(range(10, 101, 10)),
                'max_features': ['auto', 'sqrt', 'log2']
            }
        },
        'SVC': {
            'model': SVC(),
            'params': {
                'C': [0.1, 1, 10],
                'gamma': [0.1, 1, 10],
                'kernel': ['linear', 'rbf']
            }
        },
        'Multinomial NB': {
            'model': MultinomialNB(),
            'params': {
                'alpha': [0.1, 1, 10],
                'fit_prior': [True, False]
            }
        }
    }

    scores = []
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

    for algo_name, config in algos.items():
        gs = GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
        gs.fit(X_train_sampled, y_train_sampled)  # Use sampled data for grid search
        scores.append({
            'model': algo_name,
            'best_score': gs.best_score_,
            'best_params': gs.best_params_
        })

    result = pd.DataFrame(scores, columns=['model', 'best_score', 'best_params'])
    result.to_csv("Compare.csv", index=False)

    return result



In [51]:
find_best_model_using_gridsearchcv(X_train, y_train)


Unnamed: 0,model,best_score,best_params
0,RandomForestClassifier,0.875932,"{'max_features': 'sqrt', 'n_estimators': 50}"
1,SVC,0.823729,"{'C': 0.1, 'gamma': 0.1, 'kernel': 'linear'}"
2,Multinomial NB,0.95661,"{'alpha': 0.1, 'fit_prior': False}"


The best model for use is Multinomial Naive Bayes

In [22]:
model = MultinomialNB(alpha= 0.1, fit_prior= False)


In [23]:
model.fit(X_train, y_train)

In [24]:
model.score(X_test, y_test)

0.9574380165289256

In [25]:
# using cross validation for evaluating the model
from sklearn.model_selection import cross_val_score, KFold

kf = KFold(n_splits= 10)
score = cross_val_score(model, X_train, y_train, cv= kf)
score.mean()

0.9589552238805972

In [26]:
y_pred = model.predict(X_test)
y_pred

array(['Japanese', 'Russian', 'Latin', ..., 'Turkish', 'Arabic',
       'English'], dtype='<U10')

In [43]:
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = model.predict(data)
print("Language:", output[0])

Language: English


In [38]:
X_train.shape

(14740, 277720)

In [39]:
X_test.shape

(7260, 277720)

In [40]:
X.shape

(22000, 277720)

In [42]:
y_train.shape

(14740,)

save modekl

In [28]:
import pickle

In [35]:
with open('ld.pickle', 'wb') as f:
    pickle.dump(model, f)

In [34]:
with open('count_vectorizer.pkl', 'wb') as f:
  pickle.dump(cv, f)