<a href="https://colab.research.google.com/github/WWeiQueen/Deployment_LanguageModel_FastAPI-Docker-Heroku/blob/main/LanguageDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First download the language-detection dataset from kaggle:
https://www.kaggle.com/datasets/basilb2s/language-detection?resource=download

In [1]:
!unzip archive.zip

Archive:  archive.zip
  inflating: Language Detection.csv  


In [2]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
import warnings
warnings.simplefilter("ignore")

In [3]:
# Load the dataset
data = pd.read_csv("Language Detection.csv")
data.head(5)

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English


In [4]:
# seperate the input_feture data (X), target_feature (y)
X = data['Text']
y = data['Language']

# Preprocessing

In [5]:
# Assign different class number to values in target (y) column
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

  # these are the unique langauges exist in target "Langugae" column
le.classes_

array(['Arabic', 'Danish', 'Dutch', 'English', 'French', 'German',
       'Greek', 'Hindi', 'Italian', 'Kannada', 'Malayalam', 'Portugeese',
       'Russian', 'Spanish', 'Sweedish', 'Tamil', 'Turkish'], dtype=object)

In [6]:
# Apply regular expressions to remove unwanted characters
data_list = []
for text in X:
  text = re.sub(r'[!@#$(),\n"%^&*?\:;~0-9]', ' ', text)
  text = re.sub(r'[[]]', ' ', text)
  text = text.lower()
  data_list.append(text)

# Split train, test data

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

# Build the model
* 1. CountVectorizer
 CountVectorizer in scikit-learn is a text preprocessing and feature extraction technique used to convert a collection of text documents into a matrix of token counts. Each row in the matrix represents a document, and each column represents a unique word (token) found in the entire collection of documents. The values in the matrix represent the frequency of each word in each document. Here are the steps involved in training the CountVectorizer

* 2. Naive Bayes model

In [8]:
# I. creating bag of words using countvectorizer

from sklearn.feature_extraction.text import CountVectorizer

# Step 1: Create an instance of CountVectorizer
cv = CountVectorizer()

# Sample training data X: list of text documents)
# Step 2: Fit CountVectorizer on the training data
cv.fit(X_train)

# additional Step : Tokenization and vocabulary building
# The vocabulary will be a set of unique words found in the training data
# vocabulary = cv.get_feature_names_out()

# Step 3: Counting word occurrences
# Transform the training data into a matrix of token counts
X_train = cv.transform(X_train).toarray()

# The matrix is a sparse matrix, representing the frequency of each word in each document
print("Matrix representation of text data:")
print(X_train)

# Transform the test data using the trained CountVectorizer
X_test = cv.transform(X_test).toarray()

# The transformed matrix represents the frequency of each word in each new document
print("Matrix representation of new data:")
print(X_test)

# You can now use the transformed matrix in your machine learning model

Matrix representation of text data:
[[0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
Matrix representation of new data:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [9]:
# II. naive_bayes model
from sklearn.naive_bayes import MultinomialNB
# X_train = [str(doc) for doc in X_train]
# X_test = [str(doc) for doc in X_test]


model = MultinomialNB()
model.fit(X_train, y_train)

In [10]:
# Predict
y_pred = model.predict(X_test)
print(y_pred)

[14 13 11 ...  3  5 14]


# Evaluation

In [11]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

ac = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)

print("Accuracy: ", ac)

Accuracy:  0.9811411992263056


In [None]:
print(cr)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        91
           1       1.00      0.90      0.95        93
           2       0.99      0.96      0.98       103
           3       0.87      1.00      0.93       254
           4       0.97      0.99      0.98       220
           5       1.00      0.99      1.00       110
           6       1.00      1.00      1.00        76
           7       1.00      1.00      1.00        16
           8       0.98      0.99      0.98       143
           9       1.00      0.94      0.97        84
          10       1.00      0.99      1.00       118
          11       0.99      0.98      0.98       156
          12       1.00      0.96      0.98       135
          13       0.98      0.97      0.98       179
          14       0.98      0.99      0.99       124
          15       1.00      0.98      0.99        81
          16       1.00      0.86      0.92        85

    accuracy              

# Use Pipeline to combine model1 & model2

In [12]:
from sklearn.pipeline import Pipeline

# # Convert X_train to a list of strings
# X_train = [str(doc) for doc in X_train]

# Step 1: Create the pipeline with correct step definitions
pipe = Pipeline([
    ('vectorizer', cv),
    ('multinomialNB', model)
])

# rerun this so that the previous steps did not overwrite the original values types
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

# Step 2: Fit the pipeline with training data
pipe.fit(X_train, y_train)


In [13]:
# use the pipeline to predict X_test --> it should return the same accuracy

y_pred2 = pipe.predict(X_test)
ac2 = accuracy_score(y_test, y_pred2)

print("Accuracy: ", ac2)

Accuracy:  0.9709864603481625


# Save & Download the model

In [14]:
# with sklearn --> pickle; with tensorflow, pytorch --> use api (SavedModel format or the HDF5 format. state_dict dictionary)

with open('trained_pipeline-0.1.0.pkl', 'wb') as f:
  pickle.dump(pipe, f)

In [17]:
# with tensorflow, pytorch, you can not direcly download the model file. Thus can use the below command to zip the model file
!zip -r ./trained_pipeline-0.1.0.pkl.zip ./trained_pipeline-0.1.0.pkl

  adding: trained_pipeline-0.1.0.pkl (deflated 95%)


# Testing

In [16]:
text = 'hola soy una bruja viviendo en mi casa'
y = pipe.predict([text])
le.classes_[y[0]], y

('Spanish', array([13]))