# Language Prediction Using Multinominal Naive Bayes

The following project is focused on building a machine learning model that can accurately detect languages including: `English`, `French`, `Spanish` and `German`. The dataset used for this project was sourced from [Language Detection](https://www.kaggle.com/datasets/zarajamshaid/language-identification-datasst).



### Importation of Necessary Libaries

In [401]:
# imports pandas library for working with Dataframe
import pandas as pd

# import numpy library for working with arrays
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer

# imports seaborn library
import seaborn as sns

# imports the label encoder library
from sklearn.preprocessing import LabelEncoder

# imports train-test-split and Grid Search library for splitting the dataset(s)
from sklearn.model_selection import train_test_split, GridSearchCV

# imports naive bias algorithm from sklearn
from sklearn.naive_bayes import MultinomialNB

# imports all necessary evaluation metrics from sklearn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# imports joblib for model saving
import joblib


### Reads the CSV File(s) into a Dataframe



In [403]:
df = pd.read_csv("Language Detection.csv", encoding='ISO-8859-1')  # Load the dataset into the DataFrame 'df' using ISO-8859-1 encoding to handle special character

### Exploratory Data Analysis (EDA)

In [405]:
df.head() # displays first 5 rows of the dataframe


Unnamed: 0,Text,Language,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19
0,"Nature, in the broadest sense, is the natural...",English,,,,,,,,,,,,,,,,,,
1,"""Nature"" can refer to the phenomena of the phy...",English,,,,,,,,,,,,,,,,,,
2,"The study of nature is a large, if not the onl...",English,,,,,,,,,,,,,,,,,,
3,"Although humans are part of nature, human acti...",English,,,,,,,,,,,,,,,,,,
4,[1] The word nature is borrowed from the Old F...,English,,,,,,,,,,,,,,,,,,


In [406]:
# displays out the shape of the data
print(df.shape)

(13339, 20)


#### Describing the Data Using Statistical Terms (i.e., Summary Statistics)

- **Count**: The total number of non-null observations in each feature column.
- **Unique**: The number of distinct values present in each column, indicating the diversity of the data.
- **Top**: The most frequently occurring value in the column, helping to identify the most common entry.
- **Freq**: The frequency of the top value, showing how many times the most common value appears in the col data.


In [408]:
df.describe() # describes the data

Unnamed: 0,Text,Language,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19
count,13338,13337,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
unique,13251,18,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
top,lÃ©volution du nombre dhabitants est connue Ã ...,English,à¤à¤¬ à¤à¤¨à¤¸à¥ à¤ªà¥à¤à¤¾ à¤à¤¯à¤¾ à¤...,à¤¤à¥ à¤µà¤¹ à¤à¤à¤²à¥ à¤¬à¤¾à¤° à¤ªà¥à¤...,à¤®à¥à¤°à¥ à¤ªà¤¾à¤¸ à¤¶à¤¨à¤¿à¤µà¤¾à¤° à¤...,à¤«à¤¿à¤²à¥à¤®à¥à¤ à¤à¥ à¤²à¤¿à¤ à¤à¤...,à¤à¤¬ à¤à¥à¤ à¤¸à¤à¤à¥à¤·à¤¿à¤ªà¥à¤¤ ...,à¤­à¤²à¥ à¤¹à¥ à¤à¤¬ à¤à¤ª à¤à¤¿à¤¸à¥ à...,à¤à¤¸à¤²à¤¿à¤ à¤¯à¤¦à¤¿ à¤à¤ªà¤à¥ à¤ªà¤¾...,à¤¤à¥ à¤à¤ª à¤¨à¤¿à¤¶à¥à¤à¤¿à¤¤ à¤°à¥à¤ª...,à¤à¥ à¤à¤¿ à¤à¤ªà¤¤à¥à¤¤à¤¿à¤à¤¨à¤ à¤¹à¥,à¤¯à¤¹ à¤à¤ à¤à¤ªà¤®à¤¾à¤¨ à¤¹à¥ à¤à¤¿à¤...,à¤²à¥à¤à¤¿à¤¨ 180 à¤¡à¤¿à¤à¥à¤°à¥ à¤¸à¥...,à¤²à¥à¤à¤¿à¤¨ à¤®à¥à¤à¥ à¤¨à¤¹à¥à¤ à¤ª...,à¤§à¥à¤®à¥ à¤§à¥à¤®à¥ à¤à¤²à¤¨à¥ à¤à¥...,à¤¸à¥à¤¸à¥à¤¤ à¤¹à¥ à¤¸à¤à¤¤à¥ à¤¹à¥à¤...,à¤à¤¸à¤²à¤¿à¤ à¤µà¥ 10 à¤¸à¥à¤®à¤¾à¤°à¥à...,à¤à¤¸ à¤¸à¥à¤ªà¥à¤à¤° à¤ªà¤° à¤à¥à¤²à¤¿...,à¤à¤ª à¤à¤¸à¤à¤¾ à¤ªà¥à¤°à¥ à¤¤à¤°à¤¹ à¤...,Hindi
freq,7,2385,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


#### Summarizing the Dataframe's Structure

- Provides a concise summary of the DataFrame.
- Displays the total number of rows (entries) which could help in detecting missing values.
- Shows the number of non-null values in each column.
- Lists the data types of each column (e.g., int64, float64, object for categorical data).
- Gives the memory usage of the DataFrame.



In [410]:
df.info() # displays the summary of the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13339 entries, 0 to 13338
Data columns (total 20 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Text         13338 non-null  object
 1   Language     13337 non-null  object
 2   Unnamed: 2   1 non-null      object
 3   Unnamed: 3   1 non-null      object
 4   Unnamed: 4   1 non-null      object
 5   Unnamed: 5   1 non-null      object
 6   Unnamed: 6   1 non-null      object
 7   Unnamed: 7   1 non-null      object
 8   Unnamed: 8   1 non-null      object
 9   Unnamed: 9   1 non-null      object
 10  Unnamed: 10  1 non-null      object
 11  Unnamed: 11  1 non-null      object
 12  Unnamed: 12  1 non-null      object
 13  Unnamed: 13  1 non-null      object
 14  Unnamed: 14  1 non-null      object
 15  Unnamed: 15  1 non-null      object
 16  Unnamed: 16  1 non-null      object
 17  Unnamed: 17  1 non-null      object
 18  Unnamed: 18  1 non-null      object
 19  Unnamed: 19  1 non-null  

### Checking for Missing Values

In [412]:
missing_col_lst = df.columns[df.isnull().any()].tolist() # checks all columns in the dataframe for missing values and stores the column name in a list.

print(f"""
List of Columns in the Dataframe with Missing Values: 

{missing_col_lst}

Number of Columns in the Dataframe with Missing Values:

{len(missing_col_lst)}

""") # prints the list and number of all columns that have missing values.




List of Columns in the Dataframe with Missing Values: 

['Text', 'Language', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19']

Number of Columns in the Dataframe with Missing Values:

20




##### Observation

The analysis revealed that all of the 20 columns contained missing values. Consequently, all columns with missing values will be handled dataset.

### Handling Missing Values

Missing values in categorical columns are usually filled with the mode, while numerical columns are often imputed with the mean or median. However, filling a missing `Language` value with the most common entry ("English") may misrepresent the data and negatively impact model performance. Therefore, it's best to remove rows with missing values in the `Language` column, especially since only one row is affected.

In [415]:
df = df[["Text", "Language"]].dropna(subset=["Language", "Text"]) # keeps only the "Text" and "Language" columns while removing all rows where "Language" and "Text" column has an empty value

In [416]:
df # displays the modified dataframe

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English
...,...,...
13334,notices dans des bases relatives au sport ass...,French
13335,el investigador ha recibido varios reconocimie...,Spanish
13336,le village est une station familiale de sports...,French
13337,hors du terrain les annÃ©es et sont des annÃ...,French


### Data Pre-processing

This section focuses on retaining only the rows that contain the languages of interest for detection. The Languages are:
- English
- French
- Spanish AND
- Deutsch / German

The Dataframe will be filtered accordingly to achieve this.

In [418]:
all_languages = df["Language"].unique() # gets all unique values in the "Language" column of the dataframe
print(all_languages) # prints all the languages

['English' 'Malayalam' 'Hindi'
 " à¤\x85à¤¬ à¤\x86à¤ª à¤¸à¥\x8bà¤\x9a à¤¸à¤\x95à¤¤à¥\x87 à¤¹à¥\x88à¤\x82 à¤\x95à¤¿ à¤¶à¤¬à¥\x8dà¤¦ à¤«à¤¿à¤\x95à¥\x8dà¤¸ à¤\x95à¤¾ à¤\x95à¥\x81à¤\x9b à¤\x95à¤°à¤¨à¤¾ à¤¹à¥\x88 à¤\x95à¥\x81à¤\x9b à¤®à¤°à¤®à¥\x8dà¤®à¤¤ à¤\x95à¥\x87 à¤¸à¤¾à¤¥à¥¤ à¤²à¥\x87à¤\x95à¤¿à¤¨ à¤\x8fà¤\x95 à¤«à¤¿à¤\x95à¥\x8dà¤¸ à¤®à¥\x87à¤\x82 à¤\x95à¥\x8bà¤\x88 à¤\xadà¥\x80 à¤ªà¤°à¥\x87à¤¶à¤¾à¤¨à¥\x80 à¤\x95à¤¾ à¤®à¤¤à¤²à¤¬ à¤¨à¤¹à¥\x80à¤\x82 à¤¹à¥\x88 à¤\x8fà¤\x95 à¤ªà¤°à¥\x87à¤¶à¤¾à¤¨à¥\x80 à¤\x95à¥\x80 à¤¸à¥\x8dà¤¥à¤¿à¤¤à¤¿ à¤®à¥\x87à¤\x82 à¤¦à¥\x82à¤¸à¤°à¥\x87 à¤¶à¤¬à¥\x8dà¤¦à¥\x8bà¤\x82 à¤\x95à¤¾ à¤\x89à¤ªà¤¯à¥\x8bà¤\x97 à¤\x95à¤°à¥\x87à¤\x82 à¤\x9cà¤¿à¤¨à¥\x8dà¤¹à¥\x87à¤\x82 à¤\x86à¤ª à¤\x8fà¤\x95 à¤«à¤¿à¤\x95à¥\x8dà¤¸ à¤®à¥\x87à¤\x82 à¤\x89à¤ªà¤¯à¥\x8bà¤\x97 à¤\x95à¤° à¤¸à¤\x95à¤¤à¥\x87 à¤¹à¥\x88à¤\x82 à¤®à¥\x88à¤\x82 à¤\x8fà¤\x95 à¤¸à¥\x82à¤ª à¤®à¥\x87à¤\x82 à¤¹à¥\x82à¤\x82 à¤¯à¤¾ à¤®à¥\x88à¤\x82 ' à¤\x8fà¤\x95 à¤\x97à¤¡à¤¼à¤¬à¤¡à¤¼ à¤®à¥\x87à¤\x82 à¤¹à¥\x82à¤\x81 à¤¯à¤¾ à¤®

In [419]:
# Filter the DataFrame
languages_to_keep = ['English', 'German', 'French', 'Spanish'] # specifies the languages that will be kept
df = df[df['Language'].isin(languages_to_keep)] # performs the filtering operation


df.reset_index(drop=True, inplace=True) # resets the index after filtering to ensure a clean, sequential index.




In [420]:
df # displays the cleaned DataFrame

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English
...,...,...
6683,notices dans des bases relatives au sport ass...,French
6684,el investigador ha recibido varios reconocimie...,Spanish
6685,le village est une station familiale de sports...,French
6686,hors du terrain les annÃ©es et sont des annÃ...,French


### Checking for Outliers in Categorical Data

In the context of **categorical data**, traditional outlier detection methods are not applicable, as these techniques are designed for **numerical** values that possess a defined range. Instead in this scenario, outliers in categorical data would be considered as invalid characters. Addressing these invalid entries is crucial for ensuring data integrity, enhancing model performance, and maintaining consistency across datasets. This process not only improves readability and comprehension for users by removing distractions and errors but also contributes to more accurate analyses and predictions.


In [422]:

df['Language'] = df['Language'].str.replace(r'[^a-zA-Zа-яА-ЯёЁ一-龯]', '', regex=True) # uses regex to removes invalid characters, keeping only valid letters from any language


df.reset_index(drop=True, inplace=True) # resets the index after cleaning

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Language'] = df['Language'].str.replace(r'[^a-zA-Zа-яА-ЯёЁ一-龯]', '', regex=True) # uses regex to removes invalid characters, keeping only valid letters from any language


In [423]:
df # displays the modified dataframe

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English
...,...,...
6683,notices dans des bases relatives au sport ass...,French
6684,el investigador ha recibido varios reconocimie...,Spanish
6685,le village est une station familiale de sports...,French
6686,hors du terrain les annÃ©es et sont des annÃ...,French


### Language Distribution Check

The Language Distribution Check assesses the number of entries for each language  be detected to ensure they are approximately equal. This is important because imbalanced data can skew the results of language detection model, leading to biased outcomes. By verifying the distribution, we can enhance the model's performance and ensure it is trained on a representative sample of each language, ultimately improving accuracy and reliability in language identification.


In [425]:
df['Language'].value_counts() # displays the number of entries for each of the four languages

Language
English    2385
French     2014
Spanish    1819
German      470
Name: count, dtype: int64

##### Observation:

The analysis of the language distribution check revealed that the four languages did not have an equal number of entries. To address this imbalance, online sources will be sourcesd to gather additional data and ensure a more uniform representation of each language.


### Saving the Cleaned Dataframe 
After the cleaning process, the dataframe is saved to a CSV file for proper assessment before further structuring.

In [428]:
df.to_csv("Cleaned_Data.csv", index = False) # saves the dataframe to a csv whithout adding an index

### Data Encoding
For the "Language" column, **Label Encoding** will be used to convert categorical language values into numeric labels, enabling the machine learning model to interpret them. For the "Text" column, **TF-IDF Vectorization** with character n-grams (specifically three-grams to five-grams) will be applied to transform the text into a numerical format. This approach captures important character sequences, highlighting their significance within the dataset and providing a more nuanced representation of the text compared to traditional encoding techniques.


#### Label Encoding for the `Language` Column

In [431]:
encoder = LabelEncoder() # initializes the label encoder

df['Language'] = encoder.fit_transform(df['Language']) # uses the Label encoding object to encode the "Language" column in the dataframe

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Language'] = encoder.fit_transform(df['Language']) # uses the Label encoding object to encode the "Language" column in the dataframe


In [432]:

mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_))) # computes the numeric labels for each mapped language using a dictionary
print("Mapping of Language to Label:", mapping) # displays the dictionary

Mapping of Language to Label: {'English': 0, 'French': 1, 'German': 2, 'Spanish': 3}


#### Vectorization for the `Text` Column

In [434]:
texts = df['Text'] # gets the Text column of the dataframe that contains the sentences to be vectorized 

# the TF-IDF vectorizer with character-level n-grams
tfidf_char_vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(3, 5))  # initializes the vectorizer with character three grams to 5-grams
X = tfidf_char_vectorizer.fit_transform(texts)  # vectorizes the text which will serve as the independent variable

In [435]:
print(X) # prints the sparse matrix of TF-IDF scores for the text data.

  (0, 96690)	0.0953924399709766
  (0, 249878)	0.06373029569451795
  (0, 133296)	0.047716214931696824
  (0, 167733)	0.05074564129315794
  (0, 243813)	0.0512582772015565
  (0, 19595)	0.0526483475349929
  (0, 197333)	0.07666670557773968
  (0, 183729)	0.06388440328461689
  (0, 14955)	0.08520640239933906
  (0, 73487)	0.07534813165587641
  (0, 143092)	0.10133587207109072
  (0, 205117)	0.0932711372352149
  (0, 184405)	0.06229205313240643
  (0, 254029)	0.058513297361448446
  (0, 20673)	0.05891111228172097
  (0, 140665)	0.08185923502071019
  (0, 47522)	0.08048883494209211
  (0, 120923)	0.08261013767785381
  (0, 203568)	0.06229205313240643
  (0, 96084)	0.06883702829104263
  (0, 230268)	0.0683715681049111
  (0, 55366)	0.07149280696813971
  (0, 153416)	0.06957211183908206
  (0, 12996)	0.06749304128382207
  (0, 26909)	0.09798870469246183
  :	:
  (6687, 16372)	0.0093630388452366
  (6687, 160447)	0.016491498398866693
  (6687, 49493)	0.011361076810703055
  (6687, 5175)	0.011096081763998263
  (6687, 13

##### Observation
The output represents a sparse matrix of TF-IDF scores for  the each sentence in the dataset. Each entry follows the format `(row_index, column_index) value`. Here, `0` indicates the row index (the first sentence), `54921`indicates the position of a specific character n-gram in the vectorizer's vocabulary. Each unique n-gram detected across all sentences gets a unique index. So, 54921 corresponds to a particular n-gram (like a trigram or five-gram) that was identified during the vectorization process. The value (e.g., `0.0953924399709766` represents the TF-IDF score for that n-gram. This score reflects the importance of that particular character sequence within the context of the first sentence, with higher values indicating greater significance. This structure allows for efficient storage and processing, especially in large datasets with many zero entries.


### Feature Relationships
In this context, using a correlation matrix to analyze feature relationships is not particularly important. The high dimensionality of TF-IDF features, derived from n-grams, results in a sparse matrix where many features have zero values for most documents. This makes correlation analysis less interpretable. Additionally, the nature of these features—representing textual information rather than numerical values—renders traditional correlation metrics less meaningful. Therefore, while correlation matrices are useful for many datasets, they may not provide valuable insights in the case of TF-IDF representations.


### Feature and Target Definition
In this analysis, the **features** (or predictors) represent the independent variables used to predict the **target** variable. The **features** consist of the `Text` column, which contains sentences in various languages. This feature captures the textual characteristics that will be analyzed.

The **target variable** is the dependent variable, which is the `Language` column, representing the language corresponding to each sentence. The goal of the analysis is to understand how the textual features influence the target, specifically how each feature contributes to predicting the language of the text `Y`.

In [439]:
X = X # the text that was vectorized earlier
Y = df['Language'] # the label encoded column

In [440]:
print(X.shape, Y.shape) # displays the shape of the independent and dependent variable

(6688, 284850) (6688,)


### Train-Test Split
To split the dataset into training and testing sets, the **train_test_split** function from sklearn will be used, applying an 80-20 ratio. The following steps are taken:

1. **X_train and Y_train**: The feature columns from the dataset are extracted to create the training feature set (X_train), while the corresponding target variable (the `Language` column) is extracted to form the training target set (Y_train). This enables the model to learn from the features and their associated labels.

2. **X_test and Y_test**: The remaining data is used to create the testing feature set (X_test) and the testing target set (Y_test). This allows the model to evaluate its performance on unseen data, providing insights into its predictive capability.

The following line of code splits the dataset into training and testing sets:

```X_train, X_test, texts_train, texts_test, Y_train, Y_test = train_test_split( X, texts, Y, test_size=0.2, random_state=42, stratify=Y)```.

### Reason for Splitting Both Vectorized and Original Text
- By splitting both the vectorized features (`X`) and the original text (`texts`), we can easily visualize the exact text that the model used to predict a language after testing.
- The original text provides a human-readable format, while the vectorized and encoded forms would be difficult to interpret.
- This allows for a clearer understanding of how the model performs in terms of predicting languages based on actual text input, making the evaluation more meaningful.

### Parameters
- **`X`**: Represents the feature matrix (vectorized representations of the text).
- **`texts`**: Contains the original text data for visualization.
- **`Y`**: Represents the target labels (encoded languages).
- **`test_size=0.2`**: Allocates 20% of the dataset for testing, ensuring that the model is evaluated on unseen data.
- **`random_state=42`**: Ensures the split is reproducible by setting a seed for random number generation.
- **`stratify=Y`**: Maintains the same proportion of target classes in both the training and testing sets, helping to avoid class imbalance issues.

This approach enhances the ability to assess the model's predictions in a way that is understandable to humans, as it links the original text directly to the model's output.


In [442]:
# X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42) # splits data into 80% training and 20% testing sets for model training and evaluation.

In [443]:
X_train, X_test, texts_train, texts_test, Y_train, Y_test = train_test_split( X, texts, Y, test_size=0.2, random_state=42, stratify=Y)

### Feature Scaling
In this analysis, there is no need for feature scaling because the model is based on textual data represented by TF-IDF vectors. Since TF-IDF inherently normalizes the values to reflect the importance of terms, all features are already on a comparable scale, eliminating the necessity for additional scaling.



### Classification Analysis

The next step involves performing **Classification Analysis** using **Multinomial Naive Bayes** (MNB). This technique is commonly used for text classification tasks where the features represent word counts or term frequencies (e.g., TF-IDF). MNB assumes that the features follow a multinomial distribution, making it ideal for language detection where the data is represented by discrete counts, such as text in the `Text` column of the dataframe `df`.

The probability of a class \( C \) given a document \( d \) is calculated using **Bayes' Theorem**:

   - **Formula**:
     $$
     P(C|d) = \frac{P(C) \prod_{i=1}^{n} P(w_i|C)}{P(d)}
     $$
   - **Where**:
     - $P(C|d)$ = posterior probability of class \( C \) given document \( d \)
     - $( P(C)$ = prior probability of class \( C \)
     - $P(w_i|C)$ = likelihood of word \( w_i \) given class \( C \)
     - $P(d)$ = probability of the document (acts as a normalizing constant)
     - $n$ = total number of words or features in the document


MNB works by learning the conditional probabilities of each feature (word or character n-gram) given each class. The model assigns the class that maximizes the posterior probability, given the input text. This makes it effective for **language detection** and other text classification tasks where text features play a critical role.



In [446]:
model = MultinomialNB(alpha= 0.1, fit_prior = True) # initializes the multimodal Naive Bayes Model
model.fit(X_train, Y_train) # performs the training process

#### Detections 

In [448]:
Y_detect = model.predict(X_test) # make detections / predictions on the test data (i.e unseen data)

#### Results
A DataFrame will be created to display the detected languages alongside their corresponding text.

In [450]:
# uses the mapping dictionary earlier defined under the Label encoding section to get the all keys of the dictionary (i.e the languages)
languages = list(mapping.keys())  # ['English', 'French', 'German', 'Spanish'] 

In [451]:
label_encoder = LabelEncoder() # initializes the label encoder
label_encoder.fit(languages) # fits the label encoder with the language names -- ['English', 'French', 'German', 'Spanish'] 
Y_detect_lang = label_encoder.inverse_transform(Y_detect) # uses inverse_transform to convert encoded predictions back to language names


In [452]:
df_results = pd.DataFrame({
    'Text': texts_test,       # Original text from the test set
    'Predicted_Language': Y_detect_lang  # Predicted languages after decoding
})

In [453]:
df_results # displays the dataframe containing the results

Unnamed: 0,Text,Predicted_Language
550,[218][note 6] It is a battle between the right...,English
6025,la situaciÃ³n causÃ³ que himmler ordene desarm...,Spanish
4215,le championnat - qui se dÃ©roula pendant le pa...,French
3737,la economÃ­a es fundamentalmente agrÃ­cola con...,Spanish
2625,Wikipedia actualmente se ejecuta en grupos ded...,Spanish
...,...,...
1993,"Ces programmes, selon leur degrÃ© de perfectio...",French
1532,"Depuis peu, et au niveau international, ils te...",French
473,"Katherine Maher, the nonprofit Wikimedia Found...",English
3845,within two years of forming a live band stick ...,English


### Model Evaluation Metrics


#### 1. **Accuracy**
- **Definition**: Accuracy measures the ratio of correctly predicted instances to the total number of instances. It is useful for determining the overall effectiveness of the model.
- **Formula**:
  $$
  \text{Accuracy} = \frac{\text{TP + TN}}{\text{TP + TN + FP + FN}}
  $$
  - **Where**:
    - **TP** = True Positives, **TN** = True Negatives
    - **FP** = False Positives, **FN** = False Negatives


In [456]:
accuracy = model.score(X_test, Y_test) # computes and displays the accuracy
print(f"Accuracy: {accuracy}") # displays the accuracy

# OR
# y_pred = model.predict(X_test)
# accuracy = accuracy_score(Y_test, Y_detect)
# print(accuracy)

Accuracy: 0.9955156950672646



#### 2. **Precision**
- **Definition**: Precision calculates the proportion of correctly predicted positive instances out of all predicted positives. It is useful when minimizing false positives is important.
- **Formula**:
  $$
  \text{Precision} = \frac{\text{TP}}{\text{TP + FP}}
  $$


In [458]:
precision = precision_score(Y_test, Y_detect, average='weighted') # computes the precision
# precision = precision_score(Y_test, Y_detect, average='macro') # optionally, use this when you have balanced classes in the language division
print(f"Precision: {precision}") # displays the precision

Precision: 0.9955211387988847


#### 3. **Recall (Sensitivity or True Positive Rate)**
- **Definition**: Recall measures the proportion of correctly predicted positive instances out of all actual positives. It is crucial when missing positive instances (false negatives) is costly.
- **Formula**:
  $$
  \text{Recall} = \frac{\text{TP}}{\text{TP + FN}}
  $$



In [460]:
recall = recall_score(Y_test, Y_detect, average = 'weighted') # computes the recall
# recall = recall_score(Y_test, Y_detect, average = 'macro') # optionally use this when you have balanced classes in the language division
print(f"Recall:{recall}") # displays the recall

Recall:0.9955156950672646


#### 4. **F1-Score**
- **Definition**: The F1-Score is the harmonic mean of precision and recall. It balances the two metrics and is useful when both false positives and false negatives are important.
- **Formula**:
  $$
  \text{F1-Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
  $$



In [462]:
f1 = f1_score(Y_test, Y_detect, average= "weighted") # computes the f1-score
# f1 = f1_score(Y_test, Y_detect, average= "macro") # optionally use this when you have balanced classes in the language division
print(f"F1-Score: {f1}")

F1-Score: 0.9955158874982464


#### 5. **Confusion Matrix**
- **Definition**: A confusion matrix is a table that shows the actual versus predicted classifications, providing a detailed breakdown of correct and incorrect predictions for all classes.
- **Example Matrix**:
  |        | Predicted Positive | Predicted Negative |
  |--------|--------------------|--------------------|
  | Actual Positive | TP                 | FN                 |
  | Actual Negative | FP                 | TN                 |


In [464]:
conf_matrix = confusion_matrix(Y_test, Y_detect) # computes the confusion matrix


In [465]:
# Convert the mapping to reverse for labeling
mapping = {v: k for k, v in mapping.items()} # uses dictionary comprehension to swap key-value pairs in the mapping variable earlier define under Label encoding

# converts the confusion matrix array to a dataFrame for better readability
confusion_mtrx_df = pd.DataFrame(conf_matrix, # the array to be converted to a dataframe
                            index=[mapping[i] for i in range(len(mapping))], # acceses each value in the modified mapping dictionary by their corresponding keys and using it to label the rows of the dataframe
                            columns=[mapping[i] for i in range(len(mapping))] # acceses each value in the modified mapping dictionary by their corresponding keys and using it to label the columns of the dataframe
                           )

confusion_mtrx_df # displays the confusion matrix dataframe

Unnamed: 0,English,French,German,Spanish
English,476,1,0,0
French,1,401,0,1
German,1,0,93,0
Spanish,1,0,1,362


##### Observation

The confusion matrix above describes the True positives and False Positives for each of the Languages the model tried to detect.

1. **First row (True Class = English (0))**:
   - **478** English instances were predicted correctly as English.
   - **0** English instances were incorrectly predicted as French, Spanish, or German.

2. **Second row (True Class = French (1))**:
   - **410** French instances were predicted correctly as French.
   - **0** French instances were incorrectly predicted as English, Spanish, or German.

3. **Third row (True Class = German (2))**:
   - **81** German instances were predicted correctly as German.
   - **1** German instances were incorrectly predicted as English.
   - **0** German instances were incorrectly predicted as French and Spanish.

4. **Fourth row (True Class = Spanish (3))**:
   - **364** Spanish instances were predicted correctly as Spanish.
   - **3** Spanish instances were incorrectly predicted as English.
   - **1** Spanish instance was incorrectly predicted as French.
   - **0** Spanish instances were incorrectly predicted as German.

**In Summary**:
- For each row, the number in the diagonal represents the correctly classified instances for that class (i.e `478, 410, 81, 364`).
- Off-diagonal elements in each row represent misclassifications (e.g., instances of Spanish being predicted as English, French, or German).


### Checking for Overfitting and Underfitting

To adequately check if the model is overfitting or underfitting, evaluation metrics will be performed on both the `Train` data (i.e data the model has encountered) and the `Test` data (i.e data the model has not encountered). Thereafter, the following indicators will be used to know if there's overfittinng or underfitting.

- **Good** evaluation on `Train set` and **Bad** evaluation on `Test set` -- ***Overfitting***
- **Good** evaluation on `Train set` and **Good** evaluation on `Test set` -- ***Perfect fitting***
- **Good** evaluation on `Train set` and **Better** evaluation on `Test set` -- ***Excellent fitting (and generalization)***
- **Bad** evaluation on `Train set` and **Bad** evaluation on `Test set` -- ***Underfitting***

**NOTE:** There is always overfitting scenario in most machine learning models (i.e the model is likely to always perform better on the train set than the test set), the aim is to always reduce the degree of overfitting to the barest minimum as possible and to ensure the model generalizes and is robust as much as possible to the test data (i.e unseen data).

##### Using Accuracy

In [470]:
test_accuracy = accuracy # sets the value of the previously calculated model accuracy to a new variable "test_accuracy"
train_accuracy = model.score(X_train, Y_train) # computes the accuracy on the train set
accuracy_threshold = 0.80 # sets the minimum accuracy threshold for the model 

# conditional to check for Excellent fitting
if train_accuracy < test_accuracy:
    print(f"Excellent fitting")

# conditional to check for perfect fitting
elif train_accuracy >= accuracy_threshold and  test_accuracy >= accuracy_threshold:
    print(f"Perfect fitting")

# conditional to check for over fitting
elif train_accuracy > test_accuracy:
    print(f"Over fitting")

# conditional to check for under fitting
elif train_accuracy < accuracy_threshold and test_accuracy < accuracy_threshold:
    print(f"Under fitting")

print(f"""
Train Accuracy: {train_accuracy} 
Test Accuracy: {test_accuracy} """) # displays the accuracy for the train and test datasets

Perfect fitting

Train Accuracy: 0.9958878504672897 
Test Accuracy: 0.9955156950672646 


##### Using Precision

In [472]:
test_precision = precision # sets the value of the previously calculated model precision to a new variable "test_precision"
train_precision = precision_score(Y_test, Y_test, average='weighted') # computes the precision on the train set
precision_threshold = 0.67 # sets the minimum precision threshold for the model 

# conditional to check for Excellent fitting
if train_precision < test_precision:
    print(f"Excellent fitting")

# conditional to check for perfect fitting
elif train_precision >= precision_threshold and  test_precision >= precision_threshold:
    print(f"Perfect fitting")

# conditional to check for over fitting
elif train_precision > test_precision:
    print(f"Over fitting")

# conditional to check for under fitting
elif train_precision < precision_threshold and test_precision < precision_threshold:
    print(f"Under fitting")

print(f"""
Train Precision: {train_precision} 
Test Precision: {test_precision} """) # displays the precision for the train and test datasets

Perfect fitting

Train Precision: 1.0 
Test Precision: 0.9955211387988847 


##### Using Recall

In [474]:
test_recall = recall # sets the value of the previously calculated model recall to a new variable "test_recall"
train_recall = recall_score(Y_test, Y_test, average = 'weighted') # computes the recall on the train set
recall_threshold = 0.67 # sets the minimum recall threshold for the model 

# conditional to check for Excellent fitting
if train_recall < test_recall:
    print(f"Excellent fitting")

# conditional to check for perfect fitting
elif train_recall >= recall_threshold and  test_recall >= recall_threshold:
    print(f"Perfect fitting")

# conditional to check for over fitting
elif train_recall > test_recall:
    print(f"Over fitting")

# conditional to check for under fitting
elif train_recall < recall_threshold and test_recall < recall_threshold:
    print(f"Under fitting")

print(f"""
Train Recall: {train_recall} 
Test Recall: {test_recall} """) # displays the recall for the train and test datasets

Perfect fitting

Train Recall: 1.0 
Test Recall: 0.9955156950672646 


##### Using F1-Score

In [476]:
test_f1_score = f1 # sets the value of the previously calculated model f1_score to a new variable "test_f1_score"
train_f1_score = f1_score(Y_test, Y_test, average= "weighted") # computes the f1-score on the train set
f1_score_threshold = 0.67 # sets the minimum f1_score threshold for the model 

# conditional to check for Excellent fitting
if train_f1_score < test_f1_score:
    print(f"Excellent fitting")

# conditional to check for perfect fitting
elif train_f1_score >= f1_score_threshold and  test_f1_score >= f1_score_threshold:
    print(f"Perfect fitting")

# conditional to check for over fitting
elif train_f1_score > test_f1_score:
    print(f"Over fitting")

# conditional to check for under fitting
elif train_f1_score < f1_score_threshold and test_f1_score < f1_score_threshold:
    print(f"Under fitting")

print(f"""
Train f1_score: {train_f1_score} 
Test f1_score: {test_f1_score} """) # displays the recall for the train and test datasets

Perfect fitting

Train f1_score: 1.0 
Test f1_score: 0.9955158874982464 


##### Using Confusion Matrix
When assessing overfitting and underfitting with a confusion matrix, the approach differs from traditional metrics, as there are no specific thresholds to rely on. Instead, we compare the diagonal values of the training and test confusion matrices for each language. This comparison allows us to identify overfitting by examining whether the training accuracy significantly exceeds that of the test set across different languages.

In [478]:
test_conf_matrix = conf_matrix # sets the value (s) of the previously calculated conf_matrix to a new variable "test_conf_matrix"
train_conf_matrix = confusion_matrix(Y_test, Y_test) # computes the confusion matrix on the train set

train_conf_matrix_diag = np.diag(train_conf_matrix) # gets the diagonal values in train_conf_matrix
test_conf_matrix_diag = np.diag(test_conf_matrix) # gets the diagonal values in train_conf_matrix

# conditional to check for excellent fit
if np.all(train_conf_matrix_diag < test_conf_matrix_diag):
    print("Excellent fit")

# conditional to check for a perfect fit
elif np.all(train_conf_matrix_diag > 100) and np.all(test_conf_matrix_diag > 100):
    print("Perfect Fit")
    
# conditional to check for overfitting 
elif np.any(train_conf_matrix_diag > test_conf_matrix_diag):
    print("Overfitting")

# conditional to check for underfitting
if np.any(train_conf_matrix_diag < 100) and np.all(test_conf_matrix_diag < 100):
    print("Underfitting")

# converts the train confusion matrix array to a dataFrame for better readability
train_confusion_mtrx_df = pd.DataFrame(train_conf_matrix, # the array to be converted to a dataframe
                            index=[mapping[i] for i in range(len(mapping))], # acceses each value in the modified mapping dictionary by their corresponding keys and using it to label the rows of the dataframe
                            columns=[mapping[i] for i in range(len(mapping))] # acceses each value in the modified mapping dictionary by their corresponding keys and using it to label the columns of the dataframe
                           )
text_confusion_mtrx_df = confusion_mtrx_df # sets the previous confusion matrix to a new variable "test_conf_matrix"

Overfitting


##### Observation

By using the confusion matrix, an overfitting result was gotten because from the diagonal vectors of the two matrices, the train set had more True Positives for German Language that test set had on the same language.

In [480]:
train_confusion_mtrx_df # displays the train confusion matrix dataframe

Unnamed: 0,English,French,German,Spanish
English,477,0,0,0
French,0,403,0,0
German,0,0,94,0
Spanish,0,0,0,364


In [481]:
text_confusion_mtrx_df # displays the test confusion matrix dataframe

Unnamed: 0,English,French,German,Spanish
English,476,1,0,0
French,1,401,0,1
German,1,0,93,0
Spanish,1,0,1,362


### Hyper-parameter Tuning:

From the evaluation metrics obtained, it is evident that the performance of the Multinomial Naive Bayes model was very good. However, for any future adjustments and potential improvements, hyperparameter tuning can be beneficial. The following hyperparameters can be tuned using techniques such as GridSearchCV:

- **alpha (Laplace Smoothing)**: This parameter adjusts the smoothing of probabilities, which can help address zero-frequency issues.
- **fit_prior**: This determines whether to learn class prior probabilities based on the training data.
- **class_prior**: Allows for setting custom class prior probabilities when `fit_prior` is set to `False`.

Implementing GridSearchCV will enable an exhaustive search over specified parameter values, helping to identify the optimal settings for enhancing model performance.


##### Using GridSearchCV to get the Best Hyper-parameters

In [484]:

mnb = MultinomialNB() # initializes the model

# sets up the parameter grid for tuning
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 1.5, 2.0],  # laplace smoothing
    'fit_prior': [True, False]           # whether to learn class prior probabilities
}

# initializes GridSearchCV
grid_search = GridSearchCV(estimator=mnb, # specifies the model to use
                           param_grid=param_grid, # the range of hyper-parameter to be tested
                           scoring='accuracy', # evaluation metric to decidd best hyperparameters
                           cv=5, # number of cross validations
                           n_jobs=1 # sets only one CPU core for sequential running. -1 -- parallel runnig.
                          )

# fits the model with the training data
grid_search.fit(X_train, Y_train)

# gets the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best Parameters: {best_params}") # prinst the best parameters after trying out the training with grid search
print(f"Best Cross-Validation Accuracy: {best_score:.4f}") # prints best_score to 4dp


Best Parameters: {'alpha': 0.1, 'fit_prior': False}
Best Cross-Validation Accuracy: 0.9931


### Feature Imporance

In language detection using **Multinomial Naive Bayes (MNB)**, feature importance is not commonly emphasized for the following reasons:

1. **Model Nature**: MNB relies on the assumption that features (e.g., words) are conditionally independent given the class (language). The model does not calculate a global importance score for features. Instead, it calculates the probability of each feature occurring in each class, making feature importance less relevant.

2. **Focus on Word Frequencies**: The MNB model is driven by word frequencies or counts, and its performance depends on the relative frequencies of words across languages rather than the overall importance of any single feature.

3. **Class-Specific Feature Probabilities**: For MNB, each word contributes differently to each language. Therefore, the model focuses on conditional probabilities (i.e., likelihoods) of features within each class rather than assigning a single importance score to features across all classes.

4. **Interpretability**: While feature importance can be useful in some models (e.g., decision trees or logistic regression), in language detection tasks using MNB, analyzing the likelihoods of features (words) per class provides more meaningful insights into which words are most indicative of specific languages, rather than measuring importance in a global sense.

In summary, feature importance is not a priority in MNB for language detection, as the model is designed to use the relative frequency of words to predict language, focusing on class-specific probabilities instead.


### Model Deployment
With a satisfory model performance, the model will be saved for future use. This can be done using libraries like `joblib` or `pickle`, which allow us to serialize the model and load it later for deployment without retraining.

In [487]:
joblib.dump(model, 'Rizama-Victor-Samuel-Language-Detection-Model.pkl') # saves the model in the current code working directory

['Rizama-Victor-Samuel-Language-Detection-Model.pkl']