# Language Detection Machine Learning Project

## Importing Libraries

In [1]:
import pandas as pd  # Pandas is used for data manipulation and analysis.
import numpy as np   # Numpy is used for numerical operations, especially with arrays.
from sklearn.feature_extraction.text import CountVectorizer  # CountVectorizer is a tool for text preprocessing.

### Import Necessary Libraries

- **pandas**: This is used for data manipulation, particularly with tabular data like CSVs or Excel files.
- **numpy**: Useful for working with arrays and performing complex numerical operations.
- **CountVectorizer**: A feature extraction technique from the `sklearn` library that converts text data into a numerical format, based on word counts (bag-of-words model).


In [2]:
df = pd.read_csv("language.csv")

In [3]:
print(df)

                                                    Text  language
0      klement gottwaldi surnukeha palsameeriti ning ...  Estonian
1      sebes joseph pereira thomas  på eng the jesuit...   Swedish
2      ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...      Thai
3      விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...     Tamil
4      de spons behoort tot het geslacht haliclona en...     Dutch
...                                                  ...       ...
21995  hors du terrain les années  et  sont des année...    French
21996  ใน พศ  หลักจากที่เสด็จประพาสแหลมมลายู ชวา อินเ...      Thai
21997  con motivo de la celebración del septuagésimoq...   Spanish
21998  年月，當時還只有歲的她在美國出道，以mai-k名義推出首張英文《baby i like》，由...   Chinese
21999   aprilie sonda spațială messenger a nasa și-a ...  Romanian

[22000 rows x 2 columns]


In [4]:
df.head(5)

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


In [5]:
df.tail(5)

Unnamed: 0,Text,language
21995,hors du terrain les années et sont des année...,French
21996,ใน พศ หลักจากที่เสด็จประพาสแหลมมลายู ชวา อินเ...,Thai
21997,con motivo de la celebración del septuagésimoq...,Spanish
21998,年月，當時還只有歲的她在美國出道，以mai-k名義推出首張英文《baby i like》，由...,Chinese
21999,aprilie sonda spațială messenger a nasa și-a ...,Romanian


In [6]:
# Import train_test_split from sklearn
from sklearn.model_selection import train_test_split  # This function splits the dataset into training and testing subsets.

### Step 2: Import `train_test_split`

- **train_test_split**: This function from the `sklearn.model_selection` module is used to split your data into training and testing sets.
  - **Training set**: This is the portion of your data used to train the machine learning model.
  - **Testing set**: This is the data used to evaluate the model's performance on unseen data.
  - This helps avoid overfitting, where a model performs well on training data but poorly on new data.


In [7]:
# Import MultinomialNB from sklearn
from sklearn.naive_bayes import MultinomialNB  # This is a Naive Bayes classifier suitable for classification with discrete features.


### Step 2: Import `MultinomialNB`

- **MultinomialNB**: This Naive Bayes classifier is particularly effective for classification tasks where the data is represented as frequency counts, such as in text data.
  - This algorithm is often used for **text classification** problems like spam detection, sentiment analysis, and topic categorization.
  - It works well with discrete features like word counts in documents, making it ideal for tasks involving **bag-of-words** or **tf-idf** representations.


In [8]:
# Get the sum of null values in each column
df.isnull().sum()

Text        0
language    0
dtype: int64

### Checking the Count of Null Values in Each Column

- **`df.isnull().sum()`**: This function checks for null values in the DataFrame `df` and then sums them up column-wise.
  - For each column, the output will show the number of null values.
  - It helps identify which columns have missing data that may need handling (e.g., by filling or dropping missing values).


In [9]:
# Count the occurrences of each unique value in the 'language' column
df['language'].value_counts()

language
Estonian      1000
Swedish       1000
English       1000
Russian       1000
Romanian      1000
Persian       1000
Pushto        1000
Spanish       1000
Hindi         1000
Korean        1000
Chinese       1000
French        1000
Portugese     1000
Indonesian    1000
Urdu          1000
Latin         1000
Turkish       1000
Japanese      1000
Dutch         1000
Tamil         1000
Thai          1000
Arabic        1000
Name: count, dtype: int64

### Counting the Occurrences of Each Unique Value in the `language` Column

- **`data['language'].value_counts()`**: This function returns the count of unique values in the `language` column.
  - It shows the frequency of each language, sorted by the highest count first.
  - Useful for understanding the distribution of values in the column and identifying which languages are most/least common in the dataset.


In [10]:
# Check the data types of each column in the DataFrame
df.dtypes

Text        object
language    object
dtype: object

### Checking the Data Types of Each Column in `df`

- **`df.dtypes`**: This attribute provides the data types of each column in the DataFrame `df`.
  - It is crucial for understanding the nature of your data and what operations are applicable.
  - Common data types include:
    - `int64`: Integer values.
    - `float64`: Floating-point values.
    - `object`: Typically for strings or mixed types.
    - `datetime64`: Date and time values.
  - Knowing the data types helps in data cleaning, processing, and analysis.


In [11]:
# Convert the 'Text' and 'language' columns to NumPy arrays
x = np.array(df['Text'])      # 'x' will hold the text data
y = np.array(df['language'])  # 'y' will hold the language data

# Display the shapes of the arrays
#print("Shape of x:", x.shape)
#print("Shape of y:", y.shape)


### Converting DataFrame Columns to NumPy Arrays

- **`x = np.array(df['Text'])`**: This line converts the `Text` column from the DataFrame `df` into a NumPy array called `x`. This array will be used for text data.
  
- **`y = np.array(df['language'])`**: This line converts the `language` column from the DataFrame `df` into a NumPy array called `y`. This array will be used for the corresponding language labels.
  
- Converting to NumPy arrays is useful for machine learning tasks, where data is often expected to be in array format for modeling and analysis.


In [12]:
print(x)

['klement gottwaldi surnukeha palsameeriti ning paigutati mausoleumi surnukeha oli aga liiga hilja ja oskamatult palsameeritud ning hakkas ilmutama lagunemise tundemärke  aastal viidi ta surnukeha mausoleumist ära ja kremeeriti zlíni linn kandis aastatel – nime gottwaldov ukrainas harkivi oblastis kandis zmiivi linn aastatel – nime gotvald'
 'sebes joseph pereira thomas  på eng the jesuits and the sino-russian treaty of nerchinsk  the diary of thomas pereira bibliotheca instituti historici s i --   rome libris '
 'ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เริ่มตั้งแต่ถนนสนามไชยถึงแม่น้ำเจ้าพระยาที่ถนนตก กรุงเทพมหานคร เป็นถนนรุ่นแรกที่ใช้เทคนิคการสร้างแบบตะวันตก ปัจจุบันผ่านพื้นที่เขตพระนคร เขตป้อมปราบศัตรูพ่าย เขตสัมพันธวงศ์ เขตบางรัก เขตสาทร และเขตบางคอแหลม'
 ...
 'con motivo de la celebración del septuagésimoquinto ° aniversario de la fundación del departamento en  guillermo ceballos espinosa presentó a la gobernación de caldas por encargo de su titular dilia estrada de gómez el h

In [13]:
print(y)

['Estonian' 'Swedish' 'Thai' ... 'Spanish' 'Chinese' 'Romanian']


In [14]:
cv = CountVectorizer()  # Create an instance of CountVectorizer

# Fit and transform the text data
X = cv.fit_transform(x)  # Transform the text data into a document-term matrix

In [15]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Splitting the Data into Training and Testing Sets

- **`train_test_split()`**: This function splits the dataset into training and testing subsets.
  - **Parameters**:
    - **`X`**: The document-term matrix representing the features.
    - **`y`**: The array of target labels (languages).
    - **`test_size=0.33`**: This means that 33% of the data will be allocated for testing, while 67% will be used for training.
    - **`random_state=42`**: This ensures that the split is reproducible. The same seed will result in the same split in different runs.
  
- The resulting sets will be:
  - **`X_train`** and **`y_train`**: Training features and labels.
  - **`X_test`** and **`y_test`**: Testing features and labels.

- This step is essential for evaluating the performance of machine learning models, allowing you to train the model on one subset of data and test it on another to assess its generalization capabilities.


In [16]:
X_train

<14740x277720 sparse matrix of type '<class 'numpy.int64'>'
	with 613529 stored elements in Compressed Sparse Row format>

In [17]:
print(X_train)

  (0, 197295)	2
  (0, 197708)	1
  (0, 197801)	1
  (0, 198388)	1
  (0, 197467)	1
  (0, 197865)	2
  (0, 197604)	1
  (0, 198428)	1
  (0, 198501)	1
  (0, 198556)	1
  (0, 197332)	1
  (0, 197485)	2
  (0, 198123)	1
  (0, 197892)	1
  (0, 197990)	1
  (0, 198053)	1
  (0, 198417)	1
  (0, 197623)	1
  (1, 197641)	2
  (1, 197314)	1
  (1, 197931)	1
  (1, 197804)	3
  (1, 198397)	1
  (1, 197149)	1
  (1, 197781)	1
  :	:
  (14738, 188817)	1
  (14738, 192004)	1
  (14738, 157171)	1
  (14738, 190346)	1
  (14738, 190725)	1
  (14738, 189685)	1
  (14738, 159269)	2
  (14738, 145431)	1
  (14738, 173292)	1
  (14738, 176062)	1
  (14738, 159959)	1
  (14738, 190198)	1
  (14738, 167124)	1
  (14738, 168158)	1
  (14738, 180260)	2
  (14738, 153262)	1
  (14738, 162150)	1
  (14738, 153355)	1
  (14738, 178104)	1
  (14738, 163770)	1
  (14739, 223002)	1
  (14739, 235170)	1
  (14739, 222446)	1
  (14739, 221922)	1
  (14739, 242446)	1


In [18]:
# Initialize the Multinomial Naive Bayes model
model = MultinomialNB()

# Fit the model to the training data
model.fit(X_train, y_train)

### Initializing and Fitting the Multinomial Naive Bayes Model

- **`model = MultinomialNB()`**: This line creates an instance of the Multinomial Naive Bayes classifier, which is effective for text classification tasks due to its handling of categorical features.

- **`model.fit(X_train, y_train)`**: 
  - This method trains the Multinomial Naive Bayes model on the training dataset.
  - It uses the document-term matrix `X_train` as features and the corresponding labels `y_train` for training the model.

- The model learns the relationships between the features and labels, preparing it to make predictions on new, unseen data.

- **`model.get_params()`**: This retrieves the parameters of the model to confirm that it has been fitted correctly. 

- Fitting the model is a crucial step in the machine learning workflow, enabling the subsequent prediction and evaluation stages.


In [19]:
# Evaluate the model's performance on the test data
accuracy = model.score(X_test, y_test)

# Display the accuracy of the model
print("Model accuracy on the test set:", accuracy)

Model accuracy on the test set: 0.953168044077135


### Evaluating the Model's Performance on the Test Set

- **`model.score(X_test, y_test)`**: 
  - This method assesses the accuracy of the Multinomial Naive Bayes model on the test data.
  - It compares the predicted labels from the model with the actual labels in `y_test`.

- **Parameters**:
  - **`X_test`**: The document-term matrix representing the features of the testing set.
  - **`y_test`**: The true labels for the testing data.

- The method returns the accuracy of the model as a float, representing the proportion of correctly classified instances over the total instances in the testing set.

- Evaluating model performance is essential to determine how well the model is likely to perform in real-world scenarios and to identify any potential issues with overfitting or underfitting.


In [20]:
# Get user input for language detection
user_input = input("Enter any text for detection of language: ")

# Transform the user input using CountVectorizer
input_vector = cv.transform([user_input]).toarray()

#  Predict the language using the trained model
output = model.predict(input_vector)

# Display the predicted language
print("Predicted Language:", output)

Enter any text for detection of language:  hi how are you


Predicted Language: ['English']


## Splitting the Data

The code snippet below uses the `train_test_split` function to divide the dataset into training and testing sets:

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [21]:
# Features and labels
X = df['Text']
y = df['language']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Vectorizing the Text Data

The following code snippet utilizes `CountVectorizer` to convert the text data into a numerical format suitable for machine learning:

```python
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)


In [22]:
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)


## Training the Model

The following code snippet initializes and trains the Multinomial Naive Bayes model:

```python
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)


In [23]:
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)


## Making Predictions

The following code snippet uses the trained model to make predictions on the test dataset:

```python
y_pred = model.predict(X_test_vectorized)


In [24]:
y_pred = model.predict(X_test_vectorized)

## Importing Evaluation Metrics

The following code snippet imports the necessary metrics to evaluate the model's performance:

```python
accuracy_score: Indicates the overall correctness of the model's predictions.
precision_score: Measures how many of the predicted positive instances are actually correct; important when false positives carry a cost.
recall_score: Reflects the model's ability to identify all relevant positive instances; crucial when missing positives is costly.
f1_score: Balances precision and recall, useful in scenarios with imbalanced datasets.
classification_report: Provides a detailed performance summary for each class, helping to identify strengths and weaknesses in the model.


In [25]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

In [26]:
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)  # Overall correctness of predictions
precision = precision_score(y_test, y_pred, average='weighted')  # Precision for multi-class classification
recall = recall_score(y_test, y_pred, average='weighted')  # Recall for multi-class classification
f1 = f1_score(y_test, y_pred, average='weighted')  # F1 score for balance between precision and recall

# Print metrics
print(f'Accuracy: {accuracy:.2f}')  # Display accuracy
print(f'Precision: {precision:.2f}')  # Display precision
print(f'Recall: {recall:.2f}')  # Display recall
print(f'F1 Score: {f1:.2f}')  # Display F1 score

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))  # Print detailed report for each class


Accuracy: 0.94
Precision: 0.96
Recall: 0.94
F1 Score: 0.94

Classification Report:
              precision    recall  f1-score   support

      Arabic       1.00      1.00      1.00       202
     Chinese       0.96      0.49      0.65       201
       Dutch       0.98      0.98      0.98       230
     English       0.68      1.00      0.81       194
    Estonian       0.99      0.95      0.97       200
      French       0.94      0.99      0.97       188
       Hindi       1.00      0.99      0.99       208
  Indonesian       1.00      0.98      0.99       213
    Japanese       0.98      0.64      0.78       194
      Korean       0.99      0.99      0.99       190
       Latin       0.98      0.90      0.94       210
     Persian       1.00      0.99      1.00       196
   Portugese       0.99      0.96      0.98       194
      Pushto       1.00      0.96      0.98       196
    Romanian       0.98      0.98      0.98       197
     Russian       0.99      0.99      0.99       21

## Model Evaluation Results

- **Accuracy: 0.94**: This indicates that 94% of the predictions made by the model are correct, demonstrating a high level of overall performance.

- **Precision: 0.96**: With a precision of 96%, this shows that when the model predicts a language, it is correct 96% of the time. This is particularly important for minimizing false positives in language detection.

- **Recall: 0.94**: A recall of 94% signifies that the model successfully identifies 94% of all actual instances of each language, reflecting its effectiveness in detecting positive cases.

- **F1 Score: 0.94**: The F1 score combines precision and recall into a single metric, indicating a good balance between the two. An F1 score of 94% suggests that the model performs well across both precision and recall metrics.

Overall, these results indicate that the model is highly effective in detecting languages based on text input, making it suitable for practical applications.
