# üìì‚ú® Text Classification Project ‚ú®üìì

## üìú Importing Libraries üìú
- üî¢ **Numpy** for numerical operations.
- üßÆ **Pandas** for data manipulation.
- üöÄ **Resample** from **sklearn.utils** for resampling the dataset.
- üìù **TfidfVectorizer** from **sklearn.feature_extraction.text** for transforming text to feature vectors.
- üå≤ **RandomForestClassifier** from **sklearn.ensemble** for classification.
- üìà **LogisticRegression** from **sklearn.linear_model** for classification.
- ü¶∏‚Äç‚ôÇÔ∏è **XGBClassifier** from **xgboost** for classification.
- üåø **PorterStemmer** from **nltk.stem** for stemming words.
- üè∑Ô∏è **LabelEncoder** from **sklearn.preprocessing** for encoding labels.
- üß© **String** for string operations.
- üîç **Re** for regular expressions.
- üé≤ **Train_test_split** from **sklearn.model_selection** for splitting the dataset.
- üèÜ **Classification_report**, **accuracy_score**, and **precision_score** from **sklearn.metrics** for evaluating models.
- üíæ **Joblib** for saving and loading models.

## üóÇÔ∏è Data Preparation üóÇÔ∏è
- üì• **Loading the dataset** using Pandas.
- üîÑ **Handling missing values** and **cleaning the data**.
- üìä **Exploring the dataset** to understand its structure and content.
- üîç **Preprocessing text** data by removing punctuation, stop words, and stemming.

## ‚ú® Feature Extraction ‚ú®
- üìù **Transforming text** to feature vectors using **TfidfVectorizer**.

## üìä Model Training üìä
- üîÄ **Splitting the dataset** into training and testing sets.
- üå≤ **Training a Random Forest Classifier**.
- üìà **Training a Logistic Regression Classifier**.
- ü¶∏‚Äç‚ôÇÔ∏è **Training an XGBoost Classifier**.

## üèÜ Model Evaluation üèÜ
- üìä **Evaluating models** using **classification report**, **accuracy score**, and **precision score**.
- üìâ **Comparing model performance**.

## üíæ Saving and Loading Models üíæ
- üíæ **Saving trained models** using **Joblib**.
- üìÇ **Loading saved models** for future use.

## üîç Conclusion üîç
- üìú **Summarizing the findings** and **model performance**.
- üí° **Future work** and **improvements** for the project.



In [565]:
import numpy as np
import pandas as pd
from sklearn.utils import resample
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from nltk.stem import PorterStemmer
from sklearn.preprocessing import LabelEncoder
import string as s
import re as r
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,accuracy_score,precision_score
import joblib as j


# üìú Data Loading and Exploration üìú

- üì• **Loading the dataset** from a CSV file named **'Language Detection.csv'**.
- üìê **Checking the shape** of the dataset to see the number of rows and columns: **(10337, 2)**.
- üåç **Identifying unique languages** in the dataset using the **'Language'** column.
- üìä **Counting the number of instances** for each language in the dataset:

  - üá¨üáß **English**: 1385
  - üá´üá∑ **French**: 1014
  - üá™üá∏ **Spanish**: 819
  - üáµüáπ **Portugeese**: 739
  - üáÆüáπ **Italian**: 698
  - üá∑üá∫ **Russian**: 692
  - üá∏üá™ **Swedish**: 676
  - üáÆüá≥ **Malayalam**: 594
  - üá≥üá± **Dutch**: 546
  - üá¶üá™ **Arabic**: 536
  - üáπüá∑ **Turkish**: 474
  - üá©üá™ **German**: 470
  - üáÆüá≥ **Tamil**: 469
  - üá©üá∞ **Danish**: 428
  - üáÆüá≥ **Kannada**: 369
  - üá¨üá∑ **Greek**: 365
  - üáÆüá≥ **Hindi**: 63


In [537]:
df=pd.read_csv('Language Detection.csv')

In [538]:
df.head(3)

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English


In [539]:
df.shape

(10337, 2)

In [540]:
df['Language'].unique()

array(['English', 'Malayalam', 'Hindi', 'Tamil', 'Portugeese', 'French',
       'Dutch', 'Spanish', 'Greek', 'Russian', 'Danish', 'Italian',
       'Turkish', 'Sweedish', 'Arabic', 'German', 'Kannada'], dtype=object)

In [541]:
df['Language'].value_counts()

Language
English       1385
French        1014
Spanish        819
Portugeese     739
Italian        698
Russian        692
Sweedish       676
Malayalam      594
Dutch          546
Arabic         536
Turkish        474
German         470
Tamil          469
Danish         428
Kannada        369
Greek          365
Hindi           63
Name: count, dtype: int64

# üìä Upsampling the Dataset for Language Balancing üìä

- üìà **Calculating the maximum count** of instances among all languages: **1385**.
- üîÑ **Creating an empty DataFrame** `df_upsampled` to store the upsampled data.

- üîÅ **Iterating through each unique language** in the dataset:
  - üìä **Subsetting the dataset** for each language.
  
  - ‚öñÔ∏è **Checking if the count of instances** for the current language is less than the maximum count (1385):
    - üîÄ **Resampling** the data to match the maximum count using `sklearn.utils.resample`.
    - üìÑ **Concatenating** the upsampled data to `df_upsampled`.

  - üîÑ **Concatenating** the original data if the count meets or exceeds 1385.

- üîÑ **Resetting the index** of `df_upsampled` for consistency.

- üìä **Counting the number of instances** for each language in the upsampled dataset:
  - üìä **Displaying the count** of each language.
  
- üîç **Checking the shape** of the upsampled dataset: **(23545, 2)**.

- üìÑ **Displaying a random sample** of 15 rows from `df_upsampled`.


In [542]:
max_count=df['Language'].value_counts().max()

In [543]:
max_count

1385

In [544]:
df_upsampled = pd.DataFrame()

for language in df['Language'].unique():
    language_df = df[df['Language'] == language]
    
    if not language_df.empty:
        count = language_df['Language'].value_counts().values[0]
        
        if count < 1385:
            upsampled_df = resample(language_df, 
                                    replace=True, 
                                    n_samples=1385, 
                                    random_state=42)
            df_upsampled = pd.concat([df_upsampled, upsampled_df])
        else:
            df_upsampled = pd.concat([df_upsampled, language_df])



In [545]:

df_upsampled.reset_index(drop=True, inplace=True)

df_upsampled['Language'].value_counts()

Language
English       1385
Russian       1385
German        1385
Arabic        1385
Sweedish      1385
Turkish       1385
Italian       1385
Danish        1385
Greek         1385
Malayalam     1385
Spanish       1385
Dutch         1385
French        1385
Portugeese    1385
Tamil         1385
Hindi         1385
Kannada       1385
Name: count, dtype: int64

In [546]:
df_upsampled.shape

(23545, 2)

In [547]:
df_upsampled.sample(15)

Unnamed: 0,Text,Language
4391,‡ÆÖ‡Æ§‡Æø‡Æï‡ÆÆ‡Æø‡Æ≤‡Øç‡Æ≤‡Øà.,Tamil
7175,"vos amis, je me sens tr√®s lent.",French
22712,‡≤®‡≤æ‡≤®‡≥Å ‡≤Æ‡≥Å‡≤Ç‡≤¶‡≥Å‡≤µ‡≤∞‡≤ø‡≤Ø‡≤≤‡≥Å ‡≤¨‡≤Ø‡≤∏‡≥Å‡≤§‡≥ç‡≤§‡≥á‡≤®‡≥Ü ‡≤Ø‡≤æ‡≤∞‡≤æ‡≤¶‡≤∞‡≥Ç ‡≤®‡≤ø‡≤Æ‡≥ç‡≤Æ‡≤®‡≥ç‡≤®‡≥Å ...,Kannada
9515,"Ik zou dit niet goed kopen, je hebt me verkoch...",Dutch
9650,daar ben ik niet zeker van.,Dutch
13587,"–æ, —ç—Ç–æ –≥–æ—Ä–∏—Ç, –∏ –≤—ã, –Ω–∞–≤–µ—Ä–Ω–æ–µ, —Å–ª—ã—à–∞–ª–∏ —ç—Ç–æ –≤ –ø–µ...",Russian
10676,[184]‚Äã Proyectos de ese tipo o similares se or...,Spanish
12293,Œ© Œ±Œ≥Œ±œÄŒ∑œÑŒÆ ŒºŒπŒ± ŒΩœçœáœÑŒ± œåœÑŒ±ŒΩ Œ∫ŒøŒπŒºœåœÑŒ±ŒΩ Œ∑ ŒºŒ∑œÑŒ≠œÅŒ± œÑŒ∑œÇ...,Greek
8629,slaat me.,Dutch
22153,Sie fing bald an zu weinen.,German


In [548]:
s.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [549]:
pt=PorterStemmer()
def preprocessing(x):
    pun = s.punctuation  
    t = []
    m = x.split()
    for i in m:
        
        word = ''.join(char for char in i if char not in pun)
        if word:  
            stemmed_word = pt.stem(word)
            t.append(stemmed_word)
    return " ".join(t)            


In [550]:
df_upsampled['Text']=df_upsampled['Text'].apply(preprocessing)

In [551]:
df_upsampled

Unnamed: 0,Text,Language
0,natur in the broadest sens is the natur physic...,English
1,natur can refer to the phenomena of the physic...,English
2,the studi of natur is a larg if not the onli p...,English
3,although human are part of natur human activ i...,English
4,1 the word natur is borrow from the old french...,English
...,...,...
23540,‡≤™‡≤∞‡≥Ä‡≤ï‡≥ç‡≤∑‡≥Ü‡≤Ø‡≤≤‡≥ç‡≤≤‡≤ø ‡≤í‡≤Ç‡≤¶‡≥á ‡≤™‡≥ç‡≤∞‡≤∂‡≥ç‡≤®‡≥Ü‡≤ó‡≥Ü ‡≤â‡≤§‡≥ç‡≤§‡≤∞‡≤ø‡≤∏‡≤≤‡≥Å ‡≤Ö‡≤•‡≤µ‡≤æ ‡≤®‡≤ø‡≤∑...,Kannada
23541,‡≤á‡≤¶‡≤∞ ‡≤Ö‡≤°‡≤ø‡≤Ø‡≤≤‡≥ç‡≤≤‡≤ø ‡≤®‡≤®‡≤ó‡≥Ü ‡≤à 10 ‡≤™‡≤¶‡≤ó‡≤≥‡≤≤‡≥ç‡≤≤‡≤ø ‡≤Ø‡≤æ‡≤µ‡≥Å‡≤¶‡≥Å ‡≤®‡≤ø‡≤Æ‡≥ç‡≤Æ ‡≤®...,Kannada
23542,‡≤®‡≤ø‡≤Æ‡≥ç‡≤Æ‡≤®‡≥ç‡≤®‡≥Å ‡≤§‡≥ä‡≤Ç‡≤¶‡≤∞‡≥Ü‡≤ó‡≥ä‡≤≥‡≤ø‡≤∏‡≤ø‡≤¶‡≥ç‡≤¶‡≤ï‡≥ç‡≤ï‡≥Ü ‡≤ï‡≥ç‡≤∑‡≤Æ‡≤ø‡≤∏‡≤ø,Kannada
23543,‡≤®‡≤æ‡≤®‡≥Å ‡≤Ö‡≤¶‡≤∞ ‡≤¨‡≤ó‡≥ç‡≤ó‡≥Ü ‡≤®‡≤ø‡≤ú‡≤µ‡≤æ‡≤ó‡≤ø‡≤Ø‡≥Ç ‡≤µ‡≤ø‡≤∑‡≤æ‡≤¶‡≤ø‡≤∏‡≥Å‡≤§‡≥ç‡≤§‡≥á‡≤®‡≥Ü,Kannada


# üìä Model Training and Evaluation üìä

### üå± Random Forest Classifier üå±

- üìö **Splitting the dataset** into training and testing sets using `train_test_split`.
- üìä **Vectorizing** the text data using `TfidfVectorizer`.
- üå≥ **Initializing a Random Forest Classifier** and **fitting** it on the training data.
- üìä **Predicting** the labels for the test data and evaluating performance:

  - üéØ **Accuracy Score:** 0.922
  - üéØ **Precision Score:** 0.952
  
- üìÑ **Printing the Training Data Scores**:

  - üìà **Accuracy Score:** 0.998
  - üìà **Precision Score:** 0.998

### üìà Logistic Regression üìà

- üå± **Initializing a Logistic Regression model** and **fitting** it on the training data.
- üìä **Predicting** the labels for the test data and evaluating performance:

  - üéØ **Accuracy Score:** 0.964
  - üéØ **Precision Score:** 0.970
  
- üìÑ **Printing the Training Data Scores**:

  - üìà **Accuracy Score:** 0.994
  - üìà **Precision Score:** 0.994


In [552]:
x=df['Text']
y=df['Language']

In [553]:
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=42,test_size=0.2)

In [554]:
vector=TfidfVectorizer()
x_train_tidf=vector.fit_transform(x_train)
x_test_tidf=vector.transform(x_test)

In [555]:
rf=RandomForestClassifier()
rf.fit(x_train_tidf,y_train)

In [556]:
ypred=rf.predict(x_test_tidf)
y_tpre=rf.predict(x_train_tidf)


In [557]:
print("The Training Data Score....")
print()
print("accuracy_score: ",accuracy_score(y_test,ypred))
print("precision_score: ",precision_score(y_test,ypred,average="weighted"))

The Training Data Score....

accuracy_score:  0.9216634429400387
precision_score:  0.9522961994686651


In [558]:
print("The Training Data Score....")
print()
print("accuracy_score: ",accuracy_score(y_train,y_tpre))
print("precision_score: ",precision_score(y_train,y_tpre,average="weighted"))

The Training Data Score....

accuracy_score:  0.9984278631031563
precision_score:  0.9984749255106988


In [559]:
lr=LogisticRegression()
lr.fit(x_train_tidf,y_train)

In [560]:
y_pred=lr.predict(x_test_tidf)
y_tpre=lr.predict(x_train_tidf)

In [561]:
print("The Test Data Score....")
print()
print("accuracy_score: ",accuracy_score(y_test,y_pred))
print("precision_score: ",precision_score(y_test,y_pred,average="weighted"))

The Test Data Score....

accuracy_score:  0.9637330754352031
precision_score:  0.9698425771023904


In [562]:
print("The Training Data Score....")
print()
print("accuracy_score: ",accuracy_score(y_train,y_tpre))
print("precision_score: ",precision_score(y_train,y_tpre,average="weighted"))

The Training Data Score....

accuracy_score:  0.994316120449873
precision_score:  0.9944780158524212


## üì¶ Model Serialization and Prediction üì¶

### üì• Saving Models üì•

- üìÑ **Saving** the `TfidfVectorizer` object as **"vectorizer.jbl"** using `joblib.dump`.
- üìÑ **Saving** the Logistic Regression model as **"model.pkl"** using `joblib.dump`.

### üîÑ Loading Models üîÑ

- üìÑ **Loading** the Logistic Regression model from **"model.pkl"** using `joblib.load`.
- üìÑ **Loading** the `TfidfVectorizer` object from **"vectorizer.jbl"** using `joblib.load`.

### üß† Model Prediction üß†

- üìù **Predicting the language** for different texts using the loaded model and vectorizer:

  1. **English Text:**
     ```python
     text = "hello how are you"
     tran_text = vector.transform([text])
     model.predict(tran_text)[0]  # Output: 'English'
     ```

  2. **Arabic Text:**
     ```python
     text = "ŸÖÿ±ÿ≠ÿ®ÿßŸãÿå ŸÉŸäŸÅ ÿ≠ÿßŸÑŸÉÿü"
     tran_text = vector.transform([text])
     model.predict(tran_text)[0]  # Output: 'Arabic'
     ```

  3. **Danish Text:**
     ```python
     text = "Hej, hvordan har du det?"
     tran_text = vector.transform([text])
     model.predict(tran_text)[0]  # Output: 'Danish'
     ```

  4. **Hindi Text:**
     ```python
     text = "‡§®‡§Æ‡§∏‡•ç‡§§‡•á, ‡§Ü‡§™ ‡§ï‡•à‡§∏‡•á ‡§π‡•à‡§Ç?"
     tran_text = vector.transform([text])
     model.predict(tran_text)[0]  # Output: 'Hindi'
     ```


In [566]:
j.dump(vector,"vectorizer.jbl")
j.dump(lr,"model.pkl")


['model.pkl']

In [567]:
model=j.load("model.pkl")
vector=j.load("vectorizer.jbl")

In [570]:
text="hello how are you"
tran_text=vector.transform([text])
model.predict(tran_text)[0]

'English'

In [571]:
text="ŸÖÿ±ÿ≠ÿ®ÿßŸãÿå ŸÉŸäŸÅ ÿ≠ÿßŸÑŸÉÿü"
tran_text=vector.transform([text])
model.predict(tran_text)[0]

'Arabic'

In [572]:

text="Hej, hvordan har du det?"
tran_text=vector.transform([text])
model.predict(tran_text)[0]

'Danish'

In [573]:

text="‡§®‡§Æ‡§∏‡•ç‡§§‡•á, ‡§Ü‡§™ ‡§ï‡•à‡§∏‡•á ‡§π‡•à‡§Ç?"
tran_text=vector.transform([text])
model.predict(tran_text)[0]

'Hindi'

In [574]:
 
text="ŒìŒµŒπŒ± œÉŒøœÖ, œÄœéœÇ ŒµŒØœÉŒ±Œπ;"
tran_text=vector.transform([text])
model.predict(tran_text)[0]

'Greek'

In [575]:
text='‡¥π‡¥≤‡µã, ‡¥®‡¥ø‡¥ô‡µç‡¥ô‡µæ‡¥ï‡µç‡¥ï‡µç ‡¥é‡¥ô‡µç‡¥ô‡¥®‡µÜ‡¥Ø‡µÅ‡¥£‡µç‡¥ü‡µç?'
tran_text=vector.transform([text])
model.predict(tran_text)[0]

'Malayalam'

## üìù Conclusion üìù

- üéØ **Overall Performance:** The model has demonstrated strong performance across multiple languages, achieving high accuracy and precision scores during training and testing.

- üåü **Conclusion:** The model is performing very well in accurately predicting languages based on text inputs, showcasing effective use of text vectorization and logistic regression for classification tasks.

- üôè **Thank you for visiting this notebook!** üôè
