## NLP Language Detection Project   Using Classical Models and NN
### 1.Data Collection


This part includes extraction of final data set using various data sets obtained from Hugging Face and Kaggle. In this project 6 different data sets are used and few pre-processing steps are conducted to extract the final dataset.

In [1]:
## import libraries
import pandas as pd
import numpy as np


In [2]:
## import datasets from hugging face
train= pd.read_csv("train.csv")
test= pd.read_csv("test.csv")
valid= pd.read_csv("valid.csv")


In [3]:
## Merge all three datasets together and rename the column
merged_data= pd.concat([train, test, valid])
merged_data= merged_data[merged_data.columns[::-1]]
# Renaming the 'lable' column to 'language'
merged_data.rename(columns={'labels': 'language'}, inplace=True)
merged_data

Unnamed: 0,text,language
0,"os chefes de defesa da estónia, letónia, lituâ...",pt
1,размерът на хоризонталната мрежа може да бъде ...,bg
2,很好，以前从不去评价，不知道浪费了多少积分，现在知道积分可以换钱，就要好好评价了，后来我就把...,zh
3,สำหรับ ของเก่า ที่ จริงจัง ลอง honeychurch ...,th
4,Он увеличил давление .,ru
...,...,...
9995,非常好，盯了很久了，这次算是我看到的最低价了，刚翻了下，非常清晰，比二十四史那一套的效果看着好很多,zh
9996,"Üçüncü SS'e, Üçüncü Stratejik Destek Bölüğüne ...",tr
9997,Louisa May Alcott và Nathaniel Hawthorne sống ...,vi
9998,"Последното изречение... Предполагаме, разбира ...",bg


In [4]:

## Renaming the language column values to the full from
merged_data['language'] = merged_data['language'].map({
    'pt': 'Portuguese',
    'bg': 'Bulgarian',
    'en': 'English',
    'vi': 'Vietnamese',
    'fr': 'French',
    'nl': 'Dutch',
    'el': 'Greek',
    'de': 'German',
    'hi': 'Hindi',
    'it': 'Italian',
    'ar': 'Arabic',
    'es': 'Spanish',
    'tr': 'Turkish',
    'sw': 'Swahili',
    'ur': 'Urdu',
    'pl': 'Polish',
    'ru': 'Russian',
    'th': 'Thai',
    'zh': 'Chinese',
    'ja': 'Japanese'
})

## dispplay the df with renamed column
merged_data

Unnamed: 0,text,language
0,"os chefes de defesa da estónia, letónia, lituâ...",Portuguese
1,размерът на хоризонталната мрежа може да бъде ...,Bulgarian
2,很好，以前从不去评价，不知道浪费了多少积分，现在知道积分可以换钱，就要好好评价了，后来我就把...,Chinese
3,สำหรับ ของเก่า ที่ จริงจัง ลอง honeychurch ...,Thai
4,Он увеличил давление .,Russian
...,...,...
9995,非常好，盯了很久了，这次算是我看到的最低价了，刚翻了下，非常清晰，比二十四史那一套的效果看着好很多,Chinese
9996,"Üçüncü SS'e, Üçüncü Stratejik Destek Bölüğüne ...",Turkish
9997,Louisa May Alcott và Nathaniel Hawthorne sống ...,Vietnamese
9998,"Последното изречение... Предполагаме, разбира ...",Bulgarian


In [5]:
import matplotlib.pyplot as plt
# Count the number of rows for each language
language_counts = merged_data['language'].value_counts()

print(language_counts)

language
Portuguese    4500
Bulgarian     4500
English       4500
Vietnamese    4500
French        4500
Dutch         4500
Greek         4500
German        4500
Hindi         4500
Italian       4500
Arabic        4500
Spanish       4500
Turkish       4500
Swahili       4500
Urdu          4500
Polish        4500
Russian       4500
Thai          4500
Chinese       4500
Japanese      4500
Name: count, dtype: int64


In [6]:
## import dataset obtained from Hugging Face
df=pd.read_csv("Summerization_dataset.csv")
df

Unnamed: 0,article,title,article_length
0,उद्योग वाणिज्य तथा आपूर्ति मन्त्री मातृकाप्रसा...,राजनीति गर्ने कर्मचारीलाई नछाड्ने मन्त्री यादव...,496
1,सुदूरपश्चिमको पहाडी जिल्ला बझाङमा आज बिहानै लग...,दुई पटक भूकम्पको धक्का,437
2,बाबुराम भट्टराई नेतृत्वको नयाँ शक्तिले स्थानीय...,प्रदेश # मा नयाँ शक्तिको साख जोगाउने कुरुड को ...,297
3,गोर्खा राष्ट्रिय मुक्ति मोर्चाले दार्जिलिङमा छ...,दार्जिलिङमा छैटौं अनुसूची माग,394
4,आयोजक चिली ## वर्षपछि पहिलो पटक कोपा अमेरिका फ...,चिली कोपा अमेरिकाको फाइनलमा,267
...,...,...,...
285553,घरबेटी र भाडामा बस्नेका बीच सधैँ विवाद भएको सम...,भाडामा बस्नेको गुनासो घरधनीले सुने,352
285554,नेपालको दुर्गम पहाडी गाउँमा जन्मे - हुर्केको ए...,दुर्गम गाउँदेखि राष्ट्रसंघको उपल्लो ओहदासम्मको...,360
285555,एक नम्बर प्रदेश सरकारले पनि संघीय सरकारप्रति स...,प्रदेशका मन्त्रीको आरोप : केन्द्रले प्रदेशको अ...,258
285556,देशभर सञ्चालनमा रहेका ## वटा सार्वजनिक संस्थान...,"## सार्वजनिक संस्थान नाफामा , बढी राजश्व तिर्न...",262


In [7]:
## Create a new df out of above dataset with 6000 rows and rename the column
new_df = df.head(6000)
new_df=new_df[["article"]]
new_df["language"] = "Nepali"
new_df = new_df.rename(columns={"article": "text"})
# Display the new dataframe
(new_df)

Unnamed: 0,text,language
0,उद्योग वाणिज्य तथा आपूर्ति मन्त्री मातृकाप्रसा...,Nepali
1,सुदूरपश्चिमको पहाडी जिल्ला बझाङमा आज बिहानै लग...,Nepali
2,बाबुराम भट्टराई नेतृत्वको नयाँ शक्तिले स्थानीय...,Nepali
3,गोर्खा राष्ट्रिय मुक्ति मोर्चाले दार्जिलिङमा छ...,Nepali
4,आयोजक चिली ## वर्षपछि पहिलो पटक कोपा अमेरिका फ...,Nepali
...,...,...
5995,पुलिसलाई हराउँदै संकटा दोस्रो स्थानमा उक्लियोस...,Nepali
5996,नेपालका लागि होन्डा मोटरसाइकल तथा स्कुटरको अधि...,Nepali
5997,जिल्लाको माइजोगमाई गाउँपालिका–# सोयाङका मदन लि...,Nepali
5998,डोटी– जिल्ला सदरमुकाम सिलगढीमा तैनाथ नेपाली से...,Nepali


In [8]:
# Save the DataFrame as a CSV file
new_df.to_csv("summarize.csv", index=False)

In [9]:
## merged the data frames
new_merged= pd.concat([merged_data, new_df])
new_merged.head(50)

Unnamed: 0,text,language
0,"os chefes de defesa da estónia, letónia, lituâ...",Portuguese
1,размерът на хоризонталната мрежа може да бъде ...,Bulgarian
2,很好，以前从不去评价，不知道浪费了多少积分，现在知道积分可以换钱，就要好好评价了，后来我就把...,Chinese
3,สำหรับ ของเก่า ที่ จริงจัง ลอง honeychurch ...,Thai
4,Он увеличил давление .,Russian
5,"S Jak sobie życzysz: Widzisz, jak Hitler zabij...",Polish
6,اس کے بارے میں ، سفید شادی کی شرح کے بعد سفید ...,Urdu
7,Zabuni ya ushindani pia imekuwa rahisi kwa sif...,Swahili
8,Devasa 12 yüzyıl abbatiale saint-Pierre-Et-Sai...,Turkish
9,موجودہ اثاثوں میں سے ایک کا اضافہ ہو سکتا ہے ۔,Urdu


In [10]:
# Count the number of rows for each language
language_counts = new_merged['language'].value_counts()

print(language_counts)

language
Nepali        6000
Italian       4500
Japanese      4500
English       4500
Vietnamese    4500
French        4500
Dutch         4500
Greek         4500
German        4500
Hindi         4500
Portuguese    4500
Bulgarian     4500
Spanish       4500
Turkish       4500
Swahili       4500
Urdu          4500
Polish        4500
Russian       4500
Thai          4500
Chinese       4500
Arabic        4500
Name: count, dtype: int64


In [11]:
# Check for duplicates based on the 'Text' column
duplicates = new_merged[new_merged.duplicated(subset=['text'], keep=False)]

if duplicates.empty:
    print("No duplicates found.")
else:
    print("Duplicates found:")
    print(duplicates)

Duplicates found:
                                                   text    language
2     很好，以前从不去评价，不知道浪费了多少积分，现在知道积分可以换钱，就要好好评价了，后来我就把...     Chinese
160                         Yaaa güzel olmuş güzel alan     Turkish
171   Comboio de Navios para o Transporte de Armas Q...  Portuguese
182   Совет помогает бездомным людям получать еду и ...     Russian
188   Um gato preto e branco deitado sobre um cobertor.  Portuguese
...                                                 ...         ...
9989  उह हाँ मैं कहने वाला था कि मैं उडके चला जाऊंगा...       Hindi
9997  Louisa May Alcott và Nathaniel Hawthorne sống ...  Vietnamese
9998  Последното изречение... Предполагаме, разбира ...   Bulgarian
2181  मागअनुसार प्राप्त नहुँदा नुवाकोटमा दशैँका लागि...      Nepali
4468  मागअनुसार प्राप्त नहुँदा नुवाकोटमा दशैँका लागि...      Nepali

[5484 rows x 2 columns]


In [12]:
#3 drop duplicates 
new_merged.drop_duplicates(subset=['text'], keep='first', inplace=True)
new_merged

Unnamed: 0,text,language
0,"os chefes de defesa da estónia, letónia, lituâ...",Portuguese
1,размерът на хоризонталната мрежа може да бъде ...,Bulgarian
2,很好，以前从不去评价，不知道浪费了多少积分，现在知道积分可以换钱，就要好好评价了，后来我就把...,Chinese
3,สำหรับ ของเก่า ที่ จริงจัง ลอง honeychurch ...,Thai
4,Он увеличил давление .,Russian
...,...,...
5995,पुलिसलाई हराउँदै संकटा दोस्रो स्थानमा उक्लियोस...,Nepali
5996,नेपालका लागि होन्डा मोटरसाइकल तथा स्कुटरको अधि...,Nepali
5997,जिल्लाको माइजोगमाई गाउँपालिका–# सोयाङका मदन लि...,Nepali
5998,डोटी– जिल्ला सदरमुकाम सिलगढीमा तैनाथ नेपाली से...,Nepali


In [13]:
# Count the number of rows for each language
language_counts = new_merged['language'].value_counts()

print(language_counts)

language
Nepali        5999
Japanese      4500
German        4499
English       4498
Spanish       4497
Chinese       4490
French        4489
Bulgarian     4306
Vietnamese    4306
Hindi         4306
Arabic        4306
Turkish       4306
Swahili       4305
Greek         4305
Russian       4305
Thai          4305
Urdu          4295
Portuguese    4267
Italian       4260
Dutch         4246
Polish        4237
Name: count, dtype: int64


In [14]:
## renaming the language column values to first three letters of their original name
new_merged['language'] = new_merged['language'].str[:3]
# Count the number of rows for each language
language_counts = new_merged['language'].value_counts()
print(language_counts)


language
Nep    5999
Jap    4500
Ger    4499
Eng    4498
Spa    4497
Chi    4490
Fre    4489
Bul    4306
Vie    4306
Hin    4306
Ara    4306
Tur    4306
Swa    4305
Gre    4305
Rus    4305
Tha    4305
Urd    4295
Por    4267
Ita    4260
Dut    4246
Pol    4237
Name: count, dtype: int64


In [15]:
#3 import the dataset obtained from kaggle
dfx= pd.read_csv('dataset.csv')
dfx

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch
...,...,...
21995,hors du terrain les années et sont des année...,French
21996,ใน พศ หลักจากที่เสด็จประพาสแหลมมลายู ชวา อินเ...,Thai
21997,con motivo de la celebración del septuagésimoq...,Spanish
21998,年月，當時還只有歲的她在美國出道，以mai-k名義推出首張英文《baby i like》，由...,Chinese


In [16]:
## import the dataset obtained from Kaggle
dfy= pd.read_csv("Language Detection.csv")
dfy

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English
...,...,...
10332,ನಿಮ್ಮ ತಪ್ಪು ಏನು ಬಂದಿದೆಯೆಂದರೆ ಆ ದಿನದಿಂದ ನಿಮಗೆ ಒ...,Kannada
10333,ನಾರ್ಸಿಸಾ ತಾನು ಮೊದಲಿಗೆ ಹೆಣಗಾಡುತ್ತಿದ್ದ ಮಾರ್ಗಗಳನ್...,Kannada
10334,ಹೇಗೆ ' ನಾರ್ಸಿಸಿಸಮ್ ಈಗ ಮರಿಯನ್ ಅವರಿಗೆ ಸಂಭವಿಸಿದ ಎ...,Kannada
10335,ಅವಳು ಈಗ ಹೆಚ್ಚು ಚಿನ್ನದ ಬ್ರೆಡ್ ಬಯಸುವುದಿಲ್ಲ ಎಂದು ...,Kannada


In [17]:
# Rename columns of dfx
dfx.rename(columns={'Text': 'text'}, inplace=True)
# Slice the 'language' column values to only include the first three letters
dfx['language'] = dfx['language'].str[:3]
language_counts = dfx['language'].value_counts()

print(language_counts)

language
Est    1000
Swe    1000
Eng    1000
Rus    1000
Rom    1000
Per    1000
Pus    1000
Spa    1000
Hin    1000
Kor    1000
Chi    1000
Fre    1000
Por    1000
Ind    1000
Urd    1000
Lat    1000
Tur    1000
Jap    1000
Dut    1000
Tam    1000
Tha    1000
Ara    1000
Name: count, dtype: int64


In [18]:
# Rename columns of dfy dataframe
dfy.rename(columns={'Text': 'text', 'Language': 'language'}, inplace=True)

# Slice the 'language' column values to only include the first three letters
dfy['language'] = dfy['language'].str[:3]
language_counts = dfy['language'].value_counts()

print(language_counts)

language
Eng    1385
Fre    1014
Spa     819
Por     739
Ita     698
Rus     692
Swe     676
Mal     594
Dut     546
Ara     536
Tur     474
Ger     470
Tam     469
Dan     428
Kan     369
Gre     365
Hin      63
Name: count, dtype: int64


In [19]:
#3 merged all the data frames so far created above
m_data= pd.concat([new_merged, dfx, dfy])
m_data

Unnamed: 0,text,language
0,"os chefes de defesa da estónia, letónia, lituâ...",Por
1,размерът на хоризонталната мрежа може да бъде ...,Bul
2,很好，以前从不去评价，不知道浪费了多少积分，现在知道积分可以换钱，就要好好评价了，后来我就把...,Chi
3,สำหรับ ของเก่า ที่ จริงจัง ลอง honeychurch ...,Tha
4,Он увеличил давление .,Rus
...,...,...
10332,ನಿಮ್ಮ ತಪ್ಪು ಏನು ಬಂದಿದೆಯೆಂದರೆ ಆ ದಿನದಿಂದ ನಿಮಗೆ ಒ...,Kan
10333,ನಾರ್ಸಿಸಾ ತಾನು ಮೊದಲಿಗೆ ಹೆಣಗಾಡುತ್ತಿದ್ದ ಮಾರ್ಗಗಳನ್...,Kan
10334,ಹೇಗೆ ' ನಾರ್ಸಿಸಿಸಮ್ ಈಗ ಮರಿಯನ್ ಅವರಿಗೆ ಸಂಭವಿಸಿದ ಎ...,Kan
10335,ಅವಳು ಈಗ ಹೆಚ್ಚು ಚಿನ್ನದ ಬ್ರೆಡ್ ಬಯಸುವುದಿಲ್ಲ ಎಂದು ...,Kan


In [20]:
# Check for duplicates based on the 'Text' column
duplicates = m_data[m_data.duplicated(subset=['text'], keep=False)]

if duplicates.empty:
    print("No duplicates found.")
else:
    print("Duplicates found:")
    print(duplicates)

Duplicates found:
                                                    text language
166    bisby fa roskov yr orrell tm nicolson d paglin...      Ind
209    inlandsklimat råder i trakten årsmedeltemperat...      Swe
220    تاثيرات ئي پنسلينونه ورته دي داسيدو په مقابل ک...      Pus
410    bisby fa roskov yr orrell tm nicolson d paglin...      Ind
440    aastakümned  aastad  aastad  aastad  aastad  a...      Est
...                                                  ...      ...
10073                                  ನನ್ನನ್ನು ಕ್ಷಮಿಸು.      Kan
10078                                           ಓ ದೇವರೇ.      Kan
10081                                  ನನ್ನನ್ನು ಕ್ಷಮಿಸು.      Kan
10125                                           ಓ ದೇವರೇ.      Kan
10141                                  ನನ್ನನ್ನು ಕ್ಷಮಿಸು.      Kan

[301 rows x 2 columns]


In [21]:
## drop duplicates
m_data.drop_duplicates(subset=['text'], keep='first', inplace=True)
m_data

Unnamed: 0,text,language
0,"os chefes de defesa da estónia, letónia, lituâ...",Por
1,размерът на хоризонталната мрежа може да бъде ...,Bul
2,很好，以前从不去评价，不知道浪费了多少积分，现在知道积分可以换钱，就要好好评价了，后来我就把...,Chi
3,สำหรับ ของเก่า ที่ จริงจัง ลอง honeychurch ...,Tha
4,Он увеличил давление .,Rus
...,...,...
10332,ನಿಮ್ಮ ತಪ್ಪು ಏನು ಬಂದಿದೆಯೆಂದರೆ ಆ ದಿನದಿಂದ ನಿಮಗೆ ಒ...,Kan
10333,ನಾರ್ಸಿಸಾ ತಾನು ಮೊದಲಿಗೆ ಹೆಣಗಾಡುತ್ತಿದ್ದ ಮಾರ್ಗಗಳನ್...,Kan
10334,ಹೇಗೆ ' ನಾರ್ಸಿಸಿಸಮ್ ಈಗ ಮರಿಯನ್ ಅವರಿಗೆ ಸಂಭವಿಸಿದ ಎ...,Kan
10335,ಅವಳು ಈಗ ಹೆಚ್ಚು ಚಿನ್ನದ ಬ್ರೆಡ್ ಬಯಸುವುದಿಲ್ಲ ಎಂದು ...,Kan


In [22]:
## displat the language columns values count
language_counts = m_data['language'].value_counts()

print(language_counts)

language
Eng    6880
Fre    6485
Spa    6308
Por    6000
Nep    5999
Rus    5992
Ara    5836
Dut    5783
Tur    5777
Jap    5500
Chi    5490
Hin    5358
Tha    5305
Urd    5295
Ger    4964
Ita    4954
Gre    4663
Bul    4306
Vie    4306
Swa    4305
Pol    4237
Swe    1664
Tam    1445
Kor    1000
Per    1000
Rom    1000
Est     999
Pus     993
Ind     975
Lat     953
Mal     591
Dan     424
Kan     366
Name: count, dtype: int64


In [23]:
# List of languages to exclude from the df
exclude_languages = ['Pus', 'Ind', 'Lat', 'Mal', 'Dan', 'Kan',"Chi", "Jap", "Kor","Per","Rom", "Est", "Swe", "Tam"]

In [24]:
# Filter the DataFrame to keep rows with languages not in exclude_languages list
filtered_m_data = m_data[~m_data['language'].isin(exclude_languages)]
filtered_m_data

Unnamed: 0,text,language
0,"os chefes de defesa da estónia, letónia, lituâ...",Por
1,размерът на хоризонталната мрежа може да бъде ...,Bul
3,สำหรับ ของเก่า ที่ จริงจัง ลอง honeychurch ...,Tha
4,Он увеличил давление .,Rus
5,"S Jak sobie życzysz: Widzisz, jak Hitler zabij...",Pol
...,...,...
9963,narcisa änderte ihre art und weise sie kämpfte...,Ger
9964,Wie' s Narzissmus jetzt erzählt Marian beiden ...,Ger
9965,"Hat sie, ich denke, sie würde jetzt kein Goldb...",Ger
9966,"Terry, du siehst tatsächlich ein bisschen wie ...",Ger


In [25]:
language_counts = filtered_m_data['language'].value_counts()
print(language_counts)

language
Eng    6880
Fre    6485
Spa    6308
Por    6000
Nep    5999
Rus    5992
Ara    5836
Dut    5783
Tur    5777
Hin    5358
Tha    5305
Urd    5295
Ger    4964
Ita    4954
Gre    4663
Bul    4306
Vie    4306
Swa    4305
Pol    4237
Name: count, dtype: int64


In [26]:
## check the count of languages
len(language_counts)

19

In [27]:
#3 display the final data frame
filtered_m_data

Unnamed: 0,text,language
0,"os chefes de defesa da estónia, letónia, lituâ...",Por
1,размерът на хоризонталната мрежа може да бъде ...,Bul
3,สำหรับ ของเก่า ที่ จริงจัง ลอง honeychurch ...,Tha
4,Он увеличил давление .,Rus
5,"S Jak sobie życzysz: Widzisz, jak Hitler zabij...",Pol
...,...,...
9963,narcisa änderte ihre art und weise sie kämpfte...,Ger
9964,Wie' s Narzissmus jetzt erzählt Marian beiden ...,Ger
9965,"Hat sie, ich denke, sie würde jetzt kein Goldb...",Ger
9966,"Terry, du siehst tatsächlich ein bisschen wie ...",Ger


In [28]:
# Check for duplicates based on the 'Text' column
duplicates1 = filtered_m_data[filtered_m_data.duplicated(subset=['text'], keep=False)]

if duplicates1.empty:
    print("No duplicates found.")
else:
    print("Duplicates found:")
    print(duplicates1)

No duplicates found.


In [29]:
# Define file paths for saving the CSV files
NLP_csv_path = "NLPdata.csv"
# Save the training data to a CSV file
filtered_m_data.to_csv(NLP_csv_path, index=False)
print("Training data saved to:", NLP_csv_path)



Training data saved to: NLPdata.csv


The data frame is saved as a csv file for part 2 which is using the NLPdata.csv to build classical models.

In [30]:
## check classical model notebook for rest of the work