<a href="https://colab.research.google.com/github/Tiru-Kaggundi/Adv_ML_project/blob/main/LM_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this milestone, we will be categorizing whether a given text belongs to one of the Indian languages from a set of {Hindi, Malayalam, Kannada, Tamil, Bengali}. For this we will create a Bag of Vectors for each language. Here are the steps that we aim to implement-

Collate the dataset for each language. We will use the Hindi, Kannada, Tamil, Malayalam, and Bengali Wikipedia articles to train our model. Use publicly available tokenizers (or generate a new one if possible - stretch goal) to create a Bag of Words Create a Bag of Words classifier using linear layers Train the Classifier

For Language Identification, a lot of work has been done to classify and identify languages with latin as the predominant script. Some work has also been done to identify Chinese, Japanese, and Korean languages. Jauhiaihen et al has done a detailed survey on the literature that exists for Language Identification tasks. Kerwin (2006) used character frequencies as feature vectors. In a feature vector, each feature vector f has its own integer value. Raw frequency and relative frequency for each feature is calculated for each language. For our project, we will be using something very similar.

There are several intuitive techniques that are used to classify languages- Position of words- Kumar et al. (2015) used the position of the current word in word-level LI. Dictionary of unique words: Unique word dictionaries include only those words of the language, that do not belong to the other languages targeted by the language identifier. Discriminating words Kolkus (2009) used the most relevant words for each language

What we do: 
1. Collect free datasets from various places (Wikipedia articles [from Kaggle], AI4Bharat,indtlk etc. - reference/citation placed at relevant places for any dataset used) - This proved to be a major task as described in report
2. Try out various tokenizers and see their efficacy in prediction task
3. Prepare and train a model  
4. Predict and see the performance - mainly to see how a simple tokenizer performs against customized tokenizers traine by earlier teams (Google+IIT Madras, indltk etc.)



In [1]:
!pip install indic-nlp-library
!pip install torch
!pip install torchtext
!pip install spacy
!pip install torchdata

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting indic-nlp-library
  Downloading indic_nlp_library-0.81-py3-none-any.whl (40 kB)
[K     |████████████████████████████████| 40 kB 3.6 MB/s 
Collecting sphinx-argparse
  Downloading sphinx_argparse-0.3.1-py2.py3-none-any.whl (12 kB)
Collecting morfessor
  Downloading Morfessor-2.0.6-py3-none-any.whl (35 kB)
Collecting sphinx-rtd-theme
  Downloading sphinx_rtd_theme-1.0.0-py2.py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 7.9 MB/s 
Installing collected packages: sphinx-rtd-theme, sphinx-argparse, morfessor, indic-nlp-library
Successfully installed indic-nlp-library-0.81 morfessor-2.0.6 sphinx-argparse-0.3.1 sphinx-rtd-theme-1.0.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://

In [2]:
import sys
import os
from google.colab import drive 
drive.mount('/content/gdrive', force_remount=True)
%cd /content/gdrive/MyDrive/AML_Project/
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/MyDrive/AML_Project"

Mounted at /content/gdrive
/content/gdrive/MyDrive/AML_Project


In [3]:
# This command is to unmount gdrive - need it sometimes
#!fusermount -u gdrive

In [4]:

import torch
import torchtext
import spacy
from torchtext import data, datasets
from torchtext.vocab import Vectors
from torch.nn import init
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
import torch
from torchtext.datasets import AG_NEWS
from torch.utils.data import Dataset, DataLoader


Citation for datasets: @inproceedings{kakwani2020indicnlpsuite,
    title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
    author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
    year={2020},
    booktitle={Findings of EMNLP},
}


@article{kunchukuttan2020indicnlpcorpus,
    title={AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages},
    author={Anoop Kunchukuttan and Divyanshu Kakwani and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
    year={2020},
    journal={arXiv preprint},




Trials for data download, cleaning and some experiments in tokenizers

Reference: https://nbviewer.org/url/anoopkunchukuttan.github.io/indic_nlp_library/doc/indic_nlp_examples.ipynb

In [5]:
from indicnlp.tokenize import sentence_tokenize

indic_string="""तो क्या विश्व कप 2019 में मैच का बॉस टॉस है? यानी मैच में हार-जीत में \
टॉस की भूमिका अहम है? आप ऐसा सोच सकते हैं। विश्वकप के अपने-अपने पहले मैच में बुरी तरह हारने वाली एशिया की दो टीमों \
पाकिस्तान और श्रीलंका के कप्तान ने हालांकि अपने हार के पीछे टॉस की दलील तो नहीं दी, लेकिन यह जरूर कहा था कि वह एक अहम टॉस हार गए थे।"""
sentences=sentence_tokenize.sentence_split(indic_string, lang='hi')
for t in sentences:
    print(t)

तो क्या विश्व कप 2019 में मैच का बॉस टॉस है?
यानी मैच में हार-जीत में टॉस की भूमिका अहम है?
आप ऐसा सोच सकते हैं।
विश्वकप के अपने-अपने पहले मैच में बुरी तरह हारने वाली एशिया की दो टीमों पाकिस्तान और श्रीलंका के कप्तान ने हालांकि अपने हार के पीछे टॉस की दलील तो नहीं दी, लेकिन यह जरूर कहा था कि वह एक अहम टॉस हार गए थे।


In [6]:
from indicnlp.tokenize import indic_tokenize  

indic_string='सुनो, कुछ आवाज़ आ रही है। फोन?'

print('Input String: {}'.format(indic_string))
print('Tokens: ')
for t in indic_tokenize.trivial_tokenize(indic_string): 
    print(t)

Input String: सुनो, कुछ आवाज़ आ रही है। फोन?
Tokens: 
सुनो
,
कुछ
आवाज़
आ
रही
है
।
फोन
?


In [7]:
!kaggle datasets list -s 'hindi-wikipedia'

ref                                                        title                                               size  lastUpdated          downloadCount  voteCount  usabilityRating  
---------------------------------------------------------  -------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
disisbig/hindi-wikipedia-articles-172k                     Hindi Wikipedia Articles - 172k                    208MB  2019-12-24 05:01:30            395         19  0.6875           
zarajamshaid/language-identification-datasst               Language Identification dataset                      6MB  2018-12-19 18:12:23           2125         36  1.0              
disisbig/hindi-wikipedia-articles-55k                      Hindi Wikipedia Articles - 55k                      63MB  2019-12-24 04:31:32            202          8  0.625            
shivavashishtha/shark-tank-india-dataset                   Shark Tank India Dataset       

In [8]:
#!kaggle datasets download -d 'disisbig/hindi-wikipedia-articles-55k'

The hindi dataset is now sitting in the folder in zip format at '/content/drive/MyDrive/Adv_ML_Project/hindi-wikipedia-articles-55k.zip'

In [9]:
# #Don't use this part of code - it's very slow
# from zipfile import ZipFile

# with ZipFile('hindi-wikipedia-articles-55k.zip', 'r') as zipObj:
#    # Extract all the contents of zip file in current directory
#    zipObj.extractall()
# # This above code is shit slow. It takes 20 hours to extract. 

In [10]:
#!unzip 'hindi-wikipedia-articles-55k.zip' # This is relatively fast

In [11]:
!kaggle datasets list -s 'bengali-wikipedia'

ref                                            title                                   size  lastUpdated          downloadCount  voteCount  usabilityRating  
---------------------------------------------  -------------------------------------  -----  -------------------  -------------  ---------  ---------------  
disisbig/bengali-wikipedia-articles            Bengali Wikipedia Articles             105MB  2019-12-25 04:35:21            138          6  0.5              
nrkapri/rabindranath-tagore-online-variorum    Rabindranath Tagore Online Variorum    173MB  2020-06-07 20:04:16             50          9  1.0              
nbroad/muril-large-pt                          MuRIL Large pt                           2GB  2021-10-16 14:35:24             61         12  0.5              
aagalib/complete-works-of-rabindranath-tagore  Complete Works of Rabindranath Tagore   43MB  2022-01-04 16:05:23             28          8  0.9411765        
zzy990106/murilbasecased                       muril

In [12]:
#!kaggle datasets download -d 'disisbig/bengali-wikipedia-articles'

In [13]:
#!unzip 'bengali-wikipedia-articles.zip' # This is relatively fast

In [14]:
!kaggle datasets list -s 'kannada-wikipedia'

ref                                            title                           size  lastUpdated          downloadCount  voteCount  usabilityRating  
---------------------------------------------  -----------------------------  -----  -------------------  -------------  ---------  ---------------  
disisbig/kannada-wikipedia-articles            Kannada Wikipedia Articles      81MB  2019-12-25 04:38:53            168          1  0.5              
nbroad/muril-large-pt                          MuRIL Large pt                   2GB  2021-10-16 14:35:24             61         12  0.5              
zzy990106/murilbasecased                       muril-base-cased                 2GB  2021-08-21 16:36:07             26          8  0.6875           
sizlingdhairya1/iiit-spoken-language-datasets  IIIT Spoken Language Datasets  920MB  2022-05-02 08:37:28              2          0  0.3125           
nbroad/muril-large-tf                          MuRIL Large tf                   2GB  2021-10-16 14:2

In [15]:
!kaggle datasets list -s 'malayalam-wikipedia'

ref                                            title                           size  lastUpdated          downloadCount  voteCount  usabilityRating  
---------------------------------------------  -----------------------------  -----  -------------------  -------------  ---------  ---------------  
disisbig/malayalam-wikipedia-articles          Malayalam Wikipedia Articles    19MB  2019-12-25 04:39:06            121          6  0.5              
nbroad/muril-large-pt                          MuRIL Large pt                   2GB  2021-10-16 14:35:24             61         12  0.5              
zzy990106/murilbasecased                       muril-base-cased                 2GB  2021-08-21 16:36:07             26          8  0.6875           
sizlingdhairya1/iiit-spoken-language-datasets  IIIT Spoken Language Datasets  920MB  2022-05-02 08:37:28              2          0  0.3125           
nbroad/muril-large-tf                          MuRIL Large tf                   2GB  2021-10-16 14:2

In [16]:
!kaggle datasets list -s 'tamil-wikipedia'

ref                                                title                                       size  lastUpdated          downloadCount  voteCount  usabilityRating  
-------------------------------------------------  -----------------------------------------  -----  -------------------  -------------  ---------  ---------------  
disisbig/tamil-wikipedia-articles                  Tamil Wikipedia Articles                    95MB  2019-12-25 04:30:22            172         12  0.64705884       
jrobischon/wikipedia-movie-plots                   Wikipedia Movie Plots                       30MB  2018-10-15 19:59:54          14602        432  0.88235295       
sudalairajkumar/tamil-nlp                          Tamil NLP                                    3MB  2019-03-11 06:29:11           1287        101  1.0              
zarajamshaid/language-identification-datasst       Language Identification dataset              6MB  2018-12-19 18:12:23           2125         36  1.0              
prav

Indic NLP news articles is collection of news articles in various Indian langauges. 
Ref: https://github.com/AI4Bharat/indicnlp_corpus#indicnlp-news-article-classification-dataset

In [17]:
train_kn = pd.read_csv('/content/gdrive/MyDrive/AML_Project/kn/kn-train.csv')
test_kn = pd.read_csv('/content/gdrive/MyDrive/AML_Project/kn/kn-test.csv')
valid_kn = pd.read_csv('/content/gdrive/MyDrive/AML_Project/kn/kn-valid.csv')

In [18]:
#Can we use spacy to tokeknize indic languages? Yes
from spacy.lang.kn import Kannada
nlp_kn = Kannada()
print(nlp_kn.lang, [token.is_stop for token in nlp_kn("ಇಂಡೀಸ್‌ ಹಲವು")])
doc = nlp_kn("ವಿರುದ್ಧ ಮೂರು ಪಂದ್ಯಗಳ ಟಿ–20 ಸರಣಿಯಲ್ಲಿ ಆಡುವ ಮೊದಲ ಎರಡು ಪಂದ್ಯಗಳಿಗೆ ವೆಸ್ಟ್‌ ")
for token in doc:
  print(token)

kn [False, True]
ವಿರುದ್ಧ
ಮೂರು
ಪಂದ್ಯಗಳ
ಟಿ–20
ಸರಣಿಯಲ್ಲಿ
ಆಡುವ
ಮೊದಲ
ಎರಡು
ಪಂದ್ಯಗಳಿಗೆ
ವೆಸ್ಟ್‌


In [19]:
#Let's try tokenizer designed by Google and IIT Madras. It works! 
from indicnlp.tokenize import indic_tokenize 
kannada_string = "ವಿರುದ್ಧ ಮೂರು ಪಂದ್ಯಗಳ ಟಿ–20 ಸರಣಿಯಲ್ಲಿ ಆಡುವ ಮೊದಲ ಎರಡು ಪಂದ್ಯಗಳಿಗೆ ವೆಸ್ಟ್‌ "
for t in indic_tokenize.trivial_tokenize(kannada_string): 
    print(t)

ವಿರುದ್ಧ
ಮೂರು
ಪಂದ್ಯಗಳ
ಟಿ–20
ಸರಣಿಯಲ್ಲಿ
ಆಡುವ
ಮೊದಲ
ಎರಡು
ಪಂದ್ಯಗಳಿಗೆ
ವೆಸ್ಟ್‌


In [20]:
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import vocab

In [21]:
# How about just splitting on spaces based on torchtext basic tokenizer
tokenizer_basic = get_tokenizer(None)
kannada_string = "ವಿರುದ್ಧ ಮೂರು ಪಂದ್ಯಗಳ ಟಿ–20 ಸರಣಿಯಲ್ಲಿ ಆಡುವ ಮೊದಲ ಎರಡು ಪಂದ್ಯಗಳಿಗೆ ವೆಸ್ಟ್‌"
for t in tokenizer_basic(kannada_string): 
    print(t)

ವಿರುದ್ಧ
ಮೂರು
ಪಂದ್ಯಗಳ
ಟಿ–20
ಸರಣಿಯಲ್ಲಿ
ಆಡುವ
ಮೊದಲ
ಎರಡು
ಪಂದ್ಯಗಳಿಗೆ
ವೆಸ್ಟ್‌


MILESTONE 1: LOAD THE DATASETS OF VARIOUS LANGUAGES

In [22]:
# Load Hindi train and test
train_hi = pd.read_csv('/content/gdrive/MyDrive/AML_Project/hi/hi-train.csv', names = ["label", "text"], header=None)
train_hi['lang'] = "hi"
test_hi = pd.read_csv('/content/gdrive/MyDrive/AML_Project/hi/hi-test.csv', names = ["label", "text"], header=None)
test_hi['lang'] = "hi"
train_hi

Unnamed: 0,label,text,lang
0,india,मेट्रो की इस लाइन के चलने से दक्षिणी दिल्ली से...,hi
1,pakistan,नेटिजन यानि इंटरनेट पर सक्रिय नागरिक अब ट्विटर...,hi
2,news,इसमें एक फ़्लाइट एटेनडेंट की मदद की गुहार है औ...,hi
3,india,"प्रतीक खुलेपन का, आज़ाद ख्याली का और भीड़ से अ...",hi
4,india,ख़ासकर पिछले 10 साल तक प्रधानमंत्री रहे मनमोहन...,hi
...,...,...,...
3462,india,जैसे ही उन्हें पता चलता है कि कोई व्यक्ति परेश...,hi
3463,india,जैसे ही सदन की कार्यवाही शुरू हुई तमिलनाडु की ...,hi
3464,news,चीन ने पिछले हफ़्ते अप्रत्यक्ष रूप से भारत को ...,hi
3465,entertainment,मुक्ता आर्ट्स की 'कांची' कहानी है एक ख़ूबसूरत ...,hi


In [23]:
# Load Kannada train and test
train_kn = pd.read_csv('/content/gdrive/MyDrive/AML_Project/kn/kn-train.csv', names = ["label", "text"],header=None)
train_kn['lang'] = "kn"
test_kn = pd.read_csv('/content/gdrive/MyDrive/AML_Project/kn/kn-test.csv', names = ["label", "text"],header=None)
test_kn['lang'] = "kn"
train_kn

Unnamed: 0,label,text,lang
0,sports,Samsung Galaxy M30s: ಲಭ್ಯವಾಗಲಿದೆ ಶಕ್ತಿಶಾಲಿ ಸ್ಮ...,kn
1,sports,ಯುವೆಂಟಸ್ ತಂಡ ಸೇರಿದ ಕ್ರಿಸ್ಟಿಯಾನೊ ರೊನಾಲ್ಡೊ\nಯುವೆ...,kn
2,sports,"ಬ್ಯಾಡ್ಮಿಂಟನ್‌: ಎರಡನೇ ಸುತ್ತಿಗೆ ಸಿಂಧು, ಸಮೀರ್‌\nಬ...",kn
3,entertainment,ತಮ್ಮದು ಒಪ್ಪಿತ ʼಸಂಬಂಧʼ ಎಂದು ಹೇಳಿದ ನಟ\n 12-05-2...,kn
4,sports,ಖ್ಯಾತ ಕ್ರಿಕೆಟಿಗನ ಪತ್ನಿ ನಿಧನ\n 31-12-2018 7:57...,kn
...,...,...,...
23995,sports,ಏಕದಿನ ಕ್ರಿಕೆಟ್‌: ಬ್ಯಾಟಿಂಗ್‌ ಆರಂಭಿಸಿದ ವಿಂಡೀಸ್‌ಗ...,kn
23996,sports,"Chennai, First Published 23, Aug 2019, 10:26 A...",kn
23997,sports,ಬೆಳ್ಳಿ ಪದಕ ಗೆದ್ದರೂ ಟೀಂಇಂಡಿಯಾ ಹಾಕಿ ಕೋಚ್ ನಿರಾಸೆಗ...,kn
23998,lifestyle,ಸೆನ್ಸಿಟಿವ್ ಹಲ್ಲುಗಳ ಸಮಸ್ಯೆ ಇದೆಯೇ? ಈ ಮನೆಮದ್ದುಗಳನ...,kn


In [24]:
# Load Malyalam train and test
train_ml = pd.read_csv('/content/gdrive/MyDrive/AML_Project/ml/ml-train.csv',names = ["label", "text"], header=None)
train_ml['lang'] = "ml"
test_ml = pd.read_csv('/content/gdrive/MyDrive/AML_Project/ml/ml-test.csv',names = ["label", "text"], header=None)
test_ml['lang'] = "ml"
train_ml

Unnamed: 0,label,text,lang
0,sports,മത്സര പ്രതിഫലമായി സ്വന്തമാക്കിയത് പതിനേഴ് ദശലക...,ml
1,sports,ഇന്ത്യൻ പ്രീമിയർ ലീഗിൽ കൊൽക്കത്ത നൈറ്റ് റൈഡേഴ്...,ml
2,entertainment,സിനിമാ മേഖലയിൽ ഇപ്പോൾ ബയോപിക്കുകളുടെ കാലമാണ് ....,ml
3,business,കോഴിയിറച്ചിക്ക് കിലോക്ക് 87രൂപ നിശ്ചയിച്ച സംസ്...,ml
4,sports,ട്വന്റി - 20 ലോകകപ്പിൽ സൂപ്പർ ടെന്നിലെ ഇന്ത്യാ...,ml
...,...,...,...
4795,business,സ്റ്റേറ്റ് ബാങ്കിൻറെ കറൻസി അഡ്മിനിസ്ട്രേറ്റീവ്...,ml
4796,entertainment,മോഹൻലാൽ നായകനായ ഒടിയൻ എന്ന ചിത്രത്തിന്റ കഥ അമേ...,ml
4797,technology,ജനുവരിയിൽ വാവ്വേ ഇന്ത്യയിൽ അവതരിപ്പിച്ച ഹോണർ 6...,ml
4798,entertainment,തൻറെ ഏറ്റവും പുതിയ ചിത്രം സീറോയുടെ തിരക്കിലാണ്...,ml


In [25]:
# Load Bengali train and test
train_bn = pd.read_csv('/content/gdrive/MyDrive/AML_Project/bn/bn-train.csv',names = ["label", "text"], header=None)
train_bn['lang'] = "bn"
test_bn = pd.read_csv('/content/gdrive/MyDrive/AML_Project/bn/bn-test.csv',names = ["label", "text"], header=None)
test_bn['lang'] = "bn"
train_bn

Unnamed: 0,label,text,lang
0,entertainment,শরদিন্দু বন্দ্যোপাধ্যাযের গল্প ‘মগ্ন মৈনাক’ নি...,bn
1,sports,# লন্ডনঃ সচিন পুত্র অর্জুন তেন্ডুলকর এখন ক্রিক...,bn
2,sports,জাতীয় দলের নির্বাচনে কি রাজনীতির ছোঁয়া থাকে ...,bn
3,sports,আইপিএলের ফাইনালে ফের একবার ‘এল ক্ল্যাসিকো’ । র...,bn
4,entertainment,# মুম্বইঃ অবেশেষে মুক্তি পেল এ বছরের সবচেয়ে ব...,bn
...,...,...,...
11195,sports,জাতীয় দলের নির্বাচনে কি রাজনীতির ছোঁয়া থাকে ...,bn
11196,entertainment,এবার বলিউডে জুটি বাঁধলেন রণবীর সিং ও আলিয়া ভা...,bn
11197,entertainment,"# মুম্বইঃ না , আর কিছু বাদ রাখলেন না অভিনেতা শ...",bn
11198,sports,জাতীয় দলের নির্বাচনে কি রাজনীতির ছোঁয়া থাকে ...,bn


In [26]:
# Load Tamil train and test
train_ta = pd.read_csv('/content/gdrive/MyDrive/AML_Project/ta/ta-train.csv',names = ["label", "text"], header=None)
train_ta['lang'] = "ta"
test_ta = pd.read_csv('/content/gdrive/MyDrive/AML_Project/ta/ta-test.csv',names = ["label", "text"], header=None)
test_ta['lang'] = "ta"
print(train_ta.shape)
print(test_ta.shape)
train_ta

(9360, 3)
(1170, 3)


Unnamed: 0,label,text,lang
0,entertainment,கட் அவுட்டிற்கு பாலாபிஷேகம் செய்து வந்த காலமெல...,ta
1,entertainment,தமிழ் சினிமாவில் குணச்சித்திர நடிகராக அறிமுகமா...,ta
2,entertainment,மறைந்த முன்னாள் முதலமைச்சர் ஜெயலலிதாவின் மறைவி...,ta
3,politics,அமைச்சர் விஜய பாஸ்கரை பதவி நீக்கம் செய்ய வேண்ட...,ta
4,politics,ஆர்கே . நகர் இடைத்தேர்தல் களம் சூடுபிடித்துள்ள...,ta
...,...,...,...
9355,entertainment,எம் . ஜி . ஆர் காலத்திலிருந்து விமல் காலம் வரை...,ta
9356,politics,கப்பல்கள் மோதிய விவகாரத்தில் கடந்த 8 நாட்களாக ...,ta
9357,entertainment,பிக்பாஸ் நிகழ்ச்சியில் தற்போது தலைவியாக இருக்க...,ta
9358,entertainment,பரதன் இயக்கத்தில் விஜய் நடித்துள்ள அவருடைய 60வ...,ta


In [27]:
# Language classificaiton task starts
# Merge the train and test datasets having test and lang columns alone
train_all = pd.concat([train_hi, train_bn, train_kn, train_ml, train_ta], ignore_index=True)
test_all = pd.concat([test_hi, test_bn, test_kn, test_ml, test_ta], ignore_index=True)
train_lang = train_all[["lang", "text"]]
test_lang = test_all[["lang", "text"]]

In [28]:
train_lang.shape

(52827, 2)

In [29]:
test_lang.shape

(7036, 2)

In [30]:
train_lang

Unnamed: 0,lang,text
0,hi,मेट्रो की इस लाइन के चलने से दक्षिणी दिल्ली से...
1,hi,नेटिजन यानि इंटरनेट पर सक्रिय नागरिक अब ट्विटर...
2,hi,इसमें एक फ़्लाइट एटेनडेंट की मदद की गुहार है औ...
3,hi,"प्रतीक खुलेपन का, आज़ाद ख्याली का और भीड़ से अ..."
4,hi,ख़ासकर पिछले 10 साल तक प्रधानमंत्री रहे मनमोहन...
...,...,...
52822,ta,எம் . ஜி . ஆர் காலத்திலிருந்து விமல் காலம் வரை...
52823,ta,கப்பல்கள் மோதிய விவகாரத்தில் கடந்த 8 நாட்களாக ...
52824,ta,பிக்பாஸ் நிகழ்ச்சியில் தற்போது தலைவியாக இருக்க...
52825,ta,பரதன் இயக்கத்தில் விஜய் நடித்துள்ள அவருடைய 60வ...


In [31]:
test_lang

Unnamed: 0,lang,text
0,hi,बुधवार को राज्य सभा में विपक्ष के सवालों के जव...
1,hi,लखनऊ स्थित पत्रकार समीरात्मज मिश्र को बुलंदशहर...
2,hi,लगभग 1300 हेक्टेयर ज़मीन का अधिग्रहण किया जा च...
3,hi,हालांकि उनके अंगरक्षकों को बमों को जाम करने वा...
4,hi,आयोग का कहना है कि इस तरह के परीक्षण से महिलाओ...
...,...,...
7031,ta,இந்திய வீரர் தினேஷ் கார்த்திக் டக் அவுட்டில் ப...
7032,ta,இந்திய கிரிக்கெட் வீரர் முகமது ஷமி சென்ற கார் ...
7033,ta,ஷாருக் கான் நடிப்பில் வெளியாகி மிகப்பெரிய வெற்...
7034,ta,"கிராமத்தையே பார்க்காத ஸ்டாலின் , தற்போது ஊர் ஊ..."


In [32]:
#Ref https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer(None) #Can it to different tokenizers among above to experiment

def yield_tokens(df):
  for _, text in df.iterrows():
    yield tokenizer(text[1])

vocab = build_vocab_from_iterator(yield_tokens(train_lang), min_freq = 10, specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
#If you don't put min_freq above, the vocab size grows to around million words and the whole
#thing gets quite slow. 

In [33]:
len(vocab)

93763

In [34]:
vocab.get_itos()[10:22]

['और', 'से', 'को', 'ಹಾಗೂ', 'এই', 'IST', 'ಎಂದು', 'कि', "'", 'का', 'ने', 'ಒಂದು']

In [35]:
# Ref: generated text and label processing pipelines
language_dict = {'hi':0, 'kn':1, 'ta':2, 'ml':3, 'bn':4 } #language dictionary to convert lang to int
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(language_dict[x])
# print(text_pipeline('here is tamil கப்பல்கள் மோதிய விவகாரத்தில் கடந்த'))
# print(label_pipeline('kn'))

In [36]:
# create custom dataset class
#https://towardsdatascience.com/how-to-use-datasets-and-dataloader-in-pytorch-for-custom-text-data-270eed7f7c00
class CustomTextDataset(Dataset):
    def __init__(self, df):
        self.labels = df.iloc[:,0]
        self.text = df.iloc[:,1]
 
    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        if idx >= self.__len__():
          raise StopIteration
        label = self.labels[idx]
        text = self.text[idx]
        return label, text

In [37]:
from torch.utils.data.datapipes.iter.utils import IterableWrapperIterDataPipe
from torch.utils.data.datapipes.iter.callable import MapperIterDataPipe
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
      label_list.append(label_pipeline(_label))
      processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
      text_list.append(processed_text)
      offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_iter = CustomTextDataset(train_lang)
dataloader = DataLoader(train_iter, batch_size=32, shuffle=False, collate_fn=collate_batch)

In [38]:
from torch import nn

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

In [44]:
num_class = len(set([label for (label, text) in enumerate(train_iter)]))
vocab_size = len(vocab)
emsize = 128
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

In [45]:
import time

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

In [46]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
# Hyperparameters
EPOCHS = 6 # 
LR = 2  # learning rate
BATCH_SIZE = 64 # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
test_iter = CustomTextDataset(test_lang)
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])

train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
  epoch_start_time = time.time()
  train(train_dataloader)
  accu_val = evaluate(valid_dataloader)
  if total_accu is not None and total_accu > accu_val:
    scheduler.step()
  else:
      total_accu = accu_val
  print('-' * 59)
  print('| end of epoch {:3d} | time: {:5.2f}s | '
        'valid accuracy {:8.3f} '.format(epoch,
                                          time.time() - epoch_start_time,
                                          accu_val))
  print('-' * 59)

| epoch   1 |   500/  785 batches | accuracy    0.861
-----------------------------------------------------------
| end of epoch   1 | time:  9.22s | valid accuracy    0.983 
-----------------------------------------------------------
| epoch   2 |   500/  785 batches | accuracy    0.989
-----------------------------------------------------------
| end of epoch   2 | time:  7.74s | valid accuracy    0.994 
-----------------------------------------------------------
| epoch   3 |   500/  785 batches | accuracy    0.997
-----------------------------------------------------------
| end of epoch   3 | time:  7.81s | valid accuracy    0.998 
-----------------------------------------------------------
| epoch   4 |   500/  785 batches | accuracy    0.999
-----------------------------------------------------------
| end of epoch   4 | time:  7.76s | valid accuracy    1.000 
-----------------------------------------------------------
| epoch   5 |   500/  785 batches | accuracy    0.999
------

In [47]:
print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))

Checking the results of test dataset.
test accuracy    1.000


In [49]:
lang_labels = {0: 'hindi', 1: 'kannada', 2: 'tamil', 3:'malayalam', 4:'bengali' }

test_sentence = "अरे आपका आज का दिन कैसा चल रहा है। आशा है आपके और आपके परिवार वालों के साथ सब कुछ अच्छा है।"

def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item()

model = model.to('cpu')

print("This is %s language" %lang_labels[predict(test_sentence, text_pipeline)])

This is hindi language


Credit for datasets: @inproceedings{kakwani2020indicnlpsuite,
    title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
    author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
    year={2020},
    booktitle={Findings of EMNLP},
}

