Skip to content

All Required files for the Journal Paper "Am I a Resource-Poor Language? Data Sets, Embeddings, Models and Analysis for four different NLP tasks in Telugu Language"

Notifications You must be signed in to change notification settings

Cha14ran/DREAM-T

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dream-T!

Data

  • Each row/sentence contains the sentence text, and respective "sentiment", "emotion", "hatespeech", "sarcasm" labels.
  • The labels mentioned in each row of the csv file, goes in the same order:sentiment, emotion, hatespeech, sarcasm
  • Labels for sentiment: "Pos", "Neg", "Neutral"
  • Labels for Emotion: "Happy", "Sad", "Angry", "Fear"
  • Labels for Hatespeech: "Yes", "No"
  • Labels for Sarcasm: "Yes", "No"
Word Embeddings
User Interface for Annotation

How to run

  • Download entire folder userinterface_annotation
  • Go to /Website_with_user_login
  • "python3 app.py" command to run the file.
Box-Plots
  • Different cells has been created in "Boxplots.ipynb" file for Sentiment Analysis,Emotion-Identification,Hate-Speech Detection,Sarcasm Detection
P-values
  • Different cells has been created in "p-values.ipynb" file for Sentiment Analysis,Emotion-Identification,Hate-Speech Detection,Sarcasm Detection
  • Assumptions were also checked to perform ANOVA test.

Lexicon Based

BOW

TF-IDF

Word2Vec

Code Snippet for Word2Vec Model

import gensim
w2vmodel = gensim.models.KeyedVectors.load_word2vec_format('./te_w2v.vec', binary=False)

GloVe

Code Snippet for GloVe Model

import gensim
glove_model = gensim.models.KeyedVectors.load_word2vec_format('./te_glove_w2v.txt', binary=False)

FastText

Code Snippet for FastText Model

import gensim
fastText_model = gensim.models.KeyedVectors.load_word2vec_format('./te_fasttext.vec', binary=False)

Meta-Embeddings

Code Snippet for Meta-Embeddings Model

import gensim
MetaEmbeddings_model = gensim.models.KeyedVectors.load_word2vec_format('./te_metaEmbeddings.txt', binary=False)

Skip-Thought

Code Snippet for Skip-Thought Model

VOCAB_FILE = "./data/exp_vocab/vocab.txt"
EMBEDDING_MATRIX_FILE = "./data/exp_vocab/embeddings.npy"
CHECKPOINT_PATH = "./data/model/model.ckpt-129597"
encoder = encoder_manager.EncoderManager()
encoder.load_model(configuration.model_config(),
                vocabulary_file=VOCAB_FILE,
                embedding_matrix_file=EMBEDDING_MATRIX_FILE,
                checkpoint_path=CHECKPOINT_PATH)
encodings = encoder.encode(data)

ELMo

Code-Snippet for Elmo Features:

from allennlp.modules.elmo import Elmo, batch_to_ids  
from allennlp.commands.elmo import ElmoEmbedder  
from wxconv import WXC  
from polyglot_tokenizer import Tokenizer  
  
options_file = "options.json"  

weight_file = "elmo_weights.hdf5"  

elmo = ElmoEmbedder(options_file, weight_file)  
con = WXC(order='utf2wx',lang='tel')  
tk = Tokenizer(lang='te', split_sen=False)  
  
sentence = ''  
wx_sentence = con.convert(sentence)  

elmo_features = np.mean(elmo.embed_sentence(tk.tokenize(wx_sentence))[2],axis=0)

BERT

Code-Snippet for BERT Features:

from bertviz import head_view  
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel, BertConfig, BertForSequenceClassification, BertForNextSentencePrediction  

def show_head_view(model, tokenizer, sentence_a, sentence_b=None):  

	inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)  

	input_ids = inputs['input_ids']  

	if sentence_b:  

		token_type_ids = inputs['token_type_ids']  

		attention = model(input_ids, token_type_ids=token_type_ids)[-1]  

		sentence_b_start = token_type_ids[0].tolist().index(1)  

	else:  

		attention = model(input_ids)[-1]  

		sentence_b_start = None  

	input_id_list = input_ids[0].tolist() # Batch index 0  

	tokens = tokenizer.convert_ids_to_tokens(input_id_list)  

	return attention  

config = BertConfig.from_pretrained("subbareddyiiit/music_cog",output_attentions=True)  

tokenizer = AutoTokenizer.from_pretrained("subbareddyiiit/music_cog")  

model = AutoModel.from_pretrained("./pytorch_model_task.bin",config=config)  

sentence_a = "pilli cApa mIxa kUrcuMxi"  
sentence_b = "pilli raggu mIxa padukuMxi"  
sen_vec = show_head_view(model, tokenizer, sentence_a, sentence_b)

About

All Required files for the Journal Paper "Am I a Resource-Poor Language? Data Sets, Embeddings, Models and Analysis for four different NLP tasks in Telugu Language"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •