Code switching detection in multilingual twitter data

Code-switching is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication.

We developed a tool which can automatically annotate words in a sentence to be either english or spanish. We trained our model on data provided from the First Workshop on Computational Approaches to Code Switching, which can be obtained from EMNLP 2014's website. Due to copyright restrictions, the organisers could not provide the text of the tweets, hence we also developed a tool that parses the offset data format and downloads the actual text. That is thoroughly described in Data and Preprocessing.

The project started as student term project in a Machine learning for computational linguists class at Universitiy of Tuebingen.

Data

Download the Spanish-English Training data (11,400 tweets) from http://emnlp2014.org/workshops/CodeSwitch/call.html. These tab separated files contain the tagging (offset file) with the following coloums: TweetID, UserID, Start, Stop and Language
Download the tweeter feetching tool from http://emnlp2014.org/workshops/CodeSwitch/scripts/twitter.zip.
Usage: python collect_tweets.py -i <inputFile.tsv> -o <outputFile.tsv>
Using en_es_training_offsets.tsv as the input file, we get a "string file" with 3 columns: TweetID, UserID, and Tweet.
Now we have a file with the tweets (string file from the previous step) and another one with the language annotation for each word (offset file). Both contain the tweetID.

Preprocessing

After downloading the ofset file and the actual tweets, use our script read_tweets.py and set the variables at the top of the script:

tweetStringFile -> string file with the tweets (tsv)
tweetOffsetFile -> offset file with the language annotations (tsv)
isTrainingsFile -> True/False
outFile -> output file

If needed coloumns are ennumerated. The fixout_offeset.py script will delete lots of tweets as they could not be found, were empty, or modified (string is shorter/longer than deffined by the offset file) in the time between annotation and feetching.

The remaining tweets are processed, resulting in an ouput tsv file with the following columns: TweetString, TweetArray, and Languages:

TweetString: the tweet as normal string
TweetArray: every word separately - the "Start" and "Stop" markers in the offset file tell how
Languages: gives the language for every word according to the offset file

Generating feature tables

Using the output file, we can create feature tables.

The script unigrams_word_level.py fetches all present unigrams from the string file. These are typed manualy into python scripts to avoid encoding errors.

The create_unigram_table.py, create_bigram_table.py (create_bigram_table_repetitions.py), and create_multigram_table.py (create_multigram_repetitions_or_embeedings_table.py) are mostly same. The files are up to 1gb big (we do not compress sparse vectors)! They look up which unigrams, bigrams as combinations of these unigrams or both occure within the words in the training dataset.

These tables can be used to train the machine learning algorithm. Some of them are proposed in the evaluation folder and are implemented with sklearn and/or Orange.

For testing, extract_features_from_string.py can be used. It will generate a test file where a given string is tokenized and for each token, the feature vector is created.

Scripts are tetsted with Python 3.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
code		code
evaluation		evaluation
results		results
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code switching detection in multilingual twitter data

Data

Preprocessing

Generating feature tables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Code switching detection in multilingual twitter data

Data

Preprocessing

Generating feature tables

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages