Skip to content

Latest commit

 

History

History
149 lines (92 loc) · 5.83 KB

00_themen.md

File metadata and controls

149 lines (92 loc) · 5.83 KB

Loose Collection of Project Ideas

Suitable for term papers, B.Sc. theses, IMs (individual research modules), or M.Sc. theses, depending on the topic

Topics are grouped around the following not mutually-exclusive topics:

Ask me for more information! tatjana.scheffler@uni-potsdam.de

Discourse

Dialog act tagging for social media:

DA tagging mostly done on linear dialogs, but in social media we have branching conversations. How can we address this? How do diff. social media affect the results/methods?


Differentiate conversation types (e.g., on social media)


Identifying information questions (vs. other types of questions) in social media discourses

... and whether the question has been answered in subsequent follow-ups.


Improving discourse parsing through parallel corpora

  1. Disambiguation of explicit discourse connectives trained on unannotated data
  2. Use explicified implicit discourse relations in bi-text as training examples for the implicit relations in the source language
  3. Inducing new annotated discourse connective lexicons through MT alignment methods from a multi-lingual corpus

Coreference resolution for pronouns referring to pictures embedded in tweets


Social Media

Experimental production/comprehension study on capitalization in social media posts

  • correspond to which kind of stress contour?
  • meaning distinctions?
  • capitalization across languages (POS, etc?)

Classification of offensive language in social media posts

Based on an existing linguistic analysis and some training data, build a system that classifies whether a post is offensive or not (e.g., on Twitter). In particular, I would like to include syntactic patterns used for addressing/attacking other people, and the conversation context and metadata.


Synchronicity of email vs. tweets - do users use more speech-like items in asynchronous tweets rather than in asynchronous emails?


Intensifying Intensifiers:

Wer noch kein Thema hat, wie wäre es mit "Intensivierern" wie Sternchen/Großschreibung/Verlängerung, die auch ebenfalls auf Intensivierer wie "so" angewendet werden können:

(1) Mir ist Natur echt egal, aber ich bin *so* kurz davor, eine Highland-Tour zu buchen.

(2) An Abenden wie diesen wünsche ich mir SO SEHR, ...

(3) Ich hab soooo den Überblick über die ganzen Neuerungen verloren, weil ich’s so lange nicht mehr gespielt hab

Entspricht dies einfach der Betonung? Werden die gleichen Wörter markiert, die sonst auch betont würden? Was ist die Bedeutung eines solchen "Intensivierers von Intensivierern"? Lässt sich das mit Fokus erklären? Was heißt "so" in allen Fällen (in manchen ist es kein Intensivierer)? Evtl. könnte man sogar Probanden vorlesen lassen.

There are also semantic differences! "Ich finde das SO gut" vs. "Aber SO geht der mir nicht aus dem Haus" (definitely not *"soooo geht der mir nicht aus dem Haus")


Emoji vs. emoticons

There's evidence that emoji are replacing emoticons (P&E, 2017?) (as well as other kinds of non-standard lexical items). However, there are still some emoticons being used in social media interactions. This work should look at longitudinal Twitter (or other social media) data to find out (1) whether and to what extent emoji are replacing emoticons in German discourse, (2) to what extent emoticons continue to be used and under what circumstances, and (3) whether there are any semantic or syntactic differences between emoticons and emoji in current usage.

This could also be combined with a study of emoji across social media.


Computational Sociolinguistics

Lexical change in German Twitter over time

Kim et al. (2014) have shown that language models can be used to automatically identify words whose usage has changed over time. Can we reproduce this work on a much smaller time scale of 5-10 years of German Twitter posts? Which words are introduced or discontinued? Is any change in meaning discernable? Can different types of change (semantic/syntactic, broadening/narrowing, ...) be distinguished from each other (the authors give some pointers at the end)?

Kim et al. (2014). Temporal Analysis of Language through Neural Language Models. https://www.aclweb.org/anthology/W14-2517


Diglossia: Präteritum vs. Perfekt im Deutschen

Es gibt einige wohl bekannte Fälle, in denen die Schriftsprache und die gesprochene Sprache fast vollständig distinkt sind. So zum Beispiel die Negation im Französischen (in der Schriftsprache mit doppelter Negation, im Gesprochenen wird das 'ne' fast immer weggelassen), und die Vergangenheitsformen im Deutschen: in der Schriftsprache wird vor allem bei narrativen Erzählungen Präteritum eingesetzt, in der gesprochenen Sprache aber Perfekt.

In spontanen Texten aus sozialen Medien finden sich beide Varianten, aber in unterschiedlichen Zusammensetzungen. Die Arbeit hat folgendes Ziel: (1) Quantifizierung der Vergangenheitsformen im Dt. in verschiedenen Medien anhand von Korpusdaten. (2) Regressionsmodell zur Bestimmung der Faktoren, die die Vergangenheitsform bedingen. (3) Inwiefern ist die Diglossie stabil?

(Hennig, 2000): Tempus und Temporalität (Zug 2011): THE USE OF THE PRETERITE AND THE PRESENT PERFECT IN ENGLISH AND GERMAN. A CONTRASTIVE ANALYSIS. (MSc thesis) https://www.duo.uio.no/bitstream/handle/10852/25297/ENGx4190xMasterxthesisxJenniferxZug.pdf?sequence=1 (Löbner) on the meaning of perfect tense


A quantification of non-canonical word orders in spontaneous writing (and speech?) in German


Other Topics

Normalization of aphasic speech

  • speech recognition: train LM on agrammatic aphasia to improve ASR
  • transform agrammatic string into a more standard form
  • application?