Skip to content
This repo is the home page of the Pitt English Language Institute Corpus (PELIC). It contains information about the ELI Data Mining Group at the University of Pittsburgh.
Branch: master
Clone or download
Latest commit 6171c33 Dec 5, 2019

The University of Pittsburgh English Language Institute Corpus (PELIC)

This repo introduces the University of Pittsburgh English Language Institute Corpus.

The current research group includes:

Associate Members

The Pittsburgh English Language Institute Corpus (PELIC)

The corpus is based on data collected from students at the English Language Institute at the University of Pittsburgh ( from 2005-2015 as part of the National Science Foundation project housed at Pitt and CMU (

  • The data include written and spoken data from writing classes, grammar classes, reading classes, and speaking classes.

  • A snapshot total number of the first languages (L1s) of students in the ELI is in Figure 1. L1s represented by the largest number of students are Arabic, Chinese, Japanese, Korean, Spanish, and Turkish. Levels of proficiency in the dataset range from Level 2 (approximately equal to the Common European Framework (CEFR) A1 'Breakthrough') to Level 5 (Common European Framework (CEFR) B2-C1 'Vantage-Effective'). There are few Level 2 students as the Institute did not regularly offer that level during the period of data collection. Students contributed data in all skill areas, so ultimately researchers will be able to analyze the data from many students in many skills areas.

  • An idea of the numbers of texts from writing assignments from each language group at each level can be observed in Figure 2. Note that many students contribute to several levels, (e.g., 3, 4, and 5), making it possible to track the same students' development over several semesters.

  • The data from writing classes are 4.2 million words. A more detailed description of these data is in our initial analysis of lexical sophistication in Naismith et al. (2018).

  • The spoken data from speaking classes will be available in both .wav format (analyzable in PRAAT) and .mp3 format and will include the students' transcriptions of their own spoken data. A publication based on a small subset of these data are in Vercellotti (2017).

  • Many assignments have several versions, with revisions based on teacher feedback, so that uptake of teacher comments and the influence on language development can be investigated.

  • Plans for publication of the whole data set. We will be posting the dataset on line as we complete dataset processing. Check back here for updates.

Recent updates

  • Our group is developing a set of text processing tools for use in Python. These include a tool for D (also known as vocD) and a tool for Advanced Guiraud. Other tools available are (re)-tokenizing, lemmatizing, and ngrams. We are working on adding concordances and concgrams.

  • We were able to run D (a measure of lexical diversity) on our texts from the writing classes, possibly confirming the Juffs (2019) result from a subset of the data that D in the writing of our level 4 and 5 students is somewhat similar in their D scores. The large gains in D students make are from low intermediate to intermediate. See Figure 3. Thanks to Daniel Zheng for outstanding work on this part of the analysis.

Selected Publications and presentations based on the Data Set

Papers based on the Reading Tutor REAP in the English Language Institute

Spoken data already posted online by Vercellotti and available for download and analysis in CLAN.

A list of published and unpublished MA theses and PhD dissertations written with support from and the English Language Institute.

You can’t perform that action at this time.