Skip to content

Latest commit

 

History

History
83 lines (63 loc) · 7 KB

readme.md

File metadata and controls

83 lines (63 loc) · 7 KB

🌿 Corpus Linguistics (Spring 2024)

Course Overview

This graduate-level course, designed for in-service English teachers at the secondary education level, offers an insightful exploration into corpus linguistics, combined with a practical introduction to Natural Language Processing (NLP) using basic Python coding. By integrating these computational techniques, the course aims to enhance the study of large electronic text collections (corpora) and their application in understanding language use and patterns. Participants will gain a deeper comprehension of language variation and how it can inform and improve teaching practices, leading to more effective development of teaching materials and activities. The course provides a balanced mix of theoretical instruction and practical application, focusing on the analysis and interpretation of corpus data in English language usage and introducing essential NLP techniques through Python programming.

Course board & links

| 💾 Syllabus | 👭 Padlet: inclass activity | 📗Python basics manual | 🌳 Class log |

Weekly Schedule

Week Date Key topic(s) Description Code page Assignments
W01 Mar6 Introduction Course overview, syllabus; What is corpus linguistics? survey
W02 Mar13 Python basics #1 Online Corpora: COCA, BNC, Types of corpora; NLTK CL01,CL02,🔸nltk,📗
W03 Mar20 Python basics #2 Data types, NLTK (section 1) 🔸nltk, NLTK01
W04 Mar27 Project #1 NLTK, 🔸Word cloud, 🔸Word Frequency list 🔸nltk
W05 Apr3 Lexical analysis Type vs. token, lemmatization Code, 🔸nltk Assign01 (Apr17)
W06 (Apr10) Keywords Text analysis, Words in context, concordance, collocations NgramCode
W07 Apr17 Lexical diversity Type-Token-Ratio (TTR) and other lexical diversity measures Reading123,
wordlist-stopwords,
code
Assign1 Presentation (15mins)
W08 Apr24 lexical diversity measures Midterm discussion LD-practice,
N-gramCode
W09 (May1) Midterm (take-home)
W10 May8 Readability, Topic-modeling Readability measures, NLP preprocessing, topic-modeling Intro, Readability, App sampletext, RE, ArticleUse
W11 (May15) Sentiment Analysis Data collection, Individual project submission
W12 May22 Clustering Analysis Data collection, (Clustering Analysis), Sentiment Analysis Code
W13 May29 Project #2 Idea brainstorming, individual project discussions, samples TEDdata
W14 June5 Project #3 individual project discussions, samples
W15 June12 Final project Presentations

📙 How to handle frequency data

1. Data Types and Variables: ➡️details

  • Understanding different types of data (quantitative vs. qualitative) and variables (discrete vs. continuous).

2. Frequency Data Basics::

  • Definition of frequency data.
  • Differentiating between absolute frequency, relative frequency, and cumulative frequency.

3. Data Collection and Organization:**

  • Techniques for collecting frequency data
  • Organizing data into tables and charts.

4. Descriptive Statistics:

  • Measures of central tendency (mean, median, mode) specifically for frequency data.
  • Measures of variability (range, variance, standard deviation).

5. Graphical Representation of Data:

  • Histograms, bar charts, and pie charts for frequency data.
  • Understanding and interpreting these graphical representations.

6. Probability Fundamentals:

  • Basic probability concepts and rules.
  • Probability distributions relevant to frequency data (e.g., binomial, Poisson).

7. Sampling and Sampling Distributions:

  • Concepts of population vs. sample.
  • Understanding sampling distributions and the central limit theorem.

8. Hypothesis Testing and Inferential Statistics:

  • Concepts of null and alternative hypotheses.
  • Tests of significance (e.g., Chi-square test) for frequency data.

9. Correlation and Regression Analysis:

  • Understanding the relationship between variables.
  • Linear regression analysis pertinent to frequency data.

10. Data Interpretation and Reporting:

  • Analyzing and interpreting statistical results.
  • Effective communication of findings.

11. Ethical Considerations in Data Analysis:

  • Ethical issues in data collection, analysis, and reporting.
  • Data privacy and confidentiality.

12. Advanced Topics (Optional):

  • Multivariate analysis.
  • Time-series analysis and its relevance to frequency data.

Footnotes

  1. Reference reading for lexical diversity of KSAT reading passages link

  2. 'Back to basics: how measures of lexical diversity can help discriminate between CEFR levels'link

  3. 'MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment' link