This is the material for the Advanced Text Analysis with R course as part of the City University of Hong Kong Summer School in Social Science Research. I will use this page to publish lecture slides, hand-outs, data sets etc. As the title indicates, the course will be taught almost completely using R. If you don't use R yet, please make sure that you install R and Rstudio on your laptop.
This repository hosts the slides (html and source code). The source code for all handouts is published on my learningR page:
**June 2nd (morning):
In this introductory session you will learn how to use R to organize and transform your data: calculating columns, subsetting, transforming and merging data, and computing aggregate statistics. If time permits, we will also cover basic modelling and/or programming in R as desired.
In this session we will look briefly at visualizing data in R. The main focus of the session is on using APIs from R. We will be looking at the Twitter, Facebook, and NY Times API, and also see how to access arbitrary web resources from R.
This is the first session that directly deals with text analysis. The goal of this session is to learn how to use AmCAT as a document management tool, upload data, and perform queries from R.
In this session the focus is on the Document Term Matrix: word clouds, comparison of different corpora, and topic models.
In this session we will do sentiment analysis using both a dictionary approach and with machine learning. These techniques can also be applied to other forms of automatic content analysis such as determining topic or frame analysis.
- Slides
- Handout: Sentiment Resources
- Handout: Lexical Sentiment Analysis
- Handout: Text Classification
- Data: sentiment lexicon
- Data: Amazon reviews
In the last session we will look at semantic network analysis with word-window approaches and more advanced visualization techniques using ggplot2, igraph, and gephi.