A text analysis project on TED Talk dataset for tag extraction, summarization and related talk recommendation.
This project was carried out under SMU IS450 Text Mining and Natural Language Processing in AY2019-2020 Semster 2. The team consists of Wende B., Chengzi Z., May M., Suyee K., Xiaowei L., and myself.
The topic of the project is to analyse TED Talk transcripts and achieve automated tag extraction, transcript summarization, and related talk recommendation with a given new transcript and title.
The TED talk dataset is available on Kaggle, with information and transcripts of talks uploaded to the official TED.com until September 21st, 2017. In total, there is the information of 2550 talks with 2464 transcripts.
You may wish to read our Medium article to learn more about our system.
Step | Main technology |
---|---|
Transcript Preprocessing | Spacy |
Topic Modelling | Gensim, Scikit-Learn, LDA Mallet |
TF-IDF Metric Computation | Scikit-learn |
Tag Generation | WordNet, Networkx |
Summarization | TextRank |
Related Talks Recommendation | Scikit-learn |
Besides implementing the system backend, we also created a simple GUI for easy use of our system.
The instructions are as the following:
- Navigate to the project directory after downloading and unzipping/cloning
python SummarySystem
in Command Prompt- Fill in the
Title
andInput your Text
fields - Click
Generate
button