Introductory computational linguistics workshop given at GirlsCodeMonth 2018. Topics covered:
- Basic data preprocessing in Python
- TFIDF word vectorization
- Support vector machine algorithms
This workshop walks through a basic SVM classifier that detects if text is spam or not spam. We use the Kaggle SMS Spam Collection Dataset. File descriptions:
bad_evaluate.py
: trains and evaluates a classifier on a random train/test split of the entire dataset from Kaggle.bad_runner.py
: trains a classifier on a random split of the entire dataset and lets user test with their own text.good_evaluate.py
: trains and evaluates a classifier on a balanced spam/ham dataset with a random train/test split.good_runner.py
: trains a classifier on a random split of the balanced dataset and lets user test with their own text.
To run any of these files from [your-computer]/linghacks-girlscodemonth-workshop
:
python3 [the-file]