Machine learning text categorisation for bbc dataset
- Go to https://colab.research.google.com/github/TeaWithLucas/ML-text-categorisation/blob/main/ml_text_cat.ipynb
- Upload the file datasets_coursework1.zip by going to the file section on the left then clicking the upload file button as encircled in the following screenshot
- (Optional) Change any of the Setting as you desire in the setting heading (details below)
- Run all the code using Runtime > Run all (Ctrl+F9) or run all the code sequentially manually
- View the execution section to see the excution process and outputs and the subheading Evaluation for the final results
- (Optional) Adjust any of the Settings as you desire
- (Optional) Run the Settings and Execution section and review the results
In the setting section are various variables which you can use to change how the program works, they are as follows
zipped_data
- the name of the zip file containing the datasetsdata_path_folder
- directory of the category folderstest_size
- test set split in percent (training split is the remaining amount)dev_size
- development set split in percent (training split is the remaining amount)token_max
- max amount of most frequent tokens allowed in a volcabularytagged_max
- max amount of most frequent tagged words allowed in a volcabularyngram_max
- max amount of most frequent ngrams allowed in a volcabularyfeature_sets_enabled
- list of feature sets to be used in the model, they are explained and listed in the feature sets section belowword_tags
- part-of-speach tags allowed in the word_tagged feature setscoring_model
- Choose the feature selection scoring modelmodel_choices
- a set of tuples containing a model class and the chosen parameters for that classlist_num_features
- a list of possible number of top scoring features allowed
The following are feature sets that use normalised frequency counts:
gen_basic
- uses NLTK tokenised words to generate features based on those wordsbigram
- uses sklearn's default word tokeniser to generate features based on pairs of wordstrigram
- uses sklearn's default word tokeniser to generate features based on triplet of wordsword_tagged
- uses NLTK part-of-speach tagged words, filters them with the list in the word_tags setting, to generate features based on those tagged wordsverb
- uses NLTK part-of-speach tagged words, selects only verb tagged words, to generate features based on those tagged wordsadj
- uses NLTK part-of-speach tagged words, selects only adjective tagged words, to generate features based on those tagged wordsadjverb
- uses NLTK part-of-speach tagged words, selects only verb and adjective tagged words, to generate features based on those tagged words
The following are feature sets that use pretrained embedding models
embedding_glove
- uses the pretrained modelglove-twitter-25
to generate featuresembedding_fasttext
- uses the pretrained modelfasttext-wiki-news-subwords-300
to generate featuresembedding_word2vec
- uses the pretrained modelword2vec-google-news-300
to generate features
Any embedding models listed at https://github.com/RaRe-Technologies/gensim-data#models can be used using the format embedding_[model name]