FunnelAI Machine Learning Experiments

Code & notebooks for ML Experiments

Data

labeled_dataset.csv contains a small labeled dataset of forum posts. The relevant columns are body and label_*, where * is a training objective such as intent. Note that condition, industry and credit have only two types of labels and could therefore be considered as a binary or multi-label prediction problem, while intent has 4 possible labels / classes, which means it could be considered as a mullti-label or multi-class objective.

Things to explore

Data visualization
Text representation strategies
- Cleaning (removing unwanted characters)
- Tokenization
  - Per-word tokenization
  - Character n-grams
- Vectorization
  - Token histograms
  - Normalized token histogram
  - TF-IDF
  - Random embeddings
  - Pretrained embeddings (word2vec)
- Dimensionality reduction
Models
- Decision tree
- Random forest
- Gradient boosted trees
- SVM
- Logistic regression
- 1-layer neural net
- 2-layer neural net
- 3-layer neural net
- LSTM
- BERT
Effect of training objective
- 2-class problems: multi-label vs binary optimization
- 2+-class problems: multi-label vs multi-class optimization
- Multi-objective vs single-objective optimization of deep models

Task description

Load in the dataset to a Jupyrer notebook or an interactive session in your editor (e.g. VSCode, Atom)
Explore the data. Build histograms of class labels. Which training objectives have most data? Are all objectives usable?
Experiment with different text representation techniques. Explore the result of tokenization by building histograms, etc. Explore what kinds of tokenization lead to what kind of vocabulary sizes. Try to find reasonable configurations that result in a not-too-large vocab size.
Implement functionality to convert tokenized text to vector representations. Explore how to normalize, reduce dimensionality, and perform other transformations in this vector representation. Can you also visualize the documents in this vectorspace with e.g. t-SNE or similar approaches?
Train and tune individual models. Selection of models is up to you. For deep learning models, I recommend using Pytorch. The most natural fit for this dataset is considering outputs as multi-label classification, but you can also experiment with multi-class and binary classification to see the comparison.
Tune each model to the extent you feel is necessary.
Write up your experiences and observations in a form you find most useful - Google Docs, Jupyter notebooks, markdown.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
labeled_dataset.csv		labeled_dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FunnelAI Machine Learning Experiments

Data

Things to explore

Task description

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

FunnelAI Machine Learning Experiments

Data

Things to explore

Task description

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages