# Objective

Build a model to automatically predict tags for a given a StackExchange question by using the text of the question.
![alt text](https://cdn.sstatic.net/Sites/stackoverflow/company/img/logos/se/se-logo.svg?v=d29f0785ebb7)

__Dataset Specs__: Over 85,000 questions

[Download Link](https://www.kaggle.com/stackoverflow/statsquestions#Questions.csv)

__License__

All Stack Exchange user contributions are licensed under [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) with [attribution required](http://blog.stackoverflow.com/2009/06/attribution-required/).

<br>

***

In [None]:
# optional step (if you are working on colab)
# mount Google drive
from google.colab import drive
drive.mount('/content/drive')

# Steps to Follow



1. Load Data and Import Libraries
2. Text Cleaning
3. Merge Tags with Questions
4. Dataset Preparation
5. Text Representation
6. Model Building
    1. Define Model Architecture
    2. Train the Model
7. Model Predictions
8. Model Evaluation



# Load Data and Import Libraries

In [1]:
# for string matching
import re

# for reading data
import pandas as pd

# for handling html data
from bs4 import BeautifulSoup

# for visualization
import matplotlib.pyplot as plt

pd.set_option('display.max_colwidth', 200)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# extract data from the ZIP file
# !unzip 'statsquestions.zip'

In [15]:
# load the stackoverflow questions dataset
questions_df = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/Datasets/Questions.csv',encoding='latin-1')

# load the tags dataset
tags_df = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/Datasets/Tags.csv')

### Data Dictionary

1. Id: Question ID
2. OwnerUserId: User ID
3. CreationDate: Date of posting question
4. Score: Count of Upvotes received by the question
5. Title: Title of the question
6. Body: Text body of the question

In [16]:
#print first 5 rows
questions_df.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learning?,"<p>Last year, I read a blog post from <a href=""http://anyall.org/"">Brendan O'Connor</a> entitled <a href=""http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/"">""Statistics vs. Mach..."
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\n...
2,22,66.0,2010-07-19T19:25:39Z,208,Bayesian and frequentist reasoning in plain English,<p>How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?</p>\n
3,31,13.0,2010-07-19T19:28:44Z,138,What is the meaning of p values and t values in statistical tests?,"<p>After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests...."
4,36,8.0,2010-07-19T19:31:47Z,58,Examples for teaching: Correlation does not mean causation,"<p>There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:</p>\n\n<ol>\n<li>number of storks and birth ..."


# Text Cleaning

Let's define a function to clean the text data.

In [17]:
def cleaner(text):

  # take off html tags
  text = BeautifulSoup(text).get_text()

  # fetch alphabetic characters
  text = re.sub("[^a-zA-Z]", " ", text)

  # convert text to lower case
  text = text.lower()

  # split text into tokens to remove whitespaces
  tokens = text.split()

  return " ".join(tokens)

In [18]:
# call preprocessing function
questions_df['cleaned_text'] = questions_df['Body'].apply(cleaner)

In [19]:
questions_df['Body'][1]

"<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\nareas are a lot larger than condensed\nurban areas. Is there a need to account for the area size difference?</li>\n<li>if let's say I have census data\ndating back to 4 - 5 census periods,\nhow far can i forecast it into the\nfuture?</li>\n<li>if some of the census zone change\nlightly in boundaries, how can i\naccount for that change?</li>\n<li>What are the methods to validate\ncensus forecasts? for example, if i\nhave data for existing 5 census\nperiods, should I model the first 3\nand test it on the latter two? or is\nthere another way?</li>\n<li>what's the state of practice in\nforecasting census data, and what are\nsome of the state of the art methods?</li>\n</ul>\n"

In [20]:
questions_df['cleaned_text'][1]

'what are some of the ways to forecast demographic census with some validation and calibration techniques some of the concerns census blocks vary in sizes as rural areas are a lot larger than condensed urban areas is there a need to account for the area size difference if let s say i have census data dating back to census periods how far can i forecast it into the future if some of the census zone change lightly in boundaries how can i account for that change what are the methods to validate census forecasts for example if i have data for existing census periods should i model the first and test it on the latter two or is there another way what s the state of practice in forecasting census data and what are some of the state of the art methods'

# Merge Tags with Questions

Let's now explore the tags data.

In [21]:
tags_df.head()

Unnamed: 0,Id,Tag
0,1,bayesian
1,1,prior
2,1,elicitation
3,2,distributions
4,2,normality


In [22]:
# count of unique tags
len(tags_df['Tag'].unique())

1315

In [23]:
tags_df['Tag'].value_counts()

r                       13236
regression              10959
machine-learning         6089
time-series              5559
probability              4217
                        ...  
fmincon                     1
doc2vec                     1
sympy                       1
adversarial-boosting        1
corpus-linguistics          1
Name: Tag, Length: 1315, dtype: int64

In [24]:
# remove "-" from the tags
tags_df['Tag']= tags_df['Tag'].apply(lambda x:re.sub("-"," ",x))


In [27]:
# group tags Id wise
tags_df = tags_df.groupby('Id').apply(lambda x:x['Tag'].values).reset_index(name='tags')
tags_df.head()

Unnamed: 0,Id,tags
0,1,"[bayesian, prior, elicitation]"
1,2,"[distributions, normality]"
2,3,"[software, open source]"
3,4,"[distributions, statistical significance]"
4,6,[machine learning]


In [28]:
# merge tags and questions
df = pd.merge(questions_df,tags_df,how='inner',on='Id')

In [29]:
df = df[['Id','Body','cleaned_text','tags']]
df.head()

Unnamed: 0,Id,Body,cleaned_text,tags
0,6,"<p>Last year, I read a blog post from <a href=""http://anyall.org/"">Brendan O'Connor</a> entitled <a href=""http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/"">""Statistics vs. Mach...",last year i read a blog post from brendan o connor entitled statistics vs machine learning fight that discussed some of the differences between the two fields andrew gelman responded favorably to ...,[machine learning]
1,21,<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\n...,what are some of the ways to forecast demographic census with some validation and calibration techniques some of the concerns census blocks vary in sizes as rural areas are a lot larger than conde...,"[forecasting, population, census]"
2,22,<p>How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?</p>\n,how would you describe in plain english the characteristics that distinguish bayesian from frequentist reasoning,"[bayesian, frequentist]"
3,31,"<p>After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests....",after taking a statistics course and then trying to help fellow students i noticed one subject that inspires much head desk banging is interpreting the results of statistical hypothesis tests it s...,"[hypothesis testing, t test, p value, interpretation, intuition]"
4,36,"<p>There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:</p>\n\n<ol>\n<li>number of storks and birth ...",there is an old saying correlation does not mean causation when i teach i tend to use the following standard examples to illustrate this point number of storks and birth rate in denmark number of ...,"[correlation, teaching]"


In [30]:
df.shape

(85085, 4)

There are over 85,000 unique questions and over 1300 tags.

# Dataset Preparation

In [31]:
# check frequency of occurence of each tag
freq= {}
for i in df['tags']:
  for j in i:
    if j in freq.keys():
      freq[j] = freq[j] + 1
    else:
      freq[j] = 1

Let's find out the most frequent tags.

In [32]:
# sort the dictionary in descending order
freq = dict(sorted(freq.items(), key=lambda x:x[1],reverse=True))

In [33]:
freq.items()

dict_items([('r', 13236), ('regression', 10959), ('machine learning', 6089), ('time series', 5559), ('probability', 4217), ('hypothesis testing', 3869), ('self study', 3732), ('distributions', 3501), ('logistic', 3316), ('classification', 2881), ('correlation', 2871), ('statistical significance', 2666), ('bayesian', 2656), ('anova', 2505), ('normal distribution', 2181), ('multiple regression', 2054), ('mixed model', 1998), ('clustering', 1952), ('neural networks', 1897), ('mathematical statistics', 1888), ('confidence interval', 1776), ('categorical data', 1703), ('generalized linear model', 1614), ('variance', 1576), ('data visualization', 1549), ('estimation', 1533), ('forecasting', 1422), ('t test', 1418), ('pca', 1395), ('sampling', 1363), ('cross validation', 1344), ('repeated measures', 1335), ('spss', 1296), ('svm', 1283), ('chi squared', 1261), ('maximum likelihood', 1209), ('predictive models', 1189), ('multivariate analysis', 1116), ('survival', 1081), ('references', 1076), (

In [34]:
# Top 10 most frequent tags
common_tags = list(freq.keys())[:10]
common_tags

['r',
 'regression',
 'machine learning',
 'time series',
 'probability',
 'hypothesis testing',
 'self study',
 'distributions',
 'logistic',
 'classification']

We will use only those questions/queries that have the above 10 tags associated with it.

In [35]:
x=[]
y=[]

for i in range(len(df['tags'])):

  temp=[]
  for j in df['tags'][i]:
    if j in common_tags:
      temp.append(j)

  if(len(temp)>1):
    x.append(df['cleaned_text'][i])
    y.append(temp)

In [36]:
# number of questions left
len(x)

11106

In [37]:
y[:10]

[['r', 'time series'],
 ['regression', 'distributions'],
 ['distributions', 'probability', 'hypothesis testing'],
 ['hypothesis testing', 'self study'],
 ['r', 'regression', 'time series'],
 ['r', 'time series', 'self study'],
 ['probability', 'hypothesis testing'],
 ['r', 'regression'],
 ['r', 'regression'],
 ['regression', 'logistic']]

# One Hot Encoding

In [38]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

y = mlb.fit_transform(y)
y.shape

(11106, 10)

In [39]:
y[0,:]

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 1])

In [40]:
mlb.classes_

array(['classification', 'distributions', 'hypothesis testing',
       'logistic', 'machine learning', 'probability', 'r', 'regression',
       'self study', 'time series'], dtype=object)

We can now split the dataset into training set and validation set.

In [41]:
from sklearn.model_selection import train_test_split
x_tr,x_val,y_tr,y_val=train_test_split(x, y, test_size=0.2, random_state=0,shuffle=True)

# Text Representation

In [43]:
from keras.preprocessing.text import Tokenizer
#from keras.preprocessing.sequence import pad_sequences

from keras.utils import pad_sequences

#prepare a tokenizer
x_tokenizer = Tokenizer()

x_tokenizer.fit_on_texts(x_tr)

In [44]:
x_tokenizer.word_index

{'the': 1,
 'i': 2,
 'to': 3,
 'a': 4,
 'of': 5,
 'is': 6,
 'and': 7,
 'in': 8,
 'l': 9,
 'x': 10,
 'for': 11,
 'that': 12,
 'data': 13,
 'this': 14,
 't': 15,
 'have': 16,
 'y': 17,
 'with': 18,
 'model': 19,
 'it': 20,
 'are': 21,
 'be': 22,
 'my': 23,
 'as': 24,
 'on': 25,
 'e': 26,
 'p': 27,
 'if': 28,
 'can': 29,
 'n': 30,
 'but': 31,
 'not': 32,
 'm': 33,
 'or': 34,
 'r': 35,
 'how': 36,
 'regression': 37,
 'c': 38,
 'am': 39,
 's': 40,
 'from': 41,
 'test': 42,
 'what': 43,
 'would': 44,
 'b': 45,
 'so': 46,
 'time': 47,
 'there': 48,
 'using': 49,
 'which': 50,
 'an': 51,
 'do': 52,
 'one': 53,
 'each': 54,
 'value': 55,
 'use': 56,
 'by': 57,
 'some': 58,
 'variables': 59,
 'like': 60,
 'variable': 61,
 'we': 62,
 'at': 63,
 'na': 64,
 'any': 65,
 'f': 66,
 'distribution': 67,
 'two': 68,
 'values': 69,
 'set': 70,
 'you': 71,
 'all': 72,
 'function': 73,
 'fit': 74,
 'd': 75,
 'beta': 76,
 'question': 77,
 'then': 78,
 'mean': 79,
 'me': 80,
 'know': 81,
 'where': 82,
 'when'

In [45]:
len(x_tokenizer.word_index)

25312

There are around 25,000 tokens in the training dataset. Let's see how many tokens appear at least 5 times in the dataset.

In [46]:
thresh = 3

cnt=0
for key,value in x_tokenizer.word_counts.items():
  if value>=thresh:
    cnt=cnt+1

print(cnt)

12574


Over 12,000 tokens have appeared three times or more in the training set.


In [47]:
# prepare the tokenizer again
x_tokenizer = Tokenizer(num_words=cnt,oov_token='unk')

#prepare vocabulary
x_tokenizer.fit_on_texts(x_tr)

Now that we have encoded every token to an integer, let's convert the text sequences to integer sequences. After that we will pad the integer sequences to the maximum sequence length, i.e., 100.

In [49]:
# maximum sequence length allowed
max_len = 100

#convert text sequences into integer sequences
x_tr_seq = x_tokenizer.texts_to_sequences(x_tr)
x_val_seq = x_tokenizer.texts_to_sequences(x_val)

# Padding is important as algorithm would expect to have input of same length

#padding up with zero
x_tr_seq = pad_sequences(x_tr_seq,  padding='post', maxlen=max_len)
x_val_seq = pad_sequences(x_val_seq, padding='post', maxlen=max_len)

Since we are padding the sequences with zeros, we must increment the vocabulary size by one.

In [50]:
#no. of unique words
x_voc_size = x_tokenizer.num_words + 1
x_voc_size

12575

In [51]:
x_tr_seq[0]

array([1953, 5711,  416, 2023,    1,  226, 1747, 3740,  609,   43,  181,
       1953,  372,   19,  100,  416,    9, 1747, 3839,  238,   27,   27,
         27,   27,   27,   70,    6, 6919,    8, 1163,   70,    6,   43,
         43, 1802, 1802, 1802,   36,   36,   36,   36, 4308, 5410,    4,
        124,  592,  107,   22,    2, 1747, 4065,   27,   10, 1309,   10,
       6414,   10,  190,   10,  416,   10,   27,   10, 1309,   10, 6414,
         10,  190,   10,  416,   10,  456,  139,   15,    7,    2, 4610,
        164,   27,   10, 1309,   10, 6414,   10,  190,   10,  416,   10,
         27,   76,   27, 1309,   76,   27, 6414,   76,   27,  190,   76,
         27], dtype=int32)

# Model Building

In [52]:
from keras.models import *
from keras.layers import *
from keras.callbacks import *

### Define Model Architecture

In [54]:
#sequential model
model = Sequential()

#embedding layer - Used to get all the word embeddings of the input text
# mask_zero = True - means that it will not calculate the loss value for the paddings
model.add(Embedding(x_voc_size, 50, input_shape=(max_len,), mask_zero=True))

#rnn layer
model.add(SimpleRNN(128,activation='relu'))

#dense layer
model.add(Dense(128,activation='relu'))

#output layer
# Using sigmoid as activation function as we want get the probability for all the 10 tags.
model.add(Dense(10,activation='sigmoid'))

Understand the output shape and no. of parameters of each layer:

In [55]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 100, 50)           628750    
                                                                 
 simple_rnn_1 (SimpleRNN)    (None, 128)               22912     
                                                                 
 dense_2 (Dense)             (None, 128)               16512     
                                                                 
 dense_3 (Dense)             (None, 10)                1290      
                                                                 
Total params: 669,464
Trainable params: 669,464
Non-trainable params: 0
_________________________________________________________________


Define the optimizer and loss:

In [56]:
#define optimizer and loss
model.compile(optimizer='adam',loss='binary_crossentropy')

Define a callback - Model Checkpoint. Model Checkpoint is a callback used to save the best model during training.

In [57]:
# checkpoint to save best model during training
mc = ModelCheckpoint("weights.best.hdf5", monitor='val_loss', verbose=1, save_best_only=True, mode='min')

### Train the Model

Lets train the model for 10 epochs with a batch size of 128:

In [58]:
#train the model
model.fit(x_tr_seq, y_tr, batch_size=128, epochs=10, verbose=1, validation_data=(x_val_seq, y_val), callbacks=[mc])

Epoch 1/10
Epoch 1: val_loss improved from inf to 0.48226, saving model to weights.best.hdf5
Epoch 2/10
Epoch 2: val_loss improved from 0.48226 to 0.47334, saving model to weights.best.hdf5
Epoch 3/10
Epoch 3: val_loss improved from 0.47334 to 0.45529, saving model to weights.best.hdf5
Epoch 4/10
Epoch 4: val_loss improved from 0.45529 to 0.42496, saving model to weights.best.hdf5
Epoch 5/10
Epoch 5: val_loss did not improve from 0.42496
Epoch 6/10
Epoch 6: val_loss improved from 0.42496 to 0.42449, saving model to weights.best.hdf5
Epoch 7/10
Epoch 7: val_loss did not improve from 0.42449
Epoch 8/10
Epoch 8: val_loss improved from 0.42449 to 0.41864, saving model to weights.best.hdf5
Epoch 9/10
Epoch 9: val_loss did not improve from 0.41864
Epoch 10/10
Epoch 10: val_loss did not improve from 0.41864


<keras.callbacks.History at 0x78ef3d3f6980>

# Model Predictions

Load the best model weights and now, the model is ready for the predictions

In [59]:
# load weights into new model
model.load_weights("weights.best.hdf5")

#predict probabilities
pred_prob = model.predict(x_val_seq)



In [60]:
pred_prob[0]

array([0.03214562, 0.02385592, 0.0871686 , 0.41431734, 0.14316425,
       0.00634711, 0.6952231 , 0.51670754, 0.03453634, 0.24150449],
      dtype=float32)

The predictions are in terms of probabilities for each of the 10 tags. Hence we need to have a threshold value to convert these probabilities to 0 or 1.

Let's specify a set of candidate threshold values. We will select the threshold value that performs the best for the validation set.

In [61]:
#define candidate threshold values
threshold  = np.arange(0,0.5,0.01)
threshold

array([0.  , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ,
       0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21,
       0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32,
       0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43,
       0.44, 0.45, 0.46, 0.47, 0.48, 0.49])

Let's define a function that takes a threshold value and uses it to convert probabilities into 1 or 0.

In [62]:
# convert probabilities into classes or tags based on a threshold value
def classify(pred_prob,thresh):
  y_pred_seq = []

  for i in pred_prob:
    temp=[]
    for j in i:
      if j>=thresh:
        temp.append(1)
      else:
        temp.append(0)
    y_pred_seq.append(temp)

  return y_pred_seq

In [63]:
from sklearn import metrics
score=[]

#convert to 1 array
y_true = np.array(y_val).ravel()

for thresh in threshold:

    #classes for each threshold
    y_pred_seq = classify(pred_prob,thresh)

    #convert to 1d array
    y_pred = np.array(y_pred_seq).ravel()

    score.append(metrics.f1_score(y_true,y_pred))

In [64]:
# find the optimal threshold
opt = threshold[score.index(max(score))]
opt

0.34

# Model Evaluation

In [65]:
#predictions for optimal threshold
y_pred_seq = classify(pred_prob,opt)
y_pred = np.array(y_pred_seq).ravel()

In [66]:
print(metrics.classification_report(y_true,y_pred))

              precision    recall  f1-score   support

           0       0.88      0.85      0.86     17520
           1       0.50      0.58      0.54      4700

    accuracy                           0.79     22220
   macro avg       0.69      0.71      0.70     22220
weighted avg       0.80      0.79      0.79     22220



In [67]:
y_pred = mlb.inverse_transform(np.array(y_pred_seq))
y_true = mlb.inverse_transform(np.array(y_val))

df = pd.DataFrame({'comment':x_val,'actual':y_true,'predictions':y_pred})

In [68]:
df.sample(10)

Unnamed: 0,comment,actual,predictions
2032,is this separate type of distribution ex binomial bernoulli multinomial or any distribution can be represented by this way can someone elaborate with simple example,"(distributions, probability)","(probability,)"
2112,how can for instance the gamma distribution diverge near zero for an appropriate set of scale and shape parameters say shape and scale and still have its area equal to one as i understand it the a...,"(distributions, probability)","(probability,)"
1382,i have a time series object calc visit ts i want to apply the best fit time series model based on the mape value for each model the issue i face is that the mape value holt winter multiplicative m...,"(r, time series)","(logistic, r, regression)"
955,i have daily sales data for a department store for the past days i have indicators on the major holidays and the days leading up to the major holidays the number of days before the holidays that a...,"(r, time series)","(r, regression)"
1362,i am interested in a classification with imbalanced classes and i was rather happy with the caret package so far but i have some questions relative to sampling when the sampling is done during cro...,"(machine learning, r)","(classification, machine learning)"
1033,let us say we have some demographic time series data which tells us how many hours people spend in front of a computer screen each day grouped by age and gender set seed dates seq as date by day l...,"(r, time series)","(logistic, r, regression)"
1800,i am having issues answering part two i think it is about clt correct me if i m wrong but how do you compute the distribution details from this please help and thank you for all your contributions...,"(probability, self study)","(distributions, probability, self study)"
1182,i m struggling conceptually with something if i had two simulation outputs that had dates and counts of outcomes of those dates could i then create a probability distribution of time distances bet...,"(distributions, r, time series)","(logistic, r)"
952,i am interested in systems with an observable binary outcome e g success failure the individual cases are usually modelled e g using logistic regression as having an estimated probability of succe...,"(distributions, probability)","(distributions, probability, self study)"
208,i am airliner i want to protect my business from price volatility of jet fuel cost jet fuel is not traded in futures market but crude oil is traded in futures market i have daily spot prices of je...,"(regression, time series)","(logistic, r, regression)"


### Inference

In [None]:
def predict_tag(comment):
  text=[]

  #preprocess
  text = [cleaner(comment)]

  #convert to integer sequences
  seq = x_tokenizer.texts_to_sequences(text)

  #pad the sequence
  pad_seq = pad_sequences(seq,  padding='post', maxlen=max_len)

  #make predictions
  pred_prob = model.predict(pad_seq)
  classes = classify(pred_prob,opt)[0]

  classes = np.array([classes])
  classes = mlb.inverse_transform(classes)
  return classes

In [None]:
comment = "For example, in the case of logistic regression, the learning function is a Sigmoid function that tries to separate the 2 classes"

print("Comment:",comment)
print("Predicted Tags:",predict_tag(comment))

Comment: For example, in the case of logistic regression, the learning function is a Sigmoid function that tries to separate the 2 classes
Predicted Tags: [('classification', 'logistic', 'machine learning', 'regression')]
