# Organize ML projects with Scikit-Learn

While Machine Learning is powerful, people often overestimate it: apply machine learning to your project, and all your problems will be solved. In reality, it's not this simple. To be effective, one needs to organize the work very well. In this notebook, we will walkthrough practical aspects of a ML project. To look at the big picture, let's start with a checklist below. It should work reasonably well for most ML projects, but make sure to adapt it to your needs:

1. **Define the scope of work and objective**
    * How is your solution be used?
    * How should performance be measured? Are there any contraints?
    * How would the problem be solved manually?
    * List the available assumptions, and verify if possible.
    
    
2. **Get the data**
    * Document where you can get that data
    * Store data in a workspace you can easily access
    * Convert the data to a format you can easily manipulate
    * Check the overview (size, type, sample, description, statistics)
    * Data cleaning
    
    
3. **EDA & Data transformation**
    * Study each attribute and its characteristics (missing values, type of distribution, usefulness)
    * Visualize the data
    * Study the correlations between attributes
    * Feature selection, Feature Engineering, Feature scaling
    * Write functions for all data transformations
    
    
4. **Train models**
    * Automate as much as possible
    * Train promising models quickly using standard parameters. Measure and compare their performance
    * Analyze the errors the models make
    * Shortlist the top three of five most promising models, preferring models that make different types of errors.


5. **Fine-tunning**
    * Treat data transformation choices as hyperparameters, expecially when you are not sure about them (e.g., replace missing values with zeros or with the median value)
    * Unless there are very few hyperparameter value to explore, prefer random search over grid search.
    * Try ensemble methods
    * Test your final model on the test set to estimate the generalizaiton error. Don't tweak your model again, you would start overfitting the test set.

## Example: Articles categorization

### Objectives

Build a model to determine the categories of articles. 

### Get Data

In [43]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

sns.set_style("whitegrid")

In [44]:
bbc = pd.read_csv('https://raw.githubusercontent.com/dhminh1024/practice_datasets/master/bbc-text.csv')

In [45]:
bbc.sample(5)

Unnamed: 0,category,text
497,business,putin backs state grab for yukos russia s pres...
789,sport,parker misses england clash tom shanklin will ...
391,business,industrial revival hope for japan japanese ind...
2064,entertainment,george michael to perform for bbc george micha...
1949,business,huge rush for jet airways shares indian airlin...


In [46]:
bbc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  2225 non-null   object
 1   text      2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


In [47]:
# Your code here
bbc.describe()

Unnamed: 0,category,text
count,2225,2225
unique,5,2126
top,sport,britons fed up with net service a survey condu...
freq,511,2


In [48]:
bbc['category'].value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: category, dtype: int64

In [49]:
bbc.columns

Index(['category', 'text'], dtype='object')

In [50]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(bbc["text"], bbc["category"], test_size=0.2)

In [51]:
text_tokenizer = Tokenizer(num_words = 10000, oov_token="<OOV>")
text_tokenizer.fit_on_texts(np.array(x_train))
text_word_index = text_tokenizer.word_index
text_sequences = text_tokenizer.texts_to_sequences(np.array(x_train))
padded_text_sequences = pad_sequences(text_sequences, padding='post', maxlen=150)

category_tokenizer = Tokenizer()
category_tokenizer.fit_on_texts(np.array(y_train))
category_word_index = category_tokenizer.word_index
category_sequences = np.array(category_tokenizer.texts_to_sequences(np.array(y_train)))
category_sequences.reshape(category_sequences.shape[0],)

array([1, 2, 3, ..., 3, 3, 1])

In [52]:
validation_text_sequences = text_tokenizer.texts_to_sequences(np.array(x_test))
validation_text_padded = pad_sequences(validation_text_sequences, padding='post', maxlen=150)

validation_category_sequences = np.array(category_tokenizer.texts_to_sequences(np.array(y_test)))
validation_category_sequences.reshape(validation_category_sequences.shape[0],)

array([2, 1, 3, 3, 4, 5, 3, 1, 2, 4, 1, 1, 1, 3, 5, 2, 2, 3, 2, 1, 2, 1,
       3, 2, 2, 4, 4, 3, 1, 2, 4, 5, 3, 1, 3, 2, 1, 2, 2, 2, 2, 2, 2, 1,
       4, 4, 4, 5, 2, 2, 1, 2, 5, 2, 5, 4, 4, 2, 5, 2, 3, 4, 1, 2, 3, 1,
       5, 2, 3, 2, 5, 2, 3, 3, 5, 1, 1, 5, 3, 4, 5, 1, 1, 2, 2, 3, 4, 4,
       4, 2, 2, 3, 5, 1, 3, 5, 3, 2, 2, 1, 5, 1, 3, 1, 5, 2, 3, 2, 1, 2,
       4, 4, 5, 3, 1, 5, 4, 1, 4, 2, 1, 5, 3, 2, 1, 1, 4, 2, 5, 1, 5, 5,
       5, 5, 2, 3, 2, 3, 5, 2, 1, 1, 1, 1, 5, 2, 3, 2, 2, 4, 5, 5, 4, 1,
       2, 4, 5, 4, 3, 3, 2, 1, 3, 3, 5, 5, 5, 5, 4, 1, 3, 1, 2, 3, 2, 1,
       1, 2, 4, 2, 4, 2, 5, 3, 5, 2, 3, 3, 3, 1, 1, 5, 3, 5, 2, 4, 5, 4,
       3, 4, 3, 3, 5, 4, 2, 3, 2, 3, 4, 5, 1, 1, 2, 3, 3, 2, 2, 5, 4, 5,
       2, 3, 5, 5, 2, 4, 3, 3, 2, 3, 2, 4, 1, 1, 2, 5, 1, 3, 4, 4, 2, 3,
       5, 2, 2, 3, 4, 2, 2, 2, 5, 3, 3, 1, 3, 2, 1, 4, 3, 5, 4, 1, 5, 3,
       3, 1, 3, 4, 1, 5, 2, 5, 4, 3, 5, 1, 4, 3, 4, 4, 4, 2, 1, 3, 4, 1,
       1, 4, 3, 1, 5, 2, 2, 3, 1, 2, 2, 1, 2, 2, 3,

In [53]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 200, input_length=150),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(6, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 150, 200)          2000000   
_________________________________________________________________
global_average_pooling1d (Gl (None, 200)               0         
_________________________________________________________________
dense (Dense)                (None, 24)                4824      
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 150       
Total params: 2,004,974
Trainable params: 2,004,974
Non-trainable params: 0
_________________________________________________________________


In [54]:
history = model.fit(padded_text_sequences, category_sequences, epochs=20, validation_data=(validation_text_padded, validation_category_sequences), verbose=2)

Epoch 1/20
56/56 - 2s - loss: 1.6857 - accuracy: 0.5129 - val_loss: 1.5255 - val_accuracy: 0.4944
Epoch 2/20
56/56 - 1s - loss: 1.2929 - accuracy: 0.6129 - val_loss: 1.0652 - val_accuracy: 0.7258
Epoch 3/20
56/56 - 1s - loss: 0.8060 - accuracy: 0.8674 - val_loss: 0.6698 - val_accuracy: 0.9124
Epoch 4/20
56/56 - 1s - loss: 0.4447 - accuracy: 0.9545 - val_loss: 0.4318 - val_accuracy: 0.9393
Epoch 5/20
56/56 - 1s - loss: 0.2412 - accuracy: 0.9798 - val_loss: 0.3121 - val_accuracy: 0.9461
Epoch 6/20
56/56 - 1s - loss: 0.1377 - accuracy: 0.9893 - val_loss: 0.2493 - val_accuracy: 0.9393
Epoch 7/20
56/56 - 1s - loss: 0.0836 - accuracy: 0.9961 - val_loss: 0.2174 - val_accuracy: 0.9438
Epoch 8/20
56/56 - 1s - loss: 0.0531 - accuracy: 0.9989 - val_loss: 0.1956 - val_accuracy: 0.9461
Epoch 9/20
56/56 - 1s - loss: 0.0360 - accuracy: 1.0000 - val_loss: 0.1811 - val_accuracy: 0.9483
Epoch 10/20
56/56 - 2s - loss: 0.0256 - accuracy: 1.0000 - val_loss: 0.1736 - val_accuracy: 0.9461
Epoch 11/20
56/56 -

In [67]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


predictions = model.predict(validation_text_padded)
predictions = np.argmax(predictions, axis=1)
print('accuracy:',accuracy_score(validation_category_sequences,predictions))
print('confusion matrix:\n',confusion_matrix(validation_category_sequences,predictions))
print('classification report:\n',classification_report(validation_category_sequences,predictions))

accuracy: 0.9483146067415731
confusion matrix:
 [[ 99   0   0   0   0]
 [  1 102   2   1   2]
 [  1   3  82   2   2]
 [  0   0   2  71   3]
 [  0   1   1   2  68]]
classification report:
               precision    recall  f1-score   support

           1       0.98      1.00      0.99        99
           2       0.96      0.94      0.95       108
           3       0.94      0.91      0.93        90
           4       0.93      0.93      0.93        76
           5       0.91      0.94      0.93        72

    accuracy                           0.95       445
   macro avg       0.95      0.95      0.95       445
weighted avg       0.95      0.95      0.95       445

