# Organize ML projects with Scikit-Learn

While Machine Learning is powerful, people often overestimate it: apply machine learning to your project, and all your problems will be solved. In reality, it's not this simple. To be effective, one needs to organize the work very well. In this notebook, we will walkthrough practical aspects of a ML project. To look at the big picture, let's start with a checklist below. It should work reasonably well for most ML projects, but make sure to adapt it to your needs:

1. **Define the scope of work and objective**
    * How is your solution be used?
    * How should performance be measured? Are there any contraints?
    * How would the problem be solved manually?
    * List the available assumptions, and verify if possible.
    
    
2. **Get the data**
    * Document where you can get that data
    * Store data in a workspace you can easily access
    * Convert the data to a format you can easily manipulate
    * Check the overview (size, type, sample, description, statistics)
    * Data cleaning
    
    
3. **EDA & Data transformation**
    * Study each attribute and its characteristics (missing values, type of distribution, usefulness)
    * Visualize the data
    * Study the correlations between attributes
    * Feature selection, Feature Engineering, Feature scaling
    * Write functions for all data transformations
    
    
4. **Train models**
    * Automate as much as possible
    * Train promising models quickly using standard parameters. Measure and compare their performance
    * Analyze the errors the models make
    * Shortlist the top three of five most promising models, preferring models that make different types of errors.


5. **Fine-tunning**
    * Treat data transformation choices as hyperparameters, expecially when you are not sure about them (e.g., replace missing values with zeros or with the median value)
    * Unless there are very few hyperparameter value to explore, prefer random search over grid search.
    * Try ensemble methods
    * Test your final model on the test set to estimate the generalizaiton error. Don't tweak your model again, you would start overfitting the test set.

## Example: Articles categorization

### Objectives

Build a model to determine the categories of articles. 

### Get Data

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

sns.set_style("whitegrid")

In [2]:
bbc = pd.read_csv('https://raw.githubusercontent.com/dhminh1024/practice_datasets/master/bbc-text.csv')

In [None]:
bbc.sample(5)

Unnamed: 0,category,text
1246,business,sbc plans post-takeover job cuts us phone comp...
755,business,us to rule on yukos refuge call yukos has said...
2026,politics,job cuts false economy - tuc plans to shed ...
179,business,arsenal may seek full share listing arsenal ...
633,entertainment,gallery unveils interactive tree a christmas t...


In [None]:
bbc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  2225 non-null   object
 1   text      2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


In [None]:
# Your code here

In [3]:
bbc['text'][0]

'tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high

In [4]:
bbc_text= [
    'tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high-'
]

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
bag = count.fit_transform(bbc_text)

In [6]:
count.get_feature_names()

['about',
 'according',
 'allow',
 'allows',
 'also',
 'an',
 'and',
 'annual',
 'are',
 'at',
 'be',
 'been',
 'being',
 'boxes',
 'broadband',
 'built',
 'cable',
 'ces',
 'companies',
 'consumer',
 'content',
 'definition',
 'delivered',
 'devices',
 'different',
 'digital',
 'discuss',
 'dvr',
 'electronics',
 'essentially',
 'expert',
 'favourite',
 'five',
 'for',
 'forward',
 'front',
 'future',
 'gathered',
 'hands',
 'has',
 'high',
 'home',
 'how',
 'impact',
 'in',
 'into',
 'is',
 'las',
 'leading',
 'like',
 'living',
 'more',
 'most',
 'moving',
 'much',
 'networks',
 'new',
 'of',
 'one',
 'other',
 'our',
 'panel',
 'pastimes',
 'pause',
 'people',
 'personal',
 'personalised',
 'plasma',
 'play',
 'portable',
 'programmes',
 'providers',
 'pvr',
 'radically',
 'record',
 'recorders',
 'room',
 'rooms',
 'satellite',
 'service',
 'set',
 'show',
 'sky',
 'store',
 'system',
 'systems',
 'talked',
 'technologies',
 'technology',
 'telecoms',
 'that',
 'the',
 'theatre',


In [7]:
bag.toarray()

array([[ 1,  1,  1,  1,  1,  1,  8,  1,  1,  1,  2,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  2,  2,  1,  1,  4,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  4,  2,  1,  1,  1,  1,  1,
         2,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,  2,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  2,  1,  1,  1, 10,  1,  2,  2,  1,
         1,  1,  6,  1,  1,  4,  1,  1,  2,  1,  1,  2,  2,  1,  1,  1,
         1,  1,  3,  1,  2,  1]])

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
# Feed the tf-idf Vectorizer with bbc_text using fit_transform()
tfidf_vec = tfidf.fit_transform(bbc_text)

np.set_printoptions(precision=2)
# To print array in one line
np.set_printoptions(linewidth=np.inf)
print(tfidf.get_feature_names())
print(tfidf_vec.toarray())

['about', 'according', 'allow', 'allows', 'also', 'an', 'and', 'annual', 'are', 'at', 'be', 'been', 'being', 'boxes', 'broadband', 'built', 'cable', 'ces', 'companies', 'consumer', 'content', 'definition', 'delivered', 'devices', 'different', 'digital', 'discuss', 'dvr', 'electronics', 'essentially', 'expert', 'favourite', 'five', 'for', 'forward', 'front', 'future', 'gathered', 'hands', 'has', 'high', 'home', 'how', 'impact', 'in', 'into', 'is', 'las', 'leading', 'like', 'living', 'more', 'most', 'moving', 'much', 'networks', 'new', 'of', 'one', 'other', 'our', 'panel', 'pastimes', 'pause', 'people', 'personal', 'personalised', 'plasma', 'play', 'portable', 'programmes', 'providers', 'pvr', 'radically', 'record', 'recorders', 'room', 'rooms', 'satellite', 'service', 'set', 'show', 'sky', 'store', 'system', 'systems', 'talked', 'technologies', 'technology', 'telecoms', 'that', 'the', 'theatre', 'these', 'they', 'through', 'time', 'tivo', 'to', 'top', 'trend', 'tv', 'tvs', 'uk', 'us', '

In [9]:
from collections import Counter

vocab = Counter()
for bbc_text in bbc.text:
    for word in bbc_text.split(' '):
        vocab[word] += 1

vocab.most_common(20)

[('', 65553),
 ('the', 52567),
 ('to', 24955),
 ('of', 19947),
 ('and', 18561),
 ('a', 18251),
 ('in', 17570),
 ('s', 9007),
 ('for', 8884),
 ('is', 8515),
 ('that', 8135),
 ('it', 7584),
 ('on', 7460),
 ('was', 6016),
 ('he', 5933),
 ('be', 5765),
 ('with', 5313),
 ('said', 5072),
 ('as', 4976),
 ('has', 4952)]

In [10]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [11]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

vocab_reduced = Counter()
# Go through all of the items of vocab using vocab.items() and pick only words that are not in 'stop_words' 
# and save them in vocab_reduced
for w, c in vocab.items():
    if not w in stop_words:
        vocab_reduced[w]=c

vocab_reduced.most_common(20)

[('', 65553),
 ('said', 5072),
 ('-', 3195),
 ('mr', 2992),
 ('would', 2574),
 ('also', 2154),
 ('people', 1970),
 ('new', 1957),
 ('us', 1786),
 ('one', 1705),
 ('could', 1509),
 ('said.', 1499),
 ('year', 1396),
 ('last', 1380),
 ('first', 1277),
 ('.', 1171),
 ('two', 1161),
 ('government', 1085),
 ('world', 1076),
 ('uk', 993)]

In [12]:
import re 

def preprocessor(text):
    """ Return a cleaned version of text
    """
    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text)
    # Save emoticons for later appending
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # Remove any non-word character and append the emoticons,
    # removing the nose character for standarization. Convert to lower case
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    
    return text

# Create some random texts for testing the function preprocessor()
print(preprocessor('I like it :), |||<><>'))

i like it  :)


In [13]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

# Split a text into list of words
def tokenizer(text):
    return text.split()

# Split a text into list of words and apply stemming technic
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

# Testing
print(tokenizer('Hi there, I am loving this, like with a lot of love'))
print(tokenizer_porter('Hi there, I am loving this, like with a lot of love'))

['Hi', 'there,', 'I', 'am', 'loving', 'this,', 'like', 'with', 'a', 'lot', 'of', 'love']
['Hi', 'there,', 'I', 'am', 'love', 'this,', 'like', 'with', 'a', 'lot', 'of', 'love']


#**Train Model**

In [14]:
from sklearn.model_selection import train_test_split

X = bbc['text']
y = bbc['category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words=stop_words,
                        tokenizer=tokenizer_porter,
                        preprocessor=preprocessor)

clf = Pipeline([('vect', tfidf),
                ('clf', LogisticRegression(random_state=0))])
clf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=<function preprocessor at 0x7fec645efd08>,
                                 smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', '...
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_porter at 0x7fec645b3268>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
         

In [16]:
from sklearn.metrics import accuracy_score

predictions = clf.predict(X_test)
print('accuracy:',accuracy_score(y_test,predictions))


accuracy: 0.9865168539325843
