### Online /Incremental Machine Learning Tools
+ Offline ML Learning : it means we have a batch of data, and we optimize an equation to make a prediction 
+ Online ML learning: used when we have streaming data, where we want to process one sample of data at a time.
    - real-time data one observation at a time
    - we update our estimates as each new data point arrives rather than waiting until “the end” (which may never occur)
+ Incremental learning is a method of machine learning in which input data is continuously used to extend the existing model's knowledge i.e. to further train the model. 
+ It represents a dynamic technique of supervised learning and unsupervised learning that can be applied when training data becomes available gradually over time or its size is out of system memory limits.
+ The AIM
    - for the learning model to adapt to new data without forgetting its existing knowledge.



#### Tools For Incremental or Online ML
+ River
    - Creme
    - Scikit-Multiflow
+ MOA
+ SAMOA
+ StreamDB (spark streaming)

#### Usefulness
+ For Online ML 
+ For ml on streaming data

### Challenges
+ Difficult to manage
    - Highly Adaptive
+ More Research Grade



#### Installation
+ pip install river
+ pip install creme
+ pip install scikit-multiflow

### Incremental /Online Machine Learning with River

In [39]:
import pandas as pd

In [40]:
# Load ML Pkgs
import river

In [41]:
# Method
dir(river)

['__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'anomaly',
 'base',
 'cluster',
 'compat',
 'compose',
 'datasets',
 'drift',
 'dummy',
 'ensemble',
 'evaluate',
 'expert',
 'facto',
 'feature_extraction',
 'feature_selection',
 'imblearn',
 'linear_model',
 'meta',
 'metrics',
 'multiclass',
 'multioutput',
 'naive_bayes',
 'neighbors',
 'neural_net',
 'optim',
 'preprocessing',
 'proba',
 'reco',
 'stats',
 'stream',
 'synth',
 'time_series',
 'tree',
 'utils']

In [42]:
# Load Estimators
from river.linear_model import LogisticRegression
from river.naive_bayes import MultinomialNB
from river.feature_extraction import BagOfWords,TFIDF


In [43]:
def get_all_attributes(package):
    subpackages = []
    submodules = []
    for i in dir(package):
        if str(i) not in ["__all__", "__builtins__", "__cached__", "__doc__", "__file__", "__loader__", "__name__", "__package__", "__path__", "__pdoc__", "__spec__", "__version__"]:
            subpackages.append(i)
            res = [j for j in dir(eval("river.{}".format(i)))]
            submodules.append(res)
    df = pd.DataFrame(submodules)
    # Transpose
    df = df.T
    df.columns = subpackages
    res_df = df.dropna()
    return res_df
           
    

In [44]:
river_df = get_all_attributes(river)

In [45]:
river_df

Unnamed: 0,anomaly,base,cluster,compat,compose,datasets,drift,dummy,ensemble,evaluate,...,optim,preprocessing,proba,reco,stats,stream,synth,time_series,tree,utils
0,HalfSpaceTrees,AnomalyDetector,CluStream,River2SKLClassifier,Discard,AirlinePassengers,ADWIN,NoChangeClassifier,ADWINBaggingClassifier,Track,...,AMSGrad,AdaptiveStandardScaler,Gaussian,Baseline,AbsMax,Cache,Agrawal,Detrender,ExtremelyFastDecisionTreeClassifier,Histogram
1,__all__,Base,DBSTREAM,River2SKLClusterer,FuncTransformer,Bananas,DDM,PriorClassifier,AdaBoostClassifier,__all__,...,AdaBound,Binarizer,Multinomial,BiasedMF,AutoCorr,__all__,AnomalySine,GroupDetrender,HoeffdingAdaptiveTreeClassifier,SDFT
2,__builtins__,Classifier,DenStream,River2SKLRegressor,Grouper,Bikes,EDDM,StatisticRegressor,AdaptiveRandomForestClassifier,__builtins__,...,AdaDelta,FeatureHasher,__all__,FunkMF,BayesianMean,__builtins__,ConceptDriftStream,SNARIMAX,HoeffdingAdaptiveTreeRegressor,Skyline
3,__cached__,Clusterer,KMeans,River2SKLTransformer,Pipeline,ChickWeights,HDDM_A,__all__,AdaptiveRandomForestRegressor,__cached__,...,AdaGrad,LDA,__builtins__,RandomNormal,Bivariate,__cached__,Friedman,__all__,HoeffdingTreeClassifier,SortedWindow
4,__doc__,DriftDetector,STREAMKMeans,SKL2RiverClassifier,Renamer,CreditCard,HDDM_W,__builtins__,BaggingClassifier,__doc__,...,AdaMax,MaxAbsScaler,__cached__,__all__,Count,__doc__,FriedmanDrift,__builtins__,HoeffdingTreeRegressor,VectorDict
5,__file__,EnsembleMixin,__all__,SKL2RiverRegressor,Select,Elec2,KSWIN,__cached__,BaggingRegressor,__file__,...,Adam,MinMaxScaler,__doc__,__builtins__,Cov,__file__,Hyperplane,__cached__,LabelCombinationHoeffdingTreeClassifier,Window
6,__loader__,Estimator,__builtins__,__all__,SelectType,HTTP,PageHinkley,__doc__,LeveragingBaggingClassifier,__loader__,...,Averager,Normalizer,__file__,__cached__,EWMean,__loader__,LED,__doc__,__all__,__all__
7,__name__,MiniBatchClassifier,__cached__,__annotations__,TransformerUnion,Higgs,__all__,__file__,SRPClassifier,__name__,...,FTRLProximal,OneHotEncoder,__loader__,__doc__,EWVar,__name__,LEDDrift,__file__,__builtins__,__builtins__
8,__package__,MiniBatchRegressor,__doc__,__builtins__,__all__,ImageSegments,__builtins__,__loader__,__all__,__package__,...,Momentum,PreviousImputer,__name__,__file__,Entropy,__package__,Logical,__loader__,__cached__,__cached__
9,__path__,MultiOutputMixin,__file__,__cached__,__builtins__,Insects,__cached__,__name__,__builtins__,__path__,...,Nadam,RobustScaler,__package__,__loader__,IQR,__path__,Mixed,__name__,__doc__,__doc__


In [None]:
#### Requirement
+ list of tuple
+ dictionary
+ CSV
    - list of tuples or dictionary record
    - iter_csv
    - iter_pandas

In [46]:
### Data: Predict if a text if hardware or software related
data = [("my unit test failed","software"),
("tried the program, but it was buggy","software"),
("i need a new power supply","hardware"),
("the drive has a 2TB capacity","hardware"),
("unit-tests","software"),
("program","software"),
("power supply","hardware"),
("drive","hardware"),
("it needs more memory","hardware"),
("check the API","software"),
("design the API","software"),
("they need more CPU","hardware"),
("code","software"),
("i found some bugs in the code","software"),
("i swapped the memory","hardware"),
("i tested the code","software")]

test_data = [('he writes code daily','software'), 
             ('the disk is faulty','hardware'), 
             ("refactor the code","software"),
             ('no empty space on the drive','hardware')]

### Text classification
+ vectorized the text
    - CountVectorizer/ BagOfWords
    - TFIDF
+ build model on the go

In [47]:
#  Make a Pipeline
from river.compose import Pipeline

In [48]:
pipe_nb = Pipeline(('vectorizer',BagOfWords(lowercase=True)),('nb',MultinomialNB()))

In [49]:
### Visualize the Pipeline
pipe_nb

In [50]:
# Get steps
pipe_nb.steps

OrderedDict([('vectorizer',
              BagOfWords (
                on=None
                strip_accents=True
                lowercase=True
                preprocessor=None
                tokenizer=<built-in method findall of re.Pattern object at 0x7fa35529de00>
                ngram_range=(1, 1)
              )),
             ('nb',
              MultinomialNB (
                alpha=1.
              ))])

In [53]:
# Fit on our data
# Learn one at a time
# learn_one(for river)/ fit_one(for creme)
# predict_one

for text,label in data:
#     print(label)
    pipe_nb = pipe_nb.learn_one(text,label)

In [54]:
pipe_nb

In [55]:
# Make Prediction
pipe_nb.predict_one("I built an API")

'software'

In [59]:
# Make Prediction
pipe_nb.predict_proba_one("I built an API")

{'software': 0.732646964375691, 'hardware': 0.2673530356243093}

In [57]:
# Other 
pipe_nb.predict_one("the hard drive  in the computer is damaged")

'software'

In [58]:
# Prediction Proba
# Other 
pipe_nb.predict_proba_one("the hard drive  in the computer is damaged")

{'software': 0.5794679370463756, 'hardware': 0.4205320629536237}

In [None]:
### Evaluate & Classification
+ Accuracy
+ Precision/F1,Recall on a prediction

In [60]:
test_data

[('he writes code daily', 'software'),
 ('the disk is faulty', 'hardware'),
 ('refactor the code', 'software'),
 ('no empty space on the drive', 'hardware')]

In [64]:
y_pred = []
for x,y in test_data:
    print(x)
    res = pipe_nb.predict_one(x)
    y_pred.append(res)

he writes code daily
the disk is faulty
refactor the code
no empty space on the drive


In [65]:
# Classification
from river.metrics import ClassificationReport

In [66]:
report = ClassificationReport()

In [67]:
# Get y_true/y_test
y_pred = []
y_test = []
for x,y in test_data:
    print(x)
    res = pipe_nb.predict_one(x)
    y_pred.append(res)
    y_test.append(y)
    

he writes code daily
the disk is faulty
refactor the code
no empty space on the drive


In [69]:
print(y_test)
print(y_pred)

['software', 'hardware', 'software', 'hardware']
['software', 'software', 'software', 'hardware']


In [70]:
for yt,yp in zip(y_test,y_pred):
    report = report.update(yt,yp)

In [71]:
report

           Precision   Recall   F1      Support  
                                                 
hardware       1.000    0.500   0.667         2  
software       0.667    1.000   0.800         2  
                                                 
   Macro       0.833    0.750   0.733            
   Micro       0.750    0.750   0.750            
Weighted       0.833    0.750   0.733            

                 75.0% accuracy                  

In [73]:
# Update the Model on the test data & Check Accuracy
metric = river.metrics.Accuracy()
for text,label in test_data:
#     print(label)
    y_pred_before = pipe_nb.predict_one(text)
    metric = metric.update(label,y_pred_before)
    # Has already learnt the pattern
    pipe_nb = pipe_nb.learn_one(text,label)
    

In [74]:
metric

Accuracy: 75.00%

In [76]:
# Update the Model & Check Accuracy
# On the train data: 100%
metric2 = river.metrics.Accuracy()
for text,label in data:
#     print(label)
    y_pred_before = pipe_nb.predict_one(text)
    metric2 = metric2.update(label,y_pred_before)
    pipe_nb = pipe_nb.learn_one(text,label)
    

In [77]:
metric2

Accuracy: 100.00%

In [78]:
#### Thanks For Watching
#### Jesus Saves @JCharisTech
#### Jesse E.Agbe(JCharisTech) 