# Classification Example - Apple Price
This is the simple classification example of using _SeqRep_ package with price data of Apple stock.

You can 
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MIR-MU/seqrep/blob/main/examples/ClassificationExample-ApplePrice.ipynb)
or
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/MIR-MU/seqrep/main?labpath=examples%2FClassificationExample-ApplePrice.ipynb).


## Install _SeqRep_ package

In [None]:
!pip install seqrep

Collecting git+https://github.com/MIR-MU/seqrep
  Cloning https://github.com/MIR-MU/seqrep to /tmp/pip-req-build-un3rqnik
  Running command git clone -q https://github.com/MIR-MU/seqrep /tmp/pip-req-build-un3rqnik


## Import Needed Parts

In [None]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

from seqrep.feature_engineering import PreviousValuesExtractor, TimeFeaturesExtractor
from seqrep.labeling import NextColorLabeler
from seqrep.splitting import TrainTestSplitter
from seqrep.scaling import UniversalScaler
from seqrep.evaluation import ClassificationEvaluator
from seqrep.pipeline_evaluation import PipelineEvaluator

# Data Source
!pip install yfinance
import yfinance as yf



## Load Data
In this example, we will use the price data of *Apple shares* from *Yahoo-Finance*.

In [None]:
data = yf.download(tickers="AAPL", period="10000d", interval="1d")
# column names have to be lowercase
data.columns = data.columns.str.lower()
data

[*********************100%***********************]  1 of 1 completed


Unnamed: 0_level_0,open,high,low,close,adj close,volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1982-04-08,0.078125,0.078683,0.078125,0.078125,0.061146,23990400
1982-04-12,0.078125,0.078683,0.077567,0.077567,0.060709,44307200
1982-04-13,0.071987,0.071987,0.071429,0.071429,0.055905,85299200
1982-04-14,0.071987,0.072545,0.071987,0.071987,0.056342,113590400
1982-04-15,0.073103,0.073661,0.073103,0.073103,0.057215,164281600
...,...,...,...,...,...,...
2021-11-29,159.369995,161.190002,158.789993,160.240005,160.240005,88748200
2021-11-30,159.990005,165.520004,159.919998,165.300003,165.300003,174048100
2021-12-01,167.479996,170.300003,164.529999,164.770004,164.770004,152052500
2021-12-02,158.740005,164.199997,157.800003,163.759995,163.759995,136739200


## Run Pipeling with Evaluation
This is the simplest way to use this framework. Pipeline transformations are performed and then the selected model is evaluated on splitted data.

In [None]:
# This DataFrame collects the results of various runs for comparison.

# Uncomment following line if you want to clear the DataFrame with the results.
# del results_for_comparison

try:
    results_for_comparison
except NameError:
    print("Create new empty DataFrame.")
    results_for_comparison = pd.DataFrame()
else:
    print("DataFrame already exist!")

Create new empty DataFrame.


In [None]:
%%capture --no-stdout --no-display
# 1. step
pipe = Pipeline(
    [
        ("fext_prev", PreviousValuesExtractor()),
        ("fext_time", TimeFeaturesExtractor()),
        ("scale_u", UniversalScaler(scaler=MinMaxScaler())),
    ]
)

# 2. step
pipe_eval = PipelineEvaluator(
    labeler=NextColorLabeler(),
    splitter=TrainTestSplitter(),
    pipeline=pipe,
    model=SVC(),
    evaluator=ClassificationEvaluator(),
)
# 3. step
result = pipe_eval.run(data=data)
results_for_comparison = results_for_comparison.append(
    pd.Series(result, name="Run without feature reduction"),
)

14:45:41.549 Labeling data
14:45:41.553 Splitting data
14:45:41.558 Fitting pipeline
14:45:41.622 Applying pipeline transformations
14:45:41.659 	Original shape:		(7500, 19); 
		shape after removing NaNs: (7499, 19).
14:45:41.665 	Original shape:		(2500, 19); 
		shape after removing NaNs: (2499, 19).
14:45:41.666 Fitting model
14:45:45.461 Predicting
14:45:46.788 Evaluating predictions
[[1157   43]
 [1252   47]] 
 48.17927170868347 % accuracy
 52.22222222222223 % precision of 1 classes
 3.6181678214010775 % recall of 1 classes

              precision    recall  f1-score   support

           0       0.48      0.96      0.64      1200
           1       0.52      0.04      0.07      1299

    accuracy                           0.48      2499
   macro avg       0.50      0.50      0.35      2499
weighted avg       0.50      0.48      0.34      2499



## Run Pipeling with Evaluation and Feature Reduction
In this example, we use _feature selection_ for reduction of number of features. Half of the features remains (because of `number=0.5`).

For the evaluation, we use the _UniversalEvaluator_ in this case.

In [None]:
%%capture --no-stdout --no-display
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score

from seqrep.feature_engineering import TAExtractor
from seqrep.feature_reduction import UnivariateFeatureSelector
from seqrep.evaluation import UniversalEvaluator

# 1. step
pipe = Pipeline(
    [
        ("fext_prev", PreviousValuesExtractor()),
        ("fext_time", TimeFeaturesExtractor()),
        ("fext_ta", TAExtractor(all_features=True)),
        ("scale_u", UniversalScaler(scaler=MinMaxScaler())),
    ]
)
evaluator = UniversalEvaluator(
    metrics=[accuracy_score, roc_auc_score, precision_score, recall_score]
)
# 2. step
pipe_eval = PipelineEvaluator(
    labeler=NextColorLabeler(),
    splitter=TrainTestSplitter(),
    pipeline=pipe,
    feature_reductor=UnivariateFeatureSelector(number=0.5),
    model=SVC(),
    evaluator=evaluator,
)
# 3. step
result = pipe_eval.run(data=data)
results_for_comparison = results_for_comparison.append(
    pd.Series(result, name="Run with feature reduction"),
)

14:45:46.840 Labeling data
14:45:46.843 Splitting data
14:45:46.849 Fitting pipeline
14:45:50.036 Applying pipeline transformations
14:45:51.243 	Original shape:		(7500, 102); 
		shape after removing NaNs: (7429, 100).
14:45:51.251 	Original shape:		(2500, 102); 
		shape after removing NaNs: (2429, 100).
14:45:51.251 Applying feature reduction
14:45:51.276 Fitting model
14:45:55.195 Predicting
14:45:56.893 Evaluating predictions
accuracy_score:
	0.48291477974475094
roc_auc_score:
	0.5
precision_score:
	0.0
recall_score:
	0.0


In [None]:
results_for_comparison

Unnamed: 0,accuracy,confusion matrix,precision,recall,accuracy_score,precision_score,recall_score,roc_auc_score
Run without feature reduction,48.179272,"[[1157, 43], [1252, 47]]",52.222222,3.618168,,,,
Run with feature reduction,,,,,0.482915,0.0,0.0,0.5
