![workflow graph](Figures/SolutionNo_4_length_8.png "Workflow Graph")

In [1]:
from pathlib import Path
import sys

import pandas as pd

sys.path.append('/Users/stevep/Documents/code/APE_thesis/ape-thesis')
from wrapper_functions import *    

## Workflow Input Objects

### Table 1
- id: `imbd_train`
- source: `/Users/stevep/Documents/code/APE_thesis/ape-thesis/usecases/imbd/imbd_train_fixed.csv`
- DataClass: `MixedDataFrame`
- DataClass: `NoRelevance`    

In [2]:
imbd_train = load_table_csv('/Users/stevep/Documents/code/APE_thesis/ape-thesis/usecases/imbd/imbd_train_fixed.csv').head(100)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  50000 non-null  int64 
 1   review      50000 non-null  object
 2   sentiment   50000 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.1+ MB


None

Unnamed: 0.1,Unnamed: 0,review,sentiment
0,0,One of the other reviews has mentioned that af...,positive
1,1,A wonderful little production. The filling tec...,positive
2,2,I thought this was a wonderful way to spend ti...,positive
3,3,Basically there's a family where a little boy ...,negative
4,4,"Letter Matter's ""Love in the Time of Money"" is...",positive


### Step 8: `column_split`
#### Notes
Splits a dataframe into X and y based on a column name
#### inputs:
- 1
	- DataClass: `MixedDataFrame`
	- StatisticalRelevance: `NoRelevance`
	- APE_label: `['imbd_train']`
	- src: `(0, 2)`
- 2
	- DataClass: `StrColumn`
	- StatisticalRelevance: `DependentVariable`
	- APE_label: `['sentiment']`
	- src: `(0, 0)`
#### outputs:
- 1
	- DataClass: `MixedDataFrame`
	- StatisticalRelevance: `IndependentVariable`
- 2
	- DataClass: `StrSeries`
	- StatisticalRelevance: `DependentVariable`

In [3]:
mixedDataFrame_8_1, strSeries_8_2 = column_split(df=imbd_train, column='sentiment')

### Step 9: `train_test_split`
#### Notes
Splits a dataframe into X_train, y_train, X_test, y_test
    > returns strings instead of series if y is a string
    
#### inputs:
- 1
	- DataClass: `MixedDataFrame`
	- StatisticalRelevance: `IndependentVariable`
	- src: `(8, 0)`
- 2
	- DataClass: `StrSeries`
	- StatisticalRelevance: `DependentVariable`
	- src: `(8, 1)`
#### outputs:
- 1
	- DataClass: `MixedDataFrame`
	- StatisticalRelevance: `IndependentVariable`
- 2
	- DataClass: `StrSeries`
	- StatisticalRelevance: `DependentVariable`
- 3
	- DataClass: `MixedDataFrame`
	- StatisticalRelevance: `IndependentVariable`
- 4
	- DataClass: `StrSeries`
	- StatisticalRelevance: `DependentVariable`

In [4]:
mixedDataFrame_9_1, strSeries_9_2, mixedDataFrame_9_3, strSeries_9_4 = train_test_split(df=mixedDataFrame_8_1, y=strSeries_8_2)

### Step 10: `embed_text_word2vec`
#### Notes
Trains a word2vec model on a dataframe or series and returns the embeddings and the model.
    Alternatively, pass a pretrained model as the word2vec argument.
    
#### inputs:
- 1
	- DataClass: `MixedDataFrame`
	- StatisticalRelevance: `IndependentVariable`
	- src: `(9, 0)`
- 2
	- DataClass: `StrColumn`
	- StatisticalRelevance: `IndependentVariable`
	- APE_label: `['review']`
	- src: `(0, 1)`
#### outputs:
- 1
	- DataClass: `EmbeddingMatrix`
	- StatisticalRelevance: `IndependentVariable`
- 2
	- DataClass: `Word2Vec`
	- StatisticalRelevance: `IndependentVariable`

In [5]:
embeddingMatrix_10_1, word2Vec_10_2 = embed_text_word2vec(data=mixedDataFrame_9_1, column='review')

### Step 11: `init_sklearn_estimator`
#### Notes
Initializes a sklearn estimator.

    The passed string must be one of the following:

    - 'KernelRidgeRegressor'
    - 'PerceptronClassifier'
    - 'LogisticRegressionClassifier'
    - 'LinearRegressor'
    - 'ElasticNetRegressor'
    - 'RidgeRegressor'
    - 'DecisionTreeClassifier'
    - 'DecisionTreeRegressor'
    - 'LinearSVClassifier'
    - 'LinearSVRregressor'
    - 'RandomForestClassifier'
    - 'AdaBoostClassifier'
    - 'VotingClassifier'
    - 'RandomForestRegressor'
    - 'AdaBoostRegressor'
    - 'VotingRegressor'
    - 'DummyClassifier'
    - 'DummyRegressor'
    - 'KMeansClustor'
    - 'DBScanClustor'
    - 'KNeighborsClassifier'
    - 'KNeighborsRegressor'
    - 'GridSearchCV'
    - 'HalvingGridSearchCV'
    - 'SimpleImputer'
    - 'IterativeImputer'
    - 'KNNImputer'
    - 'KNNImputer'
    - 'CatKNNImputer' #! NO
    - 'PCA'
    - 'TruncatedSVD'
    
#### inputs:

#### outputs:
- 1
	- DataClass: `DecisionTreeClassifier`
	- StatisticalRelevance: `NoRelevance`

In [6]:
decisionTreeClassifier_11_1 = init_sklearn_estimator(estimator="DecisionTreeClassifier")

### Step 12: `embed_text_word2vec`
#### Notes
Trains a word2vec model on a dataframe or series and returns the embeddings and the model.
    Alternatively, pass a pretrained model as the word2vec argument.
    
#### inputs:
- 1
	- DataClass: `MixedDataFrame`
	- StatisticalRelevance: `IndependentVariable`
	- src: `(9, 0)`
- 2
	- DataClass: `StrColumn`
	- StatisticalRelevance: `IndependentVariable`
	- APE_label: `['review']`
	- src: `(0, 1)`
- 3
	- DataClass: `Word2Vec`
	- StatisticalRelevance: `IndependentVariable`
	- src: `(10, 1)`
#### outputs:
- 1
	- DataClass: `EmbeddingMatrix`
	- StatisticalRelevance: `IndependentVariable`

In [7]:
embeddingMatrix_12_1 = embed_text_word2vec(data=mixedDataFrame_9_1, column='review', word2vec=word2Vec_10_2)

### Step 13: `fit_estimator`
#### Notes
Fits an estimator
    > Operation is in-place even though it returns the estimator!
    
#### inputs:
- 1
	- DataClass: `DecisionTreeClassifier`
	- StatisticalRelevance: `NoRelevance`
	- src: `(11, 0)`
- 2
	- DataClass: `EmbeddingMatrix`
	- StatisticalRelevance: `IndependentVariable`
	- src: `(12, 0)`
- 3
	- DataClass: `StrSeries`
	- StatisticalRelevance: `DependentVariable`
	- src: `(8, 1)`
#### outputs:
- 1
	- DataClass: `DecisionTreeClassifier`
	- StatisticalRelevance: `NoRelevance`

In [8]:
decisionTreeClassifier_13_1 = fit_estimator(estimator=decisionTreeClassifier_11_1, X=embeddingMatrix_12_1, y=strSeries_8_2)

ValueError: Number of labels=100 does not match number of samples=75

### Step 14: `predict`
#### Notes
Predicts using a FITTED estimator
#### inputs:
- 1
	- DataClass: `DecisionTreeClassifier`
	- StatisticalRelevance: `NoRelevance`
	- src: `(13, 0)`
- 2
	- DataClass: `EmbeddingMatrix`
	- StatisticalRelevance: `IndependentVariable`
	- src: `(10, 0)`
#### outputs:
- 1
	- DataClass: `MixedSeries`
	- StatisticalRelevance: `Prediction`

In [None]:
mixedSeries_14_1 = predict(estimator=decisionTreeClassifier_13_1, X=embeddingMatrix_10_1)

### Step 15: `classification_report`
#### Notes
Displays a classification report
#### inputs:
- 1
	- DataClass: `MixedSeries`
	- StatisticalRelevance: `Prediction`
	- src: `(14, 0)`
- 2
	- DataClass: `StrSeries`
	- StatisticalRelevance: `DependentVariable`
	- src: `(9, 1)`
#### outputs:
- 1
	- DataClass: `ClassificationReport`
	- StatisticalRelevance: `NoRelevance`

In [None]:
classificationReport_15_1 = classification_report(y_true=mixedSeries_14_1, y_pred=strSeries_9_2)