# Title: Building your 1st Auto ML Application

## Section 1: Intro

This is a “how to” manual for how to create an auto ML application, using an example as a potential template. Auto ML (Automated Machine Learning) is a type of application such that you plug in data and machine learning algorithms are run automatically, outputting a model. Auto ML applications can have a number of possible uses - the fundamental ones are: 
Data cleaning, choosing the right machine learning algorithm(s), and validation. 

Though the applications for the auto ml application I created are numerous, the motivation for it is to determine model bias and/or bias in data. When you decide to create an auto ML application, consider what your motivation is. 

Note: I decided not to include data cleaning as part of this application because there are plenty of applications that already do that. In fact, one possibility is to hook in APIs from other Auto ML applications in order to use other peoples’ work instead of starting from scratch. 

## Section 2: Strategy 

a) Start with a  cheatsheet. I used an image that showed which algorithm to use depending on attributes of your dataset. 

b) Gather some basic information about the dataset that you can easily figure out. This includes:

c) You don’t need to figure everything out yourself though. Feel free to ask your user about the dataset to be analyzed. You don’t need to figure out everything on your own. Your user can tell you all types of relevant information, including:

d) Create an auditing log. Include the attributes discovered about the dataset, the type of problem being faced (depending on attributes of your data, including:
Is the data labeled?
If labeled, is the target variable categorical?), which algorithm is used whenever one is used, and what the output of the model is. 

e) Now is time for training. Choose which algorithm should be run, tune the algorithm, run the algorithm, and output the results.

f) The next step after training of any machine learning application is model validation. This is where I decided to focus my efforts. By doing a deeper model evaluation, figuring out which segments of data the model does poorly, you can find issues. Issues could be with the algorithm, with underlying issues related to the dataset, or some other form of bias. 

g) Prepare for the various types of issues that may have come up. See if you can solve them automatically. For example, if you there is a segment of data that is undersampled (or there are not enough samples), you can oversample the data or create synthetic data from the data that the dataset does not have a sufficient number of samples of). 

h) Iterage sections e-g


i) An important step is to make sure you avoid model drift, the degredation of model performance over time. One type of model drift is due to data drift. In order to avoid data drift, you can create data population reports that compare your data with some baseline dataset to look for changes. Another way to avoid model drift is simply by retraining your models regularly.

## 1

In [16]:
import pandas as pd


## Utility Functions

In [17]:
#utility functions
def ask_if_true(question): #, positive_answers = ["y", "yes", "true"]):
    var_asking_about = str.lower(input(question+'\n'))
    if var_asking_about in ["y", "yes", "true"]:
        return True
    else:
        return False

def create_random_seed():
    current_time = int(time.time())
    time.sleep(random.randrange(1,5)*0.02) 
    return current_time

## questions

Is the observation count more than 50?
Is the observation count more than 100,000?
Is the observation count more than 10,000?

Are you predicting a category, predicting a quantity, or are you exploring?

Is it necessary to have only few features that are important?

Is any of the data text data?

Is the data labeled (is there a target variable with data)?

In [18]:
#get answers to questions
def get_observation_count_comparisons(X):
    count_of_observations = len(X)
    is_count_of_observations_more_than_50 = (count_of_observations>50)
    is_count_of_observations_less_than_100k = (count_of_observations<100000)
    is_count_of_observations_less_than_10k = (count_of_observations<10000)
    return (is_count_of_observations_more_than_50, 
            is_count_of_observations_less_than_100k, 
            is_count_of_observations_less_than_10k)

def derive_from_type_of_target_variable(y_type):
    is_target_categorical = y_type == "categorical"
    is_target_quantity = y_type == "quantity"
    is_exploration = y_type == "exploring"
    return (is_target_categorical, 
            is_target_quantity, 
            is_exploration)

def get_is_data_labeled(Y):
    is_data_labeled = len(Y) > 0
    return is_data_labeled

#Collect more data if necessary
def get_is_number_categories_known():
    if is_target_categorical == True and is_data_labeled == False :
        question = "Do you know how many categories there are? 'yes' or 'no'\n"
        is_number_categories_known = ask_if_true(question)
        

algorithms...

classification:
'do_LinearSVC', 
'do_NaiveBayes', 
'do_KNeighborsClassifier', 
'do_SVC', 
'do_SGDClassifier', 
'do_KernelApproximation'

#dimensionality reduction:
'do_Randomized_PCA'

#Clustering
'do_KMeans', 
'do_gmm', 
'do_SpectralClustering', 
'MiniBatchKMeans', 
'do_MeanShift', 
'do_VBGMM'

#Regression
'do_SGDRegressor', 
'do_Lasso', 
'do_ElasticNet', 
'do_Ridge', 
'do_linear_SVR'

## Get Data

In [19]:
#generating/collecting data
import numpy as np
import pandas as pd
import random    

def get_data(data_generated=True, filename = ''):
    X = []
    Y = []
    if not data_generated:
        #df = pd.read_csv(filename)
        df = pd.read_csv('../data/application_data.csv')
        
        (is_count_of_observations_more_than_50, 
         is_count_of_observations_less_than_100k, 
         is_count_of_observations_less_than_10k) = get_observation_count_comparisons(df)
        y_type = get_target_variable_type()
        is_few_features_important = get_is_few_features_important()
        is_any_data_text = get_is_any_data_text()

        (is_target_categorical, 
         is_target_quantity, 
         is_exploration) = derive_from_type_of_target_variable(y_type)
        is_data_labeled = get_is_data_labeled(Y)  
        is_number_categories_known = get_is_number_categories_known()
    
    else:
        random = False
        if random == True:
            is_data_labeled = False
            is_number_categories_known = False
            np.random.seed(create_random_seed())
            number_observations = np.random.choice([0,1, 20, 200, 2000, 10001, 100001])
            for _ in range(0,number_observations):
                X.append([random.randrange(1,100),
                          random.randrange(1,100),
                          random.randrange(1,100)])
                Y.append(np.random.choice(["cat","dog"]))

            np.random.seed(create_random_seed())
    #         is_count_of_observations_more_than_50 = np.random.choice([True, False])
    #         is_count_of_observations_less_than_100k = np.random.choice([True, False])
    #         is_count_of_observations_less_than_10k = np.random.choice([True, False])
            (is_count_of_observations_more_than_50, 
             is_count_of_observations_less_than_100k, 
             is_count_of_observations_less_than_10k) = get_observation_count_comparisons(X)
            np.random.seed(create_random_seed())
            y_type = np.random.choice(["categorical", "quantity", "exploring"])
            np.random.seed(create_random_seed())
            is_few_features_important = np.random.choice([True, False])
            np.random.seed(create_random_seed())
            is_any_data_text = np.random.choice([True, False])
            (is_target_categorical, 
             is_target_quantity, 
             is_exploration) = derive_from_type_of_target_variable(y_type)
            np.random.seed(create_random_seed())
            is_data_labeled = np.random.choice([True, False])
            np.random.seed(create_random_seed())
            is_number_categories_known = np.random.choice([True, False])
        else: #generate data for specific things
            makeDataForAnAlgorithm = DataFromUser()
            (is_data_labeled,
                is_number_categories_known,
                is_count_of_observations_more_than_50,
                is_count_of_observations_less_than_100k,
                is_count_of_observations_less_than_10k,
                is_few_features_important,
                is_any_data_text,
                is_target_categorical, 
                is_target_quantity, 
                is_exploration) = makeDataForAnAlgorithm.all_user_responses()
#debugging
#         print(is_count_of_observations_more_than_50)
#         print(is_count_of_observations_less_than_100k )
#         print(is_count_of_observations_less_than_10k )
#         print(y_type )
#         print(is_few_features_important )
#         print(is_any_data_text )
#         print(is_target_categorical )
#         print(is_target_quantity)
#         print(is_exploration )
#         print(is_data_labeled )
#         print(is_number_categories_known )
#         print(number_observations )
            

        return (is_count_of_observations_more_than_50,
        is_count_of_observations_less_than_100k ,
        is_count_of_observations_less_than_10k ,
        y_type ,
        is_few_features_important ,
        is_any_data_text ,
        is_target_categorical ,
        is_target_quantity,
        is_exploration ,
        is_data_labeled ,
        is_number_categories_known ,
        number_observations,
        X,
        Y)

def get_target_variable_type():
    #ask user for type of Y:
    y_type_raw = str.lower(input("Choose 1: Is Y 'categorical', a 'quantity', or are you just 'exploring'?\n"))
    y_type = str.lower(''.join([character for character in y_type_raw if \
                      character.isalpha()]))
    if "exploring" in y_type:
        y_type = "exploring"
    if y_type not in ("categorical", "quantity", "exploring"):
        raise Exception("answer must be 'categorical','quantity', or 'exploring'")
    return y_type

def get_is_few_features_important():
    #ask user if only few features are important 
    is_few_features_important = ask_if_true("Should few features be important?:\n")
    return is_few_features_important

def get_is_any_data_text():
    #ask user if any of the data is text 
    ##TODO in the future: check why Naive Beyes is good for text data
    ##Then use that information to check for text data yourself if possible
    is_any_data_text = ask_if_true("Is any of the data text?:\n")
    return is_any_data_text


## ML Functions

In [20]:
#Classification: training for Categorical (target) with Labels
import sklearn
from sklearn.svm import LinearSVC
#from sklearn.pipeline import make_pipeline

def do_LinearSVC(X_train, Y_train, X_test, Y_test):
    model = LinearSVC(random_state=0)
    model.fit(X_train,Y_train)
    zipped = zip(model.predict(X_test), Y_test)
    print(*zipped)
    return model

##########################################################################
from sklearn.naive_bayes import GaussianNB

def do_NaiveBayes(X_train, Y_train, X_test, Y_test):
    model = GaussianNB()
    model.fit(X,Y)
    zipped = zip(model.predict(X_test), Y_test)
    print(*zipped)
    return model

##########################################################################
from sklearn.neighbors import KNeighborsClassifier

def do_KNeighborsClassifier(X_train, Y_train, X_test, Y_test):
    model = KNeighborsClassifier(n_neighbors=(len(X_train)/10+1))
    model.fit(X,Y)
    zipped = zip(model.predict(X_test), Y_test)
    print(*zipped)
    return model

##########################################################################
from sklearn import svm

def do_SVC(X_train, Y_train, X_test, Y_test):
    model = svm.SVC()
    zipped = zip(model.predict(X_test), Y_test)
    print(*zipped)
    return model    

##########################################################################
from sklearn.linear_model import SGDClassifier

def do_SGDClassifier(X_train, Y_train, X_test, Y_test, max_iterations = 1000):
    model = SGDClassifier(max_iter=max_iterations, tol=1e-3)
    zipped = zip(model.predict(X_test), Y_test)
    print(*zipped)
    return model   

##########################################################################
from sklearn.kernel_approximation import RBFSampler

def do_KernelApproximation(X_train, Y_train, X_test, Y_test):
    rbf_feature = RBFSampler(gamma=1, random_state=1)
    X_features_train = rbf_feature.fit_transform(X_train)
    X_features_test = rbf_feature.fit_transform(X_test)
    model = do_SGDClassifier(X_features, Y_train, X_features_test, Y_test, max_iterations = 5)
    
##########################################################################

    
    

In [28]:
#deminsionality reduction

def do_Randomized_PCA():
    pass

In [23]:
#Clustering: training for Categorical (target) without Labels
from sklearn.cluster import KMeans

def do_KMeans(X_train, Y_train, X_test, Y_test):
    model = KMeans(n_clusters=2, random_state=0).fit(X_train)
#     zipped = zip(model.predict(X_test), Y_test)
#     print(*zipped)
    return model

##########################################################################
from sklearn.mixture import GaussianMixture

def do_gmm(X_train, Y_train, X_test, Y_test):
    model = GaussianMixture(n_components=2, random_state=0).fit(X)
#     zipped = zip(model.predict(X_test), Y_test)
#     print(*zipped)
    return model

##########################################################################
from sklearn.cluster import SpectralClustering

def do_SpectralClustering(X_train, Y_train, X_test, Y_test):
    model = SpectralClustering(n_clusters=2,
            assign_labels="discretize",
            random_state=0).fit(X)
#     zipped = zip(model.predict(X_test), Y_test)
#     print(*zipped)
    return model

##########################################################################
from sklearn.cluster import MiniBatchKMeans
#I don't actually understand what is going on for Mini Batch KMeans

#     kmeans = MiniBatchKMeans(n_clusters=2,
#                              random_state=0,
#                              batch_size=6)
#     kmeans = kmeans.partial_fit(X[0:6,:])
#     kmeans = kmeans.partial_fit(X[6:12,:])
#     minibatch kmeans, MeanShift, VBGMM
    
##########################################################################
from sklearn.cluster import MeanShift

def do_MeanShift(X_train, y_train, X_test, Y_test):
    model = MeanShift()
    model.fit(X)
#     cluster_centers = model.cluster_centers_
#     labels = model.labels_   
    return model

##########################################################################
from sklearn.mixture import GaussianMixture

def do_GaussianMixture(X_train, y_train, X_test, Y_test):
    model = GaussianMixture()
    model.fit(X)
    return model
    

In [25]:
#Regression: Training for non-categorical quantity (target)
from sklearn.linear_model import SGDRegressor

def do_SGDRegressor(X_train, y_train, X_test, Y_test):   
    model = SGDRegressor()
    model.fit(X_train, y_train)
    return model

##########################################################################
from sklearn.linear_model import Lasso

def do_Lasso(X_train, y_train, X_test, Y_test):
    model = Lasso() 
    model.fit(X_train, y_train)
    return model

##########################################################################
from sklearn.linear_model import ElasticNet

def do_ElasticNet(X_train, y_train, X_test, Y_test): 
    model = ElasticNet()
    model.fit(X_train, y_train)
#     print(model.coef_)
    return model

##########################################################################
from sklearn.linear_model import Ridge

def do_Ridge(X_train, y_train, X_test, Y_test):
    model = Ridge()
    model.fit(X_train, y_train)
    return model

##########################################################################
from sklearn.svm import SVR

def do_linear_SVR(X_train, y_train, X_test, Y_test):
    model = SVR(kernel=linear)
    model.fit(X_train, y_train)
    return model



## Testing

In [29]:
classification_algorithms = [do_LinearSVC, do_NaiveBayes, do_KNeighborsClassifier, do_SVC, do_SGDClassifier, do_KernelApproximation]
dimensionality_reduction_algorithms = [do_Randomized_PCA]
#replaced VBGMM with GaussianMixture
clustering_algorithms = [do_KMeans, do_gmm, do_SpectralClustering, MiniBatchKMeans, do_MeanShift, do_GaussianMixture]
regression_algorithms = [do_SGDRegressor, do_Lasso, do_ElasticNet, do_Ridge, do_linear_SVR]
all_algorithms = classification_algorithms + dimensionality_reduction_algorithms + clustering_algorithms + regression_algorithms

data_labeled = ['do_LinearSVC', 
'do_NaiveBayes', 
'do_KNeighborsClassifier', 
'do_SVC', 
'do_SGDClassifier', 
'do_KernelApproximation']

number_categories_known = ['do_KMeans', 
'do_gmm', 
'do_SpectralClustering', 
'MiniBatchKMeans']

more_than_100k = ['do_SGDClassifier', 
'do_KernelApproximation', 'do_SGDRegressor']

more_than_10k = ['MiniBatchKMeans', 
'do_MeanShift', 
'do_VBGMM','do_KernelApproximation'] + more_than_100k

predicting_category = ['do_LinearSVC', 
'do_NaiveBayes', 
'do_KNeighborsClassifier', 
'do_SVC', 
'do_SGDClassifier', 
'do_KernelApproximation', 
'do_KMeans', 
'do_gmm', 
'do_SpectralClustering', 
'MiniBatchKMeans', 
'do_MeanShift', 
'do_GaussianMixture'] #replaced VBGMM with GaussianMixture

predicting_quantity = ['do_SGDRegressor', 
'do_Lasso', 
'do_ElasticNet', 
'do_Ridge', 
'do_linear_SVR']

exploring = ['do_Randomized_PCA']

labeled_data = ['do_LinearSVC', 
'do_NaiveBayes', 
'do_KNeighborsClassifier', 
'do_SVC', 
'do_SGDClassifier', 
'do_KernelApproximation', 'do_SGDRegressor', 
'do_Lasso', 
'do_ElasticNet', 
'do_Ridge', 
'do_linear_SVR']

number_categories_known = ['do_KMeans', 
'do_gmm', 
'do_SpectralClustering', 
'MiniBatchKMeans']

is_there_text_data = ['do_NaiveBayes']

few_features_important = ['do_Lasso', 
'do_ElasticNet']

class DataFromUser:
    def __init__(self):
        pass
    
    def set_responses_for_algorithm(algorithm_func = do_LinearSVC):
        algorithm = algorithm_func.__name__
        default = False
        if default == True:
            self.is_data_labeled = False
            self.is_number_categories_known = False
#             self.number_observations = 0
            self.is_count_of_observations_more_than_50 = True 
            self.is_count_of_observations_less_than_100k = False
            self.is_count_of_observations_less_than_10k = False
            self.y_type = "categorical" #np.random.choice(["categorical", "quantity", "exploring"])
            self.is_few_features_important = False
            self.is_any_data_text = False
            (self.is_target_categorical, 
             self.is_target_quantity, 
             self.is_exploration) = derive_from_type_of_target_variable(y_type)
            
        else:
            self.is_data_labeled = algorithm in data_labeled
            self.is_number_categories_known = algorithm in number_categories_known
#             self.number_observations = 0
            self.is_count_of_observations_more_than_50 = True 
            self.is_count_of_observations_less_than_100k = algorithm in more_than_100k
            self.is_count_of_observations_less_than_10k = algorithm in more_than_10k
#             self.y_type = "categorical" #np.random.choice(["categorical", "quantity", "exploring"])
            self.is_few_features_important = algorithm in few_features_important
            self.is_any_data_text = algorithm in is_there_text_data
            self.is_target_categorical = algorithm in predicting_category 
            self.is_target_quantity = algorithm in predicting_quantity
            self.is_exploration = algorithm in exploring
          
            
    def all_user_responses(self):
        return (self.is_data_labeled,
                self.is_number_categories_known,
#                 self.number_observations,
                self.is_count_of_observations_more_than_50,
                self.is_count_of_observations_less_than_100k,
                self.is_count_of_observations_less_than_10k,
#                 self.y_type, #np.random.choice(["categorical", "quantity", "exploring"])
                self.is_few_features_important,
                self.is_any_data_text,
                self.is_target_categorical, 
                self.is_target_quantity, 
                self.is_exploration)

## Choosing ML Algorithm

In [30]:


def condition_categorical_with_labels(is_count_of_observations_less_than_100k,
                                      is_any_data_text, 
                                      X_train = [], Y_train = [], 
                                      X_test = [], Y_test = [],
                                      model = ''):
    print("This is a Classification Problem ")
    print("because you are predicting a category and have labeled data")
    if is_count_of_observations_less_than_100k:
        print("Start with LinearSVC ")
        #print("because this is a classification problem with more than 100,000 observations")
        print("because there are less than 100,000 observations")
        model = do_LinearSVC(X_train, Y_train, X_test, Y_test)
        has_LinearSVC_worked = ask_if_true("Has LinearSVC worked?")
        if (not has_LinearSVC_worked) and is_any_data_text:
            print("Use Naive Bayes ")
            #print("because it is a classification problem with less than 100,000 observations, Linear SVC didn't work, and there is text data")
            print("because Linear SVC didn't work and there is text data")
            model = do_NaiveBayes(X_train, Y_train, X_test, Y_test)
        elif (not has_LinearSVC_worked) and (not is_any_data_text):
            print("Use K Neighbors Classifier ")
#                    print("because it is a classification problem with less than 100,000 observations, Linear SVC didn't work, and there is no text data")
            print("because Linear SVC didn't work and there is not any text data")
            model = do_KNeighborsClassifier(X_train, Y_train, X_test, Y_test)
            has_KNeighbors_Classifier_worked = ask_if_true("Has K Neighbors Classifier worked?")
            if not has_KNeighbors_Classifier_worked:
                print("Use SVC or Ensemble Classifiers ")
                #print("because it is a classification problem with less than 100,000 observations, Linear SVC didn't work, there is text data, and KNeighborsClassifier didn't work")            
                print("because KNeighborsClassifier didn't work")
    elif not is_count_of_observations_less_than_100k:
        print("Start with SGD Classifier ")
        print("because there are at least 100,000 observations")
        model = do_SGDClassifier(X_train, Y_train, X_test, Y_test)
        has_SGD_Classifier_worked = ask_if_true("Has SGD Classifier worked?")
        if not has_SGD_Classifier_worked:
            print("Use Kernel Approximation ")
            #print("because it is a classification problem with more than 100,000 observations, Linear SVC didn't work, and there is text data")
            print("because SGD Classifier didn't work")
    return model



In [31]:
def condition_categorical_without_labels(is_number_categories_known,
                                         is_count_of_observations_less_than_10k,
                                         X_train = [], Y_train = [], 
                                         X_test = [], Y_test = [], 
                                         model = ''):
    print("This is a Clustering problem ")
    print("because you are predicting a category and do not have labeled data")
    if is_number_categories_known and is_count_of_observations_less_than_10k:
        print("Start with K Means Cluster")
        #print("because this is a clustering problem and you have less than 10,000 samples")
        print("because the number of categories is known and there are less than 10,000 observations")
        has_KMeans_worked = ask_if_true("Has KMeans worked?")
        if not has_KMeans_worked:
            print("Use GMM or Spectral Clustering ")
            print("because K Means didn't work")
    elif is_number_categories_known and not is_count_of_observations_less_than_10k:
            print("Use MiniBatch KMeans ")
            print("because the number of categories is known and there are at least 10,000 observations")
    elif not is_number_categories_known and not is_count_of_observations_less_than_10k:
        print("You need more observations or potentially assume (or otherwisse figure out) the number of categories")
    elif not is_number_categories_known and is_count_of_observations_less_than_10k:
        print("Use MeanShift or VBGMM ")
        print("because the number of categories is not known and there are less than 10,000 observations")                
    return model
        


In [32]:
def condition_predicting_quantity(is_count_of_observations_less_than_100k,
                                  is_few_features_important, 
                                  X_train = [], Y_train = [], 
                                  X_test = [], Y_test = [], 
                                  model = ''):
    print("This is a Regression problem ")
    print("because you are predicting a quantity")
    if not is_count_of_observations_less_than_100k:
        print("Use SGC Regressor ")
        print("because there are at least 100,000 observations")
    elif is_count_of_observations_less_than_100k and is_few_features_important:
        print("Use Lasso Regression or ElasticNet Regression ")
        print("because there are less than 100,000 observations and only a few features should be important")
    elif is_count_of_observations_less_than_100k and not is_few_features_important:
        print("Start with Ridge Regression and/or SVR(kernel='linear')")
        print("because there are less than 100,000 observations and it is not necessary for only a few features to be important")
        have_Ridge_and_SVRLinear_worked = ask_if_true("Have either Ridge Regression or SVR worked?")
        if not have_Ridge_and_SVRLinear_worked:
            print("Use SVR(kernel='rbf') or EnsembleRegressors ")
            print("because neither Ridge Regression nor SVR(kernel='linear') have worked")
    return model



In [33]:
def condition_exploration(is_count_of_observations_less_than_10k, 
                          X_train = [], Y_train = [], 
                          X_test = [], Y_test = [], 
                          model = ''):
    print("This is a Dimensionality Reduction problem ")
    print("because you are exploring rather than predicting a category or quantity")
    print("Start with Randomized PCA")
    has_randomized_PCA_worked = ask_if_true("Has randomized PCA worked?")
    if not has_randomized_PCA_worked and is_count_of_observations_less_than_10k:
        print("Use kernel approximation ")
        print("because Randomized PCA did not work and there are at least 10,000 observations")
    elif not has_randomized_PCA_worked and not is_count_of_observations_less_than_10k:
        print("Use Isomap or Spectral Embedding ")
        print("because Randomized PCA did not work and there are at least 10,000 observations")
        has_Isomap_or_SpectralEmbedding_worked = ask_if_true("Has either Isomap or Spectral Embedding worked?")
        if not has_Isomap_or_SpectralEmbedding_worked:
            print("Use LLE ")
            print("because neither Isomap nor Spectral Embedding have worked")
    return model
            

## main

In [34]:
# import time
from sklearn.model_selection import train_test_split

# is_count_of_observations_more_than_50 = False
# is_count_of_observations_less_than_100k = False
# is_count_of_observations_less_than_10k = False
# y_type = ''
# is_few_features_important = False
# is_any_data_text = False
# is_target_categorical = False
# is_target_quantity = False
# is_exploration = False
# is_data_labeled = False
# is_number_categories_known = False
# number_observations = 0
# X = []
# Y = []
def main(is_count_of_observations_more_than_50,
    is_count_of_observations_less_than_100k ,
    is_count_of_observations_less_than_10k ,
    y_type ,
    is_few_features_important ,
    is_any_data_text ,
    is_target_categorical ,
    is_target_quantity,
    is_exploration ,
    is_data_labeled ,
    is_number_categories_known ,
    number_observations,
    X,
    Y):
    # initialize variables

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 0)
    
    #The Algorithm
    if not is_count_of_observations_more_than_50:
        print("Get more data")
        model = ''
    elif is_target_categorical and is_data_labeled:
        model = condition_categorical_with_labels(is_count_of_observations_less_than_100k,
                                          is_any_data_text, X_train, X_test, Y_train, Y_test)
    elif is_target_categorical and not is_data_labeled:
        model = condition_categorical_without_labels(is_number_categories_known,
                                             is_count_of_observations_less_than_10k,
                                             X_train, X_test, Y_train, Y_test)
    elif is_target_quantity:
        model = condition_predicting_quantity(is_count_of_observations_less_than_100k,
                                      is_few_features_important, X_train, X_test, Y_train, Y_test)
    elif is_exploration:
        model = condition_exploration(is_count_of_observations_less_than_10k, X_train, X_test, Y_train, Y_test)   
    return (model, X_train, Y_train, X_test, Y_test)


## Testing Main

In [35]:
df = pd.read_csv('../data/application_data.csv',
           usecols=['SK_ID_CURR',
                    'TARGET',
                    'FLAG_OWN_CAR',
                    'FLAG_OWN_REALTY',
                    'AMT_INCOME_TOTAL',
                    'AMT_CREDIT', 
                    'DAYS_EMPLOYED', 
                    'EXT_SOURCE_2', 

                    'AMT_GOODS_PRICE',
                    'DAYS_EMPLOYED',
                    'FLAG_EMP_PHONE',
                    'FLAG_WORK_PHONE',
                    'FLAG_CONT_MOBILE',
                    'FLAG_PHONE',
                    'FLAG_EMAIL',
                    'YEARS_BUILD_AVG',
                    'COMMONAREA_AVG',
                    'ELEVATORS_AVG',
                    
                    'CODE_GENDER',
                    'DAYS_BIRTH', 'HOUR_APPR_PROCESS_START', 
                    'WEEKDAY_APPR_PROCESS_START']) 

X = df[['SK_ID_CURR',
        'FLAG_OWN_CAR',
        'FLAG_OWN_REALTY',
        'AMT_INCOME_TOTAL',
        'AMT_CREDIT', 
        'DAYS_EMPLOYED', 
        'EXT_SOURCE_2', 

        'AMT_GOODS_PRICE',
        'DAYS_EMPLOYED',
        'FLAG_EMP_PHONE',
        'FLAG_WORK_PHONE',
        'FLAG_CONT_MOBILE',
        'FLAG_PHONE',
        'FLAG_EMAIL',
        'YEARS_BUILD_AVG',
        'COMMONAREA_AVG',
        'ELEVATORS_AVG',

        'CODE_GENDER',
        'DAYS_BIRTH', 'HOUR_APPR_PROCESS_START', 
        'WEEKDAY_APPR_PROCESS_START']]

Y = df['TARGET']

AttributeError: 'DataFromUser' object has no attribute 'is_data_labeled'

## commented out code

In [None]:
    #get data
    (is_count_of_observations_more_than_50,
    is_count_of_observations_less_than_100k ,
    is_count_of_observations_less_than_10k ,
    y_type ,
    is_few_features_important ,
    is_any_data_text ,
    is_target_categorical ,
    is_target_quantity,
    is_exploration ,
    is_data_labeled ,
    is_number_categories_known ,
    number_observations,
    X,
    Y) = get_data(data_generated=True)
    
(model, X_train, Y_train, X_test, Y_test) = main(is_count_of_observations_more_than_50,
    is_count_of_observations_less_than_100k ,
    is_count_of_observations_less_than_10k ,
    y_type ,
    is_few_features_important ,
    is_any_data_text ,
    is_target_categorical ,
    is_target_quantity,
    is_exploration ,
    is_data_labeled ,
    is_number_categories_known ,
    number_observations,
    X,
    Y)

In [None]:
#     #debugging
#     print(is_count_of_observations_more_than_50)
#     print(is_count_of_observations_less_than_100k )
#     print(is_count_of_observations_less_than_10k )
#     print(y_type )
#     print(is_few_features_important )
#     print(is_any_data_text )
#     print(is_target_categorical )
#     print(is_target_quantity)
#     print(is_exploration )
#     print(is_data_labeled )
#     print(is_number_categories_known )
#     print(number_observations )

In [None]:
# model = ''
# if __name__ == "__main__":
#     model = main()

#     for _ in range(10):
#         main()

In [None]:
# # is_count_of_observations_more_than_50
# # is_count_of_observations_less_than_100k
# # is_count_of_observations_less_than_10k
# # y_type
# # is_few_features_important
# # is_any_data_text
# # is_target_categorical
# # is_target_quantity
# # is_exploration
# # is_data_labeled  
# # is_number_categories_known
# if not is_count_of_observations_more_than_50:
#     print("Get more data")
# elif is_target_categorical:
#     if is_data_labeled:
#         print("This is a Classification Problem ")
#         print("because you are predicting a category and have labeled data")
#         if is_count_of_observations_less_than_100k:
#             print("Use LinearSVC ")
#             #print("because this is a classification problem with more than 100,000 observations")
#             print("because there are less than 100,000 observations")
#             has_LinearSVC_worked = ask_if_true("Has LinearSVC worked?")
#             if (not has_LinearSVC_worked) and is_any_data_text:
#                 print("Use Niave Bayes ")
#                 #print("because it is a classification problem with less than 100,000 observations, Linear SVC didn't work, and there is text data")
#                 print("because Linear SVC didn't work and there is text data")
#             elif (not has_LinearSVC_worked) and (not is_any_data_text):
#                 print("Use K Neighbors Classifier ")
# #                    print("because it is a classification problem with less than 100,000 observations, Linear SVC didn't work, and there is no text data")
#                 print("because Linear SVC didn't work and there is not any text data")
#                 has_KNeighbors_Classifier_worked = ask_if_true("Has K Neighbors Classifier worked?")
#                 if not has_KNeighbors_Classifier_worked:
#                     print("Use SVC or Ensemble Classifiers ")
#                     #print("because it is a classification problem with less than 100,000 observations, Linear SVC didn't work, there is text data, and KNeighborsClassifier didn't work")            
#                     print("because KNeighborsClassifier didn't work")
#         elif not is_count_of_observations_less_than_100k:
#             print("Use SGD Classifier ")
#             print("because there are at least 100,000 observations")
#             has_SGD_Classifier_worked = ask_if_true("Has SGD Classifier worked?")
#             if not has_SGD_Classifier_worked:
#                 print("Use Kernel Approximation ")
#                 #print("because it is a classification problem with more than 100,000 observations, Linear SVC didn't work, and there is text data")
#                 print("because SGD Classifier didn't work")
#     elif not is_data_labeled:
#         print("This is a Clustering problem ")
#         print("because you are predicting a category and do not have labeled data")
#         if is_number_categories_known and is_count_of_observations_less_than_10k:
#             print("Use K Means ")
#             #print("because this is a clustering problem and you have less than 10,000 samples")
#             print("because the number of categories is known and there are less than 10,000 observations")
#             has_KMeans_worked = ask_if_true("Has KMeans worked?")
#             if not has_KMeans_worked:
#                 print("Use GMM or Spectral Clustering ")
#                 print("because K Means didn't work")
#         elif is_number_categories_known and not is_count_of_observations_less_than_10k:
#                 print("Use MiniBatch KMeans ")
#                 print("because the number of categories is known and there are at least 10,000 observations")
#         elif not is_number_categories_known and not is_count_of_observations_less_than_10k:
#             print("You need more observations or potentially assume (or otherwisse figure out) the number of categories")
#         elif not is_number_categories_known and is_count_of_observations_less_than_10k:
#             print("Use MeanShift or VBGMM ")
#             print("because the number of categories is not known and there are less than 10,000 observations")
# elif is_target_quantity:
#     print("This is a Regression problem ")
#     print("because you are predicting a quantity")
#     if not is_count_of_observations_less_than_100k:
#         print("Use SGC Regressor ")
#         print("because there are at least 100,000 observations")
#     elif is_count_of_observations_less_than_100k and is_few_features_important:
#         print("Use Lasso Regression or ElasticNet Regression ")
#         print("because there are less than 100,000 observations and only a few features should be important")
#     elif is_count_of_observations_less_than_100k and not is_few_features_important:
#         print("Use Ridge Regression or SVR(kernel='linear')")
#         print("because there are less than 100,000 observations and it is not necessary for only a few features to be important")
#         have_Ridge_and_SVRLinear_worked = ask_if_true("Have either Ridge Regression or SVR worked?")
#         if not have_Ridge_and_SVRLinear_worked:
#             print("Use SVR(kernel='rbf') or EnsembleRegressors ")
#             print("because neither Ridge Regression nor SVR(kernel='linear') have worked")
# elif is_exploration:
#     print("This is a Dimensionality Reduction problem")
#     print("Start with Randomized PCA")
#     has_randomized_PCA_worked = ask_if_true("Has randomized PCA worked?")
#     if not has_randomized_PCA_worked and is_count_of_observations_less_than_10k:
#         print("Use kernel approximation ")
#         print("because Randomized PCA did not work and there are at least 10,000 observations")
#     elif not has_randomized_PCA_worked and not is_count_of_observations_less_than_10k:
#         print("Use Isomap or Spectral Embedding ")
#         print("because Randomized PCA did not work and there are at least 10,000 observations")
#         has_Isomap_or_SpectralEmbedding_worked = ask_if_true("Has either Isomap or Spectral Embedding worked?")
#         if not has_Isomap_or_SpectralEmbedding_worked:
#             print("Use LLE ")
#             print("because neither Isomap nor Spectral Embedding have worked")
        
        
    
            
            
            
            
                
                
        
    