In [1]:
import os

In [2]:
os.chdir('..')

<img src="flow_1.png">

In [3]:
from flows.flows import Flows

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


[1m[35mWelcome to the Data Science Package. First create an object as follows:[m[m
[1m[35mFor example, use the code below to import the flow 0:[m[m
[32m[40mflow = Flows(0)[m
[1m[35mYou can define the `categorical_threshold` which is the maximum number of categories that a categorical feature should have before considering it as continuous numeric feature. The default value is 50[m[m
[1m[35mFor example, use the code below to import the flow 0 with defining the categorical_threshold as 50[m[m
[32m[40mflow = Flows(flow_id=0, categorical_threshold=50)[m


In [4]:
flow = Flows(1)

[1m[35mPlease use the following function to read the data[m[m
[32m[40mdataframe_dict = flow.load_data(path: str, files_list: list)[m
[1m[35mFor example: [m[32m[40mpath = './data'[m[m
[1m[35mFor example: [m[32m[40mfiles_list = ['train.csv','test.csv'][m[m
[1m[35mThe output is a dictionary that contains dataframes e.g.  [m[m
[34mdataframe_dict = {'train': train_dataframe,'test': test_dataframe}[m


In [5]:
path = './data/flow_1'
files_list = ['train.csv','test.csv']

In [6]:
dataframe_dict, columns_set = flow.load_data(path, files_list)

A summary of the data sets


Unnamed: 0_level_0,train,test
column type,Unnamed: 1_level_1,Unnamed: 2_level_1
categorical_integer,18,18
categorical_string,43,43
continuous,20,19
date,0,0
json,0,0
other,0,0
total amount,81,80


[1mNOTE: numeric categorical columns that contains more than 50 classes are considered numeric continuous features.[0;0m
[1mNOTE: You can modify the threshold value if you want to consider more or less numeric categorical features as numeric continuous features.[0;0m
The possible ids are {'Id'}
The possible possible_target are ['SalePrice']
The type of the problem that should be solved {'SalePrice': 'regression'}
[1m[35mIf you have categorical features with string labels, Encode the categorical features by applying the following function:
[m[32m[40mdataframe_dict, columns_set = flow.encode_categorical_feature(dataframe_dict: dict)[m[m


In [7]:
dataframe_dict, columns_set = flow.encode_categorical_feature(dataframe_dict)

The reference dataframe is: train
[31m******************************[m
A summary of the data sets


Unnamed: 0_level_0,train,test
column type,Unnamed: 1_level_1,Unnamed: 2_level_1
categorical_integer,61,61
categorical_string,0,0
continuous,20,19
date,0,0
json,0,0
other,0,0
total amount,81,80


[1mNOTE: numeric categorical columns that contains more than 50 classes are considered numeric continuous features.[0;0m
[1mNOTE: You can modify the threshold value if you want to consider more or less numeric categorical features as numeric continuous features.[0;0m
[1m[35mYou have categorical features. Apply one-hot encoding to the categorical features by applying the following function:
[m[32m[40mdataframe_dict, columns_set = flow.one_hot_encoding (dataframe_dict: dict, ignore_columns: list, class_number_range=[3, 50][m[m
[1m[35mSince one-hot encoding can produce a lot of features, class_number_range will limit the encoding process only for features which have between 3 and 49 unique values.[m[m
[1m[35mIf you are solving a classification problem, you should exclude the target from the one-hot encoding process by defining the ingore_columns
[m[32m[40mingore_columns = [<your target/label>]
[m[35mYou can add more columns to the ingore_columns list to ignore[m[m


In [8]:
ignore_columns = ['id', 'SalePrice']

In [9]:
dataframe_dict, columns_set = flow.one_hot_encoding(dataframe_dict,
                                                    "train",
                                                    ignore_columns,
                                                    class_number_range=[3, 50])

A summary of the data sets


Unnamed: 0_level_0,train,test
column type,Unnamed: 1_level_1,Unnamed: 2_level_1
categorical_integer,436,436
categorical_string,0,0
continuous,20,19
date,0,0
json,0,0
other,0,0
total amount,456,455


[1mNOTE: numeric categorical columns that contains more than 50 classes are considered numeric continuous features.[0;0m
[1mNOTE: You can modify the threshold value if you want to consider more or less numeric categorical features as numeric continuous features.[0;0m
[1m[35mIf you have numeric features, it is a good idea to normalize numeric features. Use the following function for feature normalization :
[m[32m[40mdataframe_dict, columns_set = flow.scale_data (dataframe_dict: dict, ignore_columns: list)[m[m
[1m[35mFor example: [m[32m[40mignore_columns = ['id', 'target'][m[m


In [10]:
dataframe_dict, columns_set = flow.scale_data(dataframe_dict, ignore_columns)

A summary of the data sets


Unnamed: 0_level_0,train,test
column type,Unnamed: 1_level_1,Unnamed: 2_level_1
categorical_integer,436,436
categorical_string,0,0
continuous,20,19
date,0,0
json,0,0
other,0,0
total amount,456,455


[1mNOTE: numeric categorical columns that contains more than 50 classes are considered numeric continuous features.[0;0m
[1mNOTE: You can modify the threshold value if you want to consider more or less numeric categorical features as numeric continuous features.[0;0m
[1m[35mYour features are ready to train the model: [m[m
[1m[35mIf you want to explore the data you can run one of the following functions: [m[m
[1m[35m1 . [m[32m[40mflow.exploring_data(dataframe_dict: dict, key_i: str)[m[m
[1m[35mFor example: [m[32m[40mflow.exploring_data(dataframe_dict, 'train')[m[m
[1m[35m2 . [m[32m[40mflow.comparing_statistics(dataframe_dict: dict)[m[m
[1m[35mFor example: [m[32m[40mflow.comparing_statistics(dataframe_dict)[m[m




[1m[35mYou can start training the model by applying the following function: [m[m
[32m[40mmodel_index_list, save_models_dir, y_test = flow.training(parameters)[m
parameters = { 
 "data": {
 "train": {"features": train_dataframe, "t

In [11]:
ignore_columns = ["id", "SalePrice"]
columns = dataframe_dict["train"].columns
train_dataframe = dataframe_dict["train"][[x for x in columns_set["train"]["categorical_integer"] if x not in ignore_columns]]
test_dataframe = dataframe_dict["test"][[x for x in columns_set["train"]["categorical_integer"] if x not in ignore_columns]]
train_target = dataframe_dict["train"]["SalePrice"]

In [12]:
parameters = {
    "data": {
        "train": {"features": train_dataframe, "target": train_target.to_numpy()},
    },
    "split": {
        "method": "split",  # "method":"kfold"
        "split_ratios": 0.2,  # foldnr:5 , "split_ratios": 0.2 # "split_ratios":(0.3,0.2)
    },
    "model": {"type": "Ridge linear regression",
              "hyperparameters": {"alpha": "optimize",  # alpha:optimize
                                  },
              },
    "metrics": ["r2_score", "mean_squared_error"],
    "predict": {
        "test": {"features": test_dataframe}
    }
}

In [13]:
model_index_list, save_models_dir, y_test = flow.training(parameters)

the optimized alpha value 1.0


Unnamed: 0,model 0
mean_squared_error (train.train),451149000.0
mean_squared_error (train.validation_0),1099440000.0
r2_score (train.train),92.44
r2_score (train.validation_0),85.67


Unnamed: 0,model 0
mean_squared_error (train),580807200.0
r2_score (train),90.79


This is the end of the flow


In [14]:
parameters_lighgbm = {
    "data": {
        "train": {"features": train_dataframe, "target": train_target.to_numpy()},
    },
    "split": {
        "method": "kfold",  # "method":"kfold"
        "fold_nr": 5,  # foldnr:5 , "split_ratios": 0.2 # "split_ratios":(0.3,0.2)
    },
    "model": {"type": "lightgbm",
              "hyperparameters": dict(objective='regression', metric='root_mean_squared_error', num_leaves=5,
                                      boost_from_average=True,
                                      learning_rate=0.05, bagging_fraction=0.99, feature_fraction=0.99, max_depth=-1,
                                      num_rounds=10000, min_data_in_leaf=10, boosting='dart')
              },
    "metrics": ["r2_score", "mean_squared_error"],
    "predict": {
        "test": {"features": test_dataframe}
    }
}

In [15]:
model_index_list, save_models_dir, y_test = flow.training(parameters_lighgbm)

shuffle is not provided: 'shuffle'
random_state is not provided: 'random_state'
fold_nr. 1
{'r2_score (train.train)': 98.6, 'mean_squared_error (train.train)': 85407084.72653972, 'r2_score (train.validation)': 89.69, 'mean_squared_error (train.validation)': 735456111.1847914}
fold_nr. 2
{'r2_score (train.train)': 98.82, 'mean_squared_error (train.train)': 75297508.23059897, 'r2_score (train.validation)': 85.56, 'mean_squared_error (train.validation)': 876638566.3950267}
fold_nr. 3
{'r2_score (train.train)': 98.81, 'mean_squared_error (train.train)': 76437480.58935387, 'r2_score (train.validation)': 81.49, 'mean_squared_error (train.validation)': 1082012028.1759164}
fold_nr. 4
{'r2_score (train.train)': 98.7, 'mean_squared_error (train.train)': 80459577.24151137, 'r2_score (train.validation)': 81.24, 'mean_squared_error (train.validation)': 1280155802.525103}
fold_nr. 5
{'r2_score (train.train)': 98.78, 'mean_squared_error (train.train)': 79321514.48982576, 'r2_score (train.validation)'

Unnamed: 0,fold_1,fold_2,fold_3,fold_4,fold_5,mean
mean_squared_error (train.train),85407080.0,75297510.0,76437480.0,80459580.0,79321510.0,79384630.0
mean_squared_error (train.validation),735456100.0,876638600.0,1082012000.0,1280156000.0,673392400.0,929531000.0
r2_score (train.train),98.6,98.82,98.81,98.7,98.78,98.742
r2_score (train.validation),89.69,85.56,81.49,81.24,88.02,85.2


Unnamed: 0,model 1,model 2,model 3,model 4,model 5,mean
mean_squared_error (train),215416900.0,235565700.0,277552400.0,320398800.0,198135700.0,249413900.0
r2_score (train),96.58,96.26,95.6,94.92,96.86,96.044


This is the end of the flow


In [16]:
parameters_xgboost = {
    "data": {
        "train": {"features": train_dataframe, "target": train_target.to_numpy()},
    },
    "split": {
        "method": "kfold",  # "method":"kfold"
        "fold_nr": 5,  # fold_nr:5 , "split_ratios": 0.3 # "split_ratios":(0.3,0.2)
    },
    "model": {"type": "xgboost",
              "hyperparameters": {'max_depth': 5, 'eta': 1, 'eval_metric': "rmse"}
              },
    "metrics": ["r2_score", "mean_squared_error"],
    "predict": {
        "test": {"features": test_dataframe}
    }
}

In [17]:
model_index_list, save_models_dir, y_test = flow.training(parameters_xgboost)

shuffle is not provided: 'shuffle'
random_state is not provided: 'random_state'
The objective is not defined. The default value is reg:squarederror. Error: 'objective'
fold_nr. 1
The num_round is not defined. The default value is num_round = 10. Error: 'num_round'
[0]	train-rmse:40632.8	test-rmse:51489.7
[1]	train-rmse:31553	test-rmse:46632.6
[2]	train-rmse:28335.7	test-rmse:44061.5
[3]	train-rmse:25816.4	test-rmse:45387.2
[4]	train-rmse:23930.6	test-rmse:45457.9
[5]	train-rmse:22649.7	test-rmse:44379.5
[6]	train-rmse:21681.1	test-rmse:44091.1
[7]	train-rmse:20877.6	test-rmse:44202.4
[8]	train-rmse:19857.5	test-rmse:44549.4
[9]	train-rmse:18202.9	test-rmse:44142.4
{'r2_score (train.train)': 94.56, 'mean_squared_error (train.train)': 331344493.8129808, 'r2_score (train.validation)': 72.68, 'mean_squared_error (train.validation)': 1948553713.7916}
fold_nr. 2
The num_round is not defined. The default value is num_round = 10. Error: 'num_round'
[0]	train-rmse:43017.9	test-rmse:43932.9
[1]	

Unnamed: 0,fold_1,fold_2,fold_3,fold_4,fold_5,mean
mean_squared_error (train.train),331344500.0,301758700.0,285372900.0,335591200.0,364377000.0,323688900.0
mean_squared_error (train.validation),1948554000.0,1841913000.0,1789568000.0,1660958000.0,1451202000.0,1738439000.0
r2_score (train.train),94.56,95.26,95.55,94.57,94.38,94.864
r2_score (train.validation),72.68,69.65,69.38,75.66,74.18,72.31


Unnamed: 0,model 1,model 2,model 3,model 4,model 5,mean
mean_squared_error (train),654786300.0,609789600.0,586212000.0,600664500.0,581742000.0,606638900.0
r2_score (train),89.62,90.33,90.71,90.48,90.78,90.384


This is the end of the flow
