# **First. Big Picture -🏔**

To attempt to predict returns, there are many computer-based algorithms and models for financial market trading. <br>
**Yet,** with new techniques and approaches, **data science could improve quantitative researchers' ability to forecast an investment's return.**

> Ubiquant is committed to creating long-term stable returns for investors.

In this competition, you’ll build **a model that forecasts an investment's return rate**. <br> 
Train and test your algorithm on historical prices. Top entries will solve this real-world data science problem with as much accuracy as possible.

# **Second. Problem definition -✏**

"This dataset contains features derived from real historic data from thousands of investments." <br>
**Your challenge is to predict the value of an obfuscated metric relevant for making trading decisions.**

- row_id - A unique identifier for the row.
- time_id - The ID code for the time the data was gathered. The time IDs are in order, but the real time between the time IDs is not constant and will likely be shorter for the final private test set than in the training set.
- investment_id - The ID code for an investment. Not all investment have data in all time IDs.
- **target - The target.**
- [f_0:f_299] - Anonymized features generated from market data.

**Performance metrics** is  the mean of the Pearson correlation coefficient


# **Third. Data & Import**

In [None]:
import numpy as np
import pandas as pd
import gc
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow import keras
from scipy import stats
from pathlib import Path
import seaborn as sns

**Reading as Parquet Low Memory (Fast & Low Mem Use)**
https://www.kaggle.com/robikscube/fast-data-loading-and-low-mem-with-parquet-files

In [None]:
%%time
n_features = 300
features = [f'f_{i}' for i in range(n_features)]
train = pd.read_parquet('../input/ubiquant-parquet/train_low_mem.parquet')

In [None]:
start_mem = train.memory_usage().sum() / 1024**2

def decreasing_train(train):
    for col in train.columns:
        col_type = train[col].dtype

        if col_type != object:
            c_min = train[col].min()
            c_max = train[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    train[col] = train[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    train[col] = train[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    train[col] = train[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    train[col] = train[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    train[col] = train[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    train[col] = train[col].astype(np.float32)
                else:
                    train[col] = train[col].astype(np.float64)
        else:
            train[col] = train[col].astype('category')
    return train

train = decreasing_train(train)
end_mem = train.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

# **Fourth. Take a looke and Split test data -🙄**

In [None]:
display(train.info())
display(train.head())

In [None]:
for i in ['investment_id', 'time_id']:
    print(f'------------------{i} / value counts------------------')
    display(train[i].value_counts())

In [None]:
train.head()

In [None]:
train[['investment_id', 'time_id']].hist(bins=50, figsize=(10,5))
plt.show

**380-410(time_id)** are strange and You can see time_id's increasing aspect

# Split Test data <br>
We will split data based on time_id category [stratified sampling] <br> for preventing **sampling bias**

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(train, train['time_id']):
    train_set = train.loc[train_index]
    test_set = train.loc[test_index]

In [None]:
test_x = test_set.drop(['target', 'row_id'], axis=1).copy()
test_target = test_set['target'].copy()

In [None]:
display(train_set['time_id'].value_counts() / len(train_set))
display(test_set['time_id'].value_counts() / len(test_set))

In [None]:
del train
del test_set

# **Fifth. EDA & Visualization -📊**

In [None]:
ubiquant = train_set.copy()

1) Check time_id

In [None]:
time_count = ubiquant['time_id'].groupby(ubiquant['investment_id']).count()
time_count.plot(kind='hist', bins=25, grid=True, title='time_count')
plt.show()

time_mean = ubiquant['time_id'].groupby(ubiquant['investment_id']).mean()
time_mean.plot(kind='hist', bins=25, grid=True, title='time_mean')
plt.show()

time_std = ubiquant['time_id'].groupby(ubiquant['investment_id']).std()
time_std.plot(kind='hist', bins=25, grid=True, title='time_std')
plt.show()

del time_count
del time_mean
del time_std

2) Scatter plot

In [None]:
from pandas.plotting import scatter_matrix

attri = ['investment_id', 'time_id', 'f_0', 'f_1']
scatter_matrix(ubiquant[attri], figsize = (12,8))

3) Check Outlier

In [None]:
investment_count = ubiquant.groupby(['investment_id'])['target'].count()
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
investment_count.plot.hist(bins=60, color = 'blue', alpha = 0.4)
plt.title("Count of investment by target")
plt.show()

investment_mean = ubiquant.groupby(['investment_id'])['target'].mean()
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
investment_mean.plot.hist(bins=60, color = 'blue', alpha = 0.4)
plt.title("Mean of investment by target")
plt.show()

ax = sns.jointplot(x=investment_count, y=investment_mean, kind='reg',
                  height=8, color = 'blue')
ax.ax_joint.set_xlabel('observations')
ax.ax_joint.set_ylabel('mean target')
plt.show()

# **Sixth. Feature Engineering -🛠**

1) Make label

In [None]:
train_x = train_set.drop(['target', 'row_id'], axis=1).copy()
train_target = train_set['target'].copy()
display(train_x.head())
train_target.head()

2) Remove outlier

In [None]:
# Step 2.
outlier_id = investment_mean.reset_index(name='mean')
outlier_id = outlier_id[abs(outlier_id['mean']) < 0.15]
outlier_id = outlier_id['investment_id'].tolist()

# removeing outlier_id
remove_df = train_set[train_set['investment_id'].isin(outlier_id)].copy()
remove_df

In [None]:
# Step 3.
investment_count = remove_df.groupby(['investment_id'])['target'].count()
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
investment_count.plot.hist(bins=60, color = 'blue', alpha = 0.4)
plt.title("Count of investment by target")
plt.show()

investment_mean = remove_df.groupby(['investment_id'])['target'].mean()
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
investment_mean.plot.hist(bins=60, color = 'blue', alpha = 0.4)
plt.title("Mean of investment by target")
plt.show()

ax = sns.jointplot(x=investment_count, y=investment_mean, kind='reg',
                  height=8, color = 'blue')
ax.ax_joint.set_xlabel('observations')
ax.ax_joint.set_ylabel('mean target')
plt.show()

3) Scaling & Simple pipeline <br>
but f_0 ~ f_300 seem to be similar scale. so we don't need scaling

In [None]:
# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import StandardScaler

# f_num_pipeline = Pipeline([
#     ('std_scaler', StandardScaler())
# ])

# ubi_f_pipe = f_num_pipeline.fit_transform(train_set[features])

In [None]:
del train_set

# **Seventh. Modeling & Training -🗡**

In [None]:
import lightgbm
import xgboost

train_ds = lightgbm.Dataset(train_x, label = train_target) 
val_ds = lightgbm.Dataset(test_x, label = test_target) 
params = {'learning_rate': 0.01, 
          'max_depth': 5, 
          'objective': 'regression', 
          'metric': 'mse', 
          'is_training_metric': True, 
          'num_leaves': 144}
model = lightgbm.train(params, train_ds, 85, val_ds)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

prediction = model.predict(test_x)
mse = mean_squared_error(test_target, prediction)
print(f'model mse is {mse}')

StratifiedKFold

In [None]:
%%time
from sklearn.model_selection import KFold
params = {'learning_rate': 0.01, 
          'max_depth': 5, 
          'objective': 'regression', 
          'metric': 'mse', 
          'is_training_metric': True, 
          'num_leaves': 144}
kfold = KFold(n_splits=5)
models = []
print('start')

for  train_indices, valid_indices in kfold.split(train_x):
    print('start')
    train_x, val_x = train_x.iloc[train_indices], train_x.iloc[valid_indices]
    train_y, val_y = train_target.iloc[train_indices], train_target.iloc[valid_indices]
    train_ds = lightgbm.Dataset(train_x, label = train_y) 
    val_ds = lightgbm.Dataset(val_x, label = val_y) 
    print('middle')
    #checkpoint = keras.callbacks.ModelCheckpoint(f"model_{index}", save_best_only=True)
    early_stop = keras.callbacks.EarlyStopping(patience=10)
    model = lightgbm.train(params, train_ds, 100, val_ds)
    models.append(model)
    print('finishs')
    pearson_score = stats.pearsonr(model.predict(val_x).ravel(), val_y.values)[0]
    print('Pearson:', pearson_score)
    del train_x
    del val_x
    del train_y
    del val_y
    del train_ds
    del val_ds
    gc.collect()
    break

# **Eighth. Tunning -🎹**

In [None]:
# from sklearn.model_selection import GridSearchCV
# from lightgbm import LGBMRegressor
# LGB = LGBMRegressor()

# lgb_param_grid = {
#     'num_leaves' : [1,5,10],
#     'learning_rate': [1,0.1,0.01,0.001],
#     'n_estimators': [50, 100, 200, 500, 1000,5000], 
#     'max_depth': [15,20,25],
#     'num_leaves': [50, 100, 200],
#     'min_split_gain': [0.3, 0.4],
# }
# gsLGB = GridSearchCV(LGB,param_grid = lgb_param_grid, cv=5, scoring="neg_mean_squared_error", n_jobs= 4, verbose = 1)
# gsLGB.fit(train_x, train_target)
# LGB_best = gsLGB.best_estimator_

# print('최적 하이퍼 파라미터: ', gsLGB.best_params_)
# print('최고 예측 정확도: {:.4f}'.format(gsLGB.best_score_))

# **Submission -⛷**

In [None]:
def inference(models, ds):
    y_preds = []
    for model in models:
        y_pred = model.predict(ds)
        y_preds.append(y_pred)
    return np.mean(y_preds, axis=0)

In [None]:
import ubiquant
env = ubiquant.make_env()
iter_test = env.iter_test() 
for (test_df, sample_prediction_df) in iter_test:
    time_df = test_df.row_id.str.split('_').str[0].astype(int)
    test_df.drop(['row_id'], axis=1, inplace=True)
    test_df['time_id'] = time_df
    sample_prediction_df['target'] = inference(models, test_df)
    env.predict(sample_prediction_df) 