# Overview
- The goal of the competition is to predict the value (obfuscated metric about investment's return rate) which is related to:
    - Dozens of features anonymized by hosts and probably not only the names but values themselves are processed too.
    - Different investments, again we only know their id's.
    - Observations about these investments are taken in different times;
        - ID's are in order but the time between them can vary.
        - Each time ID doesn't neccesarily to include all investments.
    - From competition page we have this:
> In this competition, you’ll build a model that forecasts an investment's return rate. Train and test your algorithm on historical prices.
        - Not sure if this is just a general use of the term "historical prices" or giving us a hint about the features...
- In test set we expect to see roughly one million instances. (Info shared by hosts)
    - We also know test observations will be taken after the training observation period.
    - This is code competition so our submissions will be made from inside of a Kaggle notebook.
        - These submissions will be evaluated using time-series API, hosts kindly reminded us about the memory constraints of this approach.
        - Evaluation metric is mean of the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) for each time ID.
        
## Motivation
The main motivation behind this notebook is to gain insights about the data, finding underlying patterns from the anonymized variables that could help with creating predictive models. Since we have pretty big dataset I believe jumping right into modelling part by creating complex models would be waste of resources and your time. By doing EDA we could gain an advantage in the modelling part.

# Exploratory Data Analysis

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datatable as dt
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.gridspec as gridspec
from matplotlib.ticker import MaxNLocator

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

from sklearn.cluster import KMeans

import plotly.express as px

from scipy import stats
from scipy.stats import norm, skew, kurtosis

import gc
plt.style.use('ggplot')

cust_color = [
    '#EDC7B7',
    '#EEE2DC',
    '#BAB2B5',
    '#123C69',
    '#AC3B61'
]

plt.rcParams['figure.figsize'] = (18,14)
plt.rcParams['figure.dpi'] = 300
plt.rcParams["axes.grid"] = True
plt.rcParams["grid.color"] = cust_color[0]
plt.rcParams["grid.alpha"] = 0.5
plt.rcParams["grid.linestyle"] = '--'
plt.rcParams["font.family"] = "monospace"

plt.rcParams['axes.edgecolor'] = 'black'
plt.rcParams['figure.frameon'] = False
plt.rcParams['axes.spines.left'] = True
plt.rcParams['axes.spines.bottom'] = True
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.linewidth'] = 1.0

import warnings
warnings.filterwarnings("ignore")



In [None]:
df = dt.fread('../input/ubiquant-market-prediction/train.csv').to_pandas()

## First Look

We have train, test and submission .csv's, let's take a look at first instances of our train data first.

In [None]:
df.head()

Number of instances:

In [None]:
print(f'Train df number of instance: {df.shape[0]}')

Missing Values:

In [None]:
print(f'Train df missing value count: {df.isna().sum().sum()}')

Investments:

In [None]:
print(f'Train df number of unique investments: {df.investment_id.nunique()}')

Time ID's

In [None]:
print(f'Train df number of unique time_id\'s: {df.time_id.nunique()}')

Just one more thing since we working with full data...

In [None]:
time_count=df.groupby("investment_id")['time_id'].count()
fig, ax = plt.subplots(figsize=(12,9))
sns.histplot(time_count, color=cust_color[-1], kde=True)
plt.title('Number of time_id\'s per Investment Distribution')
plt.show()

It seems most of the investments ID's having 800+ timestamp records, while there are quite a number of them having much less with left skewed distribution...

# Random Sampling

Since we have high number of instances (3141410) let's take some samples representing the actual population. This will let us to do faster interpretations...


In [None]:
sampled_df = df.sample(frac=0.05, random_state=42)

In [None]:
# from statsmodels.stats.weightstats import ztest
# diff = np.mean(df.target) - np.mean(sampled_df.target)
# t, p = ztest(df.target, x2=sampled_df.target, value=diff)
# (np.nanmean(sampled_df.target) - np.nanmean(df.target)) / df.target.std()

In [None]:
del df
gc.collect()

Converting features "float16" to save some memory.

In [None]:
features = [f'f_{i}' for i in range(300)]

for f in features:
    sampled_df[f] = sampled_df[f].astype('float16')

# Target Distribution

In [None]:
def plot_dist3(df, feature, title):
    
    # Creating a customized chart. and giving in figsize and everything.
    
    fig = plt.figure(constrained_layout=True)
    
    # creating a grid of 3 cols and 3 rows.
    
    grid = gridspec.GridSpec(ncols=3, nrows=2, figure=fig)

    # Customizing the histogram grid.
    
    ax1 = fig.add_subplot(grid[0, :2])
    
    # Set the title.
    
    ax1.set_title('Histogram')
    
    # plot the histogram.
    
    sns.distplot(df.loc[:, feature],
                 hist=True,
                 kde=True,
                 fit=norm,
                  hist_kws={
                 'rwidth': 0.85,
                 'edgecolor': 'black',
                 'linewidth':.5,
                 'alpha': 0.8},
                 ax=ax1,
                 color=cust_color[-1])
    
    ax1.axvline(df.loc[:, feature].mean(), color='Green', linestyle='dashed', linewidth=3)

    min_ylim, max_ylim = plt.ylim()
    ax1.text(df.loc[:, feature].mean()*2, max_ylim*0.95, 'Mean: {:.2f}'.format(df.loc[:, feature].mean()), color='Green', fontsize='12',
             bbox=dict(boxstyle='round',facecolor='red', alpha=0.5))
    ax1.legend(labels=['Actual','Normal'])
    ax1.xaxis.set_major_locator(MaxNLocator(nbins=12))
    
    ax2 = fig.add_subplot(grid[1, :2])
    
    # Set the title.
    
    ax2.set_title('Probability Plot')
    
    # Plotting the QQ_Plot.
    stats.probplot(df.loc[:, feature],
                   plot=ax2)
    ax2.get_lines()[0].set_markerfacecolor('#e74c3c')
    ax2.get_lines()[0].set_markersize(12.0)
    ax2.xaxis.set_major_locator(MaxNLocator(nbins=16))

    # Customizing the Box Plot:
    
    ax3 = fig.add_subplot(grid[:, 2])
    # Set title.
    
    ax3.set_title('Box Plot')
    
    # Plotting the box plot.
    
    sns.boxplot(y=feature, data=df, ax=ax3, color=cust_color[-1])
    ax3.yaxis.set_major_locator(MaxNLocator(nbins=24))
    #ax3.set_ylim(0,clip_value)

    plt.suptitle(f'{title}', fontsize=24, fontname = 'monospace', weight='bold')

In [None]:
plot_dist3(sampled_df, 'target', 'Target Distribution')

Target has decent distribution centered around 0 with a peak in the middle. We can notice long tails indicating some outliers and it can be confirmed by the othet plots. The mean is almost 0 so it's a strong clue for standardization.

# Some 'Odd' Feature Distributions

These are the top features (top 20 to be exact) where their distribution doesn't perfectly fit "normal" standards. We might something useful four our models by looking at them.

In [None]:
features_std = sampled_df.iloc[:,4:].apply(lambda x: np.std(x)).sort_values(
    ascending=False)
f_std = sampled_df[features_std.iloc[:20].index.tolist()]

with pd.option_context('mode.use_inf_as_na', True):
    features_skew = np.abs(sampled_df.iloc[:,4:].apply(lambda x: np.abs(skew(x))).sort_values(
        ascending=False)).dropna()
skewed = sampled_df[features_skew.iloc[:20].index.tolist()]

with pd.option_context('mode.use_inf_as_na', True):
    features_kurt = np.abs(sampled_df.iloc[:,4:].apply(lambda x: np.abs(kurtosis(x))).sort_values(
        ascending=False)).dropna()
kurt_f = sampled_df[features_kurt.iloc[:20].index.tolist()]

In [None]:
def feat_dist(df, cols, rows=3, columns=3, title=None, figsize=(30, 25)):
    
    '''A function for displaying skew feat distribution'''
    
    fig, axes = plt.subplots(rows, columns, figsize=figsize, constrained_layout=True)
    axes = axes.flatten()

    for i, j in zip(cols, axes):
        sns.distplot(
                    df[i],
                    ax=j,
                    fit=norm,
                    hist=False,
                    color=cust_color[-1],
                    kde_kws={'linewidth':3}
        )   
        
        (mu, sigma) = norm.fit(df[i])
        j.set_title('Dist of {0} Norm Fit: $\mu=${1:.2g}, $\sigma=${2:.2f}'.format(i, mu, sigma), weight='bold')
        j.legend(labels=[f'{i}', 'Normal Dist'])
        fig.suptitle(f'{title}', fontsize=24, weight='bold')

In [None]:
feat_dist(sampled_df, f_std.columns.tolist(), rows=2, columns=4, title='Distribution of High Std Features', figsize=(30, 8))

We can see there is only one or two high std feature (f_170 and f_124) rest is just about same which indicates the preprocessing of the data by hosts...

In [None]:
# Creating distplot of features which has high skewness

feat_dist(sampled_df, skewed.columns.tolist(), rows=5, columns=4, title='Distribution of Skewed Features')

We can see many features with asymmetric tails above.

In [None]:
# Creating distplot of features which has high Kurtosis

feat_dist(sampled_df, kurt_f.columns.tolist(), rows=5, columns=4, title='Distribution of High Kurtosis Features')

Above we can see some features with sharp and tall central peaks with long tails

Looks like our features standardized too as our targets...

# Feature Target Correlation

In [None]:
correlations = sampled_df.corrwith(sampled_df['target']).iloc[:-1].to_frame()
correlations['Abs Corr'] = correlations[0].abs()
sorted_correlations = correlations.sort_values('Abs Corr', ascending=False)['Abs Corr']
fig, ax = plt.subplots(figsize=(6,8))
sns.heatmap(sorted_correlations.iloc[1:].to_frame()[sorted_correlations>=.04], cmap='inferno', annot=True, vmin=-1, vmax=1, ax=ax)
plt.title('Feature Correlations With Target')
plt.show()

Almost no linear correlation between features and target... Of course that doesn't mean they're useless, we didn't include lot's of aspects of the data into these.

# Correlation Between Features

In [None]:
corr = sampled_df.iloc[:, 4:].corr()
sns.clustermap(corr, metric="correlation", cmap="inferno", figsize=(20, 20))
plt.suptitle('Correlations Between Features', fontsize=24, weight='bold')
plt.show()


We can see some strong correlations between features, and we can clearly see they get into some clusters by looking at the dendograms...

# Correlations Between Features

In [None]:
corr = corr.abs()

corrs = corr.unstack()
pair = corrs.sort_values(ascending=False)
pair = pair.reset_index(name='correlation').rename(columns={'level_0': 'feature_a', 'level_1': 'feature_b', 0: 'correlation'})
pair = pair[pair['feature_a'] != pair['feature_b']].iloc[::2,:]
pair = pair[:10]
pair

This just confirms our suspicion there are some strongly correlated features, and we can see which features are mostly correlated. Let's take a closer look at the most correlated one: f_262 and f_228:

In [None]:
sns.jointplot(sampled_df[pair['feature_a'].iloc[0]], sampled_df[pair['feature_b'].iloc[0]], kind="reg", color=cust_color[0], height=8,
              joint_kws={'scatter_kws':dict(alpha=0.5, edgecolor="r", linewidth=0.5)})
plt.show()

Yeah we can see strong positive correlation between these two, let's take a look at the general picture with hexbins since there are many points to scatter this method can give us better picture:

In [None]:
def hex_plot(df, rows=3, columns=3, title=None):
    
    '''A function for displaying skew feat distribution'''
    
    fig, axes = plt.subplots(rows, columns, figsize=(30, 25), constrained_layout=True)
    axes = axes.flatten()

    for i,j in enumerate(axes):
        j.hexbin(sampled_df[pair['feature_a'].iloc[i]], sampled_df[pair['feature_b'].iloc[i]],  gridsize=100, cmap='inferno', bins='log')
        j.set_xlabel(pair['feature_a'].iloc[i])
        j.set_ylabel(pair['feature_b'].iloc[i])

        fig.suptitle(f'{title}', fontsize=24, weight='bold')

In [None]:
hex_plot(sampled_df, rows=5, columns=2, title='Highly Correlated Features')

We can clearly see there are strong linear correlations between some features either negative or positive. We should take a closer look to these variables to prevent multicollinearity while modelling...

# Dimension Reduction and Clusters

Since the data is anonymized and lacking categorical variables we might want to look at some reduced dimension plots and use some unsupervised techniques to see if we can find some patterns.

In [None]:
features = sampled_df.iloc[:, 4:].columns.tolist()


pipe = Pipeline([('scaler', StandardScaler()),('pca', PCA())])
pipe.fit(sampled_df[features])
pca_samples = pipe.transform(sampled_df[features])

# explaining variance ratio:

fig, ax = plt.subplots(figsize=(14, 5))
plt.plot(range(sampled_df[features].shape[1]), pipe.named_steps['pca'].explained_variance_ratio_.cumsum(), linestyle='--', drawstyle='steps-mid', color=cust_color[-1],
         label='Cumulative Explained Variance', linewidth = 1.5)
sns.barplot(np.arange(1,sampled_df[features].shape[1]+1), pipe.named_steps['pca'].explained_variance_ratio_, alpha=0.85, color=cust_color[0],
            label='Individual Explained Variance', edgecolor='black', saturation = 2, linewidth = 0.5)

plt.ylabel('Explained Variance Ratio', fontsize = 14, fontname = 'monospace', weight='semibold')
plt.xlabel('Number of Principal Components', fontsize = 14, fontname = 'monospace', weight='semibold')
ax.set_title('Explained Variance', fontsize = 20, fontname = 'monospace', weight='bold')
plt.xticks(fontsize=8, rotation=90)
plt.legend(fontsize = 13)
plt.axis([0,99,0,1])
plt.show()

We do have many features but it seems we cannot reduce dimensions without losing some signals. Even for explaining the 80% variance we might have to use 100 principal components. Next let's take a look at component loadings for first three principal components:

In [None]:
loadings = pd.DataFrame(pipe.named_steps['pca'].components_[0:3, :], columns=features)
maxPC = 1.01 * np.max(np.max(np.abs(loadings.loc[0:5, :])))

fig, axes = plt.subplots(3, 1, figsize=(12, 9))
for i, ax in enumerate(axes):
    pc_loadings = loadings.loc[i, :]    
    colors = [cust_color[0] if l > 0 else cust_color[-1] for l in pc_loadings]
    sns.barplot(x=pc_loadings.index, y=pc_loadings, ax=ax, palette=colors)
    ax.axhline(color='#888888')
    ax.set_ylabel(f'PC{i+1}')
    ax.set_ylim(-maxPC, maxPC)    
    ax.xaxis.set_tick_params(labelsize=3, rotation=90)
    
plt.suptitle("Component Loadings")
plt.tight_layout()

So the first component loadings usually have same signs, typically it's expected when most of the variables share a common factor. Since it's investment data it could indicate something like bull run or booming economy, but without knowing actual variable names we can't be sure. We also don't observe a "Dominant" loading in first three principal components (these three only explain around 30% of the variance anyways) so we can't make precise interpretations from these tables above. But it's good to see them for getting more insights about features and their behaviours together.

Let's try our luck with clustering, maybe we can fit some instances into specific clusters so it can allow us to breakdown the problem and inspect different groups individualy. Let's see how many clusters we would need...

In [None]:
kmeans_per_k = [Pipeline([('scaler', StandardScaler()),('km', KMeans(n_clusters=k, random_state=42, max_iter=100, n_init=5, tol=1e-4))]).fit(sampled_df[features])
                for k in range(1, 8)]
inertias = [model.named_steps['km'].inertia_ for model in kmeans_per_k]

plt.figure(figsize=(6, 3))
sns.lineplot(range(1, 8), inertias, color=cust_color[-2], linewidth = 1.5)
plt.xlabel("k", fontsize=15)
plt.ylabel("Inertia", fontsize=15)

plt.title('Inertias and n_clusters', fontname = 'monospace', weight='bold')
plt.show()

Hmm, Doesn't look good... Anyways we have the sharpest elbow at k=2 but let's try k=4 it has also somewhat decent curve.

In [None]:
kmeans = Pipeline([('scaler', StandardScaler()),('km', KMeans(n_clusters=4, random_state=42, max_iter=100, tol=1e-4))]).fit(sampled_df[features])
clusters = kmeans.fit_predict(sampled_df[features])
clusters = [str(number) for number in clusters]

In [None]:
pipe = Pipeline([('scaler', StandardScaler()),('pca', PCA(n_components=2))])
pipe.fit(sampled_df[features])
pca_samples = pipe.transform(sampled_df[features])
sns.scatterplot(pca_samples[:,0], pca_samples[:,1], hue=clusters, palette=cust_color[1:5])
plt.title("Clusters on Reduced Dimension")
plt.show()

Well we have clusters but they don't mean much yet... Clusters are looking pretty close to each other. Let's look at them in another way including cluster centers in 300d space...

In [None]:
centers = pd.DataFrame(kmeans.named_steps['km'].cluster_centers_, columns=features)
fig, axes = plt.subplots(4, 1, figsize=(12, 12))
for i, ax in enumerate(axes):
    center = centers.loc[i, :]
    maxPC = 1.01 * np.max(np.max(np.abs(center)))
    colors = [cust_color[0] if l > 0 else cust_color[-1] for l in center]
    ax.axhline(color='#888888')
    sns.barplot(x=center.index, y=center, ax=ax, palette=colors)
    ax.set_ylabel(f'Cluster {i}')
    ax.set_ylim(-maxPC, maxPC)
    ax.xaxis.set_tick_params(labelsize=3, rotation=90)
    
plt.suptitle("Centroid Coordinates")
plt.tight_layout()

We can see a pattern when we inspect the table above. You can see most of the features diverges from others (bars going down) in cluster 0, it's little bit hard to detect differences in the next clusters. Especially when the feature names are not given. We could get some more insights if we had feature names and knowledge in the field. But anyways it reveals the nature of some clusters to us. Let's look one more thing;

In [None]:
pipe = Pipeline([('scaler', StandardScaler()),('pca', PCA(n_components=4))])
pipe.fit(sampled_df[features])
pca_samples = pipe.transform(sampled_df[features])

total_var = pipe.named_steps['pca'].explained_variance_ratio_.sum() * 100

labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pipe.named_steps['pca'].explained_variance_ratio_ * 100)
}
labels['color'] = 'Cluster'

fig = px.scatter_matrix(
    pca_samples,
    color=clusters,
    #symbol=clusters,
    dimensions=range(4),
    labels=labels,
    title=f'Total Explained Variance: {total_var:.2f}% by Clusters',
    opacity=0.5,
    color_discrete_sequence=cust_color[1:5]
)
fig.update_traces(diagonal_visible=False)
fig.show()

The picture doesn't change much when we plot first four components against each other too. Oh well...

Next we should decide what method we can use to get some more insights using classical EDA techniques.

# Time

So we looked at many of the variables independent from time, let's take a look how're things looking if we include the time.

Since we know number of observations per time id is not uniformly distributed, we can check how it affects our target.

In [None]:
sampled_df.sort_values(by='time_id', inplace=True)
sampled_df['target_cumsum']=sampled_df.groupby(['investment_id'])['target'].transform('cumsum')

In [None]:
fig, ax = plt.subplots(3,1, figsize=(12,12))

sns.lineplot(sampled_df.groupby('time_id')['investment_id'].nunique().index, sampled_df.groupby('time_id')['investment_id'].nunique(), color=cust_color[-1], ax=ax[0])
ax[0].set_ylabel('Observation Count')
ax[0].set_title('Number of Observations by Time')

sns.regplot(sampled_df.groupby('time_id')['target'].mean().index, sampled_df.groupby('time_id')['target'].mean(), color=cust_color[0],
           scatter_kws=dict(alpha=0.5, edgecolor="r", linewidth=0.5), line_kws=dict(color=cust_color[-1]), ax=ax[1], order=2, ci=None)
ax[1].set_ylabel('Mean Target')
ax[1].set_title('Target Values by Time')

sns.regplot(sampled_df.groupby('time_id')['target_cumsum'].mean().index, sampled_df.groupby('time_id')['target_cumsum'].mean(), color=cust_color[0],
           scatter_kws=dict(alpha=0.5, edgecolor="r", linewidth=0.5), line_kws=dict(color=cust_color[-1]), ax=ax[2], order=2, ci=None)
ax[2].set_ylabel('Mean Cumulative Target')
ax[2].set_title('Cumulative Target Values by Time')
plt.tight_layout()

It seems number of observations taken into training data has increasing trend by time with some weird movements around 300-500th time id range (sharp drops). When we check mean target values by time_id we can see it's causing some outliers...

When we check the cumulative gains by the target we can observe a negative trend though.

In [None]:
sampled_df['time_target_mean']=sampled_df.groupby(['time_id'])['target'].transform('mean')
plot_dist3(sampled_df, 'time_target_mean', 'Mean Target by Time')

When we include time aspect in the target distribution we can see the distribution gets some fairly long tails indicating some outliers, usually on the positive side; while big chunk of the data in on negative side. We should inspect this further to see which investments are affecting this statistic a lot. (Since we calculated them just by mean which isn't robust to outliers.) 

## Work in Progress...
Best of luck in the competition :)