## Exploratory data analysis of the US Stock Financial indicators data set

#### Names of contributors: Anene Ifeanyi, Chizitere Igwe

#### Date: 2020-12-24

# Table of contents

1. [Summary](#Summary)
2. [Method](#Method)
3. [Data](#Data)
4. [Partition Data into train and test splits](#PartitionDataintotrainandtestsplits)
5. [Exploratory Data Visualisations](#ExploratoryDataVisualisations)
6. [References](#References)

### Summary <a name="Summary"></a>




## Method <a name="Method"></a>

Given the financial stock indicators, should a hypothetical investor buy the stock or not? 

### Data <a name="Data"></a>

In [None]:
# Import required exploratory data analysis packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# train test split and cross validation
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, RandomizedSearchCV

# Preprocessing
from sklearn.preprocessing import (
    OneHotEncoder,
    PolynomialFeatures,
    StandardScaler,
)
from sklearn.impute import SimpleImputer

# Feature selection
from sklearn.feature_selection import RFE, RFECV


# classifiers / models
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from lightgbm.sklearn import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier


# Others
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.metrics import f1_score, mean_squared_error, make_scorer, recall_score, r2_score

In [None]:
# For this exploratory data analysis, we will only be working with the 2014 dataset. The techniques applied here can be applied to the other datasets.

df_2014 = pd.read_csv('../data/raw/2014_Financial_Data.csv') 

In [None]:
df_2014 = df_2014.rename(columns={'Unnamed: 0': 'Ticker'})

## Analysis and Results <a name="PartitionDataintotrainandtestsplits"></a>



In [None]:
train_df, test_df = train_test_split(df_2014, train_size = 0.75, random_state = 123)

In [None]:
# Create X and Y train

X_train, y_train = (train_df.drop(columns = ["Class"]), train_df["Class"])

X_test, y_test = (test_df.drop(columns = ["Class"]), test_df["Class"])

In [None]:
# Preprocessing and transformations

drop_features = ["Ticker"]

categorical_features = ["Sector"]

numerical_features = X_train.select_dtypes(include=np.number).columns.tolist()

#len(numerical_features) == 222

The `Ticker` column was dropped because it does not seem to add any significant contribution to prediction. 

In [None]:
numeric_transformer = make_pipeline(StandardScaler(), SimpleImputer())

categroical_transformer = make_pipeline(SimpleImputer(strategy = "constant", fill_value = "missing"),
                                       OneHotEncoder(handle_unknown = "ignore"))


preprocessor = make_column_transformer(
    ("drop", drop_features),
    (numeric_transformer, numerical_features),
    (categorical_tranformer, categorical_features)
)

### Baseline model

In [None]:
# Helper function 

def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns the mean and standard deviation of cross validation scores
    """
    
    scores = cross_validate(model, X_train, y_train, **kwargs)
    
    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    
    out_col = []
    
    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))
    
    return pd.Series(data = out_col, index = mean_scores.index)

In [None]:
results = {}

scoring_metric = ["accuracy", "recall", "precision", "f1"]

dummy_model = make_pipeline(preprocessor, DummyClassifier(strategy = "stratified"));

results["dummy"] = mean_std_cross_val_scores(dummy_model, X_train, y_train, return_train_score = True, scoring = scoring_metric);

In [None]:
pd.DataFrame(results)

Table 1: Results of the baseline model. 

Table 1 indicates the results of the baseline model developed. The scores of this model are quite low, however, they can be used as a reference for the models that will be later.

## Linear models

In [None]:
# logistic regression

log_reg_model = make_pipeline(preprocessor, LogisticRegression(max_iter = 1000, class_weight = "balanced"))

results["Logistic Regression"] = mean_std_cross_val_scores(log_reg_model, X_train, y_train, return_train_score = True, 
                                                           scoring = scoring_metric)

In [None]:
pd.DataFrame(results)

Table 2: Results of baseline model and Logistic regression model.

The logistic regression model shows good promise. The scores are better than the dummy classifier, however, the scores are still pretty low. Optimising the regularisation (`C`) hyperparameter should give better results.

### Logistic Regression hyperparameter optimisation

In [None]:
log_reg_param_grid = {"logisticregression__C": 10.0 ** np.arange(-3, 3, 1)}
mult_metric_eval_scorer = {"accuracy" : "accuracy", "recall" : "recall", "precision" : "precision", "f1" : "f1"}

log_reg_random_search = RandomizedSearchCV(log_reg_model, param_distributions = log_reg_param_grid,
                                           n_jobs = -1, verbose = 1, scoring = mult_metric_eval_scorer, refit = "f1",
                                           return_train_score = True);

log_reg_random_search.fit(X_train, y_train);

In [None]:
best_log_reg_model = log_reg_random_search.best_estimator_;

results["Logistic Regression (tuned)"] = mean_std_cross_val_scores(best_log_reg_model, X_train, y_train, return_train_score = True,
                                                      scoring = scoring_metric);


In [None]:
pd.DataFrame(results)

Table 3: Results of Baseline, Logistic regression, and regularisation optimised logistic regression models. 

After optimising the regularisation hyperparameter, much better scores for each evaluation metric is obtained. I still think this can be improved with automatic feature selection. However, I intend to explore other classifiers before attempting feature selection. 

In [None]:
# Try out ridgeclassifier and ensembles





models = {
    "RBF SVM": SVC(),
    "random forest": RandomForestClassifier(class_weight="balanced", random_state=2),
    "xgboost": XGBClassifier(scale_pos_weight=ratio, random_state=2),
    "lgbm": LGBMClassifier(scale_pos_weight=ratio, random_state=2),
}

for name, model in models.items():
    pipe = make_pipeline(preprocessor, model)
    results[name] = mean_std_cross_val_scores(
        pipe, X_train, y_train, return_train_score=True, scoring=scoring_metric
    )

pd.DataFrame(results).T

## References <a name="References"></a>

de Jonge, E., 2020. CRAN - Package Docopt. [online] Cran.r-project.org. Available at: https://cran.r-project.org/web/packages/docopt/index.html [Accessed 29 November 2020].

Oliphant, T.E., 2006. A guide to NumPy, Trelgol Publishing USA.

McKinney, W. & others, 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference. pp. 51–56.

Waskom, M. et al., 2017. mwaskom/seaborn: v0.8.1 (September 2017), Zenodo. Available at: https://doi.org/10.5281/zenodo.883859.

Van Rossum, G. & Drake, F.L., 2009. Python 3 Reference Manual, Scotts Valley, CA: CreateSpace.

Hunter, J.D., 2007. Matplotlib: A 2D graphics environment. Computing in science & engineering, 9(3), pp.90–95.

Pedregosa, F. et al., 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.

P'erez, Fernando & Granger, B.E., 2007. IPython: a system for interactive scientific computing. Computing in Science & Engineering, 9(3).

Kluyver, T. et al., 2016. Jupyter Notebooks – a publishing format for reproducible computational workflows. In F. Loizides & B. Schmidt, eds. Positioning and Power in Academic Publishing: Players, Agents and Agendas. pp. 87–90.

Anon, 2020. Anaconda Software Distribution, Anaconda Inc. Available at: https://docs.anaconda.com/.