**Ustawienie środowiska**

Na początku wczytaliśmy potrzebne biblioteki i sprawdziliśmy ich wersje.


In [1]:
# scipy
import scipy
print('scipy: %s' % scipy.__version__)
# numpy
import numpy
print('numpy: %s' % numpy.__version__)
# matplotlib
import matplotlib
print('matplotlib: %s' % matplotlib.__version__)
# pandas
import pandas as pd
print('pandas: %s' % pd.__version__)
# statsmodels
import statsmodels
print('statsmodels: %s' % statsmodels.__version__)
# scikit-learn
import sklearn
print('sklearn: %s' % sklearn.__version__)

scipy: 1.4.1
numpy: 1.21.5
matplotlib: 3.2.2
pandas: 1.3.5
statsmodels: 0.10.2
sklearn: 1.0.2


W kolejnym korku wczytaliśmy dane z pliku csv. Wynikiem kodu jest prezentacja danych w formie tabelki jako Time-Series. Data jest tutaj indeksem.

In [6]:
from pandas import read_csv
from matplotlib import pyplot
series = read_csv('/content/drive/MyDrive/Colab Notebooks/UMF/convictions_returns.csv', parse_dates=['date'])

print(series)

# series['Y'] = series.Close/series.Open*100-100
# series['C'] = series.conviction/series.conviction[-1]*100-100
# X = series[(series.symbol == "SHW")]
# X = X[['Y','C']]
# print(X.describe())
# X.plot()
# pyplot.show()
# print(X)

       Unnamed: 0       date symbol                  sector company_id  \
0               0 2004-02-11     SU         Energy Minerals   GN63J3-R   
1               1 2004-02-11    GGG  Producer Manufacturing   H5490W-R   
2               3 2004-02-11    CWT               Utilities   GSWXLY-R   
3               4 2004-02-11    BLL      Process Industries   VFT0VQ-R   
4               5 2004-02-11    APA         Energy Minerals   DMX4QY-R   
...           ...        ...    ...                     ...        ...   
30679       37275 2022-01-26  EMRAF               Utilities   SRLHZS-R   
30680       37276 2022-01-26    IEX  Producer Manufacturing   KFJFWS-R   
30681       37277 2022-01-26    EXR                 Finance   XD67LR-R   
30682       37278 2022-01-26  LIFZF     Non-Energy Minerals   Q404Y1-R   
30683       37279 2022-01-26   VALU     Technology Services   V3RWFQ-R   

       conviction        Open        High         Low       Close   Adj Close  \
0        0.953727   13.300000 

Tworzymy RoR dla każdej spółki (po symbolu) pomiędzy kolejnymi odczytami Adj Close.

In [7]:
series.sort_values(['symbol','date'], inplace = True, ascending=[True, False])

# Dlaczego Adj Close? https://www.codingfinance.com/post/2018-04-03-calc-returns-py/

series['RoR'] = (series.groupby('symbol')['Adj Close'].apply(pd.Series.pct_change) + 1)

print(series)

       Unnamed: 0       date symbol             sector company_id  conviction  \
26111       32629 2019-10-23      A  Health Technology   FWHC5K-R    0.512520   
26026       32544 2019-10-09      A  Health Technology   FWHC5K-R    0.514559   
25877       32390 2019-09-11      A  Health Technology   FWHC5K-R    0.514137   
25800       32311 2019-08-28      A  Health Technology   FWHC5K-R    0.504983   
22927       29143 2018-02-21      A  Health Technology   FWHC5K-R    0.799274   
...           ...        ...    ...                ...        ...         ...   
22721       28924 2018-01-10    ZTS  Health Technology   TW6KKV-R    0.726495   
22671       28872 2017-12-27    ZTS  Health Technology   TW6KKV-R    0.621940   
22594       28791 2017-12-13    ZTS  Health Technology   TW6KKV-R    0.621152   
22519       28712 2017-11-29    ZTS  Health Technology   TW6KKV-R    0.621809   
22438       28628 2017-11-15    ZTS  Health Technology   TW6KKV-R    0.620609   

            Open       High

In [8]:
import matplotlib.pyplot as plt

df = series[['date','symbol','sector','conviction','Adj Close','RoR']]

df_ror = df.pivot_table(index='date', columns = 'symbol', values = 'RoR', aggfunc='first')

df_ror = df_ror.resample('M').last()

print(df_ror)

symbol       A      AAIC  AAP  AAPL  AAT      AAWW      ABBV  ABC       ABG  \
date                                                                          
2004-02-29 NaN  1.002983  NaN   NaN  NaN       NaN       NaN  NaN       NaN   
2004-03-31 NaN  1.644799  NaN   NaN  NaN       NaN       NaN  NaN       NaN   
2004-04-30 NaN       NaN  NaN   NaN  NaN       NaN       NaN  NaN       NaN   
2004-05-31 NaN       NaN  NaN   NaN  NaN       NaN       NaN  NaN       NaN   
2004-06-30 NaN       NaN  NaN   NaN  NaN       NaN       NaN  NaN       NaN   
...         ..       ...  ...   ...  ...       ...       ...  ...       ...   
2021-09-30 NaN       NaN  NaN   NaN  NaN  0.864971       NaN  NaN  0.947210   
2021-10-31 NaN       NaN  NaN   NaN  NaN  0.915556       NaN  NaN  1.181504   
2021-11-30 NaN       NaN  NaN   NaN  NaN       NaN  1.012423  NaN  1.034013   
2021-12-31 NaN       NaN  NaN   NaN  NaN  0.994764  0.962101  NaN  1.024142   
2022-01-31 NaN       NaN  NaN   NaN  NaN  1.113775  

Sprawdziliśmy relację między "daily returns", a "convictions" - jest ona ujemna, ale pomijalna. Przejdziemy jednak dalej w celu zobrazowania problemu zadania.

In [None]:
#@title
# plt.xlabel("Returns")
# plt.ylabel("Convictions' changes")
# plt.title("Scatter plot of daily returns and convictions' changes")
# plt.scatter(X['Y'], X['C'])
# plt.show()
# X.corr()

O niskim dopasowanu modelu świadczą również jego statystyki opisowe.

In [None]:
#@title
import statsmodels.formula.api as smf
### Create an instance of the class OLS
#slr_sm_model = smf.ols('C ~ Y', data=X)

### Fit the model (statsmodels calculates beta_0 and beta_1 here)
#slr_sm_model_ko = slr_sm_model.fit()

### Summarize the model

#print(slr_sm_model_ko.summary()) 

#param_slr = slr_sm_model_ko.params

Prosta regresja liniowa wygląda jednak w następujący spośob.

In [None]:
#@title
#plt.xlabel("Returns")
#plt.ylabel("Convictions' changes")
#plt.title("Simple linear regression model")
#plt.scatter(X['Y'],X['C'])
#plt.plot(X['Y'], param_slr.Intercept+param_slr.Y * X['Y'],
#         label='Y={:.4f}+{:.4f}X'.format(param_slr.Intercept, param_slr.Y), 
#         color='red')
#plt.legend()
#plt.show()

**Drugie podejście**

Model z wykorzystaniem biblioteki *sklearn*.

In [None]:
#@title
#X = X.values
#X.conviction = X.conviction.astype('float32')
#X = series
#split_point = int(0.9*len(X))
#dataset, validation = X[0:split_point], X[split_point:]

#print('Dataset %d, Validation %d' % (len(dataset), len(validation)))
#print(dataset)

Zbiór podzieliliśmy na trenujący i testowy w celu sprawdzenia dopasowania i potencjału predykcyjnego modelu.

In [16]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model, metrics
from sklearn.model_selection import train_test_split
df = df.dropna()
x,y=df.conviction, df.RoR
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.4)
print(y_train)

28182    1.016844
18122    1.046552
28623    0.957194
12080    1.012809
2989     1.029834
           ...   
16003    0.932162
15497    1.032224
29143    1.009674
12773    0.965123
30325    0.934983
Name: RoR, Length: 17590, dtype: float64


In [19]:
X_train = pd.DataFrame(X_train)
X_train

Unnamed: 0,conviction
28182,0.682934
18122,0.821828
28623,0.611504
12080,0.693584
2989,0.957020
...,...
16003,0.841857
15497,0.948389
29143,0.726029
12773,0.767875


In [18]:
# https://towardsdatascience.com/how-to-build-a-regression-model-in-python-9a10685c7f09

from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

model = linear_model.LinearRegression()
X_train = pd.DataFrame(X_train)
y_train = pd.DataFrame(y_train)

model.fit(X_train, y_train)

Y_pred_train = model.predict(X_train)
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_train, Y_pred_train))
print('Coefficient of determination (R^2): %.2f'
      % r2_score(y_train, Y_pred_train))

X_test = pd.DataFrame(X_test)
Y_pred_test = model.predict(X_test)
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_test, Y_pred_test))
print('Coefficient of determination (R^2): %.2f'
      % r2_score(y_test, Y_pred_test))

Coefficients: [[-321.85584226]]
Intercept: [200.61679517]
Mean squared error (MSE): 23329969.12
Coefficient of determination (R^2): 0.00
Coefficients: [[-321.85584226]]
Intercept: [200.61679517]
Mean squared error (MSE): 1386437067968255.75
Coefficient of determination (R^2): -0.00


In [21]:
from sklearn.model_selection import cross_val_score
x = pd.DataFrame(x)
y = pd.DataFrame(y)
scores = cross_val_score(model, x, y, cv=3)
print(scores)

[-2.90239570e+12 -3.55338298e+12 -1.02311149e-04]


Do realizacji modelu stworzyliśmy obiekt regresji liniowej i trenowaliśmy go na przygotowanych zbiorach trenujących. Na wyjściu przedstawiliśmy współczynniki regresji, wynik wariancji i sprawdzianu krzyżowego. Niestety, opracowany prosty model liniowy nie nadaje się do estymacji stop zwrotu spółek ze względu na osiągane wyniki średniego błędu kwadratowego oraz R^2.