***
# Overview
***

In this notebook, we will see how much of the percentage return will be generated for BBRI shares by observing the data from 2021 to 2022 based on data taken from yahoo finance!

***
# Libraries
***

Import common libraries for data analysis. 

In [1]:
# Data Manipulation
import numpy as np
import pandas as pd

# Plotting graphs
import matplotlib.pyplot as plt

***
# Data Understanding
***

I used is JKSE and BBRI Stocks for 1 year (monthly) starting from March 2021 to March 2022 for the dataset, from Yahoo Finance
* [BBRI historical data stock April 2021 - April 2022](https://finance.yahoo.com/quote/UNVR/history?period1=1617333580&period2=1648869580&interval=1mo&filter=history&frequency=1mo&includeAdjustedClose=true)



# **Data Load**

In [2]:
#get data finance BBRI march - april 2022
bbri = pd.read_csv("../input/dataset-bbri-march-april-2022/BBRI.JK.March-April.csv")

In [3]:
bbri.shape

(23, 7)

# Understanding the table	
01. **Date:** Date of the day.
02. **Open:** The price at which a stock started trading at the opening of the given date.
03. **High:** The highest price at which a stock traded during the date.
04. **Low:** The lowest price during the date.
05. **Close:** The price of an individual stock when the stock exchange closed for the given date.

06. **Adj Close:** Adjusted values incorporate changes resulting from corporate actions such as dividend payments, stock splits, or new share issuance.

07. **Volume:** The number of shares that exchange hands for a stock within a specific given date.

In [4]:
bbri.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2022-03-02,4680.0,4690.0,4550.0,4560.0,4386.129883,186635200
1,2022-03-04,4720.0,4730.0,4610.0,4670.0,4491.935547,292022600
2,2022-03-07,4580.0,4630.0,4520.0,4520.0,4347.654785,224515800
3,2022-03-08,4500.0,4560.0,4430.0,4430.0,4261.086426,214870200
4,2022-03-09,4530.0,4590.0,4500.0,4570.0,4395.748535,191967100


In [5]:
#display data bbri
display(bbri)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2022-03-02,4680.0,4690.0,4550.0,4560.0,4386.129883,186635200
1,2022-03-04,4720.0,4730.0,4610.0,4670.0,4491.935547,292022600
2,2022-03-07,4580.0,4630.0,4520.0,4520.0,4347.654785,224515800
3,2022-03-08,4500.0,4560.0,4430.0,4430.0,4261.086426,214870200
4,2022-03-09,4530.0,4590.0,4500.0,4570.0,4395.748535,191967100
5,2022-03-10,4640.0,4640.0,4510.0,4570.0,4395.748535,225575800
6,2022-03-11,4400.0,4440.0,4370.0,4400.0,4400.0,275965300
7,2022-03-14,4440.0,4540.0,4410.0,4520.0,4520.0,212052100
8,2022-03-15,4550.0,4640.0,4530.0,4610.0,4610.0,357755700
9,2022-03-16,4650.0,4660.0,4610.0,4650.0,4650.0,226817000


In [6]:
#get data info
bbri.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       23 non-null     object 
 1   Open       23 non-null     float64
 2   High       23 non-null     float64
 3   Low        23 non-null     float64
 4   Close      23 non-null     float64
 5   Adj Close  23 non-null     float64
 6   Volume     23 non-null     int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 1.4+ KB


In [7]:
bbri.describe()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
count,23.0,23.0,23.0,23.0,23.0,23.0
mean,4641.304348,4673.043478,4576.956522,4621.73913,4576.447987,191685300.0
std,98.640963,80.477822,87.408902,93.500227,145.576933,67227990.0
min,4400.0,4440.0,4370.0,4400.0,4261.086426,83364400.0
25%,4590.0,4635.0,4525.0,4570.0,4445.967774,145105000.0
50%,4680.0,4700.0,4610.0,4650.0,4640.0,191967100.0
75%,4700.0,4730.0,4635.0,4700.0,4700.0,225045800.0
max,4760.0,4760.0,4690.0,4730.0,4730.0,357755700.0


***
# **Process the Data**
***

Categorical variables need to be transformed into numeric variables

# Define Predictor/Independent Variables

We will use the historical data to define some predictor variables, and get the average closing price over the past fifteen 10-days.

1. Create a new column in the dataframe called "Close_10_Rolling" and assign it the values of the rolling average
2. Create a new column in the dataframe called "Open_1_Change" and assign it the values of the change in the opening
3. Drop NaN variable
4. Create an X dataframe as Independent Variable

In [8]:
#mean 10 days closing price BBRI
bbri['Close_15_Rolling'] = bbri['Close'].rolling(window = 10).mean()

#Open Change 
bbri['Open_1_Change'] = bbri['Open'] - bbri['Open'].shift(1)

#drop NaN
bbri = bbri.dropna()

#Store the dataframe as the new variable "X".
X = bbri
X.drop(['Date'], axis=1, inplace=True) 
display(X)

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,Close_15_Rolling,Open_1_Change
9,4650.0,4660.0,4610.0,4650.0,4650.0,226817000,4550.0,100.0
10,4690.0,4700.0,4610.0,4650.0,4650.0,224068000,4559.0,40.0
11,4700.0,4700.0,4580.0,4580.0,4580.0,257863400,4550.0,10.0
12,4600.0,4630.0,4560.0,4590.0,4590.0,118791900,4557.0,-100.0
13,4690.0,4710.0,4640.0,4640.0,4640.0,177758500,4578.0,90.0
14,4680.0,4690.0,4620.0,4650.0,4650.0,146545100,4586.0,-10.0
15,4640.0,4730.0,4620.0,4730.0,4730.0,206271600,4602.0,-40.0
16,4760.0,4760.0,4650.0,4710.0,4710.0,137104800,4633.0,120.0
17,4730.0,4740.0,4670.0,4730.0,4730.0,106801800,4654.0,-30.0
18,4750.0,4750.0,4670.0,4690.0,4690.0,93155100,4662.0,20.0


# Define Target/Dependent Variable

If tomorrow’s closing price is higher than today’s closing price, then we will buy the stock (1), else we will sell it (-1).

In [9]:
#Calculate the historical values of the variable to be predicted.
y = np.where(bbri['Close'].shift(-1) - bbri['Close'] >= 0, 1, -1)
display(y)

array([ 1, -1,  1,  1,  1,  1, -1,  1, -1,  1, -1,  1,  1, -1])

# Split the dataset into training and validation

The **training** set will be used to build the machine learning models.

The **validation** set should be used to see how well the model performs on unseen data.

We will use 80% of our data to train and the rest 20% to test. To do this, we will create a split variable which will divide the data frame in a 80-20 ratio. ‘Xtrain’ and ‘Ytrain’ are train dataset. ‘Xtest’ and ‘Ytest’ are the test dataset.

In [10]:
from sklearn.model_selection import train_test_split
  # 80 % go into the training test, 20% in the validation test
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=7)

In [11]:
print(X_train.head())

      Open    High     Low   Close  Adj Close     Volume  Close_15_Rolling  \
14  4680.0  4690.0  4620.0  4650.0     4650.0  146545100            4586.0   
9   4650.0  4660.0  4610.0  4650.0     4650.0  226817000            4550.0   
10  4690.0  4700.0  4610.0  4650.0     4650.0  224068000            4559.0   
22  4690.0  4730.0  4630.0  4730.0     4730.0  145105000            4698.0   
17  4730.0  4740.0  4670.0  4730.0     4730.0  106801800            4654.0   

    Open_1_Change  
14          -10.0  
9           100.0  
10           40.0  
22            0.0  
17          -30.0  


***
# **Instantiate The Logistic Regression in Python**
***


We will instantiate the logistic regression in Python using ‘LogisticRegression’ function and fit the model on the training dataset using ‘fit’ function.

In [12]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [13]:
model.fit(X_train, y_train)

LogisticRegression()

# Evaluate Model

In [14]:
model.score(X_train, y_train)

0.5454545454545454

In [15]:
model.score(X_valid, y_valid)

1.0

- the score on the training set is much better than on the validation set, an indication that could be overfitting and not being a general model.

An advantage of logistic regression (e.g. against a neural network) is that it's easily interpretable.  It can be written as a math formula:

In [16]:
model.intercept_ # the fitted intercept

array([1.00211259e-17])

In [17]:
model.coef_  # the fitted coefficients

array([[4.59693681e-14, 4.39635206e-14, 4.42512174e-14, 4.04262261e-14,
        4.04262261e-14, 9.69755438e-10, 4.13981531e-14, 5.35919464e-15]])

Which means that the formula is: 
$$ \boldsymbol P(survive) = \frac{1}{1+e^{-logit}} $$  
  
where the logit is: 
  
$$ logit = \boldsymbol{\beta_{0} + \beta_{1}\cdot x_{1} + ... + \beta_{n}\cdot x_{n}}$$ 
  
where $\beta_{0}$ is the model intercept and the other beta parameters are the model coefficients from above, each multiplied for the related feature:  
  
$$ logit = \boldsymbol{1.4224 - 0.9319 * Pclass + ... + 0.2228 * Embarked_S}$$ 

# Iterate on the model
The model could be improved, for example transforming the excluded features above or creating new ones (e.g. I could extract titles from the names which could be another indication of the socio-economic status).

The correlation matrix may give us a understanding of which variables are important

In [18]:
bbri.corr()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,Close_15_Rolling,Open_1_Change
Open,1.0,0.809125,0.778888,0.422102,0.422102,-0.387533,0.541264,0.412286
High,0.809125,1.0,0.815629,0.708864,0.708864,-0.343716,0.76118,0.168609
Low,0.778888,0.815629,1.0,0.679003,0.679003,-0.595058,0.724059,0.181588
Close,0.422102,0.708864,0.679003,1.0,1.0,-0.446582,0.777303,0.037096
Adj Close,0.422102,0.708864,0.679003,1.0,1.0,-0.446582,0.777303,0.037096
Volume,-0.387533,-0.343716,-0.595058,-0.446582,-0.446582,1.0,-0.658307,0.307401
Close_15_Rolling,0.541264,0.76118,0.724059,0.777303,0.777303,-0.658307,1.0,-0.185597
Open_1_Change,0.412286,0.168609,0.181588,0.037096,0.037096,0.307401,-0.185597,1.0


## 