**Stock Name: AAPL (Apple)**

*Stock Market Basic Influencing Factor:*

* Closing Price
* Opening Price

*Libraries*

* Pandas
* Scikit-learn
* Matplotlib
* Numpy
* Pandas_datareader.data
* Seaborn
* OS

In [None]:

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
    
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/apple-aapl-historical-stock-data/HistoricalQuotes.csv', index_col=0)
df

Loading the Latest Data

In [None]:
df_latest = pd.read_csv('/kaggle/input/apple-stock-data-updated-till-22jun2021/AAPL.csv', index_col=0)
df_latest

In [None]:
# Importing Libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
import pandas_datareader.data as web
import datetime
import seaborn as sb
import os


**Data Clensing**

* Renaming the columns
* Removing the '$' symbol
* Chening datatype to float for better dataset

In [None]:
df.columns = df.columns.str.replace(' ', '')
df.rename(columns = {'Close/Last':'Close'}, inplace=True)

In [None]:
cols = ['Close', 'Open', 'High', 'Low']
for i in cols:
    df[i] = df[i].str.replace('$', '').astype(float)

In [None]:
df.reset_index(inplace=True)
df.head()

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

In [None]:
df.dtypes

In [None]:
df.sort_values(by='Date', inplace=True)

In [None]:
df.reset_index(drop=True, inplace=True)
df.head()

In [None]:
dates = []
for x in range(len(df)):
    newdate = str(df.index[x])
    newdate = newdate[0:10]
    dates.append(newdate)
df['dates'] = dates

In [None]:
df.head()

**Shifting the next day close price to the previous day**

Using shift function, the next dat Close is brought 1 day before for prediction

In [None]:
df['Next_Day_Close'] = df['Close'].shift(-1)
df.head()

**Knowing the Data**

In [None]:
df.describe()

**Checking Correlation**

Correlation is a measure of association or dependency between two features i.e. how much Y will vary with a variation in X. The correlation method that we will use is the Pearson Correlation.

*Using Pearson Correlation coefficient:*

> corr=df.corr(method='pearson')

Pearson Correlation Coefficient is the most popular way to measure correlation, the range of values varies from -1 to 1. In mathematics/physics terms it can be understood as if two features are positively correlated then they are directly proportional and if they share negative correlation then they are inversely proportional.

In [None]:
corr = df.corr(method='pearson')
corr

In [None]:
sb.heatmap(corr,xticklabels=corr.columns, yticklabels=corr.columns,cmap='RdBu_r', annot=True, linewidth=0.5)

**EDA (Explanatory Data Analysis)**

Visualize the Dependent variable with independent variable

In [None]:
appl_df = df[['Date','High','Open','Low','Close','Next_Day_Close']]
print(appl_df.head(20))
plt.figure(figsize=(16,8))
plt.title('Apple Stocks Closing Price History 2012-2020')
plt.plot(appl_df['Date'],appl_df['Close'])
plt.xlabel('Date',fontsize=18)
plt.ylabel('Close Price US($)',fontsize=18)
plt.style.use('fivethirtyeight')
plt.show()

**Bar plot of Open Price VS Close Price (Year 2012)**

Let's take a look at the bar plot of top 50 data which is from 2012 year

In [None]:
# Plot Open vs Close (Year 2012)
appl_df[['Open','Close']].head(50).plot(kind='bar',figsize=(16,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

**Bar plot of Open Price VS Close Price (Year 2020)**

Let's take a look at the bar plot of top 50 data which is from 2020 year

In [None]:
# Plot Open vs Close (Year 2020)
appl_df[['Open','Close']].tail(50).plot(kind='bar',figsize=(16,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

**Bar plot of High Price of a day VS Close Price (Year 2012)**

Let's take a look at the bar plot of top 50 data which is from 2012 year

In [None]:
# Plot High vs Close (Year 2012)
appl_df[['High','Close']].head(50).plot(kind='bar',figsize=(16,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

**Bar plot of High Price of a day VS Close Price (Year 2020)**

Let's take a look at the bar plot of top 50 data which is from 2020 year

In [None]:
# Plot High vs Close (Year 2020)
appl_df[['High','Close']].tail(50).plot(kind='bar',figsize=(16,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

**Bar plot of Low Price of a day VS Close Price (Year 2012)**

Let's take a look at the bar plot of top 50 data which is from 2012 year

In [None]:
# Plot Low vs Close (Year 2012)
appl_df[['Low','Close']].head(50).plot(kind='bar',figsize=(16,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

**Bar plot of Low Price of a day VS Close Price (Year 2020)**

Let's take a look at the bar plot of top 50 data which is from 2020 year

In [None]:
# Plot Low vs Close (Year 2020)
appl_df[['Low','Close']].tail(50).plot(kind='bar',figsize=(16,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

**Linear Regression Model**

*Linear Model Cross-Validation:*

Basically Cross Validation is a technique using which Model is evaluated on the dataset on which it is not trained i.e. it can be a test data or can be another set as per availability or feasibility.

number of splits: 20

In [None]:
# Model Training and Testing

# Date format is DateTime 
appl_df['Year'] = df['Date'].dt.year
appl_df['Month'] = df['Date'].dt.month
appl_df['Day'] = df['Date'].dt.day

In [None]:
appl_df.head()

In [None]:
final_appl = appl_df[['Day', 'Month', 'Year', 'High', 'Open', 'Low', 'Next_Day_Close']]

In [None]:
final_appl.head(10)

# Model 1 - TRY 1

In [None]:
X = final_appl.iloc[:,final_appl.columns != 'Next_Day_Close']
Y = final_appl.iloc[:, 5]

Splitting the dataset into train and test using train_test_Split

In [None]:
# Splitting the dataset into train and test
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test= train_test_split(X,Y,test_size=.25)
print(x_train.shape) #output: (1583, 6)
print(x_test.shape)  #output: (528, 6)  
print(y_train.shape) #output: (1583,)
print(y_test.shape)  #output: (528,)

## Model 1 - TRY 1 : Linear Regression Model

In [None]:
# Linear Regression Model Training and Testing
lr_model = LinearRegression()
lr_model.fit(x_train,y_train)
y_pred = lr_model.predict(x_test)

# Linear Model Cross-Validation
from sklearn import model_selection
from sklearn.model_selection import KFold
kfold = model_selection.KFold(n_splits=20, random_state=0, shuffle=True)
results_kfold = model_selection.cross_val_score(lr_model, x_test, y_test.astype('int'), cv=kfold)
print("Accuracy: ", results_kfold.mean()*100)

**Plot Actual vs Predicted Value of Linear Regression Model : Try1**

In [None]:
# Plot Actual vs Predicted Value
plot_df = pd.DataFrame({'Actual':y_test,'Pred':y_pred})
plot_df.head(10).plot(kind='bar',figsize=(16,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

**Calculating the R2 Score for Try 1**

* R-squared measures the proportion of the variance in the dependent variable explained by the independent variables in the model.
* It ranges from 0 to 1, where 0 indicates that the model does not explain any variability, and one indicates that it explains all the variability.
* Higher R-squared values suggest a better fit, but it doesn’t necessarily mean the model is a good predictor in an absolute sense.

In [None]:
from sklearn.metrics import mean_squared_error , r2_score
import math

In [None]:
print('Linear R2: ', r2_score(y_test, y_pred))

# Model 1 : TRY 2 - Using Year for train and test data split

Taking data from 2010 to 2017 as the train data set and from 2018 to 2020 as the test data set

In [None]:
final_appl1 = final_appl[final_appl.Year < 2018]
final_appl1.head()

In [None]:
final_appl1.tail()

In [None]:
final_appl2 = final_appl[final_appl.Year >= 2018]
final_appl2.head()

In [None]:
final_appl2.tail()

In [None]:
x_train = final_appl1.iloc[:,final_appl.columns != 'Next_Day_Close']
x_test = final_appl2.iloc[:,final_appl.columns != 'Next_Day_Close']
y_train = final_appl1.iloc[:, 5]
y_test = final_appl2.iloc[:, 5]

In [None]:
#X = final_appl.iloc[:,final_appl.columns != 'Next_Day_Close']
#Y = final_appl.iloc[:, 5]

In [None]:
# Splitting the dataset into train and test
#from sklearn.model_selection import train_test_split
#x_test,y_train,y_test= train_test_split(X,Y,test_size=.25)
print(x_train.shape) #output: (1583, 6)
print(x_test.shape)  #output: (528, 6)  
print(y_train.shape) #output: (1583,)
print(y_test.shape)  #output: (528,)

**Plot Actual vs Predicted Value of Linear Regression Model - Try2**

In [None]:
## Model 1: Linear Regression Model

# Linear Regression Model Training and Testing
lr_model = LinearRegression()
lr_model.fit(x_train,y_train)
y_pred = lr_model.predict(x_test)

# Linear Model Cross-Validation
from sklearn import model_selection
from sklearn.model_selection import KFold
kfold = model_selection.KFold(n_splits=20, random_state=0, shuffle=True)
results_kfold = model_selection.cross_val_score(lr_model, x_test, y_test.astype('int'), cv=kfold)
print("Accuracy: ", results_kfold.mean()*100)

**Plot Actual vs Predicted Value of Linear Regression Model - Try2**

In [None]:
# Plot Actual vs Predicted Value
plot_df = pd.DataFrame({'Actual':y_test,'Pred':y_pred})
plot_df.head(10).plot(kind='bar',figsize=(16,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

In [None]:
#from sklearn.metrics import mean_squared_error , r2_score
#import math

In [None]:
print('Linear R2: ', r2_score(y_test, y_pred))

# Model 1 : TRY 3 - Using Moving Average, Relative Strength Index and Correlation

In [None]:
final_appl3 = appl_df
final_appl3.head()

In [None]:
# Technical Indicators
#import talib as ta

In [None]:
'''final_appl3['S_10'] = final_appl3['Close'].rolling(window=10).mean()
final_appl3['Corr'] = final_appl3['Close'].rolling(window=10).corr(df['S_10'])
final_appl3['RSI'] = ta.RSI(np.array(df['Close']), timeperiod =10)
df['Open-Close'] = df['Open'] - df['Close'].shift(1)
df['Open-Open'] = df['Open'] - df['Open'].shift(1)
df = df.dropna()
X = df.iloc[:,:9]'''

## Model 2: Logistics Regression Model

In [None]:
final_appl = appl_df
final_appl.head()

In [None]:
df1 = final_appl
df1.head()

In [None]:
df1['returns'] = np.log(df1.Close.pct_change() + 1)
df1['direction'] = [1 if i > 0 else -1 for i in df1.returns]
df1.head()

In [None]:
def lagit (df1, lags):
    names = []
    for i in range(1, lags + 1):
        df1['Lag_' + str(i)] = df1['returns'].shift(i)
        df1['Lag_' + str(i) + '_dir'] = [1 if j > 0 else -1 for j in df1['Lag_' + str(i)]]
        names.append('Lag_' + str(i) + '_dir')
    return names

In [None]:
dirnames = lagit(df1, 5)
dirnames

In [None]:
#df1.dropna(inplace=True)

In [None]:
df1.head(25)

In [None]:
df1[dirnames]

In [None]:
corr = df1.corr(method='pearson')
corr

In [None]:
sb.heatmap(corr,xticklabels=corr.columns, yticklabels=corr.columns,cmap='RdBu_r', annot=True, linewidth=2.5)

In [None]:
## Model 2: Logistics Regression Model

# Logistics Regression Model Training and Testing

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

model = LogisticRegression()

In [None]:
model.fit(df1[dirnames], df1['direction'])

In [None]:
df1['prediction_logit'] = model.predict(df1[dirnames])
df1['prediction_logit']

In [None]:
df1['strat_logit'] = df1['prediction_logit'] * df1['returns']
np.exp(df1[['returns','strat_logit']].sum())

In [None]:
np.exp(df1[['returns','strat_logit']].sum()).plot()

In [None]:
np.exp(df1[['returns','strat_logit']].cumsum()).plot()

In [None]:
train, test = train_test_split(df1, shuffle = False, test_size=0.25, random_state = 0)
train = train.copy()
test = test.copy()
model = LogisticRegression()
model.fit(train[dirnames], train['direction'])

In [None]:
test['prediction_logit'] = model.predict(test[dirnames])
test['strat_logit'] = test['prediction_logit'] * test['returns']
np.exp(test[['returns','strat_logit']].sum())

In [None]:
np.exp(df1[['returns','strat_logit']].cumsum()).plot()

In [None]:
metrics.confusion_matrix(test['direction'], test['prediction_logit'])

In [None]:
print(metrics.classification_report(test['direction'], test['prediction_logit']))