# Predicting Intel Stock Price using Linear Regression

In [None]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt
%matplotlib inline

## View data

Get the latest Intel Stock Price from https://finance.yahoo.com/quote/intc/history/

In [None]:
df = pd.read_csv("https://archive.org/download/ml-fundamentals-data/machine-learning-fundamentals-data/INTC.csv")

Read the csv file and display the first 5 rows of your dataframe.

In [None]:
df.()

Show some of the information about your data, e.g. how many rows, what data types

In [None]:
df.()

Plot the graph of Stock Prices against Date by using dataframe.plot() from pandas. Volume is excluded because it has a different scale.

In [None]:
df.(x="Date",y=["Close","Open","High","Low","Adj Close"])

## Prepare data

Since various stock prices and Volume data points has different scales, if you want to plot the graph out and visualize, we will use **matplotlib** to plot.

In [None]:
# Define Date as the X-axis and convert dataframe to a numpy array
print(type(df))
x = df["Date"].values
print(type(x))

In [None]:
# Print out the values and check the size
# print(x)
print(len(x))

In [None]:
# define the first y-axis which is all stock prices by dropping/removing the Date(x-axis) and Volume(data with different scale)
df2 = df.drop(columns=["Date","Volume"])
df2.head()
y1 = df2.values
# print(y1)
print(len(y1))

In [None]:
y2 = df["Volume"].values
# print(y2)
print(len(y2))

In [None]:
fig = plt.figure(figsize=(12, 6))
ax1 = fig.add_subplot(111)
ax1.plot(x,y1)
plt.title('Stock features against Date',fontsize=20)
ax1.set_xlabel('Date')
ax1.set_ylabel('Stock prices')
ax2 = ax1.twinx()
ax2.set_ylabel('Volume')
ax2.plot(x,y2,'c')

When creating the Linear Regression Model, "Date" can be considered as a contributing factor or feature in training the model. 

In plain sight, "Date" is just a string, but we can convert it directly into a timestamp by using pandas. 

From the timestamp data-type, we can extract more meaningful features to be used to train our model, e.g. month, quarter, week.

In [None]:
df["Date"] = pd.to_datetime(df.Date,format='%Y-%m-%d')

In [None]:
df.info()

Scikit-learn will not accept String or Timestamp as the data, so we will need to convert the "Date" into a much simplier data that scikitlearn can accept.

In [None]:
newdate = df["Date"]

df4 = pd.DataFrame({"year": newdate.dt.year,
              "month": newdate.dt.month,
              "day": newdate.dt.day,
              "hour": newdate.dt.hour,
              "dayofyear": newdate.dt.dayofyear,
              "week": newdate.dt.week,
              "weekofyear": newdate.dt.weekofyear,
              "dayofweek": newdate.dt.dayofweek,
              "weekday": newdate.dt.weekday,
              "quarter": newdate.dt.quarter,
             })


In [None]:
df3 = df.drop(columns=["Date"])
df5 = pd.concat([df4,df3],axis=1)
df5.head()

Our data is now ready for model training.

In [None]:
df5.info()

Split the data by using this way instead of using train_test_split because we are treating our stock data as a time series data

In [None]:
train = df5[:200]
test = df5[200:]

In [None]:
train.info()

In [None]:
test.info()

In [None]:
X_train = train.drop("Close",axis=1)
y_train = train["Close"]
X_test = test.drop("Close",axis=1)
y_test = test["Close"]

## Train model

In [None]:
model = LinearRegression()
model.fit(X_train,y_train)

There are 15 coefficients because we have 15 features.

In [None]:
print(model.coef_)
print(len(model.coef_))

## Evaluate

In [None]:
predictions = model.predict(X_test)

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

The evaluation metrics are low which shows that the model is generalizing the data well

In [None]:
test.insert(16,"Predictions",predictions)

In [None]:
test[["Close","Predictions"]].tail()

In [None]:
fig = plt.figure(figsize=(14, 6))
plt.title("Stock Closing Price against Date",fontsize=20)
plt.xlabel("Date")
plt.ylabel("Stock Closing Price")
plt.plot(train["Close"])
plt.plot(test[["Close","Predictions"]])


In [None]:
model.score(X_test, y_test)

Seems like the model is doing a very good job at predicting the stock price. It could be due to that we are using a small dataset with low number of features.

Let us try to build another linear regression model with the same dataset, but this time, let us use the "date" as the only feature.

## Train model using date only

In [None]:
df5.info()

In [None]:
train = df5[:200]
test = df5[200:]

In [None]:
X_train = train.drop(["Close","Open","High","Low","Adj Close","Volume"],axis=1)
y_train = train["Close"]
X_test = test.drop(["Close","Open","High","Low","Adj Close","Volume"],axis=1)
y_test = test["Close"]

In [None]:
model = LinearRegression()
model.fit(X_train,y_train)

In [None]:
print(model.coef_)
print(len(model.coef_))

### Understanding coefficients
The coefficients are the mathematical representations of the features to the label. In this example, there are ten coefficients. Every value corresponds to each feature in the data. The value can be negative or positive.

The sign(positive or negative) of the coefficient indicates that correlation between independent and dependent variable.
- Independent variable: Feature(One column)
- Dependent variable: target

All coefficient values lie between -1 and 1. 

#### Towards +1
As the value of the independent variable increase, the value of the dependent variable tends to increase.

#### Towards -1
As the value of the independent variable decrease(more negative), the value of the dependent variable tends to increase.

#### Close to zero
The value that is close to 0 indicates that the independent variable has no significant contribution to the dependent variable.

## Evaluate

In [None]:
predictions = model.predict(X_test)
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

Compared to the model before, the metric produce higher value which indicates the value has higher error than the previous model.

The previous model works better.

In [None]:
test.insert(11,"Predictions",predictions)
test[["Close","Predictions"]].tail()

In [None]:
fig = plt.figure(figsize=(14, 6))
plt.title("Stock Closing Price against Date",fontsize=20)
plt.xlabel("Date")
plt.ylabel("Stock Closing Price")
plt.plot(train["Close"])
plt.plot(test[["Close","Predictions"]])

In [None]:
model.score(X_test, y_test)

As you can see, the prediction is very bad when you predict the stock price solely based on the date.