<a href="https://colab.research.google.com/github/AnanyaGupta24/PortfolioOptimization/blob/main/Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Overall, this code sets up an authenticated connection to Google Drive API that allows the user to access and manage files stored in their Google Drive account programmatically.

In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials


# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)


This code provides a simple way to download a file from Google Drive into a Google Colab environment, which can then be used for data analysis, machine learning, or any other purposes.

In [None]:
#This part linkes the dataset in the drive to the google colab file
link = 'link here'
id = link.split('/')[-2]
downloaded = drive.CreateFile({'id' : id})
downloaded.GetContentFile('StockPrices.csv')


The code selects only the Date, Index, and Close columns from the df DataFrame and stores them in a new DataFrame called df_close.

Finally, the code calls the info() method on the df_close DataFrame to print out a summary of its data, including the data types of each column and the number of non-null values.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df = pd.read_csv('StockPrices.csv')
df['Date']= pd.to_datetime(df['Date'])
df_close = df[['Date', 'Index', 'Close']]
df_close.info()


The overall effect of this code is to create a new DataFrame (df_close) that shows the closing prices for each index on each date, with missing data removed.
The dropna() method is then used to remove any columns with missing values (i.e., columns with at least one NaN value), and the resulting DataFrame is printed using the head() method to display the first few rows.

In [None]:
#Closing Prices Dataframe
df_close = df_close.pivot_table(index = 'Date', columns = 'Index', values='Close').dropna(axis=1)
df_close.head()


#      **PCA**
The Principal Component Analysis is a popular unsupervised learning technique for reducing the dimensionality of data. Basically, PCA is a tool for identifying the main axes of variance within a data set.
PCA is a dimensionality reduction technique that finds new orthogonal dimensions (principal components) that explain the most variance in the data.


This code compresses the original DataFrame df by pivoting and stacking its data, and then drops any columns that contain missing values. The resulting DataFrame raw_df should have a simpler structure and can be used for further analysis.

In [None]:
#Dataset we are compressing, column level 0 = Stock, column level 1 = feature
raw_df = df.drop(columns = ['Unnamed: 0','Close']).set_index(['Date' , 'Index']).unstack(level = 1).stack(level = 0).unstack()
raw_df = raw_df.dropna(axis = 1)
raw_df.head()

The code first converts a DataFrame raw_df into a Numpy array, and then prints the shape of the resulting array. This can be helpful for understanding the size and structure of the data, especially when working with large datasets or complex computations.

In [None]:
raw_df = raw_df.to_numpy()
raw_df.shape

This is Python code that uses the scikit-learn (sklearn) library for data preprocessing and dimensionality reduction.
The code scales the data using MinMaxScaler, and then applies principal component analysis to reduce the dimensionality of the data to 380 principal components. The resulting transformed data is stored in PCA_df.

MinMaxScaler scales the data to a specified range, usually between 0 and 1. The fit_transform() method scales the data and returns the scaled version, which is stored in the raw_df_scaled variable.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler

#Scaling the data
raw_df_scaled = MinMaxScaler().fit_transform(raw_df)

#Performing PCA ~ Reducing Dimensionality
PCA = PCA(n_components=380)
PCA_df = PCA.fit_transform(raw_df_scaled)

Overall, the code is creating a line plot that shows how much of the total variance in a dataset is explained by each additional principal component. This can be useful for determining how many principal components are needed to capture a desired amount of variance in the data.

In [None]:
plt.plot(np.cumsum(PCA.explained_variance_ratio_))
plt.xlabel('Num Components')
plt.ylabel('Cumulative Explained Variance');

This code generates a list of labels for principal component analysis (PCA) results.

The first line of code retrieves the index values from the "df_close" DataFrame and assigns them to the variable "dates".

The second line of code retrieves the column labels from the "df_close" DataFrame and assigns them to the variable "stocks".

The third line of code initializes an empty list called "PC_labs".

The fourth line of code starts a loop that will iterate over the columns of a DataFrame called "PCA_df" (which is not shown in this code snippet).

After all iterations of the loop are complete, the "PC_labs" list will contain a set of labels that can be used to identify the results of a PCA analysis. For example, the label "PC1" would correspond to the first principal component, and so on.

In [None]:

dates = df_close.index
stocks = df_close.columns
PC_labs = []
for i in range(PCA_df.shape[1]):
  lab = "PC" + str(i+1)
  PC_labs.append(lab)

# **Linear Regression Prediction Functions**

It uses linear relationships between a dependent variable (target) and one or more independent variables (predictors) to predict the future of the target.

The function takes in the following arguments:

1. raw_df: A pandas DataFrame containing the features used for the prediction.
2. close: DataFrame containing the closing prices of the stocks being predicted.
3. time: An integer specifying the current time step.
4. lookback: The number of time steps to look back when creating the input features.
5. forward: The number of time steps to look forward when predicting the stock prices.
5. stock_num: An integer specifying the index of the stock being predicted.

The function first creates two PCA objects, pca1 and pca2, each with 10 principal components.

Then, it creates the training data and testing data. The training data is created by taking the raw features from (time - forward - lookback) to (time - forward), scaling the data using MinMaxScaler, and applying PCA to reduce the dimensionality. The corresponding target values for the training data are taken from close for the same time period.

The testing data is created by taking the raw features from (time - lookback) to (time), scaling the data using MinMaxScaler, and applying PCA to reduce the dimensionality. The corresponding target values for the testing data are taken from close for (time + 1) to (time + forward + 1)

Next, the function creates a LinearRegression object, LR, and fits it to the training data using LR.fit(X_train, y_train).

Finally, the function predicts the stock prices for the testing data using LR.predict(X_test) and returns the predicted values and the actual target values y_test.

In [None]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression

#Using the full features dataset, the closing prices; we are able to fit a line over a specified time period
def predict_prices(raw_df, close, time, lookback, forward, stock_num):

  #PCA
  pca1 = PCA(n_components = 10)
  pca2 = PCA(n_components = 10)


  #Training data = t - forward - lookback
  X_train = raw_df[time-forward-lookback:time-forward,:]
  X_train = MinMaxScaler().fit_transform(X_train)
  X_train = pca1.fit_transform(X_train)
  y_train = close.iloc[time-forward+1:time+1,stock_num]

  #Testing = t - lookback
  X_test = raw_df[time-lookback:time,:]
  X_test = MinMaxScaler().fit_transform(X_test)
  X_test = pca2.fit_transform(X_test)
  y_test = close.iloc[time+1 : time+forward+1, stock_num]

  LR = LinearRegression()
  LR.fit(X_train, y_train)
  predicted = LR.predict(X_test)
  # print(mean_squared_error(y_test,predicted))

  return predicted, y_test

 It calls the predict_prices function  multiple times to generate the predicted and actual values for different stocks and time periods.

The function takes in two arguments:

1. full_features_df: The full set of features for all stocks.
2. closing_prices_df: The closing prices of all stocks.
The function initializes two empty lists, predictions and actuals, which will store the predicted and actual values for each stock.

The function then iterates over each stock in closing_prices_df using a for loop. For each stock, it initializes two empty lists, stock_predictions and stock_actuals.

The for loop then iterates over a range of dates, starting from 60 (to allow for a lookback period of 30) and incrementing by 30 (to generate predictions for non-overlapping time periods). For each date, the function calls the predict_prices function to generate the predicted and actual values for that stock and time period.

The predicted and actual values are appended to stock_predictions and stock_actuals, respectively.

Finally, the function returns predictions and actuals, which together form the entire table of features for predicting stock prices.




In [None]:
#This function creates the entire table of features
def construct_prediction_tab(full_features_df,closing_prices_df):
  predictions = []
  actuals = []

  for stocks in range(closing_prices_df.shape[1]):
    stock_predictions = []
    stock_actuals = []

    for dates in range(60, df_close.shape[0], 30):
      pred, act = predict_prices(full_features_df, closing_prices_df, dates, 30, 30, stocks)
      stock_predictions.append(pred)
      stock_actuals.append(act)

    import numpy as np
    stock_predictions = np.concatenate(stock_predictions)
    stock_actuals = np.concatenate(stock_actuals)

    predictions.append(stock_predictions)
    actuals.append(stock_actuals)

  return predictions, actuals

Therefore, pred and act are likely to be two lists of predicted and actual values, respectively, for all stocks and time periods, which together form the entire table of features for predicting stock prices.

In [None]:
pred, act = construct_prediction_tab(raw_df, df_close)

The data argument is set to act, which is a 2D numpy array with one row per stock and one column per time period. The index argument is set to stocks. The columns argument is set to dates[61:], which is a slice of the dates variable starting from index 61 (to remove the initial prediction window of 60 days) and continuing to the end of the list. The resulting DataFrame has one row per time period and one column per stock, with the actual values of each stock's closing price at each time period.

The transpose() method is called on both DataFrames to swap the rows and columns, so that the columns correspond to the stocks and the rows correspond to the time periods.

In [None]:
# Need to get rid of 60 days for initial prediction window
final_actuals = pd.DataFrame(data = act, index=stocks, columns = dates[61:]).transpose()
final_preds = pd.DataFrame(data = pred, index = stocks).transpose()


This code is used to trim the final_preds DataFrame to match the time range of the final_actuals DataFrame, and also to set the index of the final_preds DataFrame to the same range of dates as the final_actuals DataFrame.

In [None]:

final_preds = final_preds.iloc[:4966,:]
final_preds.index = dates[61:]

In [None]:
final_actuals.head()

In [None]:

final_preds.head()

This code exports the final_actuals and final_preds DataFrames to CSV files and saves them to Google Drive using the Google Colab interface.

In [None]:
from google.colab import drive
drive.mount('drive')

final_actuals.to_csv('LR_Actual_Prices.csv')
!cp LR_Actual_Prices.csv "drive/My Drive/Machine Learning Project/ML Section Exports"

final_preds.to_csv('LR_Predicted_Prices.csv')
!cp LR_Predicted_Prices.csv "drive/My Drive/Machine Learning Project/ML Section Exports"


The start time of the prediction window, time, the number of days to look back for features, lookback, the number of days to predict in the future, forward, and the stock number to predict for, stock_num.

The output of each call is two arrays: p1, p2, and p3, which contain the predicted prices for the given prediction window, and t1, t2, and t3, which contain the actual prices for the same window.

In [None]:
# Three different Prediction Windows
p1 , t1 = predict_prices(raw_df, df_close, 60, 30, 30, 5)
p2 , t2 = predict_prices(raw_df, df_close, 90, 30, 30, 5)
p3 , t3 = predict_prices(raw_df, df_close, 120, 30, 30, 5)

In [None]:

predictions = np.concatenate([p1,p2,p3])
actuals = np.concatenate((t1,t2))

In [None]:
#This is a plt for the first 90 days of predictions for the first stock
plt.plot(predictions, label = 'predicted')
plt.plot(actuals, label = 'Actual')
plt.legend()
plt.show()

In [None]:
stock_predictions = []
stock_actuals = []

for i in range(60,df_close.shape[0], 30):
  pred, act = predict_prices(raw_df, df_close, i, 30, 30, 5)
  stock_predictions.append(pred)
  stock_actuals.append(act)


In [None]:
stock_predictions = np.concatenate(stock_predictions)
stock_actuals = np.concatenate(stock_actuals)

This code creates a scatter plot of the predicted stock prices (stock_predictions) on the x-axis and the actual stock prices (stock_actuals) on the y-axis.

This is because the predictions were made for a longer period than the actual prices, and we only want to compare the predicted prices to the actual prices within the prediction window.

The resulting scatter plot can be used to visually compare the predicted and actual prices, and to assess how well the predictions match the actual values. If the points on the scatter plot fall close to a diagonal line, it indicates a good fit between the predicted and actual values. If the points are scattered or show a systematic deviation from the diagonal line, it indicates a poor fit.

The term "Q-Q plot" stands for "quantile-quantile plot", and is a type of graphical comparison of two probability distributions. In this case, it is used to compare the distribution of the predicted stock prices to the distribution of the actual stock prices.



In [None]:

# Q-Q plot for predictions vs actuals
plt.scatter(x = stock_predictions[:4966], y = stock_actuals)

This code is creating a plot of the predicted closing stock prices versus the actual closing stock prices for a specific stock.

In [None]:

#Full Prediction vs Actuals for the same stock
plt.figure(figsize=(20,10))
plt.plot(stock_predictions, label = 'Predicted')
plt.plot(stock_actuals, label = 'Actual')
plt.legend()

The output of this code will be a single floating point value representing the MAE between the actual and predicted values.

In [None]:

from sklearn.metrics import mean_squared_error, mean_absolute_error

mean_absolute_error(final_actuals, final_preds)