# Project: Stock Price Prediction with Principal Component Analysis (PCA)

### Concepts Highlighted:

1. Data Acquisition and Preprocessing

2. Data cleaning is performed, including:
- Converting categorical data (likely the date) into numerical format (e.g., using timestamps).
- Transforming data types (e.g., converting "object" columns to numerical types like "int").

3. Feature Engineering:
- Converting the DataFrame into vectors and then into a matrix for model compatibility.
- Applying dimensionality reduction techniques (e.g., PCA) to potentially improve model performance and reduce complexity.

4. Machine Learning Analysis:
- The data is split into training and testing sets to train and evaluate a model.
- The model's performance is evaluated using Mean Squared Error (MSE) on the unseen testing data.

### Skills Demonstrated:

1. Vectors and Matrices: Working with data represented as matrices.
2. Dimensionality Reduction with PCA: Applying PCA to reduce features and identify key factors.
3. Data Analysis and Interpretation: Analyzing the principal components to understand relationships between stocks.

In [240]:
# Importing necessary libraries

import pandas as pd
import numpy as np

In [241]:
# Reading data from csv file ('prices.csv') into pandas dataframe ('df')

df = pd.read_csv("prices.csv")
df.head()

Unnamed: 0,date,symbol,open,close,low,high,volume
0,05-01-2016 00:00,WLTW,123.43,125.839996,122.309998,126.25,2163600
1,06-01-2016 00:00,WLTW,125.239998,119.980003,119.940002,125.540001,2386400
2,07-01-2016 00:00,WLTW,116.379997,114.949997,114.93,119.739998,2489500
3,08-01-2016 00:00,WLTW,115.480003,116.620003,113.5,117.440002,2006300
4,11-01-2016 00:00,WLTW,117.010002,114.970001,114.089996,117.330002,1408600


In [242]:
# 'df.shape' attribute returns a tuple representing dimensions (rows & columns) of df 

df.shape

(99, 7)

We are going to perform vector & matrix operations on our dataframe. So, its necessary to:
1. Deal with null values
2. Make data type of all the columns to be numerical

In [243]:
# 'df.isnull()' gives the dataframe back to us with each cell containing a boolean value indicating whether the corresponding cell in original
# dataframe was null ('True') or not ('False')
# 'df.isnull().sum()' returns the number of missing values in each column of the dataFrame 'df'.

df.isnull().sum()

date      0
symbol    0
open      0
close     0
low       0
high      0
volume    0
dtype: int64

Since there are no null values in our dataframe, so we don't have to deal with them.

In [244]:
# 'df.dtypes' returns a series containing the data types of each column in the dataFrame 'df'.

df.dtypes

date       object
symbol     object
open      float64
close     float64
low       float64
high      float64
volume      int64
dtype: object

We got 3 data types in our dataframe: object, float64, int64. We have to convert 'object' data type into numerical data type, so that we can
perform vector & matrix operations on our dataframe.

In [245]:
# 'pd.factorize()' is used to encode the values in 'symbol' column as numerical factors. It returns 2 outputs: encoded values & mapping
# (which isn't needed & hence assigned to '_')

df['symbol'], _ = pd.factorize(df['symbol'])
df.head()

Unnamed: 0,date,symbol,open,close,low,high,volume
0,05-01-2016 00:00,0,123.43,125.839996,122.309998,126.25,2163600
1,06-01-2016 00:00,0,125.239998,119.980003,119.940002,125.540001,2386400
2,07-01-2016 00:00,0,116.379997,114.949997,114.93,119.739998,2489500
3,08-01-2016 00:00,0,115.480003,116.620003,113.5,117.440002,2006300
4,11-01-2016 00:00,0,117.010002,114.970001,114.089996,117.330002,1408600


In [246]:
# 'pd.to_datetime' converts that 'date' column into datetime format. 'format' specifies the format of date in 'date' column.
# We applied lambda function to each datetime object in 'date' column. 'timestamp()' convert each datetime object to unix timestamp. 
# "astype('int32')" converts unix timestamps (float) to 32-bit integers.

df['date'] = pd.to_datetime(df['date'], format='%d-%m-%Y %H:%M')
df['unix_timestamp'] = df['date'].apply(lambda x: x.timestamp()).astype('int32')
df.head()

Unnamed: 0,date,symbol,open,close,low,high,volume,unix_timestamp
0,2016-01-05,0,123.43,125.839996,122.309998,126.25,2163600,1451952000
1,2016-01-06,0,125.239998,119.980003,119.940002,125.540001,2386400,1452038400
2,2016-01-07,0,116.379997,114.949997,114.93,119.739998,2489500,1452124800
3,2016-01-08,0,115.480003,116.620003,113.5,117.440002,2006300,1452211200
4,2016-01-11,0,117.010002,114.970001,114.089996,117.330002,1408600,1452470400


We dealt with data types. Now, we can drop the 'date' column because we have 'unix_timestamp' column with us.

In [247]:
df1 = df.drop(columns=['date'])
df1.head()

Unnamed: 0,symbol,open,close,low,high,volume,unix_timestamp
0,0,123.43,125.839996,122.309998,126.25,2163600,1451952000
1,0,125.239998,119.980003,119.940002,125.540001,2386400,1452038400
2,0,116.379997,114.949997,114.93,119.739998,2489500,1452124800
3,0,115.480003,116.620003,113.5,117.440002,2006300,1452211200
4,0,117.010002,114.970001,114.089996,117.330002,1408600,1452470400


In [248]:
# Setting 'unix_timestamp' as index because it allows us for easy time-based analysis.
# '_get_numeric_data()' method is used to extract only numerical columns from the dataframe. We are storing the names of all numerical columns
# into 'numerical_cols' list

df1.set_index('unix_timestamp', inplace=True)
numerical_cols = df1._get_numeric_data().columns
# numerical_cols = ['symbol', 'open', 'close', 'low', 'high', 'volume']

# Initialized an empty 'vectors' list.
# 'for' loop helps us to iterate through every row of df1 with 'index' as reference.
# NumPy array of numerical data present in numerical columns is stored in 'vector'.
# All NumPy arrays corresponding to each row is appended into 'vectors' list.

vectors = []
for index, row in df1.iterrows():
    vector = np.array(row[numerical_cols])
    vectors.append(vector)

# Dataframe is constructed with 'vectors' list with index as 'unix_timestamp' & columns as all numericql columns

df_vectors = pd.DataFrame(vectors, index=df1.index, columns=numerical_cols)
df_vectors.head()

Unnamed: 0_level_0,symbol,open,close,low,high,volume
unix_timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1451952000,0.0,123.43,125.839996,122.309998,126.25,2163600.0
1452038400,0.0,125.239998,119.980003,119.940002,125.540001,2386400.0
1452124800,0.0,116.379997,114.949997,114.93,119.739998,2489500.0
1452211200,0.0,115.480003,116.620003,113.5,117.440002,2006300.0
1452470400,0.0,117.010002,114.970001,114.089996,117.330002,1408600.0


In [249]:
# Representing the vectors into matrix form. We kept the precision of numbers to 2 decimal places for clean look.

matrix = df_vectors.values
np.set_printoptions(precision=2, suppress=True)
matrix[:5]

array([[      0.  ,     123.43,     125.84,     122.31,     126.25,
        2163600.  ],
       [      0.  ,     125.24,     119.98,     119.94,     125.54,
        2386400.  ],
       [      0.  ,     116.38,     114.95,     114.93,     119.74,
        2489500.  ],
       [      0.  ,     115.48,     116.62,     113.5 ,     117.44,
        2006300.  ],
       [      0.  ,     117.01,     114.97,     114.09,     117.33,
        1408600.  ]])

In [250]:
from sklearn.decomposition import PCA

# 'n_components=2' simply means our dataframe will be left with 2 columns having the maximum variance in the original data.
# Then we fit the PCA model to our input 'matrix' & transform it into prinicpal component space (new feature vectors having the maximum 
# variance).

pca = PCA(n_components=2)
pca_result = pca.fit_transform(matrix)

df_pca_result = pd.DataFrame(pca_result, columns=['PC1', 'PC2'])
df_pca_result.head()

Unnamed: 0,PC1,PC2
0,1230443.0,-13.540134
1,1453243.0,-9.900835
2,1556343.0,2.428226
3,1073143.0,3.841186
4,475443.4,3.618077


In [251]:
from sklearn.model_selection import train_test_split as TTS

# Splitting our data (matrix) & our target variable ('high') into train & test sets, keeping the random state = 1, so we get the same train &
# test sets everytime, test size = 0.3 or 30% 

X_train, X_test, y_train, y_test = TTS(matrix, df['high'], random_state=1, test_size=0.3)

# Fitting the PCA model to 'X_train' & transforming into lower dimensional space to reduce complexity 

pca_components = pca.fit_transform(X_train)

In [252]:
from sklearn.linear_model import LinearRegression as LR

model = LR() # Creates a linear regression model object that you can train and use for predictions.
model.fit(pca_components, y_train) # Trains the model on the training data

In [253]:
# Applying the PCA model to 'X_test'

pca_transformed_test = pca.transform(X_test)

# Predicting target variable ('high') from test data (pca_transformed_test)

predictions = model.predict(pca_transformed_test)
predictions

array([123.95, 127.38, 113.66, 125.05, 125.14, 114.99, 113.48, 126.64,
       125.83, 124.5 , 117.58, 113.43, 121.31, 124.85, 116.7 , 112.26,
       119.77, 120.88, 126.31, 111.81, 118.94, 117.96, 119.79, 119.94,
       113.35, 115.89, 125.  , 110.06, 118.06, 127.55])

In [254]:
new_data = np.array([1.556343e+06, 2.428226]).reshape(1, 2)
model.predict(new_data)

array([119.68])

In [258]:
from sklearn.metrics import mean_squared_error
 
mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)

Mean Squared Error: 0.16383786799055503


In [259]:
lowest_high = df['high'].min()
highest_high = df['high'].max()
lowest_high, highest_high

(109.260002, 128.059998)

Relatively low MSE: With a target variable range of approximately 18.8 (128.06 - 109.26), an MSE of 0.1638 suggests the model's predictions are on average fairly close to the actual values.