### Use Dimension Reduction PCA to rank the importance of each feature

#### What we need to do step by step
- Before doing this we need to clean the data so we can aplly the PCA
1. Apply PCA to your dataset to reduce the dimensionality and identify the most important features.
2. Calculate the cumulative explained variance ratio for each principal component.
3. Identify the features that contribute to 90% of the information in the dataset.
4. Create a new model that only uses these important features.
5. Compare the performance of the new model with your previous model (Activity 2).

In [1]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [2]:
df = pd.read_csv('C:\\Users\\daivi\\Desktop\\CBD 2214 Big Data Fundamentals\\In Class Assignment\\Group 5 - In class Activity 3\\week5 inclass.csv')

In [3]:
df.head()

Unnamed: 0,symbol,exchange,date,adjusted close,option symbol,expiration,strike,call/put,style,ask,...,iv,volume,open interest,stock price for iv,*,delta,vega,gamma,theta,rho
0,SPY,NYSEArca,07/30/2021,438.51,SPY 210730C00215000,07/30/2021,215.0,C,A,224.56,...,-1.0,0,0,438.97,,0.0,0.0,0.0,0.0,0.0
1,SPY,NYSEArca,07/30/2021,438.51,SPY 210730P00215000,07/30/2021,215.0,P,A,0.01,...,-1.0,0,1401,438.97,,0.0,0.0,0.0,0.0,0.0
2,SPY,NYSEArca,07/30/2021,438.51,SPY 210730C00220000,07/30/2021,220.0,C,A,219.56,...,-1.0,0,1,438.97,,0.0,0.0,0.0,0.0,0.0
3,SPY,NYSEArca,07/30/2021,438.51,SPY 210730P00220000,07/30/2021,220.0,P,A,0.01,...,-1.0,50,328,438.97,,0.0,0.0,0.0,0.0,0.0
4,SPY,NYSEArca,07/30/2021,438.51,SPY 210730C00225000,07/30/2021,225.0,C,A,214.56,...,-1.0,0,0,438.97,,0.0,0.0,0.0,0.0,0.0


### Processing and Cleaning Data

In [4]:
# Check for missing values
df.isnull().sum()

symbol                     0
exchange                   0
date                       0
adjusted close             0
option symbol              0
expiration                 0
strike                     0
call/put                   0
style                      0
ask                        0
bid                        0
mean price                 0
iv                         0
volume                     0
open interest              0
stock price for iv         0
*                     192007
delta                      0
vega                       0
gamma                      0
theta                      0
rho                        0
dtype: int64

In [5]:
# Dropping the column * as it is redundant column and mostly contain empty value
df.drop('*', axis=1, inplace=True)

# Checking for available columns
df.columns

Index(['symbol', 'exchange', 'date', 'adjusted close', 'option symbol',
       'expiration', 'strike', 'call/put', 'style', 'ask', 'bid', 'mean price',
       'iv', 'volume', 'open interest', 'stock price for iv', 'delta', 'vega',
       'gamma', 'theta', 'rho'],
      dtype='object')

In [6]:
# Check for duplicates
print(df.duplicated().sum())

0


#### As we can see there are no duplicate value so we don't need to drop any duplicate value

In [7]:
# Checking the data types
print(df.dtypes)

symbol                 object
exchange               object
date                   object
adjusted close        float64
option symbol          object
expiration             object
strike                float64
call/put               object
style                  object
ask                   float64
bid                   float64
mean price            float64
iv                    float64
volume                  int64
open interest           int64
stock price for iv    float64
delta                 float64
vega                  float64
gamma                 float64
theta                 float64
rho                   float64
dtype: object


#### As we can see that we need to change the datatype of date and expiration feature(Column)

In [8]:
# Convert date column to datetime format
df['date'] = pd.to_datetime(df['date'])

# We also need to convert expiration column to datetime
df['expiration'] = pd.to_datetime(df['expiration'])

In [9]:
# Checking the data types again to see the changed data type of date and expiration
print(df.dtypes)

symbol                        object
exchange                      object
date                  datetime64[ns]
adjusted close               float64
option symbol                 object
expiration            datetime64[ns]
strike                       float64
call/put                      object
style                         object
ask                          float64
bid                          float64
mean price                   float64
iv                           float64
volume                         int64
open interest                  int64
stock price for iv           float64
delta                        float64
vega                         float64
gamma                        float64
theta                        float64
rho                          float64
dtype: object


### Principal Component Analysis(PCA)

#### Principal Component Analysis (PCA) is a popular unsupervised learning technique for reducing the dimensionality of large data sets. It increases interpretability yet, at the same time, it minimizes information loss.

In [10]:
# Selecting only numerical features (excluding symbol, date, and option symbol). 
# we have used select_dtypes function for selecting the columns with the numerical value only because the PCA only use numerical data 
numerical_features = df.select_dtypes(include=[np.number]).columns

In [11]:
# Create a PCA object
pca = PCA(n_components=0.9)

In [12]:
# Applying Principal Component Analysis (PCA) to the numerical features of the dataset df and 
# transforms the data into a lower-dimensional representation(Dimensionality reduction)
# Here we have used fit_transfor instead of fit because we need to also reduce the dimensionality, the fit() doesn't do it.
X_pca = pca.fit_transform(df[numerical_features])

# This will help us reduce the number of features in the dataset, which can improve model performance and reduce overfitting. 
# Identify the most important features in the dataset, which can help to understand the underlying structure of the data.

In [13]:
# In here, the varience refers to how much information (variance) is retained by the selected principal components.
# The explained variance ratio indicates how much of the total variance in the dataset is captured by each principal component.
# This helps in understanding the importance of each component in explaining the data and also identify the important features.
feature_importance = pca.explained_variance_ratio_

# Print the explained variance ratio for each principal component
print("Explained Variance Ratio:")
print(feature_importance)

Explained Variance Ratio:
[0.6238646  0.37582501]


In [14]:
# In here, the PCA converted our 14 Feature which were numerical to the two new feature called PCA components as it is reduction technique
# Get the principal components from the 'pca' object
# This code help us understand the relationship between the original variables and the principal components.
principal_components = pca.components_

print("Principal Components:")
print(principal_components)

Principal Components:
[[ 3.11790501e-06 -6.63293932e-04 -1.42322508e-03 -1.39838102e-03
  -1.41074922e-03 -1.55729019e-06  4.02783705e-01  9.15291689e-01
   3.12873435e-06 -1.90285128e-06 -5.36857606e-06  2.42352696e-07
  -1.16968352e-06  5.50570903e-06]
 [-9.65425893e-06  1.00863603e-03  2.94481613e-04  2.88615078e-04
   2.91527957e-04 -5.62788258e-06  9.15294807e-01 -4.02782998e-01
  -9.69498211e-06 -4.92142587e-07 -2.47357523e-06  4.29057852e-07
  -1.31176164e-06 -2.91066752e-06]]


In [15]:
# Print the cumulative explained variance ratio
print("Cumulative Explained Variance Ratio:")
print(np.cumsum(feature_importance))

Cumulative Explained Variance Ratio:
[0.6238646  0.99968961]


In [16]:
# Identify the features that contribute to 90% of the information
# The argsort is used for sorting the array
# [::-1]: This slice reverses the order of the indices, so that the most important features are at the beginning of the array.
# [:int(0.9 * len(feature_importance))]: This slice selects the top 90% of the features based on their importance. 
important_features = np.argsort(feature_importance)[::-1][:int(0.9 * len(feature_importance))]
print("Important Features:")
print(numerical_features[important_features])

Important Features:
Index(['adjusted close'], dtype='object')


#### The 'adjusted close' contibutes to 90% of information in the dataset.

In [17]:
# Create a new DataFrame with the original data and the PCA-transformed data
pca_df = pd.DataFrame(X_pca, columns=[f"PC{i+1}" for i in range(X_pca.shape[1])])
pca_df = pd.concat([df[['symbol', 'date', 'option symbol']], pca_df], axis=1)

In [18]:
pca_df.head()

Unnamed: 0,symbol,date,option symbol,PC1,PC2
0,SPY,2021-07-30,SPY 210730C00215000,-1684.131537,297.544215
1,SPY,2021-07-30,SPY 210730P00215000,-400.860024,-266.950643
2,SPY,2021-07-30,SPY 210730C00220000,-1683.1984,297.142102
3,SPY,2021-07-30,SPY 210730P00220000,-1362.832137,211.005298
4,SPY,2021-07-30,SPY 210730C00225000,-1684.095847,297.545555


In [19]:
pca_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221046 entries, 0 to 221045
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   symbol         221046 non-null  object        
 1   date           221046 non-null  datetime64[ns]
 2   option symbol  221046 non-null  object        
 3   PC1            221046 non-null  float64       
 4   PC2            221046 non-null  float64       
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 8.4+ MB


In [20]:
pca_df.describe()

Unnamed: 0,date,PC1,PC2
count,221046,221046.0,221046.0
mean,2021-07-16 12:04:43.379930624,4.634578e-14,-6.214547e-14
min,2021-07-01 00:00:00,-1684.819,-67168.04
25%,2021-07-09 00:00:00,-1681.889,37.63409
50%,2021-07-16 00:00:00,-1583.735,280.7696
75%,2021-07-23 00:00:00,-783.6607,297.5625
max,2021-07-30 00:00:00,231139.0,385480.2
std,,6245.466,4847.441


In [21]:
# Create a new model that only uses these important features
important_data = df[numerical_features[important_features]]

In [22]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_pca, df['bid'], test_size=0.2, random_state=42)

# Define hyperparameters
n_estimators = 50  # Increased from 10 to 50
max_depth = 5  # Set max_depth to 5
min_samples_split = 5  # Increased from 2 to 5
min_samples_leaf = 2  # Increased from 1 to 2

new_model = RandomForestRegressor(n_estimators=n_estimators,
                                  max_depth=max_depth,
                                  min_samples_split=min_samples_split,
                                  min_samples_leaf=min_samples_leaf,
                                  random_state=42)

In [23]:
new_model.fit(X_train, y_train)
new_model_accuracy = new_model.score(X_test, y_test)

# Evaluate the model on the Training set
y_pred = new_model.predict(X_train)
mse = mean_squared_error(y_train, y_pred)  
r2 = r2_score(y_train, y_pred)  

# Print evaluation metrics
print(f"Mean Squared Error of training set: {mse}")
print(f"R-squared of training set: {r2}")

# Evaluate the model on the testing set
y_pred = new_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)  
r2 = r2_score(y_test, y_pred)  

# Print evaluation metrics
print(f"Mean Squared Error of testing set: {mse}")
print(f"R-squared of testing set: {r2}")

Mean Squared Error of training set: 1402.1624495101853
R-squared of training set: 0.6112504389358295
Mean Squared Error of testing set: 1363.1412080753707
R-squared of testing set: 0.6103496004987162


* Mean Squared Error (MSE):

    With PCA: MSE is much higher for both training and testing sets (1402.16 and 1363.14, respectively).
    Without PCA: MSE is much lower for both training and testing sets (0.00226 and 0.01254, respectively).
    Lower MSE values indicate better performance, so Withoutn PCA it has a significant advantage in terms of MSE.
    
    
* R-squared (R²):
    
    With PCA: R² values are moderate for both training and testing sets (0.61125 and 0.61035, respectively).
    Without PCA: R² values are extremely high for both training and testing sets (0.99999 and 0.99999, respectively).
    Higher R² values indicate better performance, so Activity 2 has a significant advantage in terms of R².


* Comparison:

    Overall, Without PCA has significantly better performance metrics than With PCA. The MSE values are much lower, and the R² values are much higher. This suggests that the model in Without PCA is a better fit for the data and generalizes well to new data.

    In contrast, the model with PCA has higher MSE values and lower R² values, indicating that it may not be as effective in predicting the target variable.

    Therefore, based on these metrics, without PCA is the better model. **