# Modeling using regression

# Spectral Data Analysis with PCA and PLS

This repository contains Jupyter Notebook files for performing spectral data analysis using Principal Component Analysis (PCA) and Partial Least Squares (PLS) techniques. The analysis aims to identify important wavelengths in the raw spectra and build a predictive model for further insights.

## Overview

In this analysis, we utilize Jupyter Notebook as an interactive computational environment to explore, preprocess, and model spectral data. The main objectives and steps involved are:

1. **Data Loading**: Load the raw spectral data into the notebook.
2. **Data Preprocessing**: Preprocess the raw spectra if needed (e.g., normalization, baseline correction, smoothing).
3. **Principal Component Analysis (PCA)**:
   - Apply PCA to the preprocessed spectral data.
   - Visualize the explained variance ratio to determine the number of principal components to retain.
   - Plot the scores and loadings plots to explore the data structure and identify important wavelengths.
4. **Partial Least Squares (PLS)**:
   - Split the data into training and testing sets.
   - Apply PLS regression to build a predictive model using the training set.
   - Evaluate the model performance on the testing set (e.g., using metrics like R-squared, RMSE).
   - Visualize the predicted versus actual values to assess model performance.
5. **Conclusion and Interpretation**: Summarize the findings from PCA and PLS analyses, interpret the results, and discuss potential implications.


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/PP5-Housing-issue/jupyter_notebooks'

We want to make the parent of the current directory the new current directory.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/PP5-Housing-issue'

---

# Step 1: Load Data

In [4]:
import numpy as np
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/house_prices_records.csv")
  )

print(df.shape)
df.head(3)

(1460, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500


---

# Step 2: Data Preprocessing

## ML pipeline for Data Cleaning

In [5]:
import pandas as pd
import numpy as np

def snv_dataframe(df):
    """
    Perform Standard Normal Variate (SNV) transformation on spectra in a pandas DataFrame.
    
    Parameters:
    df (pandas.DataFrame): DataFrame containing spectral data. Each row represents a sample,
                            and each column represents a wavelength.
    
    Returns:
    pandas.DataFrame: DataFrame containing SNV-transformed spectra.
    """
    # Calculate mean and standard deviation of each spectrum
    mean_spectrum = df.mean(axis=1)
    std_spectrum = df.std(axis=1)
    
    # Perform SNV transformation
    snv_df = (df.sub(mean_spectrum, axis=0)).div(std_spectrum, axis=0)
    
    return snv_df

# Example usage:
# Assuming 'raw_data.csv' contains your raw spectral data in a CSV file with each row representing a sample,
# and each column representing a wavelength.

# Load raw spectral data into a pandas DataFrame
raw_df = pd.read_csv('raw_data.csv')

# Perform SNV transformation on the DataFrame
snv_df = snv_dataframe(raw_df)

# Save SNV-transformed data to a CSV file
snv_df.to_csv('snv_data.csv', index=False)

  from pandas import MultiIndex, Int64Index


## PCA optimization

In [8]:
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

def plot_pca_variance(pca):
    """
    Plot the explained variance ratio of PCA components.
    
    Parameters:
    pca (sklearn.decomposition.PCA): PCA object fitted to data.
    """
    plt.figure(figsize=(8, 5))
    plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, marker='o', linestyle='-')
    plt.title('Explained Variance Ratio')
    plt.xlabel('Principal Component')
    plt.ylabel('Explained Variance Ratio')
    plt.grid(True)
    plt.show()

def plot_cumulative_variance(pca):
    """
    Plot the cumulative explained variance ratio of PCA components.
    
    Parameters:
    pca (sklearn.decomposition.PCA): PCA object fitted to data.
    """
    cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)
    plt.figure(figsize=(8, 5))
    plt.plot(range(1, len(cumulative_variance_ratio) + 1), cumulative_variance_ratio, marker='o', linestyle='-')
    plt.title('Cumulative Explained Variance Ratio')
    plt.xlabel('Number of Principal Components')
    plt.ylabel('Cumulative Explained Variance Ratio')
    plt.grid(True)
    plt.show()

# Example usage:
# Assuming 'snv_data.csv' contains your SNV-transformed spectral data in a CSV file.

# Load SNV-transformed spectral data into a pandas DataFrame
snv_df = pd.read_csv('snv_data.csv')

# Perform PCA on the DataFrame
pca = PCA()
pca.fit(snv_df)

# Plot explained variance ratio
plot_pca_variance(pca)

# Plot cumulative explained variance ratio
plot_cumulative_variance(pca)

# Continue with further analysis as needed


## PLS

### PLS Regression

four type of models show a very strong correlation max-score. ExtraTreesRegressor, Linear Regression, GradientBoostingRegressor & RandForestRegressor

### Do an extensive search on the most suitable algorithm to find the best hyperparameter configuration.

Define model and parameters, for Extensive Search

where obtained from code institue curriculum

In [11]:

models_search = {
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "LinearRegression": LinearRegression(),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
}

params_search = {
    "ExtraTreesRegressor":{'model__n_estimators': [100,50,150],
                           'model__max_depth': [None, 3, 15],
                           'model__min_samples_split': [2, 50],
                           'model__min_samples_leaf': [1,50],
    },

    "LinearRegression":{},

    "RandomForestRegressor":{'model__n_estimators': [100,50, 140],
                             'model__max_depth': [None,4, 15],
                             'model__min_samples_split': [2,50],
                             'model__min_samples_leaf': [1,50],
                             'model__max_leaf_nodes': [None,50],
    },
    
    "GradientBoostingRegressor":{'model__n_estimators': [100,50,140],
                                 'model__learning_rate':[0.1, 0.01, 0.001],
                                 'model__max_depth': [3,15, None],
                                 'model__min_samples_split': [2,50],
                                 'model__min_samples_leaf': [1,50],
                                 'model__max_leaf_nodes': [None,50],
    }
}

### PLS-DA

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cross_decomposition import PLSDA
from sklearn.metrics import accuracy_score

# Example usage:
# Assuming 'data.csv' contains your spectral data and 'targets.csv' contains corresponding class labels for classification.

# Load spectral data and targets into pandas DataFrames
data_df = pd.read_csv('data.csv')
targets_df = pd.read_csv('targets.csv')

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data_df, targets_df, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# PLS Discriminant Analysis (PLS-DA)
pls_da = PLSDA(n_components=2)
pls_da.fit(X_train_scaled, y_train)

# Predict classes on the testing set
y_pred_classes = pls_da.predict(X_test_scaled)

# Calculate and print the accuracy score
accuracy = accuracy_score(y_test, y_pred_classes)
print("Accuracy Score (PLS-DA):", accuracy)


## Conclusion

Results