## End member mixing analysis (EMMA) to determine streamflow source contributions

### Finally running an EMMA using linear regression.​

#### Here we started with solute data from Hungerford Brook late winter/early spring flow events captured with ISCOs. Data include:
- ICP-OES (Al, Ca, Cu, Fe, K, Mg, Mn, Na,P,Zn,Si
- IC and total elemental analyser data (Cl, SO4, NO3, PO4, TOC, DIN)
- Stable isotopes (dD, d18O)

Data are from the BREE OneDrive directory (Watershed Data>1_Projects>EMMA>Working file for MATLAB 2023)

- For HB 2022 timeseries, 17 parameters total
- 5 were found to be relatively conservative: Cl, Ca, Na, Si, and Mg
- See "bivariates" notebook for those plots

This code below first selects the relevant parameters (conservative solutes), standardizes the data (as Inamdar emphasizes), and then applies PCA. The plot shows the cumulative explained variance as the number of principal components increases. We can then choose the number of principal components based on the explained variance we wish to retain.

In [8]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression

# Load the PCA results
pca_data = pd.read_csv("/home/millieginty/Documents/git-repos/EMMA/analysis/pca_result_streamwater.csv")  # Use the file with PCA results

# Load the endmembers and streamflow data
endmembers_data = pd.read_csv("/home/millieginty/Documents/git-repos/EMMA/data/end_members_2022_HB_mean_for_emma.csv")
streamflow_data = pd.read_csv("/home/millieginty/Documents/git-repos/EMMA/data/Data_for_EMMA_2022_HB.csv")

# Select the columns corresponding to the concentrations of conservative tracers
conservative_tracers = ['Ca_mg_L', 'Na_mg_L', 'Si_mg_L', 'Cl_mg_L', 'Mg_mg_L']
streamflow_data = streamflow_data[conservative_tracers]

# Combine PCA results with streamflow data
combined_data = pd.concat([pca_data, streamflow_data], axis=1)

# Select the first two principal components as the mixing space
mixing_space = combined_data[['Principal Component 1', 'Principal Component 2']]

# Create an empty dataframe to store endmember contributions
endmembers_contributions = pd.DataFrame(index=mixing_space.index)

# Iterate over each unique endmember type
for endmember_type in endmembers_data['Type'].unique():
    # Select data for the current endmember type
    endmember_data_type = endmembers_data[endmembers_data['Type'] == endmember_type]

    # Reshape endmember data to match the number of samples
    endmember_data_reshaped = endmember_data_type[conservative_tracers].values.reshape(-1, len(conservative_tracers))

    # Fit the linear regression model
    model = LinearRegression()
    model.fit(mixing_space, endmember_data_reshaped)

    # Store the contributions
    contributions = model.coef_.T @ mixing_space.T
    endmembers_contributions[endmember_type] = contributions.T

# Normalize contributions to sum to 1 for each sample
contributions_normalized = endmembers_contributions.div(endmembers_contributions.sum(axis=1), axis=0)

# Add the normalized contributions to the streamflow data
streamflow_with_contributions = pd.concat([streamflow_data, contributions_normalized], axis=1)

# Save or further analyze the results
streamflow_with_contributions.to_csv("/home/millieginty/Documents/git-repos/EMMA/analysis/streamflow_with_contributions.csv", index=False)

ValueError: Found input variables with inconsistent numbers of samples: [69, 1]