# Multiple Linear Regression Analysis

This script performs a multiple linear regression analysis to examine how various environmental factors affect SOC (Soil Organic Carbon) levels.

## 📂 Input

- **CSV File**: A dataset containing SOC measurements and environmental variables.
- Required column: `SOC`
- Grouping variables (`Year`, `Species`, `Tre`, `Sub`) are excluded from regression.

## 🧪 Method

- Linear regression is performed using the `statsmodels` library.
- The model includes an intercept term.
- Predictors are all numerical columns, excluding SOC and grouping identifiers.

## 📤 Output

- **Console Output**: Full regression summary printed in the terminal.
- **CSV File**: Regression summary saved to:


In [None]:
import pandas as pd
import statsmodels.api as sm

# Load the CSV file
file_path = r'..SoC.csv'  # Replace with the actual path
df = pd.read_csv(file_path)

# Strip whitespace from column names to ensure accurate matching
df.columns = df.columns.str.strip()

# Ensure 'SOC' column exists
if 'SOC' not in df.columns:
    raise KeyError("SOC column not found in the data.")

# Separate the target variable (SOC) and predictor variables (excluding grouping variables)
X = df.drop(columns=['SOC', 'Year', 'Species', 'Tre', 'Sub'])  # Drop non-predictive grouping columns
X = sm.add_constant(X)  # Add an intercept term
Y = df['SOC']

# Build the multiple linear regression model
model = sm.OLS(Y, X).fit()

# Output the regression summary
results_summary = model.summary()
print(results_summary)

# Save regression results to CSV
output_csv_path = r'..regression_results.csv'
with open(output_csv_path, 'w') as f:
    f.write(results_summary.as_csv())

print(f"Regression results saved to {output_csv_path}")


# Standardized Linear Regression of SOC and Environmental Factors

This script performs a standardized multiple linear regression analysis to quantify how various environmental factors contribute to **SOC (Soil Organic Carbon)**.

## 📂 Input

- **CSV File**: A dataset containing SOC measurements and corresponding environmental parameters.
- Required column: `SOC`
- Grouping columns (`Year`, `Species`, `Tre`, `Sub`) are excluded from modeling.

## ⚙️ Methodology

1. **Preprocessing**:
   - Remove unnecessary columns.
   - Standardize environmental variables.
   - Add intercept term.

2. **Modeling**:
   - Use `statsmodels.OLS` for regression.
   - Extract coefficients, p-values, and compute standardized effect sizes.

3. **Visualization**:
   - A horizontal bar plot displays the effect of each factor.
   - Red bars indicate statistically significant variables (p < 0.05).
   - P-values are annotated next to each bar.

## 📤 Outputs

- 📄 `regression_results01.csv`: Regression coefficients, p-values, and standardized effect sizes.
- 📊 `regression_plot.png`: Visualization of variable importance and statistical significance.
- 📋 Console printout of regression summary (R², coefficients, t-stats, etc.)

## 📦 Requirements

```bash
pip install pandas statsmodels scikit-learn matplotlib


In [None]:
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load CSV file
file_path = r'..SoC.csv'  # Replace with the actual file path
df = pd.read_csv(file_path)

# Strip extra spaces from column names to ensure correct matching
df.columns = df.columns.str.strip()

# Ensure the target column 'SOC' exists
if 'SOC' not in df.columns:
    raise KeyError("SOC column not found in the data.")

# Define grouping columns to exclude
group_cols = ['Year', 'Species', 'Tre', 'Sub']

# Separate predictor variables (environmental factors) and response variable (SOC)
X = df.drop(columns=['SOC'] + group_cols)
Y = df['SOC']

# Standardize predictor variables
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = sm.add_constant(X_scaled)  # Add intercept term

# Fit multiple linear regression model
model = sm.OLS(Y, X_scaled).fit()

# Display regression summary
print(model.summary())

# Extract coefficients and p-values
coefficients = model.params
p_values = model.pvalues

# Compute standardized coefficients (absolute values)
standardized_coefficients = abs(coefficients)

# Combine regression results into a DataFrame
results_df = pd.DataFrame({
    'Variable': ['Intercept'] + list(X.columns),
    'Coefficient': coefficients,
    'P_value': p_values,
    'Standardized_Coefficient': standardized_coefficients
})

# Save regression results to CSV
regression_csv_path = r'..regression_results01.csv'
results_df.to_csv(regression_csv_path, index=False)
print(f"Regression results saved to {regression_csv_path}")

# Visualize contributions using a horizontal bar plot
plt.figure(figsize=(10, 6))
colors = ['red' if p < 0.05 else 'gray' for p in results_df['P_value']]
plt.barh(results_df['Variable'], results_df['Coefficient'], color=colors)
plt.xlabel('Coefficient')
plt.title(f'Factor Contributions to SOC (R² = {round(model.rsquared, 2)})')

# Annotate p-values on the plot
for index, value in enumerate(results_df['Standardized_Coefficient']):
    plt.text(value + 0.01, index, f"p = {results_df['P_value'][index]:.3f}")

plt.tight_layout()
plt.savefig(r'..regression_plot.png')
plt.show()


# Multiple Linear Regression with Interaction Terms for SOC Analysis

This script performs a multiple linear regression analysis to quantify the contribution of environmental factors and their pairwise interactions on **Soil Organic Carbon (SOC)**.

## 📂 Data Requirements

- Input format: CSV
- Must contain the `SOC` column as the response variable.
- Grouping variables expected: `Year`, `Species`, `Tre`, `Sub`
- Additional columns are treated as numeric environmental predictors.

## ⚙️ Methodology

### 1. Preprocessing
- Removes leading/trailing spaces from column names.
- Drops predefined grouping columns.
- Constructs pairwise interaction terms among group variables.
- Applies one-hot encoding to all categorical variables.
- Standardizes all predictor variables.

### 2. Modeling
- Fits a multiple linear regression model using `statsmodels.OLS`.
- Extracts coefficients, p-values, and standardized coefficients.

### 3. Output
- Saves regression results to `regression_results02.csv`.
- Visualizes the influence of each factor and interaction using a horizontal bar plot:
  - 🔴 Red bars: significant variables (p < 0.05)
  - ⚪ Gray bars: non-significant

## 📈 Visualization

- Barplot: Shows the coefficients of each variable.
- Annotated p-values for statistically significant terms.
- Output image: `regression_plot02.png`

## 📤 Output Files

- `regression_results02.csv`: All variables, coefficients, p-values.
- `regression_plot02.png`: Visual representation of model results.

## 🧩 Dependencies

```bash
pip install pandas statsmodels scikit-learn matplotlib


In [None]:
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from itertools import combinations

# Load the CSV data
file_path = r'..SoC.csv'  # Replace with actual file path
df = pd.read_csv(file_path)

# Strip column names to remove leading/trailing spaces
df.columns = df.columns.str.strip()

# Ensure 'SOC' column is present
if 'SOC' not in df.columns:
    raise KeyError("SOC column not found in the data.")

# Define grouping columns to exclude and use for interaction terms
group_cols = ['Year', 'Species', 'Tre', 'Sub']

# Separate predictors and target variable
X = df.drop(columns=['SOC'] + group_cols)
Y = df['SOC']

# Create interaction terms between group factors (e.g., Year x Species)
for col1, col2 in combinations(group_cols, 2):
    X[f'{col1}_x_{col2}'] = df[col1].astype(str) + '_' + df[col2].astype(str)

# Apply one-hot encoding to categorical interaction terms
X = pd.get_dummies(X, drop_first=True)

# Standardize predictor variables
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = sm.add_constant(X_scaled)  # Add intercept

# Fit multiple linear regression model
model = sm.OLS(Y, X_scaled).fit()

# Print regression summary
print(model.summary())

# Extract coefficients and p-values
coefficients = model.params
p_values = model.pvalues

# Calculate standardized coefficients (absolute values)
standardized_coefficients = abs(coefficients)

# Create a DataFrame with regression results
results_df = pd.DataFrame({
    'Variable': ['Intercept'] + list(X.columns),
    'Coefficient': coefficients,
    'P_value': p_values,
    'Standardized_Coefficient': standardized_coefficients
})

# Save regression results to CSV
regression_csv_path = r'..regression_results02.csv'
results_df.to_csv(regression_csv_path, index=False)
print(f"Regression results saved to {regression_csv_path}")

# Plot variable contributions using horizontal bar chart
plt.figure(figsize=(35, 20))  # Large figure for readability
colors = ['red' if p < 0.05 else 'gray' for p in results_df['P_value']]

# Truncate long labels for readability
truncated_labels = [label[:30] + '...' if len(label) > 30 else label for label in results_df['Variable']]

plt.barh(truncated_labels, results_df['Coefficient'], color=colors)
plt.xlabel('Coefficient', fontsize=16)
plt.ylabel('Variables', fontsize=16)
plt.title(f'Factor Contributions to SOC (R² = {round(model.rsquared, 2)})')

# Annotate only significant variables with p-values
for index, value in enumerate(results_df['Standardized_Coefficient']):
    if results_df['P_value'][index] < 0.05:
        plt.text(value + 0.01, index, f"p = {results_df['P_value'][index]:.3f}", fontsize=9)

# Add horizontal lines to improve readability
for i in range(len(results_df)):
    plt.axhline(y=i, color='gray', linestyle='--', linewidth=0.3)

# Rotate y-tick labels to avoid overlap
plt.yticks(rotation=30, fontsize=10)
plt.xticks(fontsize=20)

plt.tight_layout()
plt.savefig(r'..regression_plot02.png')
plt.show()
