<left>
    <img src="https://upload.wikimedia.org/wikipedia/commons/f/f3/Logo_SYGNET.png" width="90" alt="cognitiveclass.ai logo">
</left>

<center>
    <img src="https://upload.wikimedia.org/wikipedia/commons/2/2d/Tensorflow_logo.svg" width="200" alt="cognitiveclass.ai logo">
</center>




# A.I. Dataset Analysis 
## Part III - Feature Engineering

Postprocessing the simulation results from ZSOIL for Neural Network training.

## Objectives

By conducting this thorough data analysis, we'll gain a deeper understanding of the dataset and the underlying physical processes. This will not only help in building a better NN model but also provide valuable insights into the soil subsidence phenomenon in our simulation. These insights can guide feature selection, inform model architecture decisions, and improve interpretability of the NN's results.

*   Data Science with Python
*   Statistics

<h3>Table of Contents</h3>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
        <li><a href="#III. Feature Engineering"><b>II. Feature Engineering</b></a></li>
        <li><a href="#- Correlation and Causality">- Correlation and Causality</a></li>
        <li><a href="#- Regression Analysis">- Regression Analysis</a></li>
    </ul>
</div>

<hr>

In [14]:
!pip install statsmodels

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy
import statsmodels.api as sm
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
from statsmodels.stats.stattools import omni_normtest, jarque_bera
from scipy.stats import skew, kurtosis

print(scipy.__version__)

1.7.3


First we read the postprocessed file:

In [22]:
combined_data = pd.read_csv('corrected_results.csv')
print(combined_data.head())

           TIME        SF  PUSHOVER LABEL  PUSHOVER LAMBDA  PUSHOVER U-CTRL  \
0           0.0       0.0             0.0              0.0              0.0   
1           0.0       0.0             0.0              0.0              0.0   
2           0.0       0.0             0.0              0.0              0.0   
3           0.0       0.0             0.0              0.0              0.0   
4           0.0       0.0             0.0              0.0              0.0   

   ARC LENGTH STEP  ARC LENGTH U-NORM  ARC LENGTH LOAD FACTOR        NR  \
0              0.0                0.0                     0.0       0.0   
1              0.0                0.0                     0.0       1.0   
2              0.0                0.0                     0.0       2.0   
3              0.0                0.0                     0.0       3.0   
4              0.0                0.0                     0.0       4.0   

          X  ...    Saturation  Fluid velocity-X  Fluid velocity-Y  \
0   

In [23]:
# Function to clean column names
def clean_column_names(df):
    df.columns = df.columns.str.strip()  # Strip leading and trailing spaces
    df.columns = df.columns.str.replace(r'\s+', ' ', regex=True)  # Replace multiple spaces with a single space
    return df

# Apply the cleaning function to your dataframes
combined_data = clean_column_names(combined_data)

In [18]:
# Identify non-numeric columns
non_numeric_cols = combined_data.select_dtypes(exclude=['float64', 'int64']).columns
print("Non-numeric columns:", non_numeric_cols)


Non-numeric columns: Index([], dtype='object')


In [20]:
print(combined_data.describe())

               TIME       SF  PUSHOVER LABEL  PUSHOVER LAMBDA  \
count  15973.000000  15973.0         15973.0          15973.0   
mean       1.999624      0.0             0.0              0.0   
std        1.414391      0.0             0.0              0.0   
min        0.000000      0.0             0.0              0.0   
25%        1.000000      0.0             0.0              0.0   
50%        2.000000      0.0             0.0              0.0   
75%        3.000000      0.0             0.0              0.0   
max        4.000000      0.0             0.0              0.0   

       PUSHOVER U-CTRL  ARC LENGTH STEP  ARC LENGTH U-NORM  \
count          15973.0          15973.0            15973.0   
mean               0.0              0.0                0.0   
std                0.0              0.0                0.0   
min                0.0              0.0                0.0   
25%                0.0              0.0                0.0   
50%                0.0              0.0   

We repeat PCA to keep results for Feature Engineering

In [24]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
# Adjust columns as necessary
principal_components = pca.fit_transform(combined_data.drop(columns=['X', 'Y', 'TIME']))
explained_variance = pca.explained_variance_ratio_
print(f"Explained variance by component: {explained_variance}")
explained_variance_df = pd.DataFrame(explained_variance)
explained_variance_df

Explained variance by component: [0.62823293 0.25411086 0.05832556]


Unnamed: 0,0
0,0.628233
1,0.254111
2,0.058326


### 3. Feature Engineering

Identify key features and create new ones that might capture important aspects of the soil subsidence phenomenon.

<b>Stress-Strain Relationships:</b> Create features that represent stress-strain relationships, as these are crucial in geomechanical modeling.

<b>Derived Variables:</b> For instance, calculating the magnitude of the resultant displacement vector, combining displacements in all directions.

In [9]:
combined_data['Displacement-Magnitude'] = (combined_data['Displacement-Y']**2 + combined_data['Eff.Stress-YY']**2)**0.5
combined_data[['Displacement-X', 'Displacement-Y', 'Displacement-Magnitude']].head()

Unnamed: 0,Displacement-X,Displacement-Y,Displacement-Magnitude
0,0.0,1.229084,1.229084
1,0.0,1.229084,228.921
2,0.0,1.229084,230.596776
3,0.0,1.229084,234.030227
4,0.0,1.229084,201.321652


<b>Cumulative Variables</b>: Track changes over time or between stages, such as cumulative displacement or pore pressure.

In [10]:
combined_data['Cumulative-Displacement-Y'] = combined_data.groupby('TIME')['Displacement-Y'].cumsum()
combined_data[['TIME', 'Displacement-Y', 'Cumulative-Displacement-Y']].head()

Unnamed: 0,TIME,Displacement-Y,Cumulative-Displacement-Y
0,0.0,1.229084,1.229084
1,0.0,1.229084,2.458169
2,0.0,1.229084,3.687253
3,0.0,1.229084,4.916337
4,0.0,1.229084,6.145422


In [27]:
#Create Derived and Cumulative Features:
combined_data['Displacement-Magnitude'] = (combined_data['Displacement-X']**2 + combined_data['Displacement-Y']**2).apply(np.sqrt)

#Generate cumulative sums for relevant features
combined_data['Cumulative-Displacement-Y'] = combined_data['Displacement-Y'].cumsum()

#Generate Interaction Terms that might capture non-linear relationships
combined_data['StressYY_DisplacementY_Interaction'] = combined_data['Eff.Stress-YY'] * combined_data['Displacement-Y']

#Compute Temporal and Spatial Features

#Temporal Features
combined_data['DisplacementY_Rate'] = combined_data['Displacement-Y'].diff()
combined_data['StressYY_Rate'] = combined_data['Eff.Stress-YY'].diff()

# Spatial Features - Example if depth is a feature
#combined_data['Normalized_Depth'] = combined_data['Depth'] / combined_data['Depth'].max()  

#Normalization and Transformation
negative_values = combined_data['Eff.Stress-YY'] < 0
if negative_values.any():
    print("Warning: Negative values detected in 'Eff.Stress-YY'. Consider transforming or handling them before applying log1p.")
    print(combined_data['Eff.Stress-YY'][negative_values])

# log1p handles log(0) safely
combined_data['Log_StressYY'] = np.log1p(combined_data['Eff.Stress-YY'])  

#Standardize features to have mean 0 and variance 1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
combined_data[['Displacement-Y', 'Eff.Stress-YY']] = scaler.fit_transform(combined_data[['Displacement-Y', 'Eff.Stress-YY']])

# PCA Dimensionality Reduction inclusion
pca = PCA(n_components=3)
pca_features = pca.fit_transform(combined_data[['Displacement-Y', 'Strain-YY', 'Eff.Stress-YY']])
combined_data['PCA_1'], combined_data['PCA_2'], combined_data['PCA_3'] = pca_features.T

#Handle Missing Data
combined_data['Displacement-Y'] = combined_data['Displacement-Y'].fillna(combined_data['Displacement-Y'].mean())
combined_data['Missing_DisplacementY'] = combined_data['Displacement-Y'].isnull().astype(int)

#Save the New Dataset
combined_data.to_csv('combined_data_with_engineered_features.csv', index=False)

AI_Dataset = combined_data.copy()

AI_Dataset

29      -0.008248
30      -0.237430
31      -0.403793
32      -0.526445
33      -0.655765
           ...   
15967   -0.655765
15968   -0.526445
15969   -0.403793
15970   -0.237430
15971   -0.008248
Name: Eff.Stress-YY, Length: 9330, dtype: float64


  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,TIME,SF,PUSHOVER LABEL,PUSHOVER LAMBDA,PUSHOVER U-CTRL,ARC LENGTH STEP,ARC LENGTH U-NORM,ARC LENGTH LOAD FACTOR,NR,X,...,Displacement-Magnitude,Cumulative-Displacement-Y,StressYY_DisplacementY_Interaction,DisplacementY_Rate,StressYY_Rate,Log_StressYY,PCA_1,PCA_2,PCA_3,Missing_DisplacementY
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,...,1.229084,1.229084e+00,2.311284,,,1.057961,0.460615,2.198803,-0.000005,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,50.00000,...,1.229084,2.458169e+00,0.528123,0.0,-1.450805,0.357456,-0.565259,1.172929,0.000011,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,48.07692,...,1.229084,3.687253e+00,0.515069,0.0,-0.010621,0.350000,-0.572769,1.165419,0.000011,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,46.15385,...,1.229084,4.916337e+00,0.488324,0.0,-0.021760,0.334547,-0.588156,1.150032,0.000011,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,44.23077,...,1.229084,6.145422e+00,0.743112,0.0,0.207299,0.472878,-0.441573,1.296615,0.000010,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15968,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1636.0,0.00000,...,1.229084,-4.916337e+00,-0.647046,0.0,0.129320,-0.747488,-1.241347,0.496841,-0.000003,0
15969,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1637.0,0.00000,...,1.229084,-3.687253e+00,-0.496296,0.0,0.122652,-0.517167,-1.154619,0.583569,-0.000010,0
15970,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1638.0,0.00000,...,1.229084,-2.458169e+00,-0.291822,0.0,0.166363,-0.271062,-1.036983,0.701205,0.000017,0
15971,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1639.0,0.00000,...,1.229084,-1.229084e+00,-0.010137,0.0,0.229183,-0.008282,-0.874926,0.863262,0.000006,0


### 4. Temporal Analysis

If your data has a temporal component, you can explore time series analysis techniques:

<b>Trend Analysis:</b> Detect long-term trends in displacement or other variables.

<b>Fourier Analysis:</b> Identify periodic patterns that might relate to external factors or cyclic loading conditions.

### 5. Domain-Specific Analysis

In geomechanics, certain analyses are particularly useful:

<b>Mohr-Coulomb Failure Criterion:</b> Evaluate whether certain regions or elements are approaching failure conditions based on stress states.

<b>Critical State Line:</b> Plot stress points to see how close they are to critical state conditions, which can indicate potential subsidence or failure.