<left>
    <img src="https://upload.wikimedia.org/wikipedia/commons/f/f3/Logo_SYGNET.png" width="90" alt="cognitiveclass.ai logo">
</left>

<center>
    <img src="https://upload.wikimedia.org/wikipedia/commons/2/2d/Tensorflow_logo.svg" width="200" alt="cognitiveclass.ai logo">
</center>




# A.I. Dataset Analysis 
## Part III - Feature Engineering

Postprocessing the simulation results from ZSOIL for Neural Network training.

## Objectives

By conducting this thorough data analysis, we'll gain a deeper understanding of the dataset and the underlying physical processes. This will not only help in building a better NN model but also provide valuable insights into the soil subsidence phenomenon in our simulation. These insights can guide feature selection, inform model architecture decisions, and improve interpretability of the NN's results.

*   Data Science with Python
*   Statistics

<h3>Table of Contents</h3>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
        <li><a href="#III. Feature Engineering"><b>II. Feature Engineering</b></a></li>
        <li><a href="#- Correlation and Causality">- Correlation and Causality</a></li>
        <li><a href="#- Regression Analysis">- Regression Analysis</a></li>
    </ul>
</div>

<hr>

In [25]:
!pip install statsmodels

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy
import statsmodels.api as sm
from mpl_toolkits.mplot3d import Axes3D

print(scipy.__version__)


1.14.0


First we read the postprocessed file:

In [26]:
combined_data = pd.read_csv('corrected_results.csv')
print(combined_data.head())

           TIME        SF  PUSHOVER LABEL  PUSHOVER LAMBDA  PUSHOVER U-CTRL  \
0           0.0       0.0             0.0              0.0              0.0   
1           0.0       0.0             0.0              0.0              0.0   
2           0.0       0.0             0.0              0.0              0.0   
3           0.0       0.0             0.0              0.0              0.0   
4           0.0       0.0             0.0              0.0              0.0   

   ARC LENGTH STEP  ARC LENGTH U-NORM  ARC LENGTH LOAD FACTOR        NR  \
0              0.0                0.0                     0.0       0.0   
1              0.0                0.0                     0.0       1.0   
2              0.0                0.0                     0.0       2.0   
3              0.0                0.0                     0.0       3.0   
4              0.0                0.0                     0.0       4.0   

          X  ...    Saturation  Fluid velocity-X  Fluid velocity-Y  \
0   

In [29]:
# Function to clean column names
def clean_column_names(df):
    df.columns = df.columns.str.strip()  # Strip leading and trailing spaces
    df.columns = df.columns.str.replace(r'\s+', ' ', regex=True)  # Replace multiple spaces with a single space
    return df

# Apply the cleaning function to your dataframes
combined_data = clean_column_names(combined_data)

In [30]:
# Identify non-numeric columns
non_numeric_cols = combined_data.select_dtypes(exclude=['float64', 'int64']).columns
print("Non-numeric columns:", non_numeric_cols)


Non-numeric columns: Index([], dtype='object')


In [15]:
# Apply one-hot encoding to 'Y_category'
combined_data = pd.get_dummies(combined_data, columns=['Y_category'])

print("One-hot encoded Y_category columns:", combined_data.columns)

One-hot encoded Y_category columns: Index(['TIME', 'SF', 'PUSHOVER LABEL', 'PUSHOVER LAMBDA', 'PUSHOVER U-CTRL',
       'ARC LENGTH STEP', 'ARC LENGTH U-NORM', 'ARC LENGTH LOAD FACTOR', 'NR',
       'X', 'Y', 'Displacement-X', 'Displacement-Y', 'Rotation-Z',
       'Pore pressure-', 'Total head-', 'Residual Force-X', 'Residual Force-Y',
       'Residual Heat flux-Z', 'Solid-Velocity-X', 'Solid-Velocity-Y',
       'Solid-Acceleration-X', 'Solid-Acceleration-Y', 'Unnamed: 23', 'TIME.1',
       'SF.1', 'PUSHOVER LABEL.1', 'PUSHOVER LAMBDA.1', 'PUSHOVER U-CTRL.1',
       'ARC LENGTH STEP.1', 'ARC LENGTH U-NORM.1', 'ARC LENGTH LOAD FACTOR.1',
       'ELEM.', 'GP', 'Z', 'Eff.Stress-XX', 'Eff.Stress-YY', 'Eff.Stress-XY',
       'Eff.Stress-ZZ', 'Eff.Stress-XZ', 'Eff.Stress-YZ', 'Eff.Stress-11',
       'Eff.Stress-22', 'Eff.Stress-33', 'Eff.Stress-I1', 'Eff.Stress-J2^1/2',
       'Eff.Stress-p', 'Eff.Stress-q', 'Tot.Stress-XX', 'Tot.Stress-YY',
       'Tot.Stress-XY', 'Tot.Stress-ZZ', 'Tot.Str

In [27]:
print(combined_data.describe())

               TIME        SF  PUSHOVER LABEL  PUSHOVER LAMBDA  \
count  15973.000000   15973.0         15973.0          15973.0   
mean       1.999624       0.0             0.0              0.0   
std        1.414391       0.0             0.0              0.0   
min        0.000000       0.0             0.0              0.0   
25%        1.000000       0.0             0.0              0.0   
50%        2.000000       0.0             0.0              0.0   
75%        3.000000       0.0             0.0              0.0   
max        4.000000       0.0             0.0              0.0   

       PUSHOVER U-CTRL  ARC LENGTH STEP  ARC LENGTH U-NORM  \
count          15973.0          15973.0            15973.0   
mean               0.0              0.0                0.0   
std                0.0              0.0                0.0   
min                0.0              0.0                0.0   
25%                0.0              0.0                0.0   
50%                0.0           

### 3. Feature Engineering

Identify key features and create new ones that might capture important aspects of the soil subsidence phenomenon.

<b>Stress-Strain Relationships:</b> Create features that represent stress-strain relationships, as these are crucial in geomechanical modeling.

<b>Derived Variables:</b> For instance, calculating the magnitude of the resultant displacement vector, combining displacements in all directions.

In [40]:
combined_data['Displacement-Magnitude'] = (combined_data['Displacement-Y']**2 + combined_data['Eff.Stress-YY']**2)**0.5
combined_data[['Displacement-X', 'Displacement-Y', 'Displacement-Magnitude']].head()

Unnamed: 0,Displacement-X,Displacement-Y,Displacement-Magnitude
0,0.0,1.229084,1.229084
1,0.0,1.229084,228.921
2,0.0,1.229084,230.596776
3,0.0,1.229084,234.030227
4,0.0,1.229084,201.321652


<b>Cumulative Variables:</b> Track changes over time or between stages, such as cumulative displacement or pore pressure.

In [56]:
combined_data['Cumulative-Displacement-Y'] = combined_data.groupby('TIME')['Displacement-Y'].cumsum()
combined_data[['TIME', 'Displacement-Y', 'Cumulative-Displacement-Y']].head()

Unnamed: 0,TIME,Displacement-Y,Cumulative-Displacement-Y
0,0.0,1.229084,1.229084
1,0.0,1.229084,2.458169
2,0.0,1.229084,3.687253
3,0.0,1.229084,4.916337
4,0.0,1.229084,6.145422


### 4. Temporal Analysis

If your data has a temporal component, you can explore time series analysis techniques:

<b>Trend Analysis:</b> Detect long-term trends in displacement or other variables.

<b>Fourier Analysis:</b> Identify periodic patterns that might relate to external factors or cyclic loading conditions.

### 5. Domain-Specific Analysis

In geomechanics, certain analyses are particularly useful:

<b>Mohr-Coulomb Failure Criterion:</b> Evaluate whether certain regions or elements are approaching failure conditions based on stress states.

<b>Critical State Line:</b> Plot stress points to see how close they are to critical state conditions, which can indicate potential subsidence or failure.