# Ayurvedic Medicine Sensor Analysis Pipeline

This notebook demonstrates the complete machine learning pipeline for analyzing NIR sensor data from Ayurvedic medicines. We'll cover:

1. Data loading and validation
2. Exploratory data analysis
3. Feature engineering
4. Model development and training
5. Evaluation and visualization
6. Production deployment

## Setup Requirements
- Python 3.8+
- Required packages: pandas, numpy, scikit-learn, tensorflow, plotly, fastapi
- Git repository for version control

In [8]:
# Import required libraries
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.svm import SVR, SVC
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
import joblib
import json
import os

# Import custom modules
import sys
sys.path.append('..')
from src.data.preprocessing import DataPreprocessor
from src.features.engineering import FeatureEngineer
from src.models.ensemble import EnsembleModel
from src.visualization.plots import Visualizer

# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Data Loading and Validation

Let's create a sample dataset and demonstrate the data loading and validation pipeline. Our dataset should have the following structure:
- 6 NIR wavelength sensors (R,S,T,U,V,W)
- Temperature readings
- Dilution percentages (100%, 75%, 50%, 25%, 10%)
- Medicine names (3-4 different medicines)
- Effectiveness scores
- Reading IDs

In [9]:
# Generate sample data
np.random.seed(42)

# Define parameters
n_samples_per_dilution = 25
dilution_levels = [100, 75, 50, 25, 10]
medicines = ['Ashwagandha', 'Turmeric', 'Tulsi', 'Neem']
n_total_samples = n_samples_per_dilution * len(dilution_levels) * len(medicines)

# Generate synthetic data
data = []
reading_id = 1

for medicine in medicines:
    for dilution in dilution_levels:
        for _ in range(n_samples_per_dilution):
            # Base sensor readings (higher for higher dilutions)
            base_readings = np.random.normal(dilution/20, 2, 6)
            
            # Add medicine-specific patterns
            if medicine == 'Ashwagandha':
                base_readings[0] *= 1.2  # Stronger R response
            elif medicine == 'Turmeric':
                base_readings[2] *= 1.3  # Stronger T response
            elif medicine == 'Tulsi':
                base_readings[4] *= 1.4  # Stronger V response
            else:  # Neem
                base_readings[5] *= 1.5  # Stronger W response
            
            # Temperature variation
            temperature = np.random.normal(25, 2)
            
            # Effectiveness score (correlated with dilution and some noise)
            effectiveness = (dilution/100) * np.random.normal(1, 0.1)
            
            # Ensure non-negative values
            base_readings = np.maximum(base_readings, 0)
            
            data.append({
                'R': base_readings[0],
                'S': base_readings[1],
                'T': base_readings[2],
                'U': base_readings[3],
                'V': base_readings[4],
                'W': base_readings[5],
                'Temperature': temperature,
                'Dilution_Percent': dilution,
                'Medicine_Name': medicine,
                'Effectiveness_Score': effectiveness,
                'Reading_ID': reading_id
            })
            reading_id += 1

# Create DataFrame
df = pd.DataFrame(data)

# Save to CSV
os.makedirs('../data', exist_ok=True)
df.to_csv('../data/sensor_readings.csv', index=False)
print("Sample data generated and saved to '../data/sensor_readings.csv'")
df.head()

Sample data generated and saved to '../data/sensor_readings.csv'


Unnamed: 0,R,S,T,U,V,W,Temperature,Dilution_Percent,Medicine_Name,Effectiveness_Score,Reading_ID
0,7.192114,4.723471,6.295377,8.04606,4.531693,4.531726,28.158426,100,Ashwagandha,1.076743,1
1,4.873261,6.08512,4.073165,4.06854,5.483925,1.17344,21.550164,100,Ashwagandha,0.943771,2
2,3.569205,5.628495,3.183952,2.175393,7.931298,4.548447,25.135056,100,Ashwagandha,0.857525,3
3,4.693481,5.221845,2.698013,5.751396,3.798723,4.416613,23.796587,100,Ashwagandha,1.185228,4
4,5.967607,2.884578,6.64509,2.558313,5.417727,1.08066,22.343628,100,Ashwagandha,1.019686,5


# Data Preprocessing and Feature Engineering

Now let's preprocess the data and create engineered features using our custom modules. We'll:
1. Initialize the data preprocessor
2. Load and validate the data
3. Apply temperature compensation
4. Normalize features
5. Engineer additional features using wavelength ratios and spectral derivatives
6. Apply PCA for dimensionality reduction

In [10]:
# Initialize processors
preprocessor = DataPreprocessor()
feature_engineer = FeatureEngineer()

# Load and preprocess data
df = preprocessor.load_data('../data/sensor_readings.csv')
df_comp = preprocessor.temperature_compensation(df)
df_norm = preprocessor.normalize_features(df_comp)

# Prepare feature matrix
X = np.hstack([
    df_norm[preprocessor.sensor_columns].values,
    df_norm[[preprocessor.temp_column]].values
])

# Engineer features
features = feature_engineer.engineer_features(
    df_norm[preprocessor.sensor_columns].values,
    df_norm[preprocessor.temp_column].values
)

print("Original features shape:", X.shape)
print("Engineered features shape:", features['combined'].shape)
print("PCA features shape:", features['pca'].shape)

Original features shape: (500, 7)
Engineered features shape: (500, 39)
PCA features shape: (500, 3)


# Data Visualization

Let's create visualizations to understand our data better:
1. Sensor reading distributions
2. Temperature vs sensor response
3. Dilution level effects
4. Medicine type patterns
5. Feature correlations

In [None]:
# Visualization of sensor readings
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create subplots for all sensor readings
fig = make_subplots(rows=2, cols=3, 
                    subplot_titles=('R Channel', 'S Channel', 'T Channel',
                                  'U Channel', 'V Channel', 'W Channel'))

channels = ['R', 'S', 'T', 'U', 'V', 'W']
row_col = [(1,1), (1,2), (1,3), (2,1), (2,2), (2,3)]

for (channel, (row, col)) in zip(channels, row_col):
    for medicine in medicines:
        med_data = df[df['Medicine_Name'] == medicine]
        fig.add_trace(
            go.Scatter(x=med_data['Dilution_Percent'], 
                      y=med_data[channel],
                      name=f'{medicine} - {channel}',
                      mode='markers',
                      marker=dict(size=8),
                      showlegend=True if row == 1 and col == 1 else False),
            row=row, col=col
        )

fig.update_layout(height=800, width=1200, title_text="Sensor Readings vs Dilution by Medicine Type")
fig.update_xaxes(title_text="Dilution (%)")
fig.update_yaxes(title_text="Sensor Reading")
fig.show()

# Create effectiveness heatmap
df_pivot = df.pivot_table(values='Effectiveness_Score', 
                         index='Medicine_Name', 
                         columns='Dilution_Percent',
                         aggfunc='mean')

fig_heatmap = go.Figure(data=go.Heatmap(
    z=df_pivot.values,
    x=df_pivot.columns,
    y=df_pivot.index,
    colorscale='RdBu',
    colorbar=dict(title='Effectiveness Score')
))

fig_heatmap.update_layout(
    title='Effectiveness Score by Medicine and Dilution',
    xaxis_title='Dilution (%)',
    yaxis_title='Medicine',
    width=1000,
    height=500
)
fig_heatmap.show()

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [None]:
# Visualize model predictions and performance
import plotly.express as px
from sklearn.metrics import confusion_matrix, mean_squared_error, r2_score

# Dilution Model Performance
y_pred_dilution = cv_scores['dilution']['predictions']
y_true_dilution = cv_scores['dilution']['true_values']

fig_dilution = px.scatter(
    x=y_true_dilution, 
    y=y_pred_dilution,
    labels={'x': 'True Dilution (%)', 'y': 'Predicted Dilution (%)'},
    title=f'Dilution Predictions (R² = {r2_score(y_true_dilution, y_pred_dilution):.3f})'
)
fig_dilution.add_trace(
    go.Scatter(x=[0, 100], y=[0, 100], mode='lines', 
               name='Perfect Prediction', line=dict(dash='dash'))
)
fig_dilution.show()

# Medicine Classification Performance
y_pred_medicine = cv_scores['medicine']['predictions']
y_true_medicine = cv_scores['medicine']['true_values']

conf_matrix = confusion_matrix(y_true_medicine, y_pred_medicine)
fig_conf = px.imshow(conf_matrix,
                     labels=dict(x="Predicted Medicine", y="True Medicine"),
                     x=medicines,
                     y=medicines,
                     title="Medicine Classification Confusion Matrix",
                     color_continuous_scale='Blues')
fig_conf.show()

# Effectiveness Model Performance
y_pred_effectiveness = cv_scores['effectiveness']['predictions']
y_true_effectiveness = cv_scores['effectiveness']['true_values']

fig_effectiveness = px.scatter(
    x=y_true_effectiveness, 
    y=y_pred_effectiveness,
    labels={'x': 'True Effectiveness', 'y': 'Predicted Effectiveness'},
    title=f'Effectiveness Predictions (R² = {r2_score(y_true_effectiveness, y_pred_effectiveness):.3f})'
)
fig_effectiveness.add_trace(
    go.Scatter(x=[0, 1], y=[0, 1], mode='lines', 
               name='Perfect Prediction', line=dict(dash='dash'))
)
fig_effectiveness.show()


Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.


Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.


Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.


Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.


Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.



TypeError: ForestRegressor.predict() got an unexpected keyword argument 'return_std'

# Real-time Prediction Example

Finally, let's demonstrate how to use the API for real-time predictions:
1. Create a sample sensor reading
2. Send it to the prediction endpoint
3. Visualize the results with confidence intervals

In [7]:
# Create a sample reading
sample_reading = {
    'R': 3.5,
    'S': 2.8,
    'T': 4.2,
    'U': 3.0,
    'V': 3.8,
    'W': 2.5,
    'Temperature': 25.0
}

# Convert to numpy array
sample_X = np.array([[
    sample_reading['R'],
    sample_reading['S'],
    sample_reading['T'],
    sample_reading['U'],
    sample_reading['V'],
    sample_reading['W'],
    sample_reading['Temperature']
]])

# Preprocess the sample
sample_X_norm = preprocessor.normalize_features(
    preprocessor.temperature_compensation(pd.DataFrame([sample_reading]))
).values

# Engineer features
sample_features = feature_engineer.engineer_features(
    sample_X_norm[:, :-1],
    sample_X_norm[:, -1],
    fit=False
)

# Generate predictions
sample_predictions = model.predict(sample_features['combined'])

# Display results
print("Prediction Results:")
print(f"Dilution: {sample_predictions['dilution']['predictions'][0]:.1f}% (confidence: {sample_predictions['dilution']['confidence'][0]:.2f})")
print(f"Medicine: {sample_predictions['medicine']['predictions'][0]} (confidence: {sample_predictions['medicine']['confidence'][0]:.2f})")
print(f"Effectiveness: {sample_predictions['effectiveness']['predictions'][0]:.2f} (confidence: {sample_predictions['effectiveness']['confidence'][0]:.2f})")

ValueError: X has 39 features, but RandomForestRegressor is expecting 7 features as input.