# Ayurvedic Medicine Sensor Analysis Pipeline

This notebook demonstrates the complete machine learning pipeline for analyzing NIR sensor data from Ayurvedic medicines. We'll cover:

1. Data loading and validation
2. Exploratory data analysis
3. Feature engineering
4. Model development and training
5. Evaluation and visualization
6. Production deployment

## Setup Requirements
- Python 3.8+
- Required packages: pandas, numpy, scikit-learn, tensorflow, plotly, fastapi
- Git repository for version control

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.svm import SVR, SVC
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
import joblib
import json
import os

# Import custom modules
import sys
sys.path.append('..')
from src.data.preprocessing import DataPreprocessor
from src.features.engineering import FeatureEngineer
from src.models.ensemble import EnsembleModel
from src.visualization.plots import Visualizer

# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Data Loading and Validation

Let's create a sample dataset and demonstrate the data loading and validation pipeline. Our dataset should have the following structure:
- 6 NIR wavelength sensors (R,S,T,U,V,W)
- Temperature readings
- Dilution percentages (100%, 75%, 50%, 25%, 10%)
- Medicine names (3-4 different medicines)
- Effectiveness scores
- Reading IDs

In [None]:
# Generate sample data
np.random.seed(42)

# Define parameters
n_samples_per_dilution = 25
dilution_levels = [100, 75, 50, 25, 10]
medicines = ['Ashwagandha', 'Turmeric', 'Tulsi', 'Neem']
n_total_samples = n_samples_per_dilution * len(dilution_levels) * len(medicines)

# Generate synthetic data
data = []
reading_id = 1

for medicine in medicines:
    for dilution in dilution_levels:
        for _ in range(n_samples_per_dilution):
            # Base sensor readings (higher for higher dilutions)
            base_readings = np.random.normal(dilution/20, 2, 6)
            
            # Add medicine-specific patterns
            if medicine == 'Ashwagandha':
                base_readings[0] *= 1.2  # Stronger R response
            elif medicine == 'Turmeric':
                base_readings[2] *= 1.3  # Stronger T response
            elif medicine == 'Tulsi':
                base_readings[4] *= 1.4  # Stronger V response
            else:  # Neem
                base_readings[5] *= 1.5  # Stronger W response
            
            # Temperature variation
            temperature = np.random.normal(25, 2)
            
            # Effectiveness score (correlated with dilution and some noise)
            effectiveness = (dilution/100) * np.random.normal(1, 0.1)
            
            # Ensure non-negative values
            base_readings = np.maximum(base_readings, 0)
            
            data.append({
                'R': base_readings[0],
                'S': base_readings[1],
                'T': base_readings[2],
                'U': base_readings[3],
                'V': base_readings[4],
                'W': base_readings[5],
                'Temperature': temperature,
                'Dilution_Percent': dilution,
                'Medicine_Name': medicine,
                'Effectiveness_Score': effectiveness,
                'Reading_ID': reading_id
            })
            reading_id += 1

# Create DataFrame
df = pd.DataFrame(data)

# Save to CSV
os.makedirs('../data', exist_ok=True)
df.to_csv('../data/sensor_readings.csv', index=False)
print("Sample data generated and saved to '../data/sensor_readings.csv'")
df.head()

# Data Preprocessing and Feature Engineering

Now let's preprocess the data and create engineered features using our custom modules. We'll:
1. Initialize the data preprocessor
2. Load and validate the data
3. Apply temperature compensation
4. Normalize features
5. Engineer additional features using wavelength ratios and spectral derivatives
6. Apply PCA for dimensionality reduction

In [None]:
# Initialize processors
preprocessor = DataPreprocessor()
feature_engineer = FeatureEngineer()

# Load and preprocess data
df = preprocessor.load_data('../data/sensor_readings.csv')
df_comp = preprocessor.temperature_compensation(df)
df_norm = preprocessor.normalize_features(df_comp)

# Prepare feature matrix
X = np.hstack([
    df_norm[preprocessor.sensor_columns].values,
    df_norm[[preprocessor.temp_column]].values
])

# Engineer features
features = feature_engineer.engineer_features(
    df_norm[preprocessor.sensor_columns].values,
    df_norm[preprocessor.temp_column].values
)

print("Original features shape:", X.shape)
print("Engineered features shape:", features['combined'].shape)
print("PCA features shape:", features['pca'].shape)

# Data Visualization

Let's create visualizations to understand our data better:
1. Sensor reading distributions
2. Temperature vs sensor response
3. Dilution level effects
4. Medicine type patterns
5. Feature correlations

In [None]:
# Initialize visualizer
visualizer = Visualizer()

# Plot sensor readings over time
fig_readings = visualizer.plot_sensor_readings(df)
fig_readings.show()

# Plot dilution curves
fig_dilution = visualizer.plot_dilution_curves(df)
fig_dilution.show()

# Create correlation heatmap
corr_matrix = df[preprocessor.sensor_columns + [preprocessor.temp_column, 'Dilution_Percent', 'Effectiveness_Score']].corr()
fig_corr = go.Figure(data=go.Heatmap(
    z=corr_matrix,
    x=corr_matrix.columns,
    y=corr_matrix.columns,
    colorscale='RdBu'
))
fig_corr.update_layout(
    title='Feature Correlations',
    width=800,
    height=800
)
fig_corr.show()

# Plot PCA components
pca_df = pd.DataFrame(
    features['pca'],
    columns=[f'PC{i+1}' for i in range(features['pca'].shape[1])]
)
pca_df['Medicine'] = df['Medicine_Name']
pca_df['Dilution'] = df['Dilution_Percent']

fig_pca = px.scatter_3d(
    pca_df,
    x='PC1',
    y='PC2',
    z='PC3',
    color='Medicine',
    size='Dilution',
    title='PCA Components by Medicine Type and Dilution'
)
fig_pca.show()

In [None]:
# Prepare training data
train_data, test_data = preprocessor.prepare_data(df)

# Initialize and train model
model = EnsembleModel()
cv_scores = model.train(
    train_data['X'],
    train_data['y_dilution'],
    train_data['y_medicine'],
    train_data['y_effectiveness']
)

# Generate predictions for test data
test_predictions = model.predict(test_data['X'])

# Create true values dictionary for visualization
true_values = {
    'dilution': test_data['y_dilution'],
    'medicine': test_data['y_medicine'],
    'effectiveness': test_data['y_effectiveness']
}

# Plot model performance
fig_performance = visualizer.plot_model_performance(cv_scores)
fig_performance.show()

# Plot prediction confidence
fig_confidence = visualizer.plot_prediction_confidence(
    test_predictions,
    true_values
)
fig_confidence.show()

# Calculate and display metrics
for target in ['dilution', 'medicine', 'effectiveness']:
    print(f"\n{target.capitalize()} Predictions:")
    print(f"Mean prediction: {np.mean(test_predictions[target]['predictions']):.2f}")
    print(f"Mean confidence: {np.mean(test_predictions[target]['confidence']):.2f}")

# Save the trained model
os.makedirs('../models', exist_ok=True)
model.save_models('../models')

# Real-time Prediction Example

Finally, let's demonstrate how to use the API for real-time predictions:
1. Create a sample sensor reading
2. Send it to the prediction endpoint
3. Visualize the results with confidence intervals

In [None]:
# Create a sample reading
sample_reading = {
    'R': 3.5,
    'S': 2.8,
    'T': 4.2,
    'U': 3.0,
    'V': 3.8,
    'W': 2.5,
    'Temperature': 25.0
}

# Convert to numpy array
sample_X = np.array([[
    sample_reading['R'],
    sample_reading['S'],
    sample_reading['T'],
    sample_reading['U'],
    sample_reading['V'],
    sample_reading['W'],
    sample_reading['Temperature']
]])

# Preprocess the sample
sample_X_norm = preprocessor.normalize_features(
    preprocessor.temperature_compensation(pd.DataFrame([sample_reading]))
).values

# Engineer features
sample_features = feature_engineer.engineer_features(
    sample_X_norm[:, :-1],
    sample_X_norm[:, -1],
    fit=False
)

# Generate predictions
sample_predictions = model.predict(sample_features['combined'])

# Display results
print("Prediction Results:")
print(f"Dilution: {sample_predictions['dilution']['predictions'][0]:.1f}% (confidence: {sample_predictions['dilution']['confidence'][0]:.2f})")
print(f"Medicine: {sample_predictions['medicine']['predictions'][0]} (confidence: {sample_predictions['medicine']['confidence'][0]:.2f})")
print(f"Effectiveness: {sample_predictions['effectiveness']['predictions'][0]:.2f} (confidence: {sample_predictions['effectiveness']['confidence'][0]:.2f})")