# Sprint 2 Notebook

This notebook is organized into three sections:
1. **Function Testing**
2. **Backend Integration**
3. **ML Research**

---

## 1. Function Testing

This section validates the data ingestion and correlation functions across all 4 datasets.
## Overview
This code analyzes four sensor datasets by examining raw vs. cleaned sensor data and the correlation between two sensors over time. I'll provide a structured analysis of the results for each dataset.
**Analysis Methodology**

The analysis utilizes:

Time series data loading with datetime parsing

Data cleaning for missing values

Rolling correlation calculation with a 20-period window

Visualization of raw vs. cleaned sensor values

Time-series correlation visualization

In [20]:
import pandas as pd
import matplotlib.pyplot as plt
import os

# Create plots directory if it doesn't exist
if not os.path.exists("plots"):
    os.makedirs("plots")

# Utility functions
def get_dataset(path):
    try:
        df = pd.read_csv(path)
        df['created_at'] = pd.to_datetime(df['created_at'], errors='coerce')
        df = df.dropna(subset=['created_at'])
        return df
    except FileNotFoundError:
        print(f"Error: The file '{path}' was not found.")
        return None
    except Exception as e:
        print(f"Error reading {path}: {e}")
        return None

def clean_data(df, handle_binary=False, handle_negative=False):
    """
    Clean dataset with options for handling binary and negative values

    Parameters:
    - df: DataFrame to clean
    - handle_binary: If True, adds small random noise to binary fields to improve correlation calculation
    - handle_negative: If True, transforms negative values
    """
    if df is None:
        return None

    # Create a copy to avoid modifying original
    cleaned_df = df.copy()

    # Handle missing values
    cleaned_df = cleaned_df.dropna(subset=["field1", "field2"])

    # Handle binary values by adding small noise if needed
    if handle_binary:
        # Check if data appears binary (mostly 0s and 1s)
        unique_values_f1 = cleaned_df['field1'].unique()
        unique_values_f2 = cleaned_df['field2'].unique()

        is_binary_f1 = set(unique_values_f1).issubset({0, 1}) or (len(unique_values_f1) <= 2)
        is_binary_f2 = set(unique_values_f2).issubset({0, 1}) or (len(unique_values_f2) <= 2)

        if is_binary_f1:
            print("Field1 appears to be binary, adding small noise...")
            cleaned_df['field1'] = cleaned_df['field1'] + np.random.normal(0, 0.01, size=len(cleaned_df))

        if is_binary_f2:
            print("Field2 appears to be binary, adding small noise...")
            cleaned_df['field2'] = cleaned_df['field2'] + np.random.normal(0, 0.01, size=len(cleaned_df))

    # Handle negative values if needed
    if handle_negative:
        if (cleaned_df['field1'] < 0).any():
            print("Negative values found in field1, applying transformation...")
            # Option 1: Clip negative values to 0
            cleaned_df['field1'] = cleaned_df['field1'].clip(lower=0)
            # Option 2: Shift all values to make minimum 0
            # min_val = cleaned_df['field1'].min()
            # if min_val < 0:
            #     cleaned_df['field1'] = cleaned_df['field1'] - min_val

        if (cleaned_df['field2'] < 0).any():
            print("Negative values found in field2, applying transformation...")
            cleaned_df['field2'] = cleaned_df['field2'].clip(lower=0)

    return cleaned_df

# File paths
datasets = [
    "1321079.csv",
    "1350261.csv",
    "3036461.csv",
    "518150.csv"
]

# Import numpy for adding noise
import numpy as np

# Test all datasets
for file in datasets:
    print(f"\n--- Testing {file} ---")

    # Check if file exists
    if not os.path.isfile(file):
        print(f"Error: The file '{file}' was not found.")
        continue

    raw_df = get_dataset(file)
    if raw_df is None:
        continue

    # Display dataset information
    print(f"Dataset shape: {raw_df.shape}")
    print("\nFirst few rows:")
    print(raw_df.head())
    print("\nColumn information:")
    print(raw_df.dtypes)

    # Check for binary or negative values
    binary_check = ((raw_df['field1'].isin([0, 1])).all() or
                   (raw_df['field2'].isin([0, 1])).all())
    negative_check = ((raw_df['field1'] < 0).any() or
                     (raw_df['field2'] < 0).any())

    print(f"\nBinary values detected: {binary_check}")
    print(f"Negative values detected: {negative_check}")

    # Apply appropriate cleaning based on detected issues
    cleaned_df = clean_data(raw_df,
                           handle_binary=binary_check,
                           handle_negative=negative_check)

    if cleaned_df is None or cleaned_df.empty:
        print("Cleaning resulted in an empty dataset, skipping further analysis.")
        continue

    # Calculate rolling correlation with a window of 20
    cleaned_df["c(field1,field2)"] = cleaned_df["field1"].rolling(20).corr(cleaned_df["field2"])

    # Print statistics
    print("\nCleaned data statistics:")
    print(cleaned_df[["field1", "field2"]].describe())

    # Plot raw vs cleaned for field1
    plt.figure(figsize=(12, 6))
    plt.plot(raw_df["created_at"], raw_df["field1"], label="Raw field1", alpha=0.7)
    plt.plot(cleaned_df["created_at"], cleaned_df["field1"], label="Cleaned field1", alpha=0.7)
    plt.legend()
    plt.title(f"Raw vs Cleaned Data for field1 ({file})")
    plt.xlabel("Date")
    plt.ylabel("Sensor Value")
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig(f"plots/{file.replace('.csv', '')}_field1_comparison.png")
    plt.close()

    # Plot rolling correlation
    plt.figure(figsize=(12, 6))
    plt.plot(cleaned_df["created_at"], cleaned_df["c(field1,field2)"], label="Rolling Correlation")
    plt.axhline(y=0, color='r', linestyle='--', alpha=0.3)  # Reference line at 0
    plt.legend()
    plt.title(f"Rolling Correlation between field1 and field2 ({file})")
    plt.xlabel("Date")
    plt.ylabel("Correlation Value")
    plt.ylim(-1.1, 1.1)  # Correlation ranges from -1 to 1
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig(f"plots/{file.replace('.csv', '')}_correlation.png")
    plt.close()

    print(f"Analysis completed for {file}. Plots saved to 'plots' directory.")


--- Testing 1321079.csv ---
Dataset shape: (100, 6)

First few rows:
                 created_at  entry_id  field1  field2  field3  field4
0 2021-04-13 17:46:57+00:00      4556    25.1    44.3    25.1    44.7
1 2021-04-13 17:56:57+00:00      4557    25.0    44.1    25.1    44.6
2 2021-04-13 18:06:57+00:00      4558    25.1    45.1    25.0    45.8
3 2021-04-13 18:16:58+00:00      4559    25.1    44.9    24.9    45.5
4 2021-04-13 18:26:58+00:00      4560    25.0    44.2    24.9    45.0

Column information:
created_at    datetime64[ns, UTC]
entry_id                    int64
field1                    float64
field2                    float64
field3                    float64
field4                    float64
dtype: object

Binary values detected: False
Negative values detected: False

Cleaned data statistics:
           field1      field2
count  100.000000  100.000000
mean    25.186000   44.550000
std      0.244545    2.308242
min     24.800000   38.500000
25%     25.000000   43.175000
50

#  Analysis of Results

In this analysis, we focus on the impact of data cleaning on sensor readings and the relationships between different data fields through rolling correlation graphs. The insights are derived from various datasets and visualizations.

## 1. Rolling Correlation Analysis

The rolling correlation graphs, such as those in graph 2 and graph6., illustrate how correlation values fluctuate over time. For instance, graph2 depicts a line graph showing the rolling correlation between two data fields, labeled as "field1" and "field2," from the dataset "1321079.csv." The graph indicates that the correlation values fluctuate significantly throughout the observed period, with peaks and troughs suggesting varying degrees of correlation between the two fields. Notably, sharp spikes and drops in the correlation values are particularly noticeable around specific timestamps, indicating significant changes in the relationship between the two fields during those periods.
Similarly, graph 6 presents a rolling correlation between "field1" and "field2" from a dataset "3036461.csv." The graph shows a constant correlation value of 1.00 throughout the observed time period, suggesting a perfect positive correlation between the two fields. This unusual consistency may prompt further investigation into the nature of the relationship and the underlying factors contributing to this correlation.

## 2. Importance of Data Cleaning

Data cleaning plays a crucial role in ensuring the reliability of sensor readings. The graphs in graph 1 and graph 6 highlight the differences between raw and cleaned data. For example, graph 1 compares raw and cleaned sensor data over a specified time period, showing that the cleaned data line appears smoother and more stable compared to the raw data line, which exhibits more erratic fluctuations. This suggests that the cleaning process effectively removed outliers or noise present in the raw data, leading to more reliable and interpretable results.
In graph 6,  the importance of data cleaning is further emphasized, as the graph visually represents fluctuations in sensor values, providing insights into the data's integrity and processing. The cleaned data line demonstrates a more stable trend, indicating that data cleaning is essential for achieving reliable sensor readings.

## 3. Visual Comparisons of Raw and Cleaned Data

The visual comparisons in graph 3,  graph8,  and graph 9 further illustrate the impact of data cleaning on sensor readings. For example, graph 3 shows that the cleaned data line rises more steadily, peaking before experiencing a sharp decline, which could indicate a significant event or anomaly that was captured in the raw data but smoothed out in the cleaned version. This highlights the importance of data processing in understanding sensor outputs.
the graph visually compares raw and cleaned sensor data, showcasing the differences in variability. The raw data shows slight fluctuations, while the cleaned data appears more stable, indicating that the cleaning process effectively reduced noise or outliers, providing a clearer picture of the sensor's performance over time.



## 2. Backend Integration

This section will handle saving outputs, API calls, and integration with the backend pipeline.

In [None]:
# Example placeholder for backend integration
# Save cleaned dataset or correlation results

output_path = "cleaned_output.csv"
cleaned_df.to_csv(output_path, index=False)
print(f"Saved cleaned dataset to {output_path}")


## 3. ML Research

This section will cover:
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Model Prototyping (e.g., regression/classification)
- Evaluation Metrics


In [None]:
# Example placeholder for ML research
# Future steps: implement EDA, train models, evaluate performance

import seaborn as sns

# Quick visualization (example)
sns.pairplot(cleaned_df[['sensor1', 'sensor2']].dropna())
plt.show()
