## Building a Regression Model for Predicting Sensor Readings

In this section, we focus on building a predictive model to forecast sensor readings or detect anomalies in IoT device behavior. We apply machine learning techniques to the cleaned and processed data, selecting appropriate features and evaluating multiple models. The goal is to identify patterns and create a model that can predict future sensor values or classify device states as "normal" or "anomalous". This step demonstrates the potential for using predictive analytics to improve the monitoring and maintenance of IoT devices over time.

In [1]:
import pandas as pd
import py4j
import findspark
findspark.init()

from pyspark.sql import SparkSession

# Creating a Spark session
spark = SparkSession.builder \
    .appName("IoT Telemetry Data ") \
    .getOrCreate()
df = spark.read.csv('iot_telemetry_data.csv', header = True, inferSchema = True)

In [2]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Select feature columns and target column
feature_columns = ['humidity', 'co', 'smoke', 'light', 'motion', 'lpg']
target_column = 'temp'

# Assemble features into a single vector
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
df_prepared = assembler.transform(df).select("features", target_column)

# Split the data into training and test sets
train_data, test_data = df_prepared.randomSplit([0.8, 0.2], seed=42)


In [3]:
# Initialize and train the Linear Regression model
lr = LinearRegression(labelCol=target_column, featuresCol="features")
lr_model = lr.fit(train_data)

# Model summary
print("Coefficients: ", lr_model.coefficients)
print("Intercept: ", lr_model.intercept)

Coefficients:  [-0.11461735749535866,-5642.605482291604,8604.974519034318,4.6731200460259,0.4731788247748365,-19522.889818307012]
Intercept:  29.791743264227286


In [4]:
# Make predictions on the test data
predictions = lr_model.transform(test_data)

# Evaluate the model using Root Mean Square Error (RMSE)
evaluator = RegressionEvaluator(labelCol=target_column, predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print(f"Root Mean Square Error (RMSE): {rmse}")

# Show predictions vs actuals
predictions.select("prediction", target_column).show(15)


Root Mean Square Error (RMSE): 1.253896108353969
+------------------+------------------+
|        prediction|              temp|
+------------------+------------------+
| 31.92687960048898|26.799999237060547|
|31.820708238643398|23.600000381469727|
| 31.81651900516062|23.600000381469727|
|30.346290084333667|19.299999237060547|
|30.144903590771058|24.399999618530273|
|25.386563129014583|              19.5|
|29.980188553405846|25.399999618530273|
|29.851896566782564|26.399999618530273|
| 22.78477172746023|              22.8|
|22.783718401990065|              22.7|
| 22.78306621065139|              22.8|
| 22.78281727811288|              22.9|
| 22.78197978544234|              22.7|
|22.781304531980716|              22.9|
|22.781041530158923|              22.8|
+------------------+------------------+
only showing top 15 rows



## Conclusion

This IoT telemetry project highlights the power of data analysis and machine learning in monitoring and optimizing device performance. Starting with data exploration, we gained a deep understanding of the dataset's structure and underlying patterns through descriptive statistics, visualizations, and correlation analysis. This laid the groundwork for identifying significant relationships and trends among the metrics.

In the time-series analysis, we examined key metrics over time, uncovering trends, peaks, and anomalies that could indicate device irregularities or external factors influencing performance. This step provided actionable insights into the temporal behavior of sensors.

Next, we conducted anomaly detection, leveraging Z-scores to pinpoint outliers across the dataset. These outliers can signal unusual conditions, enabling proactive responses to potential issues and optimizing device uptime.

Finally, we implemented a predictive model, demonstrating the potential to forecast future sensor readings or detect anomalous states. Such capabilities can significantly enhance real-time monitoring and maintenance, reducing operational risks and costs.

The project underscores the importance of data-driven solutions in IoT ecosystems. As the dataset grows over time, transitioning to scalable tools like PySpark ensures robust processing capabilities. Additionally, the techniques used here—EDA, anomaly detection, and predictive modeling—serve as a foundation for real-world IoT analytics, empowering businesses to achieve smarter device management and improved system reliability.

