---
title: "Uncovering Patterns and Anomalies in Manufacturing Data"
subtitle: "INFO 523 - Final Project"
author: 
  - name: "Cesar Castro"
    affiliations:
      - name: "College of Information Science, University of Arizona"
description: "Project description"
format:
  html:
    code-tools: true
    code-overflow: wrap
    code-line-numbers: true
    embed-resources: true
editor: visual
code-annotations: hover
execute:
  warning: false
jupyter: python3
---

## **Project Description:**

-   ¬†Uncovering Patterns and Anomalies in Manufacturing Data

## **üéØGoals:**

The construction of modern factories is resulting in the generation of vast amounts of data. Manufacturing equipment continuously monitors various parameters, such as temperatures, vibrations, motor speeds, and energy consumption, using sensors and other methods. Variations in these parameters can indicate shifts in performance, potentially leading to defects or catastrophic failures in the equipment. Detecting these shifts has become increasingly important to reduce downtime and boost productivity.

Advanced techniques such as machine learning, anomaly detection, and image analysis are currently being utilized to forecast when equipment might require maintenance, calibration, or material changes. This project aims to leverage synthetic public data from Kaggle to compare various classification and regression models, with the objective of predicting these critical events. If time allows, we will also explore anomaly detection techniques on time series data to predict potential failures as early as possible.

Specific Objectives:

1)  First objective is to build a classification model for predictive maintenance (will compare multiple options). The model will analyze sensor data, such as air temperature, process temperature, rotational speed, and torque, from a predictive maintenance dataset to accurately predict the specific Failure Type.

2)  Second objective is to develop a regression model for anomaly detection and compare to time series analysis (e.g. ARIMA, LSTM). This model will use key features like sensor data and performance metrics to identify unusual patterns.

## **üìäProposed Datasets:**

1.  Source: Kaggle - [Machine Predictive Maintenance Classification](https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification) (Synthetic dataset that reflects real predictive maintenance encountered in the industry)

Data Example:

In [1]:
#| label: load-dataset1
#| echo: false
import pandas as pd
df1 = pd.read_csv("data/predictive_maintenance.csv")
df1.head(5)

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Target,Failure Type
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,No Failure
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,No Failure
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,No Failure
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,No Failure
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,No Failure


-   Example of data, 3 categorical features and 6 numerical features to be used. This dataset will be used for classification models.

In [2]:
#| label: load-dataset1a
#| echo: false
df1.describe()

Unnamed: 0,UDI,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Target
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,300.00493,310.00556,1538.7761,39.98691,107.951,0.0339
std,2886.89568,2.000259,1.483734,179.284096,9.968934,63.654147,0.180981
min,1.0,295.3,305.7,1168.0,3.8,0.0,0.0
25%,2500.75,298.3,308.8,1423.0,33.2,53.0,0.0
50%,5000.5,300.1,310.1,1503.0,40.1,108.0,0.0
75%,7500.25,301.5,311.1,1612.0,46.8,162.0,0.0
max,10000.0,304.5,313.8,2886.0,76.6,253.0,1.0


-   There are 10000 rows on the predictive maintenance dataset, max values for Rotational speed and Tool wear might indacate outliers or some level of skewness on the data. This will be handle during the data preparation part of the project.

In [3]:
#| label: load-dataset1b
#| echo: false
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  float64
 4   Process temperature [K]  10000 non-null  float64
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Target                   10000 non-null  int64  
 9   Failure Type             10000 non-null  object 
dtypes: float64(3), int64(4), object(3)
memory usage: 781.4+ KB


-   There are no missing values on this dataset

2.  Source: Kaggle - [Intelligent Manufacturing Dataset](https://www.kaggle.com/datasets/ziya07/intelligent-manufacturing-dataset/data) (The Intelligent Manufacturing Dataset for Predictive Optimization is a dataset designed for research in smart manufacturing, AI-driven process optimization, and predictive maintenance)

Data Example:

In [4]:
#| label: load-dataset2
#| echo: false
import pandas as pd
df2 = pd.read_csv("data/manufacturing_6G_dataset.csv")
df2.head(5)

Unnamed: 0,Timestamp,Machine_ID,Operation_Mode,Temperature_C,Vibration_Hz,Power_Consumption_kW,Network_Latency_ms,Packet_Loss_%,Quality_Control_Defect_Rate_%,Production_Speed_units_per_hr,Predictive_Maintenance_Score,Error_Rate_%,Efficiency_Status
0,2024-01-01 00:00:00,39,Idle,74.13759,3.500595,8.612162,10.650542,0.207764,7.751261,477.657391,0.34465,14.96547,Low
1,2024-01-01 00:01:00,29,Active,84.264558,3.355928,2.268559,29.11181,2.228464,4.989172,398.174747,0.769848,7.67827,Low
2,2024-01-01 00:02:00,15,Active,44.280102,2.079766,6.144105,18.357292,1.639416,0.456816,108.074959,0.987086,8.198391,Low
3,2024-01-01 00:03:00,43,Active,40.568502,0.298238,4.067825,29.153629,1.161021,4.582974,329.57941,0.98339,2.740847,Medium
4,2024-01-01 00:04:00,8,Idle,75.063817,0.34581,6.225737,34.029191,4.79652,2.287716,159.113525,0.573117,12.100686,Low


-   Example of dataset, dates, numerical and categorical variables. This dataset will be used to explore regression models, time series analysis and anomaly detection.

In [5]:
#| label: load-dataset2a
#| echo: false
df2.describe()

Unnamed: 0,Machine_ID,Temperature_C,Vibration_Hz,Power_Consumption_kW,Network_Latency_ms,Packet_Loss_%,Quality_Control_Defect_Rate_%,Production_Speed_units_per_hr,Predictive_Maintenance_Score,Error_Rate_%
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,25.49933,60.041458,2.549959,5.745929,25.55562,2.493418,5.008806,275.916324,0.499385,7.5041
std,14.389439,17.323238,1.414127,2.451271,14.120758,1.443273,2.883666,130.096892,0.288814,4.335896
min,1.0,30.000138,0.100011,1.500183,1.000025,2.6e-05,0.000449,50.000375,3e-06,0.000112
25%,13.0,45.031596,1.323214,3.627318,13.355118,1.245026,2.521591,162.873618,0.248166,3.750148
50%,25.0,60.033597,2.549441,5.75546,25.536079,2.487667,5.003569,276.648922,0.499209,7.504145
75%,38.0,74.967217,3.776459,7.860267,37.796372,3.741252,7.506127,388.812761,0.74881,11.273189
max,50.0,89.998979,4.999974,9.999889,49.999917,4.999975,9.9999,499.996768,0.999978,14.999869


-   There are 100000 rows on this dataset.

In [6]:
#| label: load-dataset2b
#| echo: false
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 13 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Timestamp                      100000 non-null  object 
 1   Machine_ID                     100000 non-null  int64  
 2   Operation_Mode                 100000 non-null  object 
 3   Temperature_C                  100000 non-null  float64
 4   Vibration_Hz                   100000 non-null  float64
 5   Power_Consumption_kW           100000 non-null  float64
 6   Network_Latency_ms             100000 non-null  float64
 7   Packet_Loss_%                  100000 non-null  float64
 8   Quality_Control_Defect_Rate_%  100000 non-null  float64
 9   Production_Speed_units_per_hr  100000 non-null  float64
 10  Predictive_Maintenance_Score   100000 non-null  float64
 11  Error_Rate_%                   100000 non-null  float64
 12  Efficiency_Status              

-   There are no missing data on any of the features.

## üóìÔ∏è**Project Schedule**

-   Definition of problem statement and goals. Due Date: 8/6/2025
-   Plan to incorporate peer review feedback into project plan. Due Date: 8/7/2025
-   Data cleaning. (handling missing, outliers, define imputation methods). Due Date: 8/13/2025
-   Define key response on the datasets, depending on the model (might want to look like defects pass/fail) for a classification model or defect rate for a regression model.¬†Due Date: 8/13/2025
-   Analyze features (use PCA or others to understand which features contribute more the variability, etc.) Due Date: 8/13/2025
-   Classification Model Creation and Validation. Due Date: 8/15/2025
-   Regression Model Creation and Validation,. Due Date: 8/18/2025
-   Incorporate time series analysis and compare models and recommend the best one.¬†. Due Date: 8/18/2025
-   Prepare final report and presentation Due Date: 8/20/2025

## üìÅ**Project Organization**

| \| **FINAL-PROJECT-CASTRO**
| \| ‚Äî üìÅ**DATA**: # Raw Data files obtained from Kaggle source in CSV format.
| \| \_\_\_\_\_\_\|‚Äî- üìÅ**processed**: # Cleaned and processed datasets
| \| \_\_\_\_\_\_\|‚Äî- üìÅ**results**: # model evaluation and other results
| \| ‚Äî üìÅ**IMAGES**: # Any images to be used by quarto site
| \| ‚Äî üìÅ**presentation_files**: # Quarto presentation files
| \| ‚Äî üìÅ**extra**: # Additional documents or files used on project
| \| ‚Äî üìÅ**quarto**: # quarto files
| \| ‚Äî üìÅ**src**: # source code used for project
| \| ‚Äî üìÅ.**github**: # github configuration files
| \| ‚Äì üìÑ**requirements.txt**: # Python Dependencies
| \| ‚Äì üìÑ**\_quarto.yml** # quarto metadata and configuration
| \| ‚Äì üìÑ**.gitignore** # list of files and directories to be ignore by Git
| \| ‚Äì üìÑ**about.qmd** # Quarto about page with general information about the project
| \| ‚Äì üìÑ**presentation.qmd** # Quarto final project presentation
| \| ‚Äì üìÑ**proposal.qmd** # Project Problem statement and proposal
| \| ‚Äì üìÑ**README.md** # main read me file for git.
|