<p style="font-family: 'Roboto', sans-serif; font-weight: bold; letter-spacing: normal; color: #005792; font-size: 16px; text-align: left; padding: 5px; border-bottom: 2px solid #007ACC; background-color: #F0F8FF;">
  Package Imports
</p>

In [1]:
# General imports
import re
import numpy as np
import pandas as pd
from warnings import filterwarnings
from tqdm.notebook import tqdm
from gc import collect
from IPython.display import clear_output, display, HTML
from termcolor import colored

# Configure pandas display options
pd.options.display.max_rows = 50
pd.set_option('display.float_format', '{:,.5f}'.format)

# Data processing and statistical analysis
from scipy.stats import mode, iqr, anderson, shapiro, normaltest
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Visualization libraries and settings
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap
%matplotlib inline

# Set warnings filter (to ignore warnings)
filterwarnings('ignore')


In [2]:
# Model imports:-

from sklearn import datasets
from sklearn.metrics import accuracy_score, balanced_accuracy_score
from sklearn.feature_selection import mutual_info_regression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import FunctionTransformer, PowerTransformer, PolynomialFeatures
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, cross_validate
from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer, PowerTransformer, RobustScaler

from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LogisticRegression

<div style="color: #333333; 
           display: block; 
           border-radius: 12px;
           background-color: #FAFBFD; 
           font-size: 14px; 
           font-family: 'Roboto', sans-serif; 
           letter-spacing: 0.5px;
           padding: 20px; 
           border: 1px solid #CCCCCC; 
           ">
    
The data is generated by a deep learning model trained on the Machine Failure Predictions dataset. The relationships between the variables are akin to those in the original dataset but with notable distinctions.

We're delving into two distinct datasets: the **Training dataset**, which houses both predictor variables and the target variable; and the **Testing dataset**, carrying only the predictor variables.

The target variable, or our focus for prediction, is **Machine Failure**—a binary indicator revealing whether a machine has failed. The predictor variables at our disposal include:

- **`Product ID`**: Acts as a specific identifier for products or machines, potentially revealing if certain machines are more prone to failure.
- **`Type`**: The category of machine, which could influence its likelihood of failure.
- **`Air temperature [K]`** and **`Process temperature [K]`**: Critical in forecasting machine failure, as overheating is a common failure cause.
- **`Rotational speed [rpm]`**: The operation speed, with higher speeds possibly leading to increased failure risks.
- **`Torque [Nm]`**: The operational force, where excessive torque can signify wear and tear, leading to failure.
- **`Tool wear [min]`**: The usage duration of tools, where longer use periods might forecast imminent failure.

Additionally, we assess specific types of failures, each marked by a binary indicator:
- **`TWF`** (Tool Wear Failure),
- **`HDF`** (Heat Dissipation Failure),
- **`PWF`** (Power Failure),
- **`OSF`** (Overstrain Failure),
- **`RNF`** (Random Failure).

Should any of these failure indicators be triggered, it signals the machine's functional demise.

**Evaluating Our Model's Performance**

We're evaluating the model's effectiveness primarily based on the area under the ROC curve, which compares predicted probabilities to actual outcomes.

**Analysis Phases**

**Phase 1: Data Examination**

Our initial analysis stage encompasses a thorough dataset review—identifying and rectifying missing values, evaluating feature distribution, spotting outliers, and conducting normality tests to ensure data quality and reliability.

**Phase 2: Data Correction**

This stage aims at addressing and amending early-detected anomalies, crucial for enhancing the dataset's overall quality and setting a solid foundation for predictive modeling.

**Phase 3: Predictive Modeling**

Concluding our analysis, we embark on developing and deploying a **binary classification model**, aspiring to accurately predict machine failures and measure our success against the precision of the area under the ROC curve.

</div>