# Predictive Maintenance 

This assignment covers the topic of predictive maintenance. Predictive Maintenance problems adress predicting when a machine needs to be maintained ahead of breaking down. This problem can occur anywhere regular maintenance is required for a machine. For example, it can be used in manufacturing, fleet operations, train maintenance, etc.

This assignment will use the [Predictive Maintenance Dataset](https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset). The dataset consists of 10 000 data points stored as rows with 14 features in columns. The 'machine failure' label that indicates, whether the machine has failed in this particular datapoint.

# Learning Objectives
- Perform model tuning based on hyper parameters.
- Select the best model after attempting multiple models.
- Perform recursive feature elimination, producing a statistically significant improvement over a model without feature selection.

In [116]:
import pandas as pd
import numpy as np
from sklearn import preprocessing, metrics
from sklearn.model_selection import train_test_split


ai4i2020 = pd.read_csv('ai4i2020.csv')
print(ai4i2020.info())
ai4i2020.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  object 
 4   Process temperature [K]  10000 non-null  object 
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Machine failure          10000 non-null  int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 703.2+ KB
None


Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure
0,1,M14860,M,298.1,308.6,1551,42.8,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0
5,6,M14865,M,298.1,308.6,1425,41.9,11,0
6,7,L47186,L,298.1,308.6,1558,42.4,14,0
7,8,L47187,L,298.1,308.6,1527,40.2,16,0
8,9,M14868,M,298.3,308.7,1667,28.6,18,0
9,10,M14869,M,298.5,309.0,1741,28.0,21,0


Question 1.1:  Write a command that will calculate the number of unique values for each feature in the training data.

In [117]:
# Command(s)
ai4i2020.apply(lambda x: len(x.unique()))

UDI                        10000
Product ID                 10000
Type                           3
Air temperature [K]           93
Process temperature [K]       82
Rotational speed [rpm]       941
Torque [Nm]                  577
Tool wear [min]              246
Machine failure                2
dtype: int64

Question 1.2: Determine if the data contains any missing values, and replace the values with np.nan. Missing values would be '?'.

In [118]:
ai4i2020.replace('?', np.nan, inplace=True)
ai4i2020.isnull().sum()

UDI                          0
Product ID                   0
Type                         0
Air temperature [K]        140
Process temperature [K]    183
Rotational speed [rpm]       0
Torque [Nm]                  0
Tool wear [min]              0
Machine failure              0
dtype: int64

Question 1.3: Replace all missing values with the mean. Change column types to numeric.

In [119]:

ai4i2020['Air temperature [K]'] = pd.to_numeric(ai4i2020['Air temperature [K]'])
ai4i2020['Process temperature [K]'] = pd.to_numeric(ai4i2020['Process temperature [K]'])

ai4i2020["Air temperature [K]"].fillna(ai4i2020['Air temperature [K]'].mean(), inplace=True)
ai4i2020["Process temperature [K]"].fillna(ai4i2020['Process temperature [K]'].mean(), inplace=True)
ai4i2020.isnull().sum() #checking the data for NaN values .... should all show 0 now




UDI                        0
Product ID                 0
Type                       0
Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
Machine failure            0
dtype: int64

Question 1.4: Drop UDI and 'Product ID' from the data

In [120]:
ai4i2020.drop(['UDI', 'Product ID'],axis=1, inplace=True)

Question 2.1: Split the data into training and testing taking into consideration 'Machine failure' as the target

In [121]:
from sklearn.model_selection import train_test_split
X = ai4i2020.drop(['Machine failure'],axis=1)
y = ai4i2020['Machine failure']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

Question 2.2: Apply [One-Hot Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to data. Make sure to Fit the training data and transform both training and test data. 

In [122]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train)
X_train_enc = enc.transform(X_train)
X_test_enc = enc.transform(X_test)

Question 2.3: Apply [SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) to the training data since there is class imbalance.

In [123]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train_enc, y_train)

Question 3.1: Train five machine learning [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and [XGBClassifier](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier) based on the training data, and evaluate their performance on the test dataset. Use default hyperparameter values.

In [124]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


In [125]:
models = {'Logistic Regresion': LogisticRegression(), 'Support Vector Machine': SVC(), 'K-NN': KNeighborsClassifier(), 'Decision Tree':DecisionTreeClassifier(),'XGBoost': XGBClassifier()}

for model in models.keys():  # or for i in range(0, len(models)):
    models[model].fit(X_res, y_res)
    
    y_pred = models[model].predict(X_test_enc)
    print (model)
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))

Logistic Regresion
              precision    recall  f1-score   support

           0       0.98      0.96      0.97      2907
           1       0.21      0.31      0.25        93

    accuracy                           0.94      3000
   macro avg       0.59      0.64      0.61      3000
weighted avg       0.95      0.94      0.95      3000

[[2795  112]
 [  64   29]]
Support Vector Machine
              precision    recall  f1-score   support

           0       0.97      1.00      0.98      2907
           1       0.33      0.03      0.06        93

    accuracy                           0.97      3000
   macro avg       0.65      0.52      0.52      3000
weighted avg       0.95      0.97      0.96      3000

[[2901    6]
 [  90    3]]
K-NN
              precision    recall  f1-score   support

           0       0.98      0.52      0.68      2907
           1       0.04      0.69      0.08        93

    accuracy                           0.53      3000
   macro avg       0.51    



XGBoost
              precision    recall  f1-score   support

           0       0.97      1.00      0.98      2907
           1       0.29      0.02      0.04        93

    accuracy                           0.97      3000
   macro avg       0.63      0.51      0.51      3000
weighted avg       0.95      0.97      0.95      3000

[[2902    5]
 [  91    2]]


Questions 3.2:  Perform recursive feature elimination (3 features) on the dataset using a logistic regression classifier with max_iter= 1000. Any difference in the results? Explain.

In [126]:
from sklearn.feature_selection import RFE

log_rgr = LogisticRegression(random_state=5, max_iter=500)

rfe = RFE(log_rgr, n_features_to_select=3)
rfe.fit(X_res, y_res)

y_pred = rfe.predict(X_test_enc)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98      2907
           1       0.62      0.05      0.10        93

    accuracy                           0.97      3000
   macro avg       0.80      0.53      0.54      3000
weighted avg       0.96      0.97      0.96      3000

[[2904    3]
 [  88    5]]
