# Sensor Component Fault Detection

## Problem Statement

**Data:** Sensor Data

**Problem Description**
- The Air Pressure System (APS) plays a critical role in heavy Scania trucks. APS generates pressurized air that is used in various critical functions such as braking and gear changing. This project predicts whether the failure is due to APS system or not.
- The sensor dataset gives categorical output with two classes, positive and negative
- The positive class corresponds to failure due to the components of the APS system
- The negative class corresponds to the failure not related to the APS system



|True class | Positive | Negative | |
| ----------- | ----------- |   |  |
|<b>Predicted class</b>||| |
| Positive      |   -       | cost_1  |    |
| Negative   | cost_2        |  | |


cost_1(FP) = 10 and cost_2(FN)= 500



- cost_1 belongs to False Positive and cost_2 belongs to False Negative
- False Positive - unnecessary check needs to be done by mechanics
- False Negative - missing to predict the fault, which may cause a breakdown
- Total cost = cost_1 * number_of_instance with type1 failure + cost_2 * number_of_instance with type2 failure
- The problem is to reduce the unnecessary repair costs. So its required to reduce the False predictions.
- Most importanly, we have to reduce the False Negative, since cost incurred due to FN is 50 times more than the FP

## Challenges and objectives
- Need to handle the missing values in all the columns
- Misclassification leads to unwanted repair costs.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

## Read the data from the csv file

In [3]:
# read the csv file and load them into a dataframe
df = pd.read_csv('../data/sensor.csv')

In [4]:
df.head()

Unnamed: 0,class,aa_000,ab_000,ac_000,ad_000,ae_000,af_000,ag_000,ag_001,ag_002,...,ee_002,ee_003,ee_004,ee_005,ee_006,ee_007,ee_008,ee_009,ef_000,eg_000
0,pos,912384,,,,,,0.0,0.0,0.0,...,7958290.0,3800292.0,8566444.0,8613822.0,5996898.0,2777986.0,2539192.0,49268.0,,
1,pos,281324,2.0,3762.0,2346.0,0.0,0.0,4808.0,215720.0,967572.0,...,624606.0,269976.0,638838.0,1358354.0,819918.0,262804.0,2824.0,0.0,0.0,0.0
2,pos,1086734,,,,0.0,0.0,0.0,0.0,0.0,...,10418606.0,5879290.0,12787218.0,11547080.0,5935580.0,1598690.0,932418.0,22944.0,0.0,0.0
3,pos,176346,0.0,6982.0,5922.0,0.0,0.0,0.0,0.0,65836.0,...,2428526.0,720862.0,1189150.0,1710512.0,1018248.0,626354.0,32208.0,0.0,0.0,0.0
4,pos,959094,2.0,288.0,266.0,0.0,0.0,0.0,0.0,0.0,...,790862.0,256590.0,566132.0,1192120.0,86112.0,16760.0,3794.0,0.0,0.0,0.0


In [5]:
df.columns

Index(['class', 'aa_000', 'ab_000', 'ac_000', 'ad_000', 'ae_000', 'af_000',
       'ag_000', 'ag_001', 'ag_002',
       ...
       'ee_002', 'ee_003', 'ee_004', 'ee_005', 'ee_006', 'ee_007', 'ee_008',
       'ee_009', 'ef_000', 'eg_000'],
      dtype='object', length=171)

In [6]:
df['class'].dtype

dtype('O')

In [7]:
df['aa_000'].dtype

dtype('int64')

In [8]:
# check the shape of the dataframe, which gives number of rows and columns
df.shape

(25001, 171)

In [9]:
df['class'].value_counts()

neg    23276
pos     1725
Name: class, dtype: int64

In [16]:
#lets find the numerical features and categorical features
numerical_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

In [18]:
print("We have {} numerical features: {}".format(len(numerical_features),numerical_features))
print("We have {} categorical features: {}".format(len(categorical_features),categorical_features))

We have 170 numerical features: ['aa_000', 'ab_000', 'ac_000', 'ad_000', 'ae_000', 'af_000', 'ag_000', 'ag_001', 'ag_002', 'ag_003', 'ag_004', 'ag_005', 'ag_006', 'ag_007', 'ag_008', 'ag_009', 'ah_000', 'ai_000', 'aj_000', 'ak_000', 'al_000', 'am_0', 'an_000', 'ao_000', 'ap_000', 'aq_000', 'ar_000', 'as_000', 'at_000', 'au_000', 'av_000', 'ax_000', 'ay_000', 'ay_001', 'ay_002', 'ay_003', 'ay_004', 'ay_005', 'ay_006', 'ay_007', 'ay_008', 'ay_009', 'az_000', 'az_001', 'az_002', 'az_003', 'az_004', 'az_005', 'az_006', 'az_007', 'az_008', 'az_009', 'ba_000', 'ba_001', 'ba_002', 'ba_003', 'ba_004', 'ba_005', 'ba_006', 'ba_007', 'ba_008', 'ba_009', 'bb_000', 'bc_000', 'bd_000', 'be_000', 'bf_000', 'bg_000', 'bh_000', 'bi_000', 'bj_000', 'bk_000', 'bl_000', 'bm_000', 'bn_000', 'bo_000', 'bp_000', 'bq_000', 'br_000', 'bs_000', 'bt_000', 'bu_000', 'bv_000', 'bx_000', 'by_000', 'bz_000', 'ca_000', 'cb_000', 'cc_000', 'cd_000', 'ce_000', 'cf_000', 'cg_000', 'ch_000', 'ci_000', 'cj_000', 'ck_000',

**Dataset description**
- We have 171 features totally, with 170 dependent features and 1 independent feature
- 25001 total data points
- target column "class" has two classes, positive and negative
- dataset contains 170 numerical features and 1 categorical feature

## Checking the missing values

In [25]:
missing = df.isna().sum()
print(missing)

class         0
aa_000        0
ab_000    19262
ac_000     1896
ad_000     6840
          ...  
ee_007      261
ee_008      261
ee_009      261
ef_000     1528
eg_000     1528
Length: 171, dtype: int64


In [21]:
df['ab_000']

0        NaN
1        2.0
2        NaN
3        0.0
4        2.0
        ... 
24996    0.0
24997    NaN
24998    NaN
24999    NaN
25000    NaN
Name: ab_000, Length: 25001, dtype: float64