In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("default")
sns.set_context("notebook")

# Source and Data Description
The data to be analyzed is published by the UC Irvine Machine Learning Repository.
Originially the data was published by Tony Lindgren and Jonas Biteurs of Scania CV AB  in 2016.
The downloadable zip file can be found on [UC Irvine's website](https://archive.ics.uci.edu/dataset/421/aps+failure+at+scania+trucks) and consists of three files:
- aps_failure_description.txt
- aps_failure_test_set.csv
- aps_failure_training_set.csv


Two labels are described with two classes:
- positive class (failure related to APS)
- negative class (failure not related to APS)

## Cost
A prediction model's performance is quantified with a Cost_1 = 10 for a false positive (failure predicted but truck working) and a false negative (truck with APS not recognized) with Cost_2 = 500.
Total_cost = Cost_1*No_Instances + Cost_2*No_Instances

## Dataset Overview
The training set consists of 60000 examples in sum with 59000 negative and 1000 positive examples.
The test set consists of 16000 examples.

## Dataset Attributes
There is a total of 171 Attributes with anonymized names. 
Single numerical values and 7 histograms (f.e. temperature bins) are in the data. 
Missing values are marked as "na"

# Goals and Intentions
Based on the attributes and classification of the training data a prediction model is developed. This model should be able to divide the test set into classes. Finally the model can be tested by calculating the total cost of the model. The goal is to minimize the cost of the prediction model.
In more technical terms the prediction model should identify faulty trucks with the information whether this is due to the APS or other technical failure. By figuring out the APS's role in the failure of the system the repairment process can be sped up and truck breakdowns can be prevented.

In [3]:
train_df = pd.read_csv("../data/raw/aps_failure_training_set.csv", header=14)
test_df = pd.read_csv("../data/raw/aps_failure_test_set.csv", header=14)

## First Look at the data
In this section the dataframes are analyzed with respect to their datatypes and missing values.
Open questions are:
- What datatypes does the dataframe consist of?
- How are missing values displayed?
- How can the DataFrame be converted so that it is suitable for a model?

In [4]:
train_df.head(10), test_df.head(10)

(  class  aa_000 ab_000      ac_000 ad_000 ae_000 af_000 ag_000 ag_001 ag_002  \
 0   neg   76698     na  2130706438    280      0      0      0      0      0   
 1   neg   33058     na           0     na      0      0      0      0      0   
 2   neg   41040     na         228    100      0      0      0      0      0   
 3   neg      12      0          70     66      0     10      0      0      0   
 4   neg   60874     na        1368    458      0      0      0      0      0   
 5   neg   38312     na  2130706432    218      0      0      0      0      0   
 6   neg      14      0           6     na      0      0      0      0      0   
 7   neg  102960     na  2130706432    116      0      0      0      0      0   
 8   neg   78696     na           0     na      0      0      0      0      0   
 9   pos  153204      0         182     na      0      0      0      0      0   
 
    ...   ee_002  ee_003   ee_004   ee_005   ee_006  ee_007  ee_008 ee_009  \
 0  ...  1240520  493384   72

In [5]:
train_df.info(), test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 171 entries, class to eg_000
dtypes: int64(1), object(170)
memory usage: 78.3+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16000 entries, 0 to 15999
Columns: 171 entries, class to eg_000
dtypes: int64(1), object(170)
memory usage: 20.9+ MB


(None, None)

In [7]:
train_df["class"].value_counts(), train_df["class"].value_counts(normalize=True)

(class
 neg    59000
 pos     1000
 Name: count, dtype: int64,
 class
 neg    0.983333
 pos    0.016667
 Name: proportion, dtype: float64)

In [8]:
test_df["class"].value_counts(), test_df["class"].value_counts(normalize=True)

(class
 neg    15625
 pos      375
 Name: count, dtype: int64,
 class
 neg    0.976562
 pos    0.023438
 Name: proportion, dtype: float64)

In [9]:
print("In train_df.iloc[0,2]: " + train_df.iloc[0,2])
print("Type of na: "), print(type(train_df.iloc[0,2]))
train_unique_types = set(type(v) for v in train_df.values.flatten())
test_unique_types = set(type(v) for v in test_df.values.flatten())
print("In the training set the following types can occur: ")
print(train_unique_types)
print("In the test set the following types can occur: ")
print(test_unique_types)

In train_df.iloc[0,2]: na
Type of na: 
<class 'str'>
In the training set the following types can occur: 
{<class 'int'>, <class 'str'>}
In the test set the following types can occur: 
{<class 'int'>, <class 'str'>}


## Data insights
Except from one int64 column, the data are all of type object. 
Since there are only strings and integers in the DFs, the object columns consist of those values.
One of the strings is "na". In oder to see whether there are more strings than just the missing values, a conversion of the types and the resulting amount of NaNs is the next step.

In [10]:
numeric_train_df = train_df.copy()
numeric_test_df = test_df.copy()

for col in numeric_train_df.columns:
    if col != "class":
        numeric_train_df[col] = pd.to_numeric(numeric_train_df[col], errors="coerce")
        numeric_test_df[col] = pd.to_numeric(numeric_test_df[col], errors="coerce")

introduced_nan_fraction_train = numeric_train_df.isna().mean().sort_values(ascending=False)
introduced_nan_fraction_test = numeric_test_df.isna().mean().sort_values(ascending=False)

In [11]:
introduced_nan_fraction_train.head(10), introduced_nan_fraction_test.head(10)

(br_000    0.821067
 bq_000    0.812033
 bp_000    0.795667
 bo_000    0.772217
 ab_000    0.772150
 cr_000    0.772150
 bn_000    0.733483
 bm_000    0.659150
 bl_000    0.454617
 bk_000    0.383900
 dtype: float64,
 br_000    0.820562
 bq_000    0.811312
 bp_000    0.795063
 bo_000    0.773500
 ab_000    0.772687
 cr_000    0.772687
 bn_000    0.732062
 bm_000    0.659125
 bl_000    0.451625
 bk_000    0.380875
 dtype: float64)

Now the fraction of NaN after conversion from the original DFs to numeric DFs is known. This needs to be compared to the original distribution of "na" occurences in the DFs.

In [12]:
string_na_fraction_train = (train_df == "na").mean().sort_values(ascending=False)
string_na_fraction_test = (test_df == "na").mean().sort_values(ascending=False)

string_na_fraction_train, string_na_fraction_test 

(br_000    0.821067
 bq_000    0.812033
 bp_000    0.795667
 bo_000    0.772217
 ab_000    0.772150
             ...   
 cj_000    0.005633
 ci_000    0.005633
 bt_000    0.002783
 aa_000    0.000000
 class     0.000000
 Length: 171, dtype: float64,
 br_000    0.820562
 bq_000    0.811312
 bp_000    0.795063
 bo_000    0.773500
 ab_000    0.772687
             ...   
 cj_000    0.005375
 ci_000    0.005375
 bt_000    0.001750
 aa_000    0.000000
 class     0.000000
 Length: 171, dtype: float64)

From the first glance the distribution of the NaN after conversion and the "na" before conversion seems to be the same. The difference in "na" distribution is now calculated. If this is 0 or close to 0 (due to rounding errors) that means, that the only strings in the DataFrame is "na" before and after conversion.

In [13]:
missing_original_train = (train_df.drop(columns="class") == "na").mean()
missing_original_test = (test_df.drop(columns="class") == "na").mean()
missing_numeric_train = numeric_train_df.drop(columns="class").isna().mean()
missing_numeric_test = numeric_test_df.drop(columns="class").isna().mean()

missing_diff_train = (missing_original_train - missing_numeric_train).abs()
missing_diff_test = (missing_original_test - missing_numeric_test).abs()

missing_diff_train_sorted = missing_diff_train.sort_values(ascending=False)
missing_diff_test_sorted = missing_diff_test.sort_values(ascending=False)

missing_diff_train_sorted.head(10), missing_diff_test_sorted.head(10)

(aa_000    0.0
 ab_000    0.0
 ac_000    0.0
 ad_000    0.0
 ae_000    0.0
 af_000    0.0
 ag_000    0.0
 ag_001    0.0
 ag_002    0.0
 ag_003    0.0
 dtype: float64,
 aa_000    0.0
 ab_000    0.0
 ac_000    0.0
 ad_000    0.0
 ae_000    0.0
 af_000    0.0
 ag_000    0.0
 ag_001    0.0
 ag_002    0.0
 ag_003    0.0
 dtype: float64)

There is no difference between the amount of "na" and NaN after converting the original DFs to numeric DFs. This means that no missing values are added with the conversion.

In [15]:
numeric_train_df.dtypes.value_counts(), numeric_test_df.dtypes.value_counts()

(float64    169
 object       1
 int64        1
 Name: count, dtype: int64,
 float64    169
 object       1
 int64        1
 Name: count, dtype: int64)

In [16]:
print(numeric_train_df.select_dtypes(include="object").columns.tolist())
print(numeric_test_df.select_dtypes(include="object").columns.tolist())
print(numeric_train_df.select_dtypes(include="int64").columns.tolist())
print(numeric_test_df.select_dtypes(include="int64").columns.tolist())
print(numeric_train_df["aa_000"])
print(numeric_test_df["aa_000"])
string_na_fraction_train = (train_df == "na").mean().sort_values(ascending=True)
string_na_fraction_test = (test_df == "na").mean().sort_values(ascending=True)


['class']
['class']
['aa_000']
['aa_000']
0         76698
1         33058
2         41040
3            12
4         60874
          ...  
59995    153002
59996      2286
59997       112
59998     80292
59999     40222
Name: aa_000, Length: 60000, dtype: int64
0           60
1           82
2        66002
3        59816
4         1814
         ...  
15995    81852
15996       18
15997    79636
15998      110
15999        8
Name: aa_000, Length: 16000, dtype: int64


The conversion to numeric DFs results in 169 columns of type float64, one column of type int64 and one object column.
The object column is the "class" column with the entries "neg" or "pos", which is the goal value.
The int64 column is "aa_000", which does not have a single "na" or NaN so its integers are converted into int64 automatically.


# Final Thoughts

The key findings in the data understanding section are:
- Anonymized data in train and test set
- "na" represents missing data
- Numeric conversion turns "na" to NaN
- Numeric conversion results in 169 float64 columns (original dtype:object)
- One column ("aa_000") is int64 (No "na" before conversion and all ints)
- One column ("class") represents target values as strings (dtype:object)

=> Next steps in 02_missing_data_strategy.ipynb