# Caterpillar Tube Pricing

This notebook will be used to document my work on the <a href="https://www.kaggle.com/c/caterpillar-tube-pricing/overview">kaggle</a> competition 'Caterpillar Tube Pricing'. Caterpillar is a company that creates the equipments used for construction sites like: bulldozers and forklifts. Each of these machines rely on a complex set of tubes. The purpose of this challenge is to predict the price a supplier will quote for a given tube assembly.

In [1]:
import numpy as np
import pandas as pd
import os

## Reading in the Data

In [2]:
#Read data
root_filepath = './Data/'
data_filepath = os.listdir(root_filepath)
data_filepath

['bill_of_materials.csv',
 'comp_adaptor.csv',
 'comp_boss.csv',
 'comp_elbow.csv',
 'comp_float.csv',
 'comp_hfl.csv',
 'comp_nut.csv',
 'comp_other.csv',
 'comp_sleeve.csv',
 'comp_straight.csv',
 'comp_tee.csv',
 'comp_threaded.csv',
 'components.csv',
 'specs.csv',
 'test_set.csv',
 'train_set.csv',
 'tube.csv',
 'tube_end_form.csv',
 'type_component.csv',
 'type_connection.csv',
 'type_end_form.csv']

In [3]:
bill_of_materials = pd.read_csv(root_filepath + data_filepath[0])
comp_adaptor = pd.read_csv(root_filepath + data_filepath[1])
comp_boss = pd.read_csv(root_filepath + data_filepath[2])
comp_elbow = pd.read_csv(root_filepath + data_filepath[3])
comp_float = pd.read_csv(root_filepath + data_filepath[4])
comp_hfl = pd.read_csv(root_filepath + data_filepath[5])
comp_nut = pd.read_csv(root_filepath + data_filepath[6])
comp_other = pd.read_csv(root_filepath + data_filepath[7])
comp_sleeve = pd.read_csv(root_filepath + data_filepath[8])
comp_straight = pd.read_csv(root_filepath + data_filepath[9])
comp_tee = pd.read_csv(root_filepath + data_filepath[10])
comp_threaded = pd.read_csv(root_filepath + data_filepath[11])
components = pd.read_csv(root_filepath + data_filepath[12])
specs = pd.read_csv(root_filepath + data_filepath[13])
test_set = pd.read_csv(root_filepath + data_filepath[14])
train_set = pd.read_csv(root_filepath + data_filepath[15])
tube = pd.read_csv(root_filepath + data_filepath[16])
tube_end_form = pd.read_csv(root_filepath + data_filepath[17])
type_component = pd.read_csv(root_filepath + data_filepath[18])
type_connection = pd.read_csv(root_filepath + data_filepath[19])
type_end_form = pd.read_csv(root_filepath + data_filepath[20])

## Data Exploration and Filling in Missing Values:

### Bill of Materials

In [4]:
bill_of_materials.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21198 entries, 0 to 21197
Data columns (total 17 columns):
tube_assembly_id    21198 non-null object
component_id_1      19149 non-null object
quantity_1          19149 non-null float64
component_id_2      14786 non-null object
quantity_2          14786 non-null float64
component_id_3      4791 non-null object
quantity_3          4798 non-null float64
component_id_4      607 non-null object
quantity_4          608 non-null float64
component_id_5      92 non-null object
quantity_5          92 non-null float64
component_id_6      26 non-null object
quantity_6          26 non-null float64
component_id_7      7 non-null object
quantity_7          7 non-null float64
component_id_8      1 non-null object
quantity_8          1 non-null float64
dtypes: float64(8), object(9)
memory usage: 2.7+ MB


In [5]:
bill_of_materials.head()

Unnamed: 0,tube_assembly_id,component_id_1,quantity_1,component_id_2,quantity_2,component_id_3,quantity_3,component_id_4,quantity_4,component_id_5,quantity_5,component_id_6,quantity_6,component_id_7,quantity_7,component_id_8,quantity_8
0,TA-00001,C-1622,2.0,C-1629,2.0,,,,,,,,,,,,
1,TA-00002,C-1312,2.0,,,,,,,,,,,,,,
2,TA-00003,C-1312,2.0,,,,,,,,,,,,,,
3,TA-00004,C-1312,2.0,,,,,,,,,,,,,,
4,TA-00005,C-1624,1.0,C-1631,1.0,C-1641,1.0,,,,,,,,,,


The bill of materials dataset contains all of the materials required to construct a specific tube. Some parts require up to 8 different components while some parts only require 1 component. The table is filled with 'NaN' if the tube doesn't require any further components.

In [6]:
#Lets replace the NaN value with 0 for quantity columns and 'None' for 'component_id' columns
bill_of_materials_imputed = bill_of_materials.copy()

bill_of_materials_imputed.component_id_1 = bill_of_materials_imputed.component_id_1.fillna('None')
bill_of_materials_imputed.component_id_2 = bill_of_materials_imputed.component_id_2.fillna('None')
bill_of_materials_imputed.component_id_3 = bill_of_materials_imputed.component_id_3.fillna('None')
bill_of_materials_imputed.component_id_4 = bill_of_materials_imputed.component_id_4.fillna('None')
bill_of_materials_imputed.component_id_5 = bill_of_materials_imputed.component_id_5.fillna('None')
bill_of_materials_imputed.component_id_6 = bill_of_materials_imputed.component_id_6.fillna('None')
bill_of_materials_imputed.component_id_7 = bill_of_materials_imputed.component_id_7.fillna('None')
bill_of_materials_imputed.component_id_8 = bill_of_materials_imputed.component_id_8.fillna('None')

bill_of_materials_imputed.quantity_1 = bill_of_materials_imputed.quantity_1.fillna(0)
bill_of_materials_imputed.quantity_2 = bill_of_materials_imputed.quantity_2.fillna(0)
bill_of_materials_imputed.quantity_3 = bill_of_materials_imputed.quantity_3.fillna(0)
bill_of_materials_imputed.quantity_4 = bill_of_materials_imputed.quantity_4.fillna(0)
bill_of_materials_imputed.quantity_5 = bill_of_materials_imputed.quantity_5.fillna(0)
bill_of_materials_imputed.quantity_6 = bill_of_materials_imputed.quantity_6.fillna(0)
bill_of_materials_imputed.quantity_7 = bill_of_materials_imputed.quantity_7.fillna(0)
bill_of_materials_imputed.quantity_8 = bill_of_materials_imputed.quantity_8.fillna(0)

bill_of_materials_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21198 entries, 0 to 21197
Data columns (total 17 columns):
tube_assembly_id    21198 non-null object
component_id_1      21198 non-null object
quantity_1          21198 non-null float64
component_id_2      21198 non-null object
quantity_2          21198 non-null float64
component_id_3      21198 non-null object
quantity_3          21198 non-null float64
component_id_4      21198 non-null object
quantity_4          21198 non-null float64
component_id_5      21198 non-null object
quantity_5          21198 non-null float64
component_id_6      21198 non-null object
quantity_6          21198 non-null float64
component_id_7      21198 non-null object
quantity_7          21198 non-null float64
component_id_8      21198 non-null object
quantity_8          21198 non-null float64
dtypes: float64(8), object(9)
memory usage: 2.7+ MB


### Comp Adapter

In [7]:
comp_adaptor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 20 columns):
component_id            25 non-null object
component_type_id       25 non-null object
adaptor_angle           1 non-null float64
overall_length          24 non-null float64
end_form_id_1           25 non-null object
connection_type_id_1    24 non-null object
length_1                1 non-null float64
thread_size_1           17 non-null float64
thread_pitch_1          17 non-null float64
nominal_size_1          8 non-null float64
end_form_id_2           25 non-null object
connection_type_id_2    24 non-null object
length_2                1 non-null float64
thread_size_2           23 non-null float64
thread_pitch_2          23 non-null float64
nominal_size_2          2 non-null float64
hex_size                17 non-null float64
unique_feature          25 non-null object
orientation             25 non-null object
weight                  23 non-null float64
dtypes: float64(12), object(8)

In [8]:
comp_adaptor

Unnamed: 0,component_id,component_type_id,adaptor_angle,overall_length,end_form_id_1,connection_type_id_1,length_1,thread_size_1,thread_pitch_1,nominal_size_1,end_form_id_2,connection_type_id_2,length_2,thread_size_2,thread_pitch_2,nominal_size_2,hex_size,unique_feature,orientation,weight
0,C-0005,CP-028,,58.4,A-001,B-001,,1.312,12.0,,A-001,B-004,,1.0,11.5,,34.93,No,No,0.206
1,C-0006,CP-028,,34.8,A-001,B-001,,0.437,20.0,,A-001,B-005,,0.75,16.0,,22.2,No,No,0.083
2,C-1435,CP-028,,20.3,A-007,B-004,,,,15.88,A-001,B-007,,0.875,18.0,,22.22,No,No,0.023
3,C-1546,CP-028,,26.4,A-007,B-004,,0.125,27.0,,A-001,B-004,,0.125,27.0,,15.88,No,No,0.026
4,C-1583,CP-028,,44.5,A-001,B-005,,1.312,12.0,,A-007,B-005,,1.062,12.0,,38.1,No,No,0.256
5,C-1634,CP-028,,34.5,A-001,B-005,,0.75,16.0,,A-001,B-002,,0.687,16.0,,22.23,No,No,0.06
6,C-1975,CP-028,,13.2,A-007,B-007,,,,3.18,A-001,B-007,,0.312,28.0,,,No,No,0.005
7,C-0428,CP-028,,26.99,A-001,B-004,,0.25,18.0,,A-007,,,,,9.52,17.46,No,No,0.032
8,C-0443,CP-028,,22.35,A-007,B-007,,,,19.05,9999,9999,,1.062,16.0,,26.97,No,No,
9,C-0823,CP-028,,16.8,A-007,B-007,,,,9.52,A-001,9999,,0.625,18.0,9.52,15.75,No,No,0.014


component_type_id, end_form_id_1, connection_type_id_1, end_form_id_2, connection_type_id_2 are all descriptors. They don't contain any useful information so we will **drop**.

adaptor_angle, length_1, length_2, nominal_size_2 only contain 1 or 2 non-null entries. **Drop**

C-0443 and C-1695 have some pretty strange data (NaN weight?).

C-1868 has NaN as 'OverallLength' but length_1 and length_2 have numerical values? Lets add the two together and fill in that value as OverallLength.

In [9]:
comp_adaptor_imputed = comp_adaptor.copy()

#Dropping Columns with little info or useless info
comp_adaptor_imputed = comp_adaptor_imputed.drop(['component_type_id', 'end_form_id_1', 'connection_type_id_1', 'end_form_id_2',
                          'connection_type_id_2', 'adaptor_angle', 'nominal_size_2', 'length_1', 'length_2'], axis=1)

#length_1 + length_2 = overall_length
comp_adaptor_imputed.loc[comp_adaptor_imputed['component_id'] == 'C-1868', 'overall_length'] = 65.5 + 28
comp_adaptor_imputed.head()

Unnamed: 0,component_id,overall_length,thread_size_1,thread_pitch_1,nominal_size_1,thread_size_2,thread_pitch_2,hex_size,unique_feature,orientation,weight
0,C-0005,58.4,1.312,12.0,,1.0,11.5,34.93,No,No,0.206
1,C-0006,34.8,0.437,20.0,,0.75,16.0,22.2,No,No,0.083
2,C-1435,20.3,,,15.88,0.875,18.0,22.22,No,No,0.023
3,C-1546,26.4,0.125,27.0,,0.125,27.0,15.88,No,No,0.026
4,C-1583,44.5,1.312,12.0,,1.062,12.0,38.1,No,No,0.256


At this point, I'm not sure how we should fill in the missing values. I will try to simply impute the numerical features with their mean and then add additional columns to indicate if the data has been imputed or not.

In [10]:
from sklearn.impute import SimpleImputer

# make new columns indicating what will be imputed
cols_with_missing = (col for col in comp_adaptor_imputed.columns
                                 if comp_adaptor_imputed[col].isnull().any())
for col in cols_with_missing:
    comp_adaptor_imputed[col + '_was_missing'] = comp_adaptor_imputed[col].isnull()
    comp_adaptor_imputed.loc[comp_adaptor_imputed[col + '_was_missing'] == True, col + '_was_missing'] = 1
    comp_adaptor_imputed.loc[comp_adaptor_imputed[col + '_was_missing'] == False, col + '_was_missing'] = 0
    
# Imputation
num_features = ['thread_size_1','thread_pitch_1','nominal_size_1','thread_size_2',
                'thread_pitch_2','hex_size','weight']

mean_imputer = SimpleImputer(strategy = 'mean')

comp_adaptor_imputed[num_features] = mean_imputer.fit_transform(comp_adaptor_imputed[num_features])

### Comp Boss

In [11]:
comp_boss.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147 entries, 0 to 146
Data columns (total 15 columns):
component_id          147 non-null object
component_type_id     147 non-null object
type                  124 non-null object
connection_type_id    147 non-null object
outside_shape         124 non-null object
base_type             124 non-null object
height_over_tube      147 non-null float64
bolt_pattern_long     23 non-null float64
bolt_pattern_wide     17 non-null float64
groove                147 non-null object
base_diameter         57 non-null float64
shoulder_diameter     30 non-null float64
unique_feature        147 non-null object
orientation           147 non-null object
weight                145 non-null float64
dtypes: float64(6), object(9)
memory usage: 17.3+ KB


In [12]:
comp_boss.head()

Unnamed: 0,component_id,component_type_id,type,connection_type_id,outside_shape,base_type,height_over_tube,bolt_pattern_long,bolt_pattern_wide,groove,base_diameter,shoulder_diameter,unique_feature,orientation,weight
0,C-0008,CP-018,Boss,B-005,Round,Flat Bottom,17.0,,,No,22.0,,Yes,Yes,0.032
1,C-0009,CP-018,Boss,B-004,Round,Flat Bottom,13.0,,,No,25.0,,No,Yes,0.033
2,C-0020,CP-018,Boss,B-005,Round,Saddle,28.4,,,No,35.0,,Yes,Yes,0.07
3,C-0054,CP-018,Boss,B-005,Round,Saddle,27.1,,,No,,,Yes,Yes,0.18
4,C-0071,CP-018,Boss,B-005,Round,Shoulder,20.0,,,No,30.0,23.0,Yes,Yes,0.08


So this dataset is really similar to the comp_adaptor dataset we will be using similar techniques:
- Drop component_type_id, connection_type_id
- bolt_pattern_long and bolt_pattern_wide contain lots of missing data, this is probably because that specific component does not have a bolt pattern. Lets just replace NaN with 0.
- Same thing with base_diameter and shoulder_diameter (maybe the part is square and therefore has no diameter?), replace NaN with 0.
- Impute weight with the mean, and then add new column stating that imputed values were added.
- outside_shape and base_type probably has missing values because the shape is 'complex'. Replace NaN with 'complex'
- type has missing values. Replace NaN with 'special'

In [13]:
comp_boss_imputed = comp_boss.copy()

#Drop Columns
comp_boss_imputed.drop(['component_type_id','connection_type_id'],axis=1)

#Fill in missing values with 0
comp_boss_imputed['bolt_pattern_long'] = comp_boss_imputed['bolt_pattern_long'].fillna(0)
comp_boss_imputed['bolt_pattern_wide'] = comp_boss_imputed['bolt_pattern_wide'].fillna(0)
comp_boss_imputed['base_diameter'] = comp_boss_imputed['base_diameter'].fillna(0)
comp_boss_imputed['shoulder_diameter'] = comp_boss_imputed['shoulder_diameter'].fillna(0)

#Replace NaN with 'complex'
comp_boss_imputed['outside_shape'] = comp_boss_imputed['outside_shape'].fillna('complex')
comp_boss_imputed['base_type'] = comp_boss_imputed['base_type'].fillna('complex')
comp_boss_imputed['type'] = comp_boss_imputed['type'].fillna('Special')

# make new columns indicating what will be imputed
cols_with_missing = (col for col in comp_boss_imputed.columns
                                 if comp_boss_imputed[col].isnull().any())
for col in cols_with_missing:
    comp_boss_imputed[col + '_was_missing'] = comp_boss_imputed[col].isnull()
    comp_boss_imputed.loc[comp_boss_imputed[col + '_was_missing'] == True, col + '_was_missing'] = 1
    comp_boss_imputed.loc[comp_boss_imputed[col + '_was_missing'] == False, col + '_was_missing'] = 0

#Imputation
mean_imputer = SimpleImputer(strategy = 'mean')
mostfreq_imputer = SimpleImputer(strategy = 'most_frequent')

comp_boss_imputed['weight'] = mean_imputer.fit_transform(comp_boss_imputed['weight'].values.reshape(-1,1))
comp_boss_imputed['type'] = mostfreq_imputer.fit_transform(comp_boss_imputed['type'].values.reshape(-1,1))

In [14]:
comp_boss_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147 entries, 0 to 146
Data columns (total 16 columns):
component_id          147 non-null object
component_type_id     147 non-null object
type                  147 non-null object
connection_type_id    147 non-null object
outside_shape         147 non-null object
base_type             147 non-null object
height_over_tube      147 non-null float64
bolt_pattern_long     147 non-null float64
bolt_pattern_wide     147 non-null float64
groove                147 non-null object
base_diameter         147 non-null float64
shoulder_diameter     147 non-null float64
unique_feature        147 non-null object
orientation           147 non-null object
weight                147 non-null float64
weight_was_missing    147 non-null int64
dtypes: float64(6), int64(1), object(9)
memory usage: 18.5+ KB


### Comp Elbow

In [15]:
comp_elbow.info()
comp_elbow.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 16 columns):
component_id          178 non-null object
component_type_id     178 non-null object
bolt_pattern_long     171 non-null float64
bolt_pattern_wide     138 non-null float64
extension_length      170 non-null float64
overall_length        175 non-null float64
thickness             171 non-null float64
drop_length           171 non-null float64
elbow_angle           130 non-null float64
mj_class_code         41 non-null object
mj_plug_class_code    40 non-null object
plug_diameter         7 non-null float64
groove                178 non-null object
unique_feature        178 non-null object
orientation           178 non-null object
weight                176 non-null float64
dtypes: float64(9), object(7)
memory usage: 22.3+ KB


Unnamed: 0,component_id,component_type_id,bolt_pattern_long,bolt_pattern_wide,extension_length,overall_length,thickness,drop_length,elbow_angle,mj_class_code,mj_plug_class_code,plug_diameter,groove,unique_feature,orientation,weight
0,C-0013,CP-008,152.4,92.08,105.0,185.0,113.0,75.0,90.0,,,,Yes,No,Yes,8.89
1,C-0016,CP-009,57.2,27.8,42.0,69.0,44.0,24.0,90.0,,,,No,No,Yes,1.172
2,C-0017,CP-009,57.2,27.8,42.0,69.0,47.0,26.0,90.0,,,,Yes,No,Yes,1.245
3,C-0018,CP-009,66.6,31.8,50.0,80.0,57.0,31.5,90.0,,,,Yes,No,Yes,1.863
4,C-0021,CP-010,75.0,,31.5,70.0,25.0,12.5,90.0,,,,No,Yes,Yes,0.903


- Drop component_type_id, mj_class_node, mj_plug_class_code, plug_diameter
- Impute the rest like how we did the previous two datasets

In [16]:
comp_elbow_imputed = comp_elbow.copy()

#Drop Columns
comp_elbow_imputed = comp_elbow_imputed.drop(['component_type_id','mj_class_code','mj_plug_class_code','plug_diameter'],
                                            axis=1)

#Features that are NaN but should really be 0 (at least I think so?)
comp_elbow_imputed['bolt_pattern_long'] = comp_elbow_imputed['bolt_pattern_long'].fillna(0)
comp_elbow_imputed['bolt_pattern_wide'] = comp_elbow_imputed['bolt_pattern_wide'].fillna(0)
comp_elbow_imputed['extension_length'] = comp_elbow_imputed['extension_length'].fillna(0)
comp_elbow_imputed['elbow_angle'] = comp_elbow_imputed['elbow_angle'].fillna(0)

# make new columns indicating what will be imputed
cols_with_missing = (col for col in comp_elbow_imputed.columns
                                 if comp_elbow_imputed[col].isnull().any())
for col in cols_with_missing:
    comp_elbow_imputed[col + '_was_missing'] = comp_elbow_imputed[col].isnull()
    comp_elbow_imputed.loc[comp_elbow_imputed[col + '_was_missing'] == True, col + '_was_missing'] = 1
    comp_elbow_imputed.loc[comp_elbow_imputed[col + '_was_missing'] == False, col + '_was_missing'] = 0

#Imputation
features_to_impute = ['weight','thickness','drop_length']
mean_imputer = SimpleImputer(strategy='mean')
comp_elbow_imputed[features_to_impute] = mean_imputer.fit_transform(comp_elbow_imputed[features_to_impute])

In [17]:
comp_elbow_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 16 columns):
component_id                  178 non-null object
bolt_pattern_long             178 non-null float64
bolt_pattern_wide             178 non-null float64
extension_length              178 non-null float64
overall_length                175 non-null float64
thickness                     178 non-null float64
drop_length                   178 non-null float64
elbow_angle                   178 non-null float64
groove                        178 non-null object
unique_feature                178 non-null object
orientation                   178 non-null object
weight                        178 non-null float64
overall_length_was_missing    178 non-null int64
thickness_was_missing         178 non-null int64
drop_length_was_missing       178 non-null int64
weight_was_missing            178 non-null int64
dtypes: float64(8), int64(4), object(4)
memory usage: 22.3+ KB


### Comp Float

In [18]:
comp_float.info()
comp_float.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 7 columns):
component_id         16 non-null object
component_type_id    16 non-null object
bolt_pattern_long    16 non-null float64
bolt_pattern_wide    16 non-null float64
thickness            16 non-null float64
orientation          16 non-null object
weight               16 non-null float64
dtypes: float64(4), object(3)
memory usage: 976.0+ bytes


Unnamed: 0,component_id,component_type_id,bolt_pattern_long,bolt_pattern_wide,thickness,orientation,weight
0,C-0027,CP-021,148.0,96.0,18.0,Yes,2.23
1,C-0454,CP-022,58.72,30.18,28.0,No,0.59
2,C-0455,CP-022,58.72,30.18,28.0,No,0.525
3,C-0494,CP-022,52.4,26.2,15.85,No,0.23
4,C-0496,CP-022,58.8,30.2,14.2,No,0.284


Small dataframe that has no missing values. We will just remove the useless columns.

In [19]:
comp_float_imputed = comp_float.copy()

#Drop Columns
comp_float_imputed = comp_float_imputed.drop('component_type_id',axis=1)

In [20]:
comp_float_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 6 columns):
component_id         16 non-null object
bolt_pattern_long    16 non-null float64
bolt_pattern_wide    16 non-null float64
thickness            16 non-null float64
orientation          16 non-null object
weight               16 non-null float64
dtypes: float64(4), object(2)
memory usage: 848.0+ bytes


### Comp HFL

In [21]:
comp_hfl.info()
comp_hfl.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 9 columns):
component_id           6 non-null object
component_type_id      6 non-null object
hose_diameter          6 non-null float64
corresponding_shell    6 non-null object
coupling_class         6 non-null object
material               6 non-null object
plating                6 non-null object
orientation            6 non-null object
weight                 6 non-null float64
dtypes: float64(2), object(7)
memory usage: 512.0+ bytes


Unnamed: 0,component_id,component_type_id,hose_diameter,corresponding_shell,coupling_class,material,plating,orientation,weight
0,C-0872,CP-023,4.8,C-0855,SP-0098,SP-0016,Yes,No,0.01
1,C-0873,CP-023,4.8,C-0856,SP-0098,SP-0016,Yes,No,0.01
2,C-0874,CP-023,4.8,C-0857,SP-0098,SP-0038,Yes,No,0.001
3,C-1039,CP-023,15.9,C-1040,SP-0097,SP-0095,No,No,0.052
4,C-1041,CP-023,15.9,C-1042,SP-0099,SP-0095,No,No,0.065


In [22]:
comp_hfl_imputed = comp_hfl.copy()

#Drop Columns
comp_hfl_imputed = comp_hfl_imputed.drop('component_type_id',axis=1)
comp_hfl_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 8 columns):
component_id           6 non-null object
hose_diameter          6 non-null float64
corresponding_shell    6 non-null object
coupling_class         6 non-null object
material               6 non-null object
plating                6 non-null object
orientation            6 non-null object
weight                 6 non-null float64
dtypes: float64(2), object(6)
memory usage: 464.0+ bytes


### Comp Nut

In [23]:
comp_nut.info()
comp_nut.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65 entries, 0 to 64
Data columns (total 11 columns):
component_id         65 non-null object
component_type_id    65 non-null object
hex_nut_size         42 non-null float64
seat_angle           15 non-null float64
length               65 non-null float64
thread_size          65 non-null object
thread_pitch         65 non-null float64
diameter             23 non-null float64
blind_hole           23 non-null object
orientation          65 non-null object
weight               64 non-null float64
dtypes: float64(6), object(5)
memory usage: 5.7+ KB


Unnamed: 0,component_id,component_type_id,hex_nut_size,seat_angle,length,thread_size,thread_pitch,diameter,blind_hole,orientation,weight
0,C-1621,CP-025,20.64,,17.0,0.687,16.0,,,No,0.015
1,C-1624,CP-025,34.92,,26.5,1.187,12.0,,,No,0.035
2,C-1623,CP-025,28.58,,23.5,1.0,14.0,,,No,0.044
3,C-1622,CP-025,23.81,,20.0,0.812,16.0,,,No,0.036
4,C-1625,CP-025,41.28,,27.5,1.437,12.0,,,No,0.129


Drop seat_angle, diameter, and blind_hold because they have a lot of missing data.

In [24]:
comp_nut_imputed = comp_nut.copy()

#Drop Columns
comp_nut_imputed = comp_nut_imputed.drop(['seat_angle','diameter','blind_hole'],axis=1)

# make new columns indicating what will be imputed
cols_with_missing = (col for col in comp_nut_imputed.columns
                                 if comp_nut_imputed[col].isnull().any())
for col in cols_with_missing:
    comp_nut_imputed[col + '_was_missing'] = comp_nut_imputed[col].isnull()
    comp_nut_imputed.loc[comp_nut_imputed[col + '_was_missing'] == True, col + '_was_missing'] = 1
    comp_nut_imputed.loc[comp_nut_imputed[col + '_was_missing'] == False, col + '_was_missing'] = 0

#Imputation
features_to_impute = ['hex_nut_size','weight']
mean_imputer = SimpleImputer(strategy='mean')
comp_nut_imputed[features_to_impute] = mean_imputer.fit_transform(comp_nut_imputed[features_to_impute])

In [25]:
comp_nut_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65 entries, 0 to 64
Data columns (total 10 columns):
component_id                65 non-null object
component_type_id           65 non-null object
hex_nut_size                65 non-null float64
length                      65 non-null float64
thread_size                 65 non-null object
thread_pitch                65 non-null float64
orientation                 65 non-null object
weight                      65 non-null float64
hex_nut_size_was_missing    65 non-null int64
weight_was_missing          65 non-null int64
dtypes: float64(4), int64(2), object(4)
memory usage: 5.2+ KB


### Comp Other

In [26]:
comp_other.info()
comp_other.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 3 columns):
component_id    1001 non-null object
part_name       1001 non-null object
weight          945 non-null float64
dtypes: float64(1), object(2)
memory usage: 23.5+ KB


Unnamed: 0,component_id,part_name,weight
0,C-1385,NUT-FLARED,0.014
1,C-1386,SLEEVE-FLARED,0.005
2,C-1369,COLLAR,0.003
3,C-0422,WASHER-FUEL LIN,0.003
4,C-1817,FITTING-NUT,0.014


In [35]:
comp_other_imputed = comp_other.copy()

#Drop Columns
comp_other_imputed = comp_other_imputed.drop('part_name',axis=1)

# make new columns indicating what will be imputed
cols_with_missing = (col for col in comp_other_imputed.columns
                                 if comp_other_imputed[col].isnull().any())

for col in cols_with_missing:
    comp_other_imputed[col + '_was_missing'] = comp_other_imputed[col].isnull()
    comp_other_imputed.loc[comp_other_imputed[col + '_was_missing'] == True, col + '_was_missing'] = 1
    comp_other_imputed.loc[comp_other_imputed[col + '_was_missing'] == False, col + '_was_missing'] = 0

#Imputation
features_to_impute = 'weight'
mean_imputer = SimpleImputer(strategy='mean')
comp_other_imputed[features_to_impute] = mean_imputer.fit_transform(comp_other_imputed[features_to_impute].values.reshape(-1,1))

In [36]:
comp_other_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 3 columns):
component_id          1001 non-null object
weight                1001 non-null float64
weight_was_missing    1001 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 23.5+ KB


### Comp Sleeve

In [37]:
comp_sleeve.info()
comp_sleeve.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 10 columns):
component_id           50 non-null object
component_type_id      50 non-null object
connection_type_id     50 non-null object
length                 50 non-null float64
intended_nut_thread    50 non-null float64
intended_nut_pitch     50 non-null int64
unique_feature         50 non-null object
plating                50 non-null object
orientation            50 non-null object
weight                 50 non-null float64
dtypes: float64(3), int64(1), object(6)
memory usage: 4.0+ KB


Unnamed: 0,component_id,component_type_id,connection_type_id,length,intended_nut_thread,intended_nut_pitch,unique_feature,plating,orientation,weight
0,C-0001,CP-024,B-001,17.3,1.062,12,No,No,No,0.013
1,C-0002,CP-024,B-001,11.2,0.5,20,No,No,No,0.005
2,C-0003,CP-024,B-001,19.3,1.187,12,No,No,No,0.014
3,C-0048,CP-024,B-002,9.5,0.562,18,No,No,No,0.006
4,C-0049,CP-024,B-002,9.5,0.812,16,No,No,No,0.012


In [39]:
comp_sleeve_imputed = comp_sleeve.copy()
comp_sleeve_imputed = comp_sleeve_imputed.drop(['component_type_id','connection_type_id'],axis=1)
comp_sleeve_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 8 columns):
component_id           50 non-null object
length                 50 non-null float64
intended_nut_thread    50 non-null float64
intended_nut_pitch     50 non-null int64
unique_feature         50 non-null object
plating                50 non-null object
orientation            50 non-null object
weight                 50 non-null float64
dtypes: float64(3), int64(1), object(4)
memory usage: 3.2+ KB


In [40]:
comp_straight.info()
comp_straight.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 361 entries, 0 to 360
Data columns (total 12 columns):
component_id         361 non-null object
component_type_id    361 non-null object
bolt_pattern_long    291 non-null float64
bolt_pattern_wide    204 non-null float64
head_diameter        70 non-null float64
overall_length       41 non-null float64
thickness            361 non-null float64
mj_class_code        120 non-null object
groove               361 non-null object
unique_feature       361 non-null object
orientation          361 non-null object
weight               354 non-null float64
dtypes: float64(6), object(6)
memory usage: 33.9+ KB


Unnamed: 0,component_id,component_type_id,bolt_pattern_long,bolt_pattern_wide,head_diameter,overall_length,thickness,mj_class_code,groove,unique_feature,orientation,weight
0,C-0012,CP-001,66.68,31.75,,40.0,20.0,,No,No,Yes,0.788
1,C-0014,CP-001,47.6,22.2,,38.0,15.0,,Yes,No,Yes,0.339
2,C-0015,CP-001,66.7,31.8,,40.0,20.0,,Yes,No,Yes,0.788
3,C-0019,CP-002,77.8,42.9,,,36.5,MJ-003,No,No,Yes,1.533
4,C-0029,CP-001,47.63,22.23,,,16.0,,Yes,No,Yes,0.286


In [54]:
comp_straight_imputed = comp_straight.copy()

#Drop Columns
comp_straight_imputed = comp_straight_imputed.drop(['head_diameter','overall_length','mj_class_code'],axis=1)

#Features that are NaN but should really be 0 (at least I think so?)
comp_straight_imputed['bolt_pattern_long'] = comp_straight_imputed['bolt_pattern_long'].fillna(0)
comp_straight_imputed['bolt_pattern_wide'] = comp_straight_imputed['bolt_pattern_wide'].fillna(0)

# make new columns indicating what will be imputed
cols_with_missing = (col for col in comp_straight_imputed.columns
                                 if comp_straight_imputed[col].isnull().any())

for col in cols_with_missing:
    comp_straight_imputed[col + '_was_missing'] = comp_straight_imputed[col].isnull()
    comp_straight_imputed.loc[comp_straight_imputed[col + '_was_missing'] == True, col + '_was_missing'] = 1
    comp_straight_imputed.loc[comp_straight_imputed[col + '_was_missing'] == False, col + '_was_missing'] = 0

# #Imputation
#Imputation
features_to_impute = 'weight'
mean_imputer = SimpleImputer(strategy='mean')
comp_straight_imputed[features_to_impute] = mean_imputer.fit_transform(comp_straight_imputed[features_to_impute].values.reshape(-1,1))

In [55]:
comp_straight_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 361 entries, 0 to 360
Data columns (total 10 columns):
component_id          361 non-null object
component_type_id     361 non-null object
bolt_pattern_long     361 non-null float64
bolt_pattern_wide     361 non-null float64
thickness             361 non-null float64
groove                361 non-null object
unique_feature        361 non-null object
orientation           361 non-null object
weight                361 non-null float64
weight_was_missing    361 non-null int64
dtypes: float64(4), int64(1), object(5)
memory usage: 28.3+ KB


### Comp Tee

In [56]:
comp_tee.info()
comp_tee.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 14 columns):
component_id          4 non-null object
component_type_id     4 non-null object
bolt_pattern_long     4 non-null float64
bolt_pattern_wide     4 non-null float64
extension_length      4 non-null float64
overall_length        4 non-null float64
thickness             4 non-null int64
drop_length           4 non-null float64
mj_class_code         4 non-null object
mj_plug_class_code    4 non-null object
groove                4 non-null object
unique_feature        4 non-null object
orientation           4 non-null object
weight                4 non-null float64
dtypes: float64(6), int64(1), object(7)
memory usage: 528.0+ bytes


Unnamed: 0,component_id,component_type_id,bolt_pattern_long,bolt_pattern_wide,extension_length,overall_length,thickness,drop_length,mj_class_code,mj_plug_class_code,groove,unique_feature,orientation,weight
0,C-0271,OTHER,58.7,30.2,57.1,93.0,57,28.5,MJ-003,Threaded,No,No,Yes,1.526
1,C-1809,OTHER,58.72,30.18,57.09,108.0,57,28.5,MJ-003,MJ-005,No,No,Yes,2.184
2,C-1830,OTHER,52.4,26.2,43.5,78.5,51,25.5,MJ-003,Threaded,No,Yes,Yes,1.135
3,C-1865,OTHER,58.7,30.2,57.1,107.0,57,28.5,MJ-003,MJ-005,No,No,Yes,1.953


In [58]:
comp_tee_imputed = comp_tee.copy()

#Drop Columns
comp_tee_imputed = comp_tee_imputed.drop(['component_type_id','mj_class_code','mj_plug_class_code'],axis=1)
comp_tee_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 11 columns):
component_id         4 non-null object
bolt_pattern_long    4 non-null float64
bolt_pattern_wide    4 non-null float64
extension_length     4 non-null float64
overall_length       4 non-null float64
thickness            4 non-null int64
drop_length          4 non-null float64
groove               4 non-null object
unique_feature       4 non-null object
orientation          4 non-null object
weight               4 non-null float64
dtypes: float64(6), int64(1), object(4)
memory usage: 432.0+ bytes


### Components

In [60]:
components.info()
components.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2048 entries, 0 to 2047
Data columns (total 3 columns):
component_id         2048 non-null object
name                 2047 non-null object
component_type_id    2048 non-null object
dtypes: object(3)
memory usage: 48.1+ KB


Unnamed: 0,component_id,name,component_type_id
0,9999,OTHER,OTHER
1,C-0001,SLEEVE,CP-024
2,C-0002,SLEEVE,CP-024
3,C-0003,SLEEVE-FLARED,CP-024
4,C-0004,NUT,CP-026


This dataset seems pretty useless, so we will not be using it.

### Specs

In [62]:
specs.info()
specs.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21198 entries, 0 to 21197
Data columns (total 11 columns):
tube_assembly_id    21198 non-null object
spec1               7129 non-null object
spec2               6844 non-null object
spec3               5840 non-null object
spec4               4154 non-null object
spec5               2921 non-null object
spec6               2071 non-null object
spec7               535 non-null object
spec8               106 non-null object
spec9               20 non-null object
spec10              1 non-null object
dtypes: object(11)
memory usage: 1.8+ MB


Unnamed: 0,tube_assembly_id,spec1,spec2,spec3,spec4,spec5,spec6,spec7,spec8,spec9,spec10
0,TA-00001,,,,,,,,,,
1,TA-00002,,,,,,,,,,
2,TA-00003,,,,,,,,,,
3,TA-00004,,,,,,,,,,
4,TA-00005,,,,,,,,,,


This dataset contains too little info to be useful, we'll not use it.

### Tube

In [63]:
tube.info()
tube.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21198 entries, 0 to 21197
Data columns (total 16 columns):
tube_assembly_id    21198 non-null object
material_id         20919 non-null object
diameter            21198 non-null float64
wall                21198 non-null float64
length              21198 non-null float64
num_bends           21198 non-null int64
bend_radius         21198 non-null float64
end_a_1x            21198 non-null object
end_a_2x            21198 non-null object
end_x_1x            21198 non-null object
end_x_2x            21198 non-null object
end_a               21198 non-null object
end_x               21198 non-null object
num_boss            21198 non-null int64
num_bracket         21198 non-null int64
other               21198 non-null int64
dtypes: float64(4), int64(4), object(8)
memory usage: 2.6+ MB


Unnamed: 0,tube_assembly_id,material_id,diameter,wall,length,num_bends,bend_radius,end_a_1x,end_a_2x,end_x_1x,end_x_2x,end_a,end_x,num_boss,num_bracket,other
0,TA-00001,SP-0035,12.7,1.65,164.0,5,38.1,N,N,N,N,EF-003,EF-003,0,0,0
1,TA-00002,SP-0019,6.35,0.71,137.0,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0
2,TA-00003,SP-0019,6.35,0.71,127.0,7,19.05,N,N,N,N,EF-008,EF-008,0,0,0
3,TA-00004,SP-0019,6.35,0.71,137.0,9,19.05,N,N,N,N,EF-008,EF-008,0,0,0
4,TA-00005,SP-0029,19.05,1.24,109.0,4,50.8,N,N,N,N,EF-003,EF-003,0,0,0


In [66]:
tube_imputed = tube.copy()

#Drop Columns
tube_imputed = tube_imputed.drop(['material_id','end_a_1x','end_a_2x','end_x_1x','end_x_2x','end_a','end_x','other'],
                                axis=1)
tube_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21198 entries, 0 to 21197
Data columns (total 8 columns):
tube_assembly_id    21198 non-null object
diameter            21198 non-null float64
wall                21198 non-null float64
length              21198 non-null float64
num_bends           21198 non-null int64
bend_radius         21198 non-null float64
num_boss            21198 non-null int64
num_bracket         21198 non-null int64
dtypes: float64(4), int64(3), object(1)
memory usage: 1.3+ MB


### Tube End Form

In [67]:
tube_end_form.info()
tube_end_form.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27 entries, 0 to 26
Data columns (total 2 columns):
end_form_id    27 non-null object
forming        27 non-null object
dtypes: object(2)
memory usage: 512.0+ bytes


Unnamed: 0,end_form_id,forming
0,EF-001,Yes
1,EF-002,No
2,EF-003,No
3,EF-004,No
4,EF-005,Yes


### Type Component

In [68]:
type_component.info()
type_component.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 2 columns):
component_type_id    29 non-null object
name                 29 non-null object
dtypes: object(2)
memory usage: 544.0+ bytes


Unnamed: 0,component_type_id,name
0,CP-001,4-bolt Tig Straight
1,CP-002,4-bolt MJ Straight
2,CP-003,4-bolt Braze/Weld Straight
3,CP-004,2-bolt Braze/Weld Straight
4,CP-005,2-bolt MJ Straight


### Type Connection

In [69]:
type_connection.info()
type_connection.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 2 columns):
connection_type_id    14 non-null object
name                  14 non-null object
dtypes: object(2)
memory usage: 304.0+ bytes


Unnamed: 0,connection_type_id,name
0,B-001,37 deg Flare-SAE J514
1,B-002,ORFS-SAE J1453
2,B-003,Hi-Duty
3,B-004,NPTF-SAE J476/J514
4,B-005,SAE STOR-SAE J1926


### Type End Form

In [70]:
type_end_form.info()
type_end_form.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 2 columns):
end_form_id    8 non-null object
name           8 non-null object
dtypes: object(2)
memory usage: 208.0+ bytes


Unnamed: 0,end_form_id,name
0,A-001,Male (Stud)
1,A-002,Male (Swivel)
2,A-003,Braze-Weld Boss
3,A-004,Braze-Weld Socket
4,A-005,Swivel Nut
