# FAMD Data Split

In this notebook file, FAMD will be applied to the prepared dataset, and dataset will be splitted Training, Validation and Test data.

But there is no method to perform the FAMD operation in the python libraries used.

For this reason, the related process will be carried out with a classical PCA application after a manual preliminary work.

See https://towardsdatascience.com/famd-how-to-generalize-pca-to-categorical-and-numerical-data-2ddbeb2b9210 for more detail.

Since more than one model will be trained in more than one way, the Train-Test split process will be performed before the training, so that all models will be trained with the same datasets.

In [1]:
#Libraries are being imported
import math
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import __version__ as sklearnVersion

In [2]:
#Library versions are being printed
print('numpy Version: ' + np.__version__)
print('pandas Version: ' + pd.__version__)
print('sklearn Version: ' + sklearnVersion)

numpy Version: 1.23.5
pandas Version: 1.5.2
sklearn Version: 1.2.0


In [3]:
#A pandas dataframe named as dataFrame is being created by reading the pkl file
dataFrame = pd.read_pickle("../Data/DataAnalysis/FabricWaste.pkl")
dataFrame.head()

Unnamed: 0,ProductTypeCategory,ProductType,Maturity,Gender,FabricType,ColorType,CustomerDefinedCategory,IsManualProcess,Red,Green,...,DefectRate,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect
0,Top,T-Shirt,Adult,Female,Single Jersey,SC,Greenish,False,42,47,...,0.064198,4,0,1,1,0,0,6,6,25
1,Top,T-Shirt,Adult,Female,Single Jersey,SC,Orangeish,False,255,229,...,0.060386,0,2,1,0,0,0,3,3,23
2,Top,T-Shirt,Adult,Female,Single Jersey,SC,Bluish,False,173,216,...,0.077121,1,5,2,1,1,1,11,11,30
3,Top,T-Shirt,Adult,Female,Single Jersey,SC,Bluish,False,0,95,...,0.062802,0,4,2,2,0,0,8,8,25
4,Top,T-Shirt,Adult,Female,Single Jersey,SC,Pinkish,False,72,50,...,0.045894,0,3,0,0,0,0,3,3,18


In [4]:
#Information of dataFrame is being printed
dataFrame.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 0 to 199
Data columns (total 42 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ProductTypeCategory      200 non-null    object 
 1   ProductType              200 non-null    object 
 2   Maturity                 200 non-null    object 
 3   Gender                   200 non-null    object 
 4   FabricType               200 non-null    object 
 5   ColorType                200 non-null    object 
 6   CustomerDefinedCategory  200 non-null    object 
 7   IsManualProcess          200 non-null    bool   
 8   Red                      200 non-null    int64  
 9   Green                    200 non-null    int64  
 10  Blue                     200 non-null    int64  
 11  Pus                      200 non-null    int64  
 12  Fine                     200 non-null    int64  
 13  G/M2                     200 non-null    int64  
 14  Cotton                   2

In [5]:
#Statistical information of dataFrame is being printed
dataFrame.describe()

Unnamed: 0,Red,Green,Blue,Pus,Fine,G/M2,Cotton,Nylon,Fiber,Linen,...,DefectRate,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,...,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,106.71,100.38,113.625,32.33,25.21,171.5,0.91595,0.0015,0.0128,0.045,...,0.04434,4.605,6.44,0.485,2.15,0.235,0.39,14.045,14.305,21.88
std,97.684272,92.426451,86.773397,0.744224,3.933607,45.514921,0.228904,0.008551,0.018706,0.207824,...,0.023203,17.502016,8.555253,1.782036,4.618893,1.271953,1.359094,27.749109,27.981005,33.984856
min,-1.0,-1.0,-1.0,32.0,14.0,125.0,0.0,0.0,0.0,0.0,...,0.005587,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,0.0,0.0,3.0,32.0,22.0,125.0,0.96,0.0,0.0,0.0,...,0.028531,0.0,2.0,0.0,0.0,0.0,0.0,4.0,4.0,9.0
50%,94.0,90.0,112.0,32.0,28.0,175.0,1.0,0.0,0.0,0.0,...,0.038528,0.0,4.0,0.0,1.0,0.0,0.0,7.0,7.0,14.0
75%,218.0,186.25,200.25,32.0,28.0,190.0,1.0,0.0,0.04,0.0,...,0.058106,2.0,8.0,0.0,2.0,0.0,0.0,12.0,12.25,20.0
max,255.0,255.0,255.0,34.0,28.0,320.0,1.0,0.05,0.04,1.0,...,0.118812,195.0,86.0,19.0,43.0,15.0,10.0,302.0,303.0,353.0


In [6]:
#OrderQuantity is being backed up
dataFrame['BackUp'] = dataFrame['OrderQuantity']

In [7]:
#IsManualProcess boolean feature is being converted to object data type so it can be used as a categorical feature
dataFrame['IsManualProcess'] = dataFrame['IsManualProcess'].astype(np.object_)
dataFrame

Unnamed: 0,ProductTypeCategory,ProductType,Maturity,Gender,FabricType,ColorType,CustomerDefinedCategory,IsManualProcess,Red,Green,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,BackUp
0,Top,T-Shirt,Adult,Female,Single Jersey,SC,Greenish,False,42,47,...,4,0,1,1,0,0,6,6,25,375
1,Top,T-Shirt,Adult,Female,Single Jersey,SC,Orangeish,False,255,229,...,0,2,1,0,0,0,3,3,23,375
2,Top,T-Shirt,Adult,Female,Single Jersey,SC,Bluish,False,173,216,...,1,5,2,1,1,1,11,11,30,375
3,Top,T-Shirt,Adult,Female,Single Jersey,SC,Bluish,False,0,95,...,0,4,2,2,0,0,8,8,25,375
4,Top,T-Shirt,Adult,Female,Single Jersey,SC,Pinkish,False,72,50,...,0,3,0,0,0,0,3,3,18,375
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,Top,T-Shirt,Adult,Male,Single Jersey,S,Stripe,False,-1,-1,...,80,29,8,7,4,0,127,128,199,6000
196,Top,T-Shirt,Adult,Female,Ribana,SC,White,False,255,255,...,11,8,1,1,0,0,20,21,20,610
197,Top,Sweatshirt,Adult,Male,Diagonal,SC,Black,False,5,2,...,36,24,0,4,0,4,67,68,87,3780
198,Top,Sweatshirt,Adult,Male,Diagonal,SC,Black,False,5,2,...,2,6,0,0,0,0,8,8,20,1545


In [8]:
#one hot encoded DataFrame of categorical features is being created with get_dummies method
dummyFrame = pd.get_dummies(dataFrame.loc[:, : 'IsManualProcess'])
dummyFrame

Unnamed: 0,ProductTypeCategory_Full,ProductTypeCategory_Leg,ProductTypeCategory_Top,ProductType_Coat,ProductType_Dress,ProductType_Pant,ProductType_Pyjamas,ProductType_Shirt,ProductType_Skirt,ProductType_Sweatshirt,...,CustomerDefinedCategory_MixedColor,CustomerDefinedCategory_Orangeish,CustomerDefinedCategory_Pinkish,CustomerDefinedCategory_Purplish,CustomerDefinedCategory_Reddish,CustomerDefinedCategory_Stripe,CustomerDefinedCategory_White,CustomerDefinedCategory_Yellowish,IsManualProcess_False,IsManualProcess_True
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
196,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
197,0,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
198,0,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0


In [9]:
#Since categorical features are kept in dummyFrame as encoded, they are being removed from dataFrame
dataFrame.drop(dataFrame.loc[:, : 'IsManualProcess'].columns, axis = 1, inplace = True)
dataFrame

Unnamed: 0,Red,Green,Blue,Pus,Fine,G/M2,Cotton,Nylon,Fiber,Linen,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,BackUp
0,42,47,35,32,28,125,0.96,0.0,0.04,0.0,...,4,0,1,1,0,0,6,6,25,375
1,255,229,180,32,28,125,0.96,0.0,0.04,0.0,...,0,2,1,0,0,0,3,3,23,375
2,173,216,230,32,28,125,0.96,0.0,0.04,0.0,...,1,5,2,1,1,1,11,11,30,375
3,0,95,106,32,28,125,0.96,0.0,0.04,0.0,...,0,4,2,2,0,0,8,8,25,375
4,72,50,72,32,28,125,0.96,0.0,0.04,0.0,...,0,3,0,0,0,0,3,3,18,375
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,-1,-1,-1,32,28,175,1.00,0.0,0.00,0.0,...,80,29,8,7,4,0,127,128,199,6000
196,255,255,255,34,18,210,1.00,0.0,0.00,0.0,...,11,8,1,1,0,0,20,21,20,610
197,5,2,3,32,20,320,1.00,0.0,0.00,0.0,...,36,24,0,4,0,4,67,68,87,3780
198,5,2,3,32,20,320,1.00,0.0,0.00,0.0,...,2,6,0,0,0,0,8,8,20,1545


In [10]:
#dummyFrame is being appended to the beginning of dataFrame 
dataFrame = pd.concat([dummyFrame, dataFrame], axis = 1)
dataFrame

Unnamed: 0,ProductTypeCategory_Full,ProductTypeCategory_Leg,ProductTypeCategory_Top,ProductType_Coat,ProductType_Dress,ProductType_Pant,ProductType_Pyjamas,ProductType_Shirt,ProductType_Skirt,ProductType_Sweatshirt,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,BackUp
0,0,0,1,0,0,0,0,0,0,0,...,4,0,1,1,0,0,6,6,25,375
1,0,0,1,0,0,0,0,0,0,0,...,0,2,1,0,0,0,3,3,23,375
2,0,0,1,0,0,0,0,0,0,0,...,1,5,2,1,1,1,11,11,30,375
3,0,0,1,0,0,0,0,0,0,0,...,0,4,2,2,0,0,8,8,25,375
4,0,0,1,0,0,0,0,0,0,0,...,0,3,0,0,0,0,3,3,18,375
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,0,0,1,0,0,0,0,0,0,0,...,80,29,8,7,4,0,127,128,199,6000
196,0,0,1,0,0,0,0,0,0,0,...,11,8,1,1,0,0,20,21,20,610
197,0,0,1,0,0,0,0,0,0,1,...,36,24,0,4,0,4,67,68,87,3780
198,0,0,1,0,0,0,0,0,0,1,...,2,6,0,0,0,0,8,8,20,1545


In [11]:
#75% of the data will be used for training and 25% of the data will be used for test
#The index to split the dataFrame is being calculated
splitIndex = int(dataFrame.shape[0] * 0.75)
splitIndex

150

In [12]:
#dataFrame is being shuffled
dataFrame = dataFrame.sample(frac = 1).sample(frac = 1)
dataFrame

Unnamed: 0,ProductTypeCategory_Full,ProductTypeCategory_Leg,ProductTypeCategory_Top,ProductType_Coat,ProductType_Dress,ProductType_Pant,ProductType_Pyjamas,ProductType_Shirt,ProductType_Skirt,ProductType_Sweatshirt,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,BackUp
39,0,0,1,0,0,0,0,0,0,0,...,6,1,0,0,0,0,7,7,15,380
161,1,0,0,0,1,0,0,0,0,0,...,0,19,1,5,0,1,26,26,33,295
62,0,0,1,0,0,0,0,0,0,0,...,0,5,0,0,0,0,5,5,21,310
24,0,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,1,6,370
179,0,0,1,0,0,0,0,0,0,1,...,17,12,1,0,0,0,30,30,44,1645
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,0,1,0,0,0,0,0,0,1,0,...,0,2,0,0,0,0,2,2,13,400
92,0,1,0,0,0,1,0,0,0,0,...,0,3,0,0,0,0,3,3,11,400
106,0,1,0,0,0,1,0,0,0,0,...,0,2,0,0,0,0,2,2,11,210
154,0,0,1,0,0,0,0,0,0,0,...,3,1,0,0,0,0,4,4,11,200


In [13]:
#trainingFrame containing 75% of data is being created
trainingFrame = dataFrame.iloc[:splitIndex].reset_index(drop = True)
trainingFrame

Unnamed: 0,ProductTypeCategory_Full,ProductTypeCategory_Leg,ProductTypeCategory_Top,ProductType_Coat,ProductType_Dress,ProductType_Pant,ProductType_Pyjamas,ProductType_Shirt,ProductType_Skirt,ProductType_Sweatshirt,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,BackUp
0,0,0,1,0,0,0,0,0,0,0,...,6,1,0,0,0,0,7,7,15,380
1,1,0,0,0,1,0,0,0,0,0,...,0,19,1,5,0,1,26,26,33,295
2,0,0,1,0,0,0,0,0,0,0,...,0,5,0,0,0,0,5,5,21,310
3,0,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,1,6,370
4,0,0,1,0,0,0,0,0,0,1,...,17,12,1,0,0,0,30,30,44,1645
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,0,0,1,0,0,0,0,0,0,0,...,15,1,0,1,0,0,16,17,19,300
146,0,0,1,0,0,0,0,0,0,0,...,4,0,0,0,0,0,4,4,7,250
147,0,0,1,0,0,0,0,0,0,0,...,0,4,2,2,0,0,8,8,25,375
148,0,1,0,0,0,1,0,0,0,0,...,0,13,3,5,0,3,23,24,30,550


In [14]:
#testFrame containing 25% of data is being created
testFrame = dataFrame.iloc[splitIndex:].reset_index(drop = True)
testFrame

Unnamed: 0,ProductTypeCategory_Full,ProductTypeCategory_Leg,ProductTypeCategory_Top,ProductType_Coat,ProductType_Dress,ProductType_Pant,ProductType_Pyjamas,ProductType_Shirt,ProductType_Skirt,ProductType_Sweatshirt,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,BackUp
0,0,0,1,0,0,0,0,0,0,0,...,0,2,0,1,1,4,8,8,12,370
1,0,0,1,0,0,0,0,0,0,0,...,6,2,0,0,0,0,8,8,8,160
2,0,0,1,0,0,0,0,0,0,0,...,0,7,0,0,0,0,7,7,12,205
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,3,4,4,6,185
4,0,1,0,0,0,1,0,0,0,0,...,0,2,0,0,0,0,2,2,11,185
5,0,0,1,0,0,0,0,0,0,1,...,18,11,0,0,0,0,28,29,54,1600
6,0,0,1,0,0,0,0,0,0,0,...,0,18,1,5,0,1,25,25,32,300
7,0,0,1,0,0,0,0,0,0,0,...,2,19,1,2,0,0,24,24,32,420
8,1,0,0,0,0,0,1,0,0,0,...,5,1,0,3,0,0,9,9,17,200
9,0,0,1,0,0,0,0,0,0,0,...,0,3,0,0,0,5,8,8,10,205


In [15]:
#dummySum is being calculated based on trainingFrame to prevent information leakage

#Since dataFrame is splitted Training and Test frames some Sums may be 0
#This causes some values to be 0 when calculating probabilities of modalities
#Any probability of modality value of 0 will cause a division error of 0
#Also, any feature in trainingFrame is unlikely to be represented at all
#For this reason, it can be said that a sum with a value of 0 does not exist in trainingFrame,
#but there is at least 1 in testFrame

#For these reasons, the sums with a value of 0 are being updated as 1 and used
#The max(dummySum[key], 1) function does exactly this
dummySum = trainingFrame.loc[:, : 'IsManualProcess_True'].sum()
for key in dummySum.keys():
    dummySum[key] = max(dummySum[key], 1)
dummySum

ProductTypeCategory_Full               23
ProductTypeCategory_Leg                31
ProductTypeCategory_Top                96
ProductType_Coat                        1
ProductType_Dress                      14
ProductType_Pant                       30
ProductType_Pyjamas                     9
ProductType_Shirt                       1
ProductType_Skirt                       1
ProductType_Sweatshirt                  9
ProductType_T-Shirt                    85
Maturity_Adult                        109
Maturity_Baby                           7
Maturity_Child                         34
Gender_Female                         104
Gender_Male                            16
Gender_Unisex                          30
FabricType_Diagonal                     6
FabricType_Interlock                    6
FabricType_Ribana                      13
FabricType_Single Jersey              125
ColorType_AOP                          26
ColorType_M                             7
ColorType_S                       

In [16]:
#probability = the number of rows in which the feature is observed / number of rows
#observed value = 1
#not observed value = 0
#dummySum = the number of rows in which the feature is observed for each feature

#dummyProbabilities are being calculated
dummyProbabilities =  dummySum/ trainingFrame.shape[0]
dummyProbabilities

ProductTypeCategory_Full              0.153333
ProductTypeCategory_Leg               0.206667
ProductTypeCategory_Top               0.640000
ProductType_Coat                      0.006667
ProductType_Dress                     0.093333
ProductType_Pant                      0.200000
ProductType_Pyjamas                   0.060000
ProductType_Shirt                     0.006667
ProductType_Skirt                     0.006667
ProductType_Sweatshirt                0.060000
ProductType_T-Shirt                   0.566667
Maturity_Adult                        0.726667
Maturity_Baby                         0.046667
Maturity_Child                        0.226667
Gender_Female                         0.693333
Gender_Male                           0.106667
Gender_Unisex                         0.200000
FabricType_Diagonal                   0.040000
FabricType_Interlock                  0.040000
FabricType_Ribana                     0.086667
FabricType_Single Jersey              0.833333
ColorType_AOP

In [17]:
#Encoded Value = value / Sqrt(dummyProbability)

#Categorical features are being encoded for trainingFrame and testFrame
for key in dummyProbabilities.keys():
    trainingFrame[key] = trainingFrame[key] / math.sqrt(dummyProbabilities[key])
    testFrame[key] = testFrame[key] / math.sqrt(dummyProbabilities[key])

In [18]:
trainingFrame

Unnamed: 0,ProductTypeCategory_Full,ProductTypeCategory_Leg,ProductTypeCategory_Top,ProductType_Coat,ProductType_Dress,ProductType_Pant,ProductType_Pyjamas,ProductType_Shirt,ProductType_Skirt,ProductType_Sweatshirt,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,BackUp
0,0.00000,0.000000,1.25,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,6,1,0,0,0,0,7,7,15,380
1,2.55377,0.000000,0.00,0.0,3.273268,0.000000,0.0,0.0,0.0,0.000000,...,0,19,1,5,0,1,26,26,33,295
2,0.00000,0.000000,1.25,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,0,5,0,0,0,0,5,5,21,310
3,0.00000,0.000000,1.25,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,0,1,0,0,0,0,1,1,6,370
4,0.00000,0.000000,1.25,0.0,0.000000,0.000000,0.0,0.0,0.0,4.082483,...,17,12,1,0,0,0,30,30,44,1645
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,0.00000,0.000000,1.25,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,15,1,0,1,0,0,16,17,19,300
146,0.00000,0.000000,1.25,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,4,0,0,0,0,0,4,4,7,250
147,0.00000,0.000000,1.25,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,0,4,2,2,0,0,8,8,25,375
148,0.00000,2.199707,0.00,0.0,0.000000,2.236068,0.0,0.0,0.0,0.000000,...,0,13,3,5,0,3,23,24,30,550


In [19]:
#Information of trainingFrame is being printed
trainingFrame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 76 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   ProductTypeCategory_Full            150 non-null    float64
 1   ProductTypeCategory_Leg             150 non-null    float64
 2   ProductTypeCategory_Top             150 non-null    float64
 3   ProductType_Coat                    150 non-null    float64
 4   ProductType_Dress                   150 non-null    float64
 5   ProductType_Pant                    150 non-null    float64
 6   ProductType_Pyjamas                 150 non-null    float64
 7   ProductType_Shirt                   150 non-null    float64
 8   ProductType_Skirt                   150 non-null    float64
 9   ProductType_Sweatshirt              150 non-null    float64
 10  ProductType_T-Shirt                 150 non-null    float64
 11  Maturity_Adult                      150 non-n

In [20]:
#Statistical information of trainingFrame is being printed
trainingFrame.describe()

Unnamed: 0,ProductTypeCategory_Full,ProductTypeCategory_Leg,ProductTypeCategory_Top,ProductType_Coat,ProductType_Dress,ProductType_Pant,ProductType_Pyjamas,ProductType_Shirt,ProductType_Skirt,ProductType_Sweatshirt,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,BackUp
count,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,...,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0
mean,0.391578,0.454606,0.8,0.08165,0.305505,0.447214,0.244949,0.08165,0.08165,0.244949,...,4.526667,6.286667,0.52,2.433333,0.246667,0.42,14.126667,14.433333,22.086667,520.5
std,0.923227,0.893677,0.60201,1.0,0.95538,0.897424,0.972784,1.0,1.0,0.972784,...,17.622323,8.882784,1.90626,5.194753,1.375415,1.457542,28.797207,29.039297,34.075112,729.892138
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,3.0,150.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,0.0,0.0,0.0,4.0,4.0,9.0,220.0
50%,0.0,0.0,1.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,1.0,0.0,0.0,7.0,8.0,14.0,300.0
75%,0.0,0.0,1.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,8.0,0.0,3.0,0.0,0.0,12.0,12.75,20.0,415.0
max,2.55377,2.199707,1.25,12.247449,3.273268,2.236068,4.082483,12.247449,12.247449,4.082483,...,195.0,86.0,19.0,43.0,15.0,10.0,302.0,303.0,353.0,4500.0


In [21]:
testFrame

Unnamed: 0,ProductTypeCategory_Full,ProductTypeCategory_Leg,ProductTypeCategory_Top,ProductType_Coat,ProductType_Dress,ProductType_Pant,ProductType_Pyjamas,ProductType_Shirt,ProductType_Skirt,ProductType_Sweatshirt,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,BackUp
0,0.0,0.0,1.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,2,0,1,1,4,8,8,12,370
1,0.0,0.0,1.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6,2,0,0,0,0,8,8,8,160
2,0.0,0.0,1.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,7,0,0,0,0,7,7,12,205
3,0.0,0.0,1.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,3,4,4,6,185
4,0.0,2.199707,0.0,0.0,0.0,2.236068,0.0,0.0,0.0,0.0,...,0,2,0,0,0,0,2,2,11,185
5,0.0,0.0,1.25,0.0,0.0,0.0,0.0,0.0,0.0,4.082483,...,18,11,0,0,0,0,28,29,54,1600
6,0.0,0.0,1.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,18,1,5,0,1,25,25,32,300
7,0.0,0.0,1.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2,19,1,2,0,0,24,24,32,420
8,2.55377,0.0,0.0,0.0,0.0,0.0,4.082483,0.0,0.0,0.0,...,5,1,0,3,0,0,9,9,17,200
9,0.0,0.0,1.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,3,0,0,0,5,8,8,10,205


In [22]:
#Information of testFrame is being printed
testFrame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 76 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   ProductTypeCategory_Full            50 non-null     float64
 1   ProductTypeCategory_Leg             50 non-null     float64
 2   ProductTypeCategory_Top             50 non-null     float64
 3   ProductType_Coat                    50 non-null     float64
 4   ProductType_Dress                   50 non-null     float64
 5   ProductType_Pant                    50 non-null     float64
 6   ProductType_Pyjamas                 50 non-null     float64
 7   ProductType_Shirt                   50 non-null     float64
 8   ProductType_Skirt                   50 non-null     float64
 9   ProductType_Sweatshirt              50 non-null     float64
 10  ProductType_T-Shirt                 50 non-null     float64
 11  Maturity_Adult                      50 non-null

In [23]:
#Statistical information of testFrame is being printed
testFrame.describe()

Unnamed: 0,ProductTypeCategory_Full,ProductTypeCategory_Leg,ProductTypeCategory_Top,ProductType_Coat,ProductType_Dress,ProductType_Pant,ProductType_Pyjamas,ProductType_Shirt,ProductType_Skirt,ProductType_Sweatshirt,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,BackUp
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,...,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.204302,0.351953,0.95,0.0,0.065465,0.268328,0.244949,0.0,0.489898,0.244949,...,4.84,6.9,0.38,1.3,0.2,0.3,13.8,13.92,21.26,552.7
std,0.699854,0.814613,0.539274,0.0,0.46291,0.734013,0.979379,0.0,2.424366,0.979379,...,17.310325,7.551862,1.353604,1.897904,0.903508,1.015191,24.605458,24.803588,34.04907,1037.01751
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,150.0
25%,0.0,0.0,1.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,0.0,0.0,0.0,4.0,4.0,8.0,200.0
50%,0.0,0.0,1.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,1.0,0.0,0.0,7.0,7.0,11.0,250.0
75%,0.0,0.0,1.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,9.75,0.0,2.0,0.0,0.0,12.0,12.0,18.0,400.0
max,2.55377,2.199707,1.25,0.0,3.273268,2.236068,4.082483,0.0,12.247449,4.082483,...,93.0,32.0,8.0,9.0,5.0,5.0,127.0,128.0,199.0,6000.0


In [24]:
#A StandardScaler object is being created
scaler = StandardScaler()
scaler

In [25]:
#The StandardScaler object is being trained with the input features of only trainingFrame to prevent information leakage,
#and the originals of this data in the trainingFrame are updated with their standardized form
trainingFrame.loc[:, : 'OrderQuantity'] = scaler.fit_transform(trainingFrame.loc[:, : 'OrderQuantity'])
trainingFrame

Unnamed: 0,ProductTypeCategory_Full,ProductTypeCategory_Leg,ProductTypeCategory_Top,ProductType_Coat,ProductType_Dress,ProductType_Pant,ProductType_Pyjamas,ProductType_Shirt,ProductType_Skirt,ProductType_Sweatshirt,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,BackUp
0,-0.425561,-0.510396,0.750000,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,-0.252646,...,6,1,0,0,0,0,7,7,15,380
1,2.349838,-0.510396,-1.333333,-0.081923,3.116775,-0.5,-0.252646,-0.081923,-0.081923,-0.252646,...,0,19,1,5,0,1,26,26,33,295
2,-0.425561,-0.510396,0.750000,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,-0.252646,...,0,5,0,0,0,0,5,5,21,310
3,-0.425561,-0.510396,0.750000,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,-0.252646,...,0,1,0,0,0,0,1,1,6,370
4,-0.425561,-0.510396,0.750000,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,3.958114,...,17,12,1,0,0,0,30,30,44,1645
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,-0.425561,-0.510396,0.750000,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,-0.252646,...,15,1,0,1,0,0,16,17,19,300
146,-0.425561,-0.510396,0.750000,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,-0.252646,...,4,0,0,0,0,0,4,4,7,250
147,-0.425561,-0.510396,0.750000,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,-0.252646,...,0,4,2,2,0,0,8,8,25,375
148,-0.425561,1.959263,-1.333333,-0.081923,-0.320844,2.0,-0.252646,-0.081923,-0.081923,-0.252646,...,0,13,3,5,0,3,23,24,30,550


In [26]:
#input features of testFrame are updated with their standardized form via trained StandardScaler object
testFrame.loc[:, : 'OrderQuantity'] = scaler.transform(testFrame.loc[:, : 'OrderQuantity'])
testFrame

Unnamed: 0,ProductTypeCategory_Full,ProductTypeCategory_Leg,ProductTypeCategory_Top,ProductType_Coat,ProductType_Dress,ProductType_Pant,ProductType_Pyjamas,ProductType_Shirt,ProductType_Skirt,ProductType_Sweatshirt,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,BackUp
0,-0.425561,-0.510396,0.75,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,-0.252646,...,0,2,0,1,1,4,8,8,12,370
1,-0.425561,-0.510396,0.75,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,-0.252646,...,6,2,0,0,0,0,8,8,8,160
2,-0.425561,-0.510396,0.75,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,-0.252646,...,0,7,0,0,0,0,7,7,12,205
3,-0.425561,-0.510396,0.75,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,-0.252646,...,0,0,0,1,0,3,4,4,6,185
4,-0.425561,1.959263,-1.333333,-0.081923,-0.320844,2.0,-0.252646,-0.081923,-0.081923,-0.252646,...,0,2,0,0,0,0,2,2,11,185
5,-0.425561,-0.510396,0.75,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,3.958114,...,18,11,0,0,0,0,28,29,54,1600
6,-0.425561,-0.510396,0.75,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,-0.252646,...,0,18,1,5,0,1,25,25,32,300
7,-0.425561,-0.510396,0.75,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,-0.252646,...,2,19,1,2,0,0,24,24,32,420
8,2.349838,-0.510396,-1.333333,-0.081923,-0.320844,-0.5,3.958114,-0.081923,-0.081923,-0.252646,...,5,1,0,3,0,0,9,9,17,200
9,-0.425561,-0.510396,0.75,-0.081923,-0.320844,-0.5,-0.252646,-0.081923,-0.081923,-0.252646,...,0,3,0,0,0,5,8,8,10,205


In [27]:
COMPONENT_SIZE = 23

In [28]:
#A PCA object is being defined and fitted with the input features of trainingFrame
pca = PCA(n_components = COMPONENT_SIZE)
pca.fit(trainingFrame.loc[:, : 'OrderQuantity'])

In [29]:
#Explained Variance Ratios of fitted PCA object are being printed
pca.explained_variance_ratio_

array([0.12882027, 0.11364208, 0.09490344, 0.07602556, 0.05985275,
       0.05004766, 0.04182381, 0.03827518, 0.03061191, 0.02980675,
       0.02678669, 0.0237685 , 0.02253765, 0.02093966, 0.02030132,
       0.01913443, 0.01838316, 0.01786295, 0.01701664, 0.01631406,
       0.01562519, 0.01450892, 0.01419543])

In [30]:
#Total Explained Variance Ratio of fitted PCA object is being printed
#This value indicates to what extent the inputs obtained after the PCA process can represent the real inputs
#91% is quite enough.
sum(pca.explained_variance_ratio_)

0.9111840385060492

In [31]:
#shape of PCA components is being printed
pca.components_.shape

(23, 58)

In [32]:
#PCA operation is being applied to inputs of trainingFrame and defined as transformedTrainingData
transformedTrainingData = pca.transform(trainingFrame.loc[:, : 'OrderQuantity'])
transformedTrainingData.shape

(150, 23)

In [33]:
pcaColumn = ["PC{:d}".format(x) for x in range(COMPONENT_SIZE)]

In [34]:
#pcaTrainingDf is being defined based on transformedTrainingData
pcaTrainingDf = pd.DataFrame(transformedTrainingData, columns = pcaColumn)
pcaTrainingDf

Unnamed: 0,PC0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,...,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20,PC21,PC22
0,-2.159842,0.903073,-0.903965,-0.494504,-1.292695,0.619725,-0.410849,1.097258,0.081156,0.803563,...,-0.513805,0.177665,-0.327848,1.545860,0.580538,-0.473694,-0.763560,0.065923,-0.152164,0.793894
1,0.162054,-1.141753,-0.219059,1.554529,1.432512,-0.847058,-0.229250,0.049316,3.855226,1.932691,...,-1.155993,-1.062772,0.779901,-0.305557,0.214684,0.897200,0.601268,-0.135818,1.215337,0.184540
2,-0.249417,-2.523880,0.002460,-2.579382,0.628535,1.531744,0.705912,0.202348,-1.614852,-0.546322,...,-0.742192,0.303754,-0.280114,0.130146,-0.293114,0.136718,0.080427,-0.311891,-1.149663,0.184600
3,-0.993506,-0.761118,0.053950,-1.878065,-1.164023,-1.238222,-0.260116,-0.716803,-1.282542,1.161412,...,0.181571,1.611356,-2.157199,-1.697666,-1.173184,0.793594,-0.006698,-0.347658,-0.685451,0.816353
4,4.394100,4.620156,6.910180,0.250664,2.236256,-1.335311,1.639115,1.140519,-2.365281,4.222803,...,-0.276106,-2.084125,1.539726,1.619939,0.171439,-1.996700,-0.289633,0.748167,-0.240204,-1.520716
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,-0.985567,3.733304,-3.353928,0.286889,3.159942,-0.564599,-2.566718,0.468873,-0.836925,-0.633818,...,-1.189565,-0.173369,0.802174,-2.189139,0.606137,-0.284085,-1.568052,0.038381,-1.665003,-0.443869
146,-0.106127,2.282926,-2.429306,-0.651671,3.116188,-1.764173,-1.643339,-1.005481,0.105907,-1.034012,...,-0.330897,0.235224,-0.920209,0.716484,-0.504857,0.284751,-1.164840,0.170446,-0.546540,0.182301
147,-1.707846,0.246973,-0.244842,-1.982393,-1.471280,0.214617,-1.076689,-0.492724,-0.296742,0.468674,...,0.957649,1.350028,-0.413178,-0.198447,-0.973711,-0.036716,0.050350,-0.745817,0.140490,-2.029734
148,-1.807489,-1.112117,0.789260,0.397510,-1.555010,-2.839794,-1.167443,0.011566,0.636225,-1.302368,...,0.799706,-1.838345,-0.036872,-0.033983,0.552478,-0.660420,0.136245,-0.121477,0.344597,0.282596


In [35]:
#pcaTrainingDf inputs and trainingFrame outputs and OrderQuantity features are being concatenated
#OrderQuantity will be needed
pcaTrainingDf = pd.concat([pcaTrainingDf, trainingFrame.loc[:, 'PrintErrorRate' : ]], axis = 1)
pcaTrainingDf

Unnamed: 0,PC0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,BackUp
0,-2.159842,0.903073,-0.903965,-0.494504,-1.292695,0.619725,-0.410849,1.097258,0.081156,0.803563,...,6,1,0,0,0,0,7,7,15,380
1,0.162054,-1.141753,-0.219059,1.554529,1.432512,-0.847058,-0.229250,0.049316,3.855226,1.932691,...,0,19,1,5,0,1,26,26,33,295
2,-0.249417,-2.523880,0.002460,-2.579382,0.628535,1.531744,0.705912,0.202348,-1.614852,-0.546322,...,0,5,0,0,0,0,5,5,21,310
3,-0.993506,-0.761118,0.053950,-1.878065,-1.164023,-1.238222,-0.260116,-0.716803,-1.282542,1.161412,...,0,1,0,0,0,0,1,1,6,370
4,4.394100,4.620156,6.910180,0.250664,2.236256,-1.335311,1.639115,1.140519,-2.365281,4.222803,...,17,12,1,0,0,0,30,30,44,1645
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,-0.985567,3.733304,-3.353928,0.286889,3.159942,-0.564599,-2.566718,0.468873,-0.836925,-0.633818,...,15,1,0,1,0,0,16,17,19,300
146,-0.106127,2.282926,-2.429306,-0.651671,3.116188,-1.764173,-1.643339,-1.005481,0.105907,-1.034012,...,4,0,0,0,0,0,4,4,7,250
147,-1.707846,0.246973,-0.244842,-1.982393,-1.471280,0.214617,-1.076689,-0.492724,-0.296742,0.468674,...,0,4,2,2,0,0,8,8,25,375
148,-1.807489,-1.112117,0.789260,0.397510,-1.555010,-2.839794,-1.167443,0.011566,0.636225,-1.302368,...,0,13,3,5,0,3,23,24,30,550


In [36]:
#BackUp Of pcaTrainingDf is being renamed as OrderQuantity again
pcaTrainingDf.rename(columns={'BackUp': 'OrderQuantity'}, inplace=True)
pcaTrainingDf

Unnamed: 0,PC0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,OrderQuantity
0,-2.159842,0.903073,-0.903965,-0.494504,-1.292695,0.619725,-0.410849,1.097258,0.081156,0.803563,...,6,1,0,0,0,0,7,7,15,380
1,0.162054,-1.141753,-0.219059,1.554529,1.432512,-0.847058,-0.229250,0.049316,3.855226,1.932691,...,0,19,1,5,0,1,26,26,33,295
2,-0.249417,-2.523880,0.002460,-2.579382,0.628535,1.531744,0.705912,0.202348,-1.614852,-0.546322,...,0,5,0,0,0,0,5,5,21,310
3,-0.993506,-0.761118,0.053950,-1.878065,-1.164023,-1.238222,-0.260116,-0.716803,-1.282542,1.161412,...,0,1,0,0,0,0,1,1,6,370
4,4.394100,4.620156,6.910180,0.250664,2.236256,-1.335311,1.639115,1.140519,-2.365281,4.222803,...,17,12,1,0,0,0,30,30,44,1645
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,-0.985567,3.733304,-3.353928,0.286889,3.159942,-0.564599,-2.566718,0.468873,-0.836925,-0.633818,...,15,1,0,1,0,0,16,17,19,300
146,-0.106127,2.282926,-2.429306,-0.651671,3.116188,-1.764173,-1.643339,-1.005481,0.105907,-1.034012,...,4,0,0,0,0,0,4,4,7,250
147,-1.707846,0.246973,-0.244842,-1.982393,-1.471280,0.214617,-1.076689,-0.492724,-0.296742,0.468674,...,0,4,2,2,0,0,8,8,25,375
148,-1.807489,-1.112117,0.789260,0.397510,-1.555010,-2.839794,-1.167443,0.011566,0.636225,-1.302368,...,0,13,3,5,0,3,23,24,30,550


In [37]:
#Information of pcaTrainingDf is being printed
pcaTrainingDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 41 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PC0                      150 non-null    float64
 1   PC1                      150 non-null    float64
 2   PC2                      150 non-null    float64
 3   PC3                      150 non-null    float64
 4   PC4                      150 non-null    float64
 5   PC5                      150 non-null    float64
 6   PC6                      150 non-null    float64
 7   PC7                      150 non-null    float64
 8   PC8                      150 non-null    float64
 9   PC9                      150 non-null    float64
 10  PC10                     150 non-null    float64
 11  PC11                     150 non-null    float64
 12  PC12                     150 non-null    float64
 13  PC13                     150 non-null    float64
 14  PC14                     1

In [38]:
#Statistical information of pcaTrainingDf is being printed
pcaTrainingDf.describe()

Unnamed: 0,PC0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,OrderQuantity
count,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,...,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0
mean,2.3684760000000003e-17,-1.125026e-16,-1.065814e-16,-2.3684760000000003e-17,-2.3684760000000003e-17,1.539509e-16,4.7369520000000006e-17,-1.894781e-16,4.1448330000000005e-17,-1.465494e-16,...,4.526667,6.286667,0.52,2.433333,0.246667,0.42,14.126667,14.433333,22.086667,520.5
std,2.742575,2.575942,2.354006,2.106912,1.869427,1.709458,1.56271,1.494945,1.336939,1.31924,...,17.622323,8.882784,1.90626,5.194753,1.375415,1.457542,28.797207,29.039297,34.075112,729.892138
min,-2.593435,-5.356644,-5.332553,-3.363834,-3.62267,-4.180609,-3.077161,-3.586945,-3.5688,-4.299588,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,3.0,150.0
25%,-1.663962,-1.592389,-1.189392,-1.525014,-1.378872,-1.195458,-0.6770874,-0.8965826,-0.8102841,-0.8504962,...,0.0,2.0,0.0,0.0,0.0,0.0,4.0,4.0,9.0,220.0
50%,-0.4565722,0.3784231,-0.2626697,-0.55823,-0.6168971,0.1307688,-0.2511817,0.03859758,-0.1363654,0.001792644,...,0.0,4.0,0.0,1.0,0.0,0.0,7.0,8.0,14.0,300.0
75%,0.3168803,1.385057,0.7821502,1.150648,1.266148,1.173204,0.4458845,0.930625,0.5667936,0.6908789,...,2.0,8.0,0.0,3.0,0.0,0.0,12.0,12.75,20.0,415.0
max,13.04181,5.395295,10.1425,6.680256,5.805823,4.858948,5.37769,5.203551,3.855226,5.227225,...,195.0,86.0,19.0,43.0,15.0,10.0,302.0,303.0,353.0,4500.0


In [39]:
#PCA operation is being applied to inputs of testFrame and defined as transformedTestData
transformedTestData = pca.transform(testFrame.loc[:, : 'OrderQuantity'])
transformedTestData.shape

(50, 23)

In [40]:
#pcaTestDf is being defined based on transformedTestData
pcaTestDf = pd.DataFrame(transformedTestData, columns = pcaColumn)
pcaTestDf

Unnamed: 0,PC0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,...,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20,PC21,PC22
0,-1.601754,-0.408275,-0.788667,-1.142983,-1.512157,-0.353603,-0.33859,0.146634,-0.64345,1.333976,...,-0.267007,1.306723,-1.069865,1.508882,0.110248,-0.348881,-1.194183,0.11997,-0.831484,0.63927
1,0.025378,0.418853,0.034701,4.251172,-0.396508,3.287433,1.986794,1.873284,0.665698,-0.663097,...,-0.046469,2.429168,-1.509779,1.859931,-0.190561,-0.015878,-1.71228,2.174085,-1.892978,-0.513338
2,-0.697826,1.466843,-0.72139,3.081282,-0.023898,3.202064,1.800255,3.991179,1.138871,-0.348235,...,-0.147161,-0.267603,-1.648785,-0.556079,-0.394167,0.895826,0.119457,-2.731102,0.945629,1.052242
3,0.690207,-2.80267,0.039824,-0.452984,0.733203,1.993621,1.060869,-0.54104,-1.755796,0.481612,...,-1.357123,0.251518,-0.169896,0.162516,-0.354254,0.866202,0.579212,-0.985516,-0.595206,0.51867
4,1.113541,-4.343674,1.501242,2.608504,0.282633,0.814241,0.213915,-1.158998,-1.689059,-1.494052,...,-0.885941,-0.10912,0.517992,0.033714,-0.213369,0.472124,0.788194,-0.367293,-0.181466,-0.417549
5,5.128034,1.427013,2.126658,-1.648347,-1.303841,1.406333,-0.127719,-0.215946,-0.064337,-1.196106,...,-0.871077,-0.925161,-1.082553,0.929994,-0.450954,-0.018374,-0.06784,0.332924,-0.506723,0.02628
6,-2.445671,1.134385,-1.713168,-0.535495,-0.986413,0.559286,-1.348857,0.64394,-0.729132,1.023333,...,-0.97611,-0.954147,0.76602,-2.50338,0.487296,-0.425266,-0.67899,0.034118,-2.051919,-0.095165
7,-1.285835,1.133684,-1.393384,-1.788763,0.078967,-1.62822,4.693191,-3.519059,2.202286,-0.619369,...,0.019744,-0.399645,-0.346717,0.472439,-0.092017,0.759658,-0.800371,-1.200049,-1.35768,-0.383883
8,-0.854012,-1.193327,-0.58885,2.234394,2.346093,1.707645,-0.681292,-0.612474,0.350221,0.975957,...,1.691087,-1.183548,-0.853176,-0.388255,-0.051755,-3.172014,-0.054986,-0.06315,3.210872,0.961833
9,-0.702521,1.271145,-0.267257,0.872589,-0.380151,1.107034,-0.327619,-0.768621,-0.017278,0.856384,...,0.317927,0.723933,0.331034,-0.159715,-0.620505,0.714398,0.81061,-1.5117,1.25437,-1.644212


In [41]:
#pcaTestDf inputs and testFrame outputs and OrderQuantity features are being concatenated
#OrderQuantity will be needed
pcaTestDf = pd.concat([pcaTestDf, testFrame.loc[:, 'PrintErrorRate' : ]], axis = 1)
pcaTestDf

Unnamed: 0,PC0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,BackUp
0,-1.601754,-0.408275,-0.788667,-1.142983,-1.512157,-0.353603,-0.33859,0.146634,-0.64345,1.333976,...,0,2,0,1,1,4,8,8,12,370
1,0.025378,0.418853,0.034701,4.251172,-0.396508,3.287433,1.986794,1.873284,0.665698,-0.663097,...,6,2,0,0,0,0,8,8,8,160
2,-0.697826,1.466843,-0.72139,3.081282,-0.023898,3.202064,1.800255,3.991179,1.138871,-0.348235,...,0,7,0,0,0,0,7,7,12,205
3,0.690207,-2.80267,0.039824,-0.452984,0.733203,1.993621,1.060869,-0.54104,-1.755796,0.481612,...,0,0,0,1,0,3,4,4,6,185
4,1.113541,-4.343674,1.501242,2.608504,0.282633,0.814241,0.213915,-1.158998,-1.689059,-1.494052,...,0,2,0,0,0,0,2,2,11,185
5,5.128034,1.427013,2.126658,-1.648347,-1.303841,1.406333,-0.127719,-0.215946,-0.064337,-1.196106,...,18,11,0,0,0,0,28,29,54,1600
6,-2.445671,1.134385,-1.713168,-0.535495,-0.986413,0.559286,-1.348857,0.64394,-0.729132,1.023333,...,0,18,1,5,0,1,25,25,32,300
7,-1.285835,1.133684,-1.393384,-1.788763,0.078967,-1.62822,4.693191,-3.519059,2.202286,-0.619369,...,2,19,1,2,0,0,24,24,32,420
8,-0.854012,-1.193327,-0.58885,2.234394,2.346093,1.707645,-0.681292,-0.612474,0.350221,0.975957,...,5,1,0,3,0,0,9,9,17,200
9,-0.702521,1.271145,-0.267257,0.872589,-0.380151,1.107034,-0.327619,-0.768621,-0.017278,0.856384,...,0,3,0,0,0,5,8,8,10,205


In [42]:
#BackUp Of pcaTestDf is being renamed as OrderQuantity again
pcaTestDf.rename(columns={'BackUp': 'OrderQuantity'}, inplace=True)
pcaTestDf

Unnamed: 0,PC0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,OrderQuantity
0,-1.601754,-0.408275,-0.788667,-1.142983,-1.512157,-0.353603,-0.33859,0.146634,-0.64345,1.333976,...,0,2,0,1,1,4,8,8,12,370
1,0.025378,0.418853,0.034701,4.251172,-0.396508,3.287433,1.986794,1.873284,0.665698,-0.663097,...,6,2,0,0,0,0,8,8,8,160
2,-0.697826,1.466843,-0.72139,3.081282,-0.023898,3.202064,1.800255,3.991179,1.138871,-0.348235,...,0,7,0,0,0,0,7,7,12,205
3,0.690207,-2.80267,0.039824,-0.452984,0.733203,1.993621,1.060869,-0.54104,-1.755796,0.481612,...,0,0,0,1,0,3,4,4,6,185
4,1.113541,-4.343674,1.501242,2.608504,0.282633,0.814241,0.213915,-1.158998,-1.689059,-1.494052,...,0,2,0,0,0,0,2,2,11,185
5,5.128034,1.427013,2.126658,-1.648347,-1.303841,1.406333,-0.127719,-0.215946,-0.064337,-1.196106,...,18,11,0,0,0,0,28,29,54,1600
6,-2.445671,1.134385,-1.713168,-0.535495,-0.986413,0.559286,-1.348857,0.64394,-0.729132,1.023333,...,0,18,1,5,0,1,25,25,32,300
7,-1.285835,1.133684,-1.393384,-1.788763,0.078967,-1.62822,4.693191,-3.519059,2.202286,-0.619369,...,2,19,1,2,0,0,24,24,32,420
8,-0.854012,-1.193327,-0.58885,2.234394,2.346093,1.707645,-0.681292,-0.612474,0.350221,0.975957,...,5,1,0,3,0,0,9,9,17,200
9,-0.702521,1.271145,-0.267257,0.872589,-0.380151,1.107034,-0.327619,-0.768621,-0.017278,0.856384,...,0,3,0,0,0,5,8,8,10,205


In [43]:
#Information of pcaTestDf is being printed
pcaTestDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 41 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PC0                      50 non-null     float64
 1   PC1                      50 non-null     float64
 2   PC2                      50 non-null     float64
 3   PC3                      50 non-null     float64
 4   PC4                      50 non-null     float64
 5   PC5                      50 non-null     float64
 6   PC6                      50 non-null     float64
 7   PC7                      50 non-null     float64
 8   PC8                      50 non-null     float64
 9   PC9                      50 non-null     float64
 10  PC10                     50 non-null     float64
 11  PC11                     50 non-null     float64
 12  PC12                     50 non-null     float64
 13  PC13                     50 non-null     float64
 14  PC14                     50 

In [44]:
#Statistical information of pcaTestDf is being printed
pcaTestDf.describe()

Unnamed: 0,PC0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,OrderQuantity
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,...,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,-0.082492,0.647803,-0.266457,0.487028,0.167399,0.408931,0.748793,-0.144923,-0.095701,-0.31659,...,4.84,6.9,0.38,1.3,0.2,0.3,13.8,13.92,21.26,552.7
std,2.397389,2.356412,2.367241,2.28044,1.790887,1.882112,2.166673,1.73592,1.41181,1.430264,...,17.310325,7.551862,1.353604,1.897904,0.903508,1.015191,24.605458,24.803588,34.04907,1037.01751
min,-2.587016,-4.813578,-5.463644,-4.955933,-2.546559,-3.651559,-3.551327,-3.519059,-2.901741,-5.528565,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,150.0
25%,-1.188657,-0.885767,-1.674115,-1.121361,-1.116876,-0.845763,-0.58779,-1.367309,-1.101758,-0.828686,...,0.0,2.0,0.0,0.0,0.0,0.0,4.0,4.0,8.0,200.0
50%,-0.410109,1.134035,-0.563466,-0.029686,-0.300849,0.254536,0.285147,-0.051231,-0.040808,-0.398636,...,0.0,4.0,0.0,1.0,0.0,0.0,7.0,7.0,11.0,250.0
75%,0.086821,1.975529,0.379221,2.410881,0.77321,1.501683,1.76606,0.909625,0.634829,0.788178,...,1.0,9.75,0.0,2.0,0.0,0.0,12.0,12.0,18.0,400.0
max,12.352104,3.940754,7.682068,4.841975,5.556841,4.879057,5.463404,3.991179,3.024043,1.997094,...,93.0,32.0,8.0,9.0,5.0,5.0,127.0,128.0,199.0,6000.0


In [45]:
#50% of the test data will be used for validation and 50% of the test data will be used for test
#The index to split the dataFrame is being calculated
splitIndex = int(pcaTestDf.shape[0] * 0.5)
splitIndex

25

In [46]:
#pcaValidationDf containing 50% of test data is being created
pcaValidationDf = pcaTestDf.iloc[:splitIndex].reset_index(drop = True)
pcaValidationDf

Unnamed: 0,PC0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,OrderQuantity
0,-1.601754,-0.408275,-0.788667,-1.142983,-1.512157,-0.353603,-0.33859,0.146634,-0.64345,1.333976,...,0,2,0,1,1,4,8,8,12,370
1,0.025378,0.418853,0.034701,4.251172,-0.396508,3.287433,1.986794,1.873284,0.665698,-0.663097,...,6,2,0,0,0,0,8,8,8,160
2,-0.697826,1.466843,-0.72139,3.081282,-0.023898,3.202064,1.800255,3.991179,1.138871,-0.348235,...,0,7,0,0,0,0,7,7,12,205
3,0.690207,-2.80267,0.039824,-0.452984,0.733203,1.993621,1.060869,-0.54104,-1.755796,0.481612,...,0,0,0,1,0,3,4,4,6,185
4,1.113541,-4.343674,1.501242,2.608504,0.282633,0.814241,0.213915,-1.158998,-1.689059,-1.494052,...,0,2,0,0,0,0,2,2,11,185
5,5.128034,1.427013,2.126658,-1.648347,-1.303841,1.406333,-0.127719,-0.215946,-0.064337,-1.196106,...,18,11,0,0,0,0,28,29,54,1600
6,-2.445671,1.134385,-1.713168,-0.535495,-0.986413,0.559286,-1.348857,0.64394,-0.729132,1.023333,...,0,18,1,5,0,1,25,25,32,300
7,-1.285835,1.133684,-1.393384,-1.788763,0.078967,-1.62822,4.693191,-3.519059,2.202286,-0.619369,...,2,19,1,2,0,0,24,24,32,420
8,-0.854012,-1.193327,-0.58885,2.234394,2.346093,1.707645,-0.681292,-0.612474,0.350221,0.975957,...,5,1,0,3,0,0,9,9,17,200
9,-0.702521,1.271145,-0.267257,0.872589,-0.380151,1.107034,-0.327619,-0.768621,-0.017278,0.856384,...,0,3,0,0,0,5,8,8,10,205


In [47]:
#Information of pcaValidationDf is being printed
pcaValidationDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 41 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PC0                      25 non-null     float64
 1   PC1                      25 non-null     float64
 2   PC2                      25 non-null     float64
 3   PC3                      25 non-null     float64
 4   PC4                      25 non-null     float64
 5   PC5                      25 non-null     float64
 6   PC6                      25 non-null     float64
 7   PC7                      25 non-null     float64
 8   PC8                      25 non-null     float64
 9   PC9                      25 non-null     float64
 10  PC10                     25 non-null     float64
 11  PC11                     25 non-null     float64
 12  PC12                     25 non-null     float64
 13  PC13                     25 non-null     float64
 14  PC14                     25 

In [48]:
#Statistical information of pcaValidationDf is being printed
pcaValidationDf.describe()

Unnamed: 0,PC0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,OrderQuantity
count,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,...,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0
mean,-0.178009,0.841029,-0.330317,0.69491,0.52658,0.612656,1.003164,0.237304,-0.251614,-0.075676,...,2.16,6.92,0.12,1.16,0.04,0.52,10.84,10.92,17.0,423.8
std,1.875163,2.38877,2.098771,2.001808,1.694058,2.025949,2.104197,1.842426,1.217884,1.119167,...,5.161395,7.471724,0.331662,2.192411,0.2,1.357694,11.055466,11.105254,17.255434,496.198633
min,-2.587016,-4.473841,-3.172407,-1.951797,-1.512157,-2.48101,-2.011895,-3.519059,-2.678456,-1.494052,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,160.0
25%,-1.285835,0.21739,-1.713168,-0.615858,-0.748153,-0.835252,-0.335407,-0.647871,-1.130517,-0.819009,...,0.0,2.0,0.0,0.0,0.0,0.0,4.0,4.0,8.0,200.0
50%,-0.600887,1.337435,-0.663736,-0.069064,0.073777,0.653674,0.574539,0.146634,-0.017278,-0.366633,...,0.0,4.0,0.0,0.0,0.0,0.0,8.0,8.0,11.0,250.0
75%,0.087226,1.916686,0.039824,2.234394,1.788728,1.707645,1.986794,1.206313,0.653157,0.955768,...,1.0,10.0,0.0,1.0,0.0,0.0,13.0,13.0,17.0,300.0
max,5.128034,3.940754,7.384269,4.841975,5.556841,4.879057,5.052129,3.991179,2.202286,1.997094,...,19.0,32.0,1.0,9.0,1.0,5.0,51.0,51.0,80.0,1995.0


In [49]:
#pcaTestDf containing 50% of test data is being created
pcaTestDf = pcaTestDf.iloc[splitIndex:].reset_index(drop = True)
pcaTestDf

Unnamed: 0,PC0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,OrderQuantity
0,-2.01391,1.995143,-1.92626,-1.974847,-0.470229,-0.920815,5.463404,-2.759364,2.244603,-0.119326,...,2,15,5,2,0,2,25,26,32,420
1,0.081436,0.312799,-0.204645,4.329446,-0.406149,3.267323,2.08309,1.907919,0.579843,-0.524113,...,0,3,0,1,0,0,4,4,5,150
2,-0.275271,3.378519,6.18066,-3.070387,-1.256094,2.69595,-1.440188,-1.783384,1.923091,-3.025943,...,93,26,2,3,0,0,123,124,143,4395
3,0.943398,-4.813578,0.824529,0.27505,0.786546,0.370814,0.870501,0.563257,-2.901741,-3.165466,...,0,3,1,0,0,0,4,4,5,200
4,-0.379232,3.409618,-3.162239,2.686746,3.859326,0.532266,-1.527445,-1.047937,-1.601517,-0.551928,...,0,0,0,0,5,0,5,5,7,170
5,-2.101637,2.223838,-1.556957,-2.129927,-0.512525,-0.8243,5.273366,-2.782388,2.460166,-0.430639,...,0,11,0,3,0,0,14,14,18,450
6,-0.282043,3.811182,-2.230863,0.004519,2.087936,-1.753659,3.935391,0.919443,-1.236278,-0.553032,...,0,10,0,2,0,0,11,12,18,250
7,-0.64304,0.346454,-0.315427,2.190501,-1.220767,2.136473,0.156331,-2.538222,-0.720663,1.731993,...,1,4,0,2,0,0,7,7,9,210
8,12.352104,-1.044931,-5.463644,-0.329819,-2.043098,-0.491714,-0.91066,0.539819,3.024043,0.518843,...,3,1,0,0,0,0,4,4,10,190
9,-2.104799,0.395125,-1.107901,-1.055992,-1.160363,-0.341301,-0.616943,-0.333464,0.484331,0.650698,...,0,1,0,0,0,0,1,1,4,220


In [50]:
#Information of pcaTestDf is being printed
pcaTestDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 41 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PC0                      25 non-null     float64
 1   PC1                      25 non-null     float64
 2   PC2                      25 non-null     float64
 3   PC3                      25 non-null     float64
 4   PC4                      25 non-null     float64
 5   PC5                      25 non-null     float64
 6   PC6                      25 non-null     float64
 7   PC7                      25 non-null     float64
 8   PC8                      25 non-null     float64
 9   PC9                      25 non-null     float64
 10  PC10                     25 non-null     float64
 11  PC11                     25 non-null     float64
 12  PC12                     25 non-null     float64
 13  PC13                     25 non-null     float64
 14  PC14                     25 

In [51]:
#Statistical information of pcaTestDf is being printed
pcaTestDf.describe()

Unnamed: 0,PC0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,...,PrintError,SewingError,FabricStain,FabricError,EmbroideryError,MeasureError,SecondQuality,CalculatedSecondQuality,Defect,OrderQuantity
count,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,...,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0
mean,0.013024,0.454578,-0.202597,0.279146,-0.191783,0.205207,0.494423,-0.527151,0.060211,-0.557505,...,7.52,6.88,0.64,1.44,0.36,0.08,16.76,16.92,25.52,681.6
std,2.863422,2.356419,2.651003,2.553474,1.846505,1.743946,2.241005,1.566362,1.59235,1.674245,...,23.878372,7.785242,1.868154,1.583246,1.254326,0.4,33.09995,33.376539,45.071351,1383.759764
min,-2.551436,-4.813578,-5.463644,-4.955933,-2.546559,-3.651559,-3.551327,-2.782388,-2.901741,-5.528565,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,4.0,150.0
25%,-0.758804,-1.143406,-1.556957,-1.558052,-1.478593,-0.849267,-0.887723,-1.783384,-0.906136,-0.831912,...,0.0,2.0,0.0,0.0,0.0,0.0,4.0,4.0,8.0,200.0
50%,-0.379232,0.395125,-0.315427,0.27505,-0.512525,0.087782,0.200355,-0.501114,-0.104364,-0.430639,...,0.0,4.0,0.0,1.0,0.0,0.0,6.0,6.0,11.0,250.0
75%,0.085605,1.995143,0.415892,2.46971,0.351512,1.472206,0.870501,0.563257,0.579843,0.518843,...,1.0,9.0,0.0,2.0,0.0,0.0,11.0,12.0,18.0,420.0
max,12.352104,3.858221,7.682068,4.371193,5.226237,3.279513,5.463404,2.71176,3.024043,1.731993,...,93.0,29.0,8.0,7.0,5.0,2.0,127.0,128.0,199.0,6000.0


In [52]:
#DataFrames are being saved as pkl files
pcaTrainingDf.to_pickle("../Data/DataSplit/Training.pkl")
pcaValidationDf.to_pickle("../Data/DataSplit/Validation.pkl")
pcaTestDf.to_pickle("../Data/DataSplit/Test.pkl")