*Problem Statement:*     Predicting Target Achieved in Garment Manufacturing

*Background:*     The dataset includes a variety of variables, including department, date, quarter, and production metrics, that are associated with the procedures involved in manufacturing garments. Predicting if a particular target, marked as "Target Achieved," is met during the production process is the goal. The aim is binary, meaning that 'Yes' denotes that it has been successfully achieved and 'No' indicates that it has not.


*Objective:*     Using the above features, create a predictive model to estimate the target's likelihood of being met. It should be possible for the model to forecast new occurrences in the dataset or in datasets that are comparable.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score
import joblib
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()

In [3]:
# Load the dataset
df=pd.read_csv("C://Users//91743//Downloads//garments_worker_productivity.csv")

In [4]:
#Show the dataset of 1st 5 rows
df.head()

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity,Target Achieved
0,1/1/2015,Quarter1,sweing,Thursday,8,0.8,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725,Yes
1,1/1/2015,Quarter1,finishing,Thursday,1,0.75,3.94,,960,0,0.0,0,0,8.0,0.8865,Yes
2,1/1/2015,Quarter1,sweing,Thursday,11,0.8,11.41,968.0,3660,50,0.0,0,0,30.5,0.80057,Yes
3,1/1/2015,Quarter1,sweing,Thursday,12,0.8,11.41,968.0,3660,50,0.0,0,0,30.5,0.80057,Yes
4,1/1/2015,Quarter1,sweing,Thursday,6,0.8,25.9,1170.0,1920,50,0.0,0,0,56.0,0.800382,Yes


In [5]:
df['department']=le.fit_transform(df['department'])
df['quarter']=le.fit_transform(df['quarter'])
df['day']=le.fit_transform(df['day'])
df['Target Achieved']=le.fit_transform(df['Target Achieved'])
df.head()

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity,Target Achieved
0,1/1/2015,0,2,3,8,0.8,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725,1
1,1/1/2015,0,1,3,1,0.75,3.94,,960,0,0.0,0,0,8.0,0.8865,1
2,1/1/2015,0,2,3,11,0.8,11.41,968.0,3660,50,0.0,0,0,30.5,0.80057,1
3,1/1/2015,0,2,3,12,0.8,11.41,968.0,3660,50,0.0,0,0,30.5,0.80057,1
4,1/1/2015,0,2,3,6,0.8,25.9,1170.0,1920,50,0.0,0,0,56.0,0.800382,1


In [7]:
df.drop(['wip','date'],axis=1, inplace=True)
df.head()

KeyError: "['wip', 'date'] not found in axis"

In [8]:
df = df.sample(frac = 1) 
print (df)
print (df.shape)

      quarter  department  day  team  targeted_productivity    smv  over_time  \
547         0           1    2     2                   0.80   3.94       1200   
167         1           1    1     8                   0.80   2.90       1440   
230         1           2    4     3                   0.80  42.27       9900   
714         1           2    5     1                   0.80  22.52          0   
151         1           1    1     9                   0.80   3.94       1440   
...       ...         ...  ...   ...                    ...    ...        ...   
1148        1           0    0    10                   0.70   2.90          0   
840         2           2    5     2                   0.60  30.33       6840   
215         1           1    0    10                   0.80   3.94       1440   
450         3           2    0     5                   0.75  20.40      10320   
683         1           2    0     4                   0.70  30.10       3300   

      incentive  idle_time 

In [9]:
# 3.3    Popout target
#        to separate predictors and target

y = df.pop('Target Achieved')
y[:3]      # Pandas Series

# 3.4   Create an alias of predictors dataset 
X = df     # X is another name for df
X.shape    # (14,7)

(1197, 13)

In [10]:
X_train,X_test, y_train, y_test = train_test_split(
                                                    X,                   # Data features
                                                    y,                   # Target column
                                                    test_size = 0.2      # split-ratio
                                                    )

# 4.1 Note the use of f-string for printing
f"X_train shape: {X_train.shape}"    
print()
f"X_test.shape : {X_test.shape}"     
print()
f"y_train shape: {y_train.shape}"    
print()
f"y_test shape : {y_test.shape}"    






'y_test shape : (240,)'

In [284]:
### 4.3
###    Which columns are categorical
###    but disguised as integers

# 4.3 How many unique values per column.
#     Check every column
#     We will assume that if number of unique values
#      are 4 or less it is categorical column else numeric

print()
X_train.nunique()        # Total no of unique values in each column

# 4.4 If no. of unique values less than 5, it is categorical
print("\n------\n")
print()
X_train.nunique() < 5    # All True are categorical




------




quarter                  False
department                True
day                      False
team                     False
targeted_productivity    False
smv                      False
over_time                False
incentive                False
idle_time                False
idle_men                 False
no_of_style_change        True
no_of_workers            False
actual_productivity      False
dtype: bool

In [11]:
# 4.5 Extract list of cat_cols and num_cols:

# 4.6 First note which are cat and which are num
dg = (X_train.nunique() < 5)  

# 4.7 Then filter out names from Series index 
cat_cols = dg[dg==True].index.tolist()
num_cols = dg[dg==False].index.tolist()

In [12]:
# 4.8 Here are the columns
cat_cols    
print()
num_cols    




['quarter',
 'day',
 'team',
 'targeted_productivity',
 'smv',
 'over_time',
 'incentive',
 'idle_time',
 'idle_men',
 'no_of_workers',
 'actual_productivity']

In [13]:
cat_cols

['department', 'no_of_style_change']

In [14]:
# 4.9 We will create two subsets of num_cols
#      One set we will impute using 'mean' 
#       and the other using 'median'

num_cols_mean   = ['targeted_productivity','smv','over_time','incentive','no_of_workers','actual_productivity']
num_cols_median = ['quarter','day','team','idle_time','idle_men']

In [15]:
# 4.10 Further sub-divide cat_cols
#      We will create two sets of cat_cols
#      One set we will fill with 'most_frequent'
#       and the other using a constant value

cat_cols_mf       = ['department']       # 'most_frequent' fill
cat_cols_const    = ['no_of_style_change']     # 'constant' fill

# Our four data subsets

Numeric with mean imputation: num_cols_mean

Numeric with median imputation: num_cols_median

Categorical with mode imputation: cat_cols_mf

Categorical with constant imputation: cat_cols_const

In [16]:
# 4.11 So we have four datasets for imputing.
#      Have a look:
X_train[num_cols_mean]              # Num dataset, impute by 'mean'   strategy
print()
X_train[num_cols_median]            # Num dataset, impute by 'median' strategy
print()
X_train[cat_cols_mf]                # Cat dataset, impute by 'most_frequent' strategy
print()
X_train[cat_cols_const]             # Cat dataset, impute by 'constant' strategy






Unnamed: 0,no_of_style_change
1194,0
1101,1
1165,1
439,0
480,0
...,...
95,0
775,1
539,0
778,0


In [17]:
# 4.12   Make a copy of X_train
#       and X_test for two separate
#       ways of data processing
#       without using pipes and with pipes

X_train_c = X_train.copy()
X_test_c  = X_test.copy()

In [18]:
# Create transformers for numeric and categorical features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [20]:
#Combine transformers into a preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_cols),
        ('cat', categorical_transformer, cat_cols)
    ])

In [21]:
# Create a pipeline with preprocessing and RandomForestClassifier
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())])

In [22]:
#train the RF classifier
clf1=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf1.fit(X_train,y_train)

y_pred1=clf1.predict(X_test)

In [23]:
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred1))

Accuracy: 0.9625


In [24]:

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
metrics.confusion_matrix(y_test, y_pred1)

array([[ 63,   8],
       [  1, 168]], dtype=int64)

In [25]:
print(classification_report(y_test, y_pred1))

              precision    recall  f1-score   support

           0       0.98      0.89      0.93        71
           1       0.95      0.99      0.97       169

    accuracy                           0.96       240
   macro avg       0.97      0.94      0.95       240
weighted avg       0.96      0.96      0.96       240



In [26]:
# Save the model for later use
joblib.dump(clf1, 'random_forest_model.joblib')


['random_forest_model.joblib']

In [27]:
# Load the saved model
loaded_model = joblib.load('random_forest_model.joblib')


In [28]:
# Use a sample new data point for classification
new_df = pd.DataFrame({
    'quarter': [3],
    'department': [2],
    'day': [3],
    'team': [1],
    'targeted_productivity': [0.8],
    'smv': [20.0],
    'over_time': [10],
    'incentive': [500],
    'idle_time': [0],
    'idle_men': [0],
    'no_of_style_change': [1],
    'no_of_workers': [30],
    'actual_productivity':[0.9]
})

In [30]:
new_df

Unnamed: 0,quarter,department,day,team,targeted_productivity,smv,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
0,3,2,3,1,0.8,20.0,10,500,0,0,1,30,0.9


In [29]:
# Make a prediction using the loaded model
prediction = loaded_model.predict(new_df)
print(f"Predicted Achieveabilty for the new data point: {prediction}")

Predicted Achieveabilty for the new data point: [1]


# Observations:

1) The accuracy on the test data provides an indication of how well the model generalizes to new, unseen data.
2) The prediction for the new data point demonstrates how to use the trained model for making predictions on real-world data.

To make the code more meaningful for your specific case:

1)Adjust column names and data types according to your actual dataset.
2)Fine-tune hyperparameters of the RandomForestClassifier.
3)Consider additional evaluation metrics beyond accuracy for a more comprehensive assessment of model performance.