**Machine Learning Lab - CSE 432**

# 07 Regression

**Regression** in machine learning is a statistical method used to predict a continuous outcome based on the value of one or more predictor variables. It is a core component of supervised learning, where the algorithm is trained with both input features and output labels to establish a relationship among variables. For instance, in predicting a car's fuel efficiency based on attributes like weight and horsepower, regression can help estimate the miles per gallon for different models. The most common form of regression is linear regression due to its simplicity and effectiveness in forecasting. However, there are various types of regression models, including logistic and polynomial, each suited for different kinds of data and analysis needs. Evaluating the performance of a regression model involves considering metrics like variance and bias to ensure accurate predictions and generalization to new data.

**7.1 Importing modules**

In [1]:
import pandas as pd
import numpy as np

**7.2 Importing and Preprocessing Data Set**

The Productivity Prediction of Garment Employees dataset (https://archive.ics.uci.edu/dataset/597/productivity+prediction+of+garment+employees) will be used for this task. This dataset includes important attributes of the garment manufacturing process and the productivity of the employees which had been collected manually and also been validated by the industry experts.

Attribute information:
    01	date			:	Date in MM-DD-YYYY
    
    02	day			:	Day of the Week
    
    03	quarter			:	A portion of the month. A month was divided into four quarters
    
    04	department		:	Associated department with the instance
    
    05	team_no			:	Associated team number with the instance
    
    06	no_of_workers		:	Number of workers in each team
    
    07	no_of_style_change	:	Number of changes in the style of a particular product
    
    08	targeted_productivity	:	Targeted productivity set by the Authority for each team for each day.
    
    09	smv			:	Standard Minute Value, it is the allocated time for a task
    
    10	wip			:	Work in progress. Includes the number of unfinished items for products
    
    11	over_time		:	Represents the amount of overtime by each team in minutes
    
    12	incentive		:	Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.
    
    13	idle_time		:	The amount of time when the production was interrupted due to several reasons
    
    14	idle_men		:	The number of workers who were idle due to production interruption
    
    15	actual_productivity	:	The actual % of productivity that was delivered by the workers. It ranges from 0-1.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Importing the data
df = pd.read_csv('/content/drive/MyDrive/CSE 432 ML/Regression/garments_worker_productivity.csv')
df

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
0,1/1/2015,Quarter1,sweing,Thursday,8,0.80,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725
1,1/1/2015,Quarter1,finishing,Thursday,1,0.75,3.94,,960,0,0.0,0,0,8.0,0.886500
2,1/1/2015,Quarter1,sweing,Thursday,11,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
3,1/1/2015,Quarter1,sweing,Thursday,12,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
4,1/1/2015,Quarter1,sweing,Thursday,6,0.80,25.90,1170.0,1920,50,0.0,0,0,56.0,0.800382
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,3/11/2015,Quarter2,finishing,Wednesday,10,0.75,2.90,,960,0,0.0,0,0,8.0,0.628333
1193,3/11/2015,Quarter2,finishing,Wednesday,8,0.70,3.90,,960,0,0.0,0,0,8.0,0.625625
1194,3/11/2015,Quarter2,finishing,Wednesday,7,0.65,3.90,,960,0,0.0,0,0,8.0,0.625625
1195,3/11/2015,Quarter2,finishing,Wednesday,9,0.75,2.90,,1800,0,0.0,0,0,15.0,0.505889


In [5]:
df['day'].unique()

array(['Thursday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday'],
      dtype=object)

**7.3 Train Test Split**

In [6]:
y = df['actual_productivity']
# Feature Selection
X = df[['quarter', 'department', 'day', 'team', 'targeted_productivity', 'smv', 'wip', 'over_time', 'incentive', 'idle_time', 'idle_men', 'no_of_style_change', 'no_of_workers']]
X

Unnamed: 0,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers
0,Quarter1,sweing,Thursday,8,0.80,26.16,1108.0,7080,98,0.0,0,0,59.0
1,Quarter1,finishing,Thursday,1,0.75,3.94,,960,0,0.0,0,0,8.0
2,Quarter1,sweing,Thursday,11,0.80,11.41,968.0,3660,50,0.0,0,0,30.5
3,Quarter1,sweing,Thursday,12,0.80,11.41,968.0,3660,50,0.0,0,0,30.5
4,Quarter1,sweing,Thursday,6,0.80,25.90,1170.0,1920,50,0.0,0,0,56.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,Quarter2,finishing,Wednesday,10,0.75,2.90,,960,0,0.0,0,0,8.0
1193,Quarter2,finishing,Wednesday,8,0.70,3.90,,960,0,0.0,0,0,8.0
1194,Quarter2,finishing,Wednesday,7,0.65,3.90,,960,0,0.0,0,0,8.0
1195,Quarter2,finishing,Wednesday,9,0.75,2.90,,1800,0,0.0,0,0,15.0


In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [9]:
X_train.isnull().sum()

Unnamed: 0,0
quarter,0
department,0
day,0
team,0
targeted_productivity,0
smv,0
wip,368
over_time,0
incentive,0
idle_time,0


**Task 01**

Use simple imputer to fill the missing values of the 'wip' column with the column mean. Fit-transform on training data and transform on testing data.

In [10]:
# Task 01

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')

X_train['wip'] = imputer.fit_transform(X_train[['wip']])
X_test['wip'] = imputer.transform(X_test[['wip']])

In [11]:
X_train.isnull().sum()

Unnamed: 0,0
quarter,0
department,0
day,0
team,0
targeted_productivity,0
smv,0
wip,0
over_time,0
incentive,0
idle_time,0


**Task 02**

Use Label Encoder to encode the columns quarter, department, day

In [12]:
# Task 02
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# Label Encode only the categorical columns using fit_transform and transform

categorical_columns = ['quarter', 'day', 'department']

for column in categorical_columns:
    X_train[column] = le.fit_transform(X_train[column])
    X_test[column] = le.transform(X_test[column])

In [13]:
X_train

Unnamed: 0,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers
213,1,1,0,11,0.80,4.15,1217.287846,1440,0,0.0,0,0,8.0
120,0,2,5,10,0.75,28.08,1144.000000,10530,69,0.0,0,0,58.5
560,0,1,2,6,0.60,2.90,1217.287846,1200,0,0.0,0,0,10.0
968,3,2,1,7,0.80,30.10,644.000000,840,38,0.0,0,1,59.0
620,0,2,3,10,0.80,22.52,1039.000000,6720,113,0.0,0,0,56.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1044,0,0,4,5,0.70,4.60,1217.287846,3360,0,0.0,0,0,8.0
1095,0,0,1,6,0.50,2.90,1217.287846,960,0,0.0,0,0,8.0
1130,1,0,0,5,0.60,3.94,1217.287846,0,2880,0.0,0,0,12.0
860,2,2,3,7,0.75,30.10,444.000000,0,0,5.0,20,1,59.0


**Task 03**

Use Standard Scaler or Min-Max Scaler to scale or normalize relevant columns

In [14]:
# Task 03
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Normalize only the numerical columns using fit_transform and transform
numerical_columns = ['targeted_productivity', 'smv', 'wip', 'over_time', 'incentive', 'idle_time', 'idle_men', 'no_of_style_change', 'no_of_workers']

X_train[numerical_columns] = scaler.fit_transform(X_train[numerical_columns])
X_test[numerical_columns] = scaler.transform(X_test[numerical_columns])

In [15]:
X_train

Unnamed: 0,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers
213,1,1,0,11,0.708176,-0.966138,0.000000,-0.903289,-0.223037,-0.046311,-0.113150,-0.360341,-1.158772
120,0,2,5,10,0.206531,1.198038,-0.047768,1.811877,0.144836,-0.046311,-0.113150,-0.360341,1.102310
560,0,1,2,6,-1.298402,-1.079185,0.000000,-0.974976,-0.223037,-0.046311,-0.113150,-0.360341,-1.069224
968,3,2,1,7,0.708176,1.380722,-0.373657,-1.082507,-0.020441,-0.046311,-0.113150,1.977684,1.124697
620,0,2,3,10,0.708176,0.695204,-0.116204,0.673837,0.379421,-0.046311,-0.113150,-0.360341,0.990375
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1044,0,0,4,5,-0.295113,-0.925441,0.000000,-0.329788,-0.223037,-0.046311,-0.113150,-0.360341,-1.158772
1095,0,0,1,6,-2.301691,-1.079185,0.000000,-1.046664,-0.223037,-0.046311,-0.113150,-0.360341,-1.158772
1130,1,0,0,5,-1.298402,-0.985130,0.000000,-1.333414,15.131647,-0.046311,-0.113150,-0.360341,-0.979676
860,2,2,3,7,0.206531,1.380722,-0.504013,-1.333414,-0.223037,0.913157,6.307622,1.977684,1.124697


**Task 04**

**7.5 Regression**

In [16]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

**Task 04**
Create a dictionary called regression models and connect imported regression models with a string in that dictionary.

In [17]:
# Task 04

regression_models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree Regression': DecisionTreeRegressor(random_state=42),
    'Random Forest Regression': RandomForestRegressor(random_state=42),
    'Support Vector Regression': SVR()
}

**Task 05**

Train and measure mean squared error or mse.

In [18]:
# Task 05

from sklearn.metrics import mean_squared_error
# can be used like this
# mse = mean_squared_error(y_test, y_pred)

for model in regression_models.values():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f'{model}: {mse}')

LinearRegression(): 0.023538703739745436
DecisionTreeRegressor(random_state=42): 0.02219963334558109
RandomForestRegressor(random_state=42): 0.015897491534874294
SVR(): 0.02078522890813577
