**Machine Learning Lab - CSE 432**

# 07 Regression

**Regression** in machine learning is a statistical method used to predict a continuous outcome based on the value of one or more predictor variables. It is a core component of supervised learning, where the algorithm is trained with both input features and output labels to establish a relationship among variables. For instance, in predicting a car's fuel efficiency based on attributes like weight and horsepower, regression can help estimate the miles per gallon for different models. The most common form of regression is linear regression due to its simplicity and effectiveness in forecasting. However, there are various types of regression models, including logistic and polynomial, each suited for different kinds of data and analysis needs. Evaluating the performance of a regression model involves considering metrics like variance and bias to ensure accurate predictions and generalization to new data.

**7.1 Importing modules**

In [1]:
import pandas as pd
import numpy as np

**7.2 Importing and Preprocessing Data Set**

The Productivity Prediction of Garment Employees dataset (https://archive.ics.uci.edu/dataset/597/productivity+prediction+of+garment+employees) will be used for this task. This dataset includes important attributes of the garment manufacturing process and the productivity of the employees which had been collected manually and also been validated by the industry experts.

Attribute information:
    01	date			:	Date in MM-DD-YYYY
    
    02	day			:	Day of the Week
    
    03	quarter			:	A portion of the month. A month was divided into four quarters
    
    04	department		:	Associated department with the instance
    
    05	team_no			:	Associated team number with the instance
    
    06	no_of_workers		:	Number of workers in each team
    
    07	no_of_style_change	:	Number of changes in the style of a particular product
    
    08	targeted_productivity	:	Targeted productivity set by the Authority for each team for each day.
    
    09	smv			:	Standard Minute Value, it is the allocated time for a task
    
    10	wip			:	Work in progress. Includes the number of unfinished items for products
    
    11	over_time		:	Represents the amount of overtime by each team in minutes
    
    12	incentive		:	Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.
    
    13	idle_time		:	The amount of time when the production was interrupted due to several reasons
    
    14	idle_men		:	The number of workers who were idle due to production interruption
    
    15	actual_productivity	:	The actual % of productivity that was delivered by the workers. It ranges from 0-1.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Importing the data
df = pd.read_csv('/content/drive/MyDrive/ML Lab/Week 7/garments_worker_productivity.csv')
df

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
0,1/1/2015,Quarter1,sweing,Thursday,8,0.80,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725
1,1/1/2015,Quarter1,finishing,Thursday,1,0.75,3.94,,960,0,0.0,0,0,8.0,0.886500
2,1/1/2015,Quarter1,sweing,Thursday,11,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
3,1/1/2015,Quarter1,sweing,Thursday,12,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
4,1/1/2015,Quarter1,sweing,Thursday,6,0.80,25.90,1170.0,1920,50,0.0,0,0,56.0,0.800382
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,3/11/2015,Quarter2,finishing,Wednesday,10,0.75,2.90,,960,0,0.0,0,0,8.0,0.628333
1193,3/11/2015,Quarter2,finishing,Wednesday,8,0.70,3.90,,960,0,0.0,0,0,8.0,0.625625
1194,3/11/2015,Quarter2,finishing,Wednesday,7,0.65,3.90,,960,0,0.0,0,0,8.0,0.625625
1195,3/11/2015,Quarter2,finishing,Wednesday,9,0.75,2.90,,1800,0,0.0,0,0,15.0,0.505889


In [24]:
df['day'].unique()
print(df)

           date  quarter  department  day  team  targeted_productivity    smv  \
0      1/1/2015        0           2    3     8                   0.80  26.16   
1      1/1/2015        0           1    3     1                   0.75   3.94   
2      1/1/2015        0           2    3    11                   0.80  11.41   
3      1/1/2015        0           2    3    12                   0.80  11.41   
4      1/1/2015        0           2    3     6                   0.80  25.90   
...         ...      ...         ...  ...   ...                    ...    ...   
1192  3/11/2015        1           0    5    10                   0.75   2.90   
1193  3/11/2015        1           0    5     8                   0.70   3.90   
1194  3/11/2015        1           0    5     7                   0.65   3.90   
1195  3/11/2015        1           0    5     9                   0.75   2.90   
1196  3/11/2015        1           0    5     6                   0.70   2.90   

              wip  over_tim

**Task 01**

Use simple imputer to fill the missing values of the 'wip' column with the column mean.

In [14]:
df['wip'].unique()
# prompt: Use simple imputer to fill the missing values of the 'wip' column with the column mean.

from sklearn.impute import SimpleImputer

# Assuming df is already loaded as in the provided code
# ... (previous code to load the dataframe)

# Create a SimpleImputer object with strategy='mean'
imputer = SimpleImputer(strategy='mean')

# Fit the imputer on the 'wip' column
imputer.fit(df[['wip']])

# Transform the 'wip' column to fill missing values with the mean
df['wip'] = imputer.transform(df[['wip']])
df['wip']

Unnamed: 0,wip
0,1108.000000
1,1190.465991
2,968.000000
3,968.000000
4,1170.000000
...,...
1192,1190.465991
1193,1190.465991
1194,1190.465991
1195,1190.465991


**Task 02**

Use Label Encoder to encode the columns quarter, department, day

In [23]:
# prompt: Use Label Encoder to encode the columns quarter, department, day

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['quarter'] = le.fit_transform(df['quarter'])
df['department'] = le.fit_transform(df['department'])
df['day'] = le.fit_transform(df['day'])
print(df)


           date  quarter  department  day  team  targeted_productivity    smv  \
0      1/1/2015        0           2    3     8                   0.80  26.16   
1      1/1/2015        0           1    3     1                   0.75   3.94   
2      1/1/2015        0           2    3    11                   0.80  11.41   
3      1/1/2015        0           2    3    12                   0.80  11.41   
4      1/1/2015        0           2    3     6                   0.80  25.90   
...         ...      ...         ...  ...   ...                    ...    ...   
1192  3/11/2015        1           0    5    10                   0.75   2.90   
1193  3/11/2015        1           0    5     8                   0.70   3.90   
1194  3/11/2015        1           0    5     7                   0.65   3.90   
1195  3/11/2015        1           0    5     9                   0.75   2.90   
1196  3/11/2015        1           0    5     6                   0.70   2.90   

              wip  over_tim

**Task 03**

Use Standard Scaler or Min-Max Scaler to scale or normalize relevant columns

In [89]:
# prompt: use relative columns to  normalize which will increase the MSE without any error

# ... (previous code)

from sklearn.preprocessing import MinMaxScaler

# Select the columns you want to normalize.
#  'actual_productivity' is the target variable and shouldn't be normalized in this context.
columns_to_normalize = ['wip', 'over_time', 'incentive', 'idle_time', 'idle_men', 'no_of_workers', 'no_of_style_change', 'targeted_productivity', 'smv']


# Create a MinMaxScaler object
scaler = MinMaxScaler()


# Fit and transform the selected columns
df[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])

df

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
0,1/1/2015,0,2,0.6,8,1.000000,0.450252,0.047631,0.273148,0.027222,0.0,0.0,0.0,0.655172,0.940725
1,1/1/2015,0,1,0.6,1,0.931507,0.020132,0.051199,0.037037,0.000000,0.0,0.0,0.0,0.068966,0.886500
2,1/1/2015,0,2,0.6,11,1.000000,0.164731,0.041575,0.141204,0.013889,0.0,0.0,0.0,0.327586,0.800570
3,1/1/2015,0,2,0.6,12,1.000000,0.164731,0.041575,0.141204,0.013889,0.0,0.0,0.0,0.327586,0.800570
4,1/1/2015,0,2,0.6,6,1.000000,0.445219,0.050314,0.074074,0.013889,0.0,0.0,0.0,0.620690,0.800382
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,3/11/2015,1,0,1.0,10,0.931507,0.000000,0.051199,0.037037,0.000000,0.0,0.0,0.0,0.068966,0.628333
1193,3/11/2015,1,0,1.0,8,0.863014,0.019357,0.051199,0.037037,0.000000,0.0,0.0,0.0,0.068966,0.625625
1194,3/11/2015,1,0,1.0,7,0.794521,0.019357,0.051199,0.037037,0.000000,0.0,0.0,0.0,0.068966,0.625625
1195,3/11/2015,1,0,1.0,9,0.931507,0.000000,0.051199,0.069444,0.000000,0.0,0.0,0.0,0.149425,0.505889


**7.3 Creating Feature and Target Set**

In [102]:
y = df['actual_productivity']
# Feature Selection
X = df[['quarter', 'department', 'day', 'team', 'targeted_productivity', 'smv', 'wip', 'over_time', 'incentive', 'idle_time', 'idle_men', 'no_of_style_change', 'no_of_workers']]
X

Unnamed: 0,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers
0,0,2,0.6,8,1.000000,0.450252,0.047631,0.273148,0.027222,0.0,0.0,0.0,0.655172
1,0,1,0.6,1,0.931507,0.020132,0.051199,0.037037,0.000000,0.0,0.0,0.0,0.068966
2,0,2,0.6,11,1.000000,0.164731,0.041575,0.141204,0.013889,0.0,0.0,0.0,0.327586
3,0,2,0.6,12,1.000000,0.164731,0.041575,0.141204,0.013889,0.0,0.0,0.0,0.327586
4,0,2,0.6,6,1.000000,0.445219,0.050314,0.074074,0.013889,0.0,0.0,0.0,0.620690
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,1,0,1.0,10,0.931507,0.000000,0.051199,0.037037,0.000000,0.0,0.0,0.0,0.068966
1193,1,0,1.0,8,0.863014,0.019357,0.051199,0.037037,0.000000,0.0,0.0,0.0,0.068966
1194,1,0,1.0,7,0.794521,0.019357,0.051199,0.037037,0.000000,0.0,0.0,0.0,0.068966
1195,1,0,1.0,9,0.931507,0.000000,0.051199,0.069444,0.000000,0.0,0.0,0.0,0.149425


**7.4 Train Test Split**

In [104]:
from sklearn.model_selection import train_test_split

In [105]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=16)

**7.5 Regression**

In [106]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

**Task 04**
Create a dictionary called regression models and connect imported regression models with a string in that dictionary.

In [101]:
# prompt: create a dictionary called regression models and connect imported regression models with a string in that dictionary to improve the accuracy. also include feature engineering

# ... (previous code)

# **Task 04**
# Create a dictionary called regression_models and connect imported regression models with a string in that dictionary.
# Include feature engineering and accuracy improvements

regression_models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree Regression': DecisionTreeRegressor(random_state=42),  # Added random_state for reproducibility
    'Random Forest Regression': RandomForestRegressor(n_estimators=100, random_state=42), # Increased n_estimators, added random_state
    'Support Vector Regression': SVR(kernel='rbf', C=100, gamma=0.1, epsilon=.1) # Tuned hyperparameters for potential improvement
}

# Feature Engineering Example (Polynomial Features)
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False) # Experiment with degree
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)


# Train and Evaluate Models
from sklearn.metrics import mean_squared_error, r2_score

for name, model in regression_models.items():
    print(f"Training {name}...")

    if name == 'Linear Regression': # use original features
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    else: # use polynomial features for better accuracy
        model.fit(X_train_poly, y_train)
        y_pred = model.predict(X_test_poly)

    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"{name} Results:")
    print(f"Mean Squared Error: {mse}")
    print(f"R-squared: {r2}")
    print("-" * 20)

Training Linear Regression...
Linear Regression Results:
Mean Squared Error: 0.02019274616476381
R-squared: 0.21172766001539978
--------------------
Training Decision Tree Regression...
Decision Tree Regression Results:
Mean Squared Error: 0.024438969219936445
R-squared: 0.04596614563365131
--------------------
Training Random Forest Regression...
Random Forest Regression Results:
Mean Squared Error: 0.013098250232159112
R-squared: 0.48867834637445584
--------------------
Training Support Vector Regression...
Support Vector Regression Results:
Mean Squared Error: 0.02802046246973159
R-squared: -0.09384604442802136
--------------------


**Task 05**

Train and measure mean squared error or mse.

In [107]:
# prompt: include feature engineering to train and measure mean squared error or mse

from sklearn.metrics import mean_squared_error

for name, model in regression_models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f'{name} MSE: {mse}')

Linear Regression MSE: 0.02019274616476381
Decision Tree Regression MSE: 0.027445556118823294
Random Forest Regression MSE: 0.013092930581662185
Support Vector Regression MSE: 0.02145548515218036
