# Homework 2

## Message from your Prof
> **Remember for the best way to learn this content, and maximize your learning experience, you must implement these models yourself and only use imports for checking your work**

You can only import classification_report from sklearn evaluation metrics (you do not need to implement classification report), but you have to implement the train_test_split, and the classifiers yourselves. You will need to use numpy or pandas as inputs for your models. You should only use the imports listed below to check your work.

Students that do not practice their own implementations **will be cooked** in their skill assessments. I do not want to hear students complaining they did poorly because the homeworks did not reflect the skill assessments. YOU HAVE BEEN WARNED!!!

<br>
In this assignment, we will be building a Naïve Bayes classifier and a SVM model for the productivity satisfaction of [the given dataset](https://archive.ics.uci.edu/ml/datasets/Productivity+Prediction+of+Garment+Employees), the productivity of garment employees.

## Background
The Garment Industry is one of the key examples of the industrial globalization of this modern era. It is a highly labour-intensive industry with lots of manual processes. Satisfying the huge global demand for garment products is mostly dependent on the production and delivery performance of the employees in the garment manufacturing companies. So, it is highly desirable among the decision makers in the garments industry to track, analyse and predict the productivity performance of the working teams in their factories.

## Dataset Attribute Information

1. **date**: Date in MM-DD-YYYY
2. **day**: Day of the Week
3. **quarter** : A portion of the month. A month was divided into four quarters
4. **department** : Associated department with the instance
5. **team_no** : Associated team number with the instance
6. **no_of_workers** : Number of workers in each team
7. **no_of_style_change** : Number of changes in the style of a particular product
8. **targeted_productivity** : Targeted productivity set by the Authority for each team for each day.
9. **smv** : Standard Minute Value, it is the allocated time for a task
10. **wip** : Work in progress. Includes the number of unfinished items for products
11. **over_time** : Represents the amount of overtime by each team in minutes
12. **incentive** : Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.
13. **idle_time** : The amount of time when the production was interrupted due to several reasons
14. **idle_men** : The number of workers who were idle due to production interruption
15. **actual_productivity** : The actual % of productivity that was delivered by the workers. It ranges from 0-1.

### Libraries that can be used: numpy, scipy, pandas, scikit-learn, cvxpy, imbalanced-learn
Any libraries used in the discussion materials are also allowed.

#### Other Notes

 - Don't worry about not being able to achieve high accuracy, it is neither the goal nor the grading standard of this assignment. <br >
 - If not specified, you are not required to do hyperparameter tuning, but feel free to do so if you'd like.

#### Trouble Shooting
In case you have trouble installing and using imbalanced-learn(imblearn) <br >
Run the below code cell, then go to the selection bar at top: Kernel > Restart. <br >
Then try `import imblearn` to see if things work.

In [36]:
import platform
display(platform.system())
import os
file_download_link = 'https://www.dropbox.com/scl/fi/j1dxtqjerbdbl81e05nvp/hw4data.zip?rlkey=8mkxz4j8lngziok6on782a8u4&dl=0'
if os.name == 'nt':
    print('Please download your dataset here:', file_download_link)
else:
    # We need to first download the data here:
    !wget -O data.zip "$file_download_link" -o /dev/null
    !unzip data.zip > /dev/null
y

'Darwin'

replace garments_worker_productivity.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

OSError: [Errno 5] Input/output error

In [1]:
!sed 's/,/\t/g' garments_worker_productivity.csv > garments_worker_productivity.tsv

In [2]:
# If your data is on google drive then uncomment the code below to access
# your google drive.
#from google.colab import drive
#drive.mount('/content/drive')

In [3]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install imbalanced-learn delayed


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m


# Exercises

## Exercise 1 - General Data Preprocessing (20 points)

Our dataset needs cleaning before building any models. Some of the cleaning tasks are common in general, but depends on what kind of models we are building, sometimes we have to do additional processing. These additional tasks will be mentioned in each of the remaining two exercises later.

Note that **we will be using this processed data from exercise 1 in each of the remaining two exercises**.

For convenience, here are the attributes that we would treat them as **categorical attributes**: `day`, `quarter`, `department`, and `team`.

Realize that `quarter` is not referring to certain months of the year, but certain days of the month. "Quarter1" represents days 1-7, "Quarter2" days 8-14, "Quarter3" days 15-21, "Quarter4" days 22-28, and "Quarter5" days 29-31

 - Drop the column `date`.
 - For each of the categorical attributes, **print out** all the unique elements.
 - For each of the categorical attributes, remap the duplicated items, if you find there are typos or spaces among the duplicated items.
     - For example, "a" and "a " should be the same, so we need to update "a " to be "a".
     - Another example, "apple" and "appel" should be the same, so you should update "appel" to be "apple".
     

 - Create another column named `satisfied` that records the productivity performance. The behavior defined as follows. **This is the dependent variable we'd like to classify in this assignment.**
     - Return True or 1 if `actual_productivity` is equal to or greater than `targeted_productivity`. Otherwise, return False or 0, which means the team fails to meet the expected performance.
 - Drop the columns `actual_productivity` and `targeted_productivity`.


 - Find and **print out** which columns/attributes that have empty vaules, e.g., NA, NaN, null, None.
 - You must use `df.describe()` or `df.info()` to display the data after preprocessing. **No credit** will be given if this step is omitted.
 - Fill the empty values with 0.


In [2]:
import pandas as pd
# If put the data(.csv) under the same folder, you could use
df = pd.read_csv('garments_worker_productivity.csv')

In [3]:
df

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
0,1/1/2015,Quarter1,sweing,Thursday,8,0.80,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725
1,1/1/2015,Quarter1,finishing,Thursday,1,0.75,3.94,,960,0,0.0,0,0,8.0,0.886500
2,1/1/2015,Quarter1,sweing,Thursday,11,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
3,1/1/2015,Quarter1,sweing,Thursday,12,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
4,1/1/2015,Quarter1,sweing,Thursday,6,0.80,25.90,1170.0,1920,50,0.0,0,0,56.0,0.800382
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,3/11/2015,Quarter2,finishing,Wednesday,10,0.75,2.90,,960,0,0.0,0,0,8.0,0.628333
1193,3/11/2015,Quarter2,finishing,Wednesday,8,0.70,3.90,,960,0,0.0,0,0,8.0,0.625625
1194,3/11/2015,Quarter2,finishing,Wednesday,7,0.65,3.90,,960,0,0.0,0,0,8.0,0.625625
1195,3/11/2015,Quarter2,finishing,Wednesday,9,0.75,2.90,,1800,0,0.0,0,0,15.0,0.505889


In [4]:
columns=list(df.columns[1:])
df=df[columns]
df

Unnamed: 0,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
0,Quarter1,sweing,Thursday,8,0.80,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725
1,Quarter1,finishing,Thursday,1,0.75,3.94,,960,0,0.0,0,0,8.0,0.886500
2,Quarter1,sweing,Thursday,11,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
3,Quarter1,sweing,Thursday,12,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
4,Quarter1,sweing,Thursday,6,0.80,25.90,1170.0,1920,50,0.0,0,0,56.0,0.800382
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,Quarter2,finishing,Wednesday,10,0.75,2.90,,960,0,0.0,0,0,8.0,0.628333
1193,Quarter2,finishing,Wednesday,8,0.70,3.90,,960,0,0.0,0,0,8.0,0.625625
1194,Quarter2,finishing,Wednesday,7,0.65,3.90,,960,0,0.0,0,0,8.0,0.625625
1195,Quarter2,finishing,Wednesday,9,0.75,2.90,,1800,0,0.0,0,0,15.0,0.505889


In [5]:
for i in ['day','quarter','department','team']:
    print(df[i].unique())

['Thursday' 'Saturday' 'Sunday' 'Monday' 'Tuesday' 'Wednesday']
['Quarter1' 'Quarter2' 'Quarter3' 'Quarter4' 'Quarter5']
['sweing' 'finishing ' 'finishing']
[ 8  1 11 12  6  7  2  3  9 10  5  4]


In [6]:
df['day']=df['day'].transform(lambda x: x.strip())
df['day']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['day']=df['day'].transform(lambda x: x.strip())


0        Thursday
1        Thursday
2        Thursday
3        Thursday
4        Thursday
          ...    
1192    Wednesday
1193    Wednesday
1194    Wednesday
1195    Wednesday
1196    Wednesday
Name: day, Length: 1197, dtype: object

In [7]:
df['department']=df['department'].transform(lambda x: 'sewing' if x=='sweing' else x.strip())
df['department']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['department']=df['department'].transform(lambda x: 'sewing' if x=='sweing' else x.strip())


0          sewing
1       finishing
2          sewing
3          sewing
4          sewing
          ...    
1192    finishing
1193    finishing
1194    finishing
1195    finishing
1196    finishing
Name: department, Length: 1197, dtype: object

In [8]:
df['satisfied']= df['actual_productivity']>=df['targeted_productivity']
df['satisfied']

0        True
1        True
2        True
3        True
4        True
        ...  
1192    False
1193    False
1194    False
1195    False
1196    False
Name: satisfied, Length: 1197, dtype: bool

In [9]:
df.drop(["actual_productivity", "targeted_productivity"], axis=1)

Unnamed: 0,quarter,department,day,team,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,satisfied
0,Quarter1,sewing,Thursday,8,26.16,1108.0,7080,98,0.0,0,0,59.0,True
1,Quarter1,finishing,Thursday,1,3.94,,960,0,0.0,0,0,8.0,True
2,Quarter1,sewing,Thursday,11,11.41,968.0,3660,50,0.0,0,0,30.5,True
3,Quarter1,sewing,Thursday,12,11.41,968.0,3660,50,0.0,0,0,30.5,True
4,Quarter1,sewing,Thursday,6,25.90,1170.0,1920,50,0.0,0,0,56.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,Quarter2,finishing,Wednesday,10,2.90,,960,0,0.0,0,0,8.0,False
1193,Quarter2,finishing,Wednesday,8,3.90,,960,0,0.0,0,0,8.0,False
1194,Quarter2,finishing,Wednesday,7,3.90,,960,0,0.0,0,0,8.0,False
1195,Quarter2,finishing,Wednesday,9,2.90,,1800,0,0.0,0,0,15.0,False


In [10]:
df['wip'].isna()

0       False
1        True
2       False
3       False
4       False
        ...  
1192     True
1193     True
1194     True
1195     True
1196     True
Name: wip, Length: 1197, dtype: bool

In [11]:
print('wip')

wip


In [12]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1197 entries, 0 to 1196
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   quarter                1197 non-null   object 
 1   department             1197 non-null   object 
 2   day                    1197 non-null   object 
 3   team                   1197 non-null   int64  
 4   targeted_productivity  1197 non-null   float64
 5   smv                    1197 non-null   float64
 6   wip                    691 non-null    float64
 7   over_time              1197 non-null   int64  
 8   incentive              1197 non-null   int64  
 9   idle_time              1197 non-null   float64
 10  idle_men               1197 non-null   int64  
 11  no_of_style_change     1197 non-null   int64  
 12  no_of_workers          1197 non-null   float64
 13  actual_productivity    1197 non-null   float64
 14  satisfied              1197 non-null   bool   
dtypes: b

In [13]:
df['wip']=df['wip'].fillna(0)
df['wip']

0       1108.0
1          0.0
2        968.0
3        968.0
4       1170.0
         ...  
1192       0.0
1193       0.0
1194       0.0
1195       0.0
1196       0.0
Name: wip, Length: 1197, dtype: float64

## Exercise 2 - Naïve Bayes Classifier (40 points in total)

### Exercise 2.1 - Additional Data Preprocessing (10 points)

To build a Naïve Bayes Classifier, we need to further encode our categorical variables.

 - For each of the **categorical attributes**, encode the set of categories to be **0 ~ (n_classes - 1)**.
     - For example, \["paris", "paris", "tokyo", "amsterdam"\] should be encoded as \[1, 1, 2, 0\].
     - Note that the order does not really matter, i.e., \[0, 0, 1, 2\] also works. But you have to start with 0 in your encodings.
     - You can find information about this encoding in the discussion materials.


 - Split the data into training and testing set with the ratio of 80:20.
 - You **must** show the first five row of your encoded dataset, as well as the shape of your train test split. **No credit** will be given if this step is omitted.

In [14]:
df['day'].unique()

array(['Thursday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday'],
      dtype=object)

In [15]:
# Remember to continue the task with your processed data from Exercise 1
df['day']=df['day'].transform(lambda x: 0 if x == 'Thursday' else 1 if x == 'Saturday' else 2 if x == 'Sunday' else 3 if x == 'Monday' else 4 if x == 'Tuesday' else 5)
df['day']

0       0
1       0
2       0
3       0
4       0
       ..
1192    5
1193    5
1194    5
1195    5
1196    5
Name: day, Length: 1197, dtype: int64

In [16]:
df['quarter'].unique()

array(['Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'Quarter5'],
      dtype=object)

In [17]:
df['quarter']=df['quarter'].transform(lambda x: 0 if x=='Quarter1' else 1 if x=='Quarter2' else 2 if x=='Quarter3' else 3 if x=='Quarter4' else 4)
df['quarter']

0       0
1       0
2       0
3       0
4       0
       ..
1192    1
1193    1
1194    1
1195    1
1196    1
Name: quarter, Length: 1197, dtype: int64

In [18]:
df['department'].unique()

array(['sewing', 'finishing'], dtype=object)

In [19]:
df['department']=df['department'].transform(lambda x: 0 if x=='sewing' else 1)
df['department'] 

0       0
1       1
2       0
3       0
4       0
       ..
1192    1
1193    1
1194    1
1195    1
1196    1
Name: department, Length: 1197, dtype: int64

In [20]:
poss=list(df['team'].unique())
df['team']=df['team'].transform(lambda x: poss.index(x))
df['team']

0       0
1       1
2       2
3       3
4       4
       ..
1192    9
1193    0
1194    5
1195    8
1196    4
Name: team, Length: 1197, dtype: int64

In [21]:


train_df = df.iloc[:int(len(df) * 0.8)]
test_df  = df.iloc[int(len(df) * 0.8):] 

print(train_df.shape)
print(test_df.shape)


(957, 15)
(240, 15)


In [22]:
train_df.head()

Unnamed: 0,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity,satisfied
0,0,0,0,0,0.8,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725,True
1,0,1,0,1,0.75,3.94,0.0,960,0,0.0,0,0,8.0,0.8865,True
2,0,0,0,2,0.8,11.41,968.0,3660,50,0.0,0,0,30.5,0.80057,True
3,0,0,0,3,0.8,11.41,968.0,3660,50,0.0,0,0,30.5,0.80057,True
4,0,0,0,4,0.8,25.9,1170.0,1920,50,0.0,0,0,56.0,0.800382,True


In [23]:
test_df.head()

Unnamed: 0,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity,satisfied
957,3,0,0,11,0.8,30.1,437.0,7080,32,0.0,0,2,59.0,0.495618,False
958,3,0,0,10,0.35,27.48,413.0,6840,38,0.0,0,1,57.0,0.449965,True
959,3,1,0,9,0.7,2.9,0.0,3360,0,0.0,0,0,8.0,0.410833,False
960,3,1,0,8,0.75,2.9,0.0,960,0,0.0,0,0,8.0,0.407813,False
961,3,0,0,1,0.35,26.66,1164.0,6600,23,0.0,0,2,55.0,0.378895,True


### Exercise 2.2 - Naïve Bayes Classifier for Categorical Attributes (15 points)

Use the categorical attributes **only**, please build a Categorical Naïve Bayes classifier that predicts the column `satisfied`. <br >
Report the **testing result** using `classification_report`.

In [24]:
# Remember to do this task with your processed data from Exercise 2.1
X=train_df[['quarter','department','day','team','satisfied']]
options= {(i, j, k, l):0 for i in range(5) for j in range(2) for k in range(6) for l in range(12)}
options2=options.copy()
P_s1=0

In [25]:
P_s1=X[X['satisfied']==True].shape[0]/(X['satisfied'].shape[0])
P_s1

0.7314524555903866

In [26]:
X

Unnamed: 0,quarter,department,day,team,satisfied
0,0,0,0,0,True
1,0,1,0,1,True
2,0,0,0,2,True
3,0,0,0,3,True
4,0,0,0,4,True
...,...,...,...,...,...
952,3,0,0,0,True
953,3,0,0,9,True
954,3,0,0,6,False
955,3,1,0,3,False


In [27]:
options

{(0, 0, 0, 0): 0,
 (0, 0, 0, 1): 0,
 (0, 0, 0, 2): 0,
 (0, 0, 0, 3): 0,
 (0, 0, 0, 4): 0,
 (0, 0, 0, 5): 0,
 (0, 0, 0, 6): 0,
 (0, 0, 0, 7): 0,
 (0, 0, 0, 8): 0,
 (0, 0, 0, 9): 0,
 (0, 0, 0, 10): 0,
 (0, 0, 0, 11): 0,
 (0, 0, 1, 0): 0,
 (0, 0, 1, 1): 0,
 (0, 0, 1, 2): 0,
 (0, 0, 1, 3): 0,
 (0, 0, 1, 4): 0,
 (0, 0, 1, 5): 0,
 (0, 0, 1, 6): 0,
 (0, 0, 1, 7): 0,
 (0, 0, 1, 8): 0,
 (0, 0, 1, 9): 0,
 (0, 0, 1, 10): 0,
 (0, 0, 1, 11): 0,
 (0, 0, 2, 0): 0,
 (0, 0, 2, 1): 0,
 (0, 0, 2, 2): 0,
 (0, 0, 2, 3): 0,
 (0, 0, 2, 4): 0,
 (0, 0, 2, 5): 0,
 (0, 0, 2, 6): 0,
 (0, 0, 2, 7): 0,
 (0, 0, 2, 8): 0,
 (0, 0, 2, 9): 0,
 (0, 0, 2, 10): 0,
 (0, 0, 2, 11): 0,
 (0, 0, 3, 0): 0,
 (0, 0, 3, 1): 0,
 (0, 0, 3, 2): 0,
 (0, 0, 3, 3): 0,
 (0, 0, 3, 4): 0,
 (0, 0, 3, 5): 0,
 (0, 0, 3, 6): 0,
 (0, 0, 3, 7): 0,
 (0, 0, 3, 8): 0,
 (0, 0, 3, 9): 0,
 (0, 0, 3, 10): 0,
 (0, 0, 3, 11): 0,
 (0, 0, 4, 0): 0,
 (0, 0, 4, 1): 0,
 (0, 0, 4, 2): 0,
 (0, 0, 4, 3): 0,
 (0, 0, 4, 4): 0,
 (0, 0, 4, 5): 0,
 (0, 0, 4, 6): 0,
 (

In [28]:
for i in options.keys():
    P_x1_s1=X[(X['quarter']==i[0])&(X['satisfied']==False)].shape[0]/X[(X['satisfied']==False)].shape[0] if X[(X['satisfied']==False)].shape[0]>0 else 0
    P_x2_s1=X[(X['department']==i[1])&(X['satisfied']==False)].shape[0]/X[(X['satisfied']==False)].shape[0] if X[(X['satisfied']==False)].shape[0]>0 else 0
    P_x3_s1=X[(X['day']==i[2])&(X['satisfied']==False)].shape[0]/X[(X['satisfied']==False)].shape[0] if X[(X['satisfied']==False)].shape[0]>0 else 0
    P_x4_s1=X[(X['team']==i[3])&(X['satisfied']==False)].shape[0]/X[(X['satisfied']==False)].shape[0] if X[(X['satisfied']==False)].shape[0]>0 else 0
    joint = X[(X['quarter']==i[0])&(X['department']==i[1])&(X['day']==i[2])&(X['team']==i[3])].shape[0]/X.shape[0] if X.shape[0]>0 else 0
    options[i]=(P_x1_s1*P_x2_s1*P_x3_s1*P_x4_s1)/joint if joint >0 else 0 
options

{(0, 0, 0, 0): 1.4304751726383114,
 (0, 0, 0, 1): 0.39338067247553565,
 (0, 0, 0, 2): 0.7509994656351136,
 (0, 0, 0, 3): 0.4649044311074512,
 (0, 0, 0, 4): 0.8582851035829868,
 (0, 0, 0, 5): 1.1801420174266068,
 (0, 0, 0, 6): 0.7152375863191557,
 (0, 0, 0, 7): 0.3218569138436201,
 (0, 0, 0, 8): 1.001332620846818,
 (0, 0, 0, 9): 0.7152375863191557,
 (0, 0, 0, 10): 0.7867613449510713,
 (0, 0, 0, 11): 0.5721900690553247,
 (0, 0, 1, 0): 0.8030737811302802,
 (0, 0, 1, 1): 0.22084528981082704,
 (0, 0, 1, 2): 0.42161373509339706,
 (0, 0, 1, 3): 0.260998978867341,
 (0, 0, 1, 4): 0.48184426867816815,
 (0, 0, 1, 5): 0.6625358694324811,
 (0, 0, 1, 6): 0.4015368905651401,
 (0, 0, 1, 7): 0.18069160075431304,
 (0, 0, 1, 8): 0.5621516467911961,
 (0, 0, 1, 9): 0.4015368905651401,
 (0, 0, 1, 10): 0.4416905796216541,
 (0, 0, 1, 11): 0.32122951245211206,
 (0, 0, 2, 0): 1.3049948943367051,
 (0, 0, 2, 1): 0.35887359594259394,
 (0, 0, 2, 2): 0.6851223195267703,
 (0, 0, 2, 3): 0.4241233406594292,
 (0, 0, 2, 

In [29]:
options2

{(0, 0, 0, 0): 0,
 (0, 0, 0, 1): 0,
 (0, 0, 0, 2): 0,
 (0, 0, 0, 3): 0,
 (0, 0, 0, 4): 0,
 (0, 0, 0, 5): 0,
 (0, 0, 0, 6): 0,
 (0, 0, 0, 7): 0,
 (0, 0, 0, 8): 0,
 (0, 0, 0, 9): 0,
 (0, 0, 0, 10): 0,
 (0, 0, 0, 11): 0,
 (0, 0, 1, 0): 0,
 (0, 0, 1, 1): 0,
 (0, 0, 1, 2): 0,
 (0, 0, 1, 3): 0,
 (0, 0, 1, 4): 0,
 (0, 0, 1, 5): 0,
 (0, 0, 1, 6): 0,
 (0, 0, 1, 7): 0,
 (0, 0, 1, 8): 0,
 (0, 0, 1, 9): 0,
 (0, 0, 1, 10): 0,
 (0, 0, 1, 11): 0,
 (0, 0, 2, 0): 0,
 (0, 0, 2, 1): 0,
 (0, 0, 2, 2): 0,
 (0, 0, 2, 3): 0,
 (0, 0, 2, 4): 0,
 (0, 0, 2, 5): 0,
 (0, 0, 2, 6): 0,
 (0, 0, 2, 7): 0,
 (0, 0, 2, 8): 0,
 (0, 0, 2, 9): 0,
 (0, 0, 2, 10): 0,
 (0, 0, 2, 11): 0,
 (0, 0, 3, 0): 0,
 (0, 0, 3, 1): 0,
 (0, 0, 3, 2): 0,
 (0, 0, 3, 3): 0,
 (0, 0, 3, 4): 0,
 (0, 0, 3, 5): 0,
 (0, 0, 3, 6): 0,
 (0, 0, 3, 7): 0,
 (0, 0, 3, 8): 0,
 (0, 0, 3, 9): 0,
 (0, 0, 3, 10): 0,
 (0, 0, 3, 11): 0,
 (0, 0, 4, 0): 0,
 (0, 0, 4, 1): 0,
 (0, 0, 4, 2): 0,
 (0, 0, 4, 3): 0,
 (0, 0, 4, 4): 0,
 (0, 0, 4, 5): 0,
 (0, 0, 4, 6): 0,
 (

In [30]:
for i in options2.keys():
    P_x1_s1=X[(X['quarter']==i[0])&(X['satisfied']==True)].shape[0]/X[(X['satisfied']==True)].shape[0] if X[(X['satisfied']==True)].shape[0]>0 else 0
    P_x2_s1=X[(X['department']==i[1])&(X['satisfied']==True)].shape[0]/X[(X['satisfied']==True)].shape[0] if X[(X['satisfied']==True)].shape[0]>0 else 0
    P_x3_s1=X[(X['day']==i[2])&(X['satisfied']==True)].shape[0]/X[(X['satisfied']==True)].shape[0] if X[(X['satisfied']==True)].shape[0]>0 else 0
    P_x4_s1=X[(X['team']==i[3])&(X['satisfied']==True)].shape[0]/X[(X['satisfied']==True)].shape[0] if X[(X['satisfied']==True)].shape[0]>0 else 0
    joint = X[(X['quarter']==i[0])&(X['department']==i[1])&(X['day']==i[2])&(X['team']==i[3])].shape[0]/X.shape[0] if X.shape[0]>0 else 0 
    options2[i]=(P_x1_s1*P_x2_s1*P_x3_s1*P_x4_s1)/joint if joint>0 else 0 
options2

{(0, 0, 0, 0): 0.9330889185506036,
 (0, 0, 0, 1): 1.380194025356101,
 (0, 0, 0, 2): 0.9719676234902119,
 (0, 0, 0, 3): 1.2635579105372758,
 (0, 0, 0, 4): 1.0302856808996248,
 (0, 0, 0, 5): 0.9136495660807994,
 (0, 0, 0, 6): 1.3218759679466883,
 (0, 0, 0, 7): 1.2635579105372758,
 (0, 0, 0, 8): 1.0691643858392332,
 (0, 0, 0, 9): 1.1663611481882545,
 (0, 0, 0, 10): 0.9914069759600163,
 (0, 0, 0, 11): 1.3024366154768843,
 (0, 0, 1, 0): 0.9250450485630984,
 (0, 0, 1, 1): 1.3682958009995831,
 (0, 0, 1, 2): 0.9635885922532275,
 (0, 0, 1, 3): 1.2526651699291957,
 (0, 0, 1, 4): 1.0214039077884212,
 (0, 0, 1, 5): 0.9057732767180339,
 (0, 0, 1, 6): 1.3104804854643892,
 (0, 0, 1, 7): 1.2526651699291957,
 (0, 0, 1, 8): 1.05994745147855,
 (0, 0, 1, 9): 1.1563063107038731,
 (0, 0, 1, 10): 0.9828603640982921,
 (0, 0, 1, 11): 1.2912087136193249,
 (0, 0, 2, 0): 0.8767818286380674,
 (0, 0, 2, 1): 1.2969064548604745,
 (0, 0, 2, 2): 0.9133144048313201,
 (0, 0, 2, 3): 1.1873087262807163,
 (0, 0, 2, 4): 0.96

In [31]:
X=test_df[['quarter','department','day','team']]
row=X.iloc[0]
row.iloc[1]
y_test=test_df['satisfied']
type(y_test)

pandas.core.series.Series

In [32]:
ypred=[]
for i in range(len(X)):
    row=X.iloc[i]
    tup=(int(row.iloc[0]),int(row.iloc[1]),int(row.iloc[2]),int(row.iloc[3]))
    if options2[tup]>=0.5:
        ypred.append(1)
    else:
        ypred.append(0)

In [33]:
ypred = pd.Series(ypred)
type(ypred)

pandas.core.series.Series

In [34]:
from sklearn.metrics import classification_report

In [35]:
print(classification_report(y_test, ypred))

              precision    recall  f1-score   support

       False       0.71      0.15      0.25        65
        True       0.76      0.98      0.85       175

    accuracy                           0.75       240
   macro avg       0.74      0.57      0.55       240
weighted avg       0.75      0.75      0.69       240



### Exercise 2.3 - Naïve Bayes Classifier for Numerical Attributes (15 points)

Use the numerical attributes **only**, please build a Gaussian Naïve Bayes classifier that predicts the column `satisfied`. <br >
Report the **testing result** using `classification_report`.

**Remember to scale your data. The scaling method is up to you.**

In [36]:
list(train_df.columns)

['quarter',
 'department',
 'day',
 'team',
 'targeted_productivity',
 'smv',
 'wip',
 'over_time',
 'incentive',
 'idle_time',
 'idle_men',
 'no_of_style_change',
 'no_of_workers',
 'actual_productivity',
 'satisfied']

In [45]:
# Remember to do this task with your processed data from Exercise 2.1
X= train_df[list(train_df.columns)[4:]]
X=X.drop('actual_productivity',axis=1)
X= X.drop('targeted_productivity',axis=1)
X

Unnamed: 0,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,satisfied
0,26.16,1108.0,7080,98,0.0,0,0,59.0,True
1,3.94,0.0,960,0,0.0,0,0,8.0,True
2,11.41,968.0,3660,50,0.0,0,0,30.5,True
3,11.41,968.0,3660,50,0.0,0,0,30.5,True
4,25.90,1170.0,1920,50,0.0,0,0,56.0,True
...,...,...,...,...,...,...,...,...,...
952,29.40,916.0,6960,56,0.0,0,2,58.0,True
953,21.82,1591.0,3240,0,0.0,0,1,52.0,True
954,30.33,398.0,6960,0,0.0,0,1,58.0,False
955,4.60,0.0,3780,0,0.0,0,0,9.0,False


In [None]:

# b=X['smv'].unique()
# c=X['wip'].unique()
# d=X['over_time'].unique()
# e=X['incentive'].unique()
# f=X['idle_time'].unique()
# g=X['idle_men'].unique()
# h= X['no_of_style_change'].unique()
# i = X['no_of_workers'].unique()
# options = {(x1, x2, x3, x4, x5, x6, x7, x8, x9): 0
#            for x1 in a
#            for x2 in b
#            for x3 in c
#            for x4 in d
#            for x5 in e
#            for x6 in f
#            for x7 in g
#            for x8 in h
#            for x9 in i}
# options


In [39]:
P_s1

0.7314524555903866

In [46]:
import numpy as np

In [49]:
temp=X[X['satisfied']==True]
temp

Unnamed: 0,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,satisfied
0,26.16,1108.0,7080,98,0.0,0,0,59.0,True
1,3.94,0.0,960,0,0.0,0,0,8.0,True
2,11.41,968.0,3660,50,0.0,0,0,30.5,True
3,11.41,968.0,3660,50,0.0,0,0,30.5,True
4,25.90,1170.0,1920,50,0.0,0,0,56.0,True
...,...,...,...,...,...,...,...,...,...
949,18.79,912.0,3960,45,0.0,0,0,33.0,True
950,29.40,1244.0,6840,45,0.0,0,2,57.0,True
951,18.79,1020.0,5640,45,0.0,0,1,52.0,True
952,29.40,916.0,6960,56,0.0,0,2,58.0,True


In [52]:

var_svm= np.sum((temp['smv']-temp['smv'].mean())**2)/temp['smv'].shape[0]
P_x1_s1=(1 / np.sqrt(2 * np.pi * var_svm)) * np.exp(-((temp['smv'] - temp['smv'].mean())**2) / (2 * var_svm))
var_wip= np.sum((temp['wip']-temp['wip'].mean())**2)/temp['wip'].shape[0]
P_x2_s1=(1 / np.sqrt(2 * np.pi * var_wip)) * np.exp(-((temp['wip'] - temp['wip'].mean())**2) / (2 * var_wip))
var_over_time=np.sum((temp['over_time']-temp['over_time'].mean())**2)/temp['over_time'].shape[0]
P_x3_s1=(1 / np.sqrt(2 * np.pi * var_over_time)) * np.exp(-((temp['over_time'] - temp['over_time'].mean())**2) / (2 * var_over_time))
var_incentive=np.sum((temp['incentive']-temp['incentive'].mean())**2)/temp['incentive'].shape[0]
P_x4_s1= (1 / np.sqrt(2 * np.pi * var_incentive)) * np.exp(-((temp['incentive'] - temp['incentive'].mean())**2) / (2 * var_incentive))
var_idle_time=np.sum((temp['idle_time']-temp['idle_time'].mean())**2)/temp['idle_time'].shape[0]
P_x5_s1= (1 / np.sqrt(2 * np.pi * var_idle_time)) * np.exp(-((temp['idle_time'] - temp['idle_time'].mean())**2) / (2 * var_idle_time))
var_idle_men=np.sum((temp['idle_men']-temp['idle_men'].mean())**2)/temp['idle_men'].shape[0]
P_x6_s1 = (1 / np.sqrt(2 * np.pi * var_idle_men)) * np.exp(-((temp['idle_men'] - temp['idle_men'].mean())**2) / (2 * var_idle_men))
var_no_of_style_change= np.sum((temp['no_of_style_change']-temp['no_of_style_change'].mean())**2)/temp['no_of_style_change'].shape[0]
P_x7_s1=(1 / np.sqrt(2 * np.pi * var_no_of_style_change)) * np.exp(-((temp['no_of_style_change'] - temp['no_of_style_change'].mean())**2) / (2 * var_no_of_style_change))
var_no_of_workers=np.sum((temp['no_of_workers']-temp['no_of_workers'].mean())**2)/temp['no_of_workers'].shape[0]
P_x8_s1= (1 / np.sqrt(2 * np.pi * var_no_of_workers)) * np.exp(-((temp['no_of_workers'] - temp['no_of_workers'].mean())**2) / (2 * var_no_of_workers))

In [53]:
temp2=X[X['satisfied']==False]
temp2

Unnamed: 0,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,satisfied
11,19.31,578.0,6480,45,0.0,0,0,54.0,False
12,11.41,668.0,3660,50,0.0,0,0,30.5,False
14,2.90,0.0,960,0,0.0,0,0,8.0,False
15,3.94,0.0,2160,0,0.0,0,0,18.0,False
16,2.90,0.0,960,0,0.0,0,0,8.0,False
...,...,...,...,...,...,...,...,...,...
941,2.90,0.0,960,0,0.0,0,0,8.0,False
942,2.90,0.0,960,0,0.0,0,0,8.0,False
954,30.33,398.0,6960,0,0.0,0,1,58.0,False
955,4.60,0.0,3780,0,0.0,0,0,9.0,False


In [55]:

var_svm= np.sum((temp2['smv']-temp2['smv'].mean())**2)/temp2['smv'].shape[0]
P_x1_s0=(1 / np.sqrt(2 * np.pi * var_svm)) * np.exp(-((temp2['smv'] - temp2['smv'].mean())**2) / (2 * var_svm))
var_wip= np.sum((temp2['wip']-temp2['wip'].mean())**2)/temp2['wip'].shape[0]
P_x2_s0=(1 / np.sqrt(2 * np.pi * var_wip)) * np.exp(-((temp2['wip'] - temp2['wip'].mean())**2) / (2 * var_wip))
var_over_time=np.sum((temp2['over_time']-temp2['over_time'].mean())**2)/temp2['over_time'].shape[0]
P_x3_s0=(1 / np.sqrt(2 * np.pi * var_over_time)) * np.exp(-((temp2['over_time'] - temp2['over_time'].mean())**2) / (2 * var_over_time))
var_incentive=np.sum((temp2['incentive']-temp2['incentive'].mean())**2)/temp2['incentive'].shape[0]
P_x4_s0= (1 / np.sqrt(2 * np.pi * var_incentive)) * np.exp(-((temp2['incentive'] - temp2['incentive'].mean())**2) / (2 * var_incentive))
var_idle_time=np.sum((temp2['idle_time']-temp2['idle_time'].mean())**2)/temp2['idle_time'].shape[0]
P_x5_s0= (1 / np.sqrt(2 * np.pi * var_idle_time)) * np.exp(-((temp2['idle_time'] - temp2['idle_time'].mean())**2) / (2 * var_idle_time))
var_idle_men=np.sum((temp2['idle_men']-temp2['idle_men'].mean())**2)/temp2['idle_men'].shape[0]
P_x6_s0 = (1 / np.sqrt(2 * np.pi * var_idle_men)) * np.exp(-((temp2['idle_men'] - temp2['idle_men'].mean())**2) / (2 * var_idle_men))
var_no_of_style_change= np.sum((temp2['no_of_style_change']-temp2['no_of_style_change'].mean())**2)/temp2['no_of_style_change'].shape[0]
P_x7_s0=(1 / np.sqrt(2 * np.pi * var_no_of_style_change)) * np.exp(-((temp2['no_of_style_change'] - temp2['no_of_style_change'].mean())**2) / (2 * var_no_of_style_change))
var_no_of_workers=np.sum((temp2['no_of_workers']-temp2['no_of_workers'].mean())**2)/temp2['no_of_workers'].shape[0]
P_x8_s0= (1 / np.sqrt(2 * np.pi * var_no_of_workers)) * np.exp(-((temp2['no_of_workers'] - temp2['no_of_workers'].mean())**2) / (2 * var_no_of_workers))

In [57]:
P_s0=X[X['satisfied']==True].shape[0]/(X['satisfied'].shape[0])

In [60]:
((P_s1*P_x1_s1*P_x2_s1*P_x3_s1*P_x4_s1*P_x5_s1*P_x6_s1*P_x7_s1*P_x8_s1)/((P_s1*P_x1_s1*P_x2_s1*P_x3_s1*P_x4_s1*P_x5_s1*P_x6_s1*P_x7_s1*P_x8_s1)+(P_s0*P_x1_s0*P_x2_s0*P_x3_s0*P_x4_s0*P_x5_s0*P_x6_s0*P_x7_s0*P_x8_s0))).unique()


array([0.5])

## Exercies 3 - SVM Classifier (35 points in total)

### Exercise 3.1 - Additional Data Preprocessing (10 points)

To build a SVM Classifier, we need a different encoding for our categorical variables.

 - For each of the **categorical attributes**, encode them with **one-hot encoding**.
     - You can find information about this encoding in the discussion materials.


 - Split the data into training and testing set with the ratio of 80:20.
 - You **must** show the first five row of your encoded dataset, as well as the shape of your train test split. **No credit** will be given if this step is omitted.

In [40]:
# Remember to continue the task with your processed data from Exercise 1

### Exercise 3.2 - SVM with Different Kernels (15 points)

Using all the attributes we have, please build a SVM that predicts the column `satisfied`. <br >
Specifically, please
 - Build one SVM with **linear kernel**.
 - Build another SVM but with **rbf kernel**.
 - Report the **testing results** of **both models** using `classification report`.

The kernel is the only setting requirement. <br >
Other hyperparameter tuning is not required. But make sure they are the same in these two SVMs if you'd like to tune the model. In other words, the only difference between the two SVMs should be the kernel setting.

**Remember to scale your data. The scaling method is up to you.**

In [41]:
# Remember to do this task with your processed data from Exercise 3.1

### Exercise 3.3 - SVM with Over-sampling (10 points)
 - For the column `satisfied` in our **training set**, please **print out** the frequency of each class.
 - Oversample the **training data**.
 - For the column `satisfied` in the oversampled data, **print out** the frequency of each class  again.
 - Re-build the 2 SVMs with the same setting you have in Exercise 3.2, but **use oversampled training data** instead.
     - Do not forget to scale the data first. As always, the scaling method is up to you.
 - Report the **testing result** with `classification_report`.

You can use ANY methods listed on [here](https://imbalanced-learn.org/stable/references/over_sampling.html#) such as RandomOverSampler or SMOTE. <br >
You are definitely welcomed to build your own oversampler. <br >

Note that you do not have to over-sample your testing data.

In [42]:
# Remember to do this task with your processed data from Exercise 3.1

## 4) Collaborative Statement (5 points)
#### You must fill this out even if you worked alone to get credit.

It is mandatory to include a Statement of Collaboration in each submission, that follows the guidelines below.
Include the names of everyone involved in the discussions (especially in-person ones), and what was discussed.
All students are required to follow the academic honesty guidelines posted on the course website. For
programming assignments in particular, I encourage students to organize (perhaps using Piazza) to discuss the
task descriptions, requirements, possible bugs in the support code, and the relevant technical content before they
start working on it. However, you should not discuss the specific solutions, and as a guiding principle, you are
not allowed to take anything written or drawn away from these discussions (no photographs of the blackboard,
written notes, referring to Piazza, etc.). Especially after you have started working on the assignment, try to restrict
the discussion to Piazza as much as possible, so that there is no doubt as to the extent of your collaboration.

Even if you did not use any outside resources or collaborate with anyone, please state that explicitly in the space below.