## PHME 2022 Data Challenge

This is the skeleton of the Jupyter Notebook you have to fill.
The Notebook must define three functions, each for solving a separate classification task of the challenge.
They are `classification_1`, `classification_2` and `classification_3` and must solve, respectively, task 1, 2 and 3 of the challenges.

**Automatic Scoring and Leader Board:** 
For each participant, we will consider the file locate in `data-challenge-phme/solution.ipynb` as proposed solution. We will execute the Notebook and, in the end, invoke the functions with the test data.
We will evaluate the output of the functions and compute the performance on it.

**Note:** if the execution of the notebook leads to an error or an exception, the functions will not be defined and we will not evaluate the performance of your solution.

**Note:** the notebook must have a reasonable execution time. We will not evaluate notebooks requiring more than **10 minutes** to be executed.
If you want to train complex models that require a large amount of time, do it in a separate notebook. Thus, in `solution.ipynb`, load pre-trained models.

In [1]:
# The input is the SPI data in form of a Pandas DataFrame, exactly as it is read with pd.read_csv()
# The output must be the list of predicted defects. Each defect is a tuple (Panel, Figure, Component)

def classification_1 (spi):
    
    defects = [
        ("26319086100520102844", "4", "R20"),
        ("25319072500520102844", "1", "U5"),
        ("25319072500520102844", "2", "U1"),
    ]
    return defects

In [25]:
# The first input is the SPI data in form of a Pandas DataFrame, exactly as it is read with pd.read_csv()
# The second input is the AOI data. OperatorLabel and RepairLabel are not included, as you must predict OperatorLabel
# The output must be the classification result. Each entry is a tuple (Panel, Figure, Component, PredictedOperatorLabel)

def classification_2 (spi, aoi):
    
    predicted = [
        ("26319044800520102844", "2", "C31", "Good"),
        ("25319072500520102844", "1", "U2", "Good"),
        ("26319063400520102844", "6", "L2", "Bad"),
    ]
    return predicted

In [13]:
# The first input is the SPI data in form of a Pandas DataFrame, exactly as it is read with pd.read_csv()
# The second input is the AOI data. RepairLabel are not included, as you must predict it
# The output must be the classification result. Each entry is a tuple (Panel, Figure, Component, PredictedRepairLabel)

def classification_3 (spi, aoi):
    
    predicted = [
        ("26319063400520102844", "6", "L2", "FalseScrap"),
        ("25319072500520102844", "1", "U3", "NotPossibleToRepair"),
        ("25319072500520102844", "2", "U1", "NotPossibleToRepair"),
    ]
    return predicted

## Test the code

In the following, we report a code that you can use to test if your script correctly handles the data.

We will use a very similar piece of code to run your Notebook to build the leaderboard.

In [35]:
from sklearn.metrics import classification_report
import pandas as pd
import statistics
import math
import glob

dfs = []
for f in glob.glob("data/SPI_*.csv.zip"):
    dfs.append(pd.read_csv(f))
SPI = pd.concat(dfs)

dfs = []
for f in glob.glob("data/AOI_*.csv.zip"):
    dfs.append(pd.read_csv(f))
AOI = pd.concat(dfs)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [36]:
results_1 = set(classification_1(SPI))
results_2 = classification_2(SPI, AOI[["PanelID","FigureID","MachineID","ComponentID","PinNumber","AOILabel"]])
results_3 = classification_3(SPI, AOI[["PanelID","FigureID","MachineID","ComponentID","PinNumber","AOILabel","OperatorLabel"]])

# Performance Task 1
groundtruth_1  = {tuple( [str(f) for f in e] ) for e in AOI[["PanelID","FigureID","ComponentID"]].values}
precision_1    = len(results_1&groundtruth_1)/len(results_1) if len(results_1) > 0 else 0
recall_1       = len(results_1&groundtruth_1)/len(groundtruth_1) if len(groundtruth_1) > 0 else 0
f1_1           = 2*precision_1*recall_1/(precision_1+recall_1) if precision_1+recall_1 > 0 else 0

# Performance Task 2
results_dict_2 = { (str(p), str(f), str(c)):l for p, f, c, l in results_2}
validationdata_2 = []
for t in AOI.drop_duplicates(subset=["PanelID","FigureID","ComponentID"], keep="first").itertuples():
    predicted = results_dict_2.get(( str(t.PanelID), str(t.FigureID), str(t.ComponentID)), "-" )
    validationdata_2.append((t.PanelID, t.FigureID, t.ComponentID, t.OperatorLabel, predicted))
validationdata_2 = pd.DataFrame(validationdata_2, columns = ["PanelID","FigureID","ComponentID", "Real", "Predicted"]) 
f1_2 = classification_report(validationdata_2["Real"], validationdata_2["Predicted"],output_dict=True)["Bad"]["f1-score"]

# Performance Task 3
results_dict_3 = { (str(p), str(f), str(c)):l for p, f, c, l in results_3}
validationdata_3 = []
for t in AOI[AOI["RepairLabel"].isin({"FalseScrap","NotPossibleToRepair"})]\
        .drop_duplicates(subset=["PanelID","FigureID","ComponentID"], keep="first").itertuples():
    predicted = results_dict_3.get(( str(t.PanelID), str(t.FigureID), str(t.ComponentID)), "-" )
    validationdata_3.append((t.PanelID, t.FigureID, t.ComponentID, t.RepairLabel, predicted))
validationdata_3 = pd.DataFrame(validationdata_3, columns = ["PanelID","FigureID","ComponentID", "Real", "Predicted"]) 
cr = classification_report(validationdata_3["Real"], validationdata_3["Predicted"],output_dict=True)
f1_3 = (cr["FalseScrap"]["f1-score"] + cr["NotPossibleToRepair"]["f1-score"])/2

print("F1 Score Task 1:", f1_1)
print("F1 Score Task 2:", f1_2)
print("F1 Score Task 3:", f1_3)
print("Final Score:", statistics.mean([f1_1, f1_2, f1_3]))

F1 Score Task 1: 7.268234182505361e-05
F1 Score Task 2: 0.004842615012106537
F1 Score Task 3: 0.008130081300813009
Final Score: 0.004348459551581533
