# PRODIGY RECIPES

#### Idea: This notebook is to automate all the recipe generations including the paths and names of the datasets. 

#### Author: Vaishnavi Kandala

#### Date: 7th Jan 2022

#### Note: Each recipe components are rightly used before running the recipe. Ensure the right PORT is used.

In [72]:
import pandas as pd
import re

In [73]:
path = "/home/azureuser/vaishnavi/trends_&_innovation_classes.tsv"

In [74]:
df_categories = pd.read_csv(path,sep="\t")

FileNotFoundError: [Errno 2] No such file or directory: '/home/azureuser/vaishnavi/trends_&_innovation_classes.tsv'

In [75]:
df_categories

Unnamed: 0,Innovation/Trend,Specific Innovation/Trend
0,Innovation,3D printed clothes
1,Innovation,3D printing
2,Innovation,Autonomous transport
3,Innovation,Biking
4,Innovation,Capsule wardrobe
...,...,...
994,,
995,,
996,,
997,,


In [19]:
def category_name_processing(category_name):
    category_name = category_name.lower()
    category_name = re.sub(" ","_",category_name)
    return category_name

category_name_processing("3D printed clothes")

## General CONFIG

In [66]:
category_name = "Artificial Intelligence"
category_name_processed = category_name_processing(category_name)
log_path = "/home/azureuser/christine/logs/"
patterns_path = "/home/azureuser/datadrive/code/christine/patterns/"
raw_data_path = "/home/azureuser/christine/data/"
model_path = "/home/azureuser/christine/model/"

PORTS ASSIGNED

PORT = 8084 path = 
PORT = 8086 path. 


### RECIPE 1

In [None]:
### TASK 1 - RECIPE 1


# Recipe 1 - This is mainly to get suggestions on the keywords relevant to a particular category. We are using a pre-trained model to get these suggestions (since, our categories are generic). We think we have good suggestions with this model but it can be further explored.

# PRODIGY_PORT=8084 PRODIGY_CONFIG_OVERRIDES='{"validate": false}' nohup prodigy sense2vec.teach AI_Seed_Words model/s2v_old --seeds "artificial intelligence, machine learning, natural language processing, supervised learning, deep learning" > AIseedsOut.out 2>&1 &

# Description:

# PRODIGY_PORT - please use the specific port assigned to your task.
# PRODIGY_CONFIG_OVERRIDES - to override any generic settings in the configuration. Here we do validate - false to ensure no input is missed because of formats.
# sense2vec.tech - is the prodigy model or recipe
# AI_Seed_Words - name of the dataset
# models/s2v_old - is the selected model to generate embeddings.
# seeds - seed words for the category
# output location - AISeedsOut.out is the name of the output (log) file.

In [61]:
port_value_task1 = str(8084)
task1_name = category_name_processed + "_keywords"
seed_words = "'artificial intelligence, machine learning, natural language processing, supervised learning, deep learning'"
task1_log = log_path + task1_name + ".out"

### RECIPE Generation

"PRODIGY_PORT=" + port_value_task1  + \
" nohup prodigy sense2vec.teach " + task1_name + \
" model/s2v_old --seeds " + seed_words + " > " + task1_log + " 2>&1 &"


"PRODIGY_PORT=8084 nohup prodigy sense2vec.teach artificial_intelligence_keywords model/s2v_old --seeds 'artificial intelligence, machine learning, natural language processing, supervised learning, deep learning' > /home/azureuser/christine/logs/artificial_intelligence_keywords.out 2>&1 &"

#### GENERATE PATTERNS

In [60]:
### CONVERT SEED WORDS TO PATTERNS
task1_patterns = patterns_path + task1_name + ".jsonl"

"prodigy terms.to-patterns " +  task1_name + " " + task1_patterns + " --label " + category_name_processed

'prodigy terms.to-patterns artificial_intelligence_keywords /home/azureuser/datadrive/code/christine/patterns/artificial_intelligence_keywords.jsonl --label artificial_intelligence'

### RECIPE 2

In [None]:
# Recipe 2 - This is mainly to train an initial model by selecting the sentences with the keywords (generated in the previous recipe). The more exhaustive keywords are a better model can be trained.

#PRODIGY_PORT=8084 nohup prodigy textcat.teach AI_Words en_core_web_sm prodigyData/AI_TrainingData.jsonl --label "Artificial Intelligence" --patterns prodigyData/AI_Seed_Word_Patterns.jsonl > AIwordsOut.out 2>&1 &

# Description:

# PRODIGY_PORT - use the specific port for the task.
# textcat.teach - prodigy recipe to train a base model
# AI_Words - the name of the dataset
# en_core_web_sm - model for embeddings
# AI_TrainingData.jsonl - training data formatted in JSONL format (use the GitHub function to generate prodigy formatted dataset).
# label - label for this call.
# patterns - keywords generated using recipe 1 to further select specific category based sentences.

In [65]:
port_value_task2 = str(8089)
task2_name = category_name_processed + "_sentences"
task2_data_path = raw_data_path + category_name_processed + "_raw_data.jsonl"
task2_log = log_path + task2_name + ".out"

"PRODIGY_PORT=" + port_value_task2  + \
" nohup prodigy textcat.teach " +  task2_name + \
" en_core_web_sm "+ task2_data_path + \
" --label " + category_name_processed + \
" --patterns " + task1_patterns + \
" > " + task2_log + " 2>&1 &"

'PRODIGY_PORT=8089 nohup prodigy textcat.teach artificial_intelligence_sentences en_core_web_sm /home/azureuser/christine/data/artificial_intelligence_raw_data.jsonl --label artificial_intelligence --patterns /home/azureuser/datadrive/code/christine/patterns/artificial_intelligence_keywords.jsonl > /home/azureuser/christine/logs/artificial_intelligence_sentences.out 2>&1 &'

### RECIPE 3

In [None]:
# Recipe 3 - This is mainly to retrain the model for that category using the annotations gathered using recipe 2.

#PRODIGY_PORT=8084 prodigy train --textcat-multilabel AI_Words --label-stats --verbose --eval-split 0.2 model/AI-model

# Description:

# PRODIGY_PORT - use specific port for the task.
# textcat_multilabel - retraining the model with the annotation data.
# AI_Words - dataset used for annotations in recipe 2.
# eval_split - 80% train 20% test.
# AI-model - model used.
# This recipe will give the performance Precision / Recall / F-Score for the class.

In [67]:
port_value_task2 = str(8089)
task3_model_path = model_path + category_name_processed + "_model"

"PRODIGY_PORT=" + port_value_task2  + \
" nohup prodigy train --textcat-multilabel " + \
task2_name + " --label-stats --verbose --eval-split 0.2 " + \
task3_model_path


'PRODIGY_PORT=8089 nohup prodigy train --textcat-multilabel artificial_intelligence_sentences --label-stats --verbose --eval-split 0.2 /home/azureuser/christine/model/artificial_intelligence_model'