# ***Classify Jutsus***

In this notebook we will create a classifier for the jutsus of Naruto Shippuden. Where we will train a model to predict the jutsu type based on the jutsu description. Each themes of this Notebook will be:

1. Load Dataset
2. Clean Text
3. Encode Labels
4. Split Dataset
5. Train Model
6. Evaluate Model


### ***Load Dataset***

In this section we will load the dataset from a json file after of scrapping the [Naruto Shippuden Wiki](https://naruto.fandom.com/wiki/Naruto_Shippuden_Wiki).

**How are we going to do it?**
* We will use the pandas library to read the json file.
* We will use the sklearn library to split the dataset into training and testing sets.
* We will use the transformers library to tokenize the text.

In [121]:
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split #That is used to split the dataset into training and testing sets.
from transformers import AutoTokenizer
from datasets import Dataset

In [122]:
data_path = "../data/jutsus.jsonl"
df = pd.read_json(data_path, lines=True)
df.head()

Unnamed: 0,jutsu_name,jutsu_type,jutsu_description
0,10 Hit Combo,Taijutsu,Lars punches the opponent before striking them...
1,Acid Permeation,Ninjutsu,Utakata blows acidic bubbles from his pipe tha...
2,Accelerated Armed Revolving Heaven,"Kekkei Genkai, Hiden, Ninjutsu, Fūinjutsu, Tai...",Tenten unseals several weapons from her scroll...
3,Adamantine Prison Wall,"Ninjutsu, Clone Techniques, Bukijutsu","After using Transformation: Adamantine Staff, ..."
4,Adamantine Sealing Chains,"Hiden, Ninjutsu, Fūinjutsu, Barrier Ninjutsu",This is a sealing technique that is characteri...


**The Next Function is going to simplify the jutsu type. Just Helpus simplify the dataset a little bit, separating the Genjutsu, Ninjutsu and Taijutsu in *jutsu_type* column.**

In [123]:
def simplify_jutsu(jutsu):
    if "Genjutsu" in jutsu:
        return "Genjutsu"
    if "Taijutsu" in jutsu:
        return "Taijutsu"
    if "Ninjutsu" in jutsu:
        return "Ninjutsu"
 

In [124]:
df["jutsu_type_simplified"] = df['jutsu_type'].apply(simplify_jutsu)
df.head()

Unnamed: 0,jutsu_name,jutsu_type,jutsu_description,jutsu_type_simplified
0,10 Hit Combo,Taijutsu,Lars punches the opponent before striking them...,Taijutsu
1,Acid Permeation,Ninjutsu,Utakata blows acidic bubbles from his pipe tha...,Ninjutsu
2,Accelerated Armed Revolving Heaven,"Kekkei Genkai, Hiden, Ninjutsu, Fūinjutsu, Tai...",Tenten unseals several weapons from her scroll...,Taijutsu
3,Adamantine Prison Wall,"Ninjutsu, Clone Techniques, Bukijutsu","After using Transformation: Adamantine Staff, ...",Ninjutsu
4,Adamantine Sealing Chains,"Hiden, Ninjutsu, Fūinjutsu, Barrier Ninjutsu",This is a sealing technique that is characteri...,Ninjutsu


Now we are going to count in each description how many Genjutsu, Ninjutsu and Taijutsu we have.

In [125]:
df['jutsu_type_simplified'].value_counts()

jutsu_type_simplified
Ninjutsu    2036
Taijutsu     631
Genjutsu     101
Name: count, dtype: int64

***In this section, we are going to create a new column called *text* that will contain the jutsu name and the jutsu description. and a new column called *jutsus* that will contain the simplified jutsu type.***

In [126]:
df['text'] = df['jutsu_name']+ ". " + df['jutsu_description']
df['jutsus'] = df['jutsu_type_simplified']
df = df[['text', 'jutsus']]
df = df.dropna()

In [127]:
df.head()

Unnamed: 0,text,jutsus
0,10 Hit Combo. Lars punches the opponent before...,Taijutsu
1,Acid Permeation. Utakata blows acidic bubbles ...,Ninjutsu
2,Accelerated Armed Revolving Heaven. Tenten uns...,Taijutsu
3,Adamantine Prison Wall. After using Transforma...,Ninjutsu
4,Adamantine Sealing Chains. This is a sealing t...,Ninjutsu


Once we have the dataset ready, we are going to clean the text, the cleaning is a important step to make the model work better. In this case, the we will avoid all Html tags in the text.

The class Cleaner is created to generate the clean text using three methods:

1. put_line_breaks: To put line breaks in the text. So the model can understand the text better.

2. remove_html_tags: To remove all Html tags in the text.

3. clean: Concatenate the text we after using the previous methods, and strip the text, we will return the clean text.

In [128]:
# To avoid any Html Tag in our Text:

from bs4 import BeautifulSoup
class Cleaner():
    def __init__(self):
        pass 
    
    def put_line_breaks(self, text):
        return text.replace("<\p>", "<\p>\n")
    
    def remove_html_tags(self, text):
        clean_text = BeautifulSoup(text, "html.parser").text
        return clean_text

    def clean(self, text):
        text = self.put_line_breaks(text)
        text = self.remove_html_tags(text)
        text = text.strip()
        return text

And now we are going to create a new column called *text_cleaned* that will contain the clean text. 

In [129]:
text_column_name = 'text'
label_column_name = 'jutsus'

In [130]:
#Clean Text

cleaner = Cleaner()
df['text_cleaned'] = df[text_column_name].apply(cleaner.clean)

In [131]:
df.head(2)

Unnamed: 0,text,jutsus,text_cleaned
0,10 Hit Combo. Lars punches the opponent before...,Taijutsu,10 Hit Combo. Lars punches the opponent before...
1,Acid Permeation. Utakata blows acidic bubbles ...,Ninjutsu,Acid Permeation. Utakata blows acidic bubbles ...


In this section we are going to encode the labels. We are going to use the LabelEncoder from sklearn to encode the labels. The LabelEncoder will return a dictionary with the label name and the label index. And we are going to use this dictionary to encode the labels.

In [None]:
#Encode labels
le = preprocessing.LabelEncoder() #This is used to convert categorical labels into numerical labels.
le.fit(df[label_column_name].to_list())

In [None]:
label_dict = {index:label_name for index,label_name in enumerate(le.__dict__['classes_'].tolist())} # This creates a dictionary that maps the index to the label name. 
label_dict

{0: 'Genjutsu', 1: 'Ninjutsu', 2: 'Taijutsu'}

In [None]:
df['label'] = le.transform(df[label_column_name].to_list()) # This line transforms the labels in the DataFrame into numerical values using the fitted LabelEncoder. 
df.head()

Unnamed: 0,text,jutsus,text_cleaned,label
0,10 Hit Combo. Lars punches the opponent before...,Taijutsu,10 Hit Combo. Lars punches the opponent before...,2
1,Acid Permeation. Utakata blows acidic bubbles ...,Ninjutsu,Acid Permeation. Utakata blows acidic bubbles ...,1
2,Accelerated Armed Revolving Heaven. Tenten uns...,Taijutsu,Accelerated Armed Revolving Heaven. Tenten uns...,2
3,Adamantine Prison Wall. After using Transforma...,Ninjutsu,Adamantine Prison Wall. After using Transforma...,1
4,Adamantine Sealing Chains. This is a sealing t...,Ninjutsu,Adamantine Sealing Chains. This is a sealing t...,1


And Now we are going to split the dataset into train and test. We are going to use the train_test_split from sklearn to split the dataset into train and test. This will return two DataFrames, one for train and one for test, this step is important to apply the classification model.

In [135]:
test_size = 0.2
df_train,df_test = train_test_split(df,
                                    test_size=test_size,
                                    stratify=df['label'])

In [136]:
df_train['jutsus'].value_counts()

jutsus
Ninjutsu    1628
Taijutsu     505
Genjutsu      81
Name: count, dtype: int64

***The DistilBERT model is a pre-trained model from Hugging Face. We are going to use this model to train our model. And we are going to use the AutoTokenizer from transformers to tokenize the text and the AutoModelForSequenceClassification from transformers to train the model.***

In [137]:
model_name = "distilbert/distilbert-base-uncased"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name) #This line loads the tokenizer for the specified pre-trained model. The tokenizer is responsible for converting text into a format that the model can understand, such as token IDs.

In [None]:
def preprocess_function(tokenizer,examples):
    return tokenizer(examples['text_cleaned'],truncation=True)

#This function takes a tokenizer and a batch of examples as input, and applies the tokenizer to the 'text_cleaned' field of the examples. The truncation=True argument ensures that any text longer than the model's maximum input length is truncated to fit.
#The function returns the tokenized examples, which can then be used as input to the model for training or evaluation.

***Now we are going to convert the Pandas DataFrame to a Hugging Face dataset and tokenize the text***

In [140]:
#Convert Pandas To a Hugging face dataset
train_dataset = Dataset.from_pandas(df_train)
test_dataset = Dataset.from_pandas(df_test)

#Tokenize Dataset
tokenized_train = train_dataset.map(lambda examples: preprocess_function(tokenizer,examples),
                                    batched=True)

tokenized_test = test_dataset.map(lambda examples: preprocess_function(tokenizer,examples),
                                    batched=True)

Map: 100%|██████████| 2214/2214 [00:01<00:00, 1250.16 examples/s]
Map: 100%|██████████| 554/554 [00:00<00:00, 1554.12 examples/s]


If you want see this process on python file, just redirect 