Projet 5 | Catégorisez automatiquement des questions

Partie 4.2 | Analyse Supervisée - Bert

# Présentation du projet
Nous cherchons a aider la communauté de Stack Overflow, site célèbre de questions-réponses liées au développement informatique, en réalisant un API de prédiction de tags.<br/>
L'analyse se basera sur le NLP (Natural language processing) et nous testerons différentes méthodes pour ne sélectionner que la plus efficace et pertinente.

# Import des packages, fonctions et paramétrage initial

In [1]:
import os
from datetime import datetime
import re
import numpy as np
import pandas as pd
from sklearn.metrics import  f1_score, accuracy_score, hamming_loss, jaccard_score
from bs4 import BeautifulSoup
import transformers
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, AutoTokenizer, BertModel, BertConfig, AutoModelForSequenceClassification, AdamW
import warnings
# Libraries and packages for text (pre-)processing
import string
import re
import ast
warnings.filterwarnings('ignore')

pd.set_option("display.max_columns", None)

In [2]:
df = pd.read_csv('./df_cleaned.csv', sep=';')

In [3]:
df.head()

Unnamed: 0,Title,Body,Tags,lemmatized_corpus,lemmatized_tags,stemmed_corpus,stemmed_tags
0,Python kernel dies for second run of PyQt5 GUI,<ul>\n<li>Using Spyder in Python 3.5.2 |Anacon...,<python><ipython><anaconda><pyqt5><spyder>,"['python', 'kernel', 'dy', 'second', 'pyqt', '...","['python', 'ipython', 'anaconda', 'pyqt', 'spy...","['python', 'kernel', 'dy', 'second', 'pyqt', '...","['python', 'ipython', 'anaconda', 'pyqt', 'spy..."
1,How can I use optional chaining with arrays an...,<p>I'm trying to use optional chaining with an...,<javascript><arrays><typescript><function><opt...,"['optional', 'chaining', 'array', 'function', ...","['javascript', 'array', 'typescript', 'functio...","['option', 'chain', 'array', 'function', 'tri'...","['javascript', 'array', 'typescript', 'functio..."
2,When to apply(pd.to_numeric) and when to astyp...,<p>I have a pandas DataFrame object named <cod...,<python><pandas><numpy><dataframe><types>,"['apply', 'numeric', 'astype', 'float', 'pytho...","['python', 'panda', 'numpy', 'dataframe', 'type']","['appli', 'numer', 'astyp', 'float', 'python',...","['python', 'panda', 'numpi', 'datafram', 'type']"
3,Get Webpack not to bundle files,<p>So right now I'm working with a prototype w...,<javascript><node.js><reactjs><typescript><web...,"['webpack', 'bundle', 'filesso', 'right', 'wor...","['javascript', 'node', 'reactjs', 'typescript'...","['webpack', 'bundl', 'filesso', 'right', 'work...","['javascript', 'node', 'reactj', 'typescript',..."
4,SwiftUI tappable subtext,<p>Is there any way in SwiftUI to open browser...,<ios><swift><xcode><text><swiftui>,"['swiftui', 'tappable', 'subtextis', 'swiftui'...","['swift', 'xcode', 'text', 'swiftui']","['swiftui', 'tappabl', 'subtexti', 'swiftui', ...","['swift', 'xcode', 'text', 'swiftui']"


In [4]:
df['Text_complet'] = df['Title'] + df['Body']
df['Text_complet'].head(1)

0    Python kernel dies for second run of PyQt5 GUI...
Name: Text_complet, dtype: object

In [5]:
print(df['Text_complet'].iloc[0])

Python kernel dies for second run of PyQt5 GUI<ul>
<li>Using Spyder in Python 3.5.2 |Anaconda 4.2.0 (64-bit) Windows package. qt: 5.6.0</li>
<li>For first run, GUI window opens as expected</li>
<li>For 2nd run, nothing opens, and receiving <strong>Kernel died, restarting</strong> log message.</li>
</ul>

<p><strong>gui1.py:</strong></p>

<blockquote>
<pre><code>import sys from PyQt5.QtWidgets import QApplication, QWidget

app = QApplication(sys.argv)

w = QWidget()

w.resize(250,150) w.show()

#sys.exit(app.exec_()) 
app.exec_()
</code></pre>
</blockquote>

<p><strong>IPhython log:</strong></p>

<pre><code>runfile('F:/work/ws_python/TestProj1/gui1/gui1.py', wdir='F:/work/ws_python/TestProj1/gui1')

runfile('F:/work/ws_python/TestProj1/gui1/gui1.py', wdir='F:/work/ws_python/TestProj1/gui1')

Kernel died, restarting

Kernel died, restarting

Kernel died, restarting
</code></pre>

<p>Why kernel dies for 2nd run and how to solve it?</p>

<blockquote>
  <p>(Doing the same even using #sys.ex

In [6]:
def remove_html(text):
    """
        Remove the html in sample text
    """
    html = re.compile(r"<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
    return re.sub(html, "", text)

In [7]:
%%time

def text_cleaning(text):
    """
    Remove figures, punctuation, words shorter than two letters (excepted C or R) in a lowered text.

    Args:
        text(String): Row text to clean

    Returns:
       res(string): Cleaned text
    """

    pattern = re.compile(r'[^\w]|[\d_]')

    try:
        res = re.sub(pattern," ", text).lower()
    except TypeError:
        return text

    res = res.split(" ")
    res = list(filter(lambda x: len(x)>3 , res))
    res = " ".join(res)
    return res

def remove_non_ascii(text):
    """
        Remove non-ASCII characters
    """
    return re.sub(r'[^\x00-\x7f]',r'', text) # or ''.join([x for x in text if x in string.printable])

def remove_special_characters(text):
    """
        Remove special special characters, including symbols, emojis, and other graphic characters
    """
    emoji_pattern = re.compile(
        '['
        u'\U0001F600-\U0001F64F'  # emoticons
        u'\U0001F300-\U0001F5FF'  # symbols & pictographs
        u'\U0001F680-\U0001F6FF'  # transport & map symbols
        u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
        u'\U00002702-\U000027B0'
        u'\U000024C2-\U0001F251'
        ']+',
        flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

def remove_punct(text):
    """
        Remove the punctuation
    """
#     return re.sub(r'[]!"$%&\'()*+,./:;=#@?[\\^_`{|}~-]+', "", text)
    return text.translate(str.maketrans('', '', string.punctuation))

def remove_URL(text):
    """
        Remove URLs from a sample string
    """
    return re.sub(r"https?://\S+|www\.\S+", "", text)

CPU times: total: 0 ns
Wall time: 0 ns


In [8]:
# remove html from the text
df['Text_cleaned'] = df['Text_complet'].apply(lambda x: remove_html(x))
#df['Text_cleaned'] = [text_cleaning(text) for text in df['Text_cleaned']]
df['Text_cleaned'] = df['Text_cleaned'].apply(lambda x: remove_non_ascii(x))
df['Text_cleaned'] = df['Text_cleaned'].apply(lambda x: remove_special_characters(x))
df['Text_cleaned'] = df['Text_cleaned'].apply(lambda x: remove_punct(x))
df['Text_cleaned'] = df['Text_cleaned'].apply(lambda x: remove_URL(x))

In [9]:
df = df.rename(columns = {'Text_cleaned' : 'text'})

In [10]:
df.columns

Index(['Title', 'Body', 'Tags', 'lemmatized_corpus', 'lemmatized_tags',
       'stemmed_corpus', 'stemmed_tags', 'Text_complet', 'text'],
      dtype='object')

In [11]:
df['text'].iloc[0]

'Python kernel dies for second run of PyQt5 GUI\nUsing Spyder in Python 352 Anaconda 420 64bit Windows package qt 560\nFor first run GUI window opens as expected\nFor 2nd run nothing opens and receiving Kernel died restarting log message\n\n\ngui1py\n\n\nimport sys from PyQt5QtWidgets import QApplication QWidget\n\napp  QApplicationsysargv\n\nw  QWidget\n\nwresize250150 wshow\n\nsysexitappexec \nappexec\n\n\n\nIPhython log\n\nrunfileFworkwspythonTestProj1gui1gui1py wdirFworkwspythonTestProj1gui1\n\nrunfileFworkwspythonTestProj1gui1gui1py wdirFworkwspythonTestProj1gui1\n\nKernel died restarting\n\nKernel died restarting\n\nKernel died restarting\n\n\nWhy kernel dies for 2nd run and how to solve it\n\n\n  Doing the same even using sysexitappexec as last line\n\n'

In [12]:
df['text']

0        Python kernel dies for second run of PyQt5 GUI...
1        How can I use optional chaining with arrays an...
2        When to applypdtonumeric and when to astypenpf...
3        Get Webpack not to bundle filesSo right now Im...
4        SwiftUI tappable subtextIs there any way in Sw...
                               ...                        
29990    ImportError cannot import name urldecode from ...
29991    Why did I got an error ModuleNotFoundError No ...
29992    ModuleNotFoundError No module named distutils ...
29993    NotImplementedError Loading a dataset cached i...
29994    WillPopScope is deprecated in FlutterWillPopSc...
Name: text, Length: 29995, dtype: object

In [13]:
print(df['text'].iloc[0])

Python kernel dies for second run of PyQt5 GUI
Using Spyder in Python 352 Anaconda 420 64bit Windows package qt 560
For first run GUI window opens as expected
For 2nd run nothing opens and receiving Kernel died restarting log message


gui1py


import sys from PyQt5QtWidgets import QApplication QWidget

app  QApplicationsysargv

w  QWidget

wresize250150 wshow

sysexitappexec 
appexec



IPhython log

runfileFworkwspythonTestProj1gui1gui1py wdirFworkwspythonTestProj1gui1

runfileFworkwspythonTestProj1gui1gui1py wdirFworkwspythonTestProj1gui1

Kernel died restarting

Kernel died restarting

Kernel died restarting


Why kernel dies for 2nd run and how to solve it


  Doing the same even using sysexitappexec as last line




In [14]:
# On convertit en liste car ça se perd quand on save en csv
df['Tags'] = df['stemmed_tags'].map(lambda X: ast.literal_eval(X))

In [15]:
df

Unnamed: 0,Title,Body,Tags,lemmatized_corpus,lemmatized_tags,stemmed_corpus,stemmed_tags,Text_complet,text
0,Python kernel dies for second run of PyQt5 GUI,<ul>\n<li>Using Spyder in Python 3.5.2 |Anacon...,"[python, ipython, anaconda, pyqt, spyder]","['python', 'kernel', 'dy', 'second', 'pyqt', '...","['python', 'ipython', 'anaconda', 'pyqt', 'spy...","['python', 'kernel', 'dy', 'second', 'pyqt', '...","['python', 'ipython', 'anaconda', 'pyqt', 'spy...",Python kernel dies for second run of PyQt5 GUI...,Python kernel dies for second run of PyQt5 GUI...
1,How can I use optional chaining with arrays an...,<p>I'm trying to use optional chaining with an...,"[javascript, array, typescript, function, opti...","['optional', 'chaining', 'array', 'function', ...","['javascript', 'array', 'typescript', 'functio...","['option', 'chain', 'array', 'function', 'tri'...","['javascript', 'array', 'typescript', 'functio...",How can I use optional chaining with arrays an...,How can I use optional chaining with arrays an...
2,When to apply(pd.to_numeric) and when to astyp...,<p>I have a pandas DataFrame object named <cod...,"[python, panda, numpi, datafram, type]","['apply', 'numeric', 'astype', 'float', 'pytho...","['python', 'panda', 'numpy', 'dataframe', 'type']","['appli', 'numer', 'astyp', 'float', 'python',...","['python', 'panda', 'numpi', 'datafram', 'type']",When to apply(pd.to_numeric) and when to astyp...,When to applypdtonumeric and when to astypenpf...
3,Get Webpack not to bundle files,<p>So right now I'm working with a prototype w...,"[javascript, node, reactj, typescript, webpack]","['webpack', 'bundle', 'filesso', 'right', 'wor...","['javascript', 'node', 'reactjs', 'typescript'...","['webpack', 'bundl', 'filesso', 'right', 'work...","['javascript', 'node', 'reactj', 'typescript',...",Get Webpack not to bundle files<p>So right now...,Get Webpack not to bundle filesSo right now Im...
4,SwiftUI tappable subtext,<p>Is there any way in SwiftUI to open browser...,"[swift, xcode, text, swiftui]","['swiftui', 'tappable', 'subtextis', 'swiftui'...","['swift', 'xcode', 'text', 'swiftui']","['swiftui', 'tappabl', 'subtexti', 'swiftui', ...","['swift', 'xcode', 'text', 'swiftui']",SwiftUI tappable subtext<p>Is there any way in...,SwiftUI tappable subtextIs there any way in Sw...
...,...,...,...,...,...,...,...,...,...
29990,ImportError: cannot import name 'url_decode' f...,<p>I am building a webapp using Flask. I impor...,"[python, flask, importerror, flask, login, wer...","['importerror', 'import', 'name', 'decode', 'w...","['python', 'flask', 'importerror', 'flask', 'l...","['importerror', 'import', 'name', 'decod', 'we...","['python', 'flask', 'importerror', 'flask', 'l...",ImportError: cannot import name 'url_decode' f...,ImportError cannot import name urldecode from ...
29991,Why did I got an error ModuleNotFoundError: No...,<p>I've installed <code>scikit-fuzzy</code> bu...,"[python, setuptool, distutil, skfuzzi, python]","['error', 'modulenotfounderror', 'module', 'na...","['python', 'setuptools', 'distutils', 'skfuzzy...","['error', 'modulenotfounderror', 'modul', 'nam...","['python', 'setuptool', 'distutil', 'skfuzzi',...",Why did I got an error ModuleNotFoundError: No...,Why did I got an error ModuleNotFoundError No ...
29992,ModuleNotFoundError: No module named 'distutil...,<p>When I try to import <code>customtkinter</c...,"[python, modulenotfounderror, customtkint, pyt...","['modulenotfounderror', 'module', 'named', 'di...","['python', 'modulenotfounderror', 'customtkint...","['modulenotfounderror', 'modul', 'name', 'dist...","['python', 'modulenotfounderror', 'customtkint...",ModuleNotFoundError: No module named 'distutil...,ModuleNotFoundError No module named distutils ...
29993,NotImplementedError: Loading a dataset cached ...,<p>I try to load a dataset using the <code>dat...,"[python, python, openai, huggingfac, dataset, ...","['notimplementederror', 'loading', 'dataset', ...","['python', 'python', 'openai', 'huggingface', ...","['notimplementederror', 'load', 'dataset', 'ca...","['python', 'python', 'openai', 'huggingfac', '...",NotImplementedError: Loading a dataset cached ...,NotImplementedError Loading a dataset cached i...


In [16]:
from collections import Counter

all_tags = [tag for sublist in df['Tags'] for tag in sublist]

tag_counts = Counter(all_tags)

In [17]:
all_tags

['python',
 'ipython',
 'anaconda',
 'pyqt',
 'spyder',
 'javascript',
 'array',
 'typescript',
 'function',
 'option',
 'chain',
 'python',
 'panda',
 'numpi',
 'datafram',
 'type',
 'javascript',
 'node',
 'reactj',
 'typescript',
 'webpack',
 'swift',
 'xcode',
 'text',
 'swiftui',
 'docker',
 'classic',
 'docker',
 'window',
 'docker',
 'desktop',
 'javascript',
 'android',
 'reactj',
 'react',
 'nativ',
 'server',
 'server',
 'select',
 'case',
 'javascript',
 'node',
 'mongodb',
 'mongoos',
 'mongoos',
 'schema',
 'twitter',
 'bootstrap',
 'reactj',
 'leaflet',
 'react',
 'bootstrap',
 'react',
 'leaflet',
 'android',
 'android',
 'layout',
 'data',
 'bind',
 'android',
 'databind',
 'core',
 'nuget',
 'nuget',
 'packag',
 'core',
 'linux',
 'gstreamer',
 'core',
 'razor',
 'blazor',
 'javascript',
 'reactj',
 'visual',
 'studio',
 'code',
 'creat',
 'react',
 'eslint',
 'javascript',
 'angular',
 'typescript',
 'jasmin',
 'angular',
 'test',
 'angular',
 'rout',
 'angular',
 'ro

In [18]:
# top 50
tag_counts.most_common(50)

[('python', 7927),
 ('android', 7149),
 ('java', 5038),
 ('javascript', 4809),
 ('spring', 4119),
 ('node', 2024),
 ('core', 1969),
 ('angular', 1901),
 ('googl', 1854),
 ('swift', 1793),
 ('studio', 1724),
 ('html', 1632),
 ('reactj', 1565),
 ('amazon', 1510),
 ('test', 1444),
 ('xcode', 1345),
 ('laravel', 1320),
 ('react', 1249),
 ('json', 1142),
 ('visual', 1140),
 ('django', 1139),
 ('typescript', 1131),
 ('docker', 1112),
 ('apach', 1085),
 ('boot', 1084),
 ('panda', 1073),
 ('servic', 1064),
 ('learn', 1036),
 ('gradl', 1019),
 ('linux', 985),
 ('window', 975),
 ('framework', 953),
 ('jqueri', 926),
 ('data', 893),
 ('firebas', 809),
 ('http', 790),
 ('azur', 788),
 ('spark', 774),
 ('server', 769),
 ('selenium', 759),
 ('flutter', 756),
 ('rubi', 734),
 ('unit', 730),
 ('rest', 710),
 ('bootstrap', 707),
 ('angularj', 694),
 ('array', 639),
 ('databas', 621),
 ('cloud', 613),
 ('datafram', 606)]

In [19]:
top_50_tags = [tag for tag, count in tag_counts.most_common(50)]

In [20]:
def filter_tags(tags):
    return [tag for tag in tags if tag in top_50_tags]

# On filtre pour ne garder que le top 50
df['tags50'] = df['Tags'].apply(filter_tags)

In [21]:
df['tags50']

0                                      [python]
1               [javascript, array, typescript]
2                     [python, panda, datafram]
3        [javascript, node, reactj, typescript]
4                                [swift, xcode]
                          ...                  
29990                                  [python]
29991                          [python, python]
29992                          [python, python]
29993                          [python, python]
29994                                 [flutter]
Name: tags50, Length: 29995, dtype: object

In [22]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
#device = 'mps'

In [23]:
import os
os.environ['CURL_CA_BUNDLE'] = ''

In [24]:
from transformers import AutoTokenizer
#tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

tokenizer = AutoTokenizer.from_pretrained(r'C:\Users\A475388\Notebooks\IML P5\Flask\bert_tokenizer')

In [25]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
encoded_tags = mlb.fit_transform(df['tags50'])

In [26]:
encoded_tags.shape

(29995, 50)

In [27]:
mlb.classes_.shape[0]

50

In [28]:
df_array = pd.DataFrame(encoded_tags, columns=[f'tag_{i}' for i in range(50)])

In [29]:
df_array

Unnamed: 0,tag_0,tag_1,tag_2,tag_3,tag_4,tag_5,tag_6,tag_7,tag_8,tag_9,tag_10,tag_11,tag_12,tag_13,tag_14,tag_15,tag_16,tag_17,tag_18,tag_19,tag_20,tag_21,tag_22,tag_23,tag_24,tag_25,tag_26,tag_27,tag_28,tag_29,tag_30,tag_31,tag_32,tag_33,tag_34,tag_35,tag_36,tag_37,tag_38,tag_39,tag_40,tag_41,tag_42,tag_43,tag_44,tag_45,tag_46,tag_47,tag_48,tag_49
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29990,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
29991,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
29992,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
29993,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [30]:
# On ajoute les encodage de la target à df
df = pd.concat([df, df_array], axis=1)

In [31]:
df.shape

(29995, 60)

In [32]:
# target_cols will be columns resulting from the MultiLabelBinarizer encoding of the tag column 
target_cols = [f'tag_{i}' for i in range(50)]
try:
    print(target_cols)
except NameError:
    print("/!\\ WARNING /!\\")
    raise NameError("target_cols is not defined")

['tag_0', 'tag_1', 'tag_2', 'tag_3', 'tag_4', 'tag_5', 'tag_6', 'tag_7', 'tag_8', 'tag_9', 'tag_10', 'tag_11', 'tag_12', 'tag_13', 'tag_14', 'tag_15', 'tag_16', 'tag_17', 'tag_18', 'tag_19', 'tag_20', 'tag_21', 'tag_22', 'tag_23', 'tag_24', 'tag_25', 'tag_26', 'tag_27', 'tag_28', 'tag_29', 'tag_30', 'tag_31', 'tag_32', 'tag_33', 'tag_34', 'tag_35', 'tag_36', 'tag_37', 'tag_38', 'tag_39', 'tag_40', 'tag_41', 'tag_42', 'tag_43', 'tag_44', 'tag_45', 'tag_46', 'tag_47', 'tag_48', 'tag_49']


In [33]:
from sklearn.model_selection import train_test_split

df = df.sample(1000)

df_train, temp = train_test_split(df, test_size=0.2, random_state=42)

df_val, df_test = train_test_split(temp, test_size=0.5, random_state=42)

In [34]:
from datasets import Dataset
# Create dataset
train_dataset = Dataset.from_pandas(df_train)
val_dataset = Dataset.from_pandas(df_val)

# Encode text
train_encodings = tokenizer(train_dataset['text'], truncation=True, padding=True, max_length=512)
val_encodings = tokenizer(val_dataset['text'], truncation=True, padding=True, max_length=512)

In [35]:
class BertProcessedDataset(torch.utils.data.Dataset):
        def __init__(self, encodings, labels):
            self.encodings = encodings
            self.labels = labels

        def __getitem__(self, idx):
            item = {key: torch.tensor(val[idx]).to(device) for key, val in self.encodings.items()}
            item['labels'] = torch.tensor(self.labels[idx], dtype=torch.float32).to(device)
            return item

        def __len__(self):
            return len(self.labels)
        
# Convert encodings to PyTorch tensors
train_dataset = BertProcessedDataset(train_encodings, df_train[target_cols].values)
valid_dataset = BertProcessedDataset(val_encodings, df_val[target_cols].values)

In [36]:
model = AutoModelForSequenceClassification.from_pretrained(
            'C:/Users/A475388/Notebooks/IML P5/Bert/bert_model', 
    num_labels=mlb.classes_.shape[0], 
    problem_type="multi_label_classification"
)
model.to(device);

In [37]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    final_metrics = {}
    
    # load sigmoid to the logits
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(logits))
    predictions = np.zeros(probs.shape)
    predictions[np.where(probs >= 0.5)] = 1  
    
    # The global f1_metrics
    final_metrics["accuracy"] = accuracy_score(labels, predictions)
    final_metrics["jacquard"] = jaccard_score(labels, predictions, average='weighted')
    final_metrics["hamming_loss"] = hamming_loss(labels, predictions)
   
    final_metrics["f1_micro"] = f1_score(labels, predictions, average="micro")
    final_metrics["f1_macro"] = f1_score(labels, predictions, average="macro")
    final_metrics["f1_weighted"] = f1_score(labels, predictions, average="weighted")
    
    return final_metrics

In [38]:
now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
os.makedirs(f"./{now}-bert-model", exist_ok=True)

In [39]:
# Defining some key variables that will be used later on in the training
TRAIN_BATCH_SIZE = 20
VALID_BATCH_SIZE = 20
EPOCHS = 10
LEARNING_RATE = 1e-4

In [40]:
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback

args = TrainingArguments(
    output_dir = f"./{now}-bert-model",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=VALID_BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    greater_is_better=True,
    eval_accumulation_steps=50,
)

trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5, early_stopping_threshold=0.02)],
)

In [41]:
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback

In [42]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Jacquard,Hamming Loss,F1 Micro,F1 Macro,F1 Weighted,Runtime,Samples Per Second,Steps Per Second
1,No log,0.118831,0.15,0.214599,0.0312,0.355372,0.150784,0.256093,29.4045,3.401,0.17
2,No log,0.098002,0.27,0.344962,0.0272,0.51773,0.213031,0.410848,29.3168,3.411,0.171
3,No log,0.088057,0.32,0.418556,0.0254,0.594249,0.370931,0.513778,29.15,3.431,0.172
4,No log,0.084067,0.32,0.449135,0.0244,0.611465,0.414504,0.544183,29.2239,3.422,0.171
5,No log,0.080186,0.34,0.471771,0.0248,0.619632,0.43394,0.577161,29.17,3.428,0.171
6,No log,0.079485,0.35,0.530444,0.022,0.678363,0.502055,0.641372,29.2895,3.414,0.171
7,No log,0.078259,0.34,0.506107,0.0228,0.656627,0.477729,0.613075,28.9994,3.448,0.172
8,No log,0.076004,0.35,0.520158,0.0222,0.670623,0.499422,0.630672,29.3272,3.41,0.17
9,No log,0.07602,0.34,0.515673,0.022,0.666667,0.488612,0.621905,53.3103,1.876,0.094
10,No log,0.07624,0.33,0.519307,0.0218,0.670695,0.48839,0.62594,53.5091,1.869,0.093


TrainOutput(global_step=400, training_loss=0.056093688011169436, metrics={'train_runtime': 6903.3365, 'train_samples_per_second': 1.159, 'train_steps_per_second': 0.058, 'total_flos': 2105795592192000.0, 'train_loss': 0.056093688011169436, 'epoch': 10.0})

In [43]:
test_dataset = Dataset.from_pandas(df_test)
test_encodings = tokenizer(test_dataset['text'], truncation=True, padding=True, max_length=512)
test_dataset = BertProcessedDataset(test_encodings, df_test[target_cols].values)

In [44]:
pred_otps = trainer.predict(test_dataset)

In [46]:
pred_otps.metrics

{'test_loss': 0.07682725787162781,
 'test_accuracy': 0.37,
 'test_jacquard': 0.5700499527705409,
 'test_hamming_loss': 0.0202,
 'test_f1_micro': 0.7202216066481995,
 'test_f1_macro': 0.5336607145627109,
 'test_f1_weighted': 0.6690979125830968,
 'test_runtime': 52.6755,
 'test_samples_per_second': 1.898,
 'test_steps_per_second': 0.095}

In [47]:
pred_otps

PredictionOutput(predictions=array([[-7.1903014 , -2.2436266 , -7.596447  , ..., -6.167569  ,
        -6.6615944 , -6.492551  ],
       [-4.142312  , -6.896008  , -2.6690335 , ..., -5.524728  ,
        -4.9368086 , -6.385022  ],
       [-4.2421613 , -5.280876  , -2.5546083 , ..., -0.29960316,
        -4.1389446 , -5.150961  ],
       ...,
       [-2.8811226 , -6.7742715 , -1.3411466 , ..., -4.749782  ,
        -5.2096605 , -6.404084  ],
       [-6.3707414 , -3.725837  , -5.0258317 , ..., -4.9742823 ,
        -3.5377686 , -6.214494  ],
       [-2.053347  , -6.5691986 , -2.431307  , ..., -4.0618124 ,
        -5.635381  , -7.2811832 ]], dtype=float32), label_ids=array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), metrics={'test_loss': 0.07682725787162781, 'test_accuracy': 0.37, 'test_jac

Ajout de l'id2label a la config du modèle pour automatiquement pouvoir récupérer le nom des tags au lieu des indices

In [48]:
id2label = {i: class_name for i, class_name in enumerate(mlb.classes_)}

In [49]:
id2label

{0: 'amazon',
 1: 'android',
 2: 'angular',
 3: 'angularj',
 4: 'apach',
 5: 'array',
 6: 'azur',
 7: 'boot',
 8: 'bootstrap',
 9: 'cloud',
 10: 'core',
 11: 'data',
 12: 'databas',
 13: 'datafram',
 14: 'django',
 15: 'docker',
 16: 'firebas',
 17: 'flutter',
 18: 'framework',
 19: 'googl',
 20: 'gradl',
 21: 'html',
 22: 'http',
 23: 'java',
 24: 'javascript',
 25: 'jqueri',
 26: 'json',
 27: 'laravel',
 28: 'learn',
 29: 'linux',
 30: 'node',
 31: 'panda',
 32: 'python',
 33: 'react',
 34: 'reactj',
 35: 'rest',
 36: 'rubi',
 37: 'selenium',
 38: 'server',
 39: 'servic',
 40: 'spark',
 41: 'spring',
 42: 'studio',
 43: 'swift',
 44: 'test',
 45: 'typescript',
 46: 'unit',
 47: 'visual',
 48: 'window',
 49: 'xcode'}

In [50]:
trainer.model.config.id2label = id2label

In [51]:
trainer.save_model('./bert_model')

# Prediction

In [52]:
model_path = 'C:/Users/A475388/Notebooks/IML P5/Bert/bert_model'

# Choix automatique du device
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'
    
print(device)

cpu


In [53]:
import os
os.environ['CURL_CA_BUNDLE'] = ''

In [54]:
tokenizer = AutoTokenizer.from_pretrained(r'C:\Users\A475388\Notebooks\IML P5\Flask\bert_tokenizer')
model = AutoModelForSequenceClassification.from_pretrained(
    model_path
)
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [55]:
from transformers import TextClassificationPipeline
pipe = TextClassificationPipeline(
    model=model, 
    tokenizer=tokenizer, 
    return_all_scores=False, # <-- renvoie les probas de tous les tags si True. 
                            # Sinon, renvoie uniquement le plus probable.
                            # ce paramètre sera ignoré si top_k est indiqué lors du call du pipe (voir plus bas)
    device=device,
    task="multi_label_classification",
    function_to_apply='sigmoid'
)

In [56]:
example_text = 'i read in tcp ip network administration by o that typing the route n command should bring up a routing table when i typed it into the terminal on a mac it returned the following usage route dnqtv command modifiers args what is the correct command to use to see the routing table in my terminal'

In [57]:
# Apparence des resultats du pipe
pipe(example_text)

[{'label': 'macos', 'score': 0.7781149744987488}]

In [58]:
%%time
# Apparence des resultats du pipe lorsque l'on demande 5 reponses
pipe('how do i install python on arch linux ? i cant understand the docs', top_k=5)

CPU times: total: 266 ms
Wall time: 38.3 ms


[{'label': 'python', 'score': 0.8857898712158203},
 {'label': 'linux', 'score': 0.5953887104988098},
 {'label': 'python-3.x', 'score': 0.12746413052082062},
 {'label': 'windows', 'score': 0.05369570851325989},
 {'label': 'macos', 'score': 0.016154713928699493}]

## Wrapper qui recommandera les resultats les plus pertinents

In [59]:
def pred_fn(text, pipeline, thresh=0.5, max_answers=10):
    pipe_output = pipeline(text, top_k=max_answers)
    recommended_tags = [
        dict_output['label'] for dict_output in pipe_output if dict_output['score'] > thresh
    ]
    
    return recommended_tags

In [60]:
pred_fn('how do i install python on arch linux ? i cant understand the docs', pipe)

['python', 'linux']

In [61]:
pred_fn('"The big SQL RegEx": How do I RegEx split a SQL query?', pipe)

['sql', 'sql-server']

In [62]:
"... ".join(['sql', 'sql-server'])

'sql... sql-server'