# 3 - Exploring Pipelines<p>
In order to feed the data to a model from HuggingFace, it needs to be put back into a data dictionary, and the new column I created as the target topics needs to be called 'labels' for the model. So I am renaming the columns as follows:<ul>
>`'label'`->`'og_label'`<br>
>`'simple_topic'`->`'label'`<p>



In [1]:
import pandas as pd
ds_train = pd.read_pickle('pickles/ds_train.pkl')
ds_test = pd.read_pickle('pickles/ds_test.pkl')

In [2]:
ds_train = ds_train.rename(columns={'label':'og_label', 'simple_topic':'label'})
ds_test = ds_test.rename(columns={'label':'og_label', 'simple_topic':'label'})

Additionally, I am only going to run the no_stopwords verstion of the processed text, as I expect training to take a long time and even the poorly performing model from before did better without them, so that is the column I am including in the data dictionary as the data. 

In [3]:
from datasets import Dataset, DatasetDict
new_train = Dataset.from_pandas(ds_train[['no_stopword', 'label_text','label']])
new_test = Dataset.from_pandas(ds_test[['no_stopword', 'label_text','label']])

new_ds = DatasetDict({
    'train': new_train,
    'test': new_test})
new_ds

DatasetDict({
    train: Dataset({
        features: ['no_stopword', 'label_text', 'label'],
        num_rows: 11314
    })
    test: Dataset({
        features: ['no_stopword', 'label_text', 'label'],
        num_rows: 7532
    })
})

In [15]:
new_ds.save_to_disk('pickles/new_ds_folder')

Saving the dataset (0/1 shards):   0%|          | 0/11314 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/7532 [00:00<?, ? examples/s]

### HuggingFace with pipeline<p>
My first pass on this was madde using the HuggingFace pipeline wrapper.

In [4]:
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, pipeline

In [5]:
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name,num_labels=7)

pipe = pipeline("text-classification", model=model, tokenizer=tokenizer,device=0)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


However, I encountered a problem where trying to run it over the entire dataset would result in only two labels being selected (0 and 6, usually). If I run it on a subset of the data, I sometimes get real predictions out, and I am unsure as to why this is.

In [16]:
data = new_ds['test']['no_stopword'][0:1000]
preds = pipe(data,truncation=True, max_length=512)

In [18]:
preds

[{'label': 'LABEL_3', 'score': 0.3728351294994354},
 {'label': 'LABEL_1', 'score': 0.8986736536026001},
 {'label': 'LABEL_6', 'score': 0.2718755304813385},
 {'label': 'LABEL_6', 'score': 0.7600948810577393},
 {'label': 'LABEL_6', 'score': 0.37501585483551025},
 {'label': 'LABEL_4', 'score': 0.8525403141975403},
 {'label': 'LABEL_6', 'score': 0.5535823106765747},
 {'label': 'LABEL_1', 'score': 0.9083364605903625},
 {'label': 'LABEL_1', 'score': 0.8566413521766663},
 {'label': 'LABEL_1', 'score': 0.7014189958572388},
 {'label': 'LABEL_1', 'score': 0.8354809284210205},
 {'label': 'LABEL_1', 'score': 0.8760351538658142},
 {'label': 'LABEL_6', 'score': 0.7214304804801941},
 {'label': 'LABEL_6', 'score': 0.39800897240638733},
 {'label': 'LABEL_5', 'score': 0.5855554938316345},
 {'label': 'LABEL_1', 'score': 0.8593443632125854},
 {'label': 'LABEL_1', 'score': 0.8323886394500732},
 {'label': 'LABEL_1', 'score': 0.8388638496398926},
 {'label': 'LABEL_1', 'score': 0.710543155670166},
 {'label': 