In this notebook, we prepare a dataset to be used to show how to perform text classification by fine-tuning a BERT-based model. <br/>
At the end "data-prepared.csv" file will be generated and will be saved to "BERT-Google-Colab" folder on the drive  <br/> <br/>
Mount your google drive by running the following command cell

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).




We download the entire dataset as a CSV file into the *data* local folder and then read it into a pandas dataframe.

In [None]:
import pandas as pd

df1 = pd.read_csv('./drive/MyDrive/BERT-Google-Colab/youtube.csv', engine='python', encoding='utf8', error_bad_lines=False)



  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
df1.head()

Unnamed: 0,link,title,description,category
0,JLZlCZ0,Ep 1| Travelling through North East India | Of...,Tanya Khanijow\n671K subscribers\nSUBSCRIBE\nT...,travel
1,i9E_Blai8vk,Welcome to Bali | Travel Vlog | Priscilla Lee,Priscilla Lee\n45.6K subscribers\nSUBSCRIBE\n*...,travel
2,r284c-q8oY,My Solo Trip to ALASKA | Cruising From Vancouv...,Allison Anderson\n588K subscribers\nSUBSCRIBE\...,travel
3,Qmi-Xwq-ME,Traveling to the Happiest Country in the World!!,Yes Theory\n6.65M subscribers\nSUBSCRIBE\n*BLA...,travel
4,_lcOX55Ef70,Solo in Paro Bhutan | Tiger's Nest visit | Bhu...,Tanya Khanijow\n671K subscribers\nSUBSCRIBE\nH...,travel


For our implementation, we use only the category column and the description column, which represents the textual info about the video.

In [None]:
df2 = df1[['category', 'description']]

In [None]:
df2.columns = ['category', "description"]

In [None]:
df2.head()

Unnamed: 0,category,description
0,travel,Tanya Khanijow\n671K subscribers\nSUBSCRIBE\nT...
1,travel,Priscilla Lee\n45.6K subscribers\nSUBSCRIBE\n*...
2,travel,Allison Anderson\n588K subscribers\nSUBSCRIBE\...
3,travel,Yes Theory\n6.65M subscribers\nSUBSCRIBE\n*BLA...
4,travel,Tanya Khanijow\n671K subscribers\nSUBSCRIBE\nH...


In [None]:
df2.shape

(3599, 2)

In [None]:
df2.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


In [None]:
df2.shape

(3599, 2)

There are 18 distinct values for the *Product* column, but some of them are very underrepresented. Also, there is a lot of overlapping among them.

We then consolidate the distinct values for the *Product* column into 6 distinct categories: *Credit Reporting*, *Debt Collection*, *Mortgage*, *Card Services*, *Loans*, and *Banking Services*.

In [None]:
df2['category'].value_counts()

travel       1156
art_music     947
food          903
history       593
Name: category, dtype: int64

In [None]:
df2 = df2[df2['category'] != 'Other']

In [None]:
pd.DataFrame(df2['category'].value_counts())

Unnamed: 0,category
travel,1156
art_music,947
food,903
history,593


We need to represent data as numeric values for the model. Here we create a new column *Product_Label* that encodes the information from the *Product* column into numeric values.

We need to do something similar for the textual information from the *Complaint* column, but as this is dependent of the model architecture, this is done in the subsequent notebook.

In [None]:
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
df2['Category_Label'] = enc.fit_transform(df2['category'])

In [None]:
df2.head()

Unnamed: 0,category,description,Category_Label
0,travel,Tanya Khanijow\n671K subscribers\nSUBSCRIBE\nT...,3
1,travel,Priscilla Lee\n45.6K subscribers\nSUBSCRIBE\n*...,3
2,travel,Allison Anderson\n588K subscribers\nSUBSCRIBE\...,3
3,travel,Yes Theory\n6.65M subscribers\nSUBSCRIBE\n*BLA...,3
4,travel,Tanya Khanijow\n671K subscribers\nSUBSCRIBE\nH...,3


In [None]:
df2.iloc[4]['description']

'Tanya Khanijow\n671K subscribers\nSUBSCRIBE\nHere’s presenting the first part of the Bhutan Series Episode in Paro. I went straight to Paro as first part of my road trip in the country. The drive from Phuntsholing took about 4 hours. \n \nThe entire budget of my Bhutan trip was close to  INR 25k. You can carry cash everywhere in Indian currency in Bhutan as it is accepted. Some things about Paro below:\n\n1. The place where I stayed at in Paro is called Ama’s Village Lodge. You can book the place here - \nSHOW MORE'

We can further preprocess the data, by trying to decrease the vocabulary size for the text. Here we perform a light text preprocessing, by removing punctuation, removing the masked information (*XXX…* patterns), removing extra spaces and finally normalize everything to lowercase.

In [None]:
import string

table = str.maketrans(string.punctuation, ' '*len(string.punctuation))
df2['description'] = df2['description'].str.translate(table)
df2['description'] = df2['description'].str.replace('X+', '')
df2['description'] = df2['description'].str.replace(' +', ' ')
df2['description'] = df2['description'].str.lower()
df2['description'] = df2['description'].str.strip()

  """
  


In [None]:
df2.iloc[4]['description']

'tanya khanijow\n671k subscribers\nsubscribe\nhere’s presenting the first part of the bhutan series episode in paro i went straight to paro as first part of my road trip in the country the drive from phuntsholing took about 4 hours \n \nthe entire budget of my bhutan trip was close to inr 25k you can carry cash everywhere in indian currency in bhutan as it is accepted some things about paro below \n\n1 the place where i stayed at in paro is called ama’s village lodge you can book the place here \nshow more'

There is some text in the description column that has 0 or very few words. Here we consider the minimum of 5 words for the text to have some useful information.

In [None]:
lengths = [len(df2.iloc[i]['description'].split()) for i in range(len(df2))]
print(max(lengths))
print(min(lengths))

939
3


In [None]:
df2 = df2[[l >= 5 for l in lengths]]

In [None]:
df2.shape

(3595, 3)

In [None]:
pd.DataFrame(df2['category'].value_counts())

Unnamed: 0,category
travel,1154
art_music,947
food,901
history,593


We then save the preprocessed dataset, and another one corresponding to a 10% sample.

In [None]:
df2.to_csv('./drive/MyDrive/BERT-Google-Colab/youtube_data_prepared.csv', index=False)

In [None]:
df2.sample(n=int(len(df2)*0.1), random_state=111).to_csv('./drive/MyDrive/BERT-Google-Colab/youtube_data_sample_prepared.csv', index=False)