# PyTorch for Natural Language Processing

## Bot Detection using BERTModel

---

**<u>_Objective:_</u>** In this short project, we fine-tune a BERT pretrained model to classify tweets made by a bot, or by a human.

This tutorial is inspired by the following walkthrough:

https://saturncloud.io/blog/pytorch-for-natural-language-processing-building-a-fake-news-classification-model/

In [1]:
# import dependencies and libraries
import pandas as pd
import numpy as np
import glob
import re
import math
import seaborn as sns
import warnings
import matplotlib.pyplot as plt

sns.set_style('whitegrid')
sns.set_theme(style = 'whitegrid', 
              rc    = {'figure.dpi'    : 400, 
                       'figure.figsize': (20, 12)}, 
              font_scale = 0.60)

from matplotlib import rcParams
rcParams.update({'figure.autolayout': True})

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)
warnings.filterwarnings('ignore', category = UserWarning, module = 'openpyxl')

## Set up Environment for Google Colab

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [15]:
import os

# Get current and root directory
cur_dir = os.getcwd()
root_dir = cur_dir[:-11]
data_dir = root_dir + "1_Data\\"

print(f"Current directory: {cur_dir}\nRoot directory : {root_dir}\nData directory : {data_dir}")

Current directory: I:\My Drive\Data Science and Analytics Portfolio\3 Tutorials\2_Bots_Detection\2_Notebooks
Root directory : I:\My Drive\Data Science and Analytics Portfolio\3 Tutorials\2_Bots_Detection\
Data directory : I:\My Drive\Data Science and Analytics Portfolio\3 Tutorials\2_Bots_Detection\1_Data\


## Read Datasets

In [18]:
%%time
df_human = pd.read_csv(f"{data_dir}/cresci-2015/TFP_tweets.csv", encoding = 'latin-1')
df_bot =pd.read_csv(f"{data_dir}/cresci-2015/TWT_tweets.csv", encoding = 'latin-1')

Wall time: 2.31 s


In [22]:
print(f"Length of human dataframe : {len(df_human)}\nLegnth of bots dataframe : {len(df_bot)}")

Length of human dataframe : 563693
Legnth of bots dataframe : 114192


Usually, we want to perform some rudimentary data cleaning steps on the dataset before we use it for training. Typically, this involves:
- Removing special characters

In [20]:
# Access only the subdataframes
df_human_select = df_human[['id', 'text']].copy()


df_human_select

Unnamed: 0,id,text
0,282123910303080448,@TheFakeProject cerca followers reali!!! #ImNo...
1,276204184393641984,RT @laperniconi: Cosa ci metteremo quest'anno ...
2,276203000333217792,RT @wontcallyouback: #faiunadomandaalpapa ha m...
3,248805120802959362,RT @nausea_17: I tifosi del Napoli si picchian...
4,244442004661096448,@ioeilmiopc buonanotte a te che mi apri un mon...
...,...,...
563688,301650553560236032,Grandi cambiamenti in @WindowsItalia: R.I.P. @...
563689,301649859755253760,R.I.P. @Windows Live Mesh! http://t.co/mQWd472z
563690,301336196020305920,RT @MSFTnews: New @Xbox numbers: 76M @Xbox con...
563691,300873489357885440,RT @Microsoft: All of our meetings at work loo...


## Parameters Declarations

## Load BERT Model

## Train BERT Model

## Model Prediction