# Tutorial on Text Classification (NLP) using ULMFiT and fastai Library in Python
links https://www.analyticsvidhya.com/blog/2018/11/tutorial-text-classification-ulmfit-fastai-library/

#### 1. Import the libraries

In [3]:
import fastai
from fastai import *
from fastai.text import * 
import pandas as pd
import numpy as np
from functools import partial
import io
import os

#### 2. Import dataset yang ada di sklearn untuk klasifikasi

In [2]:
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


bikin dataframe dan cek shape nya. label sebagai label dari teks, teks itu isi artikelnya.

In [6]:
df = pd.DataFrame({'label':dataset.target, 'text':dataset.data})

In [7]:
df.shape

(11314, 2)

In [10]:
print(df)

       label                                               text
0         17  Well i'm not sure about the story nad it did s...
1          0  \n\n\n\n\n\n\nYeah, do you expect people to re...
2         17  Although I realize that principle is not one o...
3         11  Notwithstanding all the legitimate fuss about ...
4         10  Well, I will have to change the scoring on my ...
5         15   \n \nI read somewhere, I think in Morton Smit...
6          4  \nOk.  I have a record that shows a IIsi with ...
7         17  \n\n\nSounds like wishful guessing.\n\n\n\n\n'...
8         13   Nobody is saying that you shouldn't be allowe...
9         12  \n  I was wondering if anyone can shed any lig...
10         1  Archive-name: graphics/resources-list/part1\nL...
11         6  I have a Roberto Clemente 1969 Topps baseball ...
12        13  \n\n"Diet Evangelist".  Good term.  Fits Atkin...
13        15  Hi Damon,  No matter what system or explanatio...
14         4  The title says it all.  I 

supaya bisa jadi binary classification, cuma pilih 2 label, yaitu 1 dan 10

In [11]:
df = df[df['label'].isin([1,10])]
df = df.reset_index(drop = True)

In [12]:
df['label'].value_counts()

10    600
1     584
Name: label, dtype: int64

#### 3. Preprocessing data

a. buat teks nya, cuma select huruf

In [13]:
df['text'] = df['text'].str.replace("[^a-zA-Z]", " ")

b. download stop words

In [14]:
from nltk.corpus import stopwords 
stop_words = stopwords.words('english')

c. tokenization, remove stop-words, de-tokenization

In [15]:
# tokenization 
tokenized_doc = df['text'].apply(lambda x: x.split())

# remove stop-words 
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])

# de-tokenization 
detokenized_doc = [] 
for i in range(len(df)): 
    t = ' '.join(tokenized_doc[i]) 
    detokenized_doc.append(t) 

df['text'] = detokenized_doc

#### 4. Split dataset

In [16]:
from sklearn.model_selection import train_test_split

# split data into training and validation set
df_trn, df_val = train_test_split(df, stratify = df['label'], test_size = 0.4, random_state = 12)

cek shape nya

In [17]:
df_trn.shape, df_val.shape

((710, 2), (474, 2))

In [24]:
print(df_trn)

      label                                               text
1018     10  It looks like Edmonton Oilers decided take Eur...
762      10  This kills Speaking die hard I I read died har...
768       1  The idea clip one polygon using another polygo...
152      10  I Edmonton usually least OFTEN case treated ac...
426      10  You know absolutely right I think round player...
489      10  Did boyfriend comment fact Clement looks like ...
533      10  Ten years ago number Europeans NHL roughly qua...
731       1  I went back looked review They claim significa...
890       1  I need information Display PostScript strokead...
588      10  I disagree one I think Vancouver go Bure goes ...
1074     10  Wales Conference Adams Division Semifinal I ho...
448      10  Last night Sharks broadcast Commissioner Bettm...
1060     10    Do realize many smiles crossing faces wrote gld
824       1  I got spec obviously since I quoted last posti...
1112      1  HELP MY FRIEND AND I HAVE A CLASS PROJECT 

#### 5. Siapkan language model sama classifier

In [29]:
# preview of my system
from fastai.utils import show_install
show_install()



```text
=== Software === 
python version : 3.6.4
fastai version : 1.0.34
torch version  : 1.0.0
torch cuda ver 
torch cuda is  : **Not available** 

=== Hardware === 
No GPUs available 

=== Environment === 
platform       : Windows-10-10.0.17134-SP0
conda env      : Unknown
python         : C:\Users\RandomScientist\Anaconda3\python.exe
sys.path       : 
C:\Users\RandomScientist\Anaconda3\python36.zip
C:\Users\RandomScientist\Anaconda3\DLLs
C:\Users\RandomScientist\Anaconda3\lib
C:\Users\RandomScientist\Anaconda3
C:\Users\RandomScientist\Anaconda3\lib\site-packages
C:\Users\RandomScientist\Anaconda3\lib\site-packages\win32
C:\Users\RandomScientist\Anaconda3\lib\site-packages\win32\lib
C:\Users\RandomScientist\Anaconda3\lib\site-packages\Pythonwin
C:\Users\RandomScientist\Anaconda3\lib\site-packages\IPython\extensions
C:\Users\RandomScientist\.ipython
no supported gpus found on this system
```

Please make sure to include opening/closing ``` when you paste into forums/github to make t

In [31]:
# Language model data
data_lm = TextLMDataBunch.from_df(path="", train_df = df_trn, valid_df = df_val)

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

In [20]:

# Classifier model data
data_clas = TextClasDataBunch.from_df(path = "", train_df = df_trn, valid_df = df_val, vocab=data_lm.train_ds.vocab, bs=32)

NameError: name 'data_lm' is not defined

#### 6. Fine tuning

In [None]:
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.7)

In [None]:
# train the learner object with learning rate = 1e-2
learn.fit_one_cycle(1, 1e-2)

save encoder buat classifier

In [None]:
learn.save_encoder('ft_enc')

pakai data_clas untuk fine-tuning

In [None]:
learn = text_classifier_learner(data_clas, drop_mult=0.7)
learn.load_encoder('ft_enc')

fit the model

In [None]:
learn.fit_one_cycle(1, 1e-2)

get prediction

In [None]:
# get predictions
preds, targets = learn.get_preds()

predictions = np.argmax(preds, axis = 1)
pd.crosstab(predictions, targets)