### 作業目的: 熟練Pytorch Dataset與DataLoader進行資料讀取

本此作業主要會使用[IMDB](http://ai.stanford.edu/~amaas/data/sentiment/)資料集利用Pytorch的Dataset與DataLoader進行
客製化資料讀取。
下載後的資料有分成train與test，因為這份作業目的在讀取資料，所以我們取用train部分來進行練習。
(請同學先行至IMDB下載資料)

### 載入套件

In [None]:
# Import torch and other required modules
import glob
import torch
import re
import nltk
import numpy as np
from torch.utils.data import Dataset, DataLoader
from sklearn.datasets import load_svmlight_file
from nltk.corpus import stopwords

nltk.download('stopwords') #下載stopwords
nltk.download('punkt') #下載word_tokenize需要的corpus

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from google.colab import drive
import pandas as pd

drive.mount('/content/gdrive')

Mounted at /content/gdrive


### 探索資料與資料前處理
在train資料中，有分成pos(positive)與neg(negative)，分別為正評價與負評價，此評價即為label。

In [None]:
# 讀取字典，這份字典為review內所有出現的字詞
df_vocab = pd.read_csv('./gdrive/My Drive/nlp2_colab/data/aclImdb/imdb.vocab', names=['voc'], encoding="latin")
vocab = df_vocab.voc.tolist()

# 以nltk stopwords移除贅字，過多的贅字無法提供有用的訊息，也可能影響模型的訓練
print(f"vocab length before removing stopwords: {len(vocab)}")
vocab=[word for word in vocab if not word in set(stopwords.words('english'))]
print(f"vocab length after removing stopwords: {len(vocab)}")

# 將字典轉換成dictionary
vocab_list = list(set(vocab))
vocab_list = [x for x in vocab_list if str(x) != 'nan']
vocab_dic = {}
for idx, word in enumerate(vocab_list):
  vocab_dic[word] = idx
vocab_dic

vocab length before removing stopwords: 89527
vocab length after removing stopwords: 89356


{'encouragement': 0,
 'bruiser': 1,
 'malcomx': 2,
 'lovecraft': 3,
 'aristocrats': 4,
 'natch': 5,
 'expressionist': 6,
 'fowl': 7,
 'tirÃ©': 8,
 'blue': 9,
 'schoolkids': 10,
 'onions': 11,
 'quotations': 12,
 'monosyllables': 13,
 'whyyyy': 14,
 'warmth': 15,
 'tillier': 16,
 'curley': 17,
 'outshine': 18,
 'ealing': 19,
 'quella': 20,
 'haggard-looking': 21,
 'nation': 22,
 'maidservant': 23,
 'hurries': 24,
 'successive': 25,
 'danish': 26,
 'purchases': 27,
 "x-files''final": 28,
 'frith': 29,
 'summons': 30,
 'resister': 31,
 'enunciated': 32,
 'men-in-black': 33,
 'whoppie': 34,
 'sleepover': 35,
 'ates': 36,
 'unmasked': 37,
 'bitterly': 38,
 'remarked': 39,
 'loins': 40,
 'admirably': 41,
 'christys': 42,
 'jordana': 43,
 'pos': 44,
 'guility': 45,
 'elba': 46,
 'hayak': 47,
 'straight-the': 48,
 'guillot': 49,
 'marie-paul': 50,
 'dormal': 51,
 'menstruating': 52,
 'defiantly': 53,
 'miike-version': 54,
 'zentropa': 55,
 'excluding': 56,
 'delventhal': 57,
 'farreley': 58,
 

In [None]:
# 將資料打包成(x, y)配對，其中x為review的檔案路徑，y為正評(1)或負評(0)
# 這裡將x以檔案路徑代表的原因是讓同學練習不一次將資料全讀取進來，若電腦記憶體夠大(所有資料檔案沒有很大)
# 可以將資料全一次讀取，可以減少在訓練時I/O時間，增加訓練速度
path = './gdrive/My Drive/nlp2_colab/data/aclImdb/train/'
review_pos = glob.glob(path + "pos/*.txt")
review_neg = glob.glob(path + "neg/*.txt")
y = [1] * len(review_pos) +  [0] * len(review_neg)
review_ttl = review_pos + review_neg
review_pairs = list(zip(review_ttl, y))

print(review_pairs[:2])
print(f"Total reviews: {len(review_pairs)}")

[('./gdrive/My Drive/nlp2_colab/data/aclImdb/train/pos/11414_9.txt', 1), ('./gdrive/My Drive/nlp2_colab/data/aclImdb/train/pos/11609_10.txt', 1)]
Total reviews: 25000


### 建立Dataset與DataLoader讀取資料
這裡我們會需要兩個helper functions，其中一個是讀取資料與清洗資料的函式(load_review)，另外一個是生成詞向量BoW的函式
(generate_bow)

In [None]:
def load_review(review_path):
    
  with open(review_path, 'r') as f:
    review = f.read()
        
  #移除non-alphabet符號、贅字與tokenize
  review = re.sub('[^a-zA-Z]',' ',review)
  review = nltk.word_tokenize(review)
  review = list(set(review).difference(set(stopwords.words('english'))))
    
  return review

In [None]:
def generate_bow(review, vocab_dic):
    bag_vector = np.zeros(len(vocab_dic))
    for word in review:
        if vocab_dic.get(word):
            bag_vector[vocab_dic.get(word)] += 1
            
    return bag_vector

In [None]:
class dataset(Dataset):
    '''custom dataset to load reviews and labels
    Parameters
    ----------
    data_pairs: list
        directory of all review-label pairs
    vocab: list
        list of vocabularies
    '''
    def __init__(self, data_dirs, vocab):
      self.data_dirs = data_dirs
      self.vocab = vocab

    def __len__(self):
      return len(self.data_dirs)

    def __getitem__(self, idx):
      pair = self.data_dirs[idx]
      review = pair[0]
      review = load_review(review)
      review = generate_bow(review, self.vocab)
        
      return review, pair[1]
        

In [None]:
# 建立客製化dataset
custom_dst = dataset(review_pairs, vocab_dic)
custom_dst[10]

(array([0., 0., 0., ..., 0., 0., 0.]), 1)

In [None]:
# 建立dataloader
custom_dataloader = DataLoader(dataset=custom_dst, batch_size=4, shuffle=True)

In [None]:
next(iter(custom_dataloader))

[tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]], dtype=torch.float64),
 tensor([1, 1, 1, 0])]