### 作業目的: 熟練Pytorch Dataset與DataLoader進行資料讀取

本此作業主要會使用[IMDB](http://ai.stanford.edu/~amaas/data/sentiment/)資料集利用Pytorch的Dataset與DataLoader進行
客製化資料讀取。
下載後的資料有分成train與test，因為這份作業目的在讀取資料，所以我們取用train部分來進行練習。
(請同學先行至IMDB下載資料)

### 載入套件

In [1]:
# Import torch and other required modules
import glob
import torch
import re
import nltk
import numpy as np
from torch.utils.data import Dataset, DataLoader
from sklearn.datasets import load_svmlight_file
from nltk.corpus import stopwords

nltk.download('stopwords') #下載stopwords
nltk.download('punkt') #下載word_tokenize需要的corpus

C:\ProgramData\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\ProgramData\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.TXA6YQSD3GCQQC22GEQ54J2UDCXDXHWN.gfortran-win_amd64.dll
  stacklevel=1)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\angus.tu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\angus.tu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### 探索資料與資料前處理
在train資料中，有分成pos(positive)與neg(negative)，分別為正評價與負評價，此評價即為label。

In [7]:
# 讀取字典，這份字典為review內所有出現的字詞 (imdb.vocab)
###<your code>###
with open('aclImdb\imdb.vocab', 'r',encoding='utf-8') as fp:
    vocab = fp.read().split('\n')
print(vocab[:3])

# 以nltk stopwords移除贅字，過多的贅字無法提供有用的訊息，也可能影響模型的訓練
print(f"vocab length before removing stopwords: {len(vocab)}")

###<your code>###
vocab=[word for word in vocab if not word in set(stopwords.words('english'))]
print(f"vocab length after removing stopwords: {len(vocab)}")

# 將字典轉換成dictionary
### <your code>###
vocab_dic = {}
index = 0
for i, word in enumerate(vocab):
    if word not in vocab_dic:
        vocab_dic[word] = index
        index += 1

['the', 'and', 'a']
vocab length before removing stopwords: 89527
vocab length after removing stopwords: 89356


In [8]:
print(vocab[:3])

['movie', 'film', 'one']


In [10]:
print(vocab_dic)



In [23]:
# 將資料打包成(x, y)配對，其中x為review的檔案路徑，y為正評(1)或負評(0)
# 這裡將x以檔案路徑代表的原因是讓同學練習不一次將資料全讀取進來，若電腦記憶體夠大(所有資料檔案沒有很大)
# 可以將資料全一次讀取，可以減少在訓練時I/O時間，增加訓練速度

###<your code>###
# >>> review_pairs = [('./aclImdb/train/pos/4715_9.txt', 1), ('./aclImdb/train/pos/12390_8.txt', 1)]
review_pairs = []

#'aclImdb\train\pos\'
#'aclImdb\train\neg\'
for tag in ['pos', 'neg']:
    path = r'.\aclImdb\train\{}\*.txt'.format(tag)
    if tag == 'pos':
        label = 1
    else:
        label = 0
    for f_name in glob.glob(path):
        #print(f_name, label)
        review_pairs.append((f_name, label))

print(review_pairs[:2])
print(f"Total reviews: {len(review_pairs)}")

[('.\\aclImdb\\train\\pos\\0_9.txt', 1), ('.\\aclImdb\\train\\pos\\10000_8.txt', 1)]
Total reviews: 25000


### 建立Dataset與DataLoader讀取資料
這裡我們會需要兩個helper functions，其中一個是讀取資料與清洗資料的函式(load_review)，另外一個是生成詞向量BoW的函式
(generate_bow)

In [35]:
def load_review(review_path):
    
    ###<your code>###
    with open(review_path, 'r', encoding='utf-8') as fp:
        review = fp.read()
        
    #移除non-alphabet符號、贅字與tokenize
    ###<your code>###
    review=re.sub('[^a-zA-Z]',' ',review)  # <>. 移除non-alphabet符號
    review=review.lower()                  # <>. 把全部變成小寫
    review = nltk.word_tokenize(review)    # <>. 斷詞tokenize
    review=[word for word in review if not word in set(stopwords.words('english'))] # <>. 移除贅字
    
    return review

In [38]:
review = load_review(review_pairs[0][0])
print(review)

['bromwell', 'high', 'cartoon', 'comedy', 'ran', 'time', 'programs', 'school', 'life', 'teachers', 'years', 'teaching', 'profession', 'lead', 'believe', 'bromwell', 'high', 'satire', 'much', 'closer', 'reality', 'teachers', 'scramble', 'survive', 'financially', 'insightful', 'students', 'see', 'right', 'pathetic', 'teachers', 'pomp', 'pettiness', 'whole', 'situation', 'remind', 'schools', 'knew', 'students', 'saw', 'episode', 'student', 'repeatedly', 'tried', 'burn', 'school', 'immediately', 'recalled', 'high', 'classic', 'line', 'inspector', 'sack', 'one', 'teachers', 'student', 'welcome', 'bromwell', 'high', 'expect', 'many', 'adults', 'age', 'think', 'bromwell', 'high', 'far', 'fetched', 'pity']


In [39]:
def generate_bow(review, vocab_dic):
    bag_vector = np.zeros(len(vocab_dic))
    for word in review:
        if vocab_dic.get(word):
            bag_vector[vocab_dic.get(word)] += 1
            
    return bag_vector

In [40]:
class dataset(Dataset):
    '''custom dataset to load reviews and labels
    Parameters
    ----------
    data_pairs: list
        directory of all review-label pairs
    vocab: list
        list of vocabularies
    '''
    def __init__(self, data_dirs, vocab):
        ###<your code>###
        self.data_dirs = data_dirs
        self.vocab = vocab
        
    def __len__(self):
        ###<your code>###
        return len(self.data_dirs)

    def __getitem__(self, idx):
        ###<your code>###
        file_path, label = self.data_dirs[idx]
        reviews = load_review(file_path)
        reviews = generate_bow(reviews, self.vocab)
        return (reviews, label)
        

In [41]:
# 建立客製化dataset
###<your code>###
custom_dst = dataset(review_pairs, vocab_dic)
custom_dst[10]

(array([0., 2., 2., ..., 0., 0., 0.]), 1)

In [42]:
# 建立dataloader
###<your code>###
custom_dataloader = DataLoader(custom_dst, batch_size=5, shuffle=True)
next(iter(custom_dataloader))

[tensor([[0., 7., 2.,  ..., 0., 0., 0.],
         [0., 1., 0.,  ..., 0., 0., 0.],
         [0., 5., 1.,  ..., 0., 0., 0.],
         [0., 0., 1.,  ..., 0., 0., 0.],
         [0., 0., 3.,  ..., 0., 0., 0.]], dtype=torch.float64),
 tensor([1, 1, 1, 0, 1])]

In [43]:
next(iter(custom_dataloader))

[tensor([[0., 0., 3.,  ..., 0., 0., 0.],
         [0., 0., 1.,  ..., 0., 0., 0.],
         [0., 6., 1.,  ..., 0., 0., 0.],
         [0., 2., 0.,  ..., 0., 0., 0.],
         [0., 2., 2.,  ..., 0., 0., 0.]], dtype=torch.float64),
 tensor([1, 1, 1, 1, 1])]

In [44]:
next(iter(custom_dataloader))

[tensor([[ 0.,  4.,  0.,  ...,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  ...,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  ...,  0.,  0.,  0.],
         [ 0.,  2.,  1.,  ...,  0.,  0.,  0.],
         [ 0., 10.,  0.,  ...,  0.,  0.,  0.]], dtype=torch.float64),
 tensor([1, 1, 1, 0, 0])]