# 환경설정 및 설명
* google drive 마운트 및 경로 이동
* 이 노트북은 [kaggle 코드](https://www.kaggle.com/code/swarnabha/pytorch-text-classification-torchtext-lstm)를 참고하여 작성하였음.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
cd /content/drive/MyDrive/kimjuyeon/NLP 스터디/nlp-code-study/01_text_classification

/content/drive/Othercomputers/내 컴퓨터/kimjuyeon/NLP 스터디/nlp-code-study/01_text_classification


In [3]:
# 버전 문제로 인한 코드 오류 발새 방지
!pip install torchtext==0.10.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Sentence Classification

* cleaning and basic pre-processing of text
* building a vocabulary, and creating iterators using TorchText
* building a sequnce model - LSTM using Pytorch to predict labels

In [4]:
# 파일 경로 확인
import os
for dirname, _, filenames in os.walk('/content/drive/MyDrive/kimjuyeon/NLP 스터디/nlp-code-study/01_text_classification'):
  for filename in filenames:
    print(os.path.join(dirname, filename))

/content/drive/MyDrive/kimjuyeon/NLP 스터디/nlp-code-study/01_text_classification/01_pytorch_text_classification_torchtext_lstm.ipynb
/content/drive/MyDrive/kimjuyeon/NLP 스터디/nlp-code-study/01_text_classification/archive/glove.6B.100d.txt
/content/drive/MyDrive/kimjuyeon/NLP 스터디/nlp-code-study/01_text_classification/archive/glove.6B.200d.txt
/content/drive/MyDrive/kimjuyeon/NLP 스터디/nlp-code-study/01_text_classification/archive/glove.6B.50d.txt
/content/drive/MyDrive/kimjuyeon/NLP 스터디/nlp-code-study/01_text_classification/nlp-getting-started/train.csv
/content/drive/MyDrive/kimjuyeon/NLP 스터디/nlp-code-study/01_text_classification/nlp-getting-started/test.csv
/content/drive/MyDrive/kimjuyeon/NLP 스터디/nlp-code-study/01_text_classification/nlp-getting-started/sample_submission.csv


In [5]:
# 필요한 라이브러리 불러오기
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import time

import torch
from torchtext import data
import torch.nn as nn

In [6]:
# 데이터 불러오기
train = pd.read_csv('/content/drive/MyDrive/kimjuyeon/NLP 스터디/nlp-code-study/01_text_classification/nlp-getting-started/train.csv')
test = pd.read_csv('/content/drive/MyDrive/kimjuyeon/NLP 스터디/nlp-code-study/01_text_classification/nlp-getting-started/test.csv')

# Data Pre-Processing


In [7]:
train.shape

(7613, 5)

In [8]:
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


* target column이 text의 label을 의미함.
* label == 1 : if the Tweet is about disasters
* label == 0 : if the Tweet is not about disasters

-> 우리는 text와 target column에만 관심이 있기 때문에 나머지는 drop 해줄 것임.


In [9]:
# 'id', 'keyword', 'location' column을 drop
train.drop(columns=['id', 'keyword', 'location'], inplace=True)

In [10]:
train.head()

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1


* classification 알고리즘이 관련 없는 정보와 혼동되지 않도록 텍스트를 정리하고 수정

In [11]:
# to clean data
def normalise_text(text):
  text = text.str.lower() # lowercase
  text = text.str.replace(r"\#", "") # replace hashtage
  text = text.str.replace(r"http\S+", "URL") # remove URL addresses
  text = text.str.replace(r"[^A-Za-z0-9()!?\'\`\"]", " ")
  text = text.str.replace("\s{2,}", " ")
  return text

In [12]:
train["text"] = normalise_text(train["text"])

  text = text.str.replace(r"\#", "") # replace hashtage
  text = text.str.replace(r"http\S+", "URL") # remove URL addresses
  text = text.str.replace(r"[^A-Za-z0-9()!?\'\`\"]", " ")
  text = text.str.replace("\s{2,}", " ")


In [13]:
train['text'].head()

0    our deeds are the reason of this earthquake ma...
1                forest fire near la ronge sask canada
2    all residents asked to 'shelter in place' are ...
3    13 000 people receive wildfires evacuation ord...
4    just got sent this photo from ruby alaska as s...
Name: text, dtype: object

* data를 train, valid data로 나누기

In [14]:
train_df, valid_df = train_test_split(train, random_state=0)

In [15]:
train_df.head()

Unnamed: 0,text,target
5244,refugio oil spill may have been costlier bigge...,1
4860,julian knight scvsupremecourt dismisses mass m...,1
6538,electricity cant stop scofield nigga survived ...,0
5175,meek mill begging nicki minaj to let him oblit...,0
5820,china s stock market crash this summer has spa...,0


In [16]:
valid_df.head()

Unnamed: 0,text,target
311,katiekatcubs you already know how this shit g...,0
4970,lemairelee danharmon people near meltdown com...,0
527,1 6 tix calgary flames vs col avalanche presea...,0
6362,if you ever think you running out of choices i...,0
800,if you dotish to blight your car go right ahea...,0


* 결과를 재생성하기 위해서 랜덤 시드 고정

In [17]:
SEED = 42
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

* 우리는 text data를 처리할 Field object를 만들어야 함
* field object는 text를 tensor로 변환시켜줌.
* 두가지 파라미터를 설정할 것임.
  * `tokenize=spacy`
  * `include_arguments=True`
  * SpaCy will be used to tokenize the texts 
  * SpaCy(자연어 처리를 위한 Python 기반의 오픈 소스 라이브러리)가 text tokenize로 사용될 것을 의미.
  * field object(필드 객체)가 텍스트를 패딩하는 데 필요한 텍스트 길이를 포함해야 한다는 것을 의미.
  * 나중에 이러한 객체 사용법으로 vocabulary를 만들 것이고, 이는 모든 토큰에 대한 numerical representation을 만든느데 도움이 될 것임.
* `LabelField`는 데이터 레이블에 유용한 shallow wrappter around field임.



In [20]:
TEXT = data.Field(tokenize='spacy', include_lengths=True)
LABEL = data.LabelField(dtype=torch.float)

AttributeError: ignored

* DataFrameDataset class는 data를 


