# Pre-processing

In [1]:
import re
import pandas as pd

file_path = '../data/Airline-Sentiment-2-w-AA.csv'

In [2]:
all_data = pd.read_csv(file_path, encoding='iso-8859-2')
all_data

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,airline_sentiment,airline_sentiment:confidence,negativereason,negativereason:confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,681448150,False,finalized,3,2/25/15 5:24,neutral,1.0000,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2/24/15 11:35,5.703060e+17,,Eastern Time (US & Canada)
1,681448153,False,finalized,3,2/25/15 1:53,positive,0.3486,,0.0000,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2/24/15 11:15,5.703010e+17,,Pacific Time (US & Canada)
2,681448156,False,finalized,3,2/25/15 10:01,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2/24/15 11:15,5.703010e+17,Lets Play,Central Time (US & Canada)
3,681448158,False,finalized,3,2/25/15 3:05,negative,1.0000,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2/24/15 11:15,5.703010e+17,,Pacific Time (US & Canada)
4,681448159,False,finalized,3,2/25/15 5:50,negative,1.0000,Can't Tell,1.0000,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2/24/15 11:14,5.703010e+17,,Pacific Time (US & Canada)
5,681448162,False,finalized,3,2/25/15 9:10,negative,1.0000,Can't Tell,0.6842,Virgin America,,jnardino,,0,@VirginAmerica seriously would pay $30 a fligh...,,2/24/15 11:14,5.703010e+17,,Pacific Time (US & Canada)
6,681448165,False,finalized,3,2/25/15 8:11,positive,0.6745,,0.0000,Virgin America,,cjmcginnis,,0,"@VirginAmerica yes, nearly every time I fly VX...",,2/24/15 11:13,5.703010e+17,San Francisco CA,Pacific Time (US & Canada)
7,681448167,False,finalized,3,2/25/15 2:11,neutral,0.6340,,,Virgin America,,pilot,,0,@VirginAmerica Really missed a prime opportuni...,,2/24/15 11:12,5.703000e+17,Los Angeles,Pacific Time (US & Canada)
8,681448169,False,finalized,3,2/25/15 9:01,positive,0.6559,,,Virgin America,,dhepburn,,0,"@virginamerica Well, I didn'tŰ_but NOW I DO! :-D",,2/24/15 11:11,5.703000e+17,San Diego,Pacific Time (US & Canada)
9,681448171,False,finalized,3,2/25/15 4:15,positive,1.0000,,,Virgin America,,YupitsTate,,0,"@VirginAmerica it was amazing, and arrived an ...",,2/24/15 10:53,5.702950e+17,Los Angeles,Eastern Time (US & Canada)


需要注意的是本地使用 `vscode` 打开 `Airline-Sentiment-2-w-AA.csv` 发现总行数为 14874 行，而 `pd.read_csv` 得到的表只有 14640 行。仔细查看文件本身内容发现其中应该还包含一些非数据的内容，`pd.read_csv` 应该是自动地过滤掉了，而所得的 16460 行数据也与论文中对于数据集的描述一致：
> Our data is available online. It has **14640 valid tweets** from 2/17/2015 to 2/24/2015 related to reviews of major U.S. airlines, containing sentiment label, negative reason label, tweets content and other meta information like location, user ID etc. The data fraction is roughly 15% positive, 65% negative, and 20% neutral.

## Tweet-level

tweet 中包含的 emoji 表情的预处理。<br>
由于 GloVe 中包含一些 tweet emoji 表情的 embedding vector，故在处理时统一将 Positive 的 emoji 表情替换为 `:)`，将 Negative 的表情替换为 `:(`。

In [3]:
def handle_emojis(tweet):
    # Smile -- :), : ), :-), (:
    tweet = re.sub(r'(:\s?\)|:-\)|\(:)', ' :) ', tweet)
    # Laugh -- :D, :-D, xD, XD
    tweet = re.sub(r'(\s:D|:-D|\sxD|\sXD)', ' :) ', tweet)
    # Wink -- ;-), ;)
    tweet = re.sub(r'(;-?\))', ' :) ', tweet)
    # Sad -- :(, : (, :-(
    tweet = re.sub(r'(:\(|\s:\s\(|:-\()', ' :( ', tweet)
    # Cry -- :'(
    tweet = re.sub(r'(:\'\()', ' :( ', tweet)
    return tweet

tweet-level 的预处理，包括（按顺序）：
* 处理 emoji 表情，Positive 表情统一替换为 `:)`，Negative 表情统一替换为 `:(`。
* 字符转小写
* 将出现的网址统一替换为 `urlToken`
* 将 `@XXX` 统一替换为 `userMentionToken`
* 将 `#XXX` 统一替换为 `XXX`，也即移除 `#`
* 移除 `RT`
* 将 5 个以上的连续点比如 `......` 替换为 `.....`（GloVe 包含 `.....` 的 embedding vector）
* 处理乱码和下划线，替换为 1 个空格
* 移除一条 tweet 头部和尾部出现的空格，`"` 以及 `'`
* 将连续空格替换为 1 个空格

In [4]:
def preprocess_tweet(tweet):
    # Consider that GloVe includes some emojis word
    # Replace emojis with either :) or :(
    tweet = handle_emojis(tweet)
    # Convert to lower case
    tweet = tweet.lower()
    # Replace URLs with the word URL
    tweet = re.sub(r'((www\.[\S]+)|(https?://[\S]+))', ' urlToken ', tweet)
    # Replace @handle with the word USER-MENTION
    tweet = re.sub(r'@[\S]+', 'userMentionToken', tweet)
    # Replace #hashtag with hashtag
    tweet = re.sub(r'#(\S+)', r' \1 ', tweet)
    # Remove RT (retweet)
    tweet = re.sub(r'\brt\b', '', tweet)
    # Consider that GloVe includes 1~5 dots
    # Replace 5+ dots with 5 dots
    tweet = re.sub(r'\.{5,}', r'\.\.\.\.\.', tweet)
    
    messyCodeRegex = r'[^a-zA-Z0-9\~\`\!\@\#\$\%\^\&\*\(\)\-\—\+\=\{\}\[\]\:\;\"\'\<\>\,\.\?\/\ ]+'
    # Replace messy code and _ with a single space
    tweet = re.sub(messyCodeRegex, ' ', tweet)
    
    # Strip space, " and ' from tweet
    tweet = tweet.strip(' "\'')
    # Replace multiple spaces with a single space
    tweet = re.sub(r'\s+', ' ', tweet)
    
    return tweet

## Word-level

考虑到 GloVe 中包含大量并不规范的词的词向量，故在此不做 word-level 的预处理。<br>
**分词将使用 `nltk.tokenize.WordPunctTokenizer` 来进行分词**。


## Generate processed data file

只保留数据的 id，sentiment 以及 text

In [5]:
all_data = all_data[['_unit_id', 'airline_sentiment', 'text']]
all_data.columns = ['id', 'sentiment', 'text']

In [6]:
all_data['text'] = all_data['text'].apply(preprocess_tweet)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [7]:
test_data_size = int(0.2 * len(all_data))
train_data = all_data[test_data_size:]
test_data = all_data[:test_data_size]
print(len(train_data))
print(len(test_data))

11712
2928


In [8]:
all_data.to_csv('../data/processed_all.csv', index=False)
train_data.to_csv('../data/processed_train.csv', index=False)
test_data.to_csv('../data/processed_test.csv', index=False)