# Create Dataset

Twitter日本語評判分析データセットにはツイート本文が無いので、本文を抽出していく。

In [1]:
import sys
import yaml
import json
import itertools
import pandas as pd
from time import sleep
from requests_oauthlib import OAuth1Session

In [2]:
# Twitter APIを使うに当たって必要なcredentialsを読み込む
def read_yaml():
    with open('../input/credentials.yml', 'r') as yml:
        config = yaml.load(yml, Loader=yaml.FullLoader)
    return config

credentials = read_yaml()

In [3]:
# reference: https://github.com/tatHi/tweet_extructor/blob/master/tweetExtructor.py
class TweetsExtractor(object):
    def __init__(self, config, csv_path):
        self.session = OAuth1Session(
            config['consumer_key'], 
            config['consumer_key_secret'], 
            config['access_token'], 
            config['access_token_secret']
        )
        self.csv_path = csv_path
        self.url = 'https://api.twitter.com/1.1/statuses/lookup.json'
    
    def get_tweets(self, tweet_ids):
        tweet_ids = ','.join(list(map(str,tweet_ids)))
        responce = self.session.get(self.url, params={'id':tweet_ids})
        if responce.status_code != 200:
            print('Twitter API Error: %d' % responce.status_code)
            sys.exit(1)
        responce_text = json.loads(responce.text)
        return {rt['id']: rt['text'] for rt in responce_text}
    
    def extract_tweets(self):
        annotations = []
        for line in open(self.csv_path):
            # 欠損値に対する処理
            try:
                annotations.append(list(map(int,line.strip().split(','))))
            except ValueError:
                pass
        dataset = []
        for i, batch in enumerate(itertools.zip_longest(*[iter(annotations)]*100)):
            batch = [b for b in batch if b is not None]
            tweets = self.get_tweets([line[2] for line in batch])

            for line in batch:
                # 欠損値に対する処理
                if line[2] in tweets:
                    data = {'id':line[0],
                            'topic':line[1],
                            'status':line[2],
                            'label':line[3:],
                            'text':tweets[line[2]]
                           }
                    dataset.append(data)
                else:
                    pass
            sleep(1)
        json.dump(dataset, open('../input/data.json','w'))
    
    def export_to_csv(self):
        self.extract_tweets()
        
        data_df = pd.read_json('../input/data.json')
        annotations = pd.read_csv(
            '../input/tweets_open.csv', 
            names=['id', 'topic', 'status', 'pos&neg', 'pos', 'neg', 'neu', 'non']
        )
        
        merge_df = pd.merge(data, annotations, on=['id','topic','status'])
        
        merge_df.to_csv('../input/data.csv', index=False)

In [4]:
%%time
tweets_extractor = TweetsExtractor(credentials, '../input/tweets_open.csv')
tweets_extractor.export_to_csv()

CPU times: user 1min 28s, sys: 2.48 s, total: 1min 30s
Wall time: 2h 16min 46s


In [5]:
pd.read_csv('../input/data.csv', engine='python')

Unnamed: 0,id,topic,status,label,text,pos&neg,pos,neg,neu,non
0,10025,10000.0,5.224077e+17,"[0, 0, 1, 1, 0]",エクスペリアのGPS南北が逆になるのはデフォだったのか。,0.0,0.0,1.0,1.0,0.0
1,10026,10000.0,5.224078e+17,"[0, 0, 1, 0, 0]",xperiaでスクフェス糞\n反応遅いんだよ糞が,0.0,0.0,1.0,0.0,0.0
2,10027,10000.0,5.224080e+17,"[0, 0, 1, 1, 0]",夏春都が持ってたエクスペリアも今使うには辛い,0.0,0.0,1.0,1.0,0.0
3,10032,10000.0,5.224091e+17,"[0, 0, 0, 1, 0]",少し時間空いちゃいましたが、Xperia Z3のカメラ機能について、ちょっとだけですけどまと...,0.0,0.0,0.0,1.0,0.0
4,10033,10000.0,5.224091e+17,"[0, 0, 0, 0, 1]",日向「研磨おたおめー。これプレゼント!!」\n孤爪「こ、これは」\n日向「ビビった?」\n孤...,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...
299245,2723562,10021.0,7.029092e+17,"[0, 0, 0, 0, 1]",今さっきカプセルホテルでパスコードとかしてないiPhone6を落としたんだ。\n色々詰んだわ...,0.0,0.0,0.0,0.0,1.0
299246,2723564,10021.0,7.029065e+17,"[0, 0, 0, 1, 0]",KORG Gadget 、iPhone 6s Plusでじゅうぶん動く。KORG Gadge...,0.0,0.0,0.0,1.0,0.0
299247,2723932,10021.0,7.035586e+17,"[0, 0, 0, 1, 0]",あ～ケータイが飛んでる～　あれ？ラッキーの顔がiPhone6だ～まあ私のケータイAndroi...,0.0,0.0,0.0,1.0,0.0
299248,2723937,10021.0,7.035579e+17,"[0, 0, 0, 1, 1]",お風呂上がってぼーっと冷蔵庫の前で\n刑事ドラマの過激なシーンに見とれて\nカバーの付いてな...,0.0,0.0,0.0,1.0,1.0


Twitter日本語評判分析データセットが公開された時点で、534,962件のツイートが確認出来たようですが、

2021年10月15日現在で、299,250件のツイートが現存していました。