# 数据集分析

## 数据规模

该联系提供的是烂番茄影评数据集，数据规模如下：

    训练集：8544个句子及其语法树，包含每个句子和词组的标签，共有156060条数据
    测试集：3309个句子及其语法树，无标签，共有66292条数据
    标签：共分为五个标签：
        0 - negative
        1 - somewhat negative
        2 - neutral
        3 - somewhat positive
        4 - positive

## 预处理方法

1、BOW(Bags of words)

2、TF-IDF

3、N-gram + BOW

## 训练思路

1、仅用完整的句子进行训练

2、使用完整的句子机器语法树进行训练

## 数据集处理

### 1、读文件

In [2]:
import pandas as pd

In [24]:
train=pd.read_csv('train.tsv',sep='\t')
train

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2
...,...,...,...,...
156055,156056,8544,Hearst 's,2
156056,156057,8544,forced avuncular chortles,1
156057,156058,8544,avuncular chortles,3
156058,156059,8544,avuncular,2


In [33]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   PhraseId    156060 non-null  int64 
 1   SentenceId  156060 non-null  int64 
 2   Phrase      156060 non-null  object
 3   Sentiment   156060 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 4.8+ MB


In [25]:
test=pd.read_csv('test.tsv',sep='\t')
test

Unnamed: 0,PhraseId,SentenceId,Phrase
0,156061,8545,An intermittently pleasing but mostly routine ...
1,156062,8545,An intermittently pleasing but mostly routine ...
2,156063,8545,An
3,156064,8545,intermittently pleasing but mostly routine effort
4,156065,8545,intermittently pleasing but mostly routine
...,...,...,...
66287,222348,11855,"A long-winded , predictable scenario ."
66288,222349,11855,"A long-winded , predictable scenario"
66289,222350,11855,"A long-winded ,"
66290,222351,11855,A long-winded


### 读完整的句子

In [56]:
# get all the fully sentences from training dataset
ss=[]
for i in range(1,8545):
    sentence=train.loc[train.SentenceId==i,['Phrase']]
    if sentence.iloc[:].values.tolist()==[]:
        continue
    ss+=[sentence.iloc[0][0]]
ss[0]

'A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .'

In [54]:
sentence=train.loc[train.SentenceId==i,['Phrase']].iloc[:]
sentence.iloc[:].values.tolist()==[]

True

### 构建词典

In [57]:
len(ss)

8529

可以发现，处理出的句子的数量并不等于句子的ID数，数据集并不友好hhhhh

In [58]:
vacabulary={}
idx=0
# for each sentece in the sentences list
for s in ss:
    for word in s.split(' '):
        if word in vacabulary:
            continue
        else:
            vacabulary[word]=idx
            idx+=1
vacabulary

{'A': 0,
 'series': 1,
 'of': 2,
 'escapades': 3,
 'demonstrating': 4,
 'the': 5,
 'adage': 6,
 'that': 7,
 'what': 8,
 'is': 9,
 'good': 10,
 'for': 11,
 'goose': 12,
 'also': 13,
 'gander': 14,
 ',': 15,
 'some': 16,
 'which': 17,
 'occasionally': 18,
 'amuses': 19,
 'but': 20,
 'none': 21,
 'amounts': 22,
 'to': 23,
 'much': 24,
 'a': 25,
 'story': 26,
 '.': 27,
 'This': 28,
 'quiet': 29,
 'introspective': 30,
 'and': 31,
 'entertaining': 32,
 'independent': 33,
 'worth': 34,
 'seeking': 35,
 'Even': 36,
 'fans': 37,
 'Ismail': 38,
 'Merchant': 39,
 "'s": 40,
 'work': 41,
 'I': 42,
 'suspect': 43,
 'would': 44,
 'have': 45,
 'hard': 46,
 'time': 47,
 'sitting': 48,
 'through': 49,
 'this': 50,
 'one': 51,
 'positively': 52,
 'thrilling': 53,
 'combination': 54,
 'ethnography': 55,
 'all': 56,
 'intrigue': 57,
 'betrayal': 58,
 'deceit': 59,
 'murder': 60,
 'Shakespearean': 61,
 'tragedy': 62,
 'or': 63,
 'juicy': 64,
 'soap': 65,
 'opera': 66,
 'Aggressive': 67,
 'self-glorification

In [59]:
# check how many words are there in the dictionary
len(vacabulary)

18133

8529个句子处理出了18133个词，说明其实人们日常生活中使用的词汇其实有很大程度上的重叠

### 构建特征表示模型