# Sentiment Analysis
#### IST664/CIS668 Nature Language Processing
#### Sile Hu, Xiaobin Ning, Henglong So, Ziyun Wang

In this notebook, we will first compare the accuracy of different features by using corss validation and use the feature which has the highest accuracy to do the sentiment analysis.

### part 1. Accuracy Comparsion

At first, we imported the packages needed.

In [1]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk.sentiment
import nltk
nltk.download('punkt')
from nltk.corpus import sentence_polarity
import random
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/henglong/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/henglong/nltk_data...
[nltk_data] Error downloading 'averaged_perceptron_tagger' from
[nltk_data]     <https://raw.githubusercontent.com/nltk/nltk_data/gh-p
[nltk_data]     ages/packages/taggers/averaged_perceptron_tagger.zip>:
[nltk_data]     [Errno 54] Connection reset by peer


False

In [2]:
from nltk.corpus import sentence_polarity
import random
import nltk
nltk.download('sentence_polarity')
nltk.download('vader_lexicon')

[nltk_data] Downloading package sentence_polarity to
[nltk_data]     /Users/henglong/nltk_data...
[nltk_data]   Package sentence_polarity is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/henglong/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

Then we used the movie review sentences from the NLTK corpus to generate features.

In [3]:
sentences = sentence_polarity.sents()

In [4]:
for sent in sentences[:4]:
    print(sent)

['simplistic', ',', 'silly', 'and', 'tedious', '.']
["it's", 'so', 'laddish', 'and', 'juvenile', ',', 'only', 'teenage', 'boys', 'could', 'possibly', 'find', 'it', 'funny', '.']
['exploitative', 'and', 'largely', 'devoid', 'of', 'the', 'depth', 'or', 'sophistication', 'that', 'would', 'make', 'watching', 'such', 'a', 'graphic', 'treatment', 'of', 'the', 'crimes', 'bearable', '.']
['[garbus]', 'discards', 'the', 'potential', 'for', 'pathological', 'study', ',', 'exhuming', 'instead', ',', 'the', 'skewed', 'melodrama', 'of', 'the', 'circumstantial', 'situation', '.']


The movie review sentences are not labeled individually, but can be retrieved by
category. We create the list of documents where each document(sentence) is paired with its
label.

In [5]:
pos_sents = sentence_polarity.sents(categories='pos')
neg_sents = sentence_polarity.sents(categories='neg')
documents = [(sent, cat) for cat in sentence_polarity.categories()
    for sent in sentence_polarity.sents(categories=cat)]

Since the documents are in order by label, we mix them up for later separation into
training and test sets.

In [6]:
random.shuffle(documents)

We need to define the set of words that will be used for features. This is essentially all the words in the entire document collection, except that we will limit it to the 2000 most frequent words.

In [7]:
all_words_list = [word for (sent,cat) in documents for word in sent]
all_words = nltk.FreqDist(all_words_list)
word_items = all_words.most_common(2000)
word_features = [word for (word, freq) in word_items]

Next, we built our corss validation function.

In [8]:
def cross_validation(num_folds, featuresets):
    subset_size = len(featuresets)//num_folds
    accuracy_list = []
    # iterate over the folds
    for i in range(num_folds):
        test_this_round = featuresets[i*subset_size:][:subset_size]
        train_this_round = featuresets[:i*subset_size]+featuresets[(i+1)*subset_size:]
        # train using train_this_round
        classifier = nltk.NaiveBayesClassifier.train(train_this_round)
        # evaluate against test_this_round and save accuracy
        accuracy_this_round = nltk.classify.accuracy(classifier, test_this_round)
        print(i, accuracy_this_round)
        accuracy_list.append(accuracy_this_round)
        # find mean accuracy over all rounds
        print('mean accuracy', sum(accuracy_list) / num_folds)

After all these preliminary work, we generated BOW features, negation features, POS tag features and bigram features and compare the accuracy of these features.

#### 1. BOW features

Now we can define the features for each document, using just the words, sometimes
called the BOW or unigram features. The feature label will be ‘V_keyword’ for each
keyword (aka word) in the word_features set, and the value of the feature will be
Boolean, according to whether the word is contained in that document.

In [9]:
def document_features(document, word_features):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['V_{}'.format(word)] = (word in document_words)
    return features

In [10]:
featuresets = [(document_features(d,word_features), c) for (d,c) in documents]

Now split into training and test and run the classifier.

In [11]:
train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(30)

0.759
Most Informative Features
            V_engrossing = True              pos : neg    =     19.8 : 1.0
              V_mediocre = True              neg : pos    =     16.3 : 1.0
               V_generic = True              neg : pos    =     16.3 : 1.0
                  V_dull = True              neg : pos    =     15.1 : 1.0
                  V_flat = True              neg : pos    =     13.3 : 1.0
                V_boring = True              neg : pos    =     13.2 : 1.0
             V_inventive = True              pos : neg    =     13.1 : 1.0
               V_routine = True              neg : pos    =     12.9 : 1.0
                    V_90 = True              neg : pos    =     12.3 : 1.0
             V_wonderful = True              pos : neg    =     12.3 : 1.0
             V_realistic = True              pos : neg    =     11.7 : 1.0
            V_refreshing = True              pos : neg    =     11.7 : 1.0
              V_provides = True              pos : neg    =     11.5

In [12]:
cross_validation(10, featuresets)

0 0.7570356472795498
mean accuracy 0.07570356472795498
1 0.7448405253283302
mean accuracy 0.150187617260788
2 0.7523452157598499
mean accuracy 0.22542213883677298
3 0.725140712945591
mean accuracy 0.2979362101313321
4 0.7692307692307693
mean accuracy 0.374859287054409
5 0.773921200750469
mean accuracy 0.4522514071294559
6 0.7317073170731707
mean accuracy 0.5254221388367729
7 0.7345215759849906
mean accuracy 0.598874296435272
8 0.7345215759849906
mean accuracy 0.672326454033771
9 0.7439024390243902
mean accuracy 0.7467166979362101


We can see that the accuracy of BOW feature is about 0.75

#### 2. Adding Negation Features

Negation of opinions is an important part of opinion classification. Here we try a 
simple strategy. We look for negation words "not", "never" and "no" and negation
that appears in contractions of the form "doesn’t".

One strategy with negation words is to negate the word following the negation word,
while other strategies negate all words up to the next punctuation or use syntax to find
the scope of the negation.

The form of some of the words is a verb followed by n’t. Now in the Movie Review
Corpus itself, the tokenization has these words all split into 3 words, e.g. “couldn”,
“’”, and “t”. (and I have a NOT_features definition for this case). But in this
sentence_polarity corpus, the tokenization keeps these forms of negation as one word
ending in “n’t”.

In [13]:
for sent in list(sentences)[:50]:
    for word in sent:
        if (word.endswith("n't")):
            print(sent)

['there', 'is', 'a', 'difference', 'between', 'movies', 'with', 'the', 'courage', 'to', 'go', 'over', 'the', 'top', 'and', 'movies', 'that', "don't", 'care', 'about', 'being', 'stupid']
['a', 'farce', 'of', 'a', 'parody', 'of', 'a', 'comedy', 'of', 'a', 'premise', ',', 'it', "isn't", 'a', 'comparison', 'to', 'reality', 'so', 'much', 'as', 'it', 'is', 'a', 'commentary', 'about', 'our', 'knowledge', 'of', 'films', '.']
['i', "didn't", 'laugh', '.', 'i', "didn't", 'smile', '.', 'i', 'survived', '.']
['i', "didn't", 'laugh', '.', 'i', "didn't", 'smile', '.', 'i', 'survived', '.']
['most', 'of', 'the', 'problems', 'with', 'the', 'film', "don't", 'derive', 'from', 'the', 'screenplay', ',', 'but', 'rather', 'the', 'mediocre', 'performances', 'by', 'most', 'of', 'the', 'actors', 'involved']
['the', 'lack', 'of', 'naturalness', 'makes', 'everything', 'seem', 'self-consciously', 'poetic', 'and', 'forced', '.', '.', '.', "it's", 'a', 'pity', 'that', "[nelson's]", 'achievement', "doesn't", 'match'

In [14]:
negationwords = ['no', 'not', 'never', 'none', 'nowhere', 'nothing', 'noone', 'rather',
'hardly', 'scarcely', 'rarely', 'seldom', 'neither', 'nor']

Start the feature set with all 2000 word features and 2000 Not word features set to
false. If a negation occurs, add the following word as a Not word feature (if it’s in the
top 2000 feature words), and otherwise add it as a regular feature word.

In [15]:
def NOT_features(document, word_features, negationwords):
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = False
        features['contains(NOT{})'.format(word)] = False
    # go through document words in order
    for i in range(0, len(document)):
        word = document[i]
        if ((i + 1) < len(document)) and ((word in negationwords) or (word.endswith("n't"))):
            i += 1
            features['contains(NOT{})'.format(document[i])] = (document[i] in word_features)
        else:
            features['contains({})'.format(word)] = (word in word_features)
    return features

Create feature sets as before, using the NOT_features extraction funtion, train the
classifier and test the accuracy.

In [16]:
NOT_featuresets = [(NOT_features(d, word_features, negationwords), c) for (d, c) in documents]
NOT_featuresets[0][0]['contains(NOTlike)']
NOT_featuresets[0][0]['contains(always)']

False

Now split into training and test and run the classifier.

In [17]:
train_set, test_set = NOT_featuresets[200:], NOT_featuresets[:200]
classifier1 = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier1, test_set))
classifier1.show_most_informative_features(30)

0.765
Most Informative Features
    contains(engrossing) = True              pos : neg    =     21.7 : 1.0
       contains(generic) = True              neg : pos    =     17.0 : 1.0
      contains(mediocre) = True              neg : pos    =     17.0 : 1.0
     contains(inventive) = True              pos : neg    =     15.7 : 1.0
          contains(flat) = True              neg : pos    =     15.0 : 1.0
       contains(routine) = True              neg : pos    =     15.0 : 1.0
        contains(boring) = True              neg : pos    =     14.7 : 1.0
    contains(refreshing) = True              pos : neg    =     14.3 : 1.0
            contains(90) = True              neg : pos    =     13.0 : 1.0
          contains(warm) = True              pos : neg    =     12.6 : 1.0
     contains(wonderful) = True              pos : neg    =     12.6 : 1.0
     contains(NOTenough) = True              neg : pos    =     12.3 : 1.0
  contains(refreshingly) = True              pos : neg    =     12.3

In [18]:
cross_validation(10, NOT_featuresets)

0 0.7673545966228893
mean accuracy 0.07673545966228892
1 0.7729831144465291
mean accuracy 0.15403377110694186
2 0.7945590994371482
mean accuracy 0.23348968105065668
3 0.7532833020637899
mean accuracy 0.3088180112570357
4 0.7786116322701688
mean accuracy 0.3866791744840526
5 0.798311444652908
mean accuracy 0.46651031894934336
6 0.7851782363977486
mean accuracy 0.5450281425891182
7 0.7692307692307693
mean accuracy 0.6219512195121951
8 0.7701688555347092
mean accuracy 0.698968105065666
9 0.7917448405253283
mean accuracy 0.7781425891181988


After the cross validation, we can see that the accuracy of negation feature is about 0.78

#### 3. POS Tag Features

There are some classification tasks where part-of-speech tag features can have an effect. this is more likely for shorter units of classification, such as sentence level classification or shorter social media such as tweets.

The most common way to use POS tagging information is to include counts of various types of word tags.

Here is the definition of our new feature function, adding POS tag counts to the word features.

In [19]:
def POS_features(document,word_features):
    document_words = set(document)
    tagged_words = nltk.pos_tag(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    numNoun = 0
    numVerb = 0
    numAdj = 0
    numAdverb = 0
    for (word, tag) in tagged_words:
        if tag.startswith('N'): numNoun += 1
        if tag.startswith('V'): numVerb += 1
        if tag.startswith('J'): numAdj += 1
        if tag.startswith('R'): numAdverb += 1
    features['nouns'] = numNoun
    features['verbs'] = numVerb
    features['adjectives'] = numAdj
    features['adverbs'] = numAdverb
    return features

In [20]:
POS_featuresets = [(POS_features(d, word_features), c) for (d, c) in documents]

In [21]:
train_set, test_set = POS_featuresets[1000:], POS_featuresets[:1000]
classifier2 = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier2, test_set)

0.761

In [22]:
cross_validation(10, POS_featuresets)

0 0.7560975609756098
mean accuracy 0.07560975609756097
1 0.7514071294559099
mean accuracy 0.15075046904315198
2 0.7439024390243902
mean accuracy 0.225140712945591
3 0.726078799249531
mean accuracy 0.2977485928705441
4 0.7720450281425891
mean accuracy 0.374953095684803
5 0.7833020637898687
mean accuracy 0.45328330206378986
6 0.7345215759849906
mean accuracy 0.5267354596622889
7 0.7326454033771107
mean accuracy 0.6
8 0.7288930581613509
mean accuracy 0.672889305816135
9 0.7298311444652908
mean accuracy 0.7458724202626642


After the cross validation, we can see that the accuracy of negation feature is about 0.75

#### 4. Bigram Features

when we worked on generating bigrams from documents, if we want to use highly frequent bigrams, we need to filter out special characters, which were very frequent in the bigrams, and also filter by frequency.  The bigram pmi measure also required some filtering to get frequent and meaningful bigrams.  

But there is another bigram association measure that is more often used to filter bigrams for classification features.  This is the chi-squared measure, which is another measure of information gain, but which does its own frequency filtering.  Another frequently used alternative is to just use frequency, which is the bigram measure raw_freq.

We’ll start by importing the collocations package and creating a short cut variable name for the bigram association measures.

In [23]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

We create a bigram collocation finder using the original movie review words, since the bigram finder must have the words in order.  Note that our all_words_list has exactly this list.

In [24]:
finder = BigramCollocationFinder.from_words(all_words_list)

We use the chi-squared measure to get bigrams that are informative features.  Note that we don’t need to get the scores of the bigrams, so we use the nbest function which just returns the highest scoring bigrams, using the number specified.
(Or try bigram_measures.raw_freq.)

In [25]:
bigram_features = finder.nbest(bigram_measures.chi_sq, 500)

The nbest function returns a list of significant bigrams in this corpus, and we can look at some of them.

In [26]:
print(bigram_features[:50])

[("''independent", "film''"), ("'60s-homage", 'pokepie'), ("'[the", 'cockettes]'), ("'ace", "ventura'"), ("'alternate", "reality'"), ("'aunque", 'recurre'), ("'black", "culture'"), ("'blue", "crush'"), ("'chan", "moment'"), ("'chick", "flicks'"), ("'date", "movie'"), ("'ethnic", 'cleansing'), ("'face", "value'"), ("'fully", "experienced'"), ("'hannibal'", 'lauren'), ("'jason", "x'"), ("'juvenile", "delinquent'"), ("'laugh", "therapy'"), ("'masterpiece", "theatre'"), ("'nicholas", "nickleby'"), ("'old", "neighborhood'"), ("'opening", "up'"), ("'rare", "birds'"), ("'sacre", 'bleu'), ("'science", "fiction'"), ("'shindler's", "list'"), ("'snow", "dogs'"), ("'some", "body'"), ("'special", "effects'"), ("'terrible", "filmmaking'"), ("'time", "waster'"), ("'true", "story'"), ("'unfaithful'", 'cheats'), ("'very", "sneaky'"), ("'we're", '-doing-it-for'), ("'who's", "who'"), ('-after', 'spangle'), ('-as-it-', 'thinks-it-is'), ('-as-nasty', '-as-it-'), ('-doing-it-for', "-the-cash'"), ('10-course

Now we create a feature extraction function that has all the word features as before, but also has bigram features.

In [27]:
def bigram_document_features(document, word_features, bigram_features):
    document_words = set(document)
    document_bigrams = nltk.bigrams(document)
    features = {}
    for word in word_features:
        features['V_{}'.format(word)] = (word in document_words)
    for bigram in bigram_features:
        features['B_{}_{}'.format(bigram[0], bigram[1])] = (bigram in document_bigrams)    
    return features

In [28]:
bigram_featuresets = [(bigram_document_features(d, word_features, bigram_features), c) for (d,c) in documents]
#There should be 2000 features:  1500 word features and 500 bigram features

In [29]:
train_set, test_set = bigram_featuresets[1000:], bigram_featuresets[:1000]
classifier3 = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier3, test_set))

0.759


In [31]:
cross_validation(5, bigram_featuresets)

0 0.74812382739212
mean accuracy 0.14962476547842402
1 0.7429643527204502
mean accuracy 0.29821763602251405
2 0.7696998123827392
mean accuracy 0.4521575984990619
3 0.7274859287054409
mean accuracy 0.5976547842401502
4 0.7429643527204502
mean accuracy 0.7462476547842403


After the cross validation, we can see that the accuracy of negation feature is about 0.75

To sum up, adding negation features can lead to the highest accuracy. Thus, we will use negation features in our sentiment analysis

### part 2. Sentiment Analysis

At first we imported the packaged we need.

In [32]:
import numpy as np                                 
import pandas as pd
import csv
from nltk.corpus import stopwords                   
from nltk.stem import PorterStemmer 

Then we will pre-process our data.
For data preprocessing, we need to tokenize our text. For our dataset, the review text are stored in rows of a dataframe, so we can just extract them from the dataframe and store it into a list

In [33]:
df = pd.read_csv('data.csv')
df.rename(columns={'Unnamed: 0':'Index'},inplace=True)
df.head()
df.tail()

Unnamed: 0,Index,Product Name,Brand Name,Price,Reviews
0,0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,I feel so LUCKY to have found this used (phone...
1,1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,"nice phone, nice up grade from my pantach revu..."
2,2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,Very pleased
3,3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,It works good but it goes slow sometimes but i...
4,4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,Great phone to replace my lost phone. The only...


In [34]:
ReviewList = df['Reviews'].tolist()
ReviewList[0:6]

["I feel so LUCKY to have found this used (phone to us & not used hard at all), phone on line from someone who upgraded and sold this one. My Son liked his old one that finally fell apart after 2.5+ years and didn't want an upgrade!! Thank you Seller, we really appreciate it & your honesty re: said used phone.I recommend this seller very highly & would but from them again!!",
 'nice phone, nice up grade from my pantach revue. Very clean set up and easy set up. never had an android phone but they are fantastic to say the least. perfect size for surfing and social media. great phone samsung',
 'Very pleased',
 'It works good but it goes slow sometimes but its a very good phone I love it',
 'Great phone to replace my lost phone. The only thing is the volume up button does not work, but I can still go into settings to adjust. Other than that, it does the job until I am eligible to upgrade my phone again.Thaanks!',
 'I already had a phone with problems... I know it stated it was used, but d

In [35]:
len(ReviewList)

63038

After data preprocessing, we used the model with highest accuracy to find out the results.

In [36]:
label = []
for i in ReviewList:
    texttokens = nltk.word_tokenize(i)
    textlower = [w.lower() for w in texttokens]
    inputfeatureset = NOT_features(textlower, word_features, negationwords)
    a = classifier1.classify(inputfeatureset)
    label.append(a)

In [37]:
def merge(list1, list2): 
    merged_list = [(list1[i], list2[i]) for i in range(0, len(list1))]
    return merged_list

In [38]:
sentimentList = merge(ReviewList, label)

In [39]:
sentiment_neg = []
sentiment_pos = []
Sentiment = []
for i in sentimentList:
    if i[1] == 'neg':
        sentiment_neg.append(i[0])
        Sentiment.append(int(0))
    else:
        sentiment_pos.append(i[0])
        Sentiment.append(int(1))

In [40]:
len(sentiment_neg)

44544

In [41]:
len(sentiment_pos)

18494

In [42]:
len(Sentiment)

63038

In [43]:
df['Sentiment'] = Sentiment

In [44]:
df.head()

Unnamed: 0,Index,Product Name,Brand Name,Price,Reviews,Sentiment
0,0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,I feel so LUCKY to have found this used (phone...,0
1,1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,"nice phone, nice up grade from my pantach revu...",1
2,2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,Very pleased,1
3,3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,It works good but it goes slow sometimes but i...,0
4,4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,Great phone to replace my lost phone. The only...,1


In [73]:
BrandList = df['Brand Name']
BrandList[0:5]

0    Samsung
1    Samsung
2    Samsung
3    Samsung
4    Samsung
Name: Brand Name, dtype: object

In [74]:
Pricelist = df['Price']
Pricelist.tail()

63033    79.95
63034    79.95
63035    79.95
63036    79.95
63037    79.95
Name: Price, dtype: float64

In [64]:
ProductList = df['Product Name'].unique()
ProductList[0:6]

array(['"CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D700*FRONT CAMERA*ANDROID*SLIDER*QWERTY KEYBOARD*TOUCH SCREEN',
       'Cricket Samsung Galaxy Discover R740 Phone',
       'Galaxy s III mini SM-G730V Verizon Cell Phone BLUE',
       'Galaxy S5 G900A Factory Unlocked Android Smartphone 16GB White',
       'Galaxy S6 Active - Camo Blue (Unlocked)',
       'GreatCall Samsung Jitterbug Touch3 Senior Smartphone with 1-Touch Medical Alert and Large Display'],
      dtype=object)

We define the "product score" as the percentage of positive reviews. Since we defined positive review as 1 and negative review as 0, the calculation of percentage of positive reviews is the same with the caculation of the mean value.

In [46]:
ProductScore = []
for i in range(0,len(ProductList)):
    ProductScore.append(np.mean(df.loc[df['Product Name']==str(ProductList[i])]['Sentiment'].tolist()))
                                       


In [47]:
ProductScore

[0.2702702702702703,
 0.2222222222222222,
 0.0,
 0.30125,
 0.0,
 0.24161073825503357,
 0.0,
 0.3333333333333333,
 0.28846153846153844,
 0.15217391304347827,
 0.3157894736842105,
 0.0,
 0.21212121212121213,
 0.2222222222222222,
 0.09090909090909091,
 0.375,
 0.43333333333333335,
 0.15217391304347827,
 0.45454545454545453,
 0.4375,
 0.0,
 0.15,
 0.06666666666666667,
 0.24561403508771928,
 0.14285714285714285,
 0.16666666666666666,
 0.0,
 0.0,
 0.0,
 0.3333333333333333,
 0.16666666666666666,
 0.1724137931034483,
 0.30434782608695654,
 0.36363636363636365,
 0.24324324324324326,
 0.0,
 0.5,
 0.2,
 0.5,
 0.23529411764705882,
 0.21621621621621623,
 0.2857142857142857,
 0.21212121212121213,
 0.5,
 0.0,
 0.14772727272727273,
 0.0,
 0.39344262295081966,
 0.2598870056497175,
 0.26666666666666666,
 0.19642857142857142,
 0.3684210526315789,
 0.8,
 0.35135135135135137,
 0.28,
 0.28,
 0.0,
 0.2857142857142857,
 0.3333333333333333,
 0.18181818181818182,
 0.0,
 0.0,
 0.18181818181818182,
 0.34645669291

We created a new data frame called 'ScoreDf'. 
We stored the product name and their percentage of positive reviews in this new dataframe.

In [68]:
ScoreDf = pd.DataFrame(columns=['Product Name','Brand Name','Price','Positive Percentage'])

In [69]:
ScoreDf['Product Name']=ProductList

In [70]:
ScoreDf['Price']=Pricelist

In [75]:
ScoreDf['Brand Name']=BrandList

In [76]:
ScoreDf['Positive Percentage']=ProductScore

In [77]:
ScoreDf.head()

Unnamed: 0,Product Name,Brand Name,Price,Positive Percentage
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,0.27027
1,Cricket Samsung Galaxy Discover R740 Phone,Samsung,199.99,0.222222
2,Galaxy s III mini SM-G730V Verizon Cell Phone ...,Samsung,199.99,0.0
3,Galaxy S5 G900A Factory Unlocked Android Smart...,Samsung,199.99,0.30125
4,Galaxy S6 Active - Camo Blue (Unlocked),Samsung,199.99,0.0


In [79]:
ScoreDf.tail()

Unnamed: 0,Product Name,Brand Name,Price,Positive Percentage
569,Verizon Samsung Alias 2 U750 No Contract Dual-...,Samsung,222.0,0.271277
570,Verizon Samsung Convoy U640 No Contract Milita...,Samsung,222.0,0.22093
571,Verizon Wireless Cell Phone Samsung Gusto U360...,Samsung,222.0,0.217391
572,VERIZON WIRELESS CELL PHONE SAMSUNG U460 INTEN...,Samsung,222.0,0.196429
573,Samsung Convoy U640 Phone for Verizon Wireless...,Samsung,222.0,0.285714


In [78]:
ScoreDf.to_csv('Percent.csv')

In [84]:
ScoreDf1 = pd.DataFrame(columns=['Positive Percentage'])
ScoreDf1['Positive Percentage']=ProductScore
ScoreDf1.to_csv('score.csv')

In [85]:
priceDf1 = pd.DataFrame(columns=['Price'])
priceDf1['Price']=Pricelist
priceDf1.to_csv('price.csv')

In [86]:
brandDf1 = pd.DataFrame(columns=['Brand Name'])
brandDf1['Brand Name']=BrandList
brandDf1.to_csv('brand.csv')

In [87]:
productDf1 = pd.DataFrame(columns=['Product Name'])
productDf1['Product Name']=ProductList
productDf1.to_csv('product.csv')