# Problem 1(a).  Reading Amazon Reviews.

In this problem, we will analyze Amazon reviews to determine what characteristics make them most helpful.

Download the file of Amazon gourmet food reviews from the [Stanford Large Network Dataset Collection](https://snap.stanford.edu/data/web-FineFoods.html).   (Your computer may already have a utility installed that can unzip the archive as a text file; if not, [7-zip](http://www.7-zip.org/) is a useful utility for Windows. You can also use an online utility by doing a web search for: ``open .gz files online``.)

Create a pandas DataFrame object with the following entries for each review:

* Product ID
* Number of people who voted this review helpful
* Total number of people who rated this review
* Rating of the product
* Text of the review

For the second and third of these, the information will be given in the file as ```1/5```, which would correspond to 1 vote for helpful out of 5 people who rated the review.


In [1]:
import pandas as pd
import numpy as np

In [2]:
fileLoc = '/Users/ahendel1/Downloads/finefoods.txt'

cols = ['ProductID', 'NumHelp', 'NumReviews', 'Rating', 'Text']
dict_list = []

with open(fileLoc, encoding="ISO-8859-1") as in_file:
    
    while True:
        
        # read the line
        line = in_file.readline()
        
        if not line:
            # exit loop if we reach the end
            in_file.close()
            break
        
        if 'product/productId:' in line:
            prod_id = line.split('productId:')[1].replace('\n', '').strip()
            line = in_file.readline()
        
        if 'review/helpfulness:' in line:
            review = line.split('helpfulness:')[1].split('/')
            helpful = review[0].strip()
            total   = review[1].replace('\n', '')
            line = in_file.readline()
        
        if 'review/score:' in line:
            rating = float(line.split('score:')[1].replace('\n', '').strip())
            line = in_file.readline()
            
        if 'review/text:' in line:
            text = line.split('/text:')[1].replace('\n', '')
            dict_list.append(dict(zip(cols, [prod_id, helpful, total, rating, text])))
in_file.close()
amaz_df = pd.DataFrame(dict_list)
amaz_df

Unnamed: 0,NumHelp,NumReviews,ProductID,Rating,Text
0,1,1,B001E4KFG0,5.0,I have bought several of the Vitality canned ...
1,0,0,B00813GRG4,1.0,Product arrived labeled as Jumbo Salted Peanu...
2,1,1,B000LQOCH0,4.0,This is a confection that has been around a f...
3,3,3,B000UA0QIQ,2.0,If you are looking for the secret ingredient ...
4,0,0,B006K2ZZ7K,5.0,Great taffy at a great price. There was a wi...
5,0,0,B006K2ZZ7K,4.0,I got a wild hair for taffy and ordered this ...
6,0,0,B006K2ZZ7K,5.0,This saltwater taffy had great flavors and wa...
7,0,0,B006K2ZZ7K,5.0,This taffy is so good. It is very soft and c...
8,1,1,B000E7L2R4,5.0,Right now I'm mostly just sprouting this so m...
9,0,0,B00171APVA,5.0,This is a very healthy dog food. Good for the...


# Problem 1(b).  Analyzing review text.

Add columns to your DataFrame for the length of a review, the number of exclamation points in a review, and the fraction of people who rated a review helpful. You should calculate the fraction who rated a review helpful using the two columns you made in 1a, and a ratio of 1 helpful rating out of 5 total ratings should be entered as 0.2, not the string ```1/5```. If no people voted on whether a problem was helpful, the corresponding entry in your percentage helpful column should be ```NaN```.

In [3]:
# assign number of characters in the review
amaz_df['review_num_chars']=amaz_df.Text.apply(len)

In [4]:
# assign number of exlamation points in the review
amaz_df['exclams']=amaz_df.Text.apply(lambda x: x.count('!'))

In [5]:
# assign a fraction of number of people who found the review helpful
amaz_df['frac_helpful']=pd.to_numeric(amaz_df.NumHelp)/pd.to_numeric(amaz_df.NumReviews)

# Problem 1(c).  Summary statistics.

How many reviews are in the data set?  What is the average length of a review (in characters)?  What is the average rating?  What is the greatest number of exclamation marks used in a single review?  Use the pandas package to answer these questions, then summarize your results in a markdown cell.

In [6]:
# number of reviews
amaz_df.shape[0]

568454

In [7]:
# average length of the reviews
amaz_df.review_num_chars.mean()

437.22208305333413

In [8]:
# average rating
amaz_df.Rating.mean()

4.183198640523243

In [9]:
# greatest number of exclamation points in a review
amaz_df.exclams.max()

84

### 1c Summary
* Reviews in the dataset = 568,454 
* Average Length of Review = 437 characters
* Average Rating = 4.2
* Greatest Number of Exclamation Points in a Single Review = 84

# Problem 1(d).  Export.

Save your DataFrame as a .csv file suitable for future analysis in R.  Your .csv file should not include the review text column, as the presence of commas and quotation marks will make reading the file difficult.  You should also convert entries from ```NaN``` to the empty string before saving.

In [10]:
# fill nan with empty string
amaz_df.fillna('', inplace=True)
# drop the raw text column before export
amaz_df.drop('Text', axis=1, inplace=True)

In [11]:
amaz_df.head(1)

Unnamed: 0,NumHelp,NumReviews,ProductID,Rating,review_num_chars,exclams,frac_helpful
0,1,1,B001E4KFG0,5.0,264,0,1


In [12]:
amaz_df.to_csv('finefoods_cleaned.csv', index=False)