## Testing hypothesis 2
---

**_Hypothesis_**: Reviews with more positive sentiment words receive higher helpfulness ratings.

- **Metric**: Mean helpfulness ratings for number of positive and negative words.

- **Model**: Multinomial Naive Bayes.

- **Description**:

  - Use NBC as a classifier to predict the sentiment of a review.
  - Extract the most useful words from the classifier.
  - Compute the mean helpfulness ratings for the most useful words.  

**Missing Values**:

  - `review/score`: remove the entire sample
  - `review/text`: remove the entire sample
  - `review/helpfulness`: remove the entire sample

**Data Transformation**:

  - `review/score`: Assign 1 to score (4, 5), 0 to score (1, 2). 
  - `review/text`: Create the BoW for the text. Fit a MNBC and count the number of positive and negative words. Graphical Plot.
  - `review/helpfulness`: $helpfulness = \frac{x}{y} \sqrt(y)$

---

In [1]:
# Connect to MongoDB

import pymongo

client = pymongo.MongoClient('mongodb://localhost:27017/')
database = client['spark_db']
books = database['books_hypothesis_2']

In [29]:
import pandas as pd
import numpy as np

# Remove the samples if the fields reported above have missing values. Remove also the samples with score equal to 3, since it indicates a neutral review. 
pipeline_remove = {'$match':
                            {
                            'review/text':{'$exists':True}, 
                            'review/score':{'$exists':True, '$ne':3}, 
                            'review/helpfulness':{'$exists':True}
                            }
                }

# Create a new field called class. If the score is greater than 3, the class is 1, otherwise is 0.
pipeline_class = {'$project':{
                        '_id':0,
                        'review/text':1,
                        'class':{
                            '$cond':{
                                'if':{'$in':['$review/score', [4,5]]},
                                'then':1,
                                'else':0
                            }
                        }
                    }
                }

books_removed = books.aggregate([pipeline_remove, pipeline_class])

df_data = pd.DataFrame(list(books_removed))
array_data = np.array(df_data)

# Check the number of samples retained
print('Number of samples retained: ', array_data.shape[0])


Number of samples retained:  183540


0