## Testing hypothesis 3
---

**_Hypothesis_**: Reviews with higher book rating have higher helpfulness ratings.

   - **Metric**: Correlation between book ratings and helpfulness ratings.

   - **Model**: Linear Regression

   - **Description**:

     - Use the book rating as the predictor variable and the helpfulness rating as the target variable.
     - Train a linear regression model to predict helpfulness ratings based on book ratings.

**Missing Values**:

  - `review/score`: remove the entire sample
  - `review/helpfulness`: remove the entire sample

**Data Transformation**:
  - `review/score`: groupBy book title and calculate the average score.
  - `review/helpfulness`: $helpfulness = \frac{x}{y} \sqrt(y)$

---

In [1]:
# Connect to MongoDB

import pymongo

client = pymongo.MongoClient('mongodb://localhost:27017/')
database = client['spark_db']
books = database['books_hypothesis_3']

In [8]:
import pandas as pd
import numpy as np
import math

# Remove the samples which have no score or helpfulness data
pipeline_remove = {'$match':{
                        'review/score':{'$exists':True},
                        'review/helpfulness':{'$exists':True} 
                        }
    
                }

# Retain only the required fields
pipeline_project = {'$project':{
                            'review/score':1,
                            'review/helpfulness':1,
                            '_id':0
                                }
                }   

books_data = books.aggregate([pipeline_remove,pipeline_project])
df_data = pd.DataFrame(list(books_data))
# Check the shape of the data
print(f"The shape of the data is {df_data.shape}")

df_data['review/helpfulness']
# df_data.plot(kind='scatter', x='review/score', y='review/helpfulness')
df_data['review/helpfulness'] = df_data['review/helpfulness'].apply(lambda x: (x[0]+x[1])))


The shape of the data is (200687, 2)
