# Coursework 1 - This needs a proper title

## I. Background

- Why this study is of interest and relevant

## II. Study Purpose

### A. Motivation and Objectives

### B. Research Questions

In [1]:
# import libraries
import pandas as pd
import requests
import datetime

## II. Research Methodology

- Describe that the study will be primary a NLP project and this is why the NLP preparation to follow.

### A. Data Source

- Describe why Hellopeter 
- A good explanation as to why this data source has been selected.
- Justify that this is a major source of reviews/complaints which the telecoms actively monitor and respond to. Note that Telkom issued a notice (https://www.hellopeter.com/telkom) that they are phasing it out and urge customers to contact them via twitter opr facebook.
- Bellow implement the class to load data from Hellopeter via api


The logic for retrieving review data from Hellopeter.com is encapsulated in the class below. This allows this class to be moved to a separate .py file to enable re-use and to ensure that activities specifically related to the inner workings of the API have a layer of abstraction should the API change at a future date.

In [2]:
class Hellopeter():
    """
    This class is used to retrieve Hellopeter reviews via the `https://api.hellopeter.com/consumer/business/` API.

    Parameters
    ----------
    business : str
        The business name to retrieve reviews for.
    api_url : str
        The base URL used to invoke the Hellopeter API.
    """
    def __init__(self, business:str, api_url:str='https://api.hellopeter.com/consumer/business/') -> None:
        self.business = business
        self.api_url = api_url
        self.url_template = self.api_url + self.business + '/reviews?page='

        # initialize the session to use for requests to the API
        self.request_session = requests.Session()

    def request_page(self, page_number:int) -> dict:
        """
        Request a specific review page for the business.

        Parameters
        ----------
        page_number : int
            The page number to retrieve the reviews from.   

        Returns
        -------
            response_json : dict
        """
        # set the full url for the request
        url = self.url_template + str(page_number)
       
        # set the request headers
        headers = {
            'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
            'accept': 'application/json'
        }

        # request the review page
        respose = self.request_session.get(url=url)

        # implement basic error handling
        if respose.status_code == 202:
            return respose.json()
        else:
            raise Exception('An unexpected response code were received: %s' % respose.status_code)

    def process_request_page(self, page_json:dict) -> pd.DataFrame:
        """
        Process the raw JSON data of a review page and convert it to a Pandas DataFrame.

        Parameters
        ----------
        page_json : dict
            The raw JSON (represented as a Python dictionary) that was retrieved from the API.

        Returns
        -------
        page_data : pandas.DataFrame
            The page data converted to a DataFrame.            
        """
        # create the dataframe
        page_data = pd.DataFrame(page_json['data'])

        # add the business name to the dataframe
        page_data['business'] = self.business

        # basic data type conversions
        page_data.created_at = pd.to_datetime(page_data.created_at)
        page_data.replied = page_data.replied.astype('bool')        

        # return the processed page data
        return page_data

    def retrieve_reviews(self, stop_at:datetime) -> pd.DataFrame:
        """
        Retrieve reviews for the business up to, and including the `stop_at` date.

        Parameters
        ----------
        stop_at : datetime
            The date of the last review to retrieve.

        Returns
        -------
        review_data : pandas.DataFrame
            A DataFrame containing the reviews retrieved.            
        """
        page_data = pd.DataFrame()
        current_page = 1
        stop_retrieval = False

        while not stop_retrieval:
            # retrieve the reviews for the current page
            current_reviews = self.process_request_page(self.request_page(current_page))

            # add the current page to the output dataframe
            page_data = pd.concat([page_data, current_reviews])

            # increment the page counter
            current_page += 1

            # determine if data retrieval should be stopped
            #print(current_reviews.created_at.min(), stop_at)
            stop_retrieval = current_reviews.created_at.min() < stop_at

            # print a progress indicator
            if current_page % 100 == 0:
                print(current_page, current_reviews.created_at.min())

        # perform the final filter for the stop date
        page_data = page_data.query('created_at >= @stop_at')

        # return the result dataframe
        return page_data

In [10]:
def retrieve_bussiness_reviews(business:str, stop_at:datetime, output_path='data/raw/') -> pd.DataFrame:
    """
    Retrieve reviews for a business and store the output in Parquet format.

    Parameters
    ----------
    business : str
        The business name to retrieve reviews for.        
    stop_at : datetime
        The date of the last review to retrieve.    

    Returns
    -------
    review_data : pandas.DataFrame
        A DataFrame containing the reviews retrieved.     
    """
    # retrieve the reviews
    peter = Hellopeter(business)
    review_data = peter.retrieve_reviews(stop_at)

    # save the dataset
    review_data.to_parquet('data/raw/%s.gzip' % business.replace('-', '_'), 
        compression='gzip', index=False)

    # return the retrieved data for futher processing
    return review_data

### B. Ethics of Use

 ### C. Data Selection

 - In this section also describe provider selection (telkom, vodacom, cell c, mtn) and why these major ones only.
 - in a separate cell use class and save all data to csv
 - in a separate cell discuss the features that will be selected, then in the next cell drop columns not of interest.

> ❗Please do not execute the cells below unless new data is required, retrieving the data is a long running operation. The reviews retrieved on `2021/12/21` can be found in Parquet format on [GitHub](https://github.com/JohnnyFoulds/dsm020-2021-oct/tree/master/coursework_01/data/raw).

In [5]:
# retrieve the vodacom dataset
vodacom_reviews = retrieve_bussiness_reviews(business='vodacom', stop_at=datetime.datetime(2021, 1, 1))

100 2021-11-23 23:32:28
200 2021-10-25 13:00:10
300 2021-09-26 07:29:06
400 2021-08-20 14:14:37
500 2021-07-14 19:24:18
600 2021-06-10 23:32:14
700 2021-05-12 13:09:30
800 2021-04-14 11:59:03
900 2021-03-18 09:51:51
1000 2021-02-24 12:43:47
1100 2021-02-03 08:25:32
1200 2021-01-11 08:52:58


In [8]:
# retrieve the mtn dataset
mtn_reviews = retrieve_bussiness_reviews(business='mtn', stop_at=datetime.datetime(2021, 1, 1))

100 2021-11-04 07:36:03
200 2021-09-07 19:31:11
300 2021-07-19 18:42:15
400 2021-05-22 11:36:27
500 2021-03-31 18:42:26
600 2021-02-17 20:04:42
700 2021-01-12 16:14:18


In [9]:
# retrieve the telkom dataset
telkom_reviews = retrieve_bussiness_reviews(business='telkom', stop_at=datetime.datetime(2021, 1, 1))

100 2021-10-23 20:34:44
200 2021-08-24 09:35:28
300 2021-06-26 04:53:36
400 2021-05-04 23:19:26
500 2021-03-15 20:57:52
600 2021-02-04 16:39:07


In [11]:
# retrieve the cell-c dataset
cell_c_reviews = retrieve_bussiness_reviews(business='cell-c', stop_at=datetime.datetime(2021, 1, 1))

100 2021-10-12 14:28:40
200 2021-08-12 07:52:06
300 2021-06-03 18:50:40
400 2021-03-20 14:21:45
500 2021-01-21 12:52:35


### D. Data Preparation

> ℹ️ Data are loaded from the GitHub repository instead of from the locally saved files. This is under the assumption that this notebook will be submitted as a coursework piece with a size limitation on submission size.

In [15]:
# load the raw datasets retrieved fom hellopeter
vodacom_reviews = pd.read_parquet('https://github.com/JohnnyFoulds/dsm020-2021-oct/raw/master/coursework_01/data/raw/vodacom.gzip')
mtn_reviews = pd.read_parquet('https://github.com/JohnnyFoulds/dsm020-2021-oct/raw/master/coursework_01/data/raw/mtn.gzip')
telkom_reviews = pd.read_parquet('https://github.com/JohnnyFoulds/dsm020-2021-oct/raw/master/coursework_01/data/raw/telkom.gzip')
cell_c_reviews = pd.read_parquet('https://github.com/JohnnyFoulds/dsm020-2021-oct/raw/master/coursework_01/data/raw/cell_c.gzip')

Sections (a paragraph followed by code) for:

- remove empty and illegal values - have clear reasoning described for changes
- somehow validate the data, DataFrame.Info maybe as a start, need to think about this
- prepare text for NLP analysis

### E. Limitations

- Describe limitations, only hellopeter, but there are many other sources of social media like facebook and twitter. 
- Predominantly complaints
- Specific to telecoms in south africa, similar techniques will be applicable to other industries.
- why the dataset is appropriate

## III. Exploratory Data Analysis

1. Use Hellopeter to get reviews from 3 cellular providers.
2. Clean-up - for example remove empty reviews
3. Data Prep
4. Analysis
	- Word cloud
	- bar-chart of stars clustered by providers
	- rating counts per month - is the ratings seasonal
	- boxplot of word count for reviews - also interesting correlation between word count and review stars
	- maybe a grid per telecom showing reviews per month vs the number of replies