# Coursework 1 - This needs a proper title

## I. Background

- Why this study is of interest and relevant

## II. Study Purpose

### A. Motivation and Objectives

### B. Research Questions

In [66]:
# import libraries
import pandas as pd
import requests
import datetime

## II. Research Methodology

- Describe that the study will be primary a NLP project and this is why the NLP preparation to follow.

### A. Data Source

- Describe why Hellopeter 
- A good explanation as to why this data source has been selected.
- Justify that this is a major source of reviews/complaints which the telecoms actively monitor and respond to. Note that Telkom issued a notice (https://www.hellopeter.com/telkom) that they are phasing it out and urge customers to contact them via twitter opr facebook.
- Bellow implement the class to load data from Hellopeter via api


The logic for retrieving review data from Hellopeter.com is encapsulated in the class below. This allows this class to be moved to a separate .py file to enable re-use and to ensure that activities specifically related to the inner workings of the API have a layer of abstraction should the API change at a future date.

In [67]:
class Hellopeter():
    """
    This class is used to retrieve Hellopeter reviews via the `https://api.hellopeter.com/consumer/business/` API.

    Parameters
    ----------
    business : str
        The business name to retrieve reviews for.
    api_url : str
        The base URL used to invoke the Hellopeter API.
    """
    def __init__(self, business:str, api_url:str='https://api.hellopeter.com/consumer/business/') -> None:
        self.business = business
        self.api_url = api_url
        self.url_template = self.api_url + self.business + '/reviews?page='

        # initialize the session to use for requests to the API
        self.request_session = requests.Session()

    def request_page(self, page_number:int) -> dict:
        """
        Request a specific review page for the business.

        Parameters
        ----------
        page_number : int
            The page number to retrieve the reviews from.   

        Returns
        -------
            response_json : dict
        """
        # set the full url for the request
        url = self.url_template + str(page_number)
       
        # set the request headers
        headers = {
            'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
            'accept': 'application/json'
        }

        # request the review page
        respose = self.request_session.get(url=url)

        # implement basic error handling
        if respose.status_code == 202:
            return respose.json()
        else:
            raise Exception('An unexpected response code were received: %s' % respose.status_code)

    def process_request_page(self, page_json:dict) -> pd.DataFrame:
        """
        Process the raw JSON data of a review page and convert it to a Pandas DataFrame.

        Parameters
        ----------
        page_json : dict
            The raw JSON (represented as a Python dictionary) that was retrieved from the API.

        Returns
        -------
        page_data : pandas.DataFrame
            The page data converted to a DataFrame.            
        """
        # create the dataframe
        page_data = pd.DataFrame(page_json['data'])

        # add the business name to the dataframe
        page_data['business'] = self.business

        # basic data type conversions
        page_data.created_at = pd.to_datetime(page_data.created_at)
        page_data.replied = page_data.replied.astype('bool')        

        # return the processed page data
        return page_data

# test the class
peter = Hellopeter(business='vodacom')
page = peter.request_page(1)
page_data = peter.process_request_page(page)

print(page['data'][0].keys())
page_data.head(3)

dict_keys(['id', 'user_id', 'created_at', 'authorDisplayName', 'author', 'authorAvatar', 'author_id', 'review_title', 'review_rating', 'review_content', 'business_name', 'business_slug', 'permalink', 'replied', 'messages', 'business_logo', 'industry_logo', 'industry_name', 'industry_slug', 'status_id', 'nps_rating', 'source', 'is_reported', 'business_reporting', 'author_created_date', 'author_total_reviews_count', 'attachments'])


Unnamed: 0,id,user_id,created_at,authorDisplayName,author,authorAvatar,author_id,review_title,review_rating,review_content,...,industry_slug,status_id,nps_rating,source,is_reported,business_reporting,author_created_date,author_total_reviews_count,attachments,business
0,3749926,f6378be0-31db-11ec-a476-2b1a8cae7ce9,2021-12-21 11:38:43,Lydia G,Lydia G,,f6378be0-31db-11ec-a476-2b1a8cae7ce9,Excellent service:Vodacom,5,Excellent service from vodacom. I'm thankful f...,...,telecommunications,1,,WEBSITE,False,,2021-10-20,1,[],vodacom
1,3749824,3ab199f0-f70e-11e9-8d1f-bd8083fe770b,2021-12-21 11:04:04,Gugu M,Gugu M,,3ab199f0-f70e-11e9-8d1f-bd8083fe770b,Vodacom + Commit Technologies/Elite Mobile = F...,1,"Once upon a time (many many years ago), I had ...",...,telecommunications,1,,WEBSITE,False,,2019-10-25,2,[],vodacom
2,3749788,149e8f1b-31fa-11e8-83f4-f23c91bb6188,2021-12-21 10:49:13,zandile,zandile,,149e8f1b-31fa-11e8-83f4-f23c91bb6188,I have vodacom,1,I swear vodacom used by the devil in my life. ...,...,telecommunications,1,,WEBSITE,False,,2013-08-15,9,[],vodacom


In [68]:
test = page_data.copy()
#test.created_at = pd.to_datetime(test.created_at)
#test.replied = test.replied.astype('bool')
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 28 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   id                          11 non-null     int64         
 1   user_id                     11 non-null     object        
 2   created_at                  11 non-null     datetime64[ns]
 3   authorDisplayName           11 non-null     object        
 4   author                      11 non-null     object        
 5   authorAvatar                11 non-null     object        
 6   author_id                   11 non-null     object        
 7   review_title                11 non-null     object        
 8   review_rating               11 non-null     int64         
 9   review_content              11 non-null     object        
 10  business_name               11 non-null     object        
 11  business_slug               11 non-null     object        
 

In [69]:
page_data.created_at.min()

Timestamp('2021-12-21 08:48:22')

### B. Ethics of Use

 ### C. Data Selection

 - In this section also describe provider selection (telkom, vodacom, cell c, mtn) and why these major ones only.
 - in a separate cell use class and save all data to csv
 - in a separate cell discuss the features that will be selected, then in the next cell drop columns not of interest.

### D. Data Preparation

Sections (a paragraph followed by code) for:

- remove empty and illegal values - have clear reasoning described for changes
- somehow validate the data, DataFrame.Info maybe as a start, need to think about this
- prepare text for NLP analysis

### E. Limitations

- Describe limitations, only hellopeter, but there are many other sources of social media like facebook and twitter. 
- Predominantly complaints
- Specific to telecoms in south africa, similar techniques will be applicable to other industries.
- why the dataset is appropriate

## III. Exploratory Data Analysis

1. Use Hellopeter to get reviews from 3 cellular providers.
2. Clean-up - for example remove empty reviews
3. Data Prep
4. Analysis
	- Word cloud
	- bar-chart of stars clustered by providers
	- rating counts per month - is the ratings seasonal
	- boxplot of word count for reviews - also interesting correlation between word count and review stars
	- maybe a grid per telecom showing reviews per month vs the number of replies