# Initial LIWC API (Receptiviti API) call to extract features
This notebook is run only initially to call the Receptiviti API and extract the vocabulary features which is basically the linguistic analysis on the input data/posts.
The Receptiviti API accepts the text for analyis and returns a plethora of semantic features from the text with their scoring. The scores for each feature reveal social and psychological insights, including among others syntactic characteristics of the text. The LIWC is a valuable tool in NLP and brings together the research outcomes of a series of studies in psychology, linguistics and sociology. <p> 

More information about how the LIWC is set up under the hood and tips on how to call the Receptiviti API can be found on the [official webpage](http://liwc.wpengine.com/). 

## Contents
<br/>

* [Import modules](#Import-modules)
* [Declaration of functions](#Declaration-of-functions)
* [Call the LIWC API - in batch calls](#Call-the-LIWC-API---in-batch-calls) 



## Import modules

In [4]:
import nltk
import pandas as pd 
import numpy as np
from nltk.stem import WordNetLemmatizer
import re
import os
import json
import requests
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from sklearn.model_selection import train_test_split
import wordcloud
from scipy.stats import pearsonr
import string
import spacy
# nltk.download('stopwords')
# nltk.download('punkt')
# sp = spacy.load('en_core_web_sm')
# stopwords = nltk.corpus.stopwords.words('english')
# %config Completer.use_jedi = False

In [3]:
#LIWC Credentials. Sign up for the free version of the API or go for the proprietary one.
#Following registration, define here the received key and secret once to be used upon calling the API.
# Access the API dashboard to monitor your usage and activity here https://dashboard.receptiviti.com/
API_key = #enter details here
API_secret = #enter details here
API_URL = #enter details here



## Declaration of functions

The functions _get_payload_and_url()_ and _call_receptiviti_api()_ to be called only when the api is called on new data. Use these two functions with caution as they use up the API allowance.<p>

The LIWC API has been created by [Receptiviti](https://www.receptiviti.com/liwc). Over the years, as research advanced the dictionary has been updated with more features; some these are named with the prefix LIWC and some others are named Receptiviti. 

In [28]:
#check type of content 
def get_payload_and_url(plot, API_URL):
    if len(plot)<1:
        print("ERROR: 'text' should not be empty")
        return {}
    if isinstance(plot, str):
        return ({
                    "content": plot
                }, API_URL)
    if isinstance(plot, list):
        return ([{
                    "content": content
        } for content in plot], API_URL + '/bulk')

#call the LIWC vocabulary API to analyse the data
def call_receptiviti_api(plot, API_URL, API_key, API_secret):
    payload, url = get_payload_and_url(plot, API_URL)
    results = []
    if len(payload)>0:
        response = requests.post(url, data=json.dumps(payload), auth=(API_key, API_secret), headers = {'Content-Type': 'application/json'})
        if response.status_code==200:
            results = response.json()
    return results


# filter, tokenize and lemmatize each post
def lemmatize_text(posts_corpus):
    lemmatizer = WordNetLemmatizer()
    text=[]
    for post, label in zip(posts_corpus.post, posts_corpus.label):
        tokens = [word.lower() for word in nltk.word_tokenize(post)]
        word_tokens = []
        for token in tokens:
            # include only words
            if re.search('^[a-zA-Z]+$', token):
                word_tokens.append(lemmatizer.lemmatize(token))
        text.append(" ".join(word_tokens))
    return text

#return all LIWC metrics
def LIWC_metrics_all(response):
    return response['results'][0]['dictionary_measures']

#return all Reciptiviti metrics
def Rec_metrics_all(response):
    return response['results'][0]['receptiviti_measures']


#return the n top LIWC metrics
def LIWC_metrics_n(response, n=20):
    """
    Select n number of top metrics, sorted highest to lowest  
    """
    #sort the LIWC metrics by higher to lower scores
    LIWC_metrics_all= LIWC_metrics(response)
    LIW_metrics_sorted = dict(sorted(LIWC_metrics_all.items(), key=lambda item: item[1], reverse=True))
    return dict(list(LIW_metrics_sorted.items())[:n])

#return the n top Receptiviti metrics
def Rec_metrics_n(response, n=20):
    """
    Select n number of top metrics, sorted highest to lowest  
    """
    #sort the receptiviti metrics by higher to lower scores
    Rec_metrics_all= Rec_metrics(response)
    Rec_metrics_sorted = dict(sorted(Rec_metrics_all.items(), key=lambda item: item[1], reverse=True))
    return dict(list(Rec_metrics_sorted.items())[:n])

#funtion to create the LIWC_vocabulary analysis dataframe
def LIWC_analysis_df(features, posts_sample, LIWC_analysis):
    Analysis_scores=np.empty( (len(posts_sample),len(features)) ) 
    for i, entry in enumerate(LIWC_analysis['results']):
        Analysis_scores[i] = list(entry['dictionary_measures'].values()) + list(entry['receptiviti_measures'].values())
    scores = pd.DataFrame(data=Analysis_scores, columns=features) 
    return( pd.concat( [posts_sample, scores], axis=1 ) )



In [6]:
# declare the paths to raw data
#Download and extract the dreaddit-dataset in the data/raw directory
data_path = os.path.join("..", "data", "raw", "dreaddit-train.csv")
features_path = os.path.join("..", "data", "interim", "features.csv")
processed_vocab = os.path.join("..", "data", "interim", "LIWC-analysis.csv")
#load the raw data
df = pd.read_csv(data_path)
posts= df[['text','label']]
posts.columns = ['post', 'label']


In [44]:
#create a lemmatized dataframe 
text = lemmatize_text(posts)
lemmatized_data = pd.DataFrame([text]).T
lemmatized_data.rename(columns={0:'post'},inplace=True)
lemmatized_data['label']=posts.label
lemmatized_data.head()

In [40]:
# explaining the measures of the API
# https://dashboard.receptiviti.com/docs/frameworks-and-measures/#per-measures
# results are more reliable when the word count in each post is >350


## Call the LIWC API - in batch calls 

#### CAUTION!

Extract all features from the LIWC API and save them in the CSV file for reference. **Used only once at the initial stage.** <p>
Break down the posts and run the API in batches, multiple times. **The reason is that the LIWC API cannot acccept a list of posts >1000 objects.** <p>
LIWC_analysis = call_receptiviti_api(posts_sample.post.tolist(), API_URL, API_key, API_secret)

In [None]:
#  example for sampling the raw data
posts_sample=lemmatized_data.iloc[0:999,:]
posts_sample=posts_sample.reset_index(drop=True)
posts_sample.head()

LIWC_analysis = call_receptiviti_api(posts_sample.post.tolist(), API_URL, API_key, API_secret )


This section saves all features to a features CSV, for which we define the filepath in our working directory 

In [None]:
# len(LIWC_metrics_all(single_result).keys()) + len(Rec_metrics_all(single_result).keys())
# = 163 keys in total (116 LIWC measures, 47 Rec measures)


#save features to CSV file- used only once
Rec_features = pd.DataFrame(Rec_metrics_all(single_result).keys())
Rec_features = Rec_features.apply(lambda k: "Rec_"+k)
LIWC_features = pd.DataFrame(LIWC_metrics_all(single_result).keys())
LIWC_features = LIWC_features.apply(lambda k: "LIWC_"+k, )
all_features = pd.concat([LIWC_features, Rec_features])

# set the features path as the path to your working data storage files - data/interim 
all_features.to_csv(features_path, index=False,header=False)


This section saves the LIWC analysis from the API, in batches in the processed_vocab filepath. This filepath needs to be set in the working directory under data/interim.<p>
The recommended way is to manually combine the batch processed data into a common CSV file. An easy way to do this is to name the file path each time as _"file_1"_, _"file_2"_ and so on until we have passed all data into the API. 

In [None]:
# create dataframe to hold the analysis data
LIWC_vocabulary_analysis = LIWC_analysis_df(features, posts_sample, LIWC_analysis)
LIWC_vocabulary_analysis.head(2)

#save the batch of the LIWC anaylsis to CSV file
LIWC_vocabulary_analysis.to_csv(processed_vocab, index=False,header=True)

