# Prepare AppFollow data for Upload to Luminoso Projects
## It will also filter languages and save to JSON files to your folder as backup

## The Luminoso API documentation is  [Here](https://analytics.luminoso.com/api/v4/#Prediction)

given json format of AppFollow data:

{"reviews":
[
{"rating":2,"date":"2016-06-01","created":"2016-08-19 20:18:23","content":"SOME TEXT",
"store":"as","rating_prev":0,"user_id":"USERZ","id":5555,"ext_id":"5555",
"is_answer":0,"app_version":"2.2","app_id":46045,"was_changed":0,"locale":"de","author":"ABC","review_id":"555","title":"SOME TITLE","updated":"2017-06-03 19:49:16"},

{"rating":1,"date":"2016-06-01","created":"2016-08-19 20:18:23","content":"SOME TEXT 2",
"store":"as","rating_prev":0,"user_id":"USERX","id":4444,"ext_id":"4444",
"app_version":"2.2","is_answer":0,"app_id":46045,"was_changed":0,"locale":"be","author":"XYZ","review_id":"111","title":"SOME TITLE 2","updated":"2017-06-03 19:49:16"}
]
}

### Load libraries and some utility functions

In [1]:
import json
import os
import csv

In [2]:
#load and save to JSON type files
import json

def load_json_data(filename):
    with open(filename) as json_data:
        return json.load(json_data)

def save_as_json_file(data, filename):
    if filename[-5:] != '.json':
        file = filename + '.json'
    with open(file, 'w') as f:
        json.dump(data, f)
        print("JSON data saved to file: ", file)

In [3]:
# create a stats dict from a list of dict data, and a given dict entry:
def dict_stats(entry, dictdata):
    stats_list = {}
    for n in dictdata:
        k = n[entry]
        if k in stats_list:
            stats_list[k] = stats_list[k] + 1
        else:
            stats_list[k] = 1
    print("Stats count for each type of " + entry + ":\n", stats_list)
    return stats_list

In [4]:
import argparse, cld2

def remove_foreign_languages(docs, lang_code, threshold=0):
    good_documents = []
    bad_documents = []
    for doc in docs:
        isbad = False
        try:
            isReliable, textBytesFound, details = cld2.detect(doc['text'])
        except ValueError:
            isbad = True
            continue
        if not details[0][1] == lang_code and isReliable or details[0][2] < threshold:
                isbad = True
        if isbad:
            bad_documents.append(doc)
        else:
            good_documents.append(doc)
    print('{} documents not identified as "{}" removed from project.'.format(len(bad_documents),lang_code))
    return good_documents, bad_documents

## Load the Input JSON data from AppFollow

In [5]:
INPUT_FILENAME = "AppFollowReviews.json"
#currentPath = os.getcwd()

In [6]:
raw_data = load_json_data(INPUT_FILENAME)

In [7]:
data = raw_data['reviews']
len(data)

200

In [8]:
data[0]

{'app_id': 46045,
 'app_version': '2.2',
 'author': 'MrWasGehtSieDasAn?',
 'content': 'Anmeldung scheitert schon beim Versuch die Telefonnummer einzugeben',
 'created': '2016-08-19 20:18:23',
 'date': '2016-06-01',
 'ext_id': '958825016',
 'id': 31373523,
 'is_answer': 0,
 'locale': 'de',
 'rating': 1,
 'rating_prev': 0,
 'review_id': '1387505834',
 'store': 'as',
 'title': 'Anmeldung...',
 'updated': '2017-06-03 19:49:16',
 'user_id': '462945259',
 'was_changed': 0}

In [9]:
data[1]

{'app_id': 46045,
 'app_version': '2.2',
 'author': 'Tim1901',
 'content': 'Deutsche Telefonnummern gehen nicht 🖕🏾🖕🏾',
 'created': '2016-08-19 20:18:23',
 'date': '2016-06-01',
 'ext_id': '958825016',
 'id': 31373524,
 'is_answer': 0,
 'locale': 'de',
 'rating': 1,
 'rating_prev': 0,
 'review_id': '1387501310',
 'store': 'as',
 'title': '???',
 'updated': '2017-06-03 19:49:16',
 'user_id': '238232117',
 'was_changed': 0}

In [9]:
locale_counts = dict_stats('locale', data)

Stats count for each type of locale:
 {'de': 124, 'be': 12, 'da': 16, 'ca': 48}


## Keep only the necessary fields and sort by language
#### The only *compulsory* field is the Text, which here is 'content'. All other extra fields are metadata.
#### We also need to seperate  between different languages (locale). Here we show how to keep German reviews

In [10]:
# we split the data by locale:
loc_data = {}
for loc in locale_counts:
    loc_data[loc] = [d for d in data if d['locale'] == loc] 

In [11]:
loc_data.keys()

dict_keys(['de', 'be', 'da', 'ca'])

In [12]:
#in the german data, there is some english but very few, so that's not a concern for luminoso.
for x in loc_data['de']:
    print(x['content'])

Anmeldung scheitert schon beim Versuch die Telefonnummer einzugeben
Deutsche Telefonnummern gehen nicht 🖕🏾🖕🏾
Can't verify my number? app needs a update asap!
Nicht möglich eine Nummer einzugeben
my phone number don't fit
Unable to enter my telephone number
Ich habe versucht mich anzumelden und kann meine Handynummer nicht angeben, da man nur amerikanische Nummern eingeben kann. Bitte beheben.
They say European cities are available but they aren't. What the sense of this
Warum sollte ich mir die Schuhe in London reservieren. Wie soll ich bitte schnell nach London kommen .. Like .. Ich wohne in Aachen .. Köln würde eventuell gehen für nen Yeezy boost 350.. Und das wuerde schon unglaublich viel kosten.. Außerdem wurde Adidas doch in Deutschland gegründet, warum also kann ich nicht einfach in meiner Stadt (Aachen) oder wenigstens Köln .. Was wieder krasse Kosten wären .. reservieren
Endlich ist die App in Deutschland. Aber schade ist, dass man sich in der Stadt befinden muss in der der Rel

In [13]:
# The belgian data is a more even mix between English, French ad Dutch languages. 
# We can thus create a project for each language using a pre-defined language mapping, by upload all to one porect and copying twice, then 
# removing the other languages using another script we have.
for x in loc_data['be']:
    print(x['content'])

Wanted it in 36 2/3. Got it in 37 1/3. Is it possible to change ? :-(
Meilleure application ever je suis trop happy top cool
Thanks!!!!!!
Great app, great thinking. Weird that i missed out on one release though. I tapped the microsecond the button was available and missed it. But thats the game i guess
Geweldig;)
👍🏼👍🏼👍🏼
Wicked!
Great App!!
Thanks to adidas and the Confirmed app now we can have a chance to get Yeezy's without staying for hours in the cold! And I most say this app already worked out very fine for me.
App excellente et fiable
Il faut habiter dans une zone precise pour pouvoir participer 
C'est domage si on pouvait participer partout puis si je gagne je le déplace
:-)


In [14]:
#for lccale = canada, the lanuguage is mainly english
for x in loc_data['ca']:
    print(x['content'])

Nothing. Just thank you.
FINALLY THANK YOU ADIDAS
Follow my ig ; moxeb 😜😜😜😜
The entering of my phone number won't work. The last digits should be only 4, but are 5. Can anyone help??
Now make this available to cities like Calgary and Edmonton, or change our outlets to originals locations at very least. Canada deserves more love!
Finally released in Canada. Thank you adidas!
Not of any use if you're not bringing any shoes in our area. Make the shoes available for shipping and we'll be talking.
Always errors when asking for number
Ok i have to say thank you and finally making this but the registration always has errors, i put everything is and press send sms there are errors and the last 4 digits requires 5.
Error when asking for phone number. Please fix the app.
Stuck on the loading screen every time I open the app - not even useable whatsoever.
What a horrible app. Registration won't work because they want a 11 digit phone number instead of 10. Customer service doesn't even respond.
On

In [15]:
#for lccale = denmark, the lanuguage is mainly danish but this is not suppored in Luminoso. 
for x in loc_data['da']:
    print(x['content'])

Den accepterer ikke mit telefonnummer
Copped 😍
Fungere fint når man er på WiFi.  Er overskuelig og er en hjælp nå man skal se hvad der kommer af nyt.
Jeg er smuk fordi jeg fik dem
Den har swag
Tillader ikke rootede telefoner
Easy peacy
Helt fin
YEEZY!!!
God app
Virker godt
Super god app
Kæft der er meget glad
Fedt! Coppede zebras i dag!!!
"Vælg det rigtige billede ..."  Så skal man vælge et billede ud af 9 der passer på beskrivelsen - men trods min Universitetsuddannelse og glæde for læsning, er jeg prisgivet når folk endten gætter med et hurtigt tryk, eller læser endnu hurtigere ...  En Fair for alle løsning ville være at foretrække (dette gælder også jeres online releases)
Den gider ikke at tilmelde mig...


## Create a locale-to-languages mapping
#### This will be empty if language not supported by luminoso, by may also have several values if there needs to be separation into several projects (as for Belgium)

In [16]:
# to be expanded if more locales are included in original input file
locale_languages_mapping = {'de': ['de'], 'be': ['fr', 'nl', 'en'], 'da': [], 'ca': ['en']}

In [17]:
for loc in locale_counts:
    print(loc, locale_languages_mapping[loc])

de ['de']
be ['fr', 'nl', 'en']
da []
ca ['en']


In [18]:
def format_for_upload (data):
     return [{'text': r['content'],
              'title' : r['title'], 
              'app_id':r['app_id'], 
              'app_version':r['app_version'], 
              'date' : r['date'], 
              'rating' : r['rating'], 
              'store' : r['store'] 
                } for r in data]

In [19]:
PROJECTS_MAIN_TITLE = 'AppFollow TEST'
LANG_THRESHOLD = 0.6

In [20]:
project_list = []

for (loc, data) in loc_data.items():
    langs = locale_languages_mapping[loc]
    print(langs)
    if langs == []:
        print('The dataset for locale:', loc, 'will not be uploaded as no language is defined')
    else:
        for lingua in langs:
            proj = {}
            unused_docs = []
            proj['language'] = lingua
            proj['location'] = loc
            proj['name'] = PROJECTS_MAIN_TITLE + ' -location:' + loc + ' -language:' + lingua
            if len(langs) > 1:
                documents, unused_docs = remove_foreign_languages(format_for_upload(data), lingua, LANG_THRESHOLD)
            else:
                documents = format_for_upload(data)
            proj['docs'] = documents
            proj['unused_docs'] = unused_docs
            print(proj['name'], '   with ' + str(len(proj['docs'])), 'docs')
            project_list.append(proj)

['de']
AppFollow TEST -location:de -language:de    with 124 docs
['fr', 'nl', 'en']
11 documents not identified as "fr" removed from project.
AppFollow TEST -location:be -language:fr    with 1 docs
12 documents not identified as "nl" removed from project.
AppFollow TEST -location:be -language:nl    with 0 docs
7 documents not identified as "en" removed from project.
AppFollow TEST -location:be -language:en    with 5 docs
[]
The dataset for locale: da will not be uploaded as no language is defined
['en']
AppFollow TEST -location:ca -language:en    with 48 docs


In [21]:
len(project_list)

5

In [22]:
for x in project_list:
    print(x['name'], 'docs total:', len(x['docs']), 'unused-docs total:', len(x['unused_docs']))

AppFollow TEST -location:de -language:de docs total: 124 unused-docs total: 0
AppFollow TEST -location:be -language:fr docs total: 1 unused-docs total: 11
AppFollow TEST -location:be -language:nl docs total: 0 unused-docs total: 12
AppFollow TEST -location:be -language:en docs total: 5 unused-docs total: 7
AppFollow TEST -location:ca -language:en docs total: 48 unused-docs total: 0


## Save data to Json files 
### In case needed later -- it will be easy to check data or re-upload

In [23]:
for p in project_list:
    save_as_json_file(p, p['name'])

JSON data saved to file:  AppFollow TEST -location:de -language:de.json
JSON data saved to file:  AppFollow TEST -location:be -language:fr.json
JSON data saved to file:  AppFollow TEST -location:be -language:nl.json
JSON data saved to file:  AppFollow TEST -location:be -language:en.json
JSON data saved to file:  AppFollow TEST -location:ca -language:en.json


## Uploading data to new project

In [24]:
from luminoso_api import LuminosoClient

In [25]:
account_id = 'https://eu-analytics.luminoso.com/api/v4/projects/u46p858s/'  #Adidas Training
uname = 'bodier@luminoso.com'

In [26]:
#connect
connection = LuminosoClient.connect(account_id, username = uname)

Password for bodier@luminoso.com: ········


In [35]:
for proj in project_list:
    if len(proj['docs']) != 0:    
        # Create a new project
        new_project = connection.post(name = proj['name'])
        new_project_id = new_project['project_id']

        new_project_path = connection.change_path(new_project_id)
        new_project_path.upload('docs', proj['docs'])

        print('Uploading of docs complete for project:', proj['name'])

        job_id = new_project_path.post('docs/recalculate', language = proj['language'])
        end_job = new_project_path.wait_for(job_id)
        if end_job['success'] == True:
            print('Project created successfully')
        else:
            print('Warning: recalculation failed')

print('All done')

Uploading of docs complete for project: AppFollow TEST -location:de -language:de
Project created successfully
Uploading of docs complete for project: AppFollow TEST -location:be -language:fr
Project created successfully
Uploading of docs complete for project: AppFollow TEST -location:be -language:en
Project created successfully
Uploading of docs complete for project: AppFollow TEST -location:ca -language:en
Project created successfully
All done
