## Score Drivers Over Time

Given an project which includes a date field, and topics you are interested in finding score driver, this find the score driver for each topic contained in the project.

This script will read all the documents from the project you specify. It reads them in and creates a new project using one weeks' worth of data for a 52 week period. It keeps 16 weeks of documents as it roles towards the final output.

The input is an account_id/project_id and a project name which is used when writing the output file.

The documents in the project will need to have predictors. The output is based on a score driver for each topic/predictor combination for the date specified.  For best results the project needs at least 52 weeks of data leading up to the final 16 week window.

v4 added the use of a branch function to create the new project instead of creating a new project from scratch. This saves doc count.

In [1]:
from luminoso_api import LuminosoClient
import datetime, time, json, os, csv
import numpy, pack64

### Inputs
Give an luminoso url: "https://analytics.luminoso.com/app/#/projects/p87t862f/prkhfs6b"
Change these values to match the project you would like to run score drivers against

account_id = p87t862f

project_id = prkhfs6b

In [5]:
account_id = 'c86f546w' # account id that holds the project
project_id = 'prdcj9gb' # project with all the data
project_name = 'Name_Here' # results file will include this name

In [3]:
# Get Master Data 
docs = []
client = LuminosoClient.connect('/projects/{}/{}'.format(account_id,project_id)) 
while True:
    new_docs = client.get('docs', limit=25000, offset=len(docs), doc_fields=['text',
                                                                            'date',
                                                                            'predict',
                                                                            'source',
                                                                            'title',
                                                                            'subsets',
                                                                            '_id'])
    if new_docs:
        docs.extend(new_docs)
    else:
        break
topics = client.get('topics')

In [4]:
def wait_for_recalculation(client):
    print('Waiting for recalculation')
    counter = 0
    while True:
        time.sleep(15)
        counter += 15
        print('Been waiting {} sec.'.format(counter))
        if not client.get()['running_jobs']:
            break

### Topics
If you would like to use topics from a different project, you can use this section to copy those topics to this project. To copy topics from another project, change the topic_project_id to be the source project id (by default the project is the same as the current project_id). Then uncomment the last three lines which sets the topics from the project of your choice into the project we are running score drivers against. 

In [6]:
# Pull in the topic to be measured over time from a previous project
topic_project_id = project_id
#topic_project_id = 'pr5wd46r'
client1 = LuminosoClient.connect('/projects/{}/{}'.format(account_id,topic_project_id)) #change this to be the project with the saved topics to be measured over time
topics = client1.get('topics')

#client2 = LuminosoClient.connect('/projects/{}/{}'.format(account_id,project_id))  #change this account ID to be your account
#for topic in topics:
#    client2.post('topics',text=topic['text'],name=topic['name'])

In [7]:
#check the topics to confirm 
topics

[{'vector': 'WAg9ASC8oa6a4FndCREBNr_H12fk5CEGPoOXiGdO22_FJk0D-Eka3y67rEDk5DaIB-hAV-JqVA_P-7rCCtQcZ8ra4HGEyw_Jc36SEK9CgP-U_2Ki50J9orKaTFdID67BC66OtEAMB0YFj8GjW9LhHuDERYCoT3hdER_CXlA427whDhUEze6-f0VGF6t8SM-n1DfmIDAAWh6TQK01Atk4m_-C7_hDAfN_19EU3IhbDNWCGXBq17cw-088SW-z156C7dn90qBJL9jTENE-l7-kvASe-h6BDlAwdCqj8wXAnt_a9DBFFKk-dR_b0Fzi5ot5un9anB6HBo4C0-HwMCEv8nt-If-Sg9-19KXJYw1Rt23WBSfG9h_Ct9w1D534F1CASBZQ-5X7jr_8iDEt-t7GZW5Px7AgBBc_HQAaY9_7KcpCW4JZOIVG362En2Cy9BcqCkS',
  'color': '#808080',
  '_id': '934f2323-9667-4017-b9a5-709b0891f94f',
  'name': 'reception',
  'text': 'reception'},
 {'vector': 'WBmPEa3648BfuDHYDFiEm49_I7w08p4KeJUh2D5R7nFHWB5-UH1e5tJJUw5sFAMUDnbAF0DyVCX1DUtE5lDSJPQT7cT7SrCPe4KBJLD_ngBXY0Mo4cG_RwEmF_4K89eAqK-o46vmH5k4kdFBmCrKAA7COJF7C6JHFC-JBbAok5RS87-BqdALzE453zfBXKGD5-HQBmH5kL05mLZECSz2W_EB0GMC5fIEPSFhQLxW9mR-W79HIFoEATh_zB7HB95f7K3A2r-xz6qUC4a7_iAhvFka_qpB9z-OoHk2Aas_th-_pDpaFeuFlR-wWDyiAIr-ir8tq-xF62g65K9MQ_Z257j7u1DoO74nCgIGkdAWS8vvD-uDIfCWHLWbEe9-1qBm8_xX77OAy-Ap-CIwDa

## Date window
Set the end_date here for the final week you would like to calculate.

Given that final week 'end_date', the starting date will be set 'weeks_to_process' weeks in the past and the processing will start there. Given that each week will process 'rolling_weeks' worth of data leading up to and including that week, the first week will be 'rolling weeks' into the future of that starting date.

For instance if I choose, the date May 25, 2018 and set weeks to process to 52 and set rolling_weeks to 16, the starting date will be May 26, 2017, but the first week that get's processed will by (May 5, 2017 plus 16 weeks) which is  September 15, 2017. There will be 52 - 16 = 36 weekly calculated result sets.  Each result set calculates the prior 'rolling_weeks' in this case 16 weeks of data prior to the week being calculated.

If you have more sample data each week you can reduce the 'rolling_weeks' or increase 'rolling_weeks' in sparse data sets to have enough language to process.



In [8]:
end_date='2018-06-27'
date_format='%Y-%m-%d'
end_time_final = int(time.mktime(time.strptime(end_date,date_format)))
weeks_to_process = 17 # will start 16 (rolling_weeks) weeks into this number so there is enough data for each week
rolling_weeks = 16 

In [9]:
# Sort by date, split by week, run ScoreDrivers
docs = sorted(docs, key=lambda k: k['date'])
idx = 0

end_index = weeks_to_process - rolling_weeks
week_ids = {}
predictors = []
deep_drivers = []

# set the initial end_time which will be 16 weeks back
#end_time = start_time
start_time = end_time_final - (60*60*24*7*weeks_to_process)
end_time = start_time + (60*60*24*7*rolling_weeks)

fieldnames=['doc_count','term','text','vector','regressor_dot','driver_score','similar_terms','related_terms',
           'week','predictor', 'date']
print("opening output file")
with open('ScoreDriversOverTime{}_results.csv'.format(project_name), 'a') as file:
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()
    while idx<end_index:
        print("start date: {} ({})".format(datetime.datetime.fromtimestamp(start_time).strftime('%Y-%m-%d'),str(start_time)))
        print("end date: {} ({})".format(datetime.datetime.fromtimestamp(end_time).strftime('%Y-%m-%d'),str(start_time)))

        doc_ids = [d['_id'] for d in docs if d['date'] >= start_time and d['date'] < end_time]
        subsets = list({s for d in docs for s in d['subsets']})
        
        print("starting branch size="+str(len(doc_ids)))
        branch_results = client.post('project/branch/',ids=doc_ids)
        print("finished branch")
        client_branch = LuminosoClient.connect(branch_results['path'])
        
        #client_branch.post('docs/recalculate')
        wait_for_recalculation(client_branch)
        client_branch.post('prediction/train')
        wait_for_recalculation(client_branch)

        trained_regressors = client_branch.upload('prediction',[{'text':'this is a test'}])[0]
        predictors = list(trained_regressors.keys())
        predictors = list(set(predictors))

        print('Dumping predictor results into file')
        #print('  Predictors: {}'.format(str(predictors)))
        for predictor in predictors:
            drivers = client_branch.put('prediction/drivers', predictor_name=predictor)
            for driver in drivers:
                # ADDED RELATED TERMS
                driver['predictor'] = predictor
                driver['week'] = idx
                driver['date'] = end_time
                doc_count = client_branch.get('terms/doc_counts', terms=driver['terms'], use_json=True)
                count_sum = 0
                for doc_dict in doc_count:
                    count_sum += (doc_dict['num_exact_matches'] + doc_dict['num_related_matches'])
                driver['doc_count'] = count_sum
                if idx == 52:
                    print(driver)

            writer.writerows([{k:v for k,v in d.items() if k in fieldnames} for d in drivers])
            #print('Dumped results to file. Predictor: {}'.format(predictor))
        
        # delete the project
        
        end_time = end_time + 60*60*24*7
        start_time = end_time - (60*60*24*7*rolling_weeks)

        
        idx += 1
print("DONE")

opening output file
start date: 2018-02-27 (1519801200)
end date: 2018-06-20 (1519801200)
starting branch size=62828
finished branch
Waiting for recalculation
Been waiting 15 sec.
Been waiting 30 sec.
Been waiting 45 sec.
Been waiting 60 sec.
Been waiting 75 sec.
Been waiting 90 sec.
Been waiting 105 sec.
Been waiting 120 sec.
Been waiting 135 sec.
Been waiting 150 sec.
Waiting for recalculation
Been waiting 15 sec.
Dumping predictor results into file
DONE


In [10]:
# delete the project
delete_result = client_branch.delete()
delete_result

'Deleted.'