# How to download Google Correlate Data using argotools.dbscrape

This is a brief example on how to download data from GC using dbscrape. 


In order to successfully download data from GC you'll need:

1.- A Gmail account, since you'll need to use your log-in credentials to download the data form GC (I usually just use an alternative account I created just for this purpose).

2.- chromedriver, a file used by selenium to run an automated browser

NOTE: Please make sure you have chrome's webdriver downloaded to use with selenium (specifically chrome. Everything I've worked on has been using the chromedriver) . To know more about selenium please visit their official documentation.



In [1]:
from argotools.dbscrape import GC
import pandas as pd
import time


wb_path = 'absolute_path_to/chromedriver'
path_to_csv = 'absolute_path_to/MX.csv'


We first have to log in, this is how you do it. 


In [2]:
#state_names = list(df)
session = GC(webdriver_path=wb_path, download_folder=None)
session.login(user='myemail@gmail.com', password='mypass')


Thank you for using DBscrape for google correlate! please use our               
 read me to see a simple tutorial on how to use it. GCpy opens a              
 web browser using both Selenium and chrome webdriver libraries. 


Succesfully initialized  web browser.
Logged onto Gmail account.


When initializing the session object, you'll see two different inputs: webdriver_path and download_folder.
webdriver_path is the string containing the path to the web-driver (it is neccessary!). download_folder is an optional input value where you can input the path where your downloads are automatically directed to.

download_folder is recommended in case you're downloading more than one file, since the function already contains code to rename the files incoming from GC's website (which usually come with the same name) and avoid confusion.

The class will pop-up a message if you were able or unable to log-in.

After successfully logging in, you can download data in the following ways: either query one term or upload some personal data. For this example, I present both. 

### Correlating from a CSV
To get data by inputting your own values in a csv, you can use the "correlate_from_csv" method.

You'll need the following: A csv file containing the values you want to correlate and a country where you want to correlate the data (It must be available in Google Correlate!). I use some influenza data in a csv form (see the file for the format of the data) from Flunet in Mexico and input the file path in the correlate_from_csv method. timeseries_name is an identifier (if you didn't input a download_folder value, I suggest you use a meaningful identifier that lets you differentiate the file from the others). When you finish downloading the data, you'll see it is named "correlate-<timeseries_name>.csv". If you chose to input your download_folder, the file will be renamed "<country>-<timeseries_name>.csv".

NOTE: Google Correlate is really picky when reading data from a CSV, please make sure the csv you use is aligned witht he correct format.
NOTE2: To correctly input your CSV data, you MUST set an absolute path to the file. using '.' or '..' does not work because Google Correlate do not recognize your working directory


In [3]:
example_data = pd.read_csv(path_to_csv, header=None)
print(example_data)
session.correlate_from_csv(path_to_csv=path_to_csv, timeseries_name='GC', country='MX', verbose=False)


              0     1
0    2010-01-04  1225
1    2010-01-11  1168
2    2010-01-18  1368
3    2010-01-25   970
4    2010-02-01  1139
5    2010-02-08  1211
6    2010-02-15  1274
7    2010-02-22  1184
8    2010-03-01  1363
9    2010-03-08  1017
10   2010-03-15   781
11   2010-03-22   573
12   2010-03-29   443
13   2010-04-05   371
14   2010-04-12   427
15   2010-04-19   316
16   2010-04-26   252
17   2010-05-03   192
18   2010-05-10   164
19   2010-05-17   198
20   2010-05-24   201
21   2010-05-31   201
22   2010-06-07   142
23   2010-06-14   142
24   2010-06-21   138
25   2010-06-28   107
26   2010-07-05   105
27   2010-07-12    87
28   2010-07-19   104
29   2010-07-26   110
..          ...   ...
179  2013-06-10   158
180  2013-06-17   170
181  2013-06-24   163
182  2013-07-01   215
183  2013-07-08   184
184  2013-07-15   215
185  2013-07-22   193
186  2013-07-29   188
187  2013-08-05   217
188  2013-08-12   204
189  2013-08-19   196
190  2013-08-26   224
191  2013-09-02   290
192  2013-

    
### Correlating a search term

To correlate a search term, you have to call the "correlate_term" function. Just input your word in the method and select the country where you want to search similar search-term activity. 

The method also has the flexibility of specifying the time period on which you'd like to search correlations for your terms (state / edate). Here, I look for the word influenza and search for activity of the word between January 2004 and December 2013. 

Google Correlate should output a csv file with the influenza search term activity and the top correlated search terms (the correlated terms are not restricted to the time period)


In [4]:
sdate = '2004-01-04'
edate = '2013-12-29'

session.correlate_term(search_term='influenza', country='MX', verbose=False, sdate=sdate, edate=edate)


session.close_browser()

                           If you're downloading more than one term it might become lost
Successfully downloaded data for influenza in Mexico


All the files should have been downloaded to your default downloads folder. Lets take a look at them!. Just showing the first rows and columns to avoid clutter 

In [11]:
csv_correlated_file = pd.read_csv('path_to/correlate-GC.csv', skiprows=10, index_col=0)
term_correlated_file = pd.read_csv('path_to/correlate-influenza.csv', skiprows=10, index_col=0)

print('This is the file you correlated with CSV data: \n {0}'.format(csv_correlated_file.iloc[0:10,0:3]))
print('\n This is the file you correlated using a search term \n {0}'.format(term_correlated_file.iloc[0:10,0:3]))

This is the file you correlated with CSV data: 
             GC  influenza sintomas  sintomas de la influenza
Date                                                        
2004-01-04 NaN              -0.216                     -0.27
2004-01-11 NaN              -0.216                     -0.27
2004-01-18 NaN              -0.216                     -0.27
2004-01-25 NaN              -0.216                     -0.27
2004-02-01 NaN              -0.216                     -0.27
2004-02-08 NaN              -0.216                     -0.27
2004-02-15 NaN              -0.216                     -0.27
2004-02-22 NaN              -0.216                     -0.27
2004-02-29 NaN              -0.216                     -0.27
2004-03-07 NaN              -0.216                     -0.27

 This is the file you correlated using a search term 
             influenza  virus influenza  virus de influenza
Date                                                      
2004-01-04     -0.097           -0.098       

And that's it! Please send me any feedback or things you'd like to see automated or tell me if you'd like to contribute to expanding dbscrape. Send an e-mail to leon@clemente.tech