# Google Trends Scrapper

Uses Google Trends python package to scrape google trends data

In [1]:
#install python package
!pip install pytrends

Collecting pytrends
  Downloading pytrends-4.7.3-py3-none-any.whl (14 kB)
Collecting lxml
  Downloading lxml-4.5.2-cp37-cp37m-manylinux1_x86_64.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 6.3 MB/s eta 0:00:01
Installing collected packages: lxml, pytrends
Successfully installed lxml-4.5.2 pytrends-4.7.3
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
#imports
import pandas as pd
from pytrends.request import TrendReq

In [21]:
#create pytrends object
pytrends = TrendReq(hl='en-US', tz=360)

In [13]:
#create list of keywords to return google trends data for
keywords = ['depression', 'anxiety', 'government', 'politics', 'democracy']

This code:

- sets a date range to loop through
- assigns a weeks timeframe to pytrends with search terms
- appends to a file
- increments 4 days and repeats

Google Trends only provides hourly data when the search trend is 7 days or less. Google calculates its volume search data as relative to the time range searched. So you can't  compare search trend data from two different time ranges. However, you can control for this by overlapping time periods. By returning 7 days worth of data for each loop, but incrementing by 5 days, we have three days of overlap between time periods. This worked fairly well and where it didn't I calculated the mean for any duplicate time periods. 

Because of Google's rate limiting, I had to "take breaks".

In [22]:
import datetime
#set start_date
start_date = datetime.date(2015, 1, 1)
#set end data
end_date = datetime.date(2018, 1, 1)

#sets time delta for incrementing while loop
delta = datetime.timedelta(days=5)

#sets time delta for defining range passed to Google Trends
delta_2 = datetime.timedelta(days=7)

#sd = start_date
#ed = start_date + delta_2

#sets while loop and assigns range to pytrends
while start_date <= end_date:
    sd = start_date
    ed = sd + delta_2
    ys = sd.year
    ms = sd.month
    ds = sd.day
    ye = ed.year
    me = ed.month
    de = ed.day
    df = pytrends.get_historical_interest(keywords=keywords, 
                                          year_start=ys, 
                                          month_start=ms, 
                                          day_start=ds, 
                                          hour_start=0, 
                                          year_end=ye, 
                                          month_end=me, 
                                          day_end=de, 
                                          hour_end=0, 
                                          geo='US')
    #writes week of data to file
    df.to_csv('gtrend_2018_2020.csv', mode='a')
    print(f'file writen for {sd} through {ed}')
    

  #increments five days
    start_date += delta
  
    

file writen for 2015-01-01 through 2015-01-08
file writen for 2015-01-06 through 2015-01-13
HTTPSConnectionPool(host='trends.google.com', port=443): Read timed out. (read timeout=5)
file writen for 2015-01-11 through 2015-01-18
file writen for 2015-01-16 through 2015-01-23
file writen for 2015-01-21 through 2015-01-28
file writen for 2015-01-26 through 2015-02-02
file writen for 2015-01-31 through 2015-02-07
file writen for 2015-02-05 through 2015-02-12
file writen for 2015-02-10 through 2015-02-17
HTTPSConnectionPool(host='trends.google.com', port=443): Read timed out. (read timeout=5)
file writen for 2015-02-15 through 2015-02-22
file writen for 2015-02-20 through 2015-02-27
HTTPSConnectionPool(host='trends.google.com', port=443): Read timed out. (read timeout=5)
file writen for 2015-02-25 through 2015-03-04
HTTPSConnectionPool(host='trends.google.com', port=443): Read timed out. (read timeout=5)
file writen for 2015-03-02 through 2015-03-09
file writen for 2015-03-07 through 2015-03

In [15]:
df = pd.read_csv('/floyd/home/Capstone/cap_notebooks/notebooks/Scappers/g_trend_test_2.csv')

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1367 entries, 0 to 1366
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        1367 non-null   object
 1   depression  1367 non-null   object
 2   anxiety     1367 non-null   object
 3   government  1367 non-null   object
 4   politics    1367 non-null   object
 5   democracy   1367 non-null   object
 6   isPartial   1367 non-null   object
dtypes: object(7)
memory usage: 74.9+ KB


In [23]:
df = pd.read_csv('/floyd/home/Capstone/cap_notebooks/notebooks/Scappers/gtrend_2018_2020.csv')

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70840 entries, 0 to 70839
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        70838 non-null  object
 1   depression  70838 non-null  object
 2   anxiety     70838 non-null  object
 3   government  70838 non-null  object
 4   politics    70838 non-null  object
 5   democracy   70838 non-null  object
 6   isPartial   70838 non-null  object
dtypes: object(7)
memory usage: 3.8+ MB


In [39]:
df = df.drop(df[df['date'] == 'date'].index)

In [41]:
df = df.sort_values('date')

In [47]:
df.head(30)

Unnamed: 0,date,depression,anxiety,government,politics,democracy,isPartial
33754,2015-01-01 00:00:00,17,18,20,14,2,False
33755,2015-01-01 01:00:00,18,18,22,13,2,False
33756,2015-01-01 02:00:00,18,19,20,11,1,False
33757,2015-01-01 03:00:00,18,18,19,10,1,False
33758,2015-01-01 04:00:00,17,18,55,10,1,False
33759,2015-01-01 05:00:00,18,17,18,9,1,False
33760,2015-01-01 06:00:00,25,20,19,10,2,False
33761,2015-01-01 07:00:00,23,21,18,11,2,False
33762,2015-01-01 08:00:00,24,27,22,11,1,False
33763,2015-01-01 09:00:00,24,25,27,13,1,False


In [48]:
df.duplicated()

33754    False
33755    False
33756    False
33757    False
33758    False
         ...  
33750    False
33751    False
33752    False
18116    False
51874     True
Length: 70419, dtype: bool