# Part 1: Correlation of Future Orientation Index and Gross Domestic Product
## Tasks

In this exercise, we try to reproduce the findings of the article “Quantifying the Advantage of Looking Forward” http://www.nature.com/articles/srep00350.

According to the study, the GDP per capita of countries is positively correlated to how much their population searches in Google for the next year, relative to how much they search for the previous year.

This ratio is called the Future Orientation Index (FOI). So for example for the year 2017 the FOI can be calculated as: FOI = number of searches for the term “2018” / number of searches for the term “2016”.

You will do the following tasks:
1. Aquire World Bank Data
2. Calculate the Future Orientation Index in Google Trends
3. Test the correlation between GDP and FOI

### Install requirements. 

The following cell contains all the necessary dependencies needed for this task. If you run the cell everything will be installed.  
* [`wbgapi`](https://github.com/tgherzog/wbgapi) is a Python package which provides modern, pythonic access to the World Bank's data API. [Here](https://github.com/tgherzog/wbgapi) is the documentation of `wbgapi`.
* [`pandas`](https://pandas.pydata.org/docs/index.html) is a Python package for creating and working with tabular data. [Here](https://pandas.pydata.org/docs/reference/index.html) is the documentation of `pandas`.
* [`matplotlib`](https://matplotlib.org/) is a Python package for creating plots. [Here](https://matplotlib.org/stable/api/index.html) is the documentation of `matplotlib`.
* [`scipy`](https://scipy.org/) is a Python package with different algorithms for scientific computing. [Here](https://docs.scipy.org/doc/scipy/reference/index.html#scipy-api) is the documentation of `scipy`.

In [None]:
! pip install wbgapi
! pip install pandas
! pip install matplotlib
! pip install scipy

### Import requirements
The cell below imports all necessary dependancies. Make sure they are installed (see cell above).

In [1]:
import wbgapi as wb
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# 1 World Bank Data
## 1.1 Download WDI data

From the WDI we need three indicators:
* Gross Domestic Product (GDP) per capita corrected by the Purchase Power Parity (PPP in current or 2005 international $, `"NY.GDP.PCAP.PP.KD"`)
* The amount of Internet users (per 100 people, `"IT.NET.USER.ZS"`
* The total population (described as as "Population, Total", `"SP.POP.TOTL"`)

In the following code chunk, download all data (including extras) for all countries in year 2014 and save it as a pandas data frame. See [here](https://github.com/tgherzog/wbgapi#accessing-data) how to use the `data` subpackage of `wbgapi`.

Hint: To remove aggregates (economic regions defined by the World Bank) and include only countries, use `skipAggs=True`.

In [7]:
import wbgapi as wb

# List of indicators
indicators = ['NY.GDP.PCAP.PP.KD', 'IT.NET.USER.ZS', 'SP.POP.TOTL']

try:
    # Fetch data for all countries in 2014
    data = wb.data.DataFrame(indicators, time=2014,
                             skipAggs=True, columns='series')

    # Reset index for easier readability
    data.reset_index(inplace=True)

    print(data)
except wb.APIError as e:
    print(f"API error: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

    economy  IT.NET.USER.ZS  NY.GDP.PCAP.PP.KD  SP.POP.TOTL
0       ABW         83.7800       37433.084090     103594.0
1       AFG          7.0000        3024.982120   32716210.0
2       AGO         21.3623       10262.847015   27128337.0
3       ALB         54.3000       12909.240795    2889104.0
4       AND         86.1000       61700.237094      71621.0
..      ...             ...                ...          ...
212     XKX             NaN        9150.835788    1812771.0
213     YEM         22.5500                NaN   27753304.0
214     ZAF         49.0000       14869.047901   54729551.0
215     ZMB          6.5000        3621.466084   15737793.0
216     ZWE         16.3647        3588.862715   13855753.0

[217 rows x 4 columns]


Now drop any row that has `NaN` for this you can use `pandas` [`dropna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) method.

In [19]:
import wbgapi as wb

# List of indicators
indicators = ['NY.GDP.PCAP.PP.KD', 'IT.NET.USER.ZS', 'SP.POP.TOTL']

try:
    # Fetch data for all countries in 2014
    data = wb.data.DataFrame(indicators, time=2014,
                             skipAggs=True, columns='series')

    # Reset index for easier readability
    data.reset_index(inplace=True)

    # Drop rows with any NaN values
    data.dropna(inplace=True)

    # Rename columns for clarity
    data.rename(columns={
        'SP.POP.TOTL': 'Population, Total',
        'IT.NET.USER.ZS': 'Internet users',
        'NY.GDP.PCAP.PP.KD': 'Gross Domestic Product (GDP) per capita, PPP'
    }, inplace=True)

    print(data)
except wb.APIError as e:
    print(f"API error: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

    economy  Internet users  Gross Domestic Product (GDP) per capita, PPP  \
0       ABW         83.7800                                  37433.084090   
1       AFG          7.0000                                   3024.982120   
2       AGO         21.3623                                  10262.847015   
3       ALB         54.3000                                  12909.240795   
4       AND         86.1000                                  61700.237094   
..      ...             ...                                           ...   
210     VUT         18.8000                                   3202.417111   
211     WSM         21.2000                                   6196.958924   
214     ZAF         49.0000                                  14869.047901   
215     ZMB          6.5000                                   3621.466084   
216     ZWE         16.3647                                   3588.862715   

     Population, Total  
0             103594.0  
1           32716210.0  


Next only keep rows where there are at least 5 Million internet users. Keep in Mind that the Internet Users are per 100 people, so don't forget to take the population into account.

For example in the dataset Austria has 80.995825 internet users per 100 people, while 8546356 people living in Austria. This means Austria has 6922191.55 internet users in total. The calculation for that is as follows:
$
\begin{align}
internet\_users = population \cdot \frac{internet\_user\_per\_100}{100}
\end{align}
$


In [None]:
import wbgapi as wb
import pandas as pd

# List of indicators
indicators = ['NY.GDP.PCAP.PP.KD', 'IT.NET.USER.ZS', 'SP.POP.TOTL']

try:
    # Fetch data for all countries in 2014
    data = wb.data.DataFrame(indicators, time=2014,
                             skipAggs=True, columns='series')

    # Reset index for easier readability
    data.reset_index(inplace=True)

    # Drop rows with any NaN values
    data.dropna(inplace=True)

    # Rename columns for clarity
    data.rename(columns={
        'SP.POP.TOTL': 'Population, Total',
        'IT.NET.USER.ZS': 'Internet users per 100',
        'NY.GDP.PCAP.PP.KD': 'Gross Domestic Product (GDP) per capita, PPP'
    }, inplace=True)

    # Calculate the total internet users
    data['Internet users, Total'] = data['Population, Total'] * (data['Internet users per 100'] / 100)

    # Filter to keep only rows where the total internet users are at least 5 million
    data = data[data['Internet users, Total'] >= 5_000_000]

    print(data)
except wb.APIError as e:
    print(f"API error: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

    economy  Internet users per 100  \
2       AGO                 21.3623   
5       ARE                 90.4000   
6       ARG                 64.7000   
10      AUS                 84.0000   
11      AUT                 80.9958   
..      ...                     ...   
201     UKR                 46.2360   
203     USA                 73.0000   
204     UZB                 35.5000   
209     VNM                 41.0000   
214     ZAF                 49.0000   

     Gross Domestic Product (GDP) per capita, PPP  Population, Total  \
2                                    10262.847015         27128337.0   
5                                    63309.310961          8835951.0   
6                                    28442.248189         42669500.0   
10                                   54610.393806         23475686.0   
11                                   62378.971893          8546356.0   
..                                            ...                ...   
201                        

# 2 The Future Orientation Index in Google Trends
## 2.1 Download data from Google Trends

You can download the data from Google Trends following these steps:

1) Log out from your google account or set its language to English

2) Go to trends.google.com and search for 2013 

3) Add 2015 as a search term

4) Select custom time rage: full year: 2014

5) Set the region to “Worldwide”. You can also try with this link (it links to the google trends page with all settings from above applied): https://trends.google.com/trends/explore?date=2014-01-01%202014-12-31&q=2013,2015

6) Go to the map at “Compared breakdown by region” and tick on “include low search volume regions”

7) On the top right menu click the download button to get a geoMap.csv file

If you have problems getting the file from the web interface, we also included it in the github repository.

Load the .csv file in a pandas data frame. Notice in the file the first 3 Lines are actually only information (while the third is the header). You can skip these lines by using `skiprows=3` in `pd.read_csv`. Set the headers to `"Country", "G2013", "G2015"`, this can be done by the keyword argument `names` in `pd.read_csv`.

Now remove again all rows containing `NaN`.

All the percentage data is saved as a string containing the `%` symbol. You can remove this with `pandas` [`str.replace`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html) method and save them as integer with `pandas` [`astype`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) method. Do this for column `G2013` and `G2015`.


In [None]:
# Your Code goes here!


## 2.2 Calculate the Future Orientation Index

In the following code chunk, make a new column in the Google Trends dataframe with the Future Orientation Index, which is the ratio between the search volume for 2015 and 2013 in 2014 for each country

In [None]:
# Your Code goes here!


## 2.3 Merge with World Bank data

Merge the WDI and google trends data frames, using the name of the country. For this you can use `pandas` [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) method.

In [None]:
# Your Code goes here!


# 3 Testing the correlation between GDP and FOI
# 3.1 Visualize FOI vs GDP

Now that you have the FOI index, GPD per capita and PPP value for each country, you can make a scatter plot of FOI vs GDP.

For this you can use the [`scatter`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) method of `matplotlib`.

In [None]:
# Your Code goes here!


## 3.2 Measure Pearson’s correlation

In the following chunk, calculate Pearson’s correlation coefficient between GDP and FOI.

For this you can use the [`pearsonr`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html) method of `scipy`.

In [None]:
# Your Code goes here!


## 3.3 Measure correlation after shuffling

What happens if we permute the data (e.g. shuffle the FOIs) and repeat the above analysis? Do you find any difference between the two plots and two Pearson’s correlation coefficients?

For the shuffeling you can use `pandas` [`sample`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) method with `frac` set to 1.

This test shows you if the correlation is happend by random.

In [None]:
# Your Code goes here!


Repeat the calculation with 1000 permutations and plot the histogram of the resulting values. Add a line with the value of the correlation without permutation. Is it far or close to the permuted values?

In [None]:
# Your Code goes here!


# To learn more
### Check robustness
* What result do you get if you use other years? What if you choose one of the earliest years in Google trends?
* How do results change if you use a different threshod instead of 5 Million Internet users?
    
### Test other hypotheses
* Is future orientation generating wealth? Or is wealth enabling to look more to the future?
* Is the FOI really measuring orientation to the future? Could it be something else?

# Part 2: Using Google Trends data to model Flu Trends

## Tasks

Use the [pytrends module](https://pypi.org/project/pytrends/) to get weekly Google Trends data concerning the Flu/Influenza virus from the beginning of 2014 until the end of 2018.

***Hint:*** *the pytrends module currently has a bug. If you get a `TooManyRequestsError` despite following the documentation, try following the advice outlined [here](https://github.com/GeneralMills/pytrends/issues/573#issuecomment-1501897119) or [here](https://github.com/GeneralMills/pytrends/issues/561#issuecomment-1462899426) (both solve the issue).*

### Install requirements. 

The following cell contains all the necessary dependencies needed for this task. If you run the cell everything will be installed.  
* [`pytrends`]((https://pypi.org/project/pytrends/)
* [`requests`]
* [`statsmodels`]

In [None]:
! pip install -U pytrends
! pip install requests
! pip install statsmodels

# 1 Google Trends data

## 1.1 Get weekly Google Trends data concerning the Flu/Influenza virus

- Create an instance of the `TrendReq` class
- Find the appropriate query term (i.e., influenza). The TrendReq class includes a method `suggestions`, which should help you in this task (the query term can look like e.g. `/m/03x_m3v`).
- Specify the correct geographical region, the timeframe (i.e. from the beginning of 2014 until the end of 2018), and the key-word list. Use the `build_payload` method to store this information for future requests.
- Use the `interest_over_time` method to get the data.


In [None]:
# Your Code goes here!


# 2 US National data

## Get data regarding the occurance of Influenza like Illnesses in the US

In the `Excercise 1` folder you will find a file named `ILINet.csv`, which contains data regarding the occurance of Influenza like Illnesses in the US. You can also find the data and the corresponding [documentation](https://gis.cdc.gov/grasp/fluview/FluViewPhase2QuickReferenceGuide.pdf) on the CDC's [FluView interactive dashboard](https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html).
<br>
- Read the csv file, and store it as a [pandas](https://pypi.org/project/pandas/) dataframe. You might need to use the `skiprows` argument of the `read_csv` method to be able to load the data correctly.
- Select the columns named `YEAR`, `WEEK`, and `% WEIGHTED ILI` which will be needed for our analysis. Additionally, drop the rows which store observations from before 2014, or later than 2018.

In [None]:
# Your Code goes here!


# 3 Testing the correlation between flu interest and US National data

# 3.1 Visualize flu interest vs US National data

Now that you have the US National data regarding the occurance of Influenza like Illnesses in the US, you can make a scatter plot of `flu interest` vs `% WEIGHTED ILI`.

For this you can use the [`scatter`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) method of `matplotlib`.

In [None]:
# Your Code goes here!


## 3.2 Measure Pearson’s correlation

In the following chunk, calculate Pearson’s correlation coefficient between `flu interest` and `% WEIGHTED ILI`.

For this you can use the [`pearsonr`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html) method of `scipy`.

In [None]:
# Your Code goes here!


## 3.3 Measure correlation after shuffling

What happens if we permute the data (e.g. shuffle the `flu interest`s) and repeat the above analysis? Do you find any difference between the two plots and two Pearson’s correlation coefficients?

For the shuffeling you can use `pandas` [`sample`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) method with `frac` set to 1.

This test shows you if the correlation is happend by random.

In [None]:
# Your Code goes here!


Repeat the calculation with 1000 permutations and plot the histogram of the resulting values. Add a line with the value of the correlation without permutation. Is it far or close to the permuted values?

In [None]:
# Your Code goes here!


### To learn more

#### Prediction
* Download the Google Trends data for 2019, and use your models to predict the values of `% WEIGHTED ILI`.
* Do the models make good predictions? Which model performs better?