### This notebook receives the tags from the scraped articles and sends API requests to unofficial pytrends API.

#### Input -> csv files from the websites; arrays of tags are extracted

#### Output -> time series of relative popularity of the tags in January 2022 and 2023 in a csv file

In [110]:
# Install the required packages
!pip install -r requirements.txt

Collecting inflection
  Downloading inflection-0.5.1-py2.py3-none-any.whl (9.5 kB)
Installing collected packages: inflection
Successfully installed inflection-0.5.1


You should consider upgrading via the 'C:\Python39\python.exe -m pip install --upgrade pip' command.


### Consider changing get request to post request in pytrends library
Google is making it difficult for us by changing the terms of service in the last period. The change is that their API returns 429 responses way more often when the user is considered a scraper by the website.

We have changed a get request to a post request based on advice from StackOverflow. Credits: https://stackoverflow.com/questions/75744524/pytends-api-throwing-429-error-even-if-the-request-was-made-very-first-time?fbclid=IwAR2j07FLpeXFLQcj1fFiPQyU19xrY0lFr5RuRjYLXT9p8LyQkUGzRxBOcrU

In [111]:
# Import libraries
import os
import time
from pytrends.request import TrendReq
import pandas as pd
import random
import statistics
import csv
import ast
import inflection
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
from sqlite3 import connect
conn = connect(':memory:')

### First, we must determine an appropriate pre-defined element
The Pytrends library is also just a scraper, and the restriction of compared elements is at 5 on Google Trends. However, the scores are relative. With some (relatively) smart math, we can get results for all the tags. The quality of the data relies on selecting an appropriate pre-defined element that will subsequently be sent in every request to be able to normalize the
results and get a relative score for more than 5 elements.
 

The method will combine two approaches:

Approach 1 - We send 5 batches, one <b>*arbitrary*</b> element in each batch is pre-defined. We take median for each response (tag).
The purpose of this approach is to maintain consistency of relative scores when choosing our pre-defined element. If we only applied this method,
we might pick a keyword that is too unpopular or popular, and that would skew the results.

Approach 2 - We send 5 batches that contain independent elements
We do this because we also want to reduce bias in case our arbitrary pre-defined element is heavily biased.
We combine the 10 tags with their median value, and choose a median value from this set

In [68]:
# Build TrendReq request
req = TrendReq(timeout=None, retries=10, backoff_factor=0.5)

In [117]:
## Auxiliary functions
# We reshape the data by taking the average of the column for each tag
def get_average(api_results):
    results_average = []
    for analyzed_batch in api_results:
            analyzed = pd.DataFrame()
            #Skip the first column, as that is the date
            for column_name in analyzed_batch.columns:
                #As we have time series data, we get the average for the entire month for each column
                average_data = analyzed_batch[column_name].mean()
                analyzed[column_name] = pd.Series([average_data])
                # Convert DataFrame to a dictionary
            dict_data = analyzed.to_dict(orient='records')[0]
            results_average.append(dict_data)
    #Returns an array of dictionaries containing tag and its average
    return results_average

def get_median_dict(averaged_api_results):
    aux = []
    output = {}
    for dict in averaged_api_results:
        median_value = statistics.median(dict.values())
        closest_key = min(dict, key=lambda key: abs(dict[key] - median_value))
        aux.append({closest_key: dict[closest_key]})
    for dict in aux:
        output.update(dict)
    #Returns a dictionary containing tags and its medians
    return output

def normalize(averaged_api_results):
    aux = []
    normalized_dict = {}
    aux.append(averaged_api_results[0])
    for i in range (1, len(averaged_api_results)):
        # We get the pre-defined element from each dictionary (the last element) 
        # It is possible that we divide by 0 here
        # In that case, it is clear that there was not enough collected data for our pre-defined element 
        # This means that we can just divide each tag value by 1 (having no effect)
        value_to_normalize = list(averaged_api_results[i].values())[-1]
        print('Value used to normalize:', value_to_normalize)
        value_to_normalize = 1 if value_to_normalize == 0 else value_to_normalize
        normalization_factor = list(averaged_api_results[0].values())[-1] / value_to_normalize
        print('Before:', averaged_api_results[i])
        averaged_api_results[i] = {key: value*normalization_factor for key, value in averaged_api_results[i].items()}
        print('After:', averaged_api_results[i])
        aux.append(averaged_api_results[i])
    for dict in aux:
        normalized_dict.update(dict)
    return normalized_dict

import pandas as pd

def perform_normalization(df):
# We get the pre-defined element from each dictionary
        grouping = df.groupby(['target'], [])
        values_to_normalize = df.loc[:, i+5].copy()
        print('Value used to normalize:', values_to_normalize)

        # First, cast values in columns 1-5 to numeric
        # It is possible that we divide by 0 here
        # In that case, it is clear that there was not enough collected data for our pre-defined element
        # This means that we can just divide each tag value by 1 (having no effect)
            
        values_to_normalize.iloc[:, 5] = values_to_normalize.iloc[:, [5]].replace(0, 1)

        # Now we will normalize each row in the dataframe for columns with indexes 1-5
        print('Before:', df.loc[idx, batch])
        # Normalize each element in the row
        df.loc[idx, batch].iloc[i+1:i+6] = df.loc[idx, batch].iloc[i+1:i+6].div(values_to_normalize.iloc[1:6])
        print('After:', df.loc[idx, batch])

        # Append the normalized batch dataframe to the list
        normalized_dfs.append(df.copy())
    
                      
def chat_normalize_time_series(csv_file_path):
    # Read the CSV file into a DataFrame
    df = pd.read_csv(csv_file_path, sep=';')
    
    # Get the unique tags from the DataFrame
    unique_tags = df.columns[0:].tolist()
    
    # Specify the batch size
    batch_size = 7
    
    # Initialize the index variable
    i = 0
    
    # List to store normalized dataframes
    normalized_dfs = []
    
    # Iterate over data in batches of 7 columns
    while i < len(unique_tags):
        batch = unique_tags[i : i + batch_size]
        
        # If it is the last batch, write the CSV file as output
        if i + batch_size >= len(unique_tags):
            output_csv_path = f"normalized_output_{i}.csv"
            # Write the processed batch to a new CSV file
            df[batch].to_csv(output_csv_path, sep=';', index=False)
            print(f"CSV file saved: {output_csv_path}")
            
            # Exit the loop as it's the last batch
            break
        
        # Else, we will transform the original dataframe by normalizing each row in the batc
        # Update the index for the next iteration
        i += batch_size
    
    # Concatenate all the normalized dataframes into a single dataframe
    final_normalized_df = pd.concat(normalized_dfs, ignore_index=True)
    
    # Save the final normalized dataframe to a CSV file
    final_normalized_df.to_csv("final_normalized_output.csv", sep=';', index=False)
    print("Final normalized CSV file saved.")
                    

def build_csv(normalized_dict):

    # Specify the file path
    file_path = r"C:\School\Semester_1\Data_Wrangling\Data_in_the_wild_exam\data\raw\pytrends\testing\skynews\test.csv"

    # Write the dictionary to a CSV file
    with open(file_path, 'w', newline='') as csv_file:
        csv_writer = csv.writer(csv_file)
        
        # Write header
        csv_writer.writerow(['tag_name', 'relative_popularity'])

        # Write data
        for tag_name, relative_popularity in normalized_dict.items():
            csv_writer.writerow([tag_name, relative_popularity])

In [70]:
def find_predefined_element(unique_elements, timeframe):
    final_dict = {}
    arbitrary_response = []
    independent_response = []
    arbitrary_median = []
    independent_median = []

    #Shuffle the array to send random elements
    random.shuffle(unique_elements)
    ## Just take the first 30 elements for testing
    unique_elements = unique_elements[0:30]

    for i in range (1, 20, 4):
        batch = unique_elements[i:i+4]
        #Add the pre-defined element
        batch.append(unique_elements[0])
        req.build_payload(batch, geo='GB', timeframe=timeframe)
        ##Drop last column from the response
        #Last column is an unnecessary boolean
        response = req.interest_over_time()
        time.sleep(5)
        arbitrary_response.append(response.iloc[:, :-1])
    
    averaged_response = get_average(arbitrary_response)
    arbitrary_median = get_median_dict(averaged_response)

    #Shuffle the array to send random elements
    random.shuffle(unique_elements)

    for i in range (0, 25, 5):
        batch = unique_elements[i:i+5]
        req.build_payload(batch, geo='GB', timeframe=timeframe)
        ##Drop last column from the response
        #Last column is an unnecessary boolean
        time.sleep(5)
        response = req.interest_over_time()
        independent_response.append(response.iloc[:, :-1])
    
    averaged_response = get_average(independent_response)
    independent_median = get_median_dict(averaged_response)

    # Iterate over the union of keys from both dictionaries
    common_keys = arbitrary_median.keys() & independent_median.keys()

    # Display the 10 medians
    # df = pd.DataFrame(list(arbitrary_median.items()), columns=['tag_name', 'relative_popularity_arbitrary_median'])
    # display(df.reset_index().plot(x='tag_name', y='relative_popularity_arbitrary_median', figsize=(120,10), kind='bar'))
    # df = pd.DataFrame(list(independent_median.items()), columns=['tag_name', 'relative_popularity_independent_median'])
    # display(df.reset_index().plot(x='tag_name', y='relative_popularity_independent_median', figsize=(120,10), kind='bar'))

    # It is possible that the keys in the two dictionaries will overlap
    # In that case, calculate the average of the medians for the key
    for k in common_keys:
        average_of_medians = (arbitrary_median[k] + independent_median[k])/2
        final_dict.update({k: average_of_medians})
        del arbitrary_median[k], independent_median[k]
    
    final_dict.update(arbitrary_median)
    final_dict.update(independent_median)
    
    median_value = statistics.median(final_dict.values())
    predefined_element = min(final_dict, key=lambda key: abs(final_dict[key] - median_value))

    print(f'It seems that {predefined_element} is the most appropriate pre-defined element')
    return predefined_element

### Now that we have the code for finding the pre-defined element, we can build the request

In [71]:
def build_request(unique_elements, timeframe):
    # Batch size is limited to 5 by Google Trends, 1st element is always pre-defined
    batch_size = 4
    output = []

    ## Just take the first 30 elements for testing
    unique_elements = unique_elements[0:30]
    predefined_element = find_predefined_element(unique_elements, timeframe)
    unique_elements.remove(predefined_element)
    # Loop through the array in batches
    for i in range(0, len(unique_elements), batch_size):
        #Extract the batch of 4 elements
        batch = unique_elements[i:i+batch_size]
        # Add the pre-defined element 
        batch.append(predefined_element)

        req.build_payload(batch, geo='GB', timeframe=timeframe)
        ##Drop last column from the response
        #Last column is an unnecessary boolean
        response = req.interest_over_time()
        output.append(response.iloc[:, :-1])
        
        #Wait a little so I don't overwhelm the API with requests
        time.sleep(5)
    #Returns an array of data frames, each with 5 tag columns and their score in time series
    return output

### Execute the requests

In [72]:
#Read the csv file

file_path = r"C:\School\Semester_1\Data_Wrangling\Data_in_the_wild_exam\data\raw\skynews\articles\1700032053_articles.csv"
df = pd.read_csv(file_path, sep=';')
if os.path.exists(file_path):
    df = pd.read_csv(file_path, sep=';')
    #When csv is loaded, the Tags array is recognized as a string, this casts it to an array
    df['Tags'] = df['Tags'].apply(ast.literal_eval)
else:
    print('The path is invalid')

#Get unique tags
unique_tags = []
for tags in (df['Tags']):
    for element in tags:
        if element not in unique_tags:
            unique_tags.append(element)

results = build_request(unique_tags, '2022-01-01 2022-01-31')
results = get_average(results)
results = normalize(results)
build_csv(results)

## Display the tags and their popularity on a bar chart
#df = pd.DataFrame(list(results.items()), columns=['tag_name', 'relative_popularity'])
#display(df.reset_index().plot(x='tag_name', y='relative_popularity', figsize=(120,10), kind='bar'))

It seems that russia is the most appropriate pre-defined element
Value used to normalize: 18.419354838709676
Before: {'jeffrey epstein': 1.7096774193548387, 'california': 7.645161290322581, 'wales': 55.03225806451613, 'data and forensics': 0.0, 'russia': 18.419354838709676}
After: {'jeffrey epstein': 0.24252867069657083, 'california': 1.0845149991525904, 'wales': 7.806677588836789, 'data and forensics': 0.0, 'russia': 2.6129032258064515}
Value used to normalize: 36.935483870967744
Before: {'climate change': 5.580645161290323, 'cop26': 1.1290322580645162, 'snow': 52.16129032258065, 'migrant crossings': 0.0, 'russia': 36.935483870967744}
After: {'climate change': 0.3947879983096211, 'cop26': 0.07987040428229328, 'snow': 3.6900126778419495, 'migrant crossings': 0.0, 'russia': 2.6129032258064515}
Value used to normalize: 36.67741935483871
Before: {'omicron': 49.54838709677419, 'arthur labinjo-hughes': 0.3548387096774194, 'vladimir putin': 0.8387096774193549, 'australia bushfires': 0.032258

In [118]:
csv_file_path = r"C:\School\Semester_1\Data_Wrangling\Data_in_the_wild_exam\data\intermediate\concated_results_2022\all_tags_2022.csv"
chat_normalize_time_series(csv_file_path)

CSV file saved: normalized_output_7700.csv


ValueError: No objects to concatenate

In [35]:
csv_file_path = r"C:\School\Semester_1\Data_Wrangling\Data_in_the_wild_exam\data\intermediate\concated_results_2022\all_tags_2022.csv"
df = pd.read_csv(csv_file_path, sep=';')
unique_tags = df.columns[0:].tolist()

In [36]:
# Using a set to identify duplicates
print('Len of array:', len(unique_tags))
unique_set = set()
duplicates_set = set(x for x in unique_tags if x in unique_set or unique_set.add(x))

# Converting the set of duplicates back to a list
duplicates_list = list(duplicates_set)

print("Duplicates in the list:", duplicates_list)

Len of array: 7707
Duplicates in the list: []
