# APIs: Exploring Guardian API Data
Guardian news data provides us a range of different types of variable that we can use to get an overall picture of our datatset, and perhaps even find a some interesting patterns along the way.

Below we look at a range of different options for examining Guardian Data. Whilst the text of the stories is obviously valuable data, we'll need more advanced text analysis methods for that. These methods allow us to get a good overall picture of our data and find general trends.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
articles = pd.read_json('AI_articles.json')
articles.info()

## Prepping the Data
First we get the data ready for analysis.

### Transforming the date column
First we need to ensure that our data is clean and that the `webPublicationDate` is properly formatted as a `datetime`.

In [None]:
articles.head()


In [None]:
articles['webPublicationDate'] = pd.to_datetime(articles['webPublicationDate'])

### Unpacking the Fields column
The content of the `fields` column is determined when we collect our API data, by what we passed to `show-fields` in our query parameters. However what is returned is a dictionary of information. Ideally we want to expand these dictionaries out and create additional columns for each field (`byline`, `body` and `wordcount`).

We'll mainly be using `wordcount` but the process will unpack all fields.

In [None]:
articles.loc[0, 'fields']

`.json_normalize` happens to do this kind of job for us, but it will create an entirely new dataframe from the results.

In [None]:
articles_field_data = pd.json_normalize(articles['fields'])
articles_field_data.head()

Having produced our dataframe of field data we just need to merge the `articles` dataframe and the new one together, matching up the indexes.
When merging dataframes, left literally refers to the dataframe on the left of the operation, and right to the one most towards the right.

`left.merge(right)`


In [None]:
articles = articles.merge(articles_field_data, left_index=True, right_index=True)
articles.info()


### Converting wordcount to numeric
Wordcount has been stored as a string. We can rectify that by using `.to_numeric`

In [None]:
articles['wordcount'] = pd.to_numeric(articles['wordcount'])

## Data Counts over time
A key question of a dataset about the news, would when this news took place. Equally, we may also be interested in trends over time. Depending on your query, it may be interesting to see if there were changing publication rates related to your topic of interest.

A simple `.describe` on the date column will tell us a little about the spread of the dates.

In [None]:
articles['webPublicationDate'].describe(datetime_is_numeric=True)

If we want to see trends, we can group our rows by publication period such as by `D`ay, `M`onth or `Y`ear. To do this we make a special time grouping object, and then group our data using it. We count the number of articles in each group and then plot them as a line plot.

In [None]:
time_grouper = pd.Grouper(key='webPublicationDate', freq='M')
count_over_time = articles[['webPublicationDate','id']].groupby(time_grouper).count().reset_index()
count_over_time.head()

Time series are a little trickier to plot and Seaborn doesn't have a built in covnenience method for it. However we can use the `sns.lineplot` method to manually create one, and make some adjustments to size and label positioning manually.

In [None]:
plt.figure(figsize=(20,5))
plot = sns.lineplot(data=count_over_time, x='webPublicationDate', y='id')

plot.tick_params(axis='x', labelrotation=45)
plot.set(title='AI story Frequency over time', xlabel='Month', ylabel='Freq')
sns.despine()
plt.show()

## Optional: Filtering by a Date Range
Using our timeseries plot we might decide to filter our data so we only work with a specific range period.

In [None]:
date_filter = articles['webPublicationDate'] >= 'January 2022'
articles = articles[date_filter]

In [None]:
articles['webPublicationDate'].describe()

In [None]:
time_grouper = pd.Grouper(key='webPublicationDate', freq='M')
count_over_time = articles[['webPublicationDate','id']].groupby(time_grouper).count().reset_index()

plt.figure(figsize=(20,5))
plot = sns.lineplot(data=count_over_time, x='webPublicationDate', y='id')
plot.tick_params(axis='x', labelrotation=45)
plot.set(title='AI story Frequency over time January 2022+', xlabel='Month', ylabel='Freq')
sns.despine()
plt.show()

## Appropriate Pillars?
The Guardian has a number of major sections they refer to as Pillars. We can examine the distribution of our articles across these major categories.

In [None]:
pillar_counts = articles['pillarName'].value_counts()
pillar_counts


In [None]:
sns.catplot(data=articles, y='pillarName', kind='count', order=pillar_counts.index)

## Optional: Filtering by Pillar
Depending on your search query and the type of question you have, it may be worth filtering out material in unsuitable pillars, or focusing on just one.

In [None]:
chosen_pillars = ['News', 'Opinion']
pillar_filter = articles['pillarName'].isin(chosen_pillars)
articles = articles[pillar_filter]

After filtering we can re-run our counts to check the filtering was applied, and produce a new visualisation of we need it.

In [None]:
new_pillar_counts = articles['pillarName'].value_counts()
new_pillar_counts

In [None]:
sns.catplot(data=articles, y='pillarName', kind='count', order=new_pillar_counts.index)

## Sections
Sections are the next form of categorisation. Sections give us a better sense of the overall topic of the stories.

In [None]:
section_counts = articles['sectionName'].value_counts()
section_counts

In [None]:

sns.catplot(data=articles, y='sectionName', kind='count', aspect=1.5, order=section_counts.index)


Depending on how many sections are involved we may decide to keep only those above a certain threshold of presence in our dataset. This could be a top 10 or 20, or you could base it on some sort of summary metric of the counts such as categories above the mean or median count.

In [None]:
section_counts.describe()


In [None]:
above_avg_sections = section_counts[section_counts > section_counts.median()].index
above_avg_sections

We'll just go with a top 10

In [None]:
top_sections = section_counts.index[:10]
top_sections

In [None]:
articles = articles[articles['sectionName'].isin(top_sections)]
articles.info()

### Examining Section Content
We may find interesting sections in our dataset and wonder why they're there. We can iterate through the titles and URLs of the section we're interested in to get a better sense of why they've been included.

Below is a simple section filter but you could make it more complicated, such as limiting to after a time period, or in combination with a pillar classification for example.

N.B Below we use `.head` to limit the number of results for demonstration purposes, but during analysis there is no reason you cannot remove it and to view all the results.

In [None]:

SECTION_OF_INTEREST = 'Australia news' # Just change this to switch sections


selected_data = articles[articles['sectionName'] == SECTION_OF_INTEREST].head(5)
for index, row in selected_data.iterrows():
    print(row['webTitle'])
    print(row['webUrl'])
    print('****')

## Tags
Tags are the last categorisation and they give us even more nuance in exactly what each story is about. However they are a little trickier to deal with because each story can have more than one tag associated with it. This presents us more of a challenge but also an opportunity for analysis too.

If we look at the first item in our `tags` column, we can see that the value is actually a quite complex object. A `list` and then each item in the `list` is a `dictionary`. A lot of information is provided but for our purposes we just want the tag string, which is held under the `webTitle` key.

In [None]:
articles['tags'].iloc[0]

As each story could have multiple tags we're going to create a version of the `articles` dataframe where each row represents a single tag, and other story information like title, wordcount etc are duplicated. Pandas will keep track of which rows all refer to the same story using the index.

In [None]:
tag_per_line = articles.explode('tags')
tag_per_line.head(10)

If we check the length of the original `articles` dataframe against the new `tag_per_line` we can see that we have many more rows, one row per tag used in a story. We can also see the index values for our new dataframe are duplicated. This is because what was a single row, row `0` for example, is now three rows because the story had three tags. What was row `1` is now four rows, because the story at row `1` had four tags. These index values help us keep track of what rows 'go together' to  make a single story, which we'll need later. 

In [None]:
len(tag_per_line)

In [None]:
len(articles)

Now our `tags` column is similar in structure to our `fields` column, which we unpacked earlier using `.json_normalize`. We can do the same again to generate a seperate dataframe of `tag_data`. We also use a new method `.set_index()` to replace the default index that gets generated when `.json_normalize()` makes the new dataframe, with the index from `tag_per_line` which is keeping track of which rows go with which story.

In [None]:
tag_data = pd.json_normalize(tag_per_line['tags'])
tag_data = tag_data.set_index(tag_per_line.index)


As we have index values keeping track of which row goes with which story, we could use those values to refer back to our original `articles` dataframe when we need additional information. However to simplify later tasks like printing out titles and urls, lets just copy the columns from `tag_per_line` into this new `tag_data` dataframe. The index lookup approach would be more space efficient but this isn't am issue with data this size.

In [None]:
tag_data['wordcount'] = tag_per_line['wordcount']
tag_data['article_title'] = tag_per_line['webTitle']
tag_data['article_url'] = tag_per_line['webUrl']
tag_data.info()


We have quite a lot of columns in this dataset we've made, and probably only need a few. Let's just overwrite `tag_data` with a view of it that only includes the columns we need just to keep things simpler.

In [None]:

tag_data = tag_data[['webTitle','article_title','article_url','wordcount']]
tag_data.head()

We can now check the count frequency of the different tags to get an overall picture like we did with sections.

In [None]:
tag_counts = tag_data['webTitle'].value_counts().head(20)
top_tags = tag_counts.index
tag_counts

In [None]:
sns.catplot(data=tag_data, y='webTitle', kind='count', aspect=1.5, order=top_tags)


## Titles by Tag
Like before with sections, we can examine what stories are associated with each tag. The column names will be different but the mechanics are the same.
N.B Below we use `.head` to limit the number of results for demonstration purposes, but during analysis there is no reason you cannot remove it and to view all the results.

In [None]:

TAG_OF_INTEREST = 'Elon Musk' # Just change this to switch tags


selected_data = tag_data[tag_data['webTitle'] == TAG_OF_INTEREST].head()

for index, row in selected_data.iterrows():
    print(row['article_title'])
    print(row['article_url'])
    print('****')

We will use this data more later when examining wordcounts, and looking at tag correlation. 

## Word Counts

### By Section
We defined `top_sections` earlier when we checked which sections had the highest number of stories. Here we'll use `.groupby` to get per section and per tag wordcounts. Word count is a good proxy for how much time was dedicated to a particular topic.

Total word count tells us the overall time dedicated to the topic related to each section or topic, whilst taking an average tells us how much space was given per story.

In [None]:
section_wordcounts = articles.groupby('sectionName').agg(
    avg_wordcount=('wordcount','mean'),
    total_wordcount=('wordcount','sum')
).sort_values('total_wordcount', ascending=False).loc[top_sections]

section_wordcounts

We can use box plots to see the distribution of these word counts. Remember we already filtered the `articles` data so it only included stories in top sections, however we include the filtering here to clarify that it is necessary before visualisation to reduce visual clutter.

In [None]:
to_plot = articles[articles['sectionName'].isin(top_sections)]
sns.catplot(data=to_plot, y='sectionName', x='wordcount', kind='box', aspect=2)

### By Tag
For a similar summary, but by tag, we do the same, but we use the `tag_data` dataframe, and the `webTitle` column.

In [None]:
tag_data.groupby('webTitle').agg(
    avg_wordcount=('wordcount','mean'),
    total_wordcount=('wordcount','sum')
).sort_values('total_wordcount', ascending=False).loc[top_tags]

In [None]:
to_plot = tag_data[tag_data['webTitle'].isin(top_tags)]
sns.catplot(data=to_plot, y='webTitle', x='wordcount', kind='box', aspect=2)

## Tag Correlation
One analysis technique that is available to us is to examine the correlation of tags. What tags tend to co-occur in single stories, could this give us a sense of the themes or intersection of different topics?

Here we'll create a matrix of tag counts. In the first stage we use `.get_dummies` to reshape our column of tag names so that each possible tag is given its own column, and a value of 1 is entered if that tag is present in the row, otherwise 0. 

(This may be a little confusing now but we're heading somewhere!)

In [None]:
tag_matrix = pd.get_dummies(tag_data['webTitle'])
tag_matrix


Next we take our `tag_matrix`, use our list of `top_tags` to ensure only columns representing our selected top tags remain. We do this to aid visualisation later.

In [None]:

tag_matrix = tag_matrix[top_tags].copy()
tag_matrix.info()

At the moment our `tag_matrix` is one row per tag per story, meaning that for every row *only one* of those columns will have a number 1 in it to represent a tag is associated with that story. In order to understand if certain tags correlate, if they go together, we need to simplify so that one row represents a story, and each column shows either a 0 or a 1 depending on whether the tag is present in that story.

As stories can only use each tag once if we took all the rows for one single story, and for each column added the row values together, the result would be one row where 1 indicates if the tag is there or not because if it's not, we'd simply be adding together 0 for each row, resulting in 0. Using `groupby` we can grab each set of rows representing a single story, `.sum()` together the values in each column and then get one row back which provides this representation.

Let's demo this with a simplified example...

In [None]:
toy_matrix = pd.read_csv('toy_matrix.csv')
toy_matrix

The `story` column just represents the id or title of the story, and then we have a column for each of three different tags. You'll see that each row only has one `1` in it because it is one row per tag. What we want is one row per story, so just two rows, one for story A, one for story B that puts the values spread across multiple rows into just one row.

You could probably do this in your head because really all we're saying is, for each story's subset of rows, if there is a `1` anywhere in the column, then the value is `1`, otherwise it's `0`. As we know a single story can only use a tag once, we can simplify this slightly complicated logic as just "grab all the rows for a story and for each column, add the values together".

In [None]:
toy_matrix.groupby('story').sum()

We can do the same with our actual `tag_matrix`. As we want to group on the index of the dataframe rather than a column we don't have a column name to pass `.groupby()` as usual. However we can tell it to group by "level 0". Pandas refers to indexes as levels and on a regular dataframe with just a single index, there is only one level, level 0.

In [None]:
tag_matrix = tag_matrix.groupby(level=0).sum()
tag_matrix


Finally we can get our correlation scores using `.corr`. This reshapes the data into a square, where both the rows and the columns represent tags, and the values represent the correleation between the two tags.

- 0 Represents no correlation
- 1 Represents the highest positive correlation, i.e. every story with tag `a` also includes tag `b`. 
- A negative value indicates negative correlation, i.e. the presence of tag `a` means that the presence of tag `b` is less likely.

The 'diagnonal' of the matrix will always equal 1 as the presence of tag `a` will always be correlated with the presence of tag `a`.

In [None]:
correlations = tag_matrix.corr()
correlations

We can check the correlations for a specific tag by accessing its column...

In [None]:
correlations['ChatGPT'].sort_values(ascending=False)

### Tag Heatmap
We can also visualise these correlations using a heatmap. Using the `coolwarm` color scheme means colours run from deep blue to deep red. We set the `center` of the scale to 0 so that above zero, positive correlation, is a shade of red whilst below zero, negative correlation, is a shade of blue.

In [None]:
plt.figure(figsize=(8,8))

sns.heatmap(correlations, cmap='coolwarm', center=0)

### Advanced: Identifying multi-tag titles
What if you wanted to understand WHY two tags correlate. Perhaps ones that are unexpected. You will need to identify which stories have both tags using our `tag_matrix`, and then use the index values to look up the correct rows in the `articles`. We can then iterate over them and view title and url like before.

In [None]:
TAG_1 = 'ChatGPT'
TAG_2 = 'Consciousness'
tag_filter = (tag_matrix[TAG_1] == 1) & (tag_matrix[TAG_2] == 1)

selected_story_index = tag_matrix[tag_filter].index
selected_story_index

In [None]:
selected_data = articles.loc[selected_story_index].head()

for index, row in selected_data.iterrows():
    print(row['webTitle'])
    print(row['webUrl'])
    print('****')

## Summary
There will be many other ways in which this kind of data can be explored, depending on the kind of question you might have. However the above techniques give us a good overview of the data including the time period covered, the top topics, the type of content that has been collected (news, sport, opinion etc.) and allows us to get a sense of some correlations of the topics.

## Exercises
Explore your own data set from the Guardian API. Use the techniques above to get a better sense of what you've collected. 