<img src="../ldaca_notebooks/pics/DO_Logo.PNG" alt="logo" style="width:150px;"/>

<div style="background-color: #3398FF; padding: 10px; border-radius: 5px;">

<div align="center">
    <h1><strong>&#x1F575; Australian Twittersphere Aggregate Data Exploration Notebook</strong></h1>
</div> 

This is the Exploration notebook - Start here!

In this notebook you will be able to find ways of getting data about the QUT Digital Observatory's Australian Twittersphere (AuTS) collection. For more information on the AuTS go here - https://www.digitalobservatory.net.au/resources/australian-twittersphere/

The purpose of this notebook is to explore the data contained in the AuTS to see if the data you are interested in (e.g. topics, words, etc) are held in the collection.

The notebook/s can access:

+ N-grams for the entire collection - 1-grams and 3-grams plus 3-grams with emojis as single words
+ Domains/URLs for the entire collection
+ Hashtags for the entire collection

&#x1F6D1; Note: you cannot access the AuTS directly from this notebook - you need to contact the Digital Observatory if you want to get any twitter data from the collection.

### Working with Outputs

Most functions in this notebook return **dataframes**. You can display the results by using the `print()` function, or use `.head(n)` and `.tail(n)` to view the top or bottom `n` rows, e.g.:

```python
print(top_ngrams)
top_ngrams.head(10)
top_ngrams.tail(5)
```

### Saving DataFrames

You can save any dataframe to a file using pandas' built-in methods:

```python
top_ngrams.to_csv('output.csv', index=False)    # Save as CSV
top_ngrams.to_excel('output.xlsx', index=False) # Save as Excel
```

### Saving Plots

Plots are typically returned as matplotlib or plotly figure objects. To save a plot to a file, use the appropriate method, for example:

```python
# For matplotlib figures
fig.savefig('output.png')

# For plotly figures
fig.write_image('output.png')
```

You may need to install additional packages for saving plotly images (e.g., `kaleido`).

In [1]:
# This cell initializes the main exploration functionality
# The Exploration class contains methods for:
# - Loading AuTS data (1-grams, 3-grams, hashtags, domains)
# - Analyzing frequencies over time periods
# - Searching for specific keywords, hashtags, and domains
# - Visualizing trends and comparisons


from exploration_notebook import Exploration

explore = Exploration()

# Word Frequencies (1-grams)

### What are 1-grams?

1-grams (unigrams) are single words extracted from tweets in the Australian Twittersphere dataset. Analyzing 1-grams helps you understand the most common words used, track trends, and identify topics of interest over time.

#### What can you use 1-grams for?

- Identifying popular words and topics in Australian Twitter conversations
- Comparing word usage across different time periods
- Tracking the rise or fall of specific keywords
- Analyzing tweet types (retweets, quotes, replies, originals) for each word

#### Data Structure

The 1-grams dataset is a CSV file with the following columns:

| Column                   | Description                                                                 |
|--------------------------|-----------------------------------------------------------------------------|
| `ngram`                  | The unigram (single word)                                                   |
| `date`                   | The month, binned as the 1st day of each month (e.g., `2021-01-01`)        |
| `total_frequency`        | Total occurrences of the word in all tweets for that month                  |
| `retweet_frequency`      | Occurrences in retweets                                                     |
| `quote_tweet_frequency`  | Occurrences in quote tweets                                                 |
| `reply_tweet_frequency`  | Occurrences in replies                                                      |
| `orginal_tweet_frequency`| Occurrences in original tweets                                              |

#### Working with 1-gram Functions

The notebook provides several functions for exploring 1-grams:

- `top_grams_in_date_range(data, start_date, end_date, top_n)`:
    - `data`: The loaded 1-grams dataframe.
    - `start_date`, `end_date`: Strings in `YYYY-MM-DD` format. These define the date range to analyze. Changing these will focus the analysis on different months.
    - `top_n`: Integer, the number of top words to return. Increasing this gives a broader view, decreasing focuses on the most frequent.

- `keyword_search_in_date_range(data, keyword, start_date, end_date)`:
    - `data`: The loaded 1-grams dataframe.
    - `keyword`: The word to search for. Changing this lets you analyze different terms.
    - `start_date`, `end_date`: Date range for the search. Adjusting these changes the period of analysis.

- `keyword_search_with_ratios_in_date_range(data, keyword, start_date, end_date)`:
    - `data`: The loaded 1-grams dataframe.
    - `keyword`: The word to analyze.
    - `start_date`, `end_date`: Date range for ratio analysis.
    - This function returns both frequency and the ratio of the keyword to total tweet volume, helping you see relative importance over time.

These functions return **dataframes** for further analysis and visualization.

Example usage:

```python
top_ngrams = explore.top_grams_in_date_range(one_grams, '2021-01-01', '2021-02-01', 30)
keyword_date = explore.keyword_search_in_date_range(one_grams, 'word', '2021-01-01', '2021-02-01')
keyword_ratio = explore.keyword_search_with_ratios_in_date_range(one_grams, 'word', '2021-01-01', '2021-02-01')
```

Adjusting the function arguments allows you to customize your analysis for different words, time periods, and result sizes. For example, narrowing the date range can help you focus on specific events, while changing `top_n` lets you see more or fewer frequent words.

The analysis uses the 1-grams dataset containing individual word frequencies from tweets between 2021-2023.

In [2]:
#load dataset - adjust the path to your dataset

one_grams = explore.load_data('/home/fleetr/ldaca_notebooks/data/1grams_20210101-20230712.csv')

In [3]:
#top n words over time period
words_date = explore.top_grams_in_date_range(one_grams, '2021-01-01', '2021-02-01', 30)

In [None]:
#keyword search for word occurance by date

keyword_date = explore.keyword_search_in_date_range(one_grams, 'word', '2021-01-01', '2021-02-01')

In [None]:
#keyword with ratios

keyword_ratio = explore.keyword_search_with_ratios_in_date_range(one_grams, 'word', '2021-01-01', '2021-02-01')

# Trigram Frequencies (3-grams)

### What are 3-grams?

3-grams (trigrams) are sequences of three words extracted from tweets in the Australian Twittersphere dataset. Analyzing 3-grams helps you uncover common phrases, trending topics, and contextual patterns in conversations over time.

#### What can you use 3-grams for?

- Identifying popular phrases and expressions in Australian Twitter discussions
- Comparing phrase usage across different time periods
- Tracking the emergence or decline of specific topics or slogans
- Analyzing tweet types (retweets, quotes, replies, originals) for each phrase

#### Data Structure

The 3-grams dataset is a CSV file with the following columns (same as 1-grams, but for trigrams):

| Column                   | Description                                                                 |
|--------------------------|-----------------------------------------------------------------------------|
| `ngram`                  | The trigram (three-word phrase)                                             |
| `date`                   | The month, binned as the 1st day of each month (e.g., `2021-01-01`)        |
| `total_frequency`        | Total occurrences of the phrase in all tweets for that month                |
| `retweet_frequency`      | Occurrences in retweets                                                     |
| `quote_tweet_frequency`  | Occurrences in quote tweets                                                 |
| `reply_tweet_frequency`  | Occurrences in replies                                                      |
| `orginal_tweet_frequency`| Occurrences in original tweets                                              |

#### Working with 3-gram Functions

The notebook provides several functions for exploring 3-grams:

- `top_grams_in_date_range(data, start_date, end_date, top_n)`:
    - `data`: The loaded 3-grams dataframe.
    - `start_date`, `end_date`: Strings in `YYYY-MM-DD` format. These define the date range to analyze. Changing these will focus the analysis on different months.
    - `top_n`: Integer, the number of top phrases to return. Increasing this gives a broader view, decreasing focuses on the most frequent.

- `keyword_search_in_date_range(data, keyword, start_date, end_date)`:
    - `data`: The loaded 3-grams dataframe.
    - `keyword`: The phrase or word to search for. Changing this lets you analyze different terms or phrases.
    - `start_date`, `end_date`: Date range for the search. Adjusting these changes the period of analysis.

- `keyword_search_with_ratios_in_date_range(data, keyword, start_date, end_date)`:
    - `data`: The loaded 3-grams dataframe.
    - `keyword`: The phrase to analyze.
    - `start_date`, `end_date`: Date range for ratio analysis.
    - This function returns both frequency and the ratio of the phrase to total tweet volume, helping you see relative importance over time.

These functions return **dataframes** for further analysis and visualization.

Example usage:

```python
tri_grams_date = explore.top_grams_in_date_range(three_grams, '2021-01-01', '2021-02-01', 30)
tri_keyword_date = explore.keyword_search_in_date_range(three_grams, 'year', '2021-01-01', '2021-02-01')
tri_keyword_ratio = explore.keyword_search_with_ratios_in_date_range(three_grams, 'happy new year', '2021-01-01', '2021-01-05')
```

Adjusting the function arguments allows you to customize your analysis for different phrases, time periods, and result sizes. For example, narrowing the date range can help you focus on specific events, while changing `top_n` lets you see more or fewer frequent phrases.

The analysis uses the 3-grams dataset containing frequencies of word triplets from tweets between 2021-2023, filtered to include only trigrams occurring at least 100 times (on a per day basis).

In [3]:
#load dataset - adjust the path to your dataset

three_grams = explore.load_data('/home/fleetr/ldaca_notebooks/data/3grams_min100_20210101-20230712.csv')

In [None]:
tri_grams_date = explore.top_grams_in_date_range(three_grams, '2021-01-01', '2021-02-01', 30)

In [None]:
tri_keyword_date = explore.keyword_search_in_date_range(three_grams, 'year', '2021-01-01', '2021-02-01')

In [None]:
tri_keyword_ratio = explore.keyword_search_with_ratios_in_date_range(three_grams, 'happy new year', '2021-01-01', '2021-01-05') #example
tri_keyword_ratio

# Trigram (Emoji) Frequencies (3-grams)

### What are 3-grams with emojis?

3-grams (trigrams) are sequences of three words including emojis as individual words extracted from tweets in the Australian Twittersphere dataset. Analyzing 3-grams helps you uncover common phrases, trending topics, and contextual patterns in conversations over time.

#### What can you use 3-grams for?

- Identifying popular phrases and expressions in Australian Twitter discussions
- Comparing phrase usage across different time periods
- Tracking the emergence or decline of specific topics or slogans
- Analyzing tweet types (retweets, quotes, replies, originals) for each phrase

#### Data Structure

The 3-grams dataset is a CSV file with the following columns (same as 1-grams, but for trigrams):

| Column                   | Description                                                                 |
|--------------------------|-----------------------------------------------------------------------------|
| `ngram`                  | The trigram (three-word phrase) - with emojis as single words                                           |
| `date`                   | The month, binned as the 1st day of each month (e.g., `2021-01-01`)        |
| `total_frequency`        | Total occurrences of the phrase in all tweets for that month                |
| `retweet_frequency`      | Occurrences in retweets                                                     |
| `quote_tweet_frequency`  | Occurrences in quote tweets                                                 |
| `reply_tweet_frequency`  | Occurrences in replies                                                      |
| `orginal_tweet_frequency`| Occurrences in original tweets                                              |

#### Working with 3-gram Functions

The notebook provides several functions for exploring 3-grams with emojis:

- `top_grams_in_date_range(data, start_date, end_date, top_n)`: (this function remains the same)
    - `data`: The loaded 3-grams dataframe.
    - `start_date`, `end_date`: Strings in `YYYY-MM-DD` format. These define the date range to analyze. Changing these will focus the analysis on different months.
    - `top_n`: Integer, the number of top phrases to return. Increasing this gives a broader view, decreasing focuses on the most frequent.

- `emoji_search_in_date_range(data, keyword, start_date, end_date)`: (this function is changed)
    - `data`: The loaded 3-grams dataframe.
    - `keyword`: The phrase or word to search for. Changing this lets you analyze different terms or phrases.
    - `start_date`, `end_date`: Date range for the search. Adjusting these changes the period of analysis.

- `emoji_search_with_ratios_in_date_range(data, keyword, start_date, end_date)`: (this function is changed)
    - `data`: The loaded 3-grams dataframe.
    - `keyword`: The phrase to analyze.
    - `start_date`, `end_date`: Date range for ratio analysis.
    - This function returns both frequency and the ratio of the phrase to total tweet volume, helping you see relative importance over time.

These functions return **dataframes** for further analysis and visualization.

Example usage:

```python
topi_emoji_grams = explore.top_grams_in_date_range(three_grams, '2021-01-01', '2021-02-01', 30)
emoji_date = explore.emoji_search_in_date_range(three_grams, 'year', '2021-01-01', '2021-02-01')
emoji_ratio = explore.emoji_search_with_ratios_in_date_range(three_grams, 'happy new year', '2021-01-01', '2021-01-05')
```

Adjusting the function arguments allows you to customize your analysis for different phrases, time periods, and result sizes. For example, narrowing the date range can help you focus on specific events, while changing `top_n` lets you see more or fewer frequent phrases.

The analysis uses the 3-grams dataset containing frequencies of word triplets from tweets between 2021-2023, filtered to include only trigrams occurring at least 100 times (on a per day basis).

In [None]:
emoji_grams = explore.load_data('/home/fleetr/ldaca_notebooks/data/merged_ngrams_emoji_as_individual_words_min_100.csv')

In [None]:
top_emoji_grams = explore.top_grams_in_date_range(emoji_grams, '2021-01-01', '2021-02-01', 30)

In [None]:
emoji_date = emoji_search_in_date_range(emoji_grams, 'ðŸ˜Š', '2021-01-01', '2021-02-01')

In [None]:
emoji_ratio = explore.emoji_search_with_ratios_in_date_range(emoji_grams, 'ðŸ˜Š', '2021-01-01', '2021-02-01')

## Word Frequency Comparison Plot

This section demonstrates the use of `plot_keyword_frequencies_comparison` to visualize and compare the daily frequencies of multiple keywords over a specified time period.

### What does this plot show?

The comparison plot allows you to see how the usage of different words or phrases changes over time in the Australian Twittersphere dataset. You can use either the 1-grams (single words) or 3-grams (three-word phrases) datasets as input, depending on whether you want to compare individual words or common phrases.

#### Why compare word or phrase frequencies?

- **Trend Analysis:** Comparing keywords helps you identify which topics are rising or falling in popularity, and spot correlations or divergences between them.
- **Event Detection:** Sudden spikes in certain words or phrases may indicate major events, news, or viral discussions.
- **Contextual Insights:** Using 3-grams lets you focus on specific expressions or slogans, providing richer context than single words.
- **Customizable Focus:** By changing the list of keywords or phrases, you can tailor the analysis to your research interests (e.g., health, politics, social movements).

#### How do arguments affect the plot?

- **Dataset Choice (1-grams vs 3-grams):**
    - Using 1-grams focuses on individual word usage.
    - Using 3-grams highlights phrase-level trends, which can reveal more specific topics or memes.
- **Keywords/Phrases List:**
    - The words or phrases you choose determine what is compared. Pick terms relevant to your research question.
- **Date Range:**
    - Adjusting the start and end dates lets you zoom in on particular periods (e.g., during an election, pandemic, or other events).

#### Example Use Cases

- Compare health-related terms (e.g., 'covid', 'vaccine', 'health') to see how public conversation shifts.
- Track the popularity of political slogans or hashtags using 3-grams.
- Analyze the impact of major news stories by comparing relevant keywords before, during, and after the event.

#### Example usage:

```python
keywords_to_compare = ['covid', 'vaccine', 'health']
explore.plot_keyword_frequencies_comparison(one_grams, keywords_to_compare, '2021-03-01', '2021-03-31')

# For 3-grams:
# keywords_for_trigrams = ['new year', 'covid case', 'prime minister']
# explore.plot_keyword_frequencies_comparison(three_grams, keywords_for_trigrams, '2022-01-01', '2022-01-07')
```

By adjusting the dataset, keywords, and date range, you can explore a wide variety of trends and patterns in Twitter conversations.

In [None]:
# Define a list of keywords and a date range for comparison
keywords_to_compare = ['covid', 'vaccine', 'health'] # Example keywords
start_date_compare = '2021-03-01'
end_date_compare = '2021-03-31'

# Plot for 1-grams
explore.plot_keyword_frequencies_comparison(one_grams, keywords_to_compare, start_date_compare, end_date_compare)

# Example for 3-grams (adjust keywords as needed for meaningful comparison with trigrams)
# keywords_for_trigrams = ['new year', 'covid case', 'prime minister'] 
# explore.plot_keyword_frequencies_comparison(three_grams, keywords_for_trigrams, '2022-01-01', '2022-01-07')

## Top n n-grams Daily Frequency Trend

This section demonstrates the use of `plot_top_n_grams_trend` to identify the top N n-grams within a period and then visualize their daily frequency trends.

### What does this plot show?

The daily frequency trend plot helps you discover which words or phrases are most prominent in a given time window, and how their usage changes day by day. You can use either the 1-grams (single words) or 3-grams (three-word phrases) datasets, depending on whether you want to focus on individual words or common expressions.

#### Why visualize top n-gram trends?

- **Spotlight on Key Topics:** Reveals the most talked-about words or phrases for a specific period, highlighting what captured public attention.
- **Temporal Patterns:** Shows how the frequency of top n-grams rises or falls, helping you detect cycles, bursts, or sustained interest.
- **Event Correlation:** Peaks in certain n-grams may correspond to real-world events, news, or viral moments.
- **Phrase vs. Word Analysis:** Using 3-grams can uncover trending slogans or catchphrases, while 1-grams show general word popularity.

#### How do arguments affect the plot?

- **Dataset Choice (1-grams vs 3-grams):**
    - 1-grams highlight individual word trends.
    - 3-grams focus on phrase-level trends, which can be more specific and context-rich.
- **Date Range:**
    - The start and end dates define the analysis window. Adjusting these lets you zoom in on particular events or periods.
- **Top N Value:**
    - Controls how many n-grams are shown. A higher value gives a broader view, while a lower value focuses on the most dominant terms.

#### Example Use Cases

- Identify the top words or phrases during a major news event and see how their usage evolves.
- Compare daily trends for the most popular slogans or hashtags.
- Explore shifts in conversation topics over time.

#### Example usage:

```python
start_date_trend = '2021-01-01'
end_date_trend = '2021-01-31'
top_n_count = 10
explore.plot_top_n_grams_trend(one_grams, start_date_trend, end_date_trend, top_n=top_n_count)

# For 3-grams:
# explore.plot_top_n_grams_trend(three_grams, start_date_trend, end_date_trend, top_n=top_n_count)
```

By adjusting the dataset, date range, and top N value, you can tailor the analysis to your research needs and uncover meaningful trends in Twitter conversations.

In [None]:
# Define a date range and the number of top n-grams for the trend plot
start_date_trend = '2021-01-01'
end_date_trend = '2021-01-31'
top_n_count = 10 # Show trends for the top 5 n-grams

# Plot for 1-grams
explore.plot_top_n_grams_trend(one_grams, start_date_trend, end_date_trend, top_n=top_n_count)

# Example for 3-grams
# explore.plot_top_n_grams_trend(three_grams, start_date_trend, end_date_trend, top_n=top_n_count)

---
## Hashtag Analysis

This section applies the exploration functions to a dataset of hashtags.
Assumed CSV columns: `date`, `hashtag`, `total_frequency`.

### What does hashtag trend analysis show?

Hashtag analysis helps you understand which hashtags are most popular in the Australian Twittersphere and how their usage changes over time. By visualizing hashtag frequencies, you can spot trending topics, monitor campaign effectiveness, and detect bursts of activity around events.

#### Why visualize hashtag trends?

- **Trend Detection:** See which hashtags are gaining or losing popularity.
- **Event Monitoring:** Identify spikes in hashtag usage that may correspond to news, campaigns, or viral moments.
- **Community Insights:** Discover which hashtags unite conversations or signal shifts in public interest.
- **Comparative Analysis:** Compare multiple hashtags to see how they compete or overlap in usage.

#### How do arguments affect the analysis?

- **Hashtag List:**
    - The hashtags you choose to analyze determine the focus of your comparison. Select those relevant to your topic or event.
- **Date Range:**
    - Adjusting the start and end dates lets you zoom in on specific periods, such as before, during, or after an event.
- **Top N Value:**
    - Controls how many hashtags are shown in trend plots. A higher value gives a broader view, while a lower value focuses on the most dominant hashtags.

#### Example Use Cases

- Track the rise and fall of political or health-related hashtags.
- Compare hashtag usage during major events or campaigns.
- Identify which hashtags are most associated with bursts of activity.

#### Example usage:

```python
hashtags_to_compare = ['auspol', 'covid19aus', 'nswpol']
explore.plot_hashtag_frequencies_comparison(hashtags_data, hashtags_to_compare, '2021-01-01', '2021-01-10')

explore.plot_top_hashtags_trend(hashtags_data, '2021-01-01', '2021-01-10', top_n=3)
```

By adjusting the hashtag list, date range, and top N value, you can tailor the analysis to your research needs and uncover meaningful patterns in hashtag usage.

In [2]:
hashtags_data = explore.load_data('/home/fleetr/ldaca_notebooks/data/hashtag_freq_daily_by_type_20210101_to_20240101.csv')

In [3]:
top_hashtags = explore.top_hashtags_in_date_range(hashtags_data, '2021-01-01', '2021-01-05', top_n=5)

In [4]:

hashtag_occurrences = explore.search_hashtag_in_date_range(hashtags_data, 'auspol', '2021-01-01', '2021-01-03')

In [5]:
hashtag_total_freq = explore.search_hashtag_total_frequency_in_range(hashtags_data, 'covid19aus', '2021-01-01', '2021-01-05')

In [6]:

hashtags_to_compare = ['auspol', 'covid19aus', 'nswpol']
explore.plot_hashtag_frequencies_comparison(hashtags_data, hashtags_to_compare, '2021-01-01', '2021-01-10')

In [7]:
explore.plot_top_hashtags_trend(hashtags_data, '2021-01-01', '2021-01-10', top_n=3)

---
## Domain Analysis

This section applies the exploration functions to a dataset of domains (URLs).
Assumed CSV columns: `date`, `domain`, `total_frequency`.

### What does domain trend analysis show?

Domain analysis helps you understand which web domains are most frequently shared in tweets and how their popularity changes over time. By visualizing domain frequencies, you can identify trusted sources, track the spread of news, and monitor the impact of media outlets or platforms.

#### Why visualize domain trends?

- **Source Popularity:** See which domains are most often linked in tweets, revealing influential news sites, social platforms, or campaign pages.
- **Event Tracking:** Spikes in domain sharing may correspond to breaking news, viral content, or coordinated campaigns.
- **Comparative Insights:** Compare multiple domains to see how their prominence shifts over time or in response to events.
- **Information Flow:** Understand how information spreads through different sources in the Twittersphere.

#### How do arguments affect the analysis?

- **Domain List:**
    - The domains you choose to analyze determine the focus of your comparison. Select those relevant to your research or event.
- **Date Range:**
    - Adjusting the start and end dates lets you zoom in on specific periods, such as during a news cycle or campaign.
- **Top N Value:**
    - Controls how many domains are shown in trend plots. A higher value gives a broader view, while a lower value focuses on the most dominant domains.

#### Example Use Cases

- Track the rise and fall of news or social media domains during major events.
- Compare the sharing frequency of competing news outlets.
- Identify which domains are most associated with bursts of activity or viral content.

#### Example usage:

```python
domains_to_compare = ['twitter.com', 'abc.net.au', 'youtube.com']
explore.plot_domain_frequencies_comparison(domains_data, domains_to_compare, '2021-02-01', '2021-02-10')

explore.plot_top_domains_trend(domains_data, '2021-02-01', '2021-02-10', top_n=3)
```

By adjusting the domain list, date range, and top N value, you can tailor the analysis to your research needs and uncover meaningful patterns in domain sharing.

In [8]:
domains_data = explore.load_data('/home/fleetr/ldaca_notebooks/data/domain_freq_daily_by_type_20210101_to_20240101.csv')

In [9]:
top_domains = explore.top_domains_in_date_range(domains_data, '2021-02-01', '2021-02-05', top_n=5)

In [10]:
domain_occurrences = explore.search_domain_in_date_range(domains_data, 'abc.net.au', '2021-02-01', '2021-02-03')

In [11]:
domain_total_freq = explore.search_domain_total_frequency_in_range(domains_data, 'youtube.com', '2021-02-01', '2021-02-05')

In [12]:
domains_to_compare = ['twitter.com', 'abc.net.au', 'youtube.com']
explore.plot_domain_frequencies_comparison(domains_data, domains_to_compare, '2021-02-01', '2021-02-10')

In [13]:
explore.plot_top_domains_trend(domains_data, '2021-02-01', '2021-02-10', top_n=3)