# Analyze Twitter Hashtag Sentiment
Sentiment analysis is a powerful tool and can be used to determine whether a given set of text is positive, neutral, or negative in valence. In this template, you will use the Twitter API to access recent tweets using hashtags that you define. You will then compare the sentiment across these different hashtags.

To be able to use this template, the following criteria must be satisfied:
- You will need an active Twitter account.
- You will need a bearer token for accessing the Twitter API. 

To get a bearer token, you will need to navigate to this [page](https://developer.twitter.com/en/portal/petition/essential/basic-info) and sign up for Essential access. This will take you through a short verification process. When you are finished, you should be able to create a new app and generate a bearer token which will be used to access the API.

_Warning: This template will extract real Twitter data. As a result, some content may contain offensive language._

## 1. Getting Set Up
In order to access the Twitter API, you will need to use an integration to set an environment variable. To add a new integration in your Workspace, click on the Integrations icon in the far left toolbar of the Workspace editing interface. Next, click "Add Integration" and "Environment Variables". You will need to specify the name (BEARER_TOKEN) and the value (the token you were provided). You can call this "Twitter Integration". You can read more about integrations [here](https://workspace-docs.datacamp.com/integrations/environment-variables). Click "Create" and follow the remaining steps, and you should be ready to go!

The code then performs the following:
1. Installs and imports the packages you will use to retrieve Twitter data and visualize it. 
2. Sets your bearer token for accessing the Twitter API. This does not require further input if you have configured your BEARER_TOKEN environment variable correctly.
3. Sets the hashtags you want to compare. By default, this template retrieves tweets based on three data science topics. You are free to supply any hashtags you wish to use (a topic preceded by a `#` symbol).
4. Initializes a tweepy [`Client`](https://docs.tweepy.org/en/stable/client.html). This enables you to make requests to the Twitter API. It will retrieve the last ten tweets for your first hashtag as a test.

In [1]:
%%capture
# Install necessary packages
!pip install tweepy

In [None]:
# Import packages
import os
import tweepy
import pandas as pd
import re
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import plotly.express as px

# Set bearer_token for essential access
bearer_token = os.environ["BEARER_TOKEN"]

# Define 2-3 hashtags here
hashtags = ["#tableau", "#python", "#powerbi"]

# Initialize the Tweepy client
client = tweepy.Client(bearer_token=bearer_token)

# Confirm the client is initialized by printing the 10 most recent tweets using your hashtag
for tweet in client.search_recent_tweets(hashtags[0]).data:
    print(tweet.text)

ModuleNotFoundError: No module named 'tweepy'

The code above should return the text of the past 10 tweets using the hashtag you supplied. If you have not set up your integration correctly (or are using the wrong bearer token, you may encounter an error such as:
> Unauthorized: 401 Unauthorized

If you do encounter such an error, make sure to review the instructions and try again.

## 2. Create a DataFrame of Tweets
Next, you can use the client to retrieve a specified number of tweets related to a topic. The code below defines and runs a custom function that uses [`Paginator()`](https://docs.tweepy.org/en/stable/pagination.html?highlight=pagination) to return recent tweets (within the past seven days) that use a specific hashtag. There are two parameters you can customize:
- The `num_results` you want to return per hashtag. The number must be a multiple of 100, and cannot exceed 2000.
- The language (`lang`) of the tweets you want to query. This is set to English by default, but you can use other [languages](https://developer.twitter.com/en/docs/twitter-for-websites/supported-languages) if you prefer!

The code then uses this function to iterate through the list of hashtags you defined and return a DataFrame containing all three result sets.

_Note: Depending on the number of results you return, this code can take some time to execute._

In [None]:
# Define a function to query tweets
def get_tweets(hashtag, num_results=1000, lang="en"):
    # Initialize two empty DataFrames to get user data and tweets
    tweets_df = pd.DataFrame()
    
    # Return the number of batches based on num_results
    if num_results > 2000:
        raise ValueError("`num_results` must be less than or equal to 2000.")
    elif num_results % 100 != 0:
        raise ValueError("`num_results` must be a multiple of 100.")
    max_results = 100
    limit = num_results / max_results

    # Iterate through batches of tweets
    for tweet_batch in tweepy.Paginator(
        client.search_recent_tweets,
        query=hashtag + " lang:" + str(lang) + " -is:retweet",
        max_results=100,
        limit=limit,
    ):
        # Retrieve data from batch and add it to DataFrame
        data = tweet_batch.data
        batch_data = pd.DataFrame(data)
        batch_data["hashtag"] = hashtag
        tweets_df = pd.concat([tweets_df, batch_data])

    # Return DataFrame
    return tweets_df.reset_index()


# Inititialize a DataFrame to store the tweets
sentiment_df = pd.DataFrame()

# Iterate through the hashtags and add the data
for tag in hashtags:
    temp_df = get_tweets(tag, num_results=1000, lang="en")  # Specify the language here
    sentiment_df = pd.concat([sentiment_df, temp_df])

# Preview the first DataFrame
sentiment_df

## 3. Process the Text
The next step is to perform some light cleaning on the tweets and then perform a sentiment analysis. Two custom functions are defined to fulfill these tasks:

- The first function uses [`re.sub()`](https://docs.python.org/3/library/re.html) to define a regular expression pattern and remove unwanted user mentions and links. 
- The second functions uses the NLTK [`SentimentIntensityAnalyzer()`](https://www.nltk.org/api/nltk.sentiment.vader.html#module-nltk.sentiment.vader) to generate a compound sentiment score for each tweet. This score is an aggregate of negative, neutral, and positive scores and ranges between -1 (very negative) to 1 (very positive).

**Sentiment Analysis Citation**
>Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

In [None]:
# Define function to strip away unwanted characters
def clean_tweet(tweet):
    pattern = "@\w+|https.*|\\n"
    clean_tweet = re.sub(pattern, " ", tweet)
    return clean_tweet

# Define function to calculate the compound sentiment score
def calculate_sentiment(text):
    sid = SentimentIntensityAnalyzer()
    scores = sid.polarity_scores(text)
    return scores['compound']

# Clean the tweet and store in a new column
sentiment_df['processed_text'] = sentiment_df['text'].apply(clean_tweet)

# Generate sentiment scores
sentiment_df['sentiment_score'] = sentiment_df['processed_text'].apply(calculate_sentiment)

# Preview the cleaned and analyzed tweets
sentiment_df

### 4a. Bar Chart
A bar chart is a helpful way to to visualize the mean sentiment scores per hashtag. The following code calculates the mean sentiment per hashtag and plots the data in a Plotly [bar chart](https://plotly.com/python/bar-charts/).

You can interact with the plot by hovering over it to learn the precise mean for each hashtag.

In [None]:
# Aggregate the DataFrame and return the mean sentiment per hashtag
movie_means = (
    sentiment_df.groupby("hashtag")[["hashtag", "sentiment_score"]]
    .mean()
    .sort_values(by="sentiment_score")
)

# Create the bar chart
fig = px.bar(
    movie_means,
    x="sentiment_score",
    y=movie_means.index,
    labels={"sentiment_score": "Average Sentiment Score", "hashtag": "Hashtag"},
)

# Update the layout and show the figure
fig.update_layout(
    template="plotly_white",
    title_text="Average Sentiment Score of Twitter Hashtags",
    title_x=0.5,
    width=800,  # Adjust the width of the plot
    height=400,  # Adjust the height of the plot
)
fig.show()

## 4. Visualize the Sentiment Per Hashtag
### 4b. Strip Chart
The next step is to visualize the sentiment scores per hashtag. The following code initializes a [strip chart](https://plotly.com/python/strip-charts/) for each topic using Plotly. 

This creates an interactive visualization that allows you to hover over each point and view the sentiment score and first fifty characters of the tweets (tweets longer than 50 characters are shortened for visibility purposes).

Be sure to examine the points on the upper and lower ends for each hashtag. Do the tweets correspond with the score assigned to them?

In [None]:
# Process the Tweet text and sentiment scores for easier visualization
sentiment_df["vis_text"] = sentiment_df["processed_text"].str[:50] + "..."
sentiment_df["sentiment_category"] = pd.cut(
    sentiment_df["sentiment_score"],
    bins=[-1, -0.6, -0.2, 0.2, 0.6, 1],
    labels=["Very Negative", "Negative", "Neutral", "Positive", "Very Positive"],
)

# Create color mapping for sentiment scores
color_map = color_discrete_map = {
    "Very Negative": "#7d0404",
    "Negative": "#b86500",
    "Neutral": "#b7a300",
    "Positive": "#8db700",
    "Very Positive": "#00b54b",
}

# Create the box plot
fig = px.strip(
    sentiment_df,
    x="hashtag",
    y="sentiment_score",
    color="sentiment_category",
    labels={  # Assign new labels to the plot
        "sentiment_score": "Sentiment Score",
        "hashtag": "Hashtag",
        "vis_text": "Tweet",
        "sentiment_category": "Sentiment",
    },
    category_orders={
        "sentiment_category": [
            "Very Positive",
            "Positive",
            "Neutral",
            "Negative",
            "Very Negative",
        ]
    },
    hover_data=["vis_text"],
    stripmode="overlay",
    color_discrete_map=color_map,
)

# Update the layout and show the figure
fig.update_layout(
    template="plotly_white",
    title_text="Sentiment Score Distributions Per Hashtag",
    title_x=0.5,
    width=800,  # Adjust the width of the plot
    height=600,  # Adjust the height of the plot
)
fig.show()

## 5. Next Steps
This template serves as an introduction to sentiment analysis, but there are many different paths you can take from here. You may want to pursue other forms of social media analysis, predicting sentiment using natural language processing and machine learning, or learn about different ways to visualize data. As a next step, we recommend these DataCamp courses!
- If you are interested in social media analysis, check out [Analyzing Social Media Data in Python](https://app.datacamp.com/learn/courses/analyzing-social-media-data-in-python). There you can learn more techniques to process and analyze Twitter data. 
- If you are interested in network analysis, we encourage you to look into [Sentiment Analysis in Python](https://app.datacamp.com/learn/courses/sentiment-analysis-in-python). The course mentioned above, Analyzing Social Media Data in Python, also contains content on sentiment analysis.
- Finally, if you want to learn more about creating beautiful and interactive plots with Plotly, we have a [course](https://app.datacamp.com/learn/courses/introduction-to-data-visualization-with-plotly-in-python) to teach you more ways to create interactive visualizations.