# Collecting data through the Twitter API

Firstly, we need to authenticate ourselves with the Twitter API using Tweepy. This authentication involves using the consumer key and access token, which are required to access the Twitter API. This is the initial step that needs to be taken before we can collect data.

In [None]:
from tweepy import OAuthHandler
from tweepy import API

# Consumer key authentication
auth = OAuthHandler(consumer_key, consumer_secret)

# Access key authentication
auth.set_access_token(access_token, access_token_secret)

# Set up the API with the authentication handler
api = API(auth)

After successful authentication, we are ready to collect data from Twitter based on specific keywords. We use Tweepy's Stream class to do this. In this step, we specify the keywords we want to monitor and start collecting data that matches these keywords.

In [None]:
from tweepy import Stream

# Set up words to track
keywords_to_track = ["#rstats", "#python"]

# Instantiate the SListener object 
listen = SListener(api)

# Instantiate the Stream object
stream = Stream(auth, listen)

# Begin collecting data
stream.filter(track = keywords_to_track)

With these steps, we have successfully set up authentication and initiated the collection of Twitter data based on specific keywords. This allows us to gather data relevant to our interests or projects.

In this code section, we load JSON-formatted tweet data, convert it into a Python object, and then access various aspects of the tweet. We print the tweet's text content and unique ID. Additionally, we access user-related information, including the user's handle, follower count, location, and description. Furthermore, we delve into retweet data by printing the retweeted tweet's text, the text of the original tweet that was retweeted, the user who performed the retweet, and the user who originally posted the tweet being retweeted.

In [None]:
# Load JSON
import json

# Convert from JSON to Python object
tweet = json.loads(tweet_json)

# Print tweet text
print(tweet['text'])

# Print tweet id
print(tweet['id'])

# Print user handle
print(tweet['user']['screen_name'])

# Print user follower count
print(tweet['user']['followers_count'])

# Print user location
print(tweet['user']['location'])

# Print user description
print(tweet['user']['description'])

# Print the text of the tweet
print(rt['text'])

# Print the text of tweet which has been retweeted
print(rt['retweeted_status']['text'])

# Print the user handle of the tweet
print(rt['user']['screen_name'])

# Print the user handle of the tweet which has been retweeted
print(rt['retweeted_status']['user']['screen_name'])

# Processing Twitter Text

**Tweet Items and Tweet Flattening**

In the realm of Twitter data analysis, tweets often come with various fields in the Twitter JSON that contain textual data. In a typical tweet, you can find the tweet text, the user's description, and their location. However, there are additional complexities to consider, such as extended tweets for messages longer than 140 characters and quoted tweets, which include both the original tweet's text and the commentary.

In [None]:
# Print the tweet text
print(quoted_tweet['text'])

# Print the quoted tweet text
print(quoted_tweet['quoted_status']['text'])

# Print the quoted tweet's extended (140+) text
print(quoted_tweet['quoted_status']['extended_tweet']['full_text'])

# Print the quoted user location
print(quoted_tweet['quoted_status']['user']['location'])

**A Tweet Flattening Function**

In Twitter analysis, we often deal with hundreds or thousands of tweets. To streamline the process, we can create a function called flatten_tweets() to flatten JSON data containing tweets. We'll use this function frequently, adjusting it as needed for different data types.

In [None]:
def flatten_tweets(tweets_json):
    """ Flattens out tweet dictionaries so relevant JSON
        is in a top-level dictionary."""
    tweets_list = []
    
    # Iterate through each tweet
    for tweet in tweets_json:
        tweet_obj = json.loads(tweet)
    
        # Store the user screen name in 'user-screen_name'
        tweet_obj['user-screen_name'] = tweet_obj['user']['screen_name']
    
        # Check if this is a 140+ character tweet
        if 'extended_tweet' in tweet_obj:
            # Store the extended tweet text in 'extended_tweet-full_text'
            tweet_obj['extended_tweet-full_text'] = tweet_obj['extended_tweet']['full_text']
    
        if 'retweeted_status' in tweet_obj:
            # Store the retweet user screen name in 'retweeted_status-user-screen_name'
            tweet_obj['retweeted_status-user-screen_name'] = tweet_obj['retweeted_status']['user']['screen_name']

            # Store the retweet text in 'retweeted_status-text'
            tweet_obj['retweeted_status-text'] = tweet_obj['retweeted_status']['text']
            
        tweets_list.append(tweet_obj)
    return tweets_list

**Loading Tweets into a DataFrame**

Now, let's import this processed data into a pandas DataFrame for scalable tweet analysis. We'll be working with a dataset containing tweets that include either the '#rstats' or '#python' hashtag, stored as a list of tweet JSON objects in data_science_json.

In [None]:
# Import pandas
import pandas as pd

# Flatten the tweets and store in `tweets`
tweets = flatten_tweets(data_science_json)

# Create a DataFrame from `tweets`
ds_tweets = pd.DataFrame(tweets)

# Print out the first 5 tweets from this dataset
print(ds_tweets['text'].values[0:5])

## Counting Word

**Finding Keywords**

Counting known keywords is a fundamental step in text data analysis, especially when dealing with Twitter datasets. In this dataset, we aim to count the occurrences of specific hashtags within a collection of tweets related to data science. To achieve this, we'll leverage string methods available in the pandas Series object.

In [None]:
# Flatten the tweets and store them
flat_tweets = flatten_tweets(data_science_json)

# Convert to DataFrame
ds_tweets = pd.DataFrame(flat_tweets)

# Find mentions of #python in 'text'
python = ds_tweets['text'].str.contains('#python', case=False)

# Print the proportion of tweets mentioning #python
print("Proportion of #python tweets:", np.sum(python) / ds_tweets.shape[0])

**Looking for Text in All the Wrong Places**

It's important to remember that relevant text may not always reside in the main text field of a tweet. It can also be found in the extended_tweet, retweeted_status, or quoted_status. Therefore, we need to check all of these fields to ensure we account for all relevant text. To streamline this process, we'll create a function.

The function check_word_in_tweet checks if a word is present in various fields of a Twitter dataset, including the main text, extended tweets (for tweets longer than 140 characters), retweets, and quoted tweets. It returns a logical pandas Series.

In [None]:
def check_word_in_tweet(word, data):
    """Checks if a word is in a Twitter dataset's text. 
    Checks text and extended tweet (140+ character tweets) for tweets,
    retweets and quoted tweets.
    Returns a logical pandas Series.
    """
    contains_column = data['text'].str.contains(word, case=False)
    contains_column |= data['extended_tweet-full_text'].str.contains(word, case=False)
    contains_column |= data['quoted_status-text'].str.contains(word, case=False)
    contains_column |= data['quoted_status-extended_tweet-full_text'].str.contains(word, case=False)
    contains_column |= data['retweeted_status-text'].str.contains(word, case=False)
    contains_column |= data['retweeted_status-extended_tweet-full_text'].str.contains(word, case=False)
    
    return contains_column

**Comparing #python to #rstats**

With our versatile function to check for word occurrences in various tweet fields, we can now apply it to multiple words and make comparisons. Returning to our example with the data science hashtag dataset, we want to measure how frequently #rstats appears compared to #python.

In [None]:
# Find mentions of #python in all text fields
python = check_word_in_tweet("#python", ds_tweets)

# Find mentions of #rstats in all text fields
rstats = check_word_in_tweet("#rstats", ds_tweets)

# Print the proportion of tweets mentioning #python
print("Proportion of #python tweets:", np.sum(python) / ds_tweets.shape[0])

# Print the proportion of tweets mentioning #rstats
print("Proportion of #rstats tweets:", np.sum(rstats) / ds_tweets.shape[0])

# Time Series Analysis

**Creating a Time Series Data Frame**

Time series data is invaluable for tracking variations over time, a valuable approach when analyzing Twitter text data to monitor the prevalence of specific words or phrases. The first step in achieving this is to convert the DataFrame into a format suitable for pandas time series methods. This can be accomplished by converting the index into a datetime type.

In [None]:
# Print 'created_at' to see the original format of datetime in Twitter data
print(ds_tweets['created_at'].head())

# Convert the 'created_at' column to np.datetime object
ds_tweets['created_at'] = pd.to_datetime(ds_tweets['created_at'])

# Print 'created_at' to see the new format
print(ds_tweets['created_at'].head())

# Set the index of ds_tweets to 'created_at'
ds_tweets = ds_tweets.set_index('created_at')

**Generating Mean Frequency**

To analyze and visualize word prevalence over time, we need to create a metric that can be graphed. Our check_word_in_tweet() function returns a boolean Series, where True is equivalent to 1. We can utilize this to produce columns for each keyword of interest and understand their prevalence over time.

In [None]:
# Create a 'python' column
ds_tweets['python'] = check_word_in_tweet('#python', ds_tweets)

# Create an 'rstats' column
ds_tweets['rstats'] = check_word_in_tweet('#rstats', ds_tweets)

**Plotting Mean Frequency**

Finally, we'll create a daily average of hashtag mentions and plot them over time. We'll calculate the proportions from the two boolean Series on a daily basis and then visualize them.

In [None]:
# Average of 'python' column by day
mean_python = ds_tweets['python'].resample('1 d').mean()

# Average of 'rstats' column by day
mean_rstats = ds_tweets['rstats'].resample('1 d').mean()

# Plot mean 'python' by day (green) and mean 'rstats' by day (blue)
plt.plot(mean_python.index.day, mean_python, color='green')
plt.plot(mean_rstats.index.day, mean_rstats, color='blue')

# Add labels and show
plt.xlabel('Day')
plt.ylabel('Frequency')
plt.title('Language Mentions Over Time')
plt.legend(('#python', '#rstats'))
plt.show()

![image](meanfreq.png)


This time series analysis helps us understand how the popularity of these keywords evolves on Twitter, providing insights into trends and user interests.

# Sentiment Analysis

**Loading VADER**

Sentiment analysis offers us a direct and interpretable method to grasp the meaning behind text data. While it has its limitations, it serves as an excellent starting point when working with textual data. In Python, several out-of-the-box tools are available for sentiment analysis.

In [None]:
# Load SentimentIntensityAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer 

# Instantiate a new SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

# Generate sentiment scores
sentiment_scores = ds_tweets['text'].apply(sid.polarity_scores)

**Calculating Sentiment Scores**

To gain a rough measure of sentiment towards a specific hashtag, we can calculate the average sentiment for tweets mentioning that hashtag. However, it's essential to remember that a tweet can encompass various elements, so it's crucial to inspect both the tweet's text and metrics generated by automated text methods.

In [None]:
# Print out the text of a positive tweet
print(ds_tweets[sentiment_scores > .6]['text'].values[0])

# Print out the text of a negative tweet
print(ds_tweets[sentiment_scores < -.6]['text'].values[0])

# Generate average sentiment scores for #python
sentiment_py = sentiment_scores[check_word_in_tweet('#python', ds_tweets)].resample('1 d').mean()

# Generate average sentiment scores for #rstats
sentiment_r = sentiment_scores[check_word_in_tweet('#rstats', ds_tweets)].resample('1 d').mean()

**Plotting Sentiment Scores**

Finally, let's visualize the sentiment of each hashtag over time. This process is quite similar to plotting tweet prevalence.

In [None]:
# Import matplotlib
import matplotlib.pyplot as plt

# Plot average #python sentiment per day
plt.plot(sentiment_py.index.day, sentiment_py, color='green')

# Plot average #rstats sentiment per day
plt.plot(sentiment_r.index.day, sentiment_r, color='blue')

plt.xlabel('Day')
plt.ylabel('Sentiment')
plt.title('Sentiment of Data Science Languages')
plt.legend(('#python', '#rstats'))
plt.show()

![image](sentscore.png)


This sentiment analysis provides valuable information about the emotional tone associated with these hashtags on Twitter, helping us gauge public sentiment towards specific topics.

# Twitter Networks

**Creating a Retweet Network**

Twitter data inherently exhibits networked characteristics. Among the essential Twitter networks are retweet networks, often represented as directed graphs, where the retweeting user is the source and the retweeted user is the target. By leveraging Twitter data within our flattened DataFrame, we can import this data into networkx and construct a retweet network.

In [None]:
# Import networkx
import networkx as nx

# Create a retweet network from the edgelist
G_rt = nx.from_pandas_edgelist(
    sotu_retweets,
    source='user-screen_name',
    target='retweeted_status-user-screen_name',
    create_using=nx.DiGraph())

# Print the number of nodes
print('Nodes in RT network:', len(G_rt.nodes()))

# Print the number of edges
print('Edges in RT network:', len(G_rt.edges()))

**Creating a Reply Network**

Reply networks exhibit a distinct structure compared to retweet networks. While retweet networks often indicate agreement, reply networks signal discussion, deliberation, and disagreement. The essential network properties, such as directionality and source-target relationships, remain consistent.

In [None]:
# Import networkx
import networkx as nx

# Create a reply network from the edgelist
G_reply = nx.from_pandas_edgelist(
    sotu_replies,
    source='user-screen_name',
    target='in_reply_to_screen_name',
    create_using=nx.DiGraph())

# Print the number of nodes
print('Nodes in reply network:', len(G_reply.nodes()))

# Print the number of edges
print('Edges in reply network:', len(G_reply.edges()))

**Visualizing the Retweet Network**

Visualizing retweet networks is a crucial step in exploratory data analysis. It allows us to inspect the network's structure visually, identify users with significant influence, and discern various spheres of conversation.

In [None]:
# Create random layout positions
pos = nx.random_layout(G_rt)

# Create a size list
sizes = [x[1] for x in G_rt.degree()]

# Draw the network
nx.draw_networkx(G_rt, pos,
    with_labels=False,
    node_size=sizes,
    width=0.1, alpha=0.7,
    arrowsize=2, linewidths=0)

# Turn the axis off and show
plt.axis('off')
plt.show()

![image](retweetnet.png)


**In-Degree Centrality**

Centrality measures the importance of a node in a network. Degree centrality, especially in retweet networks, is a straightforward and intuitively understandable measure. However, in directed networks like Twitter, we need to differentiate between in-degree and out-degree centrality. In-degree centrality in retweet networks highlights users who receive many retweets.

In [None]:
# Generate in-degree centrality for retweets
rt_centrality = nx.in_degree_centrality(G_rt)

# Generate in-degree centrality for replies
reply_centrality = nx.in_degree_centrality(G_reply)

# Store centralities in DataFrames
rt = pd.DataFrame(list(rt_centrality.items()), columns=column_names)
reply = pd.DataFrame(list(reply_centrality.items()), columns=column_names)

# Print the first five results in descending order of centrality
print(rt.sort_values('degree_centrality', ascending=False).head())

# Print the first five results in descending order of centrality
print(reply.sort_values('degree_centrality', ascending=False).head())

**Betweenness Centrality**

Betweenness centrality in retweet and reply networks identifies users who bridge different Twitter communities. These communities may be linked by topic or ideology.

In [None]:
# Generate betweenness centrality for retweets
rt_centrality = nx.betweenness_centrality(G_rt)

# Generate betweenness centrality for replies
reply_centrality = nx.betweenness_centrality(G_reply)

# Store centralities in DataFrames
rt = pd.DataFrame(rt_centrality.items(), columns=column_names)
reply = pd.DataFrame(reply_centrality.items(), columns=column_names)

# Print the first five results in descending order of centrality
print(rt.sort_values('betweenness_centrality', ascending=False).head())

# Print the first five results in descending order of centrality
print(reply.sort_values('betweenness_centrality', ascending=False).head())

**Ratios: "The Ratio"**

While not a strict measure of network importance, "The Ratio" is a Twitter-specific network measure often used to assess a tweet's unpopularity. It's calculated by dividing the number of replies by the number of retweets. In our case, we focus on the in-degrees of both retweet and reply networks.

In [None]:
# Calculate in-degrees and store in DataFrames
degree_rt = pd.DataFrame(list(G_rt.in_degree()), columns=column_names)
degree_reply = pd.DataFrame(list(G_reply.in_degree()), columns=column_names)

# Merge the two DataFrames on screen name
ratio = degree_rt.merge(degree_reply, on='screen_name', suffixes=('_rt', '_reply'))

# Calculate the ratio
ratio['ratio'] = ratio['degree_reply'] / ratio['degree_rt']

# Exclude any tweets with fewer than 5 retweets
ratio = ratio[ratio['degree_rt'] >= 5]

# Print the first five with the highest ratio
print(ratio.sort_values('ratio', ascending=False).head())

Additional Result:

	screen_name  degree_rt  degree_reply  ratio
	SpeakerRyan      8           15  	  1.875
	NBCNews         20           18  	  0.900
	benshapiro       5            4  	  0.800
	SenateGOP        5            3  	  0.600
	CBSThisMorning   6            3  	  0.500

**Additional Result Insight:**

- Among the users analyzed, **SpeakerRyan** stands out with the highest ratio of **1.875**. This suggests that **SpeakerRyan's** tweets received a significantly higher number of replies compared to retweets during the event.

- **NBCNews** and **benshapiro** also have notable ratios, indicating substantial engagement and discussion on their tweets.

- **SenateGOP** and **CBSThisMorning** also have interesting ratios, reflecting varying levels of engagement and conversation around their tweets.

Overall, this Twitter network analysis provides valuable insights into user engagement, conversation dynamics, and influential figures during the 2018 State of the Union speech event on Twitter. These insights can be further explored and leveraged for various research or decision-making purposes.

# Maps and Twitter data

**Accessing User-Defined Location:**

To access user-defined locations in tweets, we can extract the 'user-location' field from the Twitter JSON. This field may contain information provided by users in their profiles, which can be imprecise but more readily available.

In [None]:
# Extract the user-defined location from a single example tweet
print(tweet_json['user']['location'])

# Flatten and load the SOTU tweets into a dataframe
tweets_sotu = pd.DataFrame(flatten_tweets(tweets_sotu_json))

# Print out the top five user-defined locations
print(tweets_sotu['user-location'].value_counts().head())

**Accessing Bounding Box:**

Tweets with coordinate-level geographical information often come with bounding boxes. These are sets of four longitudinal/latitudinal coordinates that represent specific geographical areas. We can extract bounding box data from the 'place' field in the Twitter JSON.

In [None]:
def getBoundingBox(place):
    """ Returns the bounding box coordinates."""
    return place['bounding_box']['coordinates']

# Apply the function to get bounding box coordinates
bounding_boxes = tweets_sotu['place'].apply(getBoundingBox)

# Print out the first bounding box coordinates
print(bounding_boxes.values[0])

**Calculating the Centroid:**

To simplify the handling of bounding boxes, we can calculate the centroid, which represents the center of the bounding box. The centroid is computed by finding the midpoint of the lines formed by the latitude and longitude coordinates.

In [None]:
def calculateCentroid(place):
    """ Calculates the centroid from a bounding box."""
    # Obtain the coordinates from the bounding box.
    coordinates = place['bounding_box']['coordinates'][0]
        
    longs = np.unique( [x[0] for x in coordinates] )
    lats  = np.unique( [x[1] for x in coordinates] )

    if len(longs) == 1 and len(lats) == 1:
        # Return a single coordinate
        return (longs[0], lats[0])
    elif len(longs) == 2 and len(lats) == 2:
        # If we have two longs and lats, we have a box.
        central_long = np.sum(longs) / 2
        central_lat  = np.sum(lats) / 2
    else:
        raise ValueError("Non-rectangular polygon not supported: %s" % 
            ",".join(map(lambda x: str(x), coordinates)) )

    return (central_long, central_lat)

# Calculate the centroids of the place field
centroids = tweets_sotu['place'].apply(calculateCentroid)

**Creating a Basemap Map:**

The Basemap library allows us to create maps in Python. We can set up a Basemap object and define a bounding box for the map. The map can be customized with various features such as continents, coastlines, countries, and states.

In [None]:
# Set up the US bounding box
us_boundingbox = [-125, 22, -64, 50] 

# Set up the Basemap object
m = Basemap(llcrnrlon=us_boundingbox[0],
            llcrnrlat=us_boundingbox[1],
            urcrnrlon=us_boundingbox[2],
            urcrnrlat=us_boundingbox[3],
            projection='merc')

# Customize the map with features
m.fillcontinents(color='white')
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')

# Show the map
plt.show()

**Plotting Centroid Coordinates:**

Once we have calculated the centroids, we can plot them on the Basemap map by isolating the longitudes and latitudes and using the .scatter() method.

In [None]:
# Calculate the centroids for the dataset and isolate coordinates
centroids = tweets_sotu['place'].apply(calculateCentroid)
lon = [x[0] for x in centroids]
lat = [x[1] for x in centroids]

# Customize the map with features
m.fillcontinents(color='white', zorder=0)
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')

# Plot the centroids
m.scatter(lon, lat, latlon=True, alpha=0.7)

# Show the map
plt.show()

**Coloring by Sentiment:**

To differentiate places based on sentiment, we can use sentiment analysis scores from Chapter 2. We extract the compound sentiment score and use it to color the centroids on the map.

In [None]:
# Generate sentiment scores
sentiment_scores = tweets_sotu['text'].apply(sid.polarity_scores)

# Isolate the compound sentiment score
sentiment_scores = [x['compound'] for x in sentiment_scores]

# Customize the map with features
m.fillcontinents(color='white', zorder=0)
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')

# Color centroids based on sentiment scores
m.scatter(lon, lat, latlon=True, c=sentiment_scores, cmap='coolwarm', alpha=0.7)

# Show the map
plt.show()

![image](image.png)


# All Insight:

### Loading and Accessing Tweets:

- The initial part of the portfolio demonstrates loading and accessing Twitter data using Python.
- Twitter data is commonly stored in JSON format, and Python's JSON library is used to convert it into a Python object.
- Key information like tweet text, user details, and retweet data can be easily accessed from the JSON structure.

**Counting Words:**

- This section focuses on counting specific keywords, like hashtags, in a collection of tweets about data science.
- The pandas Series object is employed to count keyword occurrences efficiently.
- It provides a simple yet powerful way to analyze text data in a Twitter dataset.

### Time Series Analysis:

- Time series data frames are created to analyze the variation of specific keywords or phrases over time.
- Timestamps in tweets are converted into datetime objects for time-based analysis.
- Mean frequency of keywords is calculated and plotted over time, enabling the tracking of keyword prevalence.

### Sentiment Analysis:

- Sentiment analysis using the VADER library provides insights into the overall sentiment of tweets.
- Sentiment scores for each tweet are generated using polarity scores, and average sentiment scores are calculated over time.
- This helps in understanding how sentiments evolve regarding specific hashtags.

### Twitter Networks:

- Focusing on retweet and reply networks.
- These networks are represented as directed graphs, where users retweet or reply to others.
- Metrics like in-degree centrality and betweenness centrality are computed to identify influential users and bridge connectors between communities.

**Ratios and Unpopularity:**

- A unique Twitter measure, "The Ratio," is calculated by dividing the number of replies by the number of retweets.
- The portfolio identifies tweets with high ratios, which typically indicate unpopularity or controversy.
- This analysis helps in understanding public sentiment regarding specific tweets.

### Putting Twitter Data on the Map:

- Geospatial analysis is conducted by extracting user-defined locations and bounding boxes.
- Centroids of bounding boxes are computed to represent locations more precisely.
- Maps are created using Basemap, and centroids are plotted on the map to visualize tweet distribution.
- Coloring centroids by sentiment scores provides insights into how the State of the Union speech was received in different areas.

# In conclusion, 

This portfolio demonstrates a comprehensive analysis of Twitter data, covering keyword counting, time series analysis, sentiment analysis, network analysis, geospatial analysis, and visualization. It allows for valuable insights into user behavior, sentiments, and geographical patterns in Twitter data, which can be applied to various research and business contexts.