# Twitter Data Interactive Visualization Using Plotly

### Describing the relationships between tweet length, number of hashtags, and audience engagement.

### Audience engagement here is defined as "number of likes". Justification reason is provided in the README

In [None]:
pip install pandas
pip install numpy
pip install plotly
pip install chart_studio
pip install cufflinks

In [176]:
import urllib.request
import json
import pandas as pd
import numpy
import re
import plotly.express as px
import plotly.figure_factory as ff
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot
import cufflinks

cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl', offline=True)

### Read Twitter Data

In [148]:
#Since the JSON file format is not what is expected. The read_json method from pandas does not work, 
#so I'm using json.loads here to read from the URL

with urllib.request.urlopen('https://raw.githubusercontent.com/Remesh/tweet-data/master/data.json') as url:
    data = json.loads(url.read())
    df = pd.DataFrame(d.items())
    
df


Unnamed: 0,0,1
0,users,"[{'id': 1, 'username': 'rmoss'}, {'id': 2, 'us..."
1,hashtags,"[{'id': 1, 'name': 'reduced'}, {'id': 2, 'name..."
2,tweets,"[{'id': 1, 'user': 454, 'text': 'Call better e..."


In [149]:
# To display all the data to better understand it
pd.set_option('display.max_colwidth', None)

### Create Users Dataframe

In [150]:
users_df = pd.DataFrame(df[1][0])
users_df.head()

Unnamed: 0,id,username
0,1,rmoss
1,2,albertmitchell
2,3,blakefarrell
3,4,lisahernandez
4,5,vanessalewis


### Create Hashtags Dataframe

In [151]:
hashtags_df = pd.DataFrame(df[1][1])
hashtags_df.head()

Unnamed: 0,id,name
0,1,reduced
1,2,focused
2,3,right-sized
3,4,programmable
4,5,phased


### Create Tweets Dataframe

In [152]:
tweets_df = pd.DataFrame(df[1][2])
tweets_df.head()

Unnamed: 0,id,user,text,hashtags,retweet_id,likes
0,1,454,Call better environment my wall church red. Industry onto game partner letter PM model.\nBuy social election. Team will television science figure help.\nSuch make federal station air low at. One hair sea media test policy year as.,[],,"[119, 365, 61, 309, 297, 352, 453, 428, 468, 22, 312, 178, 436, 34, 251, 268, 141, 94, 179, 37, 265, 27, 455, 389, 426, 169, 75, 387, 70]"
1,2,398,Record treat rock scene pull.,[],,[]
2,3,335,Charge receive stay behavior rock. Although also true yourself evidence. Mind away discuss cover.\nSuddenly more poor reality the.,"[62, 109, 2]",,"[47, 500, 221, 465]"
3,4,140,Sense already old.,[],,"[265, 136, 179, 127, 72, 355, 49, 476, 94, 191, 385, 106, 353, 375, 303, 172, 407, 208, 108, 366, 474, 284, 229, 338, 316, 233, 36, 39, 161, 499, 494, 135, 67, 410, 93]"
4,5,427,Energy country student attack investment. Already mind support white gas. Like young trade his right.,[4],,"[293, 492, 190, 30, 132, 62, 383, 181, 233, 203, 488, 73, 396, 452, 257, 65, 262, 199, 498, 71, 398, 160, 333, 66, 301, 151, 100, 142, 74, 63, 367, 67, 36, 345, 210, 392, 464, 260, 416, 86, 242, 438, 10, 269, 3, 391, 317, 473]"


## To fulfill the Application Requirements, I will only be using Tweets Dataframe further on because I think that it contains all the necessary data to visualize the relationships needed

### Clean the Tweet text data from special characters

In [153]:
def clean_tweet(tweet):
    return ' '.join(re.sub('(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)', ' ', tweet).split())

In [154]:
tweets_df['clean_tweet'] = tweets_df['text'].apply(lambda x: clean_tweet(x))
tweets_df.head()

Unnamed: 0,id,user,text,hashtags,retweet_id,likes,clean_tweet
0,1,454,Call better environment my wall church red. Industry onto game partner letter PM model.\nBuy social election. Team will television science figure help.\nSuch make federal station air low at. One hair sea media test policy year as.,[],,"[119, 365, 61, 309, 297, 352, 453, 428, 468, 22, 312, 178, 436, 34, 251, 268, 141, 94, 179, 37, 265, 27, 455, 389, 426, 169, 75, 387, 70]",Call better environment my wall church red Industry onto game partner letter PM model Buy social election Team will television science figure help Such make federal station air low at One hair sea media test policy year as
1,2,398,Record treat rock scene pull.,[],,[],Record treat rock scene pull
2,3,335,Charge receive stay behavior rock. Although also true yourself evidence. Mind away discuss cover.\nSuddenly more poor reality the.,"[62, 109, 2]",,"[47, 500, 221, 465]",Charge receive stay behavior rock Although also true yourself evidence Mind away discuss cover Suddenly more poor reality the
3,4,140,Sense already old.,[],,"[265, 136, 179, 127, 72, 355, 49, 476, 94, 191, 385, 106, 353, 375, 303, 172, 407, 208, 108, 366, 474, 284, 229, 338, 316, 233, 36, 39, 161, 499, 494, 135, 67, 410, 93]",Sense already old
4,5,427,Energy country student attack investment. Already mind support white gas. Like young trade his right.,[4],,"[293, 492, 190, 30, 132, 62, 383, 181, 233, 203, 488, 73, 396, 452, 257, 65, 262, 199, 498, 71, 398, 160, 333, 66, 301, 151, 100, 142, 74, 63, 367, 67, 36, 345, 210, 392, 464, 260, 416, 86, 242, 438, 10, 269, 3, 391, 317, 473]",Energy country student attack investment Already mind support white gas Like young trade his right


### Create a Tweet Length feature and visualize the distribution

In [155]:
tweets_df['tweet_length'] = tweets_df['clean_tweet'].apply(lambda x: len(x.split()))

### From the distribution below, we can see that:
- Most of the tweets are short length 
- Most tweets are with 11 and 12 words
- The distribution is right skewed as it favors the lower - medium words length, where counts are high for 3 < words length < 18 
and counts are low for tweets of words length > 18

In [158]:
tweets_df['tweet_length'].value_counts().iplot(kind='bar', xTitle='Tweet Words Length',
                                    yTitle='Count', title='Overall Tweet Words Length Distribution')

### Create a Number of Hashtags feature and visualize the distribution

In [159]:
tweets_df['hashtags_count'] = tweets_df['hashtags'].apply(lambda x: len(x))

### Pretty even distribution for the number of hashtags with 3 hashtags as the highest count

In [161]:
tweets_df['hashtags_count'].value_counts().iplot(kind='bar', xTitle='Number of Hashtags',
                                    yTitle='Count', title='Overall Number of Hashtags Distribution')

### Create a Number of likes feature and visualize the distribution

In [162]:
tweets_df['likes_count'] = tweets_df['likes'].apply(lambda x: len(x))

### The number of likes distribution is right skewed for this dataset, with most of the high count being under 50 likes and low counts for 100+ likes. Interesting that there are no likes between 50 - 100 but that's the data that was given

In [164]:
tweets_df['likes_count'].value_counts().iplot(kind='bar', xTitle='Number of Likes',
                                    yTitle='Count', title='Overall Number of Likes Distribution')

### Distributions of both of the right skewed features for tweet length and number of likes

In [167]:
tweets_df[['likes_count', 'tweet_length']].iplot(kind='hist', xTitle='Value', yTitle='Count',
    title='Distributions of Tweet Length and Number of Likes')

## Scatter Plots to describe:
- The relationship of number of likes to the tweet length
- The relationship of number of likes to the number of hashtags

### We can see here that the number of likes are high for tweets with shorter length. So, more popular tweets are of shorter length

In [171]:
fig = px.scatter(tweets_df, x="tweet_length", y="likes_count",
                 labels= {
                     "tweet_length": "Tweet Words Length",
                     "likes_count": "Number of Likes"
                     
                 },
                 title="Relationship of number of likes to tweet length"
                )
fig.show()

### We can see here that 2 or 3 hashtags have the higher number of likes. So from this dataset, we can conclude that 2 or 3 hashtags are popular

In [172]:
fig = px.scatter(tweets_df, x="hashtags_count", y="likes_count",
                 labels= {
                     "hashtags_count": "Number of Hashtags",
                     "likes_count": "Number of Likes"
                     
                 },
                 title="Relationship of number of hashtags to number of likes"
                )
fig.show()

## Correlation Matrix for the three features:
- Tweet length is negatively correlated to number of likes which makes sense based on the previous scatter plot. This means that the higher the length, the less likes it gets
- Number of hashtags are somewhat positively correlated to number of likes because we can see from our previous plot that 2 and 3 hashtags are the popular ones
- There is no correlation between tweet length and the number of hashtags

In [173]:
corrs = tweets_df[['likes_count', 'hashtags_count','tweet_length']].corr()
figure = ff.create_annotated_heatmap(
    z=corrs.values,
    x=list(corrs.columns),
    y=list(corrs.index),
    annotation_text=corrs.round(2).values,
    showscale=True)

figure

### Creating a categorical feature for the next chart 

In [178]:
tweets_df['str_hashtags_count'] = tweets_df['hashtags_count'].apply(lambda x: str(x))

## Overall Relationship of the 3 features:
- We can see the scattered plots that most of the colors for high number of likes are red and blue. This proves that the popular hashtags count are 2 and 3. 

In [191]:
tweets_df.iplot(
    x='tweet_length',
    y='likes_count',
    categories='str_hashtags_count',
    layout=dict(
        xaxis= dict(title='Tweet Words Length'),
        yaxis= dict(title='Number of Likes'),
        legend_title_text='Number of Hashtags',
        title='Length of Tweet vs Number of Likes by Hashtags Count'
        )
    )

## Conclusion:

### I can conclude from this dataset that the tweets with shorter word lengths with the exclusion of 1 or 2 words and have 2 or 3 hashtags have more audience engagement (number of likes). If I am to use this data, I'll make sure to tweet shorter sentences with 2 or 3 hashtags.

### There's more analysis that can be done even for this dataset like sentiment analysis and creating word cloud from the text data to see which words were used oftern. However for this notebook, I focused on demonstrating the relationships between these 3 features