# Twitter Analysis of EP3 East and West Conferences

First I'll import the libraries that are needed

In [None]:
%matplotlib inline
import pymongo
from datetime import date
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator


## EP3 East Analysis

### Data Wrangling
Before I can analyze anything, I need to get the data out of my database and organize it into a form that can easily be analyzed. 

First I'll make a connection to my database. Everything is stored in a database called MongoDB on my laptop. 

In [None]:
client_con = pymongo.MongoClient()

In [None]:
client_con.database_names()

In [None]:
local_db = client_con['test']

In [None]:
local_db.collection_names()

'ep3' is the collection with the ep3east data.  
I'll load the data into a variable called data_set

In [None]:
data_set = local_db['ep3']

Next I'll take a quick look at how many tweets are in there

In [None]:
data_set.count()

In case you are wondering, this is what one tweet looks like. 
There's a ton of data in each tweet, which is one of the reasons I'm interested in this

In [None]:
print data_set.find_one()

Now I need to change the format of the data to something easier to look at.
First, I'll make the database collection into a list of dictionaries, which is easier to work with in Python

In [None]:
tweet_data = list(data_set.find())

Next I'll make that list into a dataframe, which looks more like a spread sheet
I'm going to make it a function that I can use over again for the west data set

In [None]:
def process_results(results):
    id_list = [tweet['id'] for tweet in results]
    data_set = pd.DataFrame(id_list, columns = ["id"])
    
    data_set['user'] = [tweet['user']['screen_name'] for tweet in results]
    data_set["text"] = [tweet['text'] for tweet in results]
    data_set["retweet_count"] = [tweet['retweet_count'] for tweet in results]
    data_set["favorite_count"] = [tweet['favorite_count'] for tweet in results]
    data_set["created_at"] = [tweet['created_at'] for tweet in results]
    
    return data_set

Now I'll use that function to process my data into the dataframe and store it in a variable df

In [None]:
df_east = process_results(tweet_data)

Let's take a look at it.  
First I'll check the length, hopefully it's the same as my number of records above.  
Then I'll look at the top 5 rows and the bottom 5 rows to make sure it all looks right.

In [None]:
len(df_east)

In [None]:
df_east.head()

In [None]:
df_east.tail()

It all looks good.  
Now I need to change the 'created_at' column to a different format that will make is easier for me to split up the data sets. 

In [None]:
df_east['created_at'] = pd.to_datetime(df_east['created_at'])

In [None]:
df_east.head()

Now I'll make a new column that adds retweets plus favorites. That will give some idea of the importance of a particular tweet. 

In [None]:
df_east['important_tweets'] = df_east['retweet_count'] + df_east['favorite_count']
df_east.head()

Next, we to adjust the data set to only look at origional tweets, not retweets. 

First, I'll make a column looking at whether or not is was retweeted

In [None]:
df_east['retweeted'] = df_east['text'].str.startswith('RT')

In [None]:
df_east.head()

### Total Tweets for EP3 East
Now I'll include only the origional tweets in the data set. 
Then I'll look at how many we have

In [None]:
df_east = df_east[df_east['retweeted'] == False]
len(df_east)

So, the variable df is our dataset for the full ep3east conference. 
Before we run our analyses, let's make a data set for each day and see how many origional tweets are in each one. 

### Total Tweets for EP3 East Day 1
Day 1 - 12/2/16

In [None]:
east1 = df_east[df_east['created_at'] < '2016-12-03']
len(east1)

### Total Tweets for EP3 East Day 2
Day 2- 12/3/16

In [None]:
df1 = df_east[df_east['created_at'] > '2016-12-03']
east2 = df1[df1['created_at'] < '2016-12-04']
len(east2)

### Total Tweets for EP3 East Day 3
Day 3- 12/4/16

In [None]:
east3 = df_east[df_east['created_at'] > '2016-12-04']
len(east3)

### EP3 East Total Retweets and Total Favorites
Next, let's look at some numbers from the full conference data set.   
We have total tweets (664). Let's also look at total retweets and total favorites

In [None]:
east_retweets = df_east['retweet_count'].sum()
print east_retweets
east_favorites = df_east['favorite_count'].sum()
print east_favorites

## Top 10 Tweets from EP3 East
Now let's look at the 10 most importatn tweets.  
First we'll sort the data by "important_tweets" to with highest number on top.  
Then we'll show the top 10
If you want to see the origional tweet, just go to http://twitter.com/statuses/(put the tweet ID at the end)

In [None]:
sorted_east = df_east.sort_values(['important_tweets'], ascending = False)
sorted_east.head(10)

### Word Cloud
The Last step for the full conference analysis is the Word Cloud

In [None]:
text = " ".join(df_east["text"].values)
no_url_no_tag = " ".join([word for word in text.split(' ')
                        if 'http' not in word
                        and not word.startswith('@')
                        and word != 'RT'
                        and word != 'ep3east'])
wc = WordCloud(background_color="white", font_path="/Library/Fonts/Verdana.ttf", stopwords=STOPWORDS, width=500, height=300)
wc.generate(no_url_no_tag)
plt.imshow(wc)
plt.axis("off")
plt.title("EP3 East- Full Conference")
plt.savefig('ep3east.png', dpi=500)
plt.show()

Now to do the same things for days 1, 2 and 3 individually  
For these we'll just get the top 5 tweets of the day

### EP3 East Day 1 Retweets and Favorites
Day 1

In [None]:
east1_retweets = east1['retweet_count'].sum()
print east1_retweets
east1_favorites = east1['favorite_count'].sum()
print east1_favorites

### EP3 East Day 1 Top 5 Tweets

In [None]:
east1sorted_df = east1.sort_values(['important_tweets'], ascending = False)
east1sorted_df.head(5)

### EP3 East Day 1 Word Cloud

In [None]:
text = " ".join(east1["text"].values)
no_url_no_tag = " ".join([word for word in text.split(' ')
                        if 'http' not in word
                        and not word.startswith('@')
                        and word != 'RT'
                        and word != 'ep3east'])
wc = WordCloud(background_color="white", font_path="/Library/Fonts/Verdana.ttf", stopwords=STOPWORDS, width=500, height=300)
wc.generate(no_url_no_tag)
plt.imshow(wc)
plt.axis("off")
plt.title("EP3 East- Day 1")
plt.savefig('ep3east_day1.png', dpi=500)
plt.show()

### EP3 East Day 2 Retweets and Favorites
Day 2

In [None]:
east2_retweets = east2['retweet_count'].sum()
print east2_retweets
east2_favorites = east2['favorite_count'].sum()
print east2_favorites

### EP3 East Day 2 Top 5 Tweets

In [None]:
east2sorted_df = east2.sort_values(['important_tweets'], ascending = False)
east2sorted_df.head(5)

### EP3 East Day 2 Word Cloud

In [None]:
text = " ".join(east2["text"].values)
no_url_no_tag = " ".join([word for word in text.split(' ')
                        if 'http' not in word
                        and not word.startswith('@')
                        and word != 'RT'
                        and word != 'ep3east'])
wc = WordCloud(background_color="white", font_path="/Library/Fonts/Verdana.ttf", stopwords=STOPWORDS, width=500, height=300)
wc.generate(no_url_no_tag)
plt.imshow(wc)
plt.axis("off")
plt.title("EP3 East- Day 2")
plt.savefig('ep3east_day2.png', dpi=500)
plt.show()

### Day 3 EP3 East Retweets and Favorites
Day 3

In [None]:
east3_retweets = east3['retweet_count'].sum()
print east3_retweets
east3_favorites = east3['favorite_count'].sum()
print east3_favorites

### Day 3 EP3 East Top 5 Tweets

In [None]:
east3sorted_df = east3.sort_values(['important_tweets'], ascending = False)
east3sorted_df.head(5)

### EP3 East Day 3 Word Cloud

In [None]:
text = " ".join(east3["text"].values)
no_url_no_tag = " ".join([word for word in text.split(' ')
                        if 'http' not in word
                        and not word.startswith('@')
                        and word != 'RT'
                        and word != 'ep3east'])
wc = WordCloud(background_color="white", font_path="/Library/Fonts/Verdana.ttf", stopwords=STOPWORDS, width=500, height=300)
wc.generate(no_url_no_tag)
plt.imshow(wc)
plt.axis("off")
plt.title("EP3 East- Day 3")
plt.savefig('ep3east_day3.png', dpi=500)
plt.show()

## EP3 West Analysis

All that looks good. Now let's do the same for the ep3west conference.  
Like before I'll grab the data from my database and put it into a dataframe.  
Next I'll change the format of the 'created_at column' then add the retweets+favorites column, then take out the retweets.  
On origional step that I will have to do is get rid of duplicates. I believe that I doubled up on collecting day 1 of the conference, so we'll have to make sure that we aren't counting duplicates. 

In [None]:
dataset_west = local_db['ep3west']

In [None]:
dataset_west.count()

In [None]:
tweets_west = list(dataset_west.find())

In [None]:
df_west = process_results(tweets_west)

In [None]:
len(df_west)

In [None]:
df_west = df_west.drop_duplicates('id')

In [None]:
len(df_west)

In [None]:
df_west['created_at'] = pd.to_datetime(df_west['created_at'])

In [None]:
df_west['important_tweets'] = df_west['retweet_count'] + df_west['favorite_count']

In [None]:
df_west['retweeted'] = df_west['text'].str.startswith('RT')

### Total Tweets for EP3 West

In [None]:
df_west = df_west[df_west['retweeted'] == False]
len(df_west)

Wow, Very nearly the same number of origional tweets for EP3 East and EP3 West!  
Let's take a look at the top and bottom of the dataframe

In [None]:
df_west.head()

In [None]:
df_west.tail()

It all looks good. Now let's make the data set for the 3 days of the West conference

### EP3 West Day 1 Total Tweets
Day 1- 12/9/16

In [None]:
west1 = df_west[df_west['created_at'] < '2016-12-10']
len(west1)

### EP3 West Day 2 Total Tweets
Day 2- 12/10/16

In [None]:
west_df1 = df_west[df_west['created_at'] > '2016-12-10']
west2 = west_df1[west_df1['created_at'] < '2016-12-11']
len(west2)

### EP3 West Day 3 Total Tweets
Day 3- 12/11/16

In [None]:
west3 = df_west[df_west['created_at'] > '2016-12-11']
len(west3)

Now that we have our data sets for all of our dats, let's calculate the total number of retweets, the total number of favorites and the most important tweets. 
We'll start with the full conference data. 

### EP3 West Full Conference Retweets and Favorites

In [None]:
west_retweets = df_west['retweet_count'].sum()
print west_retweets
west_favorites = df_west['favorite_count'].sum()
print west_favorites

### EP3 West Top 10 Tweets

In [None]:
sorted_west = df_west.sort_values(['important_tweets'], ascending = False)
sorted_west.head(10)

### EP3 West Full Conference Word Cloud

In [None]:
text = " ".join(df_west["text"].values)
no_url_no_tag = " ".join([word for word in text.split(' ')
                        if 'http' not in word
                        and not word.startswith('@')
                        and word != 'RT'
                        and word != 'ep3east'])
wc = WordCloud(background_color="white", font_path="/Library/Fonts/Verdana.ttf", stopwords=STOPWORDS, width=500, height=300)
wc.generate(no_url_no_tag)
plt.imshow(wc)
plt.axis("off")
plt.title("EP3 West- Full Conference")
plt.savefig('ep3west.png', dpi=500)
plt.show()

On to the analysis by day for the West.  
Again we'll start with total retweets and favorites, then do the top 5 tweets and look at those for each day.

### EP3 West Day 1 Retweets and Favorites
Day 1- 12/9/16

In [None]:
west1_retweets = west1['retweet_count'].sum()
print west1_retweets
west1_favorites = west1['favorite_count'].sum()
print west1_favorites

### EP3 West Day 1 Top 5 Tweets

In [None]:
sorted_west1 = west1.sort_values(['important_tweets'], ascending = False)
sorted_west1.head(5)

### EP3 West Day 1 Word Cloud

In [None]:
text = " ".join(west1["text"].values)
no_url_no_tag = " ".join([word for word in text.split(' ')
                        if 'http' not in word
                        and not word.startswith('@')
                        and word != 'RT'
                        and word != 'ep3east'])
wc = WordCloud(background_color="white", font_path="/Library/Fonts/Verdana.ttf", stopwords=STOPWORDS, width=500, height=300)
wc.generate(no_url_no_tag)
plt.imshow(wc)
plt.axis("off")
plt.title("EP3 West- Day 1")
plt.savefig('ep3west_day1.png', dpi=500)
plt.show()

### EP3 West Day 2 Retweets and Favorites
Day 2- 12/10/16

In [None]:
west2_retweets = west2['retweet_count'].sum()
print west2_retweets
west2_favorites = west2['favorite_count'].sum()
print west2_favorites

### EP3 West Day 2 Top 5 Tweets

In [None]:
sorted_west2 = west2.sort_values(['important_tweets'], ascending = False)
sorted_west2.head(5)

### EP3 West Day 2 Word Cloud

In [None]:
text = " ".join(west2["text"].values)
no_url_no_tag = " ".join([word for word in text.split(' ')
                        if 'http' not in word
                        and not word.startswith('@')
                        and word != 'RT'
                        and word != 'ep3east'])
wc = WordCloud(background_color="white", font_path="/Library/Fonts/Verdana.ttf", stopwords=STOPWORDS, width=500, height=300)
wc.generate(no_url_no_tag)
plt.imshow(wc)
plt.axis("off")
plt.title("EP3 West- Day 2")
plt.savefig('ep3west_day2.png', dpi=500)
plt.show()

### EP3 West Day 3 Retweets and Favorites
Day 3- 12/11/16

In [None]:
west3_retweets = west3['retweet_count'].sum()
print west3_retweets
west3_favorites = west3['favorite_count'].sum()
print west3_favorites

### EP3 West Day 3 Top 5 Tweets

In [None]:
sorted_west3 = west3.sort_values(['important_tweets'], ascending = False)
sorted_west3.head(5)

### EP3 West Day 3 Word Cloud

In [None]:
text = " ".join(west3["text"].values)
no_url_no_tag = " ".join([word for word in text.split(' ')
                        if 'http' not in word
                        and not word.startswith('@')
                        and word != 'RT'
                        and word != 'ep3east'])
wc = WordCloud(background_color="white", font_path="/Library/Fonts/Verdana.ttf", stopwords=STOPWORDS, width=500, height=300)
wc.generate(no_url_no_tag)
plt.imshow(wc)
plt.axis("off")
plt.title("EP3 West- Day 3")
plt.savefig('ep3west_day3.png', dpi=500)
plt.show()

Great, now we've got all of our data, but it's tough to look at. Let's make some bar graphs and compare West to East in each category.

## Charts to Make Comparisons Easier

First let's look at total tweets for each conference

In [None]:
east = len(df_east)
west = len(df_west)
labels = ["East", "West"]
data = [east, west]

xlocations = np.array(range(len(data)))+0.5
width = 0.5
plt.bar(xlocations, data, width=width)
plt.xticks(xlocations+width/2, labels)
plt.xlim(0, xlocations[-1]+width*2)
plt.title("Total Tweets by Conference")
plt.ylabel("Tweets")
plt.savefig("total tweets by conference.png")
plt.show()

Now Let's Look at Retweets by Conference

In [None]:
east = east_retweets
west = west_retweets
labels = ["East", "West"]
data = [east, west]

xlocations = np.array(range(len(data)))+0.5
width = 0.5
plt.bar(xlocations, data, width=width)
plt.xticks(xlocations+width/2, labels)
plt.xlim(0, xlocations[-1]+width*2)
plt.title("Total Retweets by Conference")
plt.ylabel("Reweets")
plt.savefig("Total Retweets by Conference.png")

plt.show()

Next we'll look at favorites by conference

In [None]:
east = east_favorites
west = west_favorites
labels = ["East", "West"]
data = [east, west]

xlocations = np.array(range(len(data)))+0.5
width = 0.5
plt.bar(xlocations, data, width=width)
plt.xticks(xlocations+width/2, labels)
plt.xlim(0, xlocations[-1]+width*2)
plt.title("Total Favorites by Conference")
plt.ylabel("Favorites")
plt.savefig("Total Favorites by Conference")

plt.show()

Now Let's look at the total tweets, retweets and favorites for each day within each conference
First we'll start with total tweets

In [None]:
n_groups = 3
east_tweets = (len(east1), len(east2), len(east3))
west_tweets = (len(west1), len(west2), len(west3))

fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 0.35

rects1 = plt.bar(index, east_tweets, bar_width, color='b', label='East')

rects2 = plt.bar(index + bar_width, west_tweets, bar_width, color='g', label='West')

plt.xlabel('Conference Day')
plt.ylabel('Tweets')
plt.xticks(index + bar_width, ('Day 1', 'Day 2', 'Day 3'))
plt.title("Tweets by Day for Each Conference")
plt.legend()
plt.savefig('tweets by day.png')
plt.show()

Next we'll look at retweets by day

In [None]:
n_groups = 3
east_tweets = (east1_retweets, east2_retweets, east3_retweets)
west_tweets = (west1_retweets, west2_retweets, west3_retweets)

fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 0.35

rects1 = plt.bar(index, east_tweets, bar_width, color='b', label='East')

rects2 = plt.bar(index + bar_width, west_tweets, bar_width, color='g', label='West')

plt.xlabel('Conference Day')
plt.ylabel('Retweets')
plt.xticks(index + bar_width, ('Day 1', 'Day 2', 'Day 3'))
plt.title("Retweets by Day for Each Conference")
plt.legend()
plt.savefig('retweets by day.png')
plt.show()

Finally we'll look at favorties by day for each conference

In [None]:
n_groups = 3
east_tweets = (east1_favorites, east2_favorites, east3_favorites)
west_tweets = (west1_favorites, west2_favorites, west3_favorites)

fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 0.35

rects1 = plt.bar(index, east_tweets, bar_width, color='b', label='East')

rects2 = plt.bar(index + bar_width, west_tweets, bar_width, color='g', label='West')

plt.xlabel('Conference Day')
plt.ylabel('Favorites')
plt.xticks(index + bar_width, ('Day 1', 'Day 2', 'Day 3'))
plt.title("Favorites by Day for Each Conference")
plt.legend()
plt.savefig('favorites by day.png')
plt.show()

That's it!  
Hopefully you found this interesting.  
If you did, let me know at [@CodyWeisbach](http://twitter.com/codyweisbach) on Twitter.  
If people are interested I'll do a similar analysis for each session of the conference. 