# Twitter Analysis of the APTA Next 2017 Conference

First I'll import the libraries that are needed

In [None]:
%matplotlib inline
import pymongo
from datetime import date
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator


## EP3 East Analysis

### Data Wrangling
Before I can analyze anything, I need to get the data out of my database and organize it into a form that can easily be analyzed. 

First I'll make a connection to my database. Everything is stored in a database called MongoDB on my laptop. 

In [None]:
client_con = pymongo.MongoClient()

In [None]:
client_con.database_names()

In [None]:
local_db = client_con['test']

In [None]:
local_db.collection_names()

'Next2017' is the collection with the data from .  
I'll load the data into a variable called data_set

In [None]:
data_set = local_db['Next2017']

Next I'll take a quick look at how many tweets are in there

In [None]:
data_set.count()

In case you are wondering, this is what one tweet looks like. 
There's a ton of data in each tweet, which is one of the reasons I'm interested in this

In [None]:
print data_set.find_one()

Now I need to change the format of the data to something easier to look at.
First, I'll make the database collection into a list of dictionaries, which is easier to work with in Python

In [None]:
tweet_data = list(data_set.find())

Next I'll make that list into a dataframe, which looks more like a spread sheet
I'm going to make it a function that I can use over again for the west data set

In [None]:
def process_results(results):
    id_list = [tweet['id'] for tweet in results]
    data_set = pd.DataFrame(id_list, columns = ["id"])
    
    data_set['user'] = [tweet['user']['screen_name'] for tweet in results]
    data_set["text"] = [tweet['text'] for tweet in results]
    data_set["retweet_count"] = [tweet['retweet_count'] for tweet in results]
    data_set["favorite_count"] = [tweet['favorite_count'] for tweet in results]
    data_set["created_at"] = [tweet['created_at'] for tweet in results]
    
    return data_set

Now I'll use that function to process my data into the dataframe and store it in a variable df

In [None]:
df_east = process_results(tweet_data)

Let's take a look at it.  
First I'll check the length, hopefully it's the same as my number of records above.  
Then I'll look at the top 5 rows and the bottom 5 rows to make sure it all looks right.

In [None]:
len(df_east)

In [None]:
df_east.head()

In [None]:
df_east.tail()

It all looks good.  
Now I need to change the 'created_at' column to a different format that will make is easier for me to split up the data sets. 

In [None]:
df_east['created_at'] = pd.to_datetime(df_east['created_at'])

In [None]:
df_east.head()

Now I'll make a new column that adds retweets plus favorites. That will give some idea of the importance of a particular tweet. 

In [None]:
df_east['important_tweets'] = df_east['retweet_count'] + df_east['favorite_count']
df_east.head()

Next, we to adjust the data set to only look at origional tweets, not retweets. 

First, I'll make a column looking at whether or not is was retweeted

In [None]:
df_east['retweeted'] = df_east['text'].str.startswith('RT')

In [None]:
df_east.head()

### Total Tweets
Now I'll include only the origional tweets in the data set. 
Then I'll look at how many we have

In [None]:
df_east = df_east[df_east['retweeted'] == False]
len(df_east)

### Total Retweets and Total Favorites
Next, let's look at some numbers from the full conference data set.   
We have total tweets (664). Let's also look at total retweets and total favorites

In [None]:
east_retweets = df_east['retweet_count'].sum()
print east_retweets
east_favorites = df_east['favorite_count'].sum()
print east_favorites

## Top 10 Tweets from 
Now let's look at the 10 most importatn tweets.  
First we'll sort the data by "important_tweets" to with highest number on top.  
Then we'll show the top 10
If you want to see the origional tweet, just go to http://twitter.com/statuses/ and put the tweet ID at the end after that last "/"

In [None]:
sorted_east = df_east.sort_values(['important_tweets'], ascending = False)
sorted_east.head(10)

### Word Cloud
The Last step for the full conference analysis is the Word Cloud

In [None]:
text = " ".join(df_east["text"].values)
no_url_no_tag = " ".join([word for word in text.split(' ')
                        if 'http' not in word
                        and not word.startswith('@')
                        and word != 'RT'
                        and word != 'APTANEXT'])
wc = WordCloud(background_color="white", font_path="/Library/Fonts/Verdana.ttf", stopwords=STOPWORDS, width=500, height=300)
wc.generate(no_url_no_tag)
plt.imshow(wc)
plt.axis("off")
plt.title("APTA Next 2017")
plt.savefig('Next2017.png', dpi=500)
plt.show()

That's it!  
Hopefully you found this interesting.  
If you did, let me know at [@CodyWeisbach](http://twitter.com/codyweisbach) on Twitter and I'll keep it up!