## Using Twitter to Collect Data for College Admissions


### Project Scope and Impact

The intention of this project is to develop a mechanism to analyze Twitter data related to the freshman admissions process at Philadelphia area colleges and universities. Sentiment analysis on data culled from Twitter allows institutions the opportunity to better understand their place in an increasingly digital landscape. Twitter reactions related admissions and financial aid for an institution can provide another metric to gauge population engagement, reputation, effectiveness of communications/marketing, and many other areas of an institution.

Social media is regularly monitored by the communications departments at colleges and universities to. Specifically in admissions, staff will manually check various social media feeds for any reactions - positive or negative - around the time that admissions decisions are being released. Those preliminary sentiments can provide an informal indication in how an admissions cycle might progress. Additionally, this monitoring can identify any potential issues that were not discovered during testing.

A sentiment analysis of admissions Twitter data would add value to the following:
<ul>
    <li>Admissions and Financial Aid</li>
    <li>Institutional Planning</li>
    <li>Marketing and University Communications</li>


### Project Development and Challenges

**Preliminary Data Collection**<br>
Utilizing Twitter’s Advanced Search functionality, an initial collection of college acceptance tweets from Drexel and West Chester University were found. Each institution notifies its admitted students of a specific hashtag (#drexelaccepted, #wcuaccepted) which was searched during December 2017, one of two major decision release seasons for the incoming 2018 class. These search parameters confirmed that data with positive sentiment should be readily available as shown in the examples below. <br>
<a href='https://twitter.com/lanaaquino/status/941865009256296450' _target='blank'> Example of #drexelaccepted tweet</a><br>
<a href='https://twitter.com/musicalbliss123/status/937074000488030211' _target='blank'> Example of #wcuaccepted tweet</a><br>
<p>
**Challenges**<br>
<ul>
    <li>Access to data: The Twitter API only allows access to scrape the past week of data at no expense. Decision release and notification takes place over several months. Historic access is cost prohibitive.</li>
    <li>Identifying search terms: Finding ways to capture neutral or negative sentiment since there are not readily available hashtags or terms to search</li>


### Project Data Dictionary

In order to access Twitter's API, it is necessary to create an account, an application, and generate the necessary developer keys/tokens. Once configured, the API will allow access to any publicly available tweets on the site. A publicly available Tweet is any tweet that can be found via basic search on Twitter where there are no privacy restrictions on the visibility of content.

**Data Dictionary**

_Data Collected from Twitter_ <br>

| Variable | Variable ID | Description | Data Type |
| --- | --- | --- | --- |
| Twitter ID | twitter_id | The unique Twitter ID - cannot be changed. | int |
| User ID | user_id | The user ID or Twitter handle that the user creates - can be changed. | string |
| Name | name | The name shown in the profile of the Twitter user | string |
| User description | user_desc | The profile description created by the Twitter user | string |
| Number of followers | number_of_followers | The number of followers the Twitter user has | int |
| Full text | full_text | Full text of tweet | string |
| Hashtag | hash_tag | Metadata tag added to the tweet by the Twitter user | string |
| Retweet count | retweet_count | The number of times the tweet has been retweeted | int |
| Create date | create_date | The date and time the tweet was created | datetime |
| Geographic location | geo_loc | A coalesce statement that captures first, the geographic location of the Twitter user when the tweet was created, or second, the location shown in the profile of the Twitter user. This field is null if neither exist. | string |
| Location | location | The location shown in the profile of the Twitter user | string |
<p />
_Function Arguments or Outputs_ <br>

| Variable | Variable ID | Description | Data Type |
| ---- | ---- | --- | --- |
| College name | college_name | The search term put in the end user of the application, which is likely the name of a college or other related term. | string |
| TextBlob sentiment | blob_sent | The result of the tweet's sentiment analysis using Text Blob | string |
| NLTK sentiment | nltk_sent |The result of the tweet's sentiment analysis using the Natural Language Tool Kit | string |
| Minor noise | minor_noise | The output of the minor_noise function that calculates the number of times a minor noise term is found in a tweet. Designed to assist in gauging the relevance of a tweet. Minor noise could include information related to the university (ex. sporting events) that is less relevant to the program objective. | int |
| Major noise | major_noise | The output of the major_noise function that calculates the number of times a major noise term is found in a tweet. Designed to assist in gauging the relevance of a tweet. Major noise could include anything from false flags in search terms (Drexel Hill) or generated content that is in physical proximity to the institution. | int |

### Program Features

The program:
<ul>
    <li>Uses the Twitter API with a set of pre-determined search terms to capture relevant tweets and related data using Twython</li>
    <li>Passes the full text to a function to strip the tweet of special characters and links</li>
    <li>Passes the full text to two separate sentiment analysis to categorize the tweet</li>
    <li>Passes the full text to two functions that calculates major and minor noise to identify the importance of the tweet relative to this project</li>
    <li>Creates an entry of all collected and calculated data points for any new tweet in a SQLite database</li>
    <li>Incorporates the original data plus these created fields and passes it to a function that packs it in JSON</li>
    </ul>

### Data Provided in Project Submission

This project submission includes the following examples of data extracted by the program:
<ul>
    <li>JSON outputs of the raw Twitter data (as collected) from five separate searches (#drexelaccepted, #GoingNova, #wcuaccepted, Drexel (two searches)</li>
    <li>JSON output from the create_json function which overwrites each time it is run (found in data directory)</li>
    <li>The SQLite database, admissions_twitter.db, which contains the database dump from the create_table and tweet_load functions (requires SQLite or similar tool to access)</li>
    </ul>

**Keys and tokens for developer account**

In [2]:
########### HIDDEN

<b>Using the package Twython</b> 

In [3]:
from twython import Twython, TwythonError
import requests
from pprint import pprint 

twitter = Twython(CONSUMER_KEY,CONSUMER_SECRET)

In [4]:
#HIDDEN

**Initial function to collect tweets defined in parent function**

In [20]:
def by_college(name,number_tweets):
    results = t.search(q=name, count=number_tweets,tweet_mode='extended')
    all_tweets = results['statuses']
    return all_tweets

**Function to save all acquired data into a JSON file, including function parameters and execution date**

In [19]:
def raw_json_load(all_tweets,college_name,number_of_tweets):
    import json
    import datetime
    import re
    dateofload = datetime.datetime.now()
    dateof = str(dateofload)
    dateof = dateof.split('.')[0]
    dateof = dateof.strip(':')
    date = re.sub(':','',dateof)
    number_of_tweets = str(number_of_tweets)
    fname = "raw_"+college_name+"_"+number_of_tweets+"_"+date+".json"
    name = str(fname)
    json.dump(all_tweets, open(name, "w"))

<b>Utility function to strip the full text of special characters, links, etc</b>

In [18]:
def clean_tweet(tweet): 
    import re

    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

**Function to complete TextBlob Sentiment Analysis**

In [17]:
#Function to get sentiment using TextBlob
def get_tweet_sentiment(tweet):
    from textblob import TextBlob  
    # create TextBlob object of passed tweet text 
    analysis = TextBlob(clean_tweet(tweet)) 
    # set sentiment 
    if analysis.sentiment.polarity > 0: 
        return 'positive'
    elif analysis.sentiment.polarity == 0: 
        return 'neutral'
    else: 
        return 'negative'

**Function to complete Natural Language Toolkit Sentiment Analysis**

In [16]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.downloader.download('vader_lexicon')

#Function to get sentiment using NLTK

def get_tweet_sentiment_nltk(tweet):   
    import nltk
    from nltk.sentiment.vader import SentimentIntensityAnalyzer
    analysis = clean_tweet(tweet)
    sid = SentimentIntensityAnalyzer()
    for tweet in analysis:
        if (sid.polarity_scores(analysis)['compound']) >= 0.5: 
            return 'positive'
        elif (sid.polarity_scores(analysis)['compound']) <= -0.5: 
            return 'negative'
        else: 
            return 'neutral'

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Jrl362\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


**Function to calculate minor noise such as related sporting events or other school spirit activities**

In [15]:
def minor_noise(text): 
    m_lvl = 0
    m_noise = ('Basketball','Baseball','Soccer','Lacross','Eagles','Knights','Dragons','Play','Lead','Basket','Points','defeats','victory','game','starters','starter','offense','defense','hoops')
    for m in m_noise:
        if m.lower()in text.lower():
            m_lvl = m_lvl + 1
        else:
            m_lvl = m_lvl + 0
    return m_lvl  


**Function to calculate major noise such as unrelated content that references the search term/institution**

In [14]:
def major_noise(text): 
    maj_lvl = 0
    maj_noise = ('Wendy’s','McDonald’s','Taco','Pizza','Wahoo’s','Wawa','dunkin','Drexel Hill')
    for m in maj_noise:
        if m.lower()in text.lower():
            maj_lvl = maj_lvl + 1
        else:
            maj_lvl = maj_lvl + 0
    return maj_lvl  

**Function to create SQLite database and table if it does not exist, then update table with data.**<br>
_Challenge_: Parsing tweets with special characters, created need for clean_tweet() function.

In [25]:
## function  to save data to SQLite
import sqlite3

def create_table(u,userdesc1,numfollow,f_text,hshtag,ret_cnt,create_time,c_name,twitter_id,geoloc,location,name,text_blob_sent,nltk_sent,min_noise,maj_noise):  # creates table
    ### Sql lite likes strings, had errors before conversion
    
    twitter_id = str(twitter_id)
    u=str(u)
    userdesc1 = str(userdesc1)
    f_text = str(f_text)
    hshtag = str(hshtag)
    numfollow= str(numfollow)
    ret_cnt = str(ret_cnt)
    create_time = str(create_time)
    c_name = str(c_name)
    geoloc = str(geoloc)
    location = str(location)
    name = str(name)
       
       ## cleaning
    u = u.strip()
    u = u.split('b')[1]
    u = u.strip()

    
    f_text = f_text.split('b')[1]
    f_text.split()
   
    
    ## creating the database and table if it does not exsist, then loading the data
    
    conn = sqlite3.connect('admissions_twitter1.db', timeout = 30)
    c = conn.cursor()
    c.execute('CREATE TABLE IF NOT EXISTS admissions_twitter1(twitter_id TEXT, user_id TEXT, user_desc TEXT, number_of_followers TEXT,full_text BLOB, hash_tag TEXT, retweet_count TEXT, create_date TEXT,college_name TEXT,geo_loc TEXT,location TEXT,name Text, blob_sent TEXT, nltk_sent Text, minor_noise Text, major_noise Text)')
    c.execute('INSERT INTO admissions_twitter1(twitter_id,user_id,user_desc,number_of_followers,full_text,hash_tag,retweet_count,create_date,college_name,geo_loc,location,name,blob_sent,nltk_sent,minor_noise,major_noise)VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)',(twitter_id,u,userdesc1,numfollow,f_text,hshtag,ret_cnt,create_time,c_name,geoloc,location,name,text_blob_sent,nltk_sent,min_noise,maj_noise))                    
    conn.commit()


<b>Function to load data into a dictionary,  in preparation to pack into a JSON file</b>

In [12]:
## function to create a dictonary to pack into a JSON file
import json
tweetjson = {}
def create_json(u,userdesc1,numfollow,f_text,hshtag,ret_cnt,create_time,c_name,twitter_id,geoloc,location,name,text_blob_sent,nltk_sent,min_noise,maj_noise):

 
    tweet = tweetjson.get(twitter_id,dict())
   # tweet['twitterID'] = twitter_id
    u = str(u)
    tweet['user_id'] = u
    tweet['user_desc'] = userdesc1  
    tweet['number_of_followers'] = numfollow
    tweet['full_text'] = f_text
    tweet['hash_tag'] = hshtag
    tweet['retweet_count'] = ret_cnt
    tweet['create_date'] = create_time
    tweet['college_name'] = c_name
    tweet['geo_loc'] = geoloc
    tweet['location'] = location
    tweet['name'] = name
    tweet['blob_sent'] = text_blob_sent
    tweet['nltk_sent'] = nltk_sent
    tweet['minor_noise'] = min_noise
    tweet['major_noise'] = maj_noise
    tweetjson[twitter_id] = tweet

In [66]:
import json
json.dump(tweetjson, open("data/tweetjson.json", "w"))

<b>Output of dictionary that’s being packed into JSON file</b>

In [67]:
tweetjson

{1070095700237549569: {'user_id': "b'TotalTrafficOKC'",
  'user_desc': '',
  'number_of_followers': 2225,
  'full_text': "b'Accident cleared on Drexel Ave south of SW 44th St #OKCtraffic https://t.co/5PW2TfQkaY'",
  'hash_tag': [{'text': 'OKCtraffic', 'indices': [51, 62]}],
  'retweet_count': 0,
  'create_date': 'Wed Feb 09 22:37:10 +0000 2011',
  'college_name': 'Drexel',
  'geo_loc': {'type': 'Point', 'coordinates': [35.4207, -97.569]},
  'location': 'Oklahoma City, OK',
  'name': 'TTN Oklahoma City',
  'blob_sent': 'neutral',
  'nltk_sent': 'neutral',
  'minor_noise': 0,
  'major_noise': 0},
 1070090739361738752: {'user_id': "b'Drexel_XD'",
  'user_desc': "God forgives, Drexel doesn't.",
  'number_of_followers': 92,
  'full_text': "b'RT @Sporf: \\xf0\\x9f\\x94\\xb5 @ManCity Wingers This Season:\\n\\n\\xf0\\x9f\\x8f\\xb4\\xf3\\xa0\\x81\\xa7\\xf3\\xa0\\x81\\xa2\\xf3\\xa0\\x81\\xa5\\xf3\\xa0\\x81\\xae\\xf3\\xa0\\x81\\xa7\\xf3\\xa0\\x81\\xbf @Sterling7:\\n\\n\\xe2\\x9a\\xbd\\xef\\xb8\\x

**Parent function to collect all data, run above functions, and store all data and calculated fields**

In [22]:
##  function load in tweets from twitter
def tweet_load(college_name,number_of_tweets):
    # pass to web scrapper function #2
    college_name = str(college_name)
    all_tweets = by_college(college_name,number_of_tweets) # from web scraper function
    raw_json_load(all_tweets,college_name,number_of_tweets)  ## Passing to function to load raw data
    for tweet in all_tweets: #### data
        #pprint(tweet)
        inner = []
        user = tweet["user"]["screen_name"].encode('utf-8')
        userdesc = tweet["user"]["description"]
        numfollowers = tweet["user"]["followers_count"] # Number of followers, identify influencers
        twitter_id = tweet["id"]
        geoloc = tweet["geo"]
        ftext = tweet["full_text"].encode('utf-8')  # full text tweet 280 chars
        location = tweet['user']['location']
        name = tweet['user']['name']
        hashtag = tweet['entities']['hashtags'] # hashtag
        retweet_cnt = tweet['retweet_count']  
        created_at = tweet['created_at']
        
        ### get sentiment and noise
        ftext = str(ftext)
        text_blob_sent = get_tweet_sentiment(ftext) # send to  get_tweet_sentiment function that uses text blob
        nltk_sent = get_tweet_sentiment_nltk(ftext) # send to get_tweet_sentiment_nltk fuction that uses nltk
        min_noise = minor_noise(ftext)#  send to minor noise function
        maj_noise = major_noise(ftext) # send to major noise function
        
        #add data to table, will create database and table if not exsit 
        create_table(user,userdesc,numfollowers,ftext,hashtag,retweet_cnt,created_at,college_name,twitter_id,geoloc,location,name,text_blob_sent,nltk_sent,min_noise,maj_noise) 
        create_json(user,userdesc,numfollowers,ftext,hashtag,retweet_cnt,created_at,college_name,twitter_id,geoloc,location,name,text_blob_sent,nltk_sent,min_noise,maj_noise)

<b>This runs the program</b>


In [26]:
##Runs program using pre-determined search term
#additional search terms to build into a list and iterate through: 
#wcuaccepted #wcu #goingnova, #drexel, #futuredragons, #newdragons, Drexel, Villanova, West Chester

tweet_load("Drexel",100)

### Next Steps

In looking at continued improvement and refinement of this project, we propose the following:
<ul>
    <li>Running the program during critical decision timeframes - December and March at a minimum - to collect relevant data</li>
    <li>Investigate the Twitter Streaming API following presentation Q&A - see below</li>
    <li>Create lists to hold major function parameters such as Twitter search terms and major/minor noise terms to be loaded into respective functions. Ensures the integrity of the core code while creating more robust function parameters</li>
    <li>Using search term list to iterate through terms instead of running manually each time</li>
    <li>Identify application of noise parameters, sentiment analysis, etc after a critical mass of data is collected</li>

### Twitter Streaming API

Following in-class discussion about Twitter's streaming API and its benefits to allowing us to collect relevant data in real time. What we need to look more closely at is the organization of the data elements to collect and store them as we did with the rest API above. Their nested dictionary structure has some similarities to the soccer tournament data. Most notably, the full_text value of the tweet is not contained in all results. These updates are the first priority to improve data collection going into the next segment of the course. Proper configuration will include a restarter script and some kind of schedule to dump the data for sentiment analysis and storage. Below is the commented out code and a sample result to demonstrate the availability of data.

### Twython Streaming Class to print live Twitter data

In [None]:
from twython import TwythonStreamer
from pprint import pprint

i=0
retrytime = 1

class MyStreamer(TwythonStreamer):
    def on_success(self, data):
        if 'text' in data:
            pprint(data)
          #  print(data['user_desc'])
    def on_error(self, status_code, data):
        pprint(status_code)
        i += 1
        retrytime = 2 ** i

        # Want to stop trying to get data because of the error?
        # Uncomment the next line!
        # self.disconnect() 

stream = MyStreamer(CONSUMER_KEY, CONSUMER_SECRET,
                    ACCESS_TOKEN, ACCESS_TOKEN_SECRET, retry_in=retrytime)
stream.statuses.filter(track='Trump')

### Sample output from Twitter streaming API searching for 'Drexel'