# NLP

***NLP*** or ***Natural Langauge Processing*** is a subdomain of both linguistics and computer science. As a discipline, its primary concern deals with using computation to understand, analyze, and generate natural language data.

Today we will be taking a look at an example of *NLP* by conducting a ***sentiment analysis***. Sentiment Analysis is the process of taking some textual data and assigning it some poliarity score (typically: positive, neutral, negative) based upon the lexigraphical features of the text.

The example we will be utilizing today involves utilizing *vaderSentiment* a package that was purpose built to conduct sentiment analysis upon social media posts to understand not only the poliarty of social media posts but also the *valence* (i.e. parts of speech that may modify poliarity) of some text. Vader also does some interpretation upoon the sentiment polarity of emojis which makes it particularly valuable in a social media context

Concepts:
- [NLP](https://en.wikipedia.org/wiki/Natural_language_processing)
- [Sentiment Analysis](https://en.wikipedia.org/wiki/Sentiment_analysis#Subjectivity/objectivity_identification)
- [vaderSentiment](https://github.com/cjhutto/vaderSentiment#features-and-updates)

In [None]:
import pandas as pd
import tweepy as tp
import nltk
import vaderSentiment.vaderSentiment as vd

# Tweepy

In this example we will be using tweepy a library designed to act as a rapper to the twitter API. It provides functionality to quickly interact with and extract data such as tweets and user info from twitter.

Before you can begin, you'll need to set up an app and get your connection keys/tokens for the twitter API. You can find the docs and figure out how to get your keys by following the first link below.

[Twitter API Docs](https://developer.twitter.com/en/docs)

[Tweepy Docs](http://docs.tweepy.org/en/latest/)

In [None]:
# Our keys and tokens, remember it goes against best practices to store a
# token or password directly in your scripts / notebooks
con_key = <consumer_key>
con_secret_key = <consumer_secret_key>
access_token = <access_token>
access_secret_token = <access_secret_token>

In [None]:
# Authenticate to twitter and generate an API object we can interact with
auth = tp.OAuthHandler(con_key, con_secret_key)

#set up authentication tokens for our api acess
auth.set_access_token(access_token, access_secret_token)

# Construct the API instance
api = tp.API(auth)

# User
user = '@elonmusk'

In [None]:
#get some details about our twitter user
item = api.get_user(user)
print("name: " + item.name)
print("screen_name: " + item.screen_name)
print("description: " + item.description)
print("statuses_count: " + str(item.statuses_count))
print("friends_count: " + str(item.friends_count))
print("followers_count: " + str(item.followers_count))

In [None]:
#initialize a list to hold all the tweepy Tweets
alltweets = []	
	
#make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name=user,count=200)
	
#save most recent tweets
alltweets.extend(new_tweets)
	
#save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
	
#keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
	print(f"getting tweets before {oldest}")
		
	#all subsiquent requests use the max_id param to prevent duplicates
	new_tweets = api.user_timeline(screen_name =user,count=200,max_id=oldest)
		
	#save most recent tweets
	alltweets.extend(new_tweets)
		
	#update the id of the oldest tweet less one
	oldest = alltweets[-1].id - 1
		
	print(f"...{len(alltweets)} tweets downloaded so far")

# Building the Dataframe

Now that we have pulled our data down from twitter we need to manipulate it in a way to make it easier to process. Each of the *status* objects within  the **alltweets** list has attributes that you can pull out of them. We can access these attributes individually with list comprehensions and build a dataframe. An example is below:

In [None]:
# Here we build a data frame by using list comprehensions on the tweet status objects
elon_df = pd.DataFrame(data={'Tweet':[tweet.text for tweet in alltweets],
                             'Timestamp':[tweet.created_at for tweet in alltweets],
                             'favorites_count':[tweet.favorite_count for tweet in alltweets],
                             'retweet_count':[tweet.retweet_count for tweet in alltweets],
                             'tweet_source':[tweet.source for tweet in alltweets],
                             'in reply to':[tweet.in_reply_to_screen_name for tweet in alltweets],
                             'is_retweet?':[True if tweet.text[0:2] == 'RT' else False for tweet in alltweets],
                             'Tweet_Length':[len(tweet.text) for tweet in alltweets]})

# Using Vader to generate polarity scores

Detailed information on vader can be found here:
[vaderSentiment](https://github.com/cjhutto/vaderSentiment#features-and-updates)

TLDR;

Vader allows us to rate strings of text based upon an inbuilt lexicon. The lexicon's values are modified by:

- typical negations (e.g., "not good")
- use of contractions as negations (e.g., "wasn't very good")
- conventional use of punctuation to signal increased sentiment intensity (e.g., "Good!!!")
- conventional use of word-shape to signal emphasis (e.g., using ALL CAPS for words/phrases)
- using degree modifiers to alter sentiment intensity (e.g., intensity boosters such as "very" and intensity dampeners such as "kind of")
- understanding many sentiment-laden slang words (e.g., 'sux')
- understanding many sentiment-laden slang words as modifiers such as 'uber' or 'friggin' or 'kinda'
- understanding many sentiment-laden emoticons such as :) and :D
- translating utf-8 encoded emojis such as 💘 and 💋 and 😁
- understanding sentiment-laden initialisms and acronyms (for example: 'lol')


### Getting Started

We build an analyzer instance and then we can apply that to string values to pull out some polarity scores.

In [None]:
#example of vader sentiment analyzer at work
analyzer = vd.SentimentIntensityAnalyzer()
analyzer.polarity_scores(elon_df['Tweet'][0])

In [None]:
#generate scores for all tweets in the dataframe
elon_df['positive_score'] = [analyzer.polarity_scores(tweet)['pos'] for tweet in elon_df['Tweet']]
elon_df['neutral_score']  = [analyzer.polarity_scores(tweet)['neu'] for tweet in elon_df['Tweet']]
elon_df['negative_score'] = [analyzer.polarity_scores(tweet)['neg'] for tweet in elon_df['Tweet']]
elon_df['compound_score'] = [analyzer.polarity_scores(tweet)['compound'] for tweet in elon_df['Tweet']]

In [None]:
#save our processed data for later
elon_df.to_csv('elon_tweet.csv')

In [None]:
# do some time processing
elon_df['Month'] = [time.month_name() for time in elon_df['Timestamp']]
elon_df['Year'] = [time.year for time in elon_df['Timestamp']]
elon_df['date'] = pd.to_datetime(elon_df['Timestamp'], format='%m-%Y').dt.strftime('%m-%Y')

In [None]:
# get the mean scores for each month
average_month =  elon_df.groupby('date',axis=0).agg({'compound_score':'mean',
                                                     'positive_score':'mean',
                                                     'neutral_score':'mean',
                                                     'negative_score':'mean'})
average_month = pd.concat([average_month.iloc[4:],average_month.iloc[0:4]])

In [None]:
#get the max scores for each month
max_month =  elon_df.groupby('date',axis=0).agg({'compound_score':'max',
                                                 'positive_score':'max',
                                                 'neutral_score':'max',
                                                 'negative_score':'max'})
max_month = pd.concat([max_month.iloc[4:],max_month.iloc[0:4]])

In [None]:
display(elon_df.query('compound_score>=.5'))

In [None]:
average_month.plot(figsize=(12,8))

In [None]:
max_month.plot(figsize=(16,8))