# Project: Sentiment Classification
- Make a model to determine whether a tweet positive or negative

### Step 1: Import the libraries

In [1]:
# project description:

# remove punctuatiions by string.punctuation
# helper function: 직접 정의한 함수

# step 10 : what words are actually classifying most
#  if sth is positive or negative?
#  e.g. "love", "hate" are strong words.

# Please try it out by yourself. 


In [51]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tag import pos_tag
from random import shuffle
from nltk import NaiveBayesClassifier
from nltk import classify
from nltk.sentiment import SentimentIntensityAnalyzer

### Step 2: Download the sample tweets
- Execute the following cell

In [3]:
nltk.download('twitter_samples')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('omw-1.4')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-dat

True

### Step 3: The tweets
- Get the positive and negative tweets.
    - HINT: You access the positive tweets by: **nltk.corpus.twitter_samples.strings('positive_tweets.json')**
    - HINT: Similarly for the negative tweets.
- Notice: There is also tweets with no sentiment - we will ignore them in this project
- Check a few tweets

In [4]:
positive_tweets = nltk.corpus.twitter_samples.strings("positive_tweets.json")
negative_tweets = nltk.corpus.twitter_samples.strings("negative_tweets.json")

In [5]:
positive_tweets[0]

'#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)'

### Step 4: Tokenize the tweets
- You get the tokenized tweets as follows:
    - **nltk.corpus.twitter_samples.tokenized('positive_tweets.json')**
    - Simlarly for **negative_tweets**
- Why tokenize?
    - To make processing easier
- Check a few tweets (tokenized)

In [6]:
tokenized_positive = nltk.corpus.twitter_samples.tokenized("positive_tweets.json")
tokenized_negative = nltk.corpus.twitter_samples.tokenized("negative_tweets.json")

In [7]:
tokenized_positive[:2]

[['#FollowFriday',
  '@France_Inte',
  '@PKuchly57',
  '@Milipol_Paris',
  'for',
  'being',
  'top',
  'engaged',
  'members',
  'in',
  'my',
  'community',
  'this',
  'week',
  ':)'],
 ['@Lamb2ja',
  'Hey',
  'James',
  '!',
  'How',
  'odd',
  ':/',
  'Please',
  'call',
  'our',
  'Contact',
  'Centre',
  'on',
  '02392441234',
  'and',
  'we',
  'will',
  'be',
  'able',
  'to',
  'assist',
  'you',
  ':)',
  'Many',
  'thanks',
  '!']]

### Step 5: Remove noise from data
- The following tokens do not add value in our analysis
    - Twitter usernames (starting with @)
    - Hyperlinks (starting with http:// or https://)
    - Punctuation and special characters
        - HINT: if word in **string.punctuation**
    - Numeric values only
        - HINT: use **.isnumeric()**
    - If word is a stopword ([wiki](https://en.wikipedia.org/wiki/Stop_word))
        - HINT: Check if lower case word is in **stopwords.words('english')**
- To simplify createa a helper function **is_clean** to check for the above
- Create another helper function **clean_tokens**
    - The function takes **tokens** (a list of tokens) as input
    - Then returns a list of tokens, where **is_clean** has been used to filter
    - Also, let's lowercase it all
        - HINT: Use **lower()**
- Finally, use list comprehension on the lists of positive and negative tweets where **clean_tokens** is applied on each element (tokens).

In [25]:
def is_clean(word:str):
    if word in string.punctuation:
        return False
    if word.isnumeric():
        return False
    if word in stopwords.words("english"):
        return False 
    
    # Additional
    if word.startswith('@'):
        return False
    if word.startswith("http://") or word.startswith("https://"):
        return False
    
    
    return True

In [9]:
def clean_tokens(tokens:list):
    return [word.lower() for word in tokens if is_clean(word)]

In [12]:
cleaned_positive = [clean_tokens(tokens) for tokens in tokenized_positive]
cleaned_negative = [clean_tokens(tokens) for tokens in tokenized_negative]

In [13]:
cleaned_positive[0]

['#followfriday',
 '@france_inte',
 '@pkuchly57',
 '@milipol_paris',
 'top',
 'engaged',
 'members',
 'community',
 'week',
 ':)']

In [14]:
cleaned_negative[0]

['hopeless', 'tmr', ':(']

### Step 6: Normalize the data
- The process of converting a word to its canonical form.
- Without normalization, “ran”, “runs”, and “running” would be treated as different words.
- Create a lemmatizer of **WordNetLemmatizer()**
    - HINT: use **lemmatizer = WordNetLemmatizer()**
- Create a helper function to lemmatize
    - HINT: Create a helper function **lemmatize(word, tag)**
        - Convert tag to **n** or **v** if tag starts with **NN** or **VB**, else **a**
        - Return **lemmatizer.lemmatize(word, tag)**
- Create a helper function **lemmatize_tokens(tokens: list)**
    - Return a list, where each element of **word, tag in pos_tag(...)** of **lemmatize(word, tag)**.
- Use list comprehension to normalize the positive and negative tweets
    - HINT: apply **lemmatize_tokens(...)** on all elements

In [26]:
lemmatizer = WordNetLemmatizer()

def lemmatize(word, tag):
    if tag.startswith("NN"):
        pos = 'n'
    if tag.startswith("VB"):
        pos = 'v'
    else:
        pos = 'a'

    return lemmatizer.lemmatize(word, pos)

In [27]:
def lemmatize_tokens(tokens:list):
    return [lemmatize(word, tag) for word, tag in pos_tag(tokens)]

KeyError: 'JJ'

In [29]:
normalized_positive = [lemmatize_tokens(tokens) for tokens in cleaned_positive]
normalized_negative = [lemmatize_tokens(tokens) for tokens in cleaned_negative]

In [32]:
normalized_positive[0]

['#followfriday',
 '@france_inte',
 '@pkuchly57',
 '@milipol_paris',
 'top',
 'engage',
 'members',
 'community',
 'week',
 ':)']

In [33]:
normalized_negative[0]

['hopeless', 'tmr', ':(']

### Step 7: Prepare data for Model
- Example of normalized tweet: **['hopeless', 'tmr', ':(']**
    - Should become **({'hopeless': True, 'tmr': True, ':(': True}, 'Negative')**
- Hence, the list of tweets (positive and negative) should be converted
- HINT: use a dict comprehension inside a list comprehension

In [35]:
positive_ds = [({token: True for token in tokens}, "Positive")for tokens in normalized_positive]
negative_ds = [({token: True for token in tokens}, "Negative")for tokens in normalized_negative]

In [36]:
positive_ds[0]

({'#followfriday': True,
  '@france_inte': True,
  '@pkuchly57': True,
  '@milipol_paris': True,
  'top': True,
  'engage': True,
  'members': True,
  'community': True,
  'week': True,
  ':)': True},
 'Positive')

In [37]:
negative_ds[0]

({'hopeless': True, 'tmr': True, ':(': True}, 'Negative')

### Step 8: Prepare training and test dataset
- Make the dataset of the combined positive and negative datasets
- Shuffle the dataset
    - Use **shuffle**
- Let the training dataset be the first 7000 entries
- Let the test dataset be the remaining entries

In [39]:
dataset = positive_ds + negative_ds
shuffle(dataset)

In [40]:
train_ds = dataset[:7000]
test_ds = dataset[7000:]

### Step 9: Train and test Model
- Train the model:
    - HINT: **classifier = NaiveBayesClassifier.train(train_data)**
- Test the accuracy
    - HINT: **classify.accuracy(classifier, test_data)**

In [46]:
classifier = NaiveBayesClassifier.train(train_ds)

In [47]:
classify.accuracy(classifier, test_ds)

0.9946666666666667

### Step 10: Show the most informative features
- HINT: Get the 10 most informative features: **classifier.show_most_informative_features(10)**

In [48]:
classifier.show_most_informative_features(10)

Most Informative Features
                      :) = True           Positi : Negati =   1671.3 : 1.0
                     sad = True           Negati : Positi =     23.0 : 1.0
                  arrive = True           Positi : Negati =     21.4 : 1.0
                 welcome = True           Positi : Negati =     17.8 : 1.0
               followers = True           Positi : Negati =     16.3 : 1.0
                    glad = True           Positi : Negati =     15.7 : 1.0
               community = True           Positi : Negati =     15.0 : 1.0
                     too = True           Negati : Positi =     13.8 : 1.0
                  excite = True           Positi : Negati =     13.0 : 1.0
                  you're = True           Positi : Negati =     13.0 : 1.0


### Step 11: Test the model
- Try your model as follows:
    - Define a tweet: **tweet = 'this is fun and awesome'**
    - Prepare data for model: **tweet_dict = {token: True for token in lemmatize_tokens(clean_tokens(tweet.split()))}**
    - Classify data: **classifier.classify(tweet_dict)**

In [49]:
tweet = "this is fun and awesome"
tweet_dict = {token: True for token in lemmatize_tokens(clean_tokens(tweet.split()))}

In [50]:
classifier.classify(tweet_dict)

'Positive'

### Bonus: The pre-trained Sentiment Intensity Analyzer
-  VADER (Valence Aware Dictionary and sEntiment Reasoner) ([Vader](https://www.nltk.org/howto/sentiment.html))

In [52]:
nltk.download("vader_lexicon")

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...


True

In [53]:
sia = SentimentIntensityAnalyzer()

In [54]:
sia.polarity_scores("this is fun and awesome")

{'neg': 0.0, 'neu': 0.288, 'pos': 0.712, 'compound': 0.8126}