# Project: Sentiment Classification
- Make a model to determine whether a tweet positive or negative

### Step 1: Import the libraries

In [None]:
import nltk
import string
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import classify
from nltk.corpus import stopwords
from nltk import NaiveBayesClassifier
from random import shuffle

### Step 2: Download the sample tweets
- Execute the following cell

In [None]:
nltk.download('twitter_samples')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

### Step 3: The tweets
- Get the positive and negative tweets.
    - HINT: You access the positive tweets by: **nltk.corpus.twitter_samples.strings('positive_tweets.json')**
    - HINT: Similarly for the negative tweets.
- Notice: There is also tweets with no sentiment - we will ignore them in this project
- Check a few tweets

### Step 4: Tokenize the tweets
- You get the tokenized tweets as follows:
    - **nltk.corpus.twitter_samples.tokenized('positive_tweets.json')**
    - Simlarly for **negative_tweets**
- Why tokenize?
    - To make processing easier
- Check a few tweets (tokenized)

### Step 5: Remove noise from data
- The following tokens do not add value in our analysis
    - Twitter usernames (starting with @)
    - Hyperlinks (starting with http:// or https://)
    - Punctuation and special characters
        - HINT: if word in **string.punctuation**
    - Numeric values only
        - HINT: use **.isnumeric()**
    - If word is a stopword ([wiki](https://en.wikipedia.org/wiki/Stop_word))
        - HINT: Check if lower case word is in **stopwords.words('english')**
- To simplify createa a helper function **is_clean** to check for the above
- Create another helper function **clean_tokens**
    - The function takes **tokens** (a list of tokens) as input
    - Then returns a list of tokens, where **is_clean** has been used to filter
    - Also, let's lowercase it all
        - HINT: Use **lower()**
- Finally, use list comprehension on the lists of positive and negative tweets where **clean_tokens** is applied on each element (tokens).

### Step 6: Normalize the data
- The process of converting a word to its canonical form.
- Without normalization, “ran”, “runs”, and “running” would be treated as different words.
- Create a lemmatizer of **WordNetLemmatizer()**
    - HINT: use **lemmatizer = WordNetLemmatizer()**
- Create a helper function to lemmatize
    - HINT: Create a helper function **lemmatize(word, tag)**
        - Convert tag to **n** or **v** if tag starts with **NN** or **VB**, else **a**
        - Return **lemmatizer.lemmatize(word, tag)**
- Create a helper function **lemmatize_tokens(tokens: list)**
    - Return a list, where each element of **word, tag in pos_tag(...)** of **lemmatize(word, tag)**.
- Use list comprehension to normalize the positive and negative tweets
    - HINT: apply **lemmatize_tokens(...)** on all elements

### Step 7: Prepare data for Model
- Example of normalized tweet: **['hopeless', 'tmr', ':(']**
    - Should become **({'hopeless': True, 'tmr': True, ':(': True}, 'Negative')**
- Hence, the list of tweets (positive and negative) should be converted
- HINT: use a dict comprehension inside a list comprehension

### Step 8: Prepare training and test dataset
- Make the dataset of the combined positive and negative datasets
- Shuffle the dataset
    - Use **shuffle**
- Let the training dataset be the first 7000 entries
- Let the test dataset be the remaining entries

### Step 9: Train and test Model
- Train the model:
    - HINT: **classifier = NaiveBayesClassifier.train(train_data)**
- Test the accuracy
    - HINT: **classify.accuracy(classifier, test_data)**

### Step 10: Show the most informative features
- HINT: Get the 10 most informative features: **classifier.show_most_informative_features(10)**

### Step 11: Test the model
- Try your model as follows:
    - Define a tweet: **tweet = 'this is fun and awesome'**
    - Prepare data for model: **tweet_dict = {token: True for token in lemmatize_tokens(clean_tokens(tweet.split()))}**
    - Classify data: **classifier.classify(tweet_dict)**

### Bonus: The pre-trained Sentiment Intensity Analyzer
-  VADER (Valence Aware Dictionary and sEntiment Reasoner) ([Vader](https://www.nltk.org/howto/sentiment.html))

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

In [None]:
sia = SentimentIntensityAnalyzer()

In [None]:
sia.polarity_scores('this is fun and awesome')