# Twitter Sentiment Analysis

**Natural Language Processing (NLP)** refers to AI method of communicating with an intelligent systems using a natural language such as English.

The ability to categorize opinions expressed in the text of tweets—and especially to determine whether the writer's attitude is positive, negative, or neutral—is highly valuable.

In this project, we will use a process called 'sentiment analysis' to categorize the tweets on Twitter. Sentiment analysis involves natural language processing because it deals with human-written text. 

We will be using a three-point scale, 'Positive', 'Negative' and 'Neutral' to categorize the tweets.

### Import libraries

In [1]:
import pandas as pd
import numpy as np
import re
import string
# ML
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
# NLP
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\praju\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Load Dataset

- The dataset contains 1.6 million rows.
- The tweets are encoded into 3 categories:
    - 0 = 'Negative'
    - 2 = 'Neutral'
    - 4 = 'Positive'

In [2]:
# Load dataset function
def load_dataset(filename, cols):
    dataset = pd.read_csv("training.csv", encoding='latin-1')
    dataset.columns = cols
    return dataset

- The dataset has 6 columns 'target_id', 't_id', 'created_at', 'query', 'user', 'text'.
- The only columns we are interested in are 'text' and 'target'.

In [3]:
# Remove unwanted function
def remove_unwanted_cols(dataset, cols):
    for col in cols:
        del dataset[col]
    return dataset

### Pre-processing Tweets

##### Letter Casing
Converting all letters to either upper case or lower case.
##### Tokenizing
Turning the tweets into tokens. Tokens are words separated by spaces in a text.
##### Noise Removal
Eliminating unwanted characters, such as HTML tags, punctuation marks, special characters, white spaces etc.
##### Stopword Removal
Some words do not contribute much to the machine learning model, so it's good to remove them. A list of stopwords can be defined by the nltk library, or it can be business-specific.
##### Normalization
Normalization generally refers to a series of related tasks meant to put all text on the same level. Converting text to lower case, removing special characters, and removing stopwords will remove basic inconsistencies. Normalization improves text matching.
##### Stemming
Eliminating affixes (circumfixes, suffixes, prefixes, infixes) from a word in order to obtain a word stem. Porter Stemmer is the most widely used technique because it is very fast. Generally, stemming chops off end of the word, and mostly it works fine. 
##### Lemmatization
The goal is same as with stemming, but stemming a word sometimes loses the actual meaning of the word. Lemmatization usually refers to doing things properly using vocabulary and morphological analysis of words. It returns the base or dictionary form of a word, also known as the lemma .

(**Note**: Stemming is faster than lemmatization. Stemming and lemmatization are normalization techniques, and it is recommended to use only one approach to normalize.)

In [4]:
# Pre-processing function
def preprocess_tweet_text(tweet):
    # Letter Casing
    tweet.lower()
    # Remove urls
    tweet = re.sub(r"http\S+|www\S+|https\S+", '', tweet, flags=re.MULTILINE)
    # Remove user @ references and '#' from tweet
    tweet = re.sub(r'\@\w+|\#','', tweet)
    # Remove punctuations
    tweet = tweet.translate(str.maketrans('', '', string.punctuation))
    # Remove stopwords
    tweet_tokens = word_tokenize(tweet)
    filtered_words = [w for w in tweet_tokens if not w in stop_words]
    # Stemming
    ps = PorterStemmer()
    stemmed_words = [ps.stem(w) for w in filtered_words]
    # Lemmatizing
    #lemmatizer = WordNetLemmatizer()
    #lemma_words = [lemmatizer.lemmatize(w, pos='a') for w in stemmed_words]
    
    return " ".join(filtered_words)

##### Vectorizing Data
Vectorizing is the process to convert tokens to numbers. It is an important step because the machine learning algorithm works with numbers and not text.

We'll implement vectorization using **tf-idf** technique. Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining.

**TF**: Term Frequency, which measures how frequently a term occurs in a document.

<div align="center">TF(t) = (Number of times term x appears in a document) / (Total number of terms in the document)</div>


**IDF**: Inverse Document Frequency, which measures how important a term is.

<div align="center">IDF(t) = log_e(Total number of documents / Number of documents with term t in it)</div>

In [5]:
# Vectorizer function
def get_feature_vector(train_fit):
    vector = TfidfVectorizer(sublinear_tf=True)
    vector.fit(train_fit)
    return vector

- The target column is comprised of the integer values 0, 2, and 4. To convert the integer results to be easily understood by users, we implement a small script.

In [6]:
def int_to_string(sentiment):
    if sentiment == 0:
        return "Negative"
    elif sentiment == 2:
        return "Neutral"
    else:
        return "Positive"

### Model Building

All the functions will be brought together and we will see Naive Bayes and Logistic Regression algorithms for predictions.

In [7]:
# Load dataset
dataset = load_dataset("training.csv", ['target', 't_id', 'created_at', 'query', 'user', 'text'])

# Remove unwanted columns from dataset
n_dataset = remove_unwanted_cols(dataset, ['t_id', 'created_at', 'query', 'user'])

#Preprocess data
dataset.text = dataset['text'].apply(preprocess_tweet_text)

# Split dataset into Train, Test

# Same tf vector will be used for Testing sentiments on unseen trending data
tf_vector = get_feature_vector(np.array(dataset.iloc[:, 1]).ravel())
X = tf_vector.transform(np.array(dataset.iloc[:, 1]).ravel())
y = np.array(dataset.iloc[:, 0]).ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

# Training Naive Bayes model
NB_model = MultinomialNB()
NB_model.fit(X_train, y_train)
y_predict_nb = NB_model.predict(X_test)
print(accuracy_score(y_test, y_predict_nb))

# Training Logistics Regression model
LR_model = LogisticRegression(solver='lbfgs')
LR_model.fit(X_train, y_train)
y_predict_lr = LR_model.predict(X_test)
print(accuracy_score(y_test, y_predict_lr))

0.768521875
0.787703125


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### Inference

We observe that the Naive Bayes model has given us an accuracy of **76%** and Logistic Regression has given us an accuracy of about **78%**.