### SENTIMENT ANLAYSIS WITH NAIVE BAYES 
This code cell imports the necessary libraries and functions needed for processing and analyzing a sample Twitter dataset for sentiment analysis:

- `twitter_samples` from NLTK: Provides a sample Twitter dataset.
- `clean_text`, `process_tweet`, `build_freqs` from `helper_function`: Custom helper functions to clean and process tweets and build frequency distributions.
- `numpy` as `np`: A fundamental package for scientific computing with Python.


In [5]:
from nltk.corpus import twitter_samples    # sample Twitter dataset from NLTK
from helper_funtion import clean_text,process_tweet,build_freqs
import numpy as np    

In [6]:
# Select the set of positive tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')  # Load positive tweets from 'positive_tweets.json' file

# Select the set of negative tweets
all_negative_tweets = twitter_samples.strings('negative_tweets.json')  # Load negative tweets from 'negative_tweets.json' file


In [7]:
print(len(all_positive_tweets))
print(len(all_negative_tweets))

5000
5000


This code splits the dataset of positive and negative tweets into training and testing sets.

In [8]:
# split the data into two pieces, one for training and one for testing (validation set)
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

# avoid assumptions about the length of all_positive_tweets
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

In [9]:
train_x[5]

'@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM'

## CLEANING THE TWEET

This code splits the dataset of positive and negative tweets into training and testing sets. Then it cleans the text of the training and test datasets using the `clean_text` function. 

In [10]:
# Clean the text in the training set
for i in range(len(train_x)):
    train_x[i] = clean_text(train_x[i])  # Apply the clean_text function to each tweet in the training set

# Clean the text in the test set
for i in range(len(test_x)):
    test_x[i] = clean_text(test_x[i])  # Apply the clean_text function to each tweet in the test set


In [11]:
train_x[0]

'followfriday franceint pkuchli milipolpari top engag member commun week'

The `process_tweet` function tokenizes each line of tweet into words

In [12]:
pro_train_x=process_tweet(train_x)
pro_test_y=process_tweet(test_x)

In [13]:
pro_train_x[0]

['followfriday',
 'franceint',
 'pkuchli',
 'milipolpari',
 'top',
 'engag',
 'member',
 'commun',
 'week']

building frequency from train 

In [14]:
freqs=build_freqs(pro_train_x, train_y)
first_key = next(iter(freqs))
print(first_key, freqs[first_key])
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))

('followfriday', 1.0) 23
type(freqs) = <class 'dict'>
len(freqs) = 16527


In [15]:
def lookup(freqs, word, label):
    """
    Given a dictionary of frequencies, a word, and a label (1 for positive, 0 for negative),
    return the count of the (word, label) pair.
    """
    return freqs.get((word, label), 0)

### TRAINING NAIVE BAYES

In [16]:
def train_naive_bayes(freqs,train_x,train_y):
    '''here first gonna get prob of pos word and neg word'''


    loglikelihood = {}
    logprior = 0
    # calculate V, the number of unique words in the vocabulary
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)

    # calculate N_pos, N_neg, V_pos, V_neg
    N_pos = N_neg = V_pos = V_neg = 0
    for pair in freqs.keys():
        if pair[1]>0:
            # increment the count of unique positive words by 1
            V_pos+=1
            # increment the number of positive words by the count for this (word,label) pair
            N_pos += freqs[pair]
        else:
            # increment the count of unique negative words by 1
            V_neg += 1

            # increment the number of negative words by the count for this (word,label) pair
            N_neg += freqs[pair]

    D = len(train_y)

    D_pos = (len(list(filter(lambda x: x > 0, train_y))))
    D_neg = (len(list(filter(lambda x: x <= 0, train_y))))

     # Calculate logprior
    logprior = np.log(D_pos) - np.log(D_neg)
    for word in vocab:

        #   get the positive and negative frequency of the word
        freq_pos = lookup(freqs,word,1)
        freq_neg = lookup(freqs,word,0)
        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)

        # calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos/p_w_neg)

    ### END CODE HERE ###

    return logprior, loglikelihood

In [17]:
logprior, loglikelihood = train_naive_bayes(freqs, pro_train_x, train_y)
print(logprior)
print(len(loglikelihood))

0.0
14308


In [18]:

def preprocess_tweet(tweet):
    """
    Preprocess the tweet by tokenizing and normalizing.
    This function can be adjusted based on the preprocessing steps used during training.
    """
    # Tokenize the tweet into words (simple split by spaces, can be enhanced)
    wordss=clean_text(tweet)
    words=wordss.split()
    return words

# PREDICT  THE MODEL

In [19]:
def predict_by_model(tweet,logprior,loglikelihood):
    words = preprocess_tweet(tweet)
    # initialize probability to zero
    p = 0

    # add the logprior
    p += logprior

    for word in words:

        # check if the word exists in the loglikelihood dictionary
        if word in loglikelihood:
            # add the log likelihood of that word to the probability
            p += loglikelihood[word]


    return p


Testing naive bayes

In [20]:
def test_naive_bayes(test_x, test_y, logprior, loglikelihood):
    """
    Input:
        test_x: A list of tweets
        test_y: the corresponding labels for the list of tweets
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy: (# of tweets classified correctly)/(total # of tweets)
    """
    accuracy = 0  # return this properly

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    y_hats = []
    for tweet in test_x:
        # if the prediction is > 0
        if predict_by_model(tweet, logprior, loglikelihood) > 0:
            # the predicted class is 1
            y_hat_i = 1
        else:
            # otherwise the predicted class is 0
            y_hat_i = 0

        # append the predicted class to the list y_hats
        y_hats.append(y_hat_i)

    # error is the average of the absolute values of the differences between y_hats and test_y
    error = np.mean(np.absolute(y_hats-test_y))

    # Accuracy is 1 minus the error
    accuracy = 1-error

    ### END CODE HERE ###

    return accuracy


In [21]:

print("Naive Bayes accuracy = %0.4f" %
      (test_naive_bayes(test_x, test_y, logprior, loglikelihood)))

Naive Bayes accuracy = 0.7700


In [22]:
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    # print( '%s -> %f' % (tweet, naive_bayes_predict(tweet, logprior, loglikelihood)))
    p = predict_by_model(tweet, logprior, loglikelihood)
#     print(f'{tweet} -> {p:.2f} ({p_category})')
    print(f'{tweet} -> {p:.2f}')

I am happy -> 2.08
I am bad -> -1.24
this movie should have been great. -> 1.97
great -> 2.08
great great -> 4.16
great great great -> 6.23
great great great great -> 8.31


Example :1

In [23]:
my_tweet = 'you are bad :('
predict_by_model(my_tweet, logprior, loglikelihood)

-1.2387204058495098

Example:2

In [24]:
tweet='i deeply sad'
p = predict_by_model(tweet, logprior, loglikelihood)
print('The expected output is', p)
# p = test_the_model(tweet, logprior, loglikelihood)
# print('The expected output is', p)

The expected output is -3.0202233749103757


### geting ratio  of postive and negative

In [25]:
def get_ratio(freqs, word):
    '''
    Input:
        freqs: dictionary containing the words

    Output: a dictionary with keys 'positive', 'negative', and 'ratio'.
        Example: {'positive': 10, 'negative': 20, 'ratio': 0.5}
    '''
    pos_neg_ratio = {'positive': 0, 'negative': 0, 'ratio': 0.0}
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # use lookup() to find positive counts for the word (denoted by the integer 1)
    pos_neg_ratio['positive'] = lookup(freqs,word,1)

    # use lookup() to find negative counts for the word (denoted by integer 0)
    pos_neg_ratio['negative'] = lookup(freqs,word,0)

    # calculate the ratio of positive to negative counts for the word
    pos_neg_ratio['ratio'] = (pos_neg_ratio['positive'] + 1)/(pos_neg_ratio['negative'] + 1)
    ### END CODE HERE ###
    return pos_neg_ratio

In [26]:

get_ratio(freqs, 'never')

{'positive': 31, 'negative': 42, 'ratio': 0.7441860465116279}

In [27]:
my_tweet = 'I am happy because I am progressing :)'

p = predict_by_model(my_tweet, logprior, loglikelihood)
print(p)

1.3276870378252548


In [28]:
my_tweet="hate story"
p = predict_by_model(my_tweet, logprior, loglikelihood)
print(p)

-1.7229682981668655
