# Logistic Regression

Logistic Regression is a statistical method used for binary classification tasks, predicting the probability that an input belongs to one of two classes. Despite its name, it's a classification algorithm.

  1. Model Representation: log⁡(p1−p)=β0+β1X1+β2X2+...+βnXnlog(1−pp​)=β0​+β1​X1​+β2​X2​+...+βn​Xn​
  2. Logistic Function (Sigmoid): p=11+e−(β0+β1X1+β2X2+...+βnXn)p=1+e−(β0​+β1​X1​+β2​X2​+...+βn​Xn​)1​
  3. Training: Estimate coefficients β0,β1,...,βnβ0​,β1​,...,βn​ using methods like maximum likelihood estimation on labeled data.
  4. Prediction: Calculate the probability of class membership using the trained model. If probability > 0.5, classify as positive class; else, negative class.
  5. Decision Boundary: Line separating classes determined by model coefficients. It's where predicted probability of positive class = 0.5.


### 1. Import Functions and Data

In [35]:

import nltk # Python library for NLP
from nltk.corpus import twitter_samples # sample Twitter dataset from NLTK
import matplotlib.pyplot as plt # library for visualization
import random # pseudo-random number generator

import re # library for regular expression operations
import string # for string operations

from nltk.corpus import stopwords # module for stop words that come with NLTK
from nltk.stem import PorterStemmer # module for stemming
from nltk.tokenize import TweetTokenizer # module for tokenizing strings

import csv
import numpy as np
import pandas as pd
from sklearn.utils import shuffle

nltk.download('twitter_samples')
nltk.download('stopwords')


[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [36]:

# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# Save the tweets to a file
with open('positive_tweets.txt', 'w', encoding='utf-8') as f:
    f.write('\n'.join(all_positive_tweets))

with open('negative_tweets.txt', 'w', encoding='utf-8') as f:
    f.write('\n'.join(all_negative_tweets))


In [37]:
# Randomly select three positive tweets
random_positive_tweets = random.sample(all_positive_tweets, 3)

# Randomly select three negative tweets
random_negative_tweets = random.sample(all_negative_tweets, 3)

print("Randomly selected positive tweets:")
for tweet in random_positive_tweets:
    print(tweet)

print("\nRandomly selected negative tweets:")
for tweet in random_negative_tweets:
    print(tweet)

Randomly selected positive tweets:
@MonicaBhambhani @Equinox my pleasure doll! Thank you for your wonderful energy in class! :)
@DomSequitur tired. But fine :) you??
@stuck_for_ideas Thanks for the shout out guys :)

Randomly selected negative tweets:
@sophiabxsh no Idk if I wanna watch the episode now :(
@junhuiass IT'S BEEN YEARS SINCE I HAVE BEEN IN A ZOO AND IT'S ONLY ON FIELDTRIPS SO NO TIME TO TOUCH :(((
their reactions :(((((


In [38]:

print(len(all_positive_tweets),all_positive_tweets[0])
print(len(all_negative_tweets),all_negative_tweets[0])


5000 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
5000 hopeless for tmr :(


### 2. Preprocessing

In [39]:

def process_tweet(tweet):

  """Process tweet function.
  Input:
  tweet: a string containing a tweet
  Output:
  tweets_clean: a list of words containing the processed tweet
  """
  stemmer = PorterStemmer( )
  stopwords_english = stopwords.words('english')
  # remove stock market tickers like $GE
  tweet = re.sub(r'\$\w*', '', tweet)
  # remove old style retweet text "RT"
  tweet = re.sub(r'^RT[\s]+', '', tweet)
  # remove hyperlinks
  tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)

  # only removing the hash # sign from the word
  tweet = re.sub(r'#', '', tweet)
  # tokenize tweets
  tokenizer = TweetTokenizer(preserve_case=False,
  strip_handles=True, reduce_len=True)

  tweet_tokens = tokenizer.tokenize(tweet)

  tweets_clean = []
  for word in tweet_tokens:
    if (word not in stopwords_english and # remove stopwords
      word not in string.punctuation): # remove punctuation

      # tweets_clean.append(word)
      stem_word = stemmer.stem(word) # stemming word
      tweets_clean.append(stem_word)

  return tweets_clean

In [40]:

# Initializing lists to store processed positive and negative tweets
pro_pos_tw = []
pro_neg_tw = []

# Processing each tweet in the list of positive tweets
for tweet in all_positive_tweets:
    # Applying the process_tweet function to preprocess the tweet
    pro_pos_tw.append(process_tweet(tweet))

# Processing each tweet in the list of negative tweets
for tweet in all_negative_tweets:
    # Applying the process_tweet function to preprocess the tweet
    pro_neg_tw.append(process_tweet(tweet))

# Printing the number of processed positive tweets and an example of the first processed positive tweet
print("Number of processed positive tweets:", len(pro_pos_tw))
print("Example of a processed positive tweet:", pro_pos_tw[0])

# Printing the number of processed negative tweets and an example of the first processed negative tweet
print("Number of processed negative tweets:", len(pro_neg_tw))
print("Example of a processed negative tweet:", pro_neg_tw[0])


Number of processed positive tweets: 5000
Example of a processed positive tweet: ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
Number of processed negative tweets: 5000
Example of a processed negative tweet: ['hopeless', 'tmr', ':(']


In [41]:

# Shuffle positive and negative tweets
random.shuffle(pro_pos_tw)
random.shuffle(pro_neg_tw)

# Select 4000 random positive and negative tweets for training
train_pos_tw = pro_pos_tw[:4000]
train_neg_tw = pro_neg_tw[:4000]

# Select 1000 random positive and negative tweets for testing
test_pos_tw = pro_pos_tw[4000:]
test_neg_tw = pro_neg_tw[4000:]

# Combine training and testing tweets
train_tweets = train_pos_tw + train_neg_tw
test_tweets = test_pos_tw + test_neg_tw

# Create labels
train_labels = [1] * len(train_pos_tw) + [0] * len(train_neg_tw)
test_labels = [1] * len(test_pos_tw) + [0] * len(test_neg_tw)

# Checking sizes of training and testing sets
print("Training set size:", len(train_tweets))
print("Testing set size:", len(test_tweets))

# Checking distribution of labels in training and testing sets
from collections import Counter
print("Training set label distribution:", Counter(train_labels))
print("Testing set label distribution:", Counter(test_labels))


Training set size: 8000
Testing set size: 2000
Training set label distribution: Counter({1: 4000, 0: 4000})
Testing set label distribution: Counter({1: 1000, 0: 1000})


In [42]:
len(train_tweets),len(train_labels)

(8000, 8000)

In [43]:

print(train_tweets[10])
print(train_labels[10])


['follow']
1


In [44]:
count=0
for l,t in zip(train_labels,train_tweets):
  print(l,t)

  count += 1
  if(count>10):
    break

1 ['hi', 'emma', ':-)', 'ask', 'bellybutton', 'inni', 'outi']
1 ["he'", 'twitch', "he'", 'got', 'twitch', ':-)']
1 [':)', 'beauti']
1 ['males', ':d']
1 ['beauti', ':)', 'got', 'blackfli', 'courgett', 'flower', 'year', '..', 'idea', 'hope', 'wont', 'affect', 'fruit']
1 ['snapchat', 'jennyjean', '22', 'snapchat', 'kikmeboy', 'model', 'french', 'kikchat', 'sabadodeganarseguidor', 'sexysasunday', ':)']
1 ['sept', '4th', 'rudramadevi', 'anushka', 'gunashekar', 'sir', ':)']
1 ['name', 'coupl', 'yet', 'tomhiddleston', 'elizabetholsen', 'yaytheylookgreat', ':)']
1 ["i'm", 'glow', 'morn', 'yayyy', ':)', 'happi', 'friday', 'xx']
1 ['oh', 'happi', 'hear', ':)', 'love', 'day', 'cl']
1 ['follow']


### 3. Sigmoid Function

In [45]:

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def cost_function(X, y, theta):
    m = len(y)
    h = sigmoid(np.dot(X, theta))
    cost = (-1 / m) * np.sum(y * np.log(h) + (1 - y) * np.log(1 - h))
    return cost


### 4. Gradient Descent

In [46]:

def gradient(X, y, theta):
    m = len(y)
    h = sigmoid(np.dot(X, theta))
    grad = (1 / m) * np.dot(X.T, (h - y))
    return grad

def gradient_descent(X, y, theta, alpha=0.01, iterations=1000):
    m = len(y)
    cost_history = []

    for _ in range(iterations):
        theta -= alpha * gradient(X, y, theta)
        cost = cost_function(X, y, theta)
        cost_history.append(cost)

    return theta, cost_history


### 5. Extract Feature

In [47]:

def build_freqs(tweets, ys):
  """Build frequencies.
  Input:
  tweets: a list of tweets
  ys: an m x 1 array with the sentiment label of each tweet
  (either 0 or 1)
  Output:
  freqs: a dictionary mapping each (word, sentiment) pair to its
  frequency
  """
  # Convert np array to list since zip needs an iterable.
  # The squeeze is necessary, or the list ends up with one element.
  # Also note that this is just a NOP if ys is already a list.
  #yslist = np.squeeze(ys).tolist()

  # Start with an empty dictionary and populate it by looping over all tweets and over all processed words in each tweet.
  freqs = { }

  for y, tweet in zip(ys, tweets):
    for word in tweet:
      pair = (word, y)
      if pair in freqs:
        freqs[pair] += 1
      else:
        freqs[pair] = 1

  return freqs


In [48]:
freq = build_freqs(train_tweets+test_tweets,train_labels+test_labels)

In [49]:
print(len(freq),type(freq))

count = 0
for key, value in freq.items():
    if count < 10:
        print(key, ':', value)
        count += 1
    else:
        break

13065 <class 'dict'>
('hi', 1) : 173
('emma', 1) : 2
(':-)', 1) : 692
('ask', 1) : 37
('bellybutton', 1) : 5
('inni', 1) : 4
('outi', 1) : 4
("he'", 1) : 11
('twitch', 1) : 5
('got', 1) : 69


In [50]:

def extract_features(tweet_words, frequency_table):

    """
    Count sentiment based on tweet words and a frequency table.

    Parameters:
        tweet_words (list): List of words in the tweet.
        frequency_table (dict): Dictionary containing word-sentiment score pairs and their frequencies.
        label (int): Label for the sentiment (0 for negative, 1 for positive).

    Returns:
        tuple: A tuple containing positive count, negative count, and label.
    """

    # Initialize counts for positive and negative words
    positive_count = 0
    negative_count = 0

    # Iterate over words in the tweet
    for word in tweet_words:
        # Check if the word is in the frequency table
        for key, value in frequency_table.items():
            if word == key[0]:
                # Increment positive or negative count based on the sentiment score
                if key[1] == 1:
                    positive_count += value
                if key[1] == 0:
                    negative_count += value

    return 1,positive_count,negative_count


In [51]:

def process_and_save_features(tweets, labels, frequency_table, output_filename):
    """
    Calculate sentiment counts for each tweet and save the results to a CSV file.

    Parameters:
    - tweets (list): A list of tweets.
    - labels (list): A list of corresponding labels for the tweets.
    - frequency_table (dict): A dictionary containing word-sentiment score pairs and their frequencies.
    - output_filename (str): The name of the output CSV file.

    Returns:
    None
    """
    sentiment_counts = []

    # Iterate over each tweet and label
    for tweet, label in zip(tweets, labels):

      # Calculate sentiment counts for the tweet
      bias, positive_count, negative_count = extract_features(tweet, frequency_table)
      # Append the results to the list
      sentiment_counts.append([bias, positive_count, negative_count, label])

    # Write sentiment counts to CSV file
    with open(output_filename, 'w', newline='', encoding='utf-8') as csvfile:
        csv_writer = csv.writer(csvfile)
        # Write header
        csv_writer.writerow(['Bias', 'Positive_Count', 'Negative_Count', 'Label'])
        # Write data
        csv_writer.writerows(sentiment_counts)


In [52]:
len(train_tweets),len(train_labels)

(8000, 8000)

In [53]:

process_and_save_features(train_tweets,train_labels,freq,"train.csv")
process_and_save_features(test_tweets,test_labels,freq,"test.csv")


### 6. Train

In [54]:

# Load train and test data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Shuffle train and test data
#train = shuffle(train)
#test = shuffle(test)

# Display the shape of train and test data
print("Shape of train data:", train.shape)
print("Shape of test data:", test.shape)


Shape of train data: (8000, 4)
Shape of test data: (2000, 4)


In [55]:

# Count of records with label 0 and 1 in training data
train_label_counts = train['Label'].value_counts()
print("Training data label counts:")
print(train_label_counts)

# Count of records with label 0 and 1 in test data
test_label_counts = test['Label'].value_counts()
print("\nTest data label counts:")
print(test_label_counts)


Training data label counts:
Label
1    4000
0    4000
Name: count, dtype: int64

Test data label counts:
Label
1    1000
0    1000
Name: count, dtype: int64


In [56]:
train

Unnamed: 0,Bias,Positive_Count,Negative_Count,Label
0,1,917,65,1
1,1,793,150,1
2,1,3618,13,1
3,1,630,0,1
4,1,4043,396,1
...,...,...,...,...
7995,1,264,4962,0
7996,1,595,5215,0
7997,1,173,4757,0
7998,1,724,5169,0


In [57]:

# Prepare data
X_train = train.drop(columns=['Label']).values
y_train = train['Label'].values.reshape(-1, 1)

X_test = test.drop(columns=['Label']).values
y_test = test['Label'].values.reshape(-1, 1)


In [58]:

# Initialize parameters
theta_initial = np.zeros((X_train.shape[1], 1))

# Train the model
theta, cost_history = gradient_descent(X_train, y_train, theta_initial)

# Evaluate the model
final_train_cost = cost_history[-1]
print("\n Final training cost:", final_train_cost)

# Predict on test data
predicted_probabilities = sigmoid(np.dot(X_test, theta))
predicted_labels = (predicted_probabilities >= 0.5).astype(int)

# Calculate accuracy
accuracy = np.mean(predicted_labels == y_test)
print("\n Accuracy on test set:", accuracy)


  return 1 / (1 + np.exp(-z))
  cost = (-1 / m) * np.sum(y * np.log(h) + (1 - y) * np.log(1 - h))
  cost = (-1 / m) * np.sum(y * np.log(h) + (1 - y) * np.log(1 - h))



 Final training cost: nan

 Accuracy on test set: 0.994


### 7. Test

In [59]:

def predict(X, theta):
    probabilities = sigmoid(np.dot(X, theta))
    return probabilities


In [60]:
print(theta)

[[  0.02213753]
 [ 13.00565291]
 [-12.37915871]]


In [61]:
# Predict probabilities for the test data
probabilities = predict(X_test, theta)

  return 1 / (1 + np.exp(-z))


### 8. Evaluate

In [62]:
# Convert probabilities to binary predictions
predictions = (probabilities >= 0.5).astype(int)

# Evaluate accuracy
accuracy = np.mean(predictions == y_test)
print("Accuracy on test data:", accuracy, "\n")


Accuracy on test data: 0.994 



In [63]:
# Calculate TP, FP, FN, TN
TP = np.sum((predictions == 1) & (y_test == 1))
FP = np.sum((predictions == 1) & (y_test == 0))
FN = np.sum((predictions == 0) & (y_test == 1))
TN = np.sum((predictions == 0) & (y_test == 0))

# Calculate precision, recall, and F-measure
precision = TP / (TP + FP)
recall = TP / (TP + FN)
f_measure = 2 * precision * recall / (precision + recall)

print("Precision:", precision)
print("Recall:", recall)
print("F-measure:", f_measure)

Precision: 0.9930139720558883
Recall: 0.995
F-measure: 0.994005994005994


### 9. Error Analysis

**Extracting Features and Simplifying Classification in Sentiment Analysis**

In sentiment analysis, the extracted features reveal that tweets with a higher positive count are categorized as positive, while others are deemed negative. Therefore, there's no necessity for computing the sigmoid function; a simple relational operator can achieve the same outcome. Additionally, utilizing thesauruses and language mapping can enhance the relevance of words to positive or negative classes, facilitating the removal of neutral words.

1. **Eliminating Sigmoid Computation**: By leveraging the insight that tweets with a greater positive count are already indicative of positive sentiment, there's no requirement for complex sigmoid computations. Simplifying the classification process with basic relational operators streamlines the sentiment analysis task.

2. **Enhancing Word Relevance**: Incorporating thesauruses and language mapping techniques aids in refining the relevance of words to positive or negative sentiment classes. This approach helps in filtering out neutral words, thereby improving the accuracy and efficiency of sentiment analysis algorithms.


In [64]:

# Filter misclassified tweets
misclassified_tweets = test[predictions.flatten() != y_test.flatten()]

# Display misclassified tweets
print("Misclassified tweets:")
print(misclassified_tweets)


Misclassified tweets:
      Bias  Positive_Count  Negative_Count  Label
21       1             102             115      1
143      1             264             395      1
221      1             589             726      1
568      1             264             395      1
726      1             264             395      1
1003     1             131              98      0
1272     1             267             251      0
1316     1             903             922      0
1554     1             208             119      0
1827     1              66              54      0
1939     1             416             180      0
1971     1             499             423      0


In [65]:

# Get indices of misclassified tweets
misclassified_indices = misclassified_tweets.index.tolist()

for i in range(len(misclassified_indices)):
  print(test_tweets[misclassified_indices[i]])


['brief', 'introduct', '2', 'earliest', 'histori', 'indian', 'subcontin', 'even', 'bfr', 'maurya']
["i'm", 'play', 'brain', 'dot', 'braindot']
['omg', "can't", 'tell', 'say', ':p', "can't", 'wait', 'know', '❤', '️']
["i'm", 'play', 'brain', 'dot', 'braindot']
["i'm", 'play', 'brain', 'dot', 'braindot']
['beast', 'next', 'week']
['like', 'video']
['midland', 'ye', 'thank', 'depress', 'weather', 'forecast', 'word', 'rain', 'mention', 'sever', 'time', ':-(']
['corbyn', 'must', 'understand', "labour'", 'new', 'member', 'chang', "party'", 'fortun']
['laomma', 'design', 'kebaya', 'wed', 'dress', 'bandung', 'indonesia', 'line', 'laomma', '7df89150', 'whatsapp', '62', '08962464174', '7', 'instagram', 'laomma_coutur']
['shake', 'head', 'repeatedli', 'nu-uh', 'jace', 'love', 'mostest']
['twitter', 'help', 'center', '39', 'follow', 'peopl']


### 10. On Unit Test

In [66]:

# New tweets to be added
tweets = [
    "i am sad.",
    "feeling :(.",
    "i am happy.",
    ":) moment."
]

process_tweets = []

# Process all tweets
for tweet in tweets:
    process_tweets.append(process_tweet(tweet))

for tw in process_tweets:
    print(tw)

sentiment_counts = []

# Extract features for all processed tweets
for tweet in process_tweets:
    bias, positive_count, negative_count = extract_features(tweet, freq)
    sentiment_counts.append([bias, positive_count, negative_count])

print("\n Featured Extracted : ", sentiment_counts)

# Convert the list of lists to a NumPy array
X_new = np.array(sentiment_counts)

# Pass the array to the predict function
probabilities_new = predict(X_new, theta)

print("\n Prob : ", probabilities_new)

# Convert probabilities to binary predictions
predictions_new = (probabilities_new >= 0.5).astype(int)

# Display the predicted labels
print("\n Predicted labels for new data :", predictions_new.flatten())


['sad']
['feel', ':(']
['happi']
[':)', 'moment']

 Featured Extracted :  [[1, 5, 123], [1, 47, 4729], [1, 211, 25], [1, 3580, 16]]

 Prob :  [[0.]
 [0.]
 [1.]
 [1.]]

 Predicted labels for new data : [0 0 1 1]


  return 1 / (1 + np.exp(-z))


#### Role of Special Symbols in Sentiment Analysis

In sentiment analysis, special symbols and punctuation marks play a vital role in determining the sentiment of a text. Here are some key points to consider:

1. **Emoticons and Emoji:** Emoticons such as ":)", ":(", and emojis like 😊, 😢 directly convey emotions and significantly influence sentiment classification. For example, ":)" typically indicates happiness or positivity, while ":(" indicates sadness or negativity.

2. **Punctuation Marks:** Punctuation marks such as exclamation marks (!), question marks (?), and ellipses (...) provide contextual cues for sentiment analysis. Multiple exclamation marks might indicate excitement, while a question mark might suggest uncertainty.

3. **Capitalization:** The use of uppercase letters can convey emphasis or heightened emotion, impacting sentiment analysis results.

4. **Repeating Characters:** Repeated characters, like "soooo" or "loooove," emphasize the intensity of an emotion, influencing sentiment analysis by amplifying the sentiment conveyed.

5. **Sarcasm and Irony:** Special symbols and punctuation marks are often used to convey sarcasm or irony, challenging sentiment analysis due to the disparity between literal meaning and intended sentiment.

6. **Negation:** Words like "not" or phrases like "not good" can reverse sentiment. Understanding negation context is crucial for accurate sentiment analysis.

7. **Hashtags and Mentions:** In social media sentiment analysis, hashtags (#) and mentions (@) provide context about topics or entities discussed, enhancing sentiment classification accuracy.

In summary, special symbols and punctuation marks carry rich contextual information that significantly impacts sentiment analysis. Incorporating these elements into sentiment analysis models improves their ability to accurately interpret and classify text sentiment.
