

<div style="text-align: center;">
    <h2>Tweet Sentiment Binary Classification Using <br>Logistic Regression</h2>
    <img src="../src/Twitter-Logo.png" width="250" height="250">
</div>

<div style="text-align: center;">
    <h2>What is Sentiment Analysis<h2>
    <img src="../src/happy.png" height="400" width="280">
    <img src="../src/sad.png" height="400" width="280">
</div>


<div style="text-align: center;">
    <h2>Logistic Regression<h2>
    <img src="../src/Logistic-Regression.png" width="500" height="260">
</div>



<div style="text-align: center;">
    <h2>NLTK & Scikit-Learn</h2>
    <img src="../src/NLTK-Logo.png" width="100" height="100">
    <img src="../src/Scikit-Learn-Logo.png" width="180" height="100">
</div>

- Natural Language ToolKit
  - twitter_samples
  - stopwords
  - word_tokenize 
  
- Scikit-Learn
  - CountVectorizer 
  - LogisticRegression
  - accuracy_score
  - classification_report





#### CountVectorizer

    Create Bag of Words (BoW) with CountVectorizer method from Scikit-Learn Library.

- Corpus
  
    ``` Python
    tweets = ["I love coding", "Coding is great", "I love learning new things"]
    ```

- Vocabulary
    
    ```Python
    {'i': 0, 'love': 1, 'coding': 2, 'is': 3, 'great': 4, 'learning': 5, 'new': 6, 'things': 7}
    ```
- Output Matrix

    ```Python
    [[1, 1, 1, 0, 0, 0, 0, 0],  # "I love coding"
    [0, 0, 1, 1, 1, 0, 0, 0],   # "Coding is great"
    [1, 1, 0, 0, 0, 1, 1, 1]]   # "I love learning new things"

    ```

In [34]:
# Imports from nltk Library
import nltk
from nltk.corpus import twitter_samples, stopwords
from nltk.tokenize import word_tokenize

In [35]:
# Imports from Scikit-Learn Library
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [36]:
# Download NLTK datasets
nltk.download('twitter_samples')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package twitter_samples to C:\Users\Hossein
[nltk_data]     Tahami\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Hossein
[nltk_data]     Tahami\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Hossein
[nltk_data]     Tahami\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\Hossein
[nltk_data]     Tahami\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [22]:
# Loading twitter dataset from nltk twitter_samples
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

In [23]:
# Combining Positive & Negative Tweets
all_tweets = positive_tweets + negative_tweets

# Create Labels
labels = [1] * len(positive_tweets) + [0] * len(negative_tweets)

In [64]:
# Preprocessing function for english Tweets
def preprocess_tweet(tweet):
    stop_words = stopwords.words('english')
    tokens = word_tokenize(tweet)
    #print(tokens)
    tokens = [word.lower() for word in tokens if word.isalpha() and word.lower() not in stop_words]
    #print(tokens)
    return ' '.join(tokens)

In [62]:
tweet = all_tweets[6]
print(tweet)
preprocess_tweet(tweet)

We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI
['We', 'do', "n't", 'like', 'to', 'keep', 'our', 'lovely', 'customers', 'waiting', 'for', 'long', '!', 'We', 'hope', 'you', 'enjoy', '!', 'Happy', 'Friday', '!', '-', 'LWWF', ':', ')', 'https', ':', '//t.co/smyYriipxI']
['like', 'keep', 'lovely', 'customers', 'waiting', 'long', 'hope', 'enjoy', 'happy', 'friday', 'lwwf', 'https']


'like keep lovely customers waiting long hope enjoy happy friday lwwf https'

In [65]:
# Preprocess all tweets
preprocessed_tweets = [preprocess_tweet(tweet) for tweet in all_tweets]

In [71]:
# Extracting Features by using CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(preprocessed_tweets)

In [27]:
# Train-Test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

In [28]:
# Train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

In [29]:
# Predictions
y_pred = model.predict(X_test)

In [72]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.744


In [31]:
# save
import pickle

with open('../models/model.pkl','wb') as f:
    pickle.dump(model,f)

In [32]:
# Function for new samples
def predict_sentiment(tweet, model=model):
    preprocessed = preprocess_tweet(tweet)
    vectorized = vectorizer.transform([preprocessed])
    prediction = model.predict(vectorized)[0]
    return "Positive" if prediction == 1 else "Negative"

In [73]:
# load
with open('../models/model.pkl', 'rb') as f:
    SavedModel = pickle.load(f)

In [81]:
# Testing new example
example_tweet = "Government did a great Job on locating the bomb."
print(f"Tweet: '{example_tweet}'\n Sentiment: {predict_sentiment(tweet=example_tweet, model=SavedModel)}")

Tweet: 'Government did a great Job on locating the bomb.'
 Sentiment: Positive
