## Sentiment Analysis Project
### 1. Introduction
Sentiment analysis is a natural language processing (NLP) task that involves determining the sentiment expressed in a piece of text, such as whether it is positive, negative, or neutral. This application is widely used in industries to analyze customer feedback, reviews, social media posts, and much more. For example, businesses use sentiment analysis to gauge customer satisfaction or identify negative feedback for timely action.

The purpose of this project is to walk through the essential stages of implementing sentiment analysis using Python.

In [38]:


# Import Necessary Libraries
# These libraries are required for data handling, preprocessing, feature extraction, and modeling.
import pandas as pd
import nltk
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize



In [26]:
# Download necessary NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ekale\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ekale\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\ekale\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

The downloaded NLTK resources are essential for text preprocessing in sentiment analysis. The stopwords dataset contains common words like 'the' and 'and' which are removed to reduce noise and focus on meaningful content. wordnet is a lexical database used with the WordNetLemmatizer to normalize words by converting them to their root forms, ensuring consistency (e.g., "running" → "run"). The punkt tokenizer helps split text into sentences or words, enabling token-level operations like removing stopwords and extracting features. These resources streamline preprocessing, making raw text structured and ready for machine learning models, ensuring accurate and efficient analysis.

In [52]:

# 2. Load Dataset
data = pd.read_csv("data/sentiment_dataset.csv")  # Replace with actual file path
print(data.tail(50))


            label                                               text
1599949  positive  OMG how good is ben and jerrys cookie dough ic...
1599950  positive  oooo haha just waking up and ready to eat a de...
1599951  positive  #Traveltuesday @GuyNGirlTravels Because their ...
1599952  positive  any ideaZ on what to get dad for father's day ...
1599953  positive  God works mysteriously!i learn that if u think...
1599954  positive  @_CrC_ mornin.. I'm enjoying a beautiful morni...
1599955  positive  Woke up feeling rested and refreshed today! It...
1599956  positive  @naijagal You just HAD to throw that in. Tell ...
1599957  positive  @siovene lol I don't blame you it's not the sa...
1599958  positive  @ashinynewcoin yeah, that'd be the one  sorry ...
1599959  positive     @pokapolas love the donut and the toadstool.  
1599960  positive  @crgrs359 Skip the aquarium and check out thes...
1599961  positive           @GroleauNET Yeah I'm being an ass today 
1599962  positive                 

### Text Preprocessing




In [28]:
def preprocess_text(text):
    # Initialize lemmatizer and stop words
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    
    # Lowercase the text
    text = text.lower()
    
    # Replace non-alphanumeric characters, but retain numbers
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)  # Keep alphabets, numbers, and spaces only
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stopwords
    cleaned_tokens = [word for word in tokens if word not in stop_words]

    # lemmatize tokens
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in cleaned_tokens]
    
    # Return the processed text as a single string
    return ' '.join(lemmatized_tokens)


In [54]:
preprocess_text("The directions for taking out a link are a bit confusing, and $50 I have")

'direction taking link bit confusing 50'

The `preprocess_text` function is a comprehensive text-cleaning utility essential for preparing raw text data in sentiment analysis tasks. Initially, it initializes two key components: a lemmatizer from NLTK’s `WordNetLemmatizer` and a set of English stopwords from NLTK. The first step in preprocessing is converting the entire text to lowercase, ensuring that the model treats words like "Apple" and "apple" equally. Next, the function removes non-alphanumeric characters, including punctuation, using regular expressions, which helps to focus on meaningful content while retaining numbers for context. The text is then tokenized into individual words using NLTK’s `word_tokenize`, allowing for granular manipulation. Stopwords—commonly used words that don’t add significant meaning—are filtered out to reduce noise and enhance the model’s ability to identify relevant words. Following this, lemmatization is applied to each token using the `WordNetLemmatizer`, which converts words to their root forms (e.g., "running" becomes "run"), standardizing variations. Finally, the processed tokens are reassembled into a single string, ready for input into machine learning models. This function plays a pivotal role in transforming raw, unstructured text into a clean and meaningful format that improves model performance and accuracy.

In [29]:

# Apply preprocessing to the dataset
data = data.dropna()
data['label'] = data['label'].map({'negative': 0, 'positive': 1})
data['text'] = data['text'].apply(preprocess_text)


The code above drops null values from that data, then maps the values negative and positive to 0 and 1 respectively.

The next line of code uses the fucntion we created abouve to preprocess our data, thus in the end we will have cleaned text

In [56]:
data.head(5)

Unnamed: 0,label,text
0,negative,is upset that he can't update his Facebook by ...
1,negative,@Kenichan I dived many times for the ball. Man...
2,negative,my whole body feels itchy and like its on fire
3,negative,"@nationwideclass no, it's not behaving at all...."
4,negative,@Kwesidei not the whole crew


## Splitting Data Into Train and Test

In [32]:

# 5. Split Dataset
# Split into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(
    data['text'], data['label'], test_size=0.2, random_state=42, shuffle=True
)


Here, we split our data into train and test data for model training and evaluation purposes. 

80% train data and 20% test data.

## Pipelines

In [43]:

# 4. Build Pipeline
# Create a pipeline to streamline feature extraction and model training.
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('model', LogisticRegression())
])


We create a pipepline, to streamline feature extraction and model training. TfidVectorizer as we had earlier discussed quantifies the weight of a token in a given text. (like stardadizing)

## Hyper Parameter Tuning

In [42]:
# Define the parameter grid
param_grid = {
    'tfidf__max_features': [1000, 3000, 5000],
    'tfidf__ngram_range': [(1, 1), (1, 2)],  # Unigrams or Unigrams + Bigrams
    'model__C': [0.1, 1, 10],  # Regularization strength
    'model__solver': ['liblinear', 'lbfgs'],  # Solver for Logistic Regression
}

In [46]:
from sklearn.model_selection import GridSearchCV
# Perform GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)


Fitting 5 folds for each of 36 candidates, totalling 180 fits


In [47]:

# Output the best parameters and score
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)

Best Parameters: {'model__C': 10, 'model__solver': 'liblinear', 'tfidf__max_features': 5000, 'tfidf__ngram_range': (1, 2)}
Best Cross-Validation Accuracy: 0.7709302900108985


In [48]:
# Evaluate the model on a test set

y_pred = grid_search.best_estimator_.predict(X_test)
print("Test Set Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Test Set Accuracy: 0.771528125
              precision    recall  f1-score   support

           0       0.78      0.75      0.77    159494
           1       0.76      0.80      0.78    160506

    accuracy                           0.77    320000
   macro avg       0.77      0.77      0.77    320000
weighted avg       0.77      0.77      0.77    320000



In [49]:

# Predict Sentiment for Custom Reviews
# Function to predict sentiment for a given review.
def predict_sentiment(review):
    review_preprocessed = preprocess_text(review)
    prediction = grid_search.best_estimator_.predict([review_preprocessed])
    return "Positive" if prediction[0] == 1 else "Negative"


In [67]:

# Example:
review = "Our president is overtaxing us, which is becoming so burdensome to many kenyans who are not employed"

print("Prediction:", predict_sentiment(review))


Prediction: Negative
