# Twitter Sentiment Analysis using Natural Language Processing (NLP)


## Introduction

In this project, we delve into the fascinating world of sentiment analysis using Twitter data. Sentiment analysis, a subfield of Natural Language Processing (NLP), involves analyzing text data to determine the sentiment expressed, such as positive, negative, or neutral. By examining tweets from Twitter, we can extract valuable insights into public opinion on various topics, ranging from politics to products and brands.

This project leverages machine learning techniques to classify tweets based on sentiment, helping us understand how people feel about specific topics in real time. Using a dataset of tweets, we apply preprocessing steps such as tokenization, stopword removal, and stemming, followed by training machine learning models to predict the sentiment of unseen tweets.

The aim of this project is to showcase the application of NLP and machine learning to real-world social media data, providing a powerful tool for sentiment analysis in various industries, including marketing, customer service, and public relations.


## Objective

The primary objective of this project is to:

1. **Understand and process social media data**: Extract and preprocess data from Twitter, focusing on cleaning and preparing the text for analysis.
2. **Apply Natural Language Processing (NLP) techniques**: Utilize various NLP methods such as tokenization, stopword removal, and stemming to prepare the text for sentiment classification.
3. **Train and evaluate machine learning models**: Implement machine learning algorithms to classify tweet sentiments into categories such as positive, negative, and neutral.
4. **Deploy the sentiment analysis model**: Provide a practical demonstration of how sentiment analysis can be applied to real-time social media data.
5. **Gain insights into public opinion**: Explore how sentiment analysis can help analyze large amounts of social media data to uncover trends and sentiments about specific topics.


## Tools and Libraries Used

- **Python**: The primary programming language used for data analysis and machine learning tasks.
- **NLTK (Natural Language Toolkit)**: A library for processing and analyzing human language data.
- **Scikit-learn**: A machine learning library for training, testing, and evaluating various models.
- **Pandas**: Used for data manipulation and processing, particularly for loading and working with the dataset.
- **Matplotlib**: A library used for creating visualizations and plots to understand the distribution of sentiment.
- **Seaborn**: A data visualization library used to enhance the visual appeal and clarity of plots.


## Data Description

The dataset used in this project is the **Sentiment140** dataset, which contains 1.6 million tweets labeled with sentiment labels (positive, negative, and neutral). The dataset includes the following columns:

- **target**: Sentiment label (0 = negative, 4 = positive)
- **id**: Unique identifier for the tweet
- **date**: Date and time of the tweet
- **flag**: Unused field
- **user**: Username of the Twitter account
- **text**: The text content of the tweet


## Steps Involved

1. **Data Preprocessing**:
   - Load the dataset and examine the structure.
   - Clean the text data by removing unnecessary characters, stopwords, and non-alphanumeric symbols.
   - Tokenize the text and apply stemming to reduce words to their root form.

2. **Exploratory Data Analysis (EDA)**:
   - Analyze the distribution of sentiments in the dataset.
   - Visualize the frequency of positive, negative, and neutral sentiments.

3. **Feature Extraction**:
   - Convert the cleaned text into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency).

4. **Model Training**:
   - Split the data into training and test sets.
   - Train machine learning models such as Logistic Regression, Naive Bayes, or Random Forest.

5. **Model Evaluation**:
   - Evaluate the performance of the trained model using accuracy score and other metrics.
   - Tune the model parameters for better performance.

6. **Prediction and Visualization**:
   - Make predictions on unseen data (new tweets).
   - Visualize the predicted sentiments and analyze trends.


 #### installing Kaggle Library and configurations

In [2]:
# installing Kaggle Library
! pip install kaggle



In [9]:
import os

# Specify the directory containing `kaggle.json`
os.environ['KAGGLE_CONFIG_DIR'] = r"C:\Users\Chang\Downloads"

# Test the Kaggle API
!kaggle datasets list


ref                                                              title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
---------------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
stealthtechnologies/predict-student-performance-dataset          Predict Student Performance                          12KB  2024-12-26 12:57:04           1322         36  1.0              
bhadramohit/customer-shopping-latest-trends-dataset              Customer Shopping (Latest Trends) Dataset            76KB  2024-11-23 15:26:12          21803        406  1.0              
ankushpanday1/heart-attack-in-youth-of-india                     Heart attack in youth of India                      298KB  2025-01-02 15:20:31            831         26  1.0              
oktayrdeki/heart-disease                               

In [3]:

# Download the Sentiment140 dataset
!kaggle datasets download -d kazanova/sentiment140



Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
sentiment140.zip: Skipping, found more recently modified local copy (use --force to force download)


In [11]:
import zipfile

# Unzip the downloaded file
with zipfile.ZipFile("sentiment140.zip", 'r') as zip_ref:
    zip_ref.extractall("sentiment140_dataset")


#### Importing the Dependencies

In [17]:
# Import libraries for data manipulation and analysis
import pandas as pd  # Used for data manipulation and analysis, especially for working with DataFrames
import numpy as np  # Provides support for numerical computations, arrays, and mathematical operations

# Import libraries for text preprocessing
import re  # Regular expressions, used for cleaning and searching text patterns
from nltk.corpus import stopwords  # Provides a list of common stopwords to remove from text (e.g., "and", "the")
from nltk.stem.porter import PorterStemmer  # Used for stemming, which reduces words to their root forms (e.g., "running" -> "run")

# Import libraries for feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer  # Converts text into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency)

# Import libraries for model training and evaluation
from sklearn.model_selection import train_test_split  # Splits the dataset into training and testing sets
from sklearn.linear_model import LogisticRegression  # Logistic Regression algorithm for classification tasks
from sklearn.metrics import accuracy_score  # Calculates the accuracy of the model's predictions


In [18]:
# Import the necessary NLTK module for stopwords
import nltk  # Natural Language Toolkit, a library for processing and analyzing human language data

# Download the stopwords data from NLTK (this step is needed only once)
nltk.download('stopwords')  # Downloads the list of stopwords in various languages from NLTK

# Printing the stopwords in English
print(stopwords.words('english'))  # Displays the list of common stopwords in English (e.g., "the", "and", "is")


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Chang\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 1. **Data Preprocessing**:

In [19]:
# Load the data from the CSV file to a pandas DataFrame
twitter_data = pd.read_csv(r"C:\Users\Chang\Downloads\sentiment140_dataset\training.1600000.processed.noemoticon.csv", encoding='latin-1')

# Checking the number of rows and columns
print(twitter_data.shape)


(1599999, 6)


In [20]:
# printing the first 5 rows of the dataframe
twitter_data.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [21]:
# Define the column names
column_names = ["target", "id", "date", "flag", "user", "text"]

# Load the data again with the new column names
twitter_data = pd.read_csv(r"C:\Users\Chang\Downloads\sentiment140_dataset\training.1600000.processed.noemoticon.csv", 
                           encoding='latin-1', names=column_names, header=None)

# Display the first few rows of the dataset with new column names
twitter_data.head()


Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [22]:
# Checking if there are any missing values in the dataset
twitter_data.isnull().sum()

target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64

In [23]:
# Checking the distribution of the 'target' column
# The 'target' column contains the sentiment labels: 0 for negative and 4 for positive sentiment.
# By using 'value_counts()', we can check how many negative (0) and positive (4) sentiments are present in the dataset.
# This gives us an idea of the class distribution, which helps us understand if the dataset is balanced or not.

print(twitter_data['target'].value_counts())





0    800000
4    800000
Name: target, dtype: int64


#### Explanation:
twitter_data['target'].value_counts(): This line counts how many occurrences of each label (0 and 4) are present in the 'target' column. We see that both labels (0 and 4) have equal occurrences, making the dataset balanced in terms of sentiment labels (negative vs. positive).

In [24]:
# Convert the target value "4" to "1" to make the dataset more standard
# In machine learning tasks, it's common to represent the target sentiment as binary: 
# 0 for negative and 1 for positive. 
# This step ensures that we have a binary classification problem instead of a multi-class classification.

twitter_data.replace({'target': {4: 1}}, inplace=True)

# Checking the distribution of the target after conversion
# We now expect to see '0' for negative sentiment and '1' for positive sentiment in the 'target' column.

target_distribution = twitter_data['target'].value_counts()

# Output the distribution
print(target_distribution)


0    800000
1    800000
Name: target, dtype: int64


### Importance of Checking the Target Column Distribution

The **target column** represents the sentiment of the tweets, where:
- **0** indicates a **negative sentiment**
- **1** indicates a **positive sentiment**

By running `value_counts()`, we can quickly observe the distribution of sentiment labels in the dataset. This is important for several reasons:
1. **Class Imbalance**: If the dataset contains a significantly higher number of one class (e.g., positive or negative), it might affect the performance of machine learning models. In this case, the dataset is balanced with an equal number of positive (1) and negative (0) sentiments, which is ideal for training a model without bias.
2. **Model Evaluation**: Knowing the distribution helps us better evaluate the performance of our sentiment analysis model, as class imbalance can lead to misleading accuracy scores. In this case, since we have a balanced dataset, our model will be less likely to be biased towards predicting one sentiment over the other.
3. **Data Insights**: This step helps us get a quick overview of the dataset and ensures there are no unexpected anomalies, such as missing data or imbalanced labels.

In this particular dataset, we can see:
- There are **800,000** negative tweets (label 0).
- There are **800,000** positive tweets (label 1).

This is an evenly distributed dataset, making it suitable for training a model without the risk of bias toward one sentiment.


## Stemming

Stemming is a crucial step in Natural Language Processing (NLP) that reduces words to their root form. This process ensures that variations of a word, such as "running," "runner," and "runs," are treated as the same, which helps in:

** Text Normalization: Ensures consistency by reducing words to their base form.
** Reducing Vocabulary Size: Groups similar words, making computations faster and easier.
** Improving Model Accuracy: Focuses on meaning rather than grammatical variations.
** Relevance: In sentiment analysis, stemming helps simplify tweets that often contain informal and varied language. 

In [25]:
# Initialize the Porter Stemmer
port_stem = PorterStemmer()

In [26]:
# Define the stemming function
def stemming(content):
    """
    Function to preprocess and stem a given text.
    
    Steps:
    1. Remove non-alphabetic characters using regex.
    2. Convert the text to lowercase for consistency.
    3. Split the text into individual words (tokens).
    4. Remove stopwords (common words like 'the', 'and', etc.).
    5. Stem each word to its root form.
    6. Join the stemmed words back into a single string.
    
    Parameters:
    content (str): A string of text to preprocess and stem.
    
    Returns:
    str: The preprocessed and stemmed text.
    """
    # Step 1: Remove non-alphabetic characters
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    
    # Step 2: Convert text to lowercase
    stemmed_content = stemmed_content.lower()
    
    # Step 3: Split the text into individual words (tokens)
    stemmed_content = stemmed_content.split()
    
    # Step 4: Remove stopwords and apply stemming
    stemmed_content = [
        port_stem.stem(word) for word in stemmed_content if word not in stopwords.words('english')
    ]
    
    # Step 5: Join the stemmed words back into a single string
    stemmed_content = ' '.join(stemmed_content)
    
    return stemmed_content


In [27]:
# Apply the stemming function to the 'text' column
# Takes about 50 minutes to complete this execution
twitter_data['stemmed_content'] = twitter_data['text'].apply(stemming)


In [29]:
'''
The new column stemmed_content contains the processed text from the text 
column, where all words are reduced to their root forms through stemming, 
helping standardize the data for analysis.
'''
twitter_data.head()

Unnamed: 0,target,id,date,flag,user,text,stemmed_content
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com zl awww bummer sho...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset updat facebook text might cri result sch...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,kenichan dive mani time ball manag save rest g...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole bodi feel itchi like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",nationwideclass behav mad see


In [30]:
# For the sentiment analysis task, we are focusing on two main columns:
# 1. 'stemmed_content': Contains the processed text data with stemming applied (reducing words to their root form).

print(twitter_data['stemmed_content'])

0          switchfoot http twitpic com zl awww bummer sho...
1          upset updat facebook text might cri result sch...
2          kenichan dive mani time ball manag save rest g...
3                            whole bodi feel itchi like fire
4                              nationwideclass behav mad see
                                 ...                        
1599995                           woke school best feel ever
1599996    thewdb com cool hear old walt interview http b...
1599997                         readi mojo makeov ask detail
1599998    happi th birthday boo alll time tupac amaru sh...
1599999    happi charitytuesday thenspcc sparkschar speak...
Name: stemmed_content, Length: 1600000, dtype: object


In [31]:
# 2. 'target': Contains the sentiment labels, where 0 indicates negative sentiment and 1 indicates positive sentiment.
print(twitter_data['target'])

0          0
1          0
2          0
3          0
4          0
          ..
1599995    1
1599996    1
1599997    1
1599998    1
1599999    1
Name: target, Length: 1600000, dtype: int64


In [32]:
# Separating the data (features) and the label (target)
X = twitter_data['stemmed_content'].values  # X contains the preprocessed text (features)
Y = twitter_data['target'].values           # Y contains the sentiment labels (0 or 1)

In [33]:
# The output displays the first 3 and last 3 tweets from the dataset, which contains over a million tweets.
# This is a small sample from the `X` (features) array, showing how the text data looks after preprocessing (stemming).
# Due to the large size of the dataset, we only display a snippet here for clarity.

print(X)

['switchfoot http twitpic com zl awww bummer shoulda got david carr third day'
 'upset updat facebook text might cri result school today also blah'
 'kenichan dive mani time ball manag save rest go bound' ...
 'readi mojo makeov ask detail'
 'happi th birthday boo alll time tupac amaru shakur'
 'happi charitytuesday thenspcc sparkschar speakinguph h']


In [34]:
# The output displays a small sample of the target labels from the dataset (Y).
# Each value represents the sentiment of the corresponding tweet: 0 for negative sentiment and 1 for positive sentiment.
# Due to the large size of the dataset, we are only displaying a snippet of the labels for clarity.

print(Y)

[0 0 0 ... 1 1 1]


### Splitting the Data into Training and Test Sets
In this step, we will split the dataset into two parts:

Training Data: This subset will be used to train our machine learning model. The model will learn patterns from this data to make predictions.

Test Data: This subset will be used to evaluate the performance of the trained model. By testing the model on data it hasn't seen before, we can assess how well it generalizes to new, unseen data.

This separation helps us ensure that the model is not just memorizing the data (overfitting) but is able to make accurate predictions on new data.

In [35]:
# Splitting the dataset into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(
    X,               # Feature data (stemmed tweets)
    Y,               # Target labels (sentiment labels)
    test_size=0.2,   # 20% of the data will be used for testing
    stratify=Y,      # Ensures the target labels are evenly distributed in both train and test sets
    random_state=2   # Seed for reproducibility of the split
)

# Now we have:
# X_train, Y_train: Data for training the model
# X_test, Y_test: Data for evaluating the model


In [36]:
# Printing the shapes of the original, training, and test datasets
print(X.shape, X_train.shape, X_test.shape)

(1600000,) (1280000,) (320000,)


### Explanation of Output:

- **`X.shape: (1600000,)`**: The original dataset contains 1,600,000 tweets.
- **`X_train.shape: (1280000,)`**: 1,280,000 tweets (80% of the data) are allocated to the training set.
- **`X_test.shape: (320000,)`**: 320,000 tweets (20% of the data) are allocated to the test set.

The dataset has been correctly split into 80% for training and 20% for testing.


### Converting Textual Data to Numerical Data

In this step, we convert the textual data into numerical data that can be processed by machine learning models. Since the model cannot understand text directly, we need to transform the text (tweets) into a numerical format.

We'll use **TF-IDF (Term Frequency - Inverse Document Frequency)**, which is a technique to evaluate the importance of a word in a document relative to the entire dataset. This helps in capturing the significance of each word in a tweet.

---


In [37]:
# Initializing the TfidfVectorizer, which will convert text to numerical data
vectorizer = TfidfVectorizer()

# Fit the vectorizer on the training data and transform it into numerical data
# The fit_transform method both learns the vocabulary and transforms the text to TF-IDF features
X_train = vectorizer.fit_transform(X_train)

# Transform the test data using the already learned vocabulary
# The transform method only applies the transformation without altering the vocabulary
X_test = vectorizer.transform(X_test)


In [38]:
print(X_train)

  (0, 443066)	0.4484755317023172
  (0, 235045)	0.41996827700291095
  (0, 109306)	0.3753708587402299
  (0, 185193)	0.5277679060576009
  (0, 354543)	0.3588091611460021
  (0, 436713)	0.27259876264838384
  (1, 160636)	1.0
  (2, 288470)	0.16786949597862733
  (2, 132311)	0.2028971570399794
  (2, 150715)	0.18803850583207948
  (2, 178061)	0.1619010109445149
  (2, 409143)	0.15169282335109835
  (2, 266729)	0.24123230668976975
  (2, 443430)	0.3348599670252845
  (2, 77929)	0.31284080750346344
  (2, 433560)	0.3296595898028565
  (2, 406399)	0.32105459490875526
  (2, 129411)	0.29074192727957143
  (2, 407301)	0.18709338684973031
  (2, 124484)	0.1892155960801415
  (2, 109306)	0.4591176413728317
  (3, 172421)	0.37464146922154384
  (3, 411528)	0.27089772444087873
  (3, 388626)	0.3940776331458846
  (3, 56476)	0.5200465453608686
  :	:
  (1279996, 390130)	0.22064742191076112
  (1279996, 434014)	0.2718945052332447
  (1279996, 318303)	0.21254698865277746
  (1279996, 237899)	0.2236567560099234
  (1279996, 2910

In [39]:
print(X_test)

  (0, 420984)	0.17915624523539803
  (0, 409143)	0.31430470598079707
  (0, 398906)	0.3491043873264267
  (0, 388348)	0.21985076072061738
  (0, 279082)	0.1782518010910344
  (0, 271016)	0.4535662391658828
  (0, 171378)	0.2805816206356073
  (0, 138164)	0.23688292264071403
  (0, 132364)	0.25525488955578596
  (0, 106069)	0.3655545001090455
  (0, 67828)	0.26800375270827315
  (0, 31168)	0.16247724180521766
  (0, 15110)	0.1719352837797837
  (1, 366203)	0.24595562404108307
  (1, 348135)	0.4739279595416274
  (1, 256777)	0.28751585696559306
  (1, 217562)	0.40288153995289894
  (1, 145393)	0.575262969264869
  (1, 15110)	0.211037449588008
  (1, 6463)	0.30733520460524466
  (2, 400621)	0.4317732461913093
  (2, 256834)	0.2564939661498776
  (2, 183312)	0.5892069252021465
  (2, 89448)	0.36340369428387626
  (2, 34401)	0.37916255084357414
  :	:
  (319994, 123278)	0.4530341382559843
  (319995, 444934)	0.3211092817599261
  (319995, 420984)	0.22631428606830145
  (319995, 416257)	0.23816465111736276
  (319995, 3

##### Explanation of the Output:

The output represents the **numerical transformation** of the textual data using `TfidfVectorizer`. 

- Each row corresponds to a specific tweet in the dataset.
- Each column represents a unique word (or token) in the entire vocabulary of the dataset.
- The values are the **TF-IDF scores**, which indicate the importance of a word in a specific tweet relative to its frequency across all tweets.

This transformation converts textual data into a sparse matrix format, where most values are zero, optimizing memory usage. The numerical data is now ready for use in training machine learning models.


# Training the Machine Learning Model

In this section, we will train a machine learning model to classify tweets into their respective sentiment categories (e.g., positive or negative). 

The training process involves:

1. **Choosing a Model**: Selecting an appropriate machine learning algorithm for sentiment analysis.
2. **Feeding the Data**: Using the numerical data (`X_train` and `Y_train`) to train the model to identify patterns and relationships between features (tweets) and their corresponding labels (sentiments).
3. **Evaluation**: After training, the model will be tested on unseen data (`X_test`) to evaluate its accuracy and ability to generalize.

The goal is to develop a robust model that can accurately predict the sentiment of new, unseen tweets.


### Logistic Regression

Logistic Regression is a simple and effective algorithm for binary classification tasks, such as sentiment analysis. It predicts probabilities using a sigmoid function and classifies data into categories based on a threshold (e.g., positive or negative sentiment). Its simplicity makes it a great starting point for building classification models.


In [42]:
# Initializing the Logistic Regression model with a maximum iteration of 1000
# max_iter specifies the maximum number of iterations the solver will take to converge
model = LogisticRegression(max_iter=1000)

In [43]:
# Training the Logistic Regression model using the training data
# X_train contains the feature vectors for the training set
# Y_train contains the corresponding labels for the training set
model.fit(X_train, Y_train)


### Model Evaluation

Model evaluation is the process of assessing how well a trained machine learning model performs on unseen data. It helps determine whether the model is making accurate predictions and identifies potential overfitting or underfitting issues. Common evaluation metrics include accuracy, precision, recall, and F1-score.

In this step, we'll test our model on the test dataset and calculate relevant metrics to measure its performance.


#### Accuracy Score

The accuracy score is a metric used to evaluate the performance of a classification model. It calculates the proportion of correct predictions (both true positives and true negatives) made by the model out of all predictions. 
A higher accuracy indicates that the model is performing well, though it may not always be the best metric if the dataset is imbalanced.


In [44]:
# Predicting labels for the training data
X_train_prediction = model.predict(X_train)

# Calculating accuracy score for the training data
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)

# Printing the training data accuracy
print(f"Training Data Accuracy: {training_data_accuracy}")


Training Data Accuracy: 0.81023125


#### Explanation of Output:

The **Training Data Accuracy** is calculated as `0.81023125`, which means the model correctly predicted the labels for approximately **81.02%** of the training data. This is the proportion of correct predictions made by the model on the training set.

A higher accuracy score generally indicates a better-performing model. However, it's important to evaluate the model on the test data as well to ensure that it generalizes well to unseen data and isn't overfitting to the training set.


In [45]:
# Predicting labels for the test data
X_test_prediction = model.predict(X_test)

# Calculating accuracy score for the test data
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)

# Printing the test data accuracy
print(f"Test Data Accuracy: {test_data_accuracy}")


Test Data Accuracy: 0.778


#### Test Data Accuracy

The test data accuracy is `0.778`, which means the model correctly predicted the labels for 77.8% of the test data.

#### Comparison with Training Data Accuracy

- **Training Data Accuracy**: `0.810` (81%)
- **Test Data Accuracy**: `0.778` (77.8%)

The model performs well on both the training and test data, with only a small difference in accuracy (around 3.2%). This suggests that the model is not overfitting or underfitting. In other words, it generalizes well to unseen data.

#### Conclusion

- **Good Model Fit**: The model shows a good balance between performance on training data and test data, which indicates that the model is likely not overfitting or underfitting.
- **Overfitting or Underfitting**: If the training accuracy was much higher than the test accuracy, it could indicate overfitting. Conversely, if both accuracies were low, the model could be underfitting. In this case, the results suggest a well-fitted model.

This is a positive outcome, but further optimization and testing


### Saving the Trained Model

Once we have trained our machine learning model, it is important to save it so that we can use it later for predictions without having to retrain the model every time. This can be achieved using the `pickle` module in Python.

Pickle allows us to serialize the model into a file, which we can later load back into memory when needed. This is useful for deploying the model or for future use in applications.


In [46]:
import pickle

In [47]:
# Specify the filename where the model will be saved
filename = 'trained_model.sav'

# Save the trained model using pickle
# The 'wb' mode is used to write the model as a binary file
with open(filename, 'wb') as model_file:
    pickle.dump(model, model_file)

# Print a confirmation message that the model has been saved successfully
print(f"Model saved as {filename}")


Model saved as trained_model.sav


#### Saving the Trained Model

In machine learning, once a model is trained and evaluated, we can save it for future use, rather than retraining it each time. Saving the model allows us to:

- **Avoid retraining**: Training machine learning models can be time-consuming, especially with large datasets. By saving the trained model, we can quickly load it whenever needed for new predictions.
- **Reuse the model**: The saved model can be deployed in different environments or used across multiple projects.
- **Share the model**: If the model is effective, it can be shared with other team members, collaborators, or stakeholders for their use.

In this case, we have saved the trained model as `trained_model.sav`. We can now load this model in the future and make predictions without needing to retrain it.


#### Using the Saved Model for New Predictions

Once we have saved the trained model, we can load it again to make predictions on new, unseen data. Here’s how we can use the saved model for making predictions without retraining it.

In [48]:
# Loading the saved model from the file
loaded_model = pickle.load(open(r"C:\Users\Chang\Downloads\trained_model.sav", 'rb'))

# Now, the 'loaded_model' can be used to make predictions on new data.

In [50]:
# Selecting a new data point (tweet) from the test set
X_new = X_test[200]

# Printing the true label for the selected test data point
print("True label:", Y_test[200])

# Making a prediction on the selected data point
prediction = model.predict(X_new)

# Printing the predicted label
print("Predicted label:", prediction)

# Checking if the prediction is 0 (negative) or 1 (positive)
if prediction[0] == 0:
    print('Negative Tweet')
else:
    print('Positive Tweet')


True label: 1
Predicted label: [1]
Positive Tweet


Explanation:

We selected a tweet from the test set (X_test[200]), and its true sentiment label is 1, which indicates a positive tweet.
The model also predicted the sentiment to be 1, so the prediction was correct.
As a result, the tweet is classified as Positive Tweet.

In [51]:
# Selecting a new data point (tweet) from the test set
X_new = X_test[3]

# Printing the true label for the selected test data point
print("True label:", Y_test[3])

# Making a prediction on the selected data point
prediction = model.predict(X_new)

# Printing the predicted label
print("Predicted label:", prediction)

# Checking if the prediction is 0 (negative) or 1 (positive)
if prediction[0] == 0:
    print('Negative Tweet')
else:
    print('Positive Tweet')


True label: 0
Predicted label: [0]
Negative Tweet


Explanation:

We selected another tweet from the test set (X_test[3]), and its true sentiment label is 0, which indicates a negative tweet.
The model correctly predicted the sentiment as 0, so the prediction was correct.
As a result, the tweet is classified as a Negative Tweet.

Confusion Matrix & Classification Report
To evaluate the model's performance more comprehensively, we’ll use the Confusion Matrix and Classification Report. This will help us understand how well the model is classifying each sentiment label.

In [52]:
from sklearn.metrics import confusion_matrix, classification_report

# Predicting labels for the test data
X_test_prediction = model.predict(X_test)

# Confusion Matrix
conf_matrix = confusion_matrix(Y_test, X_test_prediction)
print("Confusion Matrix:")
print(conf_matrix)

# Classification Report
class_report = classification_report(Y_test, X_test_prediction)
print("Classification Report:")
print(class_report)


Confusion Matrix:
[[121436  38564]
 [ 32476 127524]]
Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.76      0.77    160000
           1       0.77      0.80      0.78    160000

    accuracy                           0.78    320000
   macro avg       0.78      0.78      0.78    320000
weighted avg       0.78      0.78      0.78    320000




- **True Negatives (121,436)**: The number of negative tweets that were correctly predicted as negative.
- **False Positives (38,564)**: The number of negative tweets that were incorrectly predicted as positive.
- **False Negatives (32,476)**: The number of positive tweets that were incorrectly predicted as negative.
- **True Positives (127,524)**: The number of positive tweets that were correctly predicted as positive.

### Classification Report

The **Classification Report** provides more detailed metrics:

- **Accuracy**: The model correctly predicted the sentiment of 78% of the tweets.
  
- **Precision**: This indicates how accurate the model is when predicting a particular sentiment. 
  - Precision for negative sentiment (`0`) is 79%, and for positive sentiment (`1`), it's 77%.

- **Recall**: This shows how well the model is able to identify all instances of a particular sentiment.
  - Recall for negative sentiment (`0`) is 76%, and for positive sentiment (`1`), it's 80%.

- **F1-Score**: The harmonic mean of precision and recall. This gives a balanced measure of the model’s performance.
  - The F1-score for negative sentiment (`0`) is 77%, and for positive sentiment (`1`), it’s 78%.

- **Macro Average**: The average of precision, recall, and F1-score across both classes, giving equal weight to each class.
  
- **Weighted Average**: The average of precision, recall, and F1-score, taking the class distribution into account.

### Conclusion

The model has achieved a **78% accuracy**, which is quite good. The balance between precision, recall, and F1-score for both classes indicates that the model is performing well without a significant bias towards either positive or negative sentiments. 

However, there is some room for improvement, particularly in improving recall for negative sentiment (`0`). Further tuning and exploration of more advanced models could improve these results.


# Final Report: Twitter Sentiment Analysis Using NLP

## Summary

In this project, we explored the use of Natural Language Processing (NLP) and machine learning techniques to perform sentiment analysis on a large dataset of tweets. The primary objective was to classify tweets into positive and negative sentiments, helping us understand public opinion on various topics. We leveraged machine learning models, such as Logistic Regression, to predict the sentiment of unseen tweets after preprocessing the text data.

### Key Steps in the Process:
1. **Data Preprocessing**: The dataset was cleaned by removing unnecessary characters, stopwords, and applying stemming to ensure the text data was in a usable format for machine learning.
2. **Exploratory Data Analysis (EDA)**: We analyzed the distribution of sentiment labels (positive, negative) and visualized the sentiment distribution.
3. **Feature Extraction**: The cleaned text was converted into numerical features using the TF-IDF method, allowing us to use the data in a machine learning model.
4. **Model Training**: Various machine learning models were trained, including Logistic Regression. We split the data into training and test sets, evaluating performance on both.
5. **Model Evaluation**: The trained model's performance was evaluated using accuracy score, precision, recall, F1-score, and the confusion matrix.
6. **Prediction and Visualization**: After evaluating the model, predictions were made on unseen data, and the sentiment of tweets was predicted.

## Results

- **Accuracy**: The model achieved an overall accuracy of **78%** on the test data.
- **Confusion Matrix**: The confusion matrix showed that the model was able to correctly predict the sentiment for a majority of tweets but had some misclassifications (false positives and false negatives).
  
  The confusion matrix for the model's predictions was: [[121436 38564] [ 32476 127524]]
  
- **Classification Report**: The report displayed metrics such as precision, recall, and F1-score for both positive and negative sentiments. The balanced F1-scores (78% for both classes) indicate that the model performed well without significant bias towards either class.

## Conclusion

This sentiment analysis model has demonstrated the power of machine learning and NLP in understanding and processing real-time social media data. With an accuracy of 78%, the model is effective at classifying tweets into positive and negative sentiments. However, there is still room for improvement, particularly in recall for the negative sentiment class. Future steps could include:

- **Model Tuning**: Tuning the model parameters and exploring more advanced algorithms like Random Forest, Naive Bayes, or deep learning models could further improve performance.
- **Handling Class Imbalance**: Investigating methods to deal with potential class imbalance may improve the recall for both positive and negative classes.
- **Real-Time Deployment**: The model could be integrated with a real-time Twitter feed to analyze the sentiment of incoming tweets.

In conclusion, sentiment analysis has a wide range of applications in industries like marketing, public relations, and customer service. This project not only showcases the importance of data preprocessing and machine learning but also provides a solid foundation for building systems that can analyze large amounts of social media data to uncover public opinion on a variety of topics.

--- 

With that, we have completed the sentiment analysis model, successfully classifying tweets and gaining insights from social media data.

Thank you!!
