### Part 1: Text Collection and Loading
Tweets Dataset: https://www.kaggle.com/datasets/kazanova/sentiment140

In [1]:
import pandas as pd

# Load CSV file
dataset_path = "Dataset.csv"
df = pd.read_csv(dataset_path)

# Display the first few rows
print(df["Tweet"].head(15))


0     @switchfoot http://twitpic.com/2y1zl - Awww, t...
1     is upset that he can't update his Facebook by ...
2     @Kenichan I dived many times for the ball. Man...
3       my whole body feels itchy and like its on fire 
4     @nationwideclass no, it's not behaving at all....
5                         @Kwesidei not the whole crew 
6                                           Need a hug 
7     @LOLTrish hey  long time no see! Yes.. Rains a...
8                  @Tatiana_K nope they didn't have it 
9                             @twittera que me muera ? 
10          spring break in plain city... it's snowing 
11                           I just re-pierced my ears 
12    @caregiving I couldn't bear to watch it.  And ...
13    @octolinz16 It it counts, idk why I did either...
14    @smarrison i would've been the first, but i di...
Name: Tweet, dtype: object


### Text Preprocessing

In [2]:
import nltk
from nltk.corpus import gutenberg

#### Tokenization
Tokenization is the process of splitting text into individual units called tokens. These tokens can be words, phrases, or sentences. In natural language processing (NLP), it typically involves dividing the text into words and punctuation marks.

In [4]:
from nltk.tokenize import word_tokenize, sent_tokenize

def tokenize(text):
    """
    Tokenizes the input text into sentences and words.

    Args:
    text (str): The text to be tokenized.

    Returns:
    list: A list of words in the text.
    """
    # Sentence tokenization
    sentences = sent_tokenize(text)
    # Word tokenization
    words = word_tokenize(text)
    return words

Tokenization helps in breaking down the text into manageable pieces, making it easier to analyze and process further. It transforms a large body of text into smaller, more understandable units, which can be counted, tagged, or classified. Proper tokenization is crucial as it sets the stage for the following preprocessing steps. Poor tokenization can lead to inaccurate analysis and reduced performance of NLP models.

#### Stops Words Removal
Stop word removal involves eliminating common words that are often considered irrelevant for text analysis. Examples of stop words include "and", "the", "is", and "in".

In [5]:
from nltk.corpus import stopwords

def remove_stopwords(words):
    """
    Removes stop words from a list of words.
    
    Input:
    - words (list): List of words from which stop words should be removed.
    
    Output:
    - filtered_words (list): List of words with stop words removed.
    """
    # Get the set of English stop words
    stop_words = set(stopwords.words('english'))
    
    # Remove stop words from the list
    filtered_words = [word for word in words if word.lower() not in stop_words]
    
    return filtered_words


Removing stop words reduces the amount of data to be processed, which can improve the efficiency of text analysis. This step helps in focusing on the more meaningful words that contribute to the content's overall meaning. However, it's essential to choose stop words carefully, as some common words might still carry significant contextual information depending on the analysis goals.

#### Stemming
Stemming is the process of reducing words to their base or root form, often by removing suffixes. For example, the words "running", "runner", and "ran" might all be reduced to the root "run".

In [6]:
from nltk.stem import PorterStemmer

def stem(words):
    """
    Applies stemming to a list of words, reducing them to their root forms.

    Args:
    words (list): A list of words to be stemmed.

    Returns:
    list: A list of stemmed words.
    """
    # Initialize the Porter stemmer
    stemmer = PorterStemmer()
    # Apply stemming to each word
    stemmed_words = [stemmer.stem(word) for word in words]
    return stemmed_words

Stemming helps in normalizing the text by converting different forms of a word to a common base form. This reduces the vocabulary size, which is beneficial for text analysis and retrieval. However, stemming can sometimes lead to over-reduction, where words with different meanings are reduced to the same stem, potentially causing confusion and loss of semantic meaning.

#### Lemmatization
Lemmatization is the process of reducing words to their base or dictionary form, called lemmas, by considering the context in which they are used. Unlike stemming, lemmatization uses a vocabulary and morphological analysis to accurately convert words to their base forms.

In [29]:
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

# Download the required NLTK resources
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# Function to convert NLTK POS tags to WordNet POS tags
def get_wordnet_pos(nltk_pos_tag):
    if nltk_pos_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_pos_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_pos_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def lemmatize(words):
    """
    Applies lemmatization to a list of words, reducing them to their base forms.

    Args:
    words (list): A list of words to be lemmatized.

    Returns:
    list: A list of lemmatized words.
    """
    # Initialize the WordNet lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Get POS tags for the words
    pos_tags = pos_tag(words)
    
    # Apply lemmatization to each word with its POS tag
    lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in pos_tags]
    
    return lemmatized_words


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\PMLS\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\PMLS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lemmatization provides a more accurate reduction of words compared to stemming by considering the word's context and part of speech. This ensures that the words retain their true meaning after reduction. Lemmatization helps in improving the quality of the text analysis, making the resulting data more reliable and meaningful for subsequent processing.

#### Pipeline

In [31]:
def preprocess_tweet(tweet):
    """
    Applies a series of preprocessing steps to a tweet.
    
    Input:
    - tweet (str): The tweet text to be preprocessed.
    
    Output:
    - preprocessed (str): Preprocessed text.
    """
    # Tokenization
    words = tokenize(tweet)
    
    # Stop Word Removal
    words_no_stopwords = remove_stopwords(words)
    
    # Stemming    
    stemmed_words = stem(words_no_stopwords)

    # Lemmatization
    lemmatized_words = lemmatize(words_no_stopwords)
    
    # Return after all preprocessing
    return ' '.join(lemmatized_words)

# Apply preprocessing to the 'Tweet' column
df['Preprocessed'] = df['Tweet'][:100].apply(preprocess_tweet)
# print(df['Tweet'][:100].apply(preprocess_tweet))
# print(lemmatize(['behaving']))

# print(df.head())
