# Natural Language Processing

<img align='center' width='600' src="https://images.gr-assets.com/misc/1535611813-1535611813_goodreads_misc.gif">

In [1]:
import nltk
import spacy
import pandas as pd
import numpy as np
import re
import random

## **work with kaggle**

- First you have to go to your profile and creat API token which will download kaggle.json to your pc
- Now drag and drop json file to your colab files tab
- Run the codes below on **google colab**
```
    !mkdir /root/.kaggle
    !mv kaggle.json /root/.kaggle/kaggle.json
    !chmod 600 /root/.kaggle/kaggle.json
    !kaggle datasets list
```

<br>

- or if you are on your **local system** put save kaggle.json on desktop and run below command on jupyter notebook:

```
    !mkdir /Users/mhd/.kaggle/
    !cp ~/Desktop/kaggle.json /Users/mhd/.kaggle/
    !chmod 600 /Users/mhd/.kaggle/kaggle.json
```
<br>

- To download the datasts to your colab open the desire competition(datasets) and click on `three dot` on the upper-right then `copy API command`

-  for more information [click here](https://www.kaggle.com/discussions/general/74235)

## **Spacy**

In [None]:
# !rm -r /root/.kaggle
!mkdir /Users/mhd/.kaggle/
!cp ~/Desktop/kaggle.json /Users/mhd/.kaggle/
!chmod 600 /Users/mhd/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d venky73/spam-mails-dataset
!unzip spam-mails-dataset.zip

### **Pandas library to read and edit csv files**

Our dataset consists of a text column and a label column for "ham" or "spam". Since Machine Learning libraries use numeric data as input, If labels are in string format they need to be converted.

The text column also includes raw text, which needs to be handled.

In [2]:
spam_df = pd.read_csv('spam_ham_dataset.csv', index_col=0)
spam_df.head()

Unnamed: 0,label,text,label_num
605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
4685,spam,"Subject: photoshop , windows , office . cheap ...",1
2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


### **Preprocessing**

#### **Handle labels column**

In [3]:
# If label are just in string format use the code below to create new column for int type of that column
# This is for demonstration purpose only 
df_new = spam_df[['text', 'label']]
df_new.head(3)

Unnamed: 0,text,label
605,Subject: enron methanol ; meter # : 988291\r\n...,ham
2349,"Subject: hpl nom for january 9 , 2001\r\n( see...",ham
3624,"Subject: neon retreat\r\nho ho ho , we ' re ar...",ham


In [4]:
df_new['label_num'] = (df_new['label'] == 'spam').astype(int)
df_new.head()

Unnamed: 0,text,label,label_num
605,Subject: enron methanol ; meter # : 988291\r\n...,ham,0
2349,"Subject: hpl nom for january 9 , 2001\r\n( see...",ham,0
3624,"Subject: neon retreat\r\nho ho ho , we ' re ar...",ham,0
4685,"Subject: photoshop , windows , office . cheap ...",spam,1
2030,Subject: re : indian springs\r\nthis deal is t...,ham,0


#### **Handle text column**

The text column has a lot of stop words and punctuation. First, we remove them. After that, it's time to reduce the words to their roots using lemmatization.


In [5]:
spam_df = pd.read_csv('spam_ham_dataset.csv')
print('Number of rows: ', len(spam_df))
spam_df.head(3)

Number of rows:  5171


Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0


In [6]:
# if we want to edit input text data

nlp = spacy.load("en_core_web_sm")

import re

def remove_stops(text):
    """
    Remove stopwords, punctuation, and digits from the input text.

    Args:
    text (str): The input text to process.

    Returns:
    str: The processed text with stopwords, punctuation, and digits removed.
    """

    # Normalize whitespace in the text
    text = re.sub('\s+', ' ', text)

    # Process the text using an NLP tool (assumed to be spaCy)
    doc = nlp(text)

    # Extract lemmatized tokens that are not stopwords, punctuation, or digits
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct and not token.is_digit]

    # Join the tokens into a single string and return
    return ' '.join(tokens)


# use small fraction of data to check the function
df = spam_df.loc[:2, ('text', 'label_num')]
df['text_lemma_without_stops'] = df['text'].apply(remove_stops)
df

Unnamed: 0,text,label_num,text_lemma_without_stops
0,Subject: enron methanol ; meter # : 988291\r\n...,0,subject enron methanol meter follow note give ...
1,"Subject: hpl nom for january 9 , 2001\r\n( see...",0,subject hpl nom january attached file hplnol x...
2,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0,subject neon retreat ho ho ho wonderful time y...


In [7]:
# spam_df['text_lemma_without_stops'] = spam_df['text'].apply(remove_stops)
# spam_df.to_csv('spam_ham_dataset_lemma.csv')
# spam_df.head(5)

In [10]:
spam_df = pd.read_csv("spam_ham_dataset_lemma.csv", index_col=0)
spam_df.head(5)

Unnamed: 0.1,Unnamed: 0,label,text,label_num,text_lemma_without_stops
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0,subject enron methanol meter follow note give ...
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0,subject hpl nom january attached file hplnol x...
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0,subject neon retreat ho ho ho wonderful time y...
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1,subject photoshop windows office cheap main tr...
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0,subject indian spring deal book teco pvr reven...


#### **Bag Of Word(BOG)**

The Bag-of-Words (BoW) model is a simple yet powerful technique used for text representation in Natural Language Processing (NLP) tasks. In this model, a document (or a piece of text) is represented as a bag (multiset) of words, disregarding grammar and word order but keeping multiplicity. The basic idea is to create a vocabulary of unique words from the entire corpus, and then represent each document as a vector where each dimension corresponds to a word in the vocabulary, and the value represents the frequency of that word in the document.

**CountVectorizer** is a tool for converting a collection of text documents into a matrix of token counts. It essentially converts text data into numerical features that can be used for machine learning algorithms.

- **Tokenization**: It breaks down each document into individual words or tokens. It can also handle n-grams, which are sequences of n tokens.
- **Vocabulary Building**: It constructs a vocabulary of all unique tokens across the entire corpus of documents. Each unique token becomes a feature.
- **Counting**: It counts the occurrences of each token in each document. This count becomes the value of the corresponding feature in the matrix.
- **Sparse Matrix**: The output is typically a sparse matrix where each row represents a document, each column represents a token in the vocabulary, and each cell represents the count of the corresponding token in the document. Since most documents will only contain a small subset of all possible tokens, the matrix is sparse, meaning that most of its entries are zeros.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

X = spam_df['text']
X_lemma = spam_df['text_lemma_without_stops']
y = spam_df['label_num']

transformer = CountVectorizer()
X_cv = transformer.fit_transform(X)
X_np = X_cv.toarray()

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_cv, y, test_size=0.8, random_state=0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1034, 50447), (4137, 50447), (1034,), (4137,))

In [13]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, y_train)

pred = model.predict(X_test)

In [14]:
from sklearn.metrics import accuracy_score, classification_report

acc = accuracy_score(y_test, pred) * 100
print(f'Model accuracy is {acc:.2f}')

Model accuracy is 96.66


In [15]:
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.96      0.99      0.98      2963
           1       0.99      0.90      0.94      1174

    accuracy                           0.97      4137
   macro avg       0.97      0.95      0.96      4137
weighted avg       0.97      0.97      0.97      4137



In [16]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

def train_model(model, feature_extraction, X, y):
    """
    Train a text classification model using a pipeline consisting of a feature extraction method and a model for classification.

    Args:
    model (class): The classifier model to use for classification.
    feature_extraction (class): The feature extraction method to use, typically CountVectorizer or TfidfVectorizer.
    X (array-like): Input data containing text documents.
    y (array-like): Target labels for the input data.

    Returns:
    str: Classification report showing precision, recall, F1-score, and support.
    """

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)

    
    # Define a pipeline consisting of a feature extraction method and model classifier
    clf = Pipeline([
        ('transformer', feature_extraction),
        ('Model', model)
    ])
    
    # Fit the pipeline on the training data
    clf.fit(X_train, y_train)
    
    # Make predictions on the test data
    pred = clf.predict(X_test)
    
    # Generate a classification report
    report = classification_report(y_test, pred)

    # Print the name of the model used
    print(f'''Your model is: "{str(model).split('(')[0]}"''', end='\n\n')
    
    return report

In [17]:
# Do the same thing with less line of code
print(train_model(MultinomialNB(), CountVectorizer(), X, y))

Your model is: "MultinomialNB"

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      2963
           1       0.97      0.93      0.95      1174

    accuracy                           0.97      4137
   macro avg       0.97      0.96      0.96      4137
weighted avg       0.97      0.97      0.97      4137



In [18]:
# Do the same thing with lemmatized text
print(train_model(MultinomialNB(), CountVectorizer(), X_lemma, y))

Your model is: "MultinomialNB"

              precision    recall  f1-score   support

           0       0.97      0.98      0.98      2963
           1       0.95      0.94      0.94      1174

    accuracy                           0.97      4137
   macro avg       0.96      0.96      0.96      4137
weighted avg       0.97      0.97      0.97      4137



In [19]:
# Do the same thing with lemmatized text with different n_gram range(1 and 2)
print(train_model(MultinomialNB(), CountVectorizer(ngram_range=(1, 2)), X_lemma, y))

Your model is: "MultinomialNB"

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      2963
           1       0.97      0.92      0.94      1174

    accuracy                           0.97      4137
   macro avg       0.97      0.95      0.96      4137
weighted avg       0.97      0.97      0.97      4137



## **NLTK**

The `movie_reviews` corpus in NLTK is a dataset commonly used for sentiment analysis and text classification tasks. It consists of reviews of movies categorized into *positive* and *negative* sentiments.

The `movie_reviews` corpus is organized in the following way:

1. **File Structure**:
    - The corpus consists of two directories: pos and neg.
    - The `pos` directory contains text files with positive movie reviews.
    - The `neg` directory contains text files with negative movie reviews.
2. **File Content**:
    - Each text file corresponds to a single movie review.
    - The content of each file is the text of the movie review.
3. **Categorization:**
    - The reviews are categorized based on sentiment.
    - Positive reviews are stored in the pos directory.
    - Negative reviews are stored in the neg directory.
4. **Balanced Dataset**:
    - The dataset is balanced, meaning that it contains an equal number of positive and negative reviews.
    - Each directory (pos and neg) contains an equal number of text files.
5. **Usage:**
    - Researchers and developers often use this corpus for training and testing machine learning models for sentiment analysis or text classification tasks.
    - It provides a standardized dataset for evaluating the performance of different algorithms and techniques in sentiment analysis.

In [20]:
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords

print("Number of words in movie review corpus: ", len(mr.words()))
print("Categories: ", mr.categories())

Number of words in movie review corpus:  1583820
Categories:  ['neg', 'pos']


In [21]:
# Load stop words and remove 'not' from it
stops = stopwords.words('english')
stops.remove('not')

# Extract each item review and class
docs = [(file, category) for category in mr.categories() for file in mr.fileids(category)]

In [22]:
import re
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

all_words = []
lemma_words_without_stops = []

# remove punctuations and stops words
tokenizer = RegexpTokenizer(r'\w+') # just accept words
for w in mr.words():
    all_words.append(w.lower())
    lemma_words_without_stops.append(lemmatizer.lemmatize(w.lower())) if (w not in stops) and (tokenizer.tokenize(w)) else None


len(all_words), len(lemma_words_without_stops)

(1583820, 714149)

In [23]:
lemma_words_without_stops = nltk.FreqDist(lemma_words_without_stops)
all_words = nltk.FreqDist(all_words)

In [24]:
# Check for existance of different words in list
print("number of ',' repitition in all_words: ",all_words[','])
print("number of ',' repitition in words_without_stops: ", lemma_words_without_stops[','])

number of ',' repitition in all_words:  77717
number of ',' repitition in words_without_stops:  0


In [25]:
# use portion of all words as feature vectore
feature_vector = list(lemma_words_without_stops)[:4000]

### **Create feature extraction function**

In [26]:
def feature_extract(words, feature_vector):
    """
    Extract features from a list of words based on a feature vector.

    Args:
    words (list): List of words to extract features from.
    feature_vector (list): Feature vector containing the features to be extracted.

    Returns:
    dict: A dictionary indicating whether each feature in the feature vector is present in the words.
    """

    # Convert words to a set for faster membership checking
    words = set(words)
    
    # Initialize an empty dictionary to store features
    feature = {}

    # Check if each feature in the feature vector is present in the words
    for x in feature_vector:
        feature[x] = x in words

    return feature

In [27]:
# Check if the function works 
rev = feature_extract(mr.words(docs[0][0]), feature_vector)
for k, v in rev.items():
    if v is True:
        print(k)

film
movie
one
not
character
like
get
make
even
good
would
also
well
life
two
see
way
go
plot
really
little
know
people
bad
director
new
look
find
audience
back
give
big
world
still
want
seems
every
part
going
point
actually
although
ever
since
line
problem
away
watch
might
start
bit
making
american
kind
always
seem
trying
sense
half
need
idea
pretty
sure
mind
given
horror
attempt
head
music
got
ending
10
completely
2
different
simply
mean
dead
lost
entire
someone
main
review
final
playing
despite
video
production
running
entertaining
feeling
throughout
deal
genre
others
five
flick
member
coming
break
girlfriend
guess
obviously
taken
secret
3
happen
oh
giving
studio
seemed
street
chase
apparently
cool
party
ago
teen
us
4
strange
came
mess
took
decent
overall
drive
showing
biggest
fantasy
8
concept
beauty
whatever
somewhere
hot
witch
generation
okay
write
7
accident
edge
normal
sad
decided
generally
rarely
as
weird
20
stick
blair
bottom
confusing
sitting
continues
slasher
nightmare
plai

In [28]:
# apply feature extraction function on documents
features = [(feature_extract(rev, feature_vector), category) for (rev, category) in docs]

In [29]:
# Split dataset as train and test sets
from sklearn import model_selection

# Using Sklearn
train_set, test_set = model_selection.train_test_split(features, test_size=0.2)

# Costum
# split = int(len(features) * 0.8)
# train_set = features[:split]
# test_set = features[split:]

In [30]:
# Use nltk Naive bayes classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

accuracy = nltk.classify.accuracy(classifier, test_set) * 100
print(f"NaiveBayesClassifier accuracy : {accuracy}")

NaiveBayesClassifier accuracy : 100.0


In [31]:
classifier.show_most_informative_features(15)

Most Informative Features
                       1 = False             neg : pos    =      1.2 : 1.0
                       2 = False             pos : neg    =      1.1 : 1.0
                       7 = False             pos : neg    =      1.1 : 1.0
                       7 = True              neg : pos    =      1.1 : 1.0
                       2 = True              neg : pos    =      1.1 : 1.0
                       1 = True              pos : neg    =      1.1 : 1.0
                       5 = False             neg : pos    =      1.1 : 1.0
                       5 = True              pos : neg    =      1.0 : 1.0
                       4 = False             neg : pos    =      1.0 : 1.0
                       4 = True              pos : neg    =      1.0 : 1.0
                       0 = False             pos : neg    =      1.0 : 1.0
                       0 = True              neg : pos    =      1.0 : 1.0
                       3 = False             pos : neg    =      1.0 : 1.0

### **Create train sklearn models function**

In [32]:
from nltk.classify.scikitlearn import SklearnClassifier

def train_nltk(my_model, train_set, test_set):
    """
    Train an NLTK classifier using a provided model and evaluate its accuracy on a test set.

    Args:
    my_model: An NLTK classifier model to be trained.
    train_set: Training set for the classifier.
    test_set: Test set for evaluating the classifier.

    Returns:
    float: Accuracy of the trained model on the test set.
    """

    # Convert the NLTK classifier model to a scikit-learn compatible classifier
    model = SklearnClassifier(my_model)

    # Train the model using the provided training set
    model.train(train_set)

    # Extract and print the name of the model (extracted from its string representation)
    print(f"Your model is:  {str(my_model).split('(')[0]}")

    # Calculate accuracy of the trained model on the test set
    accuracy = nltk.classify.accuracy(model, test_set) * 100
    
    return accuracy


In [33]:
from sklearn.svm import SVC

accuracy = train_nltk(SVC(kernel='linear'), train_set, test_set)
print(f"Accuracy: {accuracy}")

Your model is:  SVC
Accuracy: 100.0


In [34]:
from sklearn.linear_model import LogisticRegression, SGDClassifier

accuracy = train_nltk(LogisticRegression(), train_set, test_set)
print(f"Accuracy: {accuracy}")

Your model is:  LogisticRegression
Accuracy: 100.0


In [35]:
accuracy = train_nltk(SGDClassifier(), train_set, test_set)
print(f"Accuracy: {accuracy}")

Your model is:  SGDClassifier
Accuracy: 100.0


In [36]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

accuracy = train_nltk(MultinomialNB(), train_set, test_set)
print(f"Accuracy: {accuracy}")

Your model is:  MultinomialNB
Accuracy: 100.0


In [37]:
accuracy = train_nltk(BernoulliNB(), train_set, test_set)
print(f"Accuracy: {accuracy}")

Your model is:  BernoulliNB
Accuracy: 100.0


### **Create preprocess function**

which to all the work for us

In [50]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import movie_reviews as mr
from sklearn import model_selection
from tqdm.auto import tqdm

def preprocess_nltk(docs, n=2):
    """
    Preprocesses text data using NLTK tools, including tokenization, lemmatization,
    removal of stop words and punctuation, and generation of n-grams. Prepares a feature set
    for classification tasks using NLTK movie reviews dataset.

    Args:
    docs (list): List of tuples containing text documents and their corresponding categories.
    n (int, optional): Size of n-grams to generate (default is 2).

    Returns:
    tuple: A tuple containing the training and testing sets for classification tasks.
    """

    # Initialize WordNetLemmatizer for lemmatization
    lemmatizer = WordNetLemmatizer()

    # Initialize an empty list to store words after preprocessing
    lemma_words_without_stops = []

    # Tokenize words and preprocess each word
    tokenizer = RegexpTokenizer(r'\w+')  # Accept only words
    stops = set(stopwords.words('english'))

    # Tokenize words from the movie_reviews corpus, remove punctuations and stop words
    words = [tokenizer.tokenize(w)[0] for w in mr.words() if tokenizer.tokenize(w)]
    for w in tqdm(words):
        # Lemmatize words and remove stop words
        lemma_words_without_stops.append(lemmatizer.lemmatize(w.lower())) if (w not in stops) else None

    # Generate n-grams from the preprocessed words
    if n > 1:
        ngrams_list = []
        feature_ngram = []
        for gram in range(2, n + 1):
            print(f"Add ngram={gram} to the data")
            for w in nltk.ngrams(lemma_words_without_stops, gram):
                word = ''
                for i in range(gram):
                    word += w[i] + ' '
                ngrams_list.append(word.strip())
    
            # Compute the frequency distribution of n-grams
            ngrams_freq = nltk.FreqDist(ngrams_list)
            feature_ngram += list(ngrams_freq)[:1000]

    # Compute the frequency distribution of lemmatized words
    lemma_words_without_stops = nltk.FreqDist(lemma_words_without_stops)

    # Select the top 3000 most common lemmatized words and n-grams as the feature vector
    feature_vector = list(lemma_words_without_stops)[:3000]
    feature_vector += feature_ngram if n > 1 else []

    # Prepare the feature set for classification
    features = [(feature_extract(rev, feature_vector), category) for (rev, category) in docs]

    # Split the feature set into training and testing sets
    train_set, test_set = model_selection.train_test_split(features, test_size=0.2, random_state=1)

    return train_set, test_set

In [52]:
docs = [(file, category) for category in mr.categories() for file in mr.fileids(category)]
train, test = preprocess_nltk(docs, n=1)

  0%|          | 0/1336782 [00:00<?, ?it/s]

In [53]:
# Use nltk Naive bayes classifier ngrams equal 1
classifier = nltk.NaiveBayesClassifier.train(train)

accuracy = nltk.classify.accuracy(classifier, test) * 100
print(f"NaiveBayesClassifier accuracy : {accuracy}")

NaiveBayesClassifier accuracy : 100.0


In [54]:
classifier.show_most_informative_features(15)

Most Informative Features
                       1 = False             neg : pos    =      1.1 : 1.0
                       2 = False             pos : neg    =      1.1 : 1.0
                       7 = False             pos : neg    =      1.1 : 1.0
                       7 = True              neg : pos    =      1.1 : 1.0
                       2 = True              neg : pos    =      1.1 : 1.0
                       5 = False             neg : pos    =      1.0 : 1.0
                       1 = True              pos : neg    =      1.0 : 1.0
                       5 = True              pos : neg    =      1.0 : 1.0
                       4 = False             pos : neg    =      1.0 : 1.0
                       4 = True              neg : pos    =      1.0 : 1.0
                       0 = False             neg : pos    =      1.0 : 1.0
                       0 = True              pos : neg    =      1.0 : 1.0
                       6 = False             pos : neg    =      1.0 : 1.0

In [55]:
# ngrams equal 2
train, test = preprocess_nltk(docs, n=2)

classifier = nltk.NaiveBayesClassifier.train(train)

accuracy = nltk.classify.accuracy(classifier, test) * 100
print(f"NaiveBayesClassifier accuracy : {accuracy}")

  0%|          | 0/1336782 [00:00<?, ?it/s]

Add ngram=2 to the data
NaiveBayesClassifier accuracy : 100.0


In [56]:
# ngrams equal 3
train, test = preprocess_nltk(docs, n=3)

classifier = nltk.NaiveBayesClassifier.train(train)

accuracy = nltk.classify.accuracy(classifier, test) * 100
print(f"NaiveBayesClassifier accuracy : {accuracy}")

  0%|          | 0/1336782 [00:00<?, ?it/s]

Add ngram=2 to the data
Add ngram=3 to the data
NaiveBayesClassifier accuracy : 100.0


In [57]:
# Use sklearn model
accuracy = train_nltk(MultinomialNB(), train, test)
print(f"Accuracy: {accuracy}")

Your model is:  MultinomialNB
Accuracy: 100.0
