### Step 2: Load Data

In [14]:
import pandas as pd
# Load the spam.csv file
file_path = "spam.csv"  # Replace with the correct path to your file if necessary
data = pd.read_csv(file_path,encoding='WINDOWS-1252')
data = data.rename(columns={data.columns[0]: 'Label', data.columns[1]: 'Message'})
data = data[['Label', 'Message']]
# Display the first few rows of the dataset
data.head()

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Step 3: Compute Basic Statistics

In [15]:
# Compute Basic Statistics
from collections import Counter

total_messages = len(data)
spam_count = data['Label'].value_counts().get('spam', 0)
total_word_count = data['Message'].str.split().str.len().sum()
average_word_count = data['Message'].str.split().str.len().mean()

all_words = Counter(' '.join(data['Message']).split())
most_common_words = all_words.most_common(5)
num_rare_words = sum(1 for count in all_words.values() if count == 1)

print(f"Total number of messages: {total_messages}")
print(f"Number of spam messages: {spam_count}")
print(f"Total word count: {total_word_count}")
print(f"Average number of words per message: {average_word_count:.2f}")
print("Five most frequent words:")
for word, freq in most_common_words:
    print(f"{word}: {freq}")
print(f"Number of rare words: {num_rare_words}")

Total number of messages: 5572
Number of spam messages: 747
Total word count: 86335
Average number of words per message: 15.49
Five most frequent words:
to: 2134
you: 1622
I: 1466
a: 1327
the: 1197
Number of rare words: 9268


### Step 4: Tokenization


In [16]:
import pandas as pd
from nltk.tokenize import word_tokenize
import spacy

# Load the dataset
file_path = 'spam.csv'  # Replace with the actual file path
data = pd.read_csv(file_path, encoding='ISO-8859-1')
data.columns = ['Label', 'Message', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']
data = data[['Label', 'Message']]

# Load SpaCy model
spacy_model = spacy.load('en_core_web_sm')

# NLTK Tokenization
data['NLTK_Tokens'] = data['Message'].apply(lambda text: word_tokenize(text))

# SpaCy Tokenization
data['SpaCy_Tokens'] = data['Message'].apply(lambda text: [token.text for token in spacy_model(text)])

# Compare and find differences between NLTK and SpaCy tokenization results
data['Token_Differences'] = data.apply(
    lambda row: list(set(row['NLTK_Tokens']).difference(row['SpaCy_Tokens'])), axis=1
)

# Remove rows where 'Message' is empty
data = data[data['Message'].str.strip() != ""]

# Print a few examples
print("Example of Original Messages:")
print(data['Message'].head(5))
print("\nExample of NLTK Tokenization:")
print(data['NLTK_Tokens'].head(5))
print("\nExample of SpaCy Tokenization:")
print(data['SpaCy_Tokens'].head(5))
print("\nList of Tokens in NLTK Results but not in SpaCy Results:")
print(data['Token_Differences'].head(5))


Example of Original Messages:
0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: Message, dtype: object

Example of NLTK Tokenization:
0    [Go, until, jurong, point, ,, crazy, .., Avail...
1             [Ok, lar, ..., Joking, wif, u, oni, ...]
2    [Free, entry, in, 2, a, wkly, comp, to, win, F...
3    [U, dun, say, so, early, hor, ..., U, c, alrea...
4    [Nah, I, do, n't, think, he, goes, to, usf, ,,...
Name: NLTK_Tokens, dtype: object

Example of SpaCy Tokenization:
0    [Go, until, jurong, point, ,, crazy, .., Avail...
1             [Ok, lar, ..., Joking, wif, u, oni, ...]
2    [Free, entry, in, 2, a, wkly, comp, to, win, F...
3    [U, dun, say, so, early, hor, ..., U, c, alrea...
4    [Nah, I, do, n't, think, he, goes, to, usf, ,,...
Name: SpaCy_Tokens, 

# Observed Output Differences: NLTK vs. SpaCy Tokenization

## Differences in Tokens:
- **NLTK-Specific Tokens**: `['(', '&', 'rate', ')', 'C', 'question', 'std', 'T']`
  - These tokens are present in NLTK's output but not in SpaCy's.
- **SpaCy-Specific Tokens**: Handles contractions differently (`['Melle']` split as `['Melle']`), ignores formatting nuances like symbols (`'&'`).

---

## Detailed Analysis of Differences

### 1. Handling of Symbols and Abbreviations:
- **NLTK**: Preserves symbols like `&`, `T&C`, treating them as separate tokens.
- **SpaCy**: Often merges symbols with context or ignores them.

### 2. Word Contractions:
- **NLTK**: Retains specific forms (`'Melle'`).
- **SpaCy**: Splits contractions more contextually, e.g., `I'm` → `['I', "'m"]`.

### 3. Numeric Tokens:
- Both tokenize numbers similarly but differ in abbreviation handling (e.g., `16+` in NLTK is retained, while SpaCy processes it contextually).

---

## Key Observations:
1. **NLTK**:
   - Retains more raw tokens, focusing on simple splits.
   - Suitable for datasets requiring detailed token preservation (e.g., symbols).

2. **SpaCy**:
   - Context-aware and refines tokens based on linguistic patterns.
   - Ideal for semantic analysis or production-level NLP tasks.

---


### Step 5: Lemmatization


In [17]:
# Perform lemmatization using NLTK and SpaCy libraries
import spacy
from nltk.stem import WordNetLemmatizer

# Load SpaCy model
spacy_model = spacy.load('en_core_web_sm')

# Initialize NLTK's lemmatizer
nltk_lemmatizer = WordNetLemmatizer()

# NLTK Lemmatization
data['NLTK_Lemmas'] = data['Message'].apply(
    lambda text: [nltk_lemmatizer.lemmatize(word) for word in word_tokenize(text)]
)

# SpaCy Lemmatization
data['SpaCy_Lemmas'] = data['Message'].apply(
    lambda text: [token.lemma_ for token in spacy_model(text)]
)

# Compare and find differences between NLTK and SpaCy results
data['Lemma_Differences'] = data.apply(
    lambda row: list(set(row['NLTK_Lemmas']).difference(row['SpaCy_Lemmas'])), axis=1
)

# Remove rows where 'Message' is empty
data = data[data['Message'].str.strip() != ""]

# Print a few examples
print("Example of Original Messages:")
print(data['Message'].head(5))
print("\nExample of NLTK Lemmatization:")
print(data['NLTK_Lemmas'].head(5))
print("\nExample of SpaCy Lemmatization:")
print(data['SpaCy_Lemmas'].head(5))
print("\nList of Lemmas in NLTK Results but not in SpaCy Results:")
print(data['Lemma_Differences'].head(5))


Example of Original Messages:
0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: Message, dtype: object

Example of NLTK Lemmatization:
0    [Go, until, jurong, point, ,, crazy, .., Avail...
1             [Ok, lar, ..., Joking, wif, u, oni, ...]
2    [Free, entry, in, 2, a, wkly, comp, to, win, F...
3    [U, dun, say, so, early, hor, ..., U, c, alrea...
4    [Nah, I, do, n't, think, he, go, to, usf, ,, h...
Name: NLTK_Lemmas, dtype: object

Example of SpaCy Lemmatization:
0    [go, until, jurong, point, ,, crazy, .., avail...
1               [ok, lar, ..., joke, wif, u, oni, ...]
2    [free, entry, in, 2, a, wkly, comp, to, win, F...
3    [u, dun, say, so, early, hor, ..., u, c, alrea...
4    [Nah, I, do, not, think, he, go, to, usf, ,, h...
Name: SpaCy_Lemmas

## Observed Output Differences: NLTK vs. SpaCy Lemmatization

### NLTK Lemmatization:
- Retains original case and tense.
- Requires manual preprocessing for contractions and plural handling.

### SpaCy Lemmatization:
- Automatically lowercases words.
- Reduces verbs and plural nouns to their base forms.
- Separates contractions (e.g., `"hasn't"` → `"have not"`).

---

### Example Results:

#### Original Message:
"I've been searching for the right words to thank you."

#### NLTK Lemmas:
`['I', "'ve", 'been', 'searching', 'for', 'the', 'right', 'words', 'to', 'thank', 'you']`

#### SpaCy Lemmas:
`['I', 'have', 'be', 'search', 'for', 'the', 'right', 'word', 'to', 'thank', 'you']`

---

### Key Observations:
1. **Case Sensitivity**:
   - NLTK retains case, SpaCy lowercases.
2. **Verb Handling**:
   - NLTK keeps tense, SpaCy reduces (e.g., `"searching"` → `"search"`).
3. **Plural Forms**:
   - NLTK preserves plurals, SpaCy singularizes (e.g., `"words"` → `"word"`).
4. **Contractions**:
   - NLTK leaves as is, SpaCy separates (e.g., `"I've"` → `"I have"`).

---

### Summary:
- **Use NLTK**: For raw text with minimal preprocessing.
- **Use SpaCy**: For normalized, context-aware text suitable for NLP tasks.


### Step 6: Stemming

In [18]:
# Perform stemming using NLTK
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

# Perform stemming twice on the NLTK Lemmatization results
data['NLTK_Stems_Once'] = data['NLTK_Lemmas'].apply(lambda lemmas: [stemmer.stem(lemma) for lemma in lemmas])
data['NLTK_Stems_Twice'] = data['NLTK_Stems_Once'].apply(lambda stems: [stemmer.stem(stem) for stem in stems])

# Write differences between the results of single stemming and double stemming
data['Stemming_Differences'] = data.apply(
    lambda row: list(set(row['NLTK_Stems_Once']).difference(row['NLTK_Stems_Twice'])), axis=1
)

# Remove rows where 'Stemming_Differences' is empty
data = data[data['Stemming_Differences'].map(len) > 0]

# Print examples with the original sentence for comparison
print("Example of Original Messages:")
print(data['Message'].head(6))
print("\nExample of NLTK Stemming (once):")
print(data['NLTK_Stems_Once'].head(6))
print("\nExample of NLTK Stemming (twice):")
print(data['NLTK_Stems_Twice'].head(6))
print("\nDifferences between single and double stemming:")
print(data['Stemming_Differences'].head(6))

Example of Original Messages:
13    I've been searching for the right words to tha...
20            Is that seriously how you spell his name?
34    Thanks for your subscription to Ringtone UK yo...
42    07732584351 - Rodger Burns - MSG = We tried to...
46        Didn't you get hep b immunisation in nigeria.
65    As a valued customer, I am pleased to advise y...
Name: Message, dtype: object

Example of NLTK Stemming (once):
13    [i, 've, been, search, for, the, right, word, ...
20    [is, that, serious, how, you, spell, hi, name, ?]
34    [thank, for, your, subscript, to, rington, uk,...
42    [07732584351, -, rodger, burn, -, msg, =, we, ...
46    [did, n't, you, get, hep, b, immunis, in, nige...
65    [as, a, valu, custom, ,, i, am, pleas, to, adv...
Name: NLTK_Stems_Once, dtype: object

Example of NLTK Stemming (twice):
13    [i, 've, been, search, for, the, right, word, ...
20     [is, that, seriou, how, you, spell, hi, name, ?]
34    [thank, for, your, subscript, to, rington, uk

## Observed Output Differences

**Single Stemming**:
- Words are reduced to their root forms once.
- Example: `"houses"` → `"hous"`

**Double Stemming**:
- The already stemmed words undergo another round of stemming, potentially reducing them further.
- Example: `"hous"` → `"hou"`

---

### Detailed Analysis of Differences

**1. Word Examples**:

- `"promises"` → `"promis"` (unchanged after double stemming).
- `"please"` → `"pleas"` (double stemming results in no further reduction).
- `"causes"` → `"caus"` (unchanged after double stemming).
- `"apologies"` → `"apologis"` (unchanged, as double stemming cannot reduce further).
- `"houses"` → `"hous"` → `"hou"` (double stemming reduces it again).

---

**2. Key Observations**:

1. **Some words remain the same** after single and double stemming because the algorithm considers them fully reduced.

2. **Other words**, especially those with suffixes or irregular forms, may undergo additional reduction during the second pass.

---

### When Does Double Stemming Help or Hurt?

**1. Helps**:
- In cases where further reduction is required to reach the simplest possible form (e.g., `"houses"` → `"hou"`).

**2. Hurts**:
- When over-stemming occurs, resulting in loss of meaningful structure or interpretability (e.g., `"apologies"` → `"apologis"` → no further change).


In [19]:
# import pandas as pd
# from nltk.tokenize import word_tokenize
# from nltk.stem import PorterStemmer, WordNetLemmatizer
# import spacy
# from collections import Counter

# Load dataset
file_path = 'spam.csv'  # Replace with the actual file path
data = pd.read_csv(file_path, encoding='ISO-8859-1', usecols=[0, 1], names=['Label', 'Message'], skiprows=1)

# Initialize NLP tools
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
spacy_nlp = spacy.load("en_core_web_sm")

# Process dataset
results = []
for _, row in data.iterrows():
    message = row['Message']

    # Tokenization & Lemmatization
    nltk_tokens = word_tokenize(message)
    nltk_lemmas = [lemmatizer.lemmatize(token) for token in nltk_tokens]
    spacy_lemmas = [token.lemma_ for token in spacy_nlp(message)]

    # Stemming (single and double)
    stems_once = [stemmer.stem(lemma) for lemma in nltk_lemmas]
    stems_twice = [stemmer.stem(stem) for stem in stems_once]

    results.append({'Message': message, 'NLTK Lemmas': nltk_lemmas, 'SpaCy Lemmas': spacy_lemmas,
                    'Stems Once': stems_once, 'Stems Twice': stems_twice})

# Convert results to DataFrame
processed_data = pd.DataFrame(results)

# Analysis for candidate message
stemmed_tokens = Counter(token for stems in processed_data['Stems Twice'] for token in stems)
lemmatized_tokens = Counter(lemma for lemmas in processed_data['NLTK Lemmas'] for lemma in lemmas)

candidate_message = next(
    (row for _, row in processed_data.iterrows()
     if len(stemmed_tokens - Counter(row['Stems Twice'])) < len(stemmed_tokens) and
        len(lemmatized_tokens - Counter(row['NLTK Lemmas'])) == len(lemmatized_tokens)),
    None
)

# Output
if candidate_message is not None:
    print("Candidate message found:", candidate_message['Message'])
else:
    print("No such message exists that satisfies both conditions.")

No such message exists that satisfies both conditions.


### Task 8: Identifying a Unique Spam Message



### Explanation:
1. **Intrinsic Relationship Between Stems and Lemmas**:
   - Both stemming and lemmatization aim to reduce words to their base forms. Consequently, most tokens contribute to both representations, causing overlaps between stems and lemmas.

2. **Simultaneous Impact**:
   - When a spam message is removed, it typically affects both stemmed and lemmatized tokens because the tokens in that message are also present in both stemmed and lemmatized forms.

3. **Low Probability of Unique Impact**:
   - For a message to reduce stems while leaving lemmas unchanged, its tokens would need to uniquely affect stems without contributing new lemmas. This scenario is rare due to the linguistic overlap between stemming and lemmatization.

---
The task demonstrates the inherent similarities and differences between stemming and lemmatization. While the specific constraints of the task are improbable to satisfy, the process offers valuable insights into how these preprocessing techniques work and their relationship.



In [20]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import spacy
from collections import Counter

# Load the dataset
file_path = 'spam.csv'  # Replace with the actual file path
data = pd.read_csv(file_path, encoding='ISO-8859-1')
data.columns = ['Label', 'Message', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']
data = data[['Label', 'Message']]

# Initialize NLP tools
nltk_stemmer = PorterStemmer()
nltk_lemmatizer = WordNetLemmatizer()
spacy_nlp = spacy.load("en_core_web_sm")

# Process the dataset
results = []
for idx, row in data.iterrows():
    message = row['Message']
    label = row['Label']

    # Tokenization and Lemmatization (NLTK and SpaCy)
    nltk_tokens = word_tokenize(message)
    nltk_lemmas = [nltk_lemmatizer.lemmatize(token) for token in nltk_tokens]
    spacy_doc = spacy_nlp(message)
    spacy_lemmas = [token.lemma_ for token in spacy_doc]

    # Stemming (NLTK)
    nltk_stems_once = [nltk_stemmer.stem(lemma) for lemma in nltk_lemmas]
    nltk_stems_twice = [nltk_stemmer.stem(stem) for stem in nltk_stems_once]

    results.append({
        'Index': idx,
        'Label': label,
        'Message': message,
        'NLTK Tokens': nltk_tokens,
        'NLTK Lemmas': nltk_lemmas,
        'SpaCy Lemmas': spacy_lemmas,
        'NLTK Stems Once': nltk_stems_once,
        'NLTK Stems Twice': nltk_stems_twice
    })

# Convert results to DataFrame
processed_data = pd.DataFrame(results)

# Task 9 Analysis
# Reduce the total number of lemmatized tokens
lemmatized_tokens = Counter([lemma for row in processed_data['NLTK Lemmas'] for lemma in row])
unique_lemmatized_counts = processed_data['NLTK Lemmas'].apply(lambda x: len(set(x)))

# Maintain the exact same number of stemmed tokens
stemmed_tokens = Counter([token for row in processed_data['NLTK Stems Twice'] for token in row])
unique_stemmed_counts = processed_data['NLTK Stems Twice'].apply(lambda x: len(set(x)))

# Find the candidate message
candidate_message = None
for idx, row in processed_data.iterrows():
    current_lemmas = Counter(row['NLTK Lemmas'])
    current_stems = Counter(row['NLTK Stems Twice'])

    # Check if removing the message affects lemmatized tokens but not stemmed tokens
    if sum((lemmatized_tokens - current_lemmas).values()) < sum(lemmatized_tokens.values()) and \
       sum((stemmed_tokens - current_stems).values()) == sum(stemmed_tokens.values()):
        candidate_message = row
        break

# Output the candidate message or explanation
if candidate_message:
    print("Candidate message found:")
    print(candidate_message)
else:
    print("No such message exists that satisfies both conditions.")



No such message exists that satisfies both conditions.


### Task 9: Identifying a Unique Spam Message



### Explanation:
1. **Intrinsic Relationship Between Lemmas and Stems**:
   - Both lemmatization and stemming aim to simplify words, often resulting in overlaps. Lemmas preserve linguistic roots, while stems are algorithmically derived, but they frequently share the same token base.

2. **Simultaneous Impact**:
   - Removing a message that reduces lemmatized tokens generally impacts stemmed tokens as well, since most tokens contribute to both sets.

3. **Low Probability of Unique Impact**:
   - For a message to reduce lemmas while leaving stems unchanged, its tokens would need to uniquely affect lemmas without contributing new stems. This scenario is rare because of the shared roots in both processes.

---

The task highlights the interconnectedness of lemmatization and stemming. While the constraints of the task make it improbable to find a satisfying message, the exercise provides insights into the overlap and differences between these preprocessing techniques.


In [21]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Load the dataset
file_path = 'spam.csv'  # Replace with the actual file path
data = pd.read_csv(file_path, encoding='ISO-8859-1')
data.columns = ['Label', 'Message', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']
data = data[['Label', 'Message']]

# Initialize NLP tools
nltk_stemmer = PorterStemmer()
nltk_lemmatizer = WordNetLemmatizer()

# Select more sample messages (updated to include 5 examples instead of 2)
sample_messages = data['Message'].head(5).tolist()

# Process each sample message and compare transformations
results = []
for message in sample_messages:
    # Original message
    original = message

    # Tokenization
    tokens = word_tokenize(message)

    # Lemmatization
    lemmas = [nltk_lemmatizer.lemmatize(token) for token in tokens]

    # Stemming
    stems = [nltk_stemmer.stem(token) for token in tokens]

    # Store results for comparison
    results.append({
        'Original': original,
        'Tokens': tokens,
        'Lemmas': lemmas,
        'Stems': stems
    })

# Convert results to a DataFrame for easier comparison
comparison_df = pd.DataFrame(results)

# Display comparisons before and after the transformations
print("Comparison of NLTK Tokenization, Lemmatization, and Stemming:")
print(comparison_df)

Comparison of NLTK Tokenization, Lemmatization, and Stemming:
                                            Original  \
0  Go until jurong point, crazy.. Available only ...   
1                      Ok lar... Joking wif u oni...   
2  Free entry in 2 a wkly comp to win FA Cup fina...   
3  U dun say so early hor... U c already then say...   
4  Nah I don't think he goes to usf, he lives aro...   

                                              Tokens  \
0  [Go, until, jurong, point, ,, crazy, .., Avail...   
1           [Ok, lar, ..., Joking, wif, u, oni, ...]   
2  [Free, entry, in, 2, a, wkly, comp, to, win, F...   
3  [U, dun, say, so, early, hor, ..., U, c, alrea...   
4  [Nah, I, do, n't, think, he, goes, to, usf, ,,...   

                                              Lemmas  \
0  [Go, until, jurong, point, ,, crazy, .., Avail...   
1           [Ok, lar, ..., Joking, wif, u, oni, ...]   
2  [Free, entry, in, 2, a, wkly, comp, to, win, F...   
3  [U, dun, say, so, early, hor, ..., U,

## Task 10: Comparison of NLTK Tokenization, Lemmatization, and Stemming

## Objective
To analyze the effects of tokenization, lemmatization, and stemming on the messages in the dataset using the **NLTK library**. The aim is to observe how these preprocessing steps transform the original text.

---

## Sample Comparison
Below is an example of how the text changes during each preprocessing step:

### **1. Original Text**
- "Nah I don't think he goes to usf, he lives around here"

### **2. Tokenization(NLTK)**
- **Definition**: Splits the text into individual words and punctuation.
- **Output**:
  - `['Nah', 'I', 'do', "n't", 'think', 'he', 'goes', 'to', 'usf', ',', 'he', 'lives', 'around', 'here']`

### **3. Lemmatization(NLTK)**
- **Definition**: Reduces words to their base or dictionary form, preserving grammatical context.
- **Output**:
  - `['Nah', 'I', 'do', "n't", 'think', 'he', 'go', 'to', 'usf', ',', 'he', 'live', 'around', 'here']`
- **Key Observations**:
  - `goes → go`, `lives → live`

### **4. Stemming(NLTK)**
- **Definition**: Reduces words to their root form without considering grammatical context.
- **Output**:
  - `['nah', 'i', 'do', "n't", 'think', 'he', 'goe', 'to', 'usf', ',', 'he', 'live', 'around', 'here']`
- **Key Observations**:
  - More aggressive reduction, e.g., `goes → goe`, `lives → live`

---

## Observations
1. **Tokenization**:
   - Provides a structured representation of the text by splitting it into individual tokens.
   - Punctuation and words are treated as separate tokens.

2. **Lemmatization**:
   - Maintains linguistic accuracy by reducing words to their base forms, considering grammar.
   - Suitable for tasks requiring semantic understanding.

3. **Stemming**:
   - Reduces words to their roots but can be overly aggressive, potentially altering the meaning.
   - Faster than lemmatization but less accurate.

---

## Conclusion
- **Lemmatization** is better for preserving linguistic meaning in tasks like text classification and sentiment analysis.
- **Stemming** is faster and useful for simple tasks like keyword extraction.
- The choice of technique depends on the application and the need for grammatical accuracy.




In [22]:
import pandas as pd
import spacy

# Load the dataset
file_path = 'spam.csv'  # Replace with the actual file path
data = pd.read_csv(file_path, encoding='ISO-8859-1')
data.columns = ['Label', 'Message', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']
data = data[['Label', 'Message']]

# Initialize SpaCy NLP tool
spacy_nlp = spacy.load("en_core_web_sm")

# Select more sample messages (updated to include 10 examples instead of 2)
sample_messages = data['Message'].head(10).tolist()

# Process each sample message
results = []
for message in sample_messages:
    # Original message
    original = message

    # Tokenization and Lemmatization
    spacy_doc = spacy_nlp(message)
    tokens = [token.text for token in spacy_doc]
    lemmas = [token.lemma_ for token in spacy_doc]

    # Store results
    results.append({
        'Original': original,
        'Tokens': tokens,
        'Lemmas': lemmas
    })

# Convert results to DataFrame
comparison_df = pd.DataFrame(results)
print(comparison_df)

                                            Original  \
0  Go until jurong point, crazy.. Available only ...   
1                      Ok lar... Joking wif u oni...   
2  Free entry in 2 a wkly comp to win FA Cup fina...   
3  U dun say so early hor... U c already then say...   
4  Nah I don't think he goes to usf, he lives aro...   
5  FreeMsg Hey there darling it's been 3 week's n...   
6  Even my brother is not like to speak with me. ...   
7  As per your request 'Melle Melle (Oru Minnamin...   
8  WINNER!! As a valued network customer you have...   
9  Had your mobile 11 months or more? U R entitle...   

                                              Tokens  \
0  [Go, until, jurong, point, ,, crazy, .., Avail...   
1           [Ok, lar, ..., Joking, wif, u, oni, ...]   
2  [Free, entry, in, 2, a, wkly, comp, to, win, F...   
3  [U, dun, say, so, early, hor, ..., U, c, alrea...   
4  [Nah, I, do, n't, think, he, goes, to, usf, ,,...   
5  [FreeMsg, Hey, there, darling, it, 's, been,

## Task 11: Comparison of SpaCy Tokenization and Lemmatization

## Objective
To understand how **SpaCy** processes text using tokenization and lemmatization. This task demonstrates the transformations applied to raw text and their impact on the tokens and lemmas.

---

## Key Transformations

### **1. Original Text**
The unprocessed text from the dataset, retaining punctuation, capitalization, and grammatical structures.
- **Example 1**: `"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."`
- **Example 2**: `"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005."`

---

### **2. Tokenization (SpaCy)**
- **Definition**: Splits text into individual tokens, including words, punctuation, and numbers.
- **Example Output**:
  - **Original**: `"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005."`
  - **Tokens**: `['Free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'FA', 'Cup', 'final', 'tkts', '21st', 'May', '2005', '.']`
- **Key Features**:
  - Each word or symbol is treated as a separate token.
  - Numbers (`21st`, `2005`) and punctuation (`.`) are preserved as individual tokens.
  - Tokenization ensures all components of the text are analyzable.

---

### **3. Lemmatization (SpaCy)**
- **Definition**: Converts tokens to their base or dictionary form while considering the context.
- **Example Output**:
  - **Tokens**: `['Free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'FA', 'Cup', 'final', 'tkts', '21st', 'May', '2005', '.']`
  - **Lemmas**: `['free', 'entry', 'in', '2', 'a', 'weekly', 'compete', 'to', 'win', 'FA', 'Cup', 'final', 'ticket', '21st', 'May', '2005', '.']`
- **Key Features**:
  - Abbreviations and informal words are converted to meaningful base forms:
    - `wkly → weekly`
    - `tkts → ticket`
  - Proper nouns (`FA`, `Cup`, `May`) remain unchanged.
  - Numbers and punctuation are preserved.

---

## Observations

1. **Tokenization**:
   - Accurately splits text into meaningful units, retaining symbols and numbers.
   - Provides a strong foundation for subsequent preprocessing tasks.

2. **Lemmatization**:
   - Ensures semantic consistency by reducing tokens to their base forms.
   - Handles contractions and abbreviations effectively (e.g., `n't → not`, `tkts → ticket`).
   - Maintains context by preserving proper nouns and numerical data.

---

## Conclusion
SpaCy’s preprocessing pipeline is efficient for natural language processing tasks:
- **Tokenization** breaks raw text into manageable components for analysis.
- **Lemmatization** simplifies tokens to their dictionary forms, aiding in semantic understanding.

These features make SpaCy a powerful tool for preparing text data for downstream NLP applications.


In [23]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Define the URL
url = "https://www.history.com/news/president-jimmy-carter-dies-at-100-years-old"

# Step 2: Send a GET request to the URL
wiki = requests.get(url)

# Check if the request was successful
if wiki.status_code == 200:
    # Step 3: Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(wiki.content, 'html.parser')

    # Split the text into sentences by "."
    cleaned_text = soup.get_text()
    sentences_url = [sentence.strip() for sentence in cleaned_text.split('.') if sentence.strip()]

    # Step 4: Extract specific content (e.g., headings, paragraphs, links)
    # Extract all headings (h1, h2, h3, etc.)
    headings = soup.find_all(['h1', 'h2', 'h3'])
    print("Headings:")
    for heading in headings:
        print(heading.text.strip())

    # Create a DataFrame
    df_clean_text = pd.DataFrame(sentences_url, columns=["message"])

    # Extract all paragraph text
    paragraphs = soup.find_all('p')
    print("\nParagraphs:")
    for para in paragraphs[:5]:  # Display only the first 5 paragraphs
        print(para.text.strip())

    # Extract all hyperlinks
    links = soup.find_all('a', href=True)
    print("\nHyperlinks:")
    for link in links[:10]:  # Display only the first 10 links
        print(link['href'])

    # Display the DataFrame
    print(df_clean_text)
else:
    print(f"Failed to retrieve the page. Status code: {wiki.status_code}")

# Task 15

In [25]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
import spacy

# Load the SpaCy model
spacy_model = spacy.load('en_core_web_sm')

# Step 1: Web scraping from the provided URL
url = "https://www.history.com/news/president-jimmy-carter-dies-at-100-years-old"
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    # Extract headings and paragraphs
    headings = [heading.text.strip() for heading in soup.find_all(['h1', 'h2', 'h3'])]
    paragraphs = [para.text.strip() for para in soup.find_all('p') if para.text.strip()]
    # Combine scraped content into a DataFrame
    data = pd.DataFrame({'Message': headings + paragraphs})
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
    exit()

# Step 2: Perform tokenization using NLTK and SpaCy
data['NLTK_Tokens'] = data['Message'].apply(lambda text: word_tokenize(text))
data['SpaCy_Tokens'] = data['Message'].apply(lambda text: [token.text for token in spacy_model(text)])
data['Token_Differences'] = data.apply(
    lambda row: list(set(row['NLTK_Tokens']).difference(row['SpaCy_Tokens'])), axis=1
)

# Step 3: Perform lemmatization using NLTK and SpaCy
nltk_lemmatizer = WordNetLemmatizer()
data['NLTK_Lemmatized'] = data['NLTK_Tokens'].apply(
    lambda tokens: [nltk_lemmatizer.lemmatize(token) for token in tokens]
)
data['SpaCy_Lemmatized'] = data['Message'].apply(
    lambda text: [token.lemma_ for token in spacy_model(text)]
)
data['Lemmatization_Differences'] = data.apply(
    lambda row: list(set(row['NLTK_Lemmatized']).difference(row['SpaCy_Lemmatized'])), axis=1
)

# Step 4: Perform stemming using NLTK
stemmer = PorterStemmer()
data['Stemmed_NLTK_Lemmatized'] = data['NLTK_Lemmatized'].apply(
    lambda tokens: [stemmer.stem(token) for token in tokens]
)
data['Stemmed_SpaCy_Lemmatized'] = data['SpaCy_Lemmatized'].apply(
    lambda tokens: [stemmer.stem(token) for token in tokens]
)
data['Stemming_Differences'] = data.apply(
    lambda row: list(set(row['Stemmed_NLTK_Lemmatized']).difference(row['Stemmed_SpaCy_Lemmatized'])), axis=1
)

# Step 5: Identify messages reducing tokens
# Messages reducing stemmed tokens while maintaining lemmatized tokens
reduced_stemmed_tokens = [
    index for index, row in data.iterrows()
    if len(row['Stemmed_NLTK_Lemmatized']) < len(row['Stemmed_SpaCy_Lemmatized'])
]

# Messages reducing lemmatized tokens while maintaining stemmed tokens
reduced_lemmatized_tokens = [
    index for index, row in data.iterrows()
    if len(row['NLTK_Lemmatized']) < len(row['SpaCy_Lemmatized'])
]

# Step 6: Print results for all tasks
print("Task 5: Tokenization Differences")
print(data['Token_Differences'].head(5))

print("\nTask 6: Lemmatization Differences")
print(data['Lemmatization_Differences'].head(5))

print("\nTask 7: Stemming Differences")
print(data['Stemming_Differences'].head(5))

print("\nTask 8: Messages reducing stemmed tokens while maintaining lemmatized tokens")
print(reduced_stemmed_tokens)

print("\nTask 9: Messages reducing lemmatized tokens while maintaining stemmed tokens")
print(reduced_lemmatized_tokens)

print("\nTask 10: Compare NLTK Results (Tokenization, Lemmatization, Stemming)")
print(data[['Message', 'NLTK_Tokens', 'NLTK_Lemmatized', 'Stemmed_NLTK_Lemmatized']].head(5))

print("\nTask 11: Compare SpaCy Results (Tokenization, Lemmatization)")
print(data[['Message', 'SpaCy_Tokens', 'SpaCy_Lemmatized']].head(5))




Task 5: Tokenization Differences
0    []
1    []
2    []
3    []
4    []
Name: Token_Differences, dtype: object

Task 6: Lemmatization Differences
0    [Former, Dies]
1           [Early]
2         [a, Term]
3      [Presidency]
4                []
Name: Lemmatization_Differences, dtype: object

Task 7: Stemming Differences
0     []
1     []
2    [a]
3     []
4     []
Name: Stemming_Differences, dtype: object

Task 8: Messages reducing stemmed tokens while maintaining lemmatized tokens
[5, 12, 13, 15, 17, 18, 19, 21, 24, 26, 32]

Task 9: Messages reducing lemmatized tokens while maintaining stemmed tokens
[5, 12, 13, 15, 17, 18, 19, 21, 24, 26, 32]

Task 10: Compare NLTK Results (Tokenization, Lemmatization, Stemming)
                                         Message  \
0  Former President Jimmy Carter Dies at Age 100   
1  Family Farm, Early Naval and Political Career   
2                       Term as Georgia Governor   
3                            Carter's Presidency   
4             

# Explanation of Output

## Task 5: Tokenization Differences

The `Token_Differences` column shows the differences in tokenization results between NLTK and SpaCy for the first 5 rows.


Empty lists (`[]`) indicate no differences in tokenization. This means NLTK and SpaCy produced identical tokens for these messages.

---

## Task 6: Lemmatization Differences

The `Lemmatization_Differences` column shows differences in lemmatized results between NLTK and SpaCy for the first 5 rows.

- **Row 0:** SpaCy lemmatized `Former` and `Dies` (e.g., `former` and `die`), but NLTK kept them unchanged.
- **Row 1:** SpaCy lemmatized `Early` (e.g., `early`), while NLTK left it unchanged.
- **Row 4:** There are no differences, meaning both NLTK and SpaCy produced identical lemmatized tokens.

---

## Task 7: Stemming Differences

The `Stemming_Differences` column shows differences in stemming results between NLTK lemmatized tokens and SpaCy lemmatized tokens for the first 5 rows.


- **Row 2:** The word `a` is present in NLTK's stemming results but not in SpaCy's.
- **Other Rows:** Empty lists (`[]`) indicate that stemming produced identical results for the two methods in these rows.

---

## Task 8: Messages Reducing Stemmed Tokens While Maintaining Lemmatized Tokens


These are the row indices of messages where:
1. Removing the message reduces the total number of **stemmed tokens**.
2. The total number of **lemmatized tokens** remains the same.

These messages often contain words where stemming significantly reduces word variations (e.g., `running` → `run`), but lemmatization produces consistent results.

---

## Task 9: Messages Reducing Lemmatized Tokens While Maintaining Stemmed Tokens


These are the row indices of messages where:
1. Removing the message reduces the total number of **lemmatized tokens**.
2. The total number of **stemmed tokens** remains the same.

These messages may contain words where lemmatization eliminates word forms that stemming does not (e.g., `went` → `go` in lemmatization but unchanged in stemming).

---

## Task 10: Compare NLTK Results (Tokenization, Lemmatization, Stemming)

| **Message**                                   | **NLTK_Tokens**                                  | **NLTK_Lemmatized**                               | **Stemmed_NLTK_Lemmatized**                       |
|-----------------------------------------------|-------------------------------------------------|-------------------------------------------------|-------------------------------------------------|
| Former President Jimmy Carter Dies at Age 100 | [Former, President, Jimmy, Carter, Dies, ...]   | [Former, President, Jimmy, Carter, Dies, ...]   | [former, presid, jimmi, carter, die, ...]        |
| Family Farm, Early Naval and Political Career | [Family, Farm, ,, Early, Naval, and, ...]      | [Family, Farm, ,, Early, Naval, and, ...]      | [famili, farm, ,, earli, naval, and, ...]        |
| Term as Georgia Governor                      | [Term, as, Georgia, Governor]                  | [Term, a, Georgia, Governor]                   | [term, a, georgia, governor]                     |
| Carter's Presidency                           | [Carter, 's, Presidency]                       | [Carter, 's, Presidency]                       | [carter, 's, presid]                             |
| Iran Hostage Crisis                           | [Iran, Hostage, Crisis]                        | [Iran, Hostage, Crisis]                        | [iran, hostag, crisi]                            |

This table shows how NLTK processes the text in terms of:
- **Tokenization:** Breaking text into words or tokens.
- **Lemmatization:** Reducing tokens to their base forms.
- **Stemming:** Further reducing tokens to their stems.

---

## Task 11: Compare SpaCy Results (Tokenization, Lemmatization)

| **Message**                                   | **SpaCy_Tokens**                                | **SpaCy_Lemmatized**                             |
|-----------------------------------------------|------------------------------------------------|------------------------------------------------|
| Former President Jimmy Carter Dies at Age 100 | [Former, President, Jimmy, Carter, Dies, ...] | [former, President, Jimmy, Carter, die, ...]  |
| Family Farm, Early Naval and Political Career | [Family, Farm, ,, Early, Naval, and, ...]     | [Family, Farm, ,, early, Naval, and, ...]     |
| Term as Georgia Governor                      | [Term, as, Georgia, Governor]                 | [term, as, Georgia, Governor]                 |
| Carter's Presidency                           | [Carter, 's, Presidency]                      | [Carter, 's, presidency]                      |
| Iran Hostage Crisis                           | [Iran, Hostage, Crisis]                       | [Iran, Hostage, Crisis]                       |

This table shows how SpaCy processes the text in terms of:
- **Tokenization:** Breaking text into tokens.
- **Lemmatization:** Reducing tokens to their base forms.

For example:
- **Row 0:** `Dies` → `die` (lemmatized by SpaCy).
- **Row 4:** Tokens remain unchanged (no lemmatization applied).

---

## Summary

1. **Tokenization:**
   - No significant differences between NLTK and SpaCy for the provided examples.

2. **Lemmatization:**
   - SpaCy is more aggressive, e.g., converting `Dies` → `die` and `Presidency` → `presidency`.

3. **Stemming:**
   - NLTK stems words more aggressively, often removing suffixes (e.g., `Presidency` → `presid`).

4. **Messages Affecting Tokens:**
   - Tasks 8 and 9 identify specific rows where token counts are affected by stemming or lemmatization.

The processed data is saved to `web_scraped_results.csv` for further analysis.








In [26]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
import spacy

# Load the SpaCy model
spacy_model = spacy.load('en_core_web_sm')

# Step 1: Load content from chat.txt
file_path = 'chat.txt'  # Replace with your chat.txt file path
try:
    with open(file_path, 'r', encoding='utf-8') as file:
        messages = file.readlines()
except FileNotFoundError:
    print(f"Error: The file {file_path} was not found.")
    exit()

# Create a DataFrame from chat.txt content
data = pd.DataFrame({'Message': [msg.strip() for msg in messages if msg.strip()]})  # Remove empty lines

# Step 2: Perform tokenization using NLTK and SpaCy
data['NLTK_Tokens'] = data['Message'].apply(lambda text: word_tokenize(text))
data['SpaCy_Tokens'] = data['Message'].apply(lambda text: [token.text for token in spacy_model(text)])
data['Token_Differences'] = data.apply(
    lambda row: list(set(row['NLTK_Tokens']).difference(row['SpaCy_Tokens'])), axis=1
)

# Step 3: Perform lemmatization using NLTK and SpaCy
nltk_lemmatizer = WordNetLemmatizer()
data['NLTK_Lemmatized'] = data['NLTK_Tokens'].apply(
    lambda tokens: [nltk_lemmatizer.lemmatize(token) for token in tokens]
)
data['SpaCy_Lemmatized'] = data['Message'].apply(
    lambda text: [token.lemma_ for token in spacy_model(text)]
)
data['Lemmatization_Differences'] = data.apply(
    lambda row: list(set(row['NLTK_Lemmatized']).difference(row['SpaCy_Lemmatized'])), axis=1
)

# Step 4: Perform stemming using NLTK
stemmer = PorterStemmer()
data['Stemmed_NLTK_Lemmatized'] = data['NLTK_Lemmatized'].apply(
    lambda tokens: [stemmer.stem(token) for token in tokens]
)
data['Stemmed_SpaCy_Lemmatized'] = data['SpaCy_Lemmatized'].apply(
    lambda tokens: [stemmer.stem(token) for token in tokens]
)
data['Stemming_Differences'] = data.apply(
    lambda row: list(set(row['Stemmed_NLTK_Lemmatized']).difference(row['Stemmed_SpaCy_Lemmatized'])), axis=1
)

# Step 5: Identify messages reducing tokens
# Messages reducing stemmed tokens while maintaining lemmatized tokens
reduced_stemmed_tokens = [
    index for index, row in data.iterrows()
    if len(row['Stemmed_NLTK_Lemmatized']) < len(row['Stemmed_SpaCy_Lemmatized'])
]

# Messages reducing lemmatized tokens while maintaining stemmed tokens
reduced_lemmatized_tokens = [
    index for index, row in data.iterrows()
    if len(row['NLTK_Lemmatized']) < len(row['SpaCy_Lemmatized'])
]

# Step 6: Print results for all tasks
print("Task 5: Tokenization Differences")
print(data['Token_Differences'].head(5))

print("\nTask 6: Lemmatization Differences")
print(data['Lemmatization_Differences'].head(5))

print("\nTask 7: Stemming Differences")
print(data['Stemming_Differences'].head(5))

print("\nTask 8: Messages reducing stemmed tokens while maintaining lemmatized tokens")
print(reduced_stemmed_tokens)

print("\nTask 9: Messages reducing lemmatized tokens while maintaining stemmed tokens")
print(reduced_lemmatized_tokens)

print("\nTask 10: Compare NLTK Results (Tokenization, Lemmatization, Stemming)")
print(data[['Message', 'NLTK_Tokens', 'NLTK_Lemmatized', 'Stemmed_NLTK_Lemmatized']].head(5))

print("\nTask 11: Compare SpaCy Results (Tokenization, Lemmatization)")
print(data[['Message', 'SpaCy_Tokens', 'SpaCy_Lemmatized']].head(5))



Task 5: Tokenization Differences
0                                            [מוביל/ת]
1    [//www.linkedin.com/jobs/view/3628574195, :, h...
2                                   [01/09/2023, ‎, []
3                                                   []
4                                                   []
Name: Token_Differences, dtype: object

Task 6: Lemmatization Differences
0                                     [מוביל/ת, Check]
1    [http, //www.linkedin.com/jobs/view/3628574195...
2                          [01/09/2023, omitted, ‎, []
3                                                   []
4                                                   []
Name: Lemmatization_Differences, dtype: object

Task 7: Stemming Differences
0                                            [מוביל/ת]
1    [http, //www.linkedin.com/jobs/view/3628574195...
2                                   [01/09/2023, ‎, []
3                                                   []
4                                             

# Explanation of Output

## Task 5: Tokenization Differences

The `Token_Differences` column shows the differences in tokenization results between NLTK and SpaCy for the first 5 rows.

- **Row 0:** The token `מוביל/ת` is present in NLTK's output but not in SpaCy's, possibly due to SpaCy's handling of non-English characters.
- **Row 1:** Differences include tokens like URLs, colons, and separators, where NLTK splits them into smaller tokens while SpaCy might retain them as a single token.
- **Rows 2-4:** Empty lists (`[]`) indicate no differences in tokenization between NLTK and SpaCy.

---

## Task 6: Lemmatization Differences

The `Lemmatization_Differences` column shows differences in lemmatized results between NLTK and SpaCy for the first 5 rows.

- **Row 0:** Words like `מוביל/ת` and `Check` appear in the differences, suggesting SpaCy lemmatizes or skips them differently than NLTK.
- **Row 1:** URL components and separators are lemmatized differently by the two libraries.
- **Rows 3-4:** Empty lists indicate that both libraries produced identical lemmatized tokens for these rows.

---

## Task 7: Stemming Differences

The `Stemming_Differences` column shows differences in stemming results between NLTK lemmatized tokens and SpaCy lemmatized tokens for the first 5 rows.

- **Row 0:** The word `מוביל/ת` appears in the differences, highlighting different handling by the two libraries.
- **Row 1:** Differences include URL components, where NLTK stems them differently from SpaCy.
- **Rows 2-4:** Empty lists indicate identical stemming results for these rows.

---

## Task 8: Messages Reducing Stemmed Tokens While Maintaining Lemmatized Tokens

These are the row indices of messages where:
1. Removing the message reduces the total number of **stemmed tokens**.
2. The total number of **lemmatized tokens** remains the same.

### Examples:
- Messages like URLs or technical terms might contribute fewer stems due to aggressive token reduction by stemming compared to lemmatization.

### Observations:
- Examples include rows like `0`, `13`, `20`, and others where stemming reduces more tokens than lemmatization.

---

## Task 9: Messages Reducing Lemmatized Tokens While Maintaining Stemmed Tokens

These are the row indices of messages where:
1. Removing the message reduces the total number of **lemmatized tokens**.
2. The total number of **stemmed tokens** remains the same.

### Examples:
- Messages with repetitive or derived forms (e.g., `running` → `run` in lemmatization but not stemming) could cause this behavior.

### Observations:
- Rows like `0`, `13`, `20`, and others contribute fewer lemmatized tokens due to SpaCy's or NLTK's handling of such cases.

---

## Task 10: Compare NLTK Results (Tokenization, Lemmatization, Stemming)

| **Message**                                   | **NLTK_Tokens**                                  | **NLTK_Lemmatized**                               | **Stemmed_NLTK_Lemmatized**                       |
|-----------------------------------------------|-------------------------------------------------|-------------------------------------------------|-------------------------------------------------|
| [01/09/2023, 13:48:44] Barak: Check out this j... | [[, 01/09/2023, ,, 13:48:44, ], Barak, :, Chec... | [[, 01/09/2023, ,, 13:48:44, ], Barak, :, Chec... | [[, 01/09/2023, ,, 13:48:44, ], barak, :, chec... |
| https://www.linkedin.com/jobs/view/3628574195 | [https, :, //www.linkedin.com/jobs/view/362857... | [http, :, //www.linkedin.com/jobs/view/362857... | [http, :, //www.linkedin.com/jobs/view/362857... |
| ‎[01/09/2023, 14:28:49] Barak: ‎image omitted   | [‎, [, 01/09/2023, ,, 14:28:49, ], Barak, :, ‎... | [‎, [, 01/09/2023, ,, 14:28:49, ], Barak, :, ‎... | [‎, [, 01/09/2023, ,, 14:28:49, ], barak, :, ‎... |
| [01/09/2023, 14:33:59] עמרי ששון : אוווו         | [[, 01/09/2023, ,, 14:33:59, ], עמרי, ששון, :,... | [[, 01/09/2023, ,, 14:33:59, ], עמרי, ששון, :,... | [[, 01/09/2023, ,, 14:33:59, ], עמרי, ששון, :,... |
| [01/09/2023, 14:34:05] עמרי ששון : זה רשם אותך... | [[, 01/09/2023, ,, 14:34:05, ], עמרי, ששון, :,... | [[, 01/09/2023, ,, 14:34:05, ], עמרי, ששון, :,... | [[, 01/09/2023, ,, 14:34:05, ], עמרי, ששון, :,... |

---

## Task 11: Compare SpaCy Results (Tokenization, Lemmatization)

| **Message**                                   | **SpaCy_Tokens**                                | **SpaCy_Lemmatized**                             |
|-----------------------------------------------|------------------------------------------------|------------------------------------------------|
| [01/09/2023, 13:48:44] Barak: Check out this j... | [[, 01/09/2023, ,, 13:48:44, ], Barak, :, Chec... | [[, 01/09/2023, ,, 13:48:44, ], Barak, :, chec... |
| https://www.linkedin.com/jobs/view/3628574195 | [https://www.linkedin.com/jobs/view/3628574195] | [https://www.linkedin.com/jobs/view/3628574195] |
| ‎[01/09/2023, 14:28:49] Barak: ‎image omitted   | [‎[01/09/2023, ,, 14:28:49, ], Barak, :, ‎imag... | [‎[01/09/2023, ,, 14:28:49, ], Barak, :, ‎imag... |
| [01/09/2023, 14:33:59] עמרי ששון : אוווו         | [[, 01/09/2023, ,, 14:33:59, ], עמרי, ששון, :,... | [[, 01/09/2023, ,, 14:33:59, ], עמרי, ששון, :,... |
| [01/09/2023, 14:34:05] עמרי ששון : זה רשם אותך... | [[, 01/09/2023, ,, 14:34:05, ], עמרי, ששון, :,... | [[, 01/09/2023, ,, 14:34:05, ], עמרי, ששון, :,... |

---

## Summary

1. **Tokenization:** Differences mainly involve handling of special characters, URLs, and separators.
2. **Lemmatization:** SpaCy aggressively lemmatizes, often reducing inflected forms to base forms.
3. **Stemming:** NLTK applies more aggressive stemming than SpaCy.
4. **Messages Affecting Tokens:** Tasks 8 and 9 identify rows where stemming and lemmatization affect token counts differently.

The processed data is saved to `chat_processed_results.csv`.
