[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/M-Talha-Farooqi/NLP-CourseWork/blob/main/Code-Implementation/Step-1-Data-Pre-Processing-Data-Clean-Up/Step_1_Text_Pre_Processing.ipynb)

# üìò NLP: Text Preprocessing

## üîç What is Text Preprocessing?

Text preprocessing is the process of converting raw, unstructured text into a clean and structured format suitable for analysis and machine learning. It involves various techniques to remove noise, standardize language, and extract useful components from text data.

---

## ‚ùì Why is Text Preprocessing Important?

- Real-world text contains noise: typos, punctuation, casing, extra spaces, and irrelevant words.
- Helps in reducing vocabulary size and improving model accuracy.
- Makes data uniform and ready for feature extraction or modeling.
- Removes inconsistencies that may confuse machine learning algorithms.

---

## üß∞ Libraries Commonly Used for Text Preprocessing

| Library | Purpose |
|--------|---------|
| **NLTK** (Natural Language Toolkit) | Offers tools for tokenization, stemming, lemmatization, and stopword removal |
| **spaCy** | Fast and production-ready NLP library with built-in preprocessing capabilities |
| **re** | Regular expressions, useful for cleaning text (e.g., removing special characters) |
| **string** | Built-in Python module, useful for handling punctuation |
| **TextBlob** | Simple API for text processing, includes lemmatization and sentiment analysis |
| **gensim** | Useful for topic modeling, also provides some preprocessing utilities |

---

## üîÑ Major Text Preprocessing Steps

### 1. ‚úÇÔ∏è Tokenization

**Definition**: Tokenization is the process of breaking text into smaller units called tokens. These can be words, subwords, or sentences.

- **Sentence Tokenization**: Splits a paragraph into individual sentences.
  - Tools: NLTK (`sent_tokenize`), spaCy (`doc.sents`)
- **Word Tokenization**: Splits a sentence into words.
  - Tools: NLTK (`word_tokenize`), spaCy (`doc[i].text`)

**Example**:  
Text: "Text preprocessing is essential."  
Word Tokens: ['Text', 'preprocessing', 'is', 'essential', '.']

---

### 2. üî° Lowercasing

**Definition**: Converts all characters in the text to lowercase.

**Why**: Reduces redundancy by treating "Apple" and "apple" as the same word.

**Example**:  
['Apple', 'apple', 'APPLE'] ‚Üí ['apple', 'apple', 'apple']

---

### 3. ‚ùå Removing Punctuation

**Definition**: Removes symbols like commas, periods, exclamation marks, etc.

**Why**: Punctuation usually does not add meaningful information for many NLP tasks.

**Tools**:  
- `string.punctuation` from Python‚Äôs built-in string library  
- `re` module using regular expressions

**Example**:  
"Hello, world!" ‚Üí "Hello world"

---

### 4. üõë Stopword Removal

**Definition**: Stopwords are common words that do not carry significant meaning and are often removed to reduce noise.

**Examples of Stopwords**: is, am, are, the, in, on, and, or

**Tools**:  
- NLTK (`stopwords.words('english')`)  
- spaCy (`token.is_stop`)  
- gensim (`remove_stopwords`)

**Example**:  
Text: "This is a good movie"  
After stopword removal: "good movie"

---

### 5. üå± Stemming

**Definition**: Stemming reduces words to their root form by removing suffixes. It may not return valid dictionary words.

**Available Stemmers** in NLTK:
- **PorterStemmer**: Most commonly used, moderate stemming
- **LancasterStemmer**: More aggressive
- **SnowballStemmer**: An improved version of Porter, supports multiple languages

**Example**:  
"running", "runs", "ran" ‚Üí "run"

**Note**: Stemming can result in over-stemming (e.g., "universities" ‚Üí "univers")

---

### 6. üçã Lemmatization

**Definition**: Lemmatization also reduces words to their base form (lemma), but uses a dictionary and considers the part of speech.

**Difference from Stemming**:  
- Lemmatization is slower but more accurate.
- Lemmatization returns actual dictionary words.

**Available Lemmatizers**:
- **WordNetLemmatizer** (in NLTK)
- **spaCy built-in lemmatizer**
- **TextBlob lemmatizer**

**Example**:  
"running" (verb) ‚Üí "run"  
"better" (adjective) ‚Üí "good"

---

## üìã Summary of Preprocessing Techniques

| Step | Description | Tools/Libraries |
|------|-------------|-----------------|
| Tokenization | Split text into words/sentences | NLTK, spaCy, TextBlob |
| Lowercasing | Convert all tokens to lowercase | Python built-in |
| Remove Punctuation | Delete special symbols | string, re |
| Stopword Removal | Remove common uninformative words | NLTK, spaCy, gensim |
| Stemming | Reduce word to root form (not always valid word) | Porter, Lancaster, Snowball |
| Lemmatization | Reduce word to base form using dictionary | WordNet, spaCy, TextBlob |

---

## ‚úÖ Final Notes

- The choice of techniques depends on the NLP task.
- For rule-based or simple tasks, stemming may be enough.
- For semantic or deep models, lemmatization is preferred.
- Proper preprocessing leads to better accuracy, reduced noise, and lower dimensionality in models.



# Install the Required Libraries

In [1]:
!pip install nltk



# Import the Required Libraries

In [2]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.3-py3-none-any.whl.metadata (1.6 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.3-py3-none-any.whl (345 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m345.1/345.1 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)


In [3]:
import pandas as pd
import nltk

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer, PorterStemmer


import string

import re


import contractions

# Download necessary NLTK data if not already available

In [4]:

nltk.download('stopwords')

# Printing the stopwords in English
stop_words = set(stopwords.words('english'))
print("Stop words:", stop_words)


Stop words: {'because', 'but', 'themselves', "they're", 'up', 'here', 'other', 'with', 'below', 'only', 'those', 'same', "she'll", "you're", 'these', 'at', 'o', 'into', 'that', 'have', 'few', 't', "they've", 'mustn', 'theirs', 'having', 'under', 've', "you'd", 'not', 'through', "shan't", 'should', 'do', 'ain', "it's", "she's", 'their', 'isn', 'than', 'of', 'after', 'ours', 'didn', 'which', 'against', 'don', 'further', 'mightn', 'any', "haven't", "we've", 'when', 'you', "needn't", 're', 'its', 'most', "they'd", 'hers', "i'll", "doesn't", 'herself', 'out', 'be', 'each', 'an', 'a', 'how', 'can', 'more', 'll', 'for', "isn't", 'now', 'shouldn', 'down', "mightn't", 'own', 'were', 'aren', 'been', 'just', 'then', "wasn't", 'all', "she'd", 'why', 'from', 'too', 'hadn', 'needn', 'couldn', 'she', 'once', 'if', "you've", 'had', "that'll", 'nor', "aren't", "didn't", 'what', 'before', 'was', "i'd", 'wouldn', 'between', 'by', 'himself', 'yours', 'myself', 'doesn', 'over', 'in', 'to', 'we', 'no', 'its

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [5]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Create a Dummy Dataset

In [6]:
# Create the DataFrame with terrorism-related data
data = {
    'ID': [1, 2, 3, 4, 5],
    'Text': [
        "A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.",
        "The terrorist group claimed responsibility for the attack on the military checkpoint.",
        "Security forces arrested suspects linked to a plot targeting public transportation.",
        "The use of improvised explosive devices has increased in recent insurgent activities.",
        "Counter-terrorism units conducted a raid on a suspected militant hideout in the region."
    ]
}


df = pd.DataFrame(data)
# Adjusting pandas options to display full content in the dataframe
pd.set_option('display.max_colwidth', None)

df

Unnamed: 0,ID,Text
0,1,A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.
1,2,The terrorist group claimed responsibility for the attack on the military checkpoint.
2,3,Security forces arrested suspects linked to a plot targeting public transportation.
3,4,The use of improvised explosive devices has increased in recent insurgent activities.
4,5,Counter-terrorism units conducted a raid on a suspected militant hideout in the region.


In [7]:
df

Unnamed: 0,ID,Text
0,1,A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.
1,2,The terrorist group claimed responsibility for the attack on the military checkpoint.
2,3,Security forces arrested suspects linked to a plot targeting public transportation.
3,4,The use of improvised explosive devices has increased in recent insurgent activities.
4,5,Counter-terrorism units conducted a raid on a suspected militant hideout in the region.


In [8]:
# Save to CSV
df.to_csv("counter_terrorism_dataset.csv", index=False)

In [9]:
# Read the CSV back
df = pd.read_csv("counter_terrorism_dataset.csv")

df

Unnamed: 0,ID,Text
0,1,A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.
1,2,The terrorist group claimed responsibility for the attack on the military checkpoint.
2,3,Security forces arrested suspects linked to a plot targeting public transportation.
3,4,The use of improvised explosive devices has increased in recent insurgent activities.
4,5,Counter-terrorism units conducted a raid on a suspected militant hideout in the region.


In [10]:
df['Text']

Unnamed: 0,Text
0,A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.
1,The terrorist group claimed responsibility for the attack on the military checkpoint.
2,Security forces arrested suspects linked to a plot targeting public transportation.
3,The use of improvised explosive devices has increased in recent insurgent activities.
4,Counter-terrorism units conducted a raid on a suspected militant hideout in the region.


In [11]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Apply tokenization
df['sentences'] = df['Text'].apply(sent_tokenize)
df

Unnamed: 0,ID,Text,sentences
0,1,A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.,"[A bombing occurred in downtown Baghdad., Resulting in several casualties and injuries.]"
1,2,The terrorist group claimed responsibility for the attack on the military checkpoint.,[The terrorist group claimed responsibility for the attack on the military checkpoint.]
2,3,Security forces arrested suspects linked to a plot targeting public transportation.,[Security forces arrested suspects linked to a plot targeting public transportation.]
3,4,The use of improvised explosive devices has increased in recent insurgent activities.,[The use of improvised explosive devices has increased in recent insurgent activities.]
4,5,Counter-terrorism units conducted a raid on a suspected militant hideout in the region.,[Counter-terrorism units conducted a raid on a suspected militant hideout in the region.]


In [12]:
df['tokens'] = df['Text'].apply(word_tokenize)

df

Unnamed: 0,ID,Text,sentences,tokens
0,1,A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.,"[A bombing occurred in downtown Baghdad., Resulting in several casualties and injuries.]","[A, bombing, occurred, in, downtown, Baghdad, ., Resulting, in, several, casualties, and, injuries, .]"
1,2,The terrorist group claimed responsibility for the attack on the military checkpoint.,[The terrorist group claimed responsibility for the attack on the military checkpoint.],"[The, terrorist, group, claimed, responsibility, for, the, attack, on, the, military, checkpoint, .]"
2,3,Security forces arrested suspects linked to a plot targeting public transportation.,[Security forces arrested suspects linked to a plot targeting public transportation.],"[Security, forces, arrested, suspects, linked, to, a, plot, targeting, public, transportation, .]"
3,4,The use of improvised explosive devices has increased in recent insurgent activities.,[The use of improvised explosive devices has increased in recent insurgent activities.],"[The, use, of, improvised, explosive, devices, has, increased, in, recent, insurgent, activities, .]"
4,5,Counter-terrorism units conducted a raid on a suspected militant hideout in the region.,[Counter-terrorism units conducted a raid on a suspected militant hideout in the region.],"[Counter-terrorism, units, conducted, a, raid, on, a, suspected, militant, hideout, in, the, region, .]"


In [13]:
# Show example
for idx, row in df.iterrows():
    print(f"\nüîπ Title: {row['Text']}")
    print("üìå Sentences:", row['sentences'])
    print("üìå Word Tokens:", row['tokens'])


üîπ Title: A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.
üìå Sentences: ['A bombing occurred in downtown Baghdad.', 'Resulting in several casualties and injuries.']
üìå Word Tokens: ['A', 'bombing', 'occurred', 'in', 'downtown', 'Baghdad', '.', 'Resulting', 'in', 'several', 'casualties', 'and', 'injuries', '.']

üîπ Title: The terrorist group claimed responsibility for the attack on the military checkpoint.
üìå Sentences: ['The terrorist group claimed responsibility for the attack on the military checkpoint.']
üìå Word Tokens: ['The', 'terrorist', 'group', 'claimed', 'responsibility', 'for', 'the', 'attack', 'on', 'the', 'military', 'checkpoint', '.']

üîπ Title: Security forces arrested suspects linked to a plot targeting public transportation.
üìå Sentences: ['Security forces arrested suspects linked to a plot targeting public transportation.']
üìå Word Tokens: ['Security', 'forces', 'arrested', 'suspects', 'linked', 'to', 'a', 'plot'

In [14]:
df

Unnamed: 0,ID,Text,sentences,tokens
0,1,A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.,"[A bombing occurred in downtown Baghdad., Resulting in several casualties and injuries.]","[A, bombing, occurred, in, downtown, Baghdad, ., Resulting, in, several, casualties, and, injuries, .]"
1,2,The terrorist group claimed responsibility for the attack on the military checkpoint.,[The terrorist group claimed responsibility for the attack on the military checkpoint.],"[The, terrorist, group, claimed, responsibility, for, the, attack, on, the, military, checkpoint, .]"
2,3,Security forces arrested suspects linked to a plot targeting public transportation.,[Security forces arrested suspects linked to a plot targeting public transportation.],"[Security, forces, arrested, suspects, linked, to, a, plot, targeting, public, transportation, .]"
3,4,The use of improvised explosive devices has increased in recent insurgent activities.,[The use of improvised explosive devices has increased in recent insurgent activities.],"[The, use, of, improvised, explosive, devices, has, increased, in, recent, insurgent, activities, .]"
4,5,Counter-terrorism units conducted a raid on a suspected militant hideout in the region.,[Counter-terrorism units conducted a raid on a suspected militant hideout in the region.],"[Counter-terrorism, units, conducted, a, raid, on, a, suspected, militant, hideout, in, the, region, .]"


In [15]:
df['tokens']

Unnamed: 0,tokens
0,"[A, bombing, occurred, in, downtown, Baghdad, ., Resulting, in, several, casualties, and, injuries, .]"
1,"[The, terrorist, group, claimed, responsibility, for, the, attack, on, the, military, checkpoint, .]"
2,"[Security, forces, arrested, suspects, linked, to, a, plot, targeting, public, transportation, .]"
3,"[The, use, of, improvised, explosive, devices, has, increased, in, recent, insurgent, activities, .]"
4,"[Counter-terrorism, units, conducted, a, raid, on, a, suspected, militant, hideout, in, the, region, .]"


# Function to remove punctuation

In [16]:
import string
# Define the function to remove punctuation
def remove_punctuation(tokens):
    return [word for word in tokens if word not in string.punctuation]

# Apply the function to the 'tokens' column
df['cleantokens'] = df['tokens'].apply(remove_punctuation)
df

Unnamed: 0,ID,Text,sentences,tokens,cleantokens
0,1,A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.,"[A bombing occurred in downtown Baghdad., Resulting in several casualties and injuries.]","[A, bombing, occurred, in, downtown, Baghdad, ., Resulting, in, several, casualties, and, injuries, .]","[A, bombing, occurred, in, downtown, Baghdad, Resulting, in, several, casualties, and, injuries]"
1,2,The terrorist group claimed responsibility for the attack on the military checkpoint.,[The terrorist group claimed responsibility for the attack on the military checkpoint.],"[The, terrorist, group, claimed, responsibility, for, the, attack, on, the, military, checkpoint, .]","[The, terrorist, group, claimed, responsibility, for, the, attack, on, the, military, checkpoint]"
2,3,Security forces arrested suspects linked to a plot targeting public transportation.,[Security forces arrested suspects linked to a plot targeting public transportation.],"[Security, forces, arrested, suspects, linked, to, a, plot, targeting, public, transportation, .]","[Security, forces, arrested, suspects, linked, to, a, plot, targeting, public, transportation]"
3,4,The use of improvised explosive devices has increased in recent insurgent activities.,[The use of improvised explosive devices has increased in recent insurgent activities.],"[The, use, of, improvised, explosive, devices, has, increased, in, recent, insurgent, activities, .]","[The, use, of, improvised, explosive, devices, has, increased, in, recent, insurgent, activities]"
4,5,Counter-terrorism units conducted a raid on a suspected militant hideout in the region.,[Counter-terrorism units conducted a raid on a suspected militant hideout in the region.],"[Counter-terrorism, units, conducted, a, raid, on, a, suspected, militant, hideout, in, the, region, .]","[Counter-terrorism, units, conducted, a, raid, on, a, suspected, militant, hideout, in, the, region]"


# Removal of Stopwords

In [17]:
import pandas as pd
from nltk.corpus import stopwords

# Make sure you have the stopwords downloaded
# import nltk
# nltk.download('stopwords')

# Get the set of English stopwords
stop_words = set(stopwords.words('english'))
stop_words

# Define a function to remove stopwords
def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in stop_words]

# Apply the function to the 'tokens' column
df['cleantokens'] = df['cleantokens'].apply(remove_stopwords)
df

Unnamed: 0,ID,Text,sentences,tokens,cleantokens
0,1,A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.,"[A bombing occurred in downtown Baghdad., Resulting in several casualties and injuries.]","[A, bombing, occurred, in, downtown, Baghdad, ., Resulting, in, several, casualties, and, injuries, .]","[bombing, occurred, downtown, Baghdad, Resulting, several, casualties, injuries]"
1,2,The terrorist group claimed responsibility for the attack on the military checkpoint.,[The terrorist group claimed responsibility for the attack on the military checkpoint.],"[The, terrorist, group, claimed, responsibility, for, the, attack, on, the, military, checkpoint, .]","[terrorist, group, claimed, responsibility, attack, military, checkpoint]"
2,3,Security forces arrested suspects linked to a plot targeting public transportation.,[Security forces arrested suspects linked to a plot targeting public transportation.],"[Security, forces, arrested, suspects, linked, to, a, plot, targeting, public, transportation, .]","[Security, forces, arrested, suspects, linked, plot, targeting, public, transportation]"
3,4,The use of improvised explosive devices has increased in recent insurgent activities.,[The use of improvised explosive devices has increased in recent insurgent activities.],"[The, use, of, improvised, explosive, devices, has, increased, in, recent, insurgent, activities, .]","[use, improvised, explosive, devices, increased, recent, insurgent, activities]"
4,5,Counter-terrorism units conducted a raid on a suspected militant hideout in the region.,[Counter-terrorism units conducted a raid on a suspected militant hideout in the region.],"[Counter-terrorism, units, conducted, a, raid, on, a, suspected, militant, hideout, in, the, region, .]","[Counter-terrorism, units, conducted, raid, suspected, militant, hideout, region]"


In [18]:
import pandas as pd
from nltk.stem import WordNetLemmatizer

# Make sure you have the WordNet data downloaded
# import nltk
# nltk.download('wordnet')
# nltk.download('omw-1.4')

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Define the lemmatization function
def lemmatize_text(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

# Apply it to the 'tokens' column
df['Lemmatokens'] = df['cleantokens'].apply(lemmatize_text)
df

Unnamed: 0,ID,Text,sentences,tokens,cleantokens,Lemmatokens
0,1,A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.,"[A bombing occurred in downtown Baghdad., Resulting in several casualties and injuries.]","[A, bombing, occurred, in, downtown, Baghdad, ., Resulting, in, several, casualties, and, injuries, .]","[bombing, occurred, downtown, Baghdad, Resulting, several, casualties, injuries]","[bombing, occurred, downtown, Baghdad, Resulting, several, casualty, injury]"
1,2,The terrorist group claimed responsibility for the attack on the military checkpoint.,[The terrorist group claimed responsibility for the attack on the military checkpoint.],"[The, terrorist, group, claimed, responsibility, for, the, attack, on, the, military, checkpoint, .]","[terrorist, group, claimed, responsibility, attack, military, checkpoint]","[terrorist, group, claimed, responsibility, attack, military, checkpoint]"
2,3,Security forces arrested suspects linked to a plot targeting public transportation.,[Security forces arrested suspects linked to a plot targeting public transportation.],"[Security, forces, arrested, suspects, linked, to, a, plot, targeting, public, transportation, .]","[Security, forces, arrested, suspects, linked, plot, targeting, public, transportation]","[Security, force, arrested, suspect, linked, plot, targeting, public, transportation]"
3,4,The use of improvised explosive devices has increased in recent insurgent activities.,[The use of improvised explosive devices has increased in recent insurgent activities.],"[The, use, of, improvised, explosive, devices, has, increased, in, recent, insurgent, activities, .]","[use, improvised, explosive, devices, increased, recent, insurgent, activities]","[use, improvised, explosive, device, increased, recent, insurgent, activity]"
4,5,Counter-terrorism units conducted a raid on a suspected militant hideout in the region.,[Counter-terrorism units conducted a raid on a suspected militant hideout in the region.],"[Counter-terrorism, units, conducted, a, raid, on, a, suspected, militant, hideout, in, the, region, .]","[Counter-terrorism, units, conducted, raid, suspected, militant, hideout, region]","[Counter-terrorism, unit, conducted, raid, suspected, militant, hideout, region]"


In [19]:
import pandas as pd
from nltk.stem import PorterStemmer

# Initialize the stemmer
stemmer = PorterStemmer()

# Define the stemming function
def stem_text(tokens):
    return [stemmer.stem(word) for word in tokens]

# Apply it to the 'tokens' column
df['Stemtokens'] = df['cleantokens'].apply(stem_text)
df

Unnamed: 0,ID,Text,sentences,tokens,cleantokens,Lemmatokens,Stemtokens
0,1,A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.,"[A bombing occurred in downtown Baghdad., Resulting in several casualties and injuries.]","[A, bombing, occurred, in, downtown, Baghdad, ., Resulting, in, several, casualties, and, injuries, .]","[bombing, occurred, downtown, Baghdad, Resulting, several, casualties, injuries]","[bombing, occurred, downtown, Baghdad, Resulting, several, casualty, injury]","[bomb, occur, downtown, baghdad, result, sever, casualti, injuri]"
1,2,The terrorist group claimed responsibility for the attack on the military checkpoint.,[The terrorist group claimed responsibility for the attack on the military checkpoint.],"[The, terrorist, group, claimed, responsibility, for, the, attack, on, the, military, checkpoint, .]","[terrorist, group, claimed, responsibility, attack, military, checkpoint]","[terrorist, group, claimed, responsibility, attack, military, checkpoint]","[terrorist, group, claim, respons, attack, militari, checkpoint]"
2,3,Security forces arrested suspects linked to a plot targeting public transportation.,[Security forces arrested suspects linked to a plot targeting public transportation.],"[Security, forces, arrested, suspects, linked, to, a, plot, targeting, public, transportation, .]","[Security, forces, arrested, suspects, linked, plot, targeting, public, transportation]","[Security, force, arrested, suspect, linked, plot, targeting, public, transportation]","[secur, forc, arrest, suspect, link, plot, target, public, transport]"
3,4,The use of improvised explosive devices has increased in recent insurgent activities.,[The use of improvised explosive devices has increased in recent insurgent activities.],"[The, use, of, improvised, explosive, devices, has, increased, in, recent, insurgent, activities, .]","[use, improvised, explosive, devices, increased, recent, insurgent, activities]","[use, improvised, explosive, device, increased, recent, insurgent, activity]","[use, improvis, explos, devic, increas, recent, insurg, activ]"
4,5,Counter-terrorism units conducted a raid on a suspected militant hideout in the region.,[Counter-terrorism units conducted a raid on a suspected militant hideout in the region.],"[Counter-terrorism, units, conducted, a, raid, on, a, suspected, militant, hideout, in, the, region, .]","[Counter-terrorism, units, conducted, raid, suspected, militant, hideout, region]","[Counter-terrorism, unit, conducted, raid, suspected, militant, hideout, region]","[counter-terror, unit, conduct, raid, suspect, milit, hideout, region]"


In [20]:
# Define the function to convert tokens to lowercase
def lowercase_tokens(tokens):
    return [word.lower() for word in tokens]

# Apply the function to the 'tokens' column
df['lowercasetokens'] = df['cleantokens'].apply(lowercase_tokens)
df

Unnamed: 0,ID,Text,sentences,tokens,cleantokens,Lemmatokens,Stemtokens,lowercasetokens
0,1,A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.,"[A bombing occurred in downtown Baghdad., Resulting in several casualties and injuries.]","[A, bombing, occurred, in, downtown, Baghdad, ., Resulting, in, several, casualties, and, injuries, .]","[bombing, occurred, downtown, Baghdad, Resulting, several, casualties, injuries]","[bombing, occurred, downtown, Baghdad, Resulting, several, casualty, injury]","[bomb, occur, downtown, baghdad, result, sever, casualti, injuri]","[bombing, occurred, downtown, baghdad, resulting, several, casualties, injuries]"
1,2,The terrorist group claimed responsibility for the attack on the military checkpoint.,[The terrorist group claimed responsibility for the attack on the military checkpoint.],"[The, terrorist, group, claimed, responsibility, for, the, attack, on, the, military, checkpoint, .]","[terrorist, group, claimed, responsibility, attack, military, checkpoint]","[terrorist, group, claimed, responsibility, attack, military, checkpoint]","[terrorist, group, claim, respons, attack, militari, checkpoint]","[terrorist, group, claimed, responsibility, attack, military, checkpoint]"
2,3,Security forces arrested suspects linked to a plot targeting public transportation.,[Security forces arrested suspects linked to a plot targeting public transportation.],"[Security, forces, arrested, suspects, linked, to, a, plot, targeting, public, transportation, .]","[Security, forces, arrested, suspects, linked, plot, targeting, public, transportation]","[Security, force, arrested, suspect, linked, plot, targeting, public, transportation]","[secur, forc, arrest, suspect, link, plot, target, public, transport]","[security, forces, arrested, suspects, linked, plot, targeting, public, transportation]"
3,4,The use of improvised explosive devices has increased in recent insurgent activities.,[The use of improvised explosive devices has increased in recent insurgent activities.],"[The, use, of, improvised, explosive, devices, has, increased, in, recent, insurgent, activities, .]","[use, improvised, explosive, devices, increased, recent, insurgent, activities]","[use, improvised, explosive, device, increased, recent, insurgent, activity]","[use, improvis, explos, devic, increas, recent, insurg, activ]","[use, improvised, explosive, devices, increased, recent, insurgent, activities]"
4,5,Counter-terrorism units conducted a raid on a suspected militant hideout in the region.,[Counter-terrorism units conducted a raid on a suspected militant hideout in the region.],"[Counter-terrorism, units, conducted, a, raid, on, a, suspected, militant, hideout, in, the, region, .]","[Counter-terrorism, units, conducted, raid, suspected, militant, hideout, region]","[Counter-terrorism, unit, conducted, raid, suspected, militant, hideout, region]","[counter-terror, unit, conduct, raid, suspect, milit, hideout, region]","[counter-terrorism, units, conducted, raid, suspected, militant, hideout, region]"


In [21]:
# Define the function to join tokens into a single string
def join_tokens(tokens):
    return ' '.join(tokens)

# Apply the function to create a new 'clean_text' column
df['clean_Lemma_text'] = df['lowercasetokens'].apply(join_tokens)
df

Unnamed: 0,ID,Text,sentences,tokens,cleantokens,Lemmatokens,Stemtokens,lowercasetokens,clean_Lemma_text
0,1,A bombing occurred in downtown Baghdad. Resulting in several casualties and injuries.,"[A bombing occurred in downtown Baghdad., Resulting in several casualties and injuries.]","[A, bombing, occurred, in, downtown, Baghdad, ., Resulting, in, several, casualties, and, injuries, .]","[bombing, occurred, downtown, Baghdad, Resulting, several, casualties, injuries]","[bombing, occurred, downtown, Baghdad, Resulting, several, casualty, injury]","[bomb, occur, downtown, baghdad, result, sever, casualti, injuri]","[bombing, occurred, downtown, baghdad, resulting, several, casualties, injuries]",bombing occurred downtown baghdad resulting several casualties injuries
1,2,The terrorist group claimed responsibility for the attack on the military checkpoint.,[The terrorist group claimed responsibility for the attack on the military checkpoint.],"[The, terrorist, group, claimed, responsibility, for, the, attack, on, the, military, checkpoint, .]","[terrorist, group, claimed, responsibility, attack, military, checkpoint]","[terrorist, group, claimed, responsibility, attack, military, checkpoint]","[terrorist, group, claim, respons, attack, militari, checkpoint]","[terrorist, group, claimed, responsibility, attack, military, checkpoint]",terrorist group claimed responsibility attack military checkpoint
2,3,Security forces arrested suspects linked to a plot targeting public transportation.,[Security forces arrested suspects linked to a plot targeting public transportation.],"[Security, forces, arrested, suspects, linked, to, a, plot, targeting, public, transportation, .]","[Security, forces, arrested, suspects, linked, plot, targeting, public, transportation]","[Security, force, arrested, suspect, linked, plot, targeting, public, transportation]","[secur, forc, arrest, suspect, link, plot, target, public, transport]","[security, forces, arrested, suspects, linked, plot, targeting, public, transportation]",security forces arrested suspects linked plot targeting public transportation
3,4,The use of improvised explosive devices has increased in recent insurgent activities.,[The use of improvised explosive devices has increased in recent insurgent activities.],"[The, use, of, improvised, explosive, devices, has, increased, in, recent, insurgent, activities, .]","[use, improvised, explosive, devices, increased, recent, insurgent, activities]","[use, improvised, explosive, device, increased, recent, insurgent, activity]","[use, improvis, explos, devic, increas, recent, insurg, activ]","[use, improvised, explosive, devices, increased, recent, insurgent, activities]",use improvised explosive devices increased recent insurgent activities
4,5,Counter-terrorism units conducted a raid on a suspected militant hideout in the region.,[Counter-terrorism units conducted a raid on a suspected militant hideout in the region.],"[Counter-terrorism, units, conducted, a, raid, on, a, suspected, militant, hideout, in, the, region, .]","[Counter-terrorism, units, conducted, raid, suspected, militant, hideout, region]","[Counter-terrorism, unit, conducted, raid, suspected, militant, hideout, region]","[counter-terror, unit, conduct, raid, suspect, milit, hideout, region]","[counter-terrorism, units, conducted, raid, suspected, militant, hideout, region]",counter-terrorism units conducted raid suspected militant hideout region


In [22]:
# Save to CSV
df.to_csv("cleaned_counter_terrorism_dataset.csv", index=False)