<a href="https://colab.research.google.com/github/2403a54121-dev/NLP/blob/main/Untitled25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DATA *LOADING*

In [None]:
import pandas as pd
df = pd.read_csv('arxiv_data.csv', engine='python', nrows=1000)
print("DataFrame loaded successfully with the first 1000 rows.")
print(df.head())

DataFrame loaded successfully with the first 1000 rows.
                                              titles  \
0  Survey on Semantic Stereo Matching / Semantic ...   
1  FUTURE-AI: Guiding Principles and Consensus Re...   
2  Enforcing Mutual Consistency of Hard Regions f...   
3  Parameter Decoupling Strategy for Semi-supervi...   
4  Background-Foreground Segmentation for Interio...   

                                           summaries  \
0  Stereo matching is one of the widely used tech...   
1  The recent advancements in artificial intellig...   
2  In this paper, we proposed a novel mutual cons...   
3  Consistency training has proven to be an advan...   
4  To ensure safety in automated driving, the cor...   

                         terms  
0           ['cs.CV', 'cs.LG']  
1  ['cs.CV', 'cs.AI', 'cs.LG']  
2           ['cs.CV', 'cs.AI']  
3                    ['cs.CV']  
4           ['cs.CV', 'cs.LG']  


**DATA CLEANING**

In [None]:
import re

def preprocess_text(text):
    # 1. Remove URLs
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
    text = re.sub(r'www\.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)

    # 2. Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # 3. Remove social media mentions
    text = re.sub(r'@\w+', '', text)

    # 4. Remove hashtags
    text = re.sub(r'#\w+', '', text)

    # 5. Convert to lowercase
    text = text.lower()

    # 6. Remove emojis
    emoji_pattern = re.compile(
        "[" # Start character class
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U00002702-\U000027B0"
        "\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)

    # 7. Remove any remaining special characters (keeping only alphanumeric and spaces)
    text = re.sub(r'[^a-z0-9\s]', '', text)

    # 8. Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

df['processed_summaries'] = df['summaries'].apply(preprocess_text)

print("Preprocessing complete. Displaying the first few processed summaries:")
for i, summary in enumerate(df['processed_summaries'].head()):
    print(f"Original: {df['summaries'].iloc[i]}")
    print(f"Processed: {summary}\n")


Preprocessing complete. Displaying the first few processed summaries:
Original: Stereo matching is one of the widely used techniques for inferring depth from
stereo images owing to its robustness and speed. It has become one of the major
topics of research since it finds its applications in autonomous driving,
robotic navigation, 3D reconstruction, and many other fields. Finding pixel
correspondences in non-textured, occluded and reflective areas is the major
challenge in stereo matching. Recent developments have shown that semantic cues
from image segmentation can be used to improve the results of stereo matching.
Many deep neural network architectures have been proposed to leverage the
advantages of semantic segmentation in stereo matching. This paper aims to give
a comparison among the state of art networks both in terms of accuracy and in
terms of speed which are of higher importance in real-time applications.
Processed: stereo matching is one of the widely used techniques for infe

**Word Tokenization (NLTK)**

In [None]:
import nltk
from nltk.tokenize import word_tokenize

# Download the 'punkt' tokenizer models if not already downloaded
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

# Download 'punkt_tab' explicitly, as it was identified as missing in the previous execution
nltk.download('punkt_tab')

# Apply word tokenization
df['tokenized_summaries'] = df['processed_summaries'].apply(word_tokenize)

print("Word tokenization complete. Displaying the first few tokenized summaries:")
for i, tokens in enumerate(df['tokenized_summaries'].head()):
    print(f"Original Processed Summary: {df['processed_summaries'].iloc[i]}")
    print(f"Tokenized Summary: {tokens}\n")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Word tokenization complete. Displaying the first few tokenized summaries:
Original Processed Summary: stereo matching is one of the widely used techniques for inferring depth from stereo images owing to its robustness and speed it has become one of the major topics of research since it finds its applications in autonomous driving robotic navigation 3d reconstruction and many other fields finding pixel correspondences in nontextured occluded and reflective areas is the major challenge in stereo matching recent developments have shown that semantic cues from image segmentation can be used to improve the results of stereo matching many deep neural network architectures have been proposed to leverage the advantages of semantic segmentation in stereo matching this paper aims to give a comparison among the state of art networks both in terms of accuracy and in terms of speed which are of higher importance in realtime applications
Tokenized Summary: ['stereo', 'matching', 'is', 'one', 'of', '

**Stopword Removal (NLTK)**

In [None]:
from nltk.corpus import stopwords

# Download the 'stopwords' corpus if not already downloaded
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

# Get English stopwords
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words]

# Apply stopword removal
df['filtered_summaries'] = df['tokenized_summaries'].apply(remove_stopwords)

print("Stopword removal complete. Displaying the first few filtered summaries:")
for i, filtered_tokens in enumerate(df['filtered_summaries'].head()):
    print(f"Original Tokenized Summary: {df['tokenized_summaries'].iloc[i]}")
    print(f"Filtered Summary: {filtered_tokens}\n")

Stopword removal complete. Displaying the first few filtered summaries:
Original Tokenized Summary: ['stereo', 'matching', 'is', 'one', 'of', 'the', 'widely', 'used', 'techniques', 'for', 'inferring', 'depth', 'from', 'stereo', 'images', 'owing', 'to', 'its', 'robustness', 'and', 'speed', 'it', 'has', 'become', 'one', 'of', 'the', 'major', 'topics', 'of', 'research', 'since', 'it', 'finds', 'its', 'applications', 'in', 'autonomous', 'driving', 'robotic', 'navigation', '3d', 'reconstruction', 'and', 'many', 'other', 'fields', 'finding', 'pixel', 'correspondences', 'in', 'nontextured', 'occluded', 'and', 'reflective', 'areas', 'is', 'the', 'major', 'challenge', 'in', 'stereo', 'matching', 'recent', 'developments', 'have', 'shown', 'that', 'semantic', 'cues', 'from', 'image', 'segmentation', 'can', 'be', 'used', 'to', 'improve', 'the', 'results', 'of', 'stereo', 'matching', 'many', 'deep', 'neural', 'network', 'architectures', 'have', 'been', 'proposed', 'to', 'leverage', 'the', 'advantag

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**Lemmatization (NLTK)**

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download the 'wordnet' corpus if not already downloaded
try:
    wordnet.ensure_loaded()
except LookupError:
    nltk.download('wordnet')

# Initialize WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Function to perform lemmatization
def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

# Apply lemmatization
df['lemmatized_summaries'] = df['filtered_summaries'].apply(lemmatize_tokens)

print("Lemmatization complete. Displaying the first few lemmatized summaries:")
for i, lemmatized_tokens in enumerate(df['lemmatized_summaries'].head()):
    print(f"Original Filtered Summary: {df['filtered_summaries'].iloc[i]}")
    print(f"Lemmatized Summary: {lemmatized_tokens}\n")

[nltk_data] Downloading package wordnet to /root/nltk_data...


Lemmatization complete. Displaying the first few lemmatized summaries:
Original Filtered Summary: ['stereo', 'matching', 'one', 'widely', 'used', 'techniques', 'inferring', 'depth', 'stereo', 'images', 'owing', 'robustness', 'speed', 'become', 'one', 'major', 'topics', 'research', 'since', 'finds', 'applications', 'autonomous', 'driving', 'robotic', 'navigation', '3d', 'reconstruction', 'many', 'fields', 'finding', 'pixel', 'correspondences', 'nontextured', 'occluded', 'reflective', 'areas', 'major', 'challenge', 'stereo', 'matching', 'recent', 'developments', 'shown', 'semantic', 'cues', 'image', 'segmentation', 'used', 'improve', 'results', 'stereo', 'matching', 'many', 'deep', 'neural', 'network', 'architectures', 'proposed', 'leverage', 'advantages', 'semantic', 'segmentation', 'stereo', 'matching', 'paper', 'aims', 'give', 'comparison', 'among', 'state', 'art', 'networks', 'terms', 'accuracy', 'terms', 'speed', 'higher', 'importance', 'realtime', 'applications']
Lemmatized Summary

**Rejoining Words**

In [None]:
df['clean_summaries'] = df['lemmatized_summaries'].apply(lambda x: ' '.join(x))

print("Rejoining words complete. Displaying the first few clean summaries:")
for i, clean_summary in enumerate(df['clean_summaries'].head()):
    print(f"Original Lemmatized Summary: {df['lemmatized_summaries'].iloc[i]}")
    print(f"Clean Summary: {clean_summary}\n")

Rejoining words complete. Displaying the first few clean summaries:
Original Lemmatized Summary: ['stereo', 'matching', 'one', 'widely', 'used', 'technique', 'inferring', 'depth', 'stereo', 'image', 'owing', 'robustness', 'speed', 'become', 'one', 'major', 'topic', 'research', 'since', 'find', 'application', 'autonomous', 'driving', 'robotic', 'navigation', '3d', 'reconstruction', 'many', 'field', 'finding', 'pixel', 'correspondence', 'nontextured', 'occluded', 'reflective', 'area', 'major', 'challenge', 'stereo', 'matching', 'recent', 'development', 'shown', 'semantic', 'cue', 'image', 'segmentation', 'used', 'improve', 'result', 'stereo', 'matching', 'many', 'deep', 'neural', 'network', 'architecture', 'proposed', 'leverage', 'advantage', 'semantic', 'segmentation', 'stereo', 'matching', 'paper', 'aim', 'give', 'comparison', 'among', 'state', 'art', 'network', 'term', 'accuracy', 'term', 'speed', 'higher', 'importance', 'realtime', 'application']
Clean Summary: stereo matching one wi

**Unified NLTK Preprocessing Pipeline Function**

In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

def nltk_preprocessing_pipeline(text):
    # Download necessary NLTK data if not already downloaded
    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        nltk.download('punkt')

    try:
        nltk.data.find('tokenizers/punkt_tab')
    except LookupError:
        nltk.download('punkt_tab')

    try:
        stopwords.words('english')
    except LookupError:
        nltk.download('stopwords')

    try:
        wordnet.ensure_loaded()
    except LookupError:
        nltk.download('wordnet')

    try:
        nltk.data.find('corpora/omw-1.4')
    except LookupError:
        nltk.download('omw-1.4')

    # 1. Initial Text Cleaning (from preprocess_text function)
    # Remove URLs
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
    text = re.sub(r'www\.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove social media mentions
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags
    text = re.sub(r'#\w+', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove emojis
    emoji_pattern = re.compile(
        "[" # Start character class
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U00002702-\U000027B0"
        "\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    # Remove any remaining special characters (keeping only alphanumeric and spaces)
    text = re.sub(r'[^a-z0-9\s]', '', text)
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # 2. Word Tokenization
    tokens = word_tokenize(text)

    # 3. Stopword Removal
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # 4. Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

    # 5. Rejoin words
    return ' '.join(lemmatized_tokens)

# Apply the unified pipeline to the original 'summaries' column
df['clean_summaries_pipeline'] = df['summaries'].apply(nltk_preprocessing_pipeline)

print("Unified NLTK Preprocessing Pipeline complete. Displaying the first few results:")
for i in range(5):
    print(f"Original Summary: {df['summaries'].iloc[i]}")
    print(f"Cleaned (Step-by-step): {df['clean_summaries'].iloc[i]}")
    print(f"Cleaned (Pipeline): {df['clean_summaries_pipeline'].iloc[i]}\n")

# Verify consistency
consistency_check = (df['clean_summaries'] == df['clean_summaries_pipeline']).all()
print(f"Consistency with step-by-step approach verified: {consistency_check}")

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downlo

Unified NLTK Preprocessing Pipeline complete. Displaying the first few results:
Original Summary: Stereo matching is one of the widely used techniques for inferring depth from
stereo images owing to its robustness and speed. It has become one of the major
topics of research since it finds its applications in autonomous driving,
robotic navigation, 3D reconstruction, and many other fields. Finding pixel
correspondences in non-textured, occluded and reflective areas is the major
challenge in stereo matching. Recent developments have shown that semantic cues
from image segmentation can be used to improve the results of stereo matching.
Many deep neural network architectures have been proposed to leverage the
advantages of semantic segmentation in stereo matching. This paper aims to give
a comparison among the state of art networks both in terms of accuracy and in
terms of speed which are of higher importance in real-time applications.
Cleaned (Step-by-step): stereo matching one widely use

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Pack