# *Step 02: Text Preprocessing*
### 1. Import Required Libraries for Text Preprocessing
- In this section, we import necessary libraries for handling text data.
- These tools help transform raw text into structured numerical representations for model training.

### 2. Loading and Handling Missing Data
- If any question pair has a missing value, we drop that row.
- This ensures that only complete question pairs are used for preprocessing.

### 3. Text Preprocessing Pipeline
We define a function `preprocess_text()` to clean and prepare text data for modeling.

1. **Lowercasing**: Converts all text to lowercase for consistency.
2. **Removing Special Characters**: Removes punctuation, numbers, and symbols.
3. **Tokenization**: Splits text into individual words (tokens).
4. **Removing Stopwords**: Eliminates common words (e.g., "the", "is", "and") to reduce noise.
5. **Lemmatization**: Converts words to their root form (e.g., "running" → "run").

This transformation ensures that the text is standardized and ready for further processing.

### 4. Training Word2Vec for Word Embeddings
We train a `Word2Vec` model using the cleaned text data:

- `vector_size=100`: Each word is represented as a **100-dimensional vector**.
- `window=5`: Context window size for capturing relationships between words.
- `min_count=1`: Includes all words appearing at least once in the corpus.
- `workers=4`: Uses multi-threading for faster training.

Word embeddings help capture **semantic relationships** between words, improving the model's ability to understand meaning.

### 5. Feature Extraction using TF-IDF
To convert text into numerical features, we apply **TF-IDF (Term Frequency-Inverse Document Frequency)**:

- **TF** measures how often a word appears in a document.
- **IDF** reduces the importance of frequently occurring words.

This technique helps capture **important keywords** while ignoring common words that don’t add value.

### 6. Display Results
Finally, we print a **sample of the cleaned text** and check the **shape of the TF-IDF matrix**.

- The cleaned text shows how preprocessing transformed raw questions.
- The TF-IDF matrix shape gives the total number of features extracted.

This ensures that our preprocessing steps were applied correctly.

In [1]:
## Import Libraries
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec

## Load Dataset
file_path = "train.csv"  # Ensure this file is in the same directory
df = pd.read_csv(file_path)

#  Handle Missing Values
df = df.dropna(subset=['question1', 'question2'])

##Text Preprocessing Pipeline

# Initialize NLP Tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    """Preprocesses text by tokenizing, lowercasing, removing stopwords & lemmatizing."""
    if isinstance(text, str):
        text = text.lower()  # Lowercasing
        text = re.sub(r'[^a-zA-Z0-9]', ' ', text)  # Remove special characters & punctuation
        words = word_tokenize(text)  # Tokenization
        words = [word for word in words if word not in stop_words]  # Remove stopwords
        words = [lemmatizer.lemmatize(word) for word in words]  # Lemmatization
        return words
    else:
        return []

# Apply Preprocessing function to both question columns
df['question1_clean'] = df['question1'].apply(preprocess_text)
df['question2_clean'] = df['question2'].apply(preprocess_text)

## Train Word2Vec Model for Word Embeddings

# Combine sentences from both questions for training
sentences = df['question1_clean'].tolist() + df['question2_clean'].tolist()

# Train Word2Vec model
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Save trained model
word2vec_model.save("word2vec.model")

## Feature Extraction using TF-IDF

# Apply TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer()
df['combined_text'] = df['question1'].fillna('') + ' ' + df['question2'].fillna('')
X_tfidf = tfidf_vectorizer.fit_transform(df['combined_text'])

## Display Results

# Print sample processed text, shape of TF-IDF matrix & Completion message
print("Sample Processed Text:")
print(df[['question1_clean', 'question2_clean']].head())
print("\nTF-IDF Matrix Shape:", X_tfidf.shape)
print("Text Preprocessing, TF-IDF, and Word Embeddings completed successfully!")


Sample Processed Text:
                                     question1_clean  \
0  [step, step, guide, invest, share, market, india]   
1              [story, kohinoor, koh, noor, diamond]   
2  [increase, speed, internet, connection, using,...   
3                          [mentally, lonely, solve]   
4  [one, dissolve, water, quikly, sugar, salt, me...   

                                     question2_clean  
0         [step, step, guide, invest, share, market]  
1  [would, happen, indian, government, stole, koh...  
2         [internet, speed, increased, hacking, dns]  
3  [find, remainder, math, 23, 24, math, divided,...  
4                [fish, would, survive, salt, water]  

TF-IDF Matrix Shape: (404287, 86152)
Text Preprocessing, TF-IDF, and Word Embeddings completed successfully!


# Summary of Text Preprocessing

### Key Preprocessing Steps:
1. **Loading & Cleaning Data**: Dropped missing values.
2. **Text Cleaning**:
   - Lowercased text.
   - Removed special characters and stopwords.
   - Tokenized and lemmatized words.
3. **Word Embeddings (Word2Vec)**: Trained a model to capture word relationships.
4. **Feature Extraction (TF-IDF)**: Converted text into numerical form for model training.

### Final Results:
- Text data is now structured for machine learning.
- The **TF-IDF matrix shape** indicates the number of extracted features.
- **Word2Vec embeddings** are saved for further use.
