<a href="https://colab.research.google.com/github/Aarushi900/Text_Mining/blob/main/Lab_sheet_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab Sheet: Text Mining Preprocessing with Python**

**Objectives**

1.   Implement basic text preprocessing techniques.
2.Clean and prepare text data for further analysis.
2.   Use Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) models to vectorize text data.
3.   Understand and apply common text mining techniques.
5.Use libraries such as nltk and re to perform common preprocessing tasks.




**Prerequisites**

*   Basic knowledge of Python programming.
*   Understanding of text processing concepts.
* Python installed on your machine.
* Internet connection to download necessary libraries.



**Instructions**

**1. Set Up Your Environment**
1. Install Python: Make sure you have Python 3.x installed. You can download it from python.org.

2. Create and Activate a Virtual Environment (optional but recommended):

* Create a virtual environment:

In [None]:
python -m venv myenv

* Activate the virtual environment
  * On Windows

In [None]:
myenv\Scripts\activate

   * On macOS/Linux

In [None]:
source myenv/bin/activate

3. Install Required Libraries:
* Install 'nltk'

In [None]:
pip install nltk



**2. Download NLTK Resources**

You need to download specific resources from 'nltk' to use stopwords and tokenizers. Run the following code to download these resources:

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

**3. Implement the Preprocessing Code**

Create a Python script or Jupyter Notebook with the following code to perform text preprocessing:

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# Initialize the stemmer and stopwords
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords and perform stemming
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]

    return tokens

# Sample text
text = "Hello world! This is an example of text preprocessing in Python. Let's clean this text."

# Preprocess the sample text
preprocessed_tokens = preprocess_text(text)
print(preprocessed_tokens)

['hello', 'world', 'exampl', 'text', 'preprocess', 'python', 'let', 'clean', 'text']


4. **Run the Code**

Execute the script or notebook to see the output of the 'preprocessing steps'. The preprocessed_tokens variable should contain the cleaned and processed tokens from the sample text.

5. **Implement Bag of Words (BoW)**

Add the following code to convert the preprocessed text into a Bag of Words model:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "Hello world! This is an example of text preprocessing.",
    "Text mining is an important aspect of data analysis.",
    "Let's clean and prepare text data for analysis."
]

# Initialize CountVectorizer
vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Get feature names and converted document matrix
feature_names = vectorizer.get_feature_names_out()
document_matrix = X.toarray()

print("Feature Names:", feature_names)
print("Document Matrix (BoW):")
print(document_matrix)

Feature Names: ['analysis' 'aspect' 'clean' 'data' 'example' 'hello' 'important' 'let'
 'mining' 'prepare' 'preprocessing' 'text' 'world']
Document Matrix (BoW):
[[0 0 0 0 1 1 0 0 0 0 1 1 1]
 [1 1 0 1 0 0 1 0 1 0 0 1 0]
 [1 0 1 1 0 0 0 1 0 1 0 1 0]]


**6. Implement TF-IDF**

Add the following code to convert the text into a TF-IDF representation:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the documents
X_tfidf = tfidf_vectorizer.fit_transform(documents)

# Get feature names and converted document matrix
feature_names_tfidf = tfidf_vectorizer.get_feature_names_out()
document_matrix_tfidf = X_tfidf.toarray()

print("Feature Names (TF-IDF):", feature_names_tfidf)
print("Document Matrix (TF-IDF):")
print(document_matrix_tfidf)

Feature Names (TF-IDF): ['analysis' 'aspect' 'clean' 'data' 'example' 'hello' 'important' 'let'
 'mining' 'prepare' 'preprocessing' 'text' 'world']
Document Matrix (TF-IDF):
[[0.         0.         0.         0.         0.47952794 0.47952794
  0.         0.         0.         0.         0.47952794 0.28321692
  0.47952794]
 [0.35829137 0.4711101  0.         0.35829137 0.         0.
  0.4711101  0.         0.4711101  0.         0.         0.27824521
  0.        ]
 [0.35829137 0.         0.4711101  0.35829137 0.         0.
  0.         0.4711101  0.         0.4711101  0.         0.27824521
  0.        ]]


***Exercises***

* Implement Lemmatization: Replace the stemming process with lemmatization using WordNetLemmatizer.
* Handle Different Languages: Modify the code to preprocess text in a different language using the appropriate stopwords and tokenizers.

* Explore Other Vectorizers: Try using HashingVectorizer and compare it with CountVectorizer and TfidfVectorizer.

In [12]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag
# Ensure necessary NLTK resources are downloaded
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')


# Define function to lemmatize each word with its POS tag

# POS_TAGGER_FUNCTION : TYPE 1
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    # Tokenize and POS tagging
    tokens = word_tokenize(text.lower())
    pos_tags = pos_tag(tokens)

    # Print tokens and POS tags for debugging
    print("Tokens:", tokens)
    print("POS Tags:", pos_tags)

    # Lemmatize based on POS tag
    lemmatized_tokens = [
        lemmatizer.lemmatize(token, pos_tagger(pos)) for token, pos in pos_tags if token.isalpha()
    ]

    # Print lemmatized tokens for debugging
    print("Lemmatized Tokens:", lemmatized_tokens)

    return ' '.join(lemmatized_tokens)

# Example usage
text = "The running cats are running swiftly."
print("Lemmatized Text:", lemmatize_text(text))


Tokens: ['the', 'running', 'cats', 'are', 'running', 'swiftly', '.']
POS Tags: [('the', 'DT'), ('running', 'NN'), ('cats', 'NNS'), ('are', 'VBP'), ('running', 'VBG'), ('swiftly', 'RB'), ('.', '.')]
Lemmatized Tokens: ['the', 'running', 'cat', 'be', 'run', 'swiftly']
Lemmatized Text: the running cat be run swiftly


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [15]:
import spacy
from nltk.corpus import stopwords as sw
from nltk.tokenize import word_tokenize
import nltk
!python -m spacy download fr_core_news_sm
# Ensure necessary NLTK resources are downloaded
nltk.download('stopwords')

# Load French tokenizer and lemmatizer
nlp = spacy.load('fr_core_news_sm')

def preprocess_french_text(text):
    stop_words = set(sw.words('french'))
    doc = nlp(text)
    lemmatized_tokens = [token.lemma_ for token in doc if token.text.lower() not in stop_words and token.is_alpha]
    return ' '.join(lemmatized_tokens)

# Example usage
french_text = "Les chats courent rapidement."
print(preprocess_french_text(french_text))  # Output: "chat courir rapidement"


Collecting fr-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.7.0/fr_core_news_sm-3.7.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


chat courir rapidement


In [16]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [
    "The cat is on the mat.",
    "The dog is on the rug.",
    "Cats and dogs are pets."
]

# CountVectorizer
count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(documents)
print("CountVectorizer:\n", count_vectorizer.get_feature_names_out())
print(count_matrix.toarray())

# TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
print("\nTfidfVectorizer:\n", tfidf_vectorizer.get_feature_names_out())
print(tfidf_matrix.toarray())

# HashingVectorizer
hashing_vectorizer = HashingVectorizer(n_features=10)  # n_features is a parameter to adjust the number of features
hashing_matrix = hashing_vectorizer.fit_transform(documents)
print("\nHashingVectorizer:\n", hashing_matrix.toarray())

# Comparing similarity between first two documents
similarity_count = cosine_similarity(count_matrix[0:1], count_matrix[1:2])
similarity_tfidf = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
similarity_hashing = cosine_similarity(hashing_matrix[0:1], hashing_matrix[1:2])

print("\nCosine Similarity CountVectorizer:", similarity_count)
print("Cosine Similarity TfidfVectorizer:", similarity_tfidf)
print("Cosine Similarity HashingVectorizer:", similarity_hashing)


CountVectorizer:
 ['and' 'are' 'cat' 'cats' 'dog' 'dogs' 'is' 'mat' 'on' 'pets' 'rug' 'the']
[[0 0 1 0 0 0 1 1 1 0 0 2]
 [0 0 0 0 1 0 1 0 1 0 1 2]
 [1 1 0 1 0 1 0 0 0 1 0 0]]

TfidfVectorizer:
 ['and' 'are' 'cat' 'cats' 'dog' 'dogs' 'is' 'mat' 'on' 'pets' 'rug' 'the']
[[0.         0.         0.42755362 0.         0.         0.
  0.32516555 0.42755362 0.32516555 0.         0.         0.6503311 ]
 [0.         0.         0.         0.         0.42755362 0.
  0.32516555 0.         0.32516555 0.         0.42755362 0.6503311 ]
 [0.4472136  0.4472136  0.         0.4472136  0.         0.4472136
  0.         0.         0.         0.4472136  0.         0.        ]]

HashingVectorizer:
 [[ 0.         -0.40824829  0.          0.          0.          0.
   0.          0.40824829 -0.81649658  0.        ]
 [ 0.         -0.35355339  0.         -0.35355339  0.          0.
   0.          0.35355339 -0.70710678  0.35355339]
 [-0.4472136  -0.4472136   0.          0.4472136   0.         -0.4472136
   0.447