#1. Introduction
* **Text as a Challenge**: Unlike numerical data, raw text is unstructured and messy. This makes it hard for computers to directly analyze and uncover insights.
* **Vectorization to the Rescue**: Vectorization techniques transform words, sentences, and even entire documents into numerical representations. This allows us to use mathematical and computational tools for powerful text analysis.
* **Your Mission**: This assignment will take you on a journey through text processing and vectorization. You'll decode clues, uncover hidden connections, and collaborate with others to reach the ultimate treasure!

# 2. Setting Up
* Install the necessary libraries
* Import the  libraries
* Load the Dataset

##Make sure you have these libraries installed##
 (pip install [library_name] if needed):
* nltk
* pandas
* sklearn
* gensim
* spacy
* (Optional for advanced exploration): transformers



In [None]:
# Install libraries if needed (uncomment as necessary)
# !pip install nltk pandas sklearn gensim spacy transformers

# Import libraries
import pandas as pd
import numpy as np
import nltk
import re  # For regular expressions
import gensim
import spacy

from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')

# Optional advanced exploration with Transformers
from transformers import pipeline


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#3.  Your Quest Begins – The Initial Clue
* Decipher the Message: Your first clue is the key! Analyze it closely. What words or themes stand out?
* * Hint 1: Think about which topic category within the Newsgroup 20 dataset connects to your initial clue.

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Load the 20 newsgroups dataset
newsgroups_train = fetch_20newsgroups(subset='train')

# Define the text to classify
text = "I shield secrets with mathematical might, transforming messages into hidden light"

# Create a pipeline that combines TF-IDF vectorization and Multinomial Naive Bayes classifier
pipe = make_pipeline(TfidfVectorizer(), MultinomialNB())

# Train the model on the training data
pipe.fit(newsgroups_train.data, newsgroups_train.target)

# Predict the category of the text
predicted_category = pipe.predict([text])[0]

# Get the target names (category names)
target_names = newsgroups_train.target_names

# Print the predicted category
print("Predicted Category:", target_names[predicted_category])


Predicted Category: sci.crypt


#4. Keyword Quest
Finding the Guiding Stars: Time to extract keywords that illuminate your path. Let's start with TF-IDF:


In [None]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts."""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

def clean_text(text):
    """Normalize, remove possessives, lemmatize, and remove stopwords."""
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    text = text.lower()
    text = re.sub(r"'s\b", "", text)
    text = re.sub(r"[^\w\s]", '', text)
    words = word_tokenize(text)
    lemmatized_words = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in words if w not in stop_words]
    return lemmatized_words

def extract_keywords(text_data):
    """Extract keywords using TF-IDF from preprocessed text."""
    stop_words = stopwords.words('english')
    vectorizer = TfidfVectorizer(stop_words=stop_words)
    text_data = [' '.join(clean_text(doc)) for doc in text_data]
    vectors = vectorizer.fit_transform(text_data)
    feature_names = vectorizer.get_feature_names_out()
    dense = vectors.todense()
    denselist = dense.tolist()
    df = pd.DataFrame(denselist, columns=feature_names)
    return df

# Clues
my_text = [
    "From ancient ciphers to digital keys, I weave patterns of security, where only the intended eyes can see.",
    "Like a dance with logic, steps in a line, my pattern dictates the final design.",
    "Logic's blueprint, code's design, my instructions make machines align."
]

keyword_df = extract_keywords(my_text)
print(keyword_df)

      align   ancient  blueprint    cipher      code     dance    design  \
0  0.000000  0.323112   0.000000  0.323112  0.000000  0.000000  0.000000   
1  0.000000  0.000000   0.000000  0.000000  0.000000  0.359554  0.273450   
2  0.373801  0.000000   0.373801  0.000000  0.373801  0.000000  0.284285   

    dictate   digital       eye  ...      like      line     logic   machine  \
0  0.000000  0.323112  0.323112  ...  0.000000  0.000000  0.000000  0.000000   
1  0.359554  0.000000  0.000000  ...  0.359554  0.359554  0.273450  0.000000   
2  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.284285  0.373801   

       make   pattern  security       see      step     weave  
0  0.000000  0.245735  0.323112  0.323112  0.000000  0.323112  
1  0.000000  0.273450  0.000000  0.000000  0.359554  0.000000  
2  0.373801  0.000000  0.000000  0.000000  0.000000  0.000000  

[3 rows x 24 columns]


# Hint 2:
 Look for keywords that might link to other texts, reveal new concepts, or hint at hidden patterns within the data.

#5. Semantic Safari
* Exploring the World of Meaning: Word embeddings like Word2Vec or GloVe help us understand how words relate to each other.
Hint 3: Calculate similarities between your keywords and texts in other categories. Could there be unexpected connections?

## Hint 3:
Calculate similarities between your keywords and texts in other categories. Could there be unexpected connections?


In [None]:
from gensim.models import Word2Vec
sentences = [clean_text(text) for text in my_text]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)


In [None]:
# Displaying vocabulary
print("Vocabulary in Model:", model.wv.index_to_key)

# Calculate and display similarities
for text in my_text:
    preprocessed_text = clean_text(text)
    text_vector = np.mean([model.wv[word] for word in preprocessed_text if word in model.wv], axis=0)

    # Dictionary to store keyword and its similarity
    similarity_dict = {}
    for keyword in keyword_df.columns.tolist():
        if keyword in model.wv:
            keyword_vector = model.wv[keyword]
            similarity = np.dot(keyword_vector, text_vector) / (np.linalg.norm(keyword_vector) * np.linalg.norm(text_vector))
            similarity_dict[keyword] = similarity
        else:
            similarity_dict[keyword] = float('-inf')  # Treat words not in vocab as lowest similarity

    # Sorting the dictionary by similarity in descending order
    sorted_similarities = sorted(similarity_dict.items(), key=lambda item: item[1], reverse=True)

    # Printing the sorted similarities
    print(f"\nComparing to text: {' '.join(preprocessed_text)}")
    for keyword, sim_value in sorted_similarities:
        if sim_value == float('-inf'):
            print(f"Keyword '{keyword}' not in vocabulary.")
        else:
            print(f"Similarity with '{keyword}': {sim_value:.4f}")

Vocabulary in Model: ['design', 'pattern', 'logic', 'align', 'like', 'cipher', 'digital', 'key', 'weave', 'security', 'intend', 'eye', 'see', 'dance', 'machine', 'step', 'line', 'dictate', 'final', 'blueprint', 'code', 'instruction', 'make', 'ancient']

Comparing to text: ancient cipher digital key weave pattern security intend eye see
Similarity with 'intend': 0.4606
Similarity with 'ancient': 0.3766
Similarity with 'eye': 0.3578
Similarity with 'cipher': 0.3250
Similarity with 'see': 0.3088
Similarity with 'digital': 0.2754
Similarity with 'key': 0.2660
Similarity with 'weave': 0.2631
Similarity with 'pattern': 0.2382
Similarity with 'security': 0.2108
Similarity with 'final': 0.1184
Similarity with 'make': 0.1177
Similarity with 'design': 0.1173
Similarity with 'blueprint': 0.0964
Similarity with 'logic': 0.0755
Similarity with 'machine': 0.0659
Similarity with 'align': 0.0626
Similarity with 'dictate': 0.0334
Similarity with 'step': 0.0272
Similarity with 'line': -0.0061
Similarity

# Advanced Exploration: Transformers (Optional)
While Word2Vec and GloVe offer valuable insights, Transformer-based models can provide even more nuanced semantic understanding. These models go beyond individual word meanings and capture context-dependent relationships between words.

* **Exploring with Transformers**: Consider using pre-trained Transformers for tasks like question answering or text summarization.
* * Imagine you have a question related to the content you've analyzed. You could use a question-answering pipeline to find the answer within relevant texts.
* * Text summarization pipelines could be helpful for generating concise summaries of lengthy documents you encounter during your exploration.
* Explore the Transformers documentation (https://huggingface.co/docs/transformers/en/index) to discover more pipelines and fine-tune their exploration.

**Benefits and Considerations**:
* Transformers can potentially uncover deeper semantic relationships compared to traditional word embeddings.
* They often require more computational resources

* The Transformers section below provides a commented-out example using a question-answering pipeline. You can experiment with other functionalities offered by Transformers based on their interests.

In [None]:
!pip install transformers



In [None]:
# Explore Transformers (optional challenge)
# This section requires installing the transformers library

# After installing transformers (see above), uncomment the following:

# Example using a pre-trained model for question answering
answerer = pipeline("question-answering", model="deepset/roberta-base-squad2")

# Example using a pre-trained model for text classification
#classifier = pipeline("text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

#result = classifier("Like a dance with logic, steps in a line, my pattern dictates the final design.")
#print(result)

your_text_data = "Logic's blueprint, code's design, my instructions make machines align."

# Imagine your question relates to the content of your texts
question = "Is the text referring to what kind of design?"
answer = answerer({'question': question, 'context': your_text_data}) #Replace_text_data with the actual text data you will use for you exploration.
print(question)
print(answer)

# Experiment with other Transformers pipelines like text classification,
# summarization, or sentiment analysis based on your exploration goals.


Is the text referring to what kind of design?
{'score': 0.41227254271507263, 'start': 19, 'end': 32, 'answer': "code's design"}


#6. Pattern Pursuit
* **Cracking the Code**: Examine closely for unusual patterns within the texts – letter sequences, numbers, or anything resembling a code. Regular expressions will be your powerful ally.
##Hint 4:
 Need help learning regular expressions? Check out this resource: https://docs.python.org/3/library/re.html

In [None]:
# Regular expression examples
'''
text = "There is a hidden code: AB12XY94"
pattern = r"\b[A-Z]{2}\d{2}[A-Z]{2}\d{2}\b"  # Pattern for a simple code-like structure
matches = re.findall(pattern, text)
if matches:
    print("Found potential codes:", matches)

    import re
'''
def find_patterns(text_data):
    # Updated pattern for typical US license plates (e.g., "XYZ1234")
    license_plate_pattern = r"\b[A-Z]{3}\d{4}\b"
    # Standard email regex pattern
    email_pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b"
    # Pattern for typical US phone numbers (e.g., "555-333-8888")
    phone_pattern = r"\b\d{3}-\d{3}-\d{4}\b"

    license_plates = re.findall(license_plate_pattern, text_data)
    emails = re.findall(email_pattern, text_data)
    phone_numbers = re.findall(phone_pattern, text_data)

    # Reporting findings
    if license_plates:
        print("Found potential license plates:", license_plates)
    if emails:
        print("Found potential emails:", emails)
    if phone_numbers:
        print("Found phone numbers:", phone_numbers)
    if not (license_plates or emails or phone_numbers):
        print("No potential patterns found.")

# Example usage (students would apply this to their texts)
my_text = "The license plate XYZ1234 is expired, you can renew it contacting texasdps@gov.us or call 555-333-8888."
find_patterns(my_text)

Found potential license plates: ['XYZ1234']
Found potential emails: ['texasdps@gov.us']
Found phone numbers: ['555-333-8888']


## Clue 1a = sci.crypt
## Clue 2a = security
## Clue 1b = logic
## Clue 2b = code's design

#7. Collaboration and Convergence
* **Teamwork Makes the Dream Work** How will your team share your findings and combine your insights? Discuss effective communication strategies.
* **The Final Puzzle**: Once all the clues are gathered, collaborate to solve the ultimate puzzle and locate the treasure!

#8. Reflection and Report
* Document Your Journey: Your final report is crucial! It should include:
* * The methods and techniques you used at each stage.
* * Explain in details what the Code snippets provided do and why.
* * Insights on your collaboration process.
* Lessons Learned: Think about:
* * Which text processing techniques were most helpful and why?
* * How did vectorization empower you to find hidden connections?
* * What was the most surprising part of this adventure?
* * How could you use these skills for other problems in the real world?