<a href="https://colab.research.google.com/github/AbdAllAh950/Machine-Learning/blob/main/Task_02_%D0%98%D1%81%D1%81%D0%B0_%D0%90%D0%B1%D0%B4%D0%B0%D0%BB%D0%BB%D0%B0_%D0%A1%D0%B0%D0%B9%D0%B5%D0%B4_%D0%90%D0%BB%D0%B8_Group_J4133.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Setting Up the Environment**

In [None]:
# Import libraries
import requests
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Download the text from Project Gutenberg
url = "http://www.gutenberg.org/files/11/11-0.txt"
response = requests.get(url)
text = response.text

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


**Preprocess the Text**\
Perform preprocessing steps: convert to lowercase, remove non-alphabetic characters, remove stop words, and lemmatize.

In [None]:
# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Preprocess function
def preprocess(text):
    # Convert to lowercase
    text = text.lower()
    # Remove non-alphabetic characters
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize words
    words = word_tokenize(text)
    # Remove stopwords and lemmatize
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return words

# Preprocess the entire text
processed_text = preprocess(text)

**Find the Top 10 Most Important Words in Each Chapter**\
1- Split the text into chapters: Use "CHAPTER" as a separator since the text uses it to mark new chapters.\
2- Calculate TF-IDF for each chapter and identify the top 10 most important words.

In [None]:
# Import required libraries
from sklearn.feature_extraction.text import TfidfVectorizer

# Split the text into chapters
chapters = text.split("CHAPTER")[1:]  # Skip any intro part

# Initialize the TF-IDF vectorizer, using 'english' stop words
tfidf_vectorizer = TfidfVectorizer(max_features=10, stop_words='english')

# Function to find top TF-IDF words in each chapter
def get_top_words_per_chapter(chapter_text):
    # Fit and transform the text for each chapter
    tfidf_matrix = tfidf_vectorizer.fit_transform([chapter_text])
    # Get feature names (words) and their scores
    feature_names = tfidf_vectorizer.get_feature_names_out()
    scores = tfidf_matrix.toarray().flatten()
    # Get top 10 words with highest scores
    top_words = [word for word, score in sorted(zip(feature_names, scores), key=lambda x: x[1], reverse=True)]
    return top_words

# Process each chapter and print the top 10 words
for i, chapter in enumerate(chapters):
    # Preprocess the chapter text
    processed_chapter = " ".join(preprocess(chapter))
    # Get top words
    top_words = get_top_words_per_chapter(processed_chapter)
    print(f"Top 10 words for Chapter {i + 1}: {top_words}")

Top 10 words for Chapter 1: ['rabbithole']
Top 10 words for Chapter 2: ['ii', 'pool', 'tear']
Top 10 words for Chapter 3: ['caucusrace', 'iii', 'long', 'tale']
Top 10 words for Chapter 4: ['iv', 'little', 'rabbit', 'sends']
Top 10 words for Chapter 5: ['advice', 'caterpillar']
Top 10 words for Chapter 6: ['pepper', 'pig', 'vi']
Top 10 words for Chapter 7: ['mad', 'teaparty', 'vii']
Top 10 words for Chapter 8: ['croquetground', 'queen', 'viii']
Top 10 words for Chapter 9: ['ix', 'mock', 'story', 'turtle']
Top 10 words for Chapter 10: ['lobster', 'quadrille']
Top 10 words for Chapter 11: ['stole', 'tart', 'xi']
Top 10 words for Chapter 12: ['alices', 'evidence', 'xii']
Top 10 words for Chapter 13: ['alice', 'little', 'like', 'think', 'way', 'door', 'said', 'thought', 'time', 'went']
Top 10 words for Chapter 14: ['alice', 'little', 'mouse', 'im', 'said', 'dear', 'foot', 'thing', 'like', 'went']
Top 10 words for Chapter 15: ['said', 'alice', 'mouse', 'dodo', 'know', 'soon', 'bird', 'dry', 

**Find the Top 10 Most Used Verbs in Sentences with "Alice"**\
1- Find sentences that mention "Alice".\
2- Extract verbs from those sentences.\
3- Count the most common verbs and print the top 10.

In [None]:
# Download parts of speech tagger
nltk.download('averaged_perceptron_tagger')

# Tokenize the text into sentences
sentences = sent_tokenize(text)

# Function to extract verbs from sentences mentioning Alice
def get_verbs_with_alice(sentences):
    verbs = []
    for sentence in sentences:
        if 'alice' in sentence.lower():
            # Tokenize and tag parts of speech
            words = word_tokenize(sentence)
            pos_tags = nltk.pos_tag(words)
            # Extract verbs, lemmatize them, and add to list
            verbs += [lemmatizer.lemmatize(word.lower()) for word, tag in pos_tags if tag.startswith('VB')]
    return verbs

# Get verbs from sentences with "Alice" and find the top 10
alice_verbs = get_verbs_with_alice(sentences)
top_alice_verbs = Counter(alice_verbs).most_common(10)
print("Top 10 verbs associated with Alice:", top_alice_verbs)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Top 10 verbs associated with Alice: [('said', 254), ('wa', 174), ('’', 152), ('had', 94), ('“', 91), ('be', 81), ('s', 50), ('thought', 50), ('”', 48), ('have', 43)]


**Sentence Extraction:** We first split the text into sentences and check each one for the word "Alice".\
**Verb Extraction**: For each sentence that mentions "Alice", we tokenize and tag each word’s part of speech, selecting only verbs.\
**Counting and Displaying Verbs:** We use Counter to find the 10 most common verbs in sentences mentioning "Alice".