<a href="https://colab.research.google.com/github/AliArabi55/NLP/blob/main/NLP_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Objective:**

This lab is designed to help students understand the steps involved in cleaning text data, performing linguistic analysis, and extracting features using TF-IDF and N-gram models. By the end of the lab, students will have hands-on experience in preparing text data for machine learning tasks, such as text classification or sentiment analysis.

# **Libraries Installation**

In [90]:
!pip install nltk
!pip install scikit-learn
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


# **Task 1: Tokenization✅**

In [91]:
text_data = [
    "The movie was fantastic and I loved every part of it about Egypt",
    "I hated the film, it was the worst I have ever seen",
    "The storyline was boring but the acting was brilliant",
    "An amazing movie with a great plot and incredible performances",
    "Egypt movie, I regret wasting my time on it",
    "The actors did a great job but the story lacked depth",
    "One of the best films I have seen in a long time, highly recommend it",
    "This film was just okay, not too bad but not great either",
    "Absolutely loved the movie, fantastic plot and wonderful cast",
    "The movie was disappointing, it did not live up to the hype"
]

# Tokenization Code
tokenized_data = [word_tokenize(sentence) for sentence in text_data]
tokenized_data

[['The',
  'movie',
  'was',
  'fantastic',
  'and',
  'I',
  'loved',
  'every',
  'part',
  'of',
  'it',
  'about',
  'Egypt'],
 ['I',
  'hated',
  'the',
  'film',
  ',',
  'it',
  'was',
  'the',
  'worst',
  'I',
  'have',
  'ever',
  'seen'],
 ['The',
  'storyline',
  'was',
  'boring',
  'but',
  'the',
  'acting',
  'was',
  'brilliant'],
 ['An',
  'amazing',
  'movie',
  'with',
  'a',
  'great',
  'plot',
  'and',
  'incredible',
  'performances'],
 ['Egypt', 'movie', ',', 'I', 'regret', 'wasting', 'my', 'time', 'on', 'it'],
 ['The',
  'actors',
  'did',
  'a',
  'great',
  'job',
  'but',
  'the',
  'story',
  'lacked',
  'depth'],
 ['One',
  'of',
  'the',
  'best',
  'films',
  'I',
  'have',
  'seen',
  'in',
  'a',
  'long',
  'time',
  ',',
  'highly',
  'recommend',
  'it'],
 ['This',
  'film',
  'was',
  'just',
  'okay',
  ',',
  'not',
  'too',
  'bad',
  'but',
  'not',
  'great',
  'either'],
 ['Absolutely',
  'loved',
  'the',
  'movie',
  ',',
  'fantastic',
  

#**Task 2: Stopword Removal✅**




In [92]:
stop_words = set(stopwords.words('english'))

# Remove Stopwords
cleaned_data = []
for sentence in tokenized_data:
    filtered_sentence = [word for word in sentence if word.lower() not in stop_words]
    cleaned_data.append(filtered_sentence)
cleaned_data

[['movie', 'fantastic', 'loved', 'every', 'part', 'Egypt'],
 ['hated', 'film', ',', 'worst', 'ever', 'seen'],
 ['storyline', 'boring', 'acting', 'brilliant'],
 ['amazing', 'movie', 'great', 'plot', 'incredible', 'performances'],
 ['Egypt', 'movie', ',', 'regret', 'wasting', 'time'],
 ['actors', 'great', 'job', 'story', 'lacked', 'depth'],
 ['One', 'best', 'films', 'seen', 'long', 'time', ',', 'highly', 'recommend'],
 ['film', 'okay', ',', 'bad', 'great', 'either'],
 ['Absolutely',
  'loved',
  'movie',
  ',',
  'fantastic',
  'plot',
  'wonderful',
  'cast'],
 ['movie', 'disappointing', ',', 'live', 'hype']]

#**Task 3: Stemming or Lemmatization✅**
  

In [93]:
lemmatizer = WordNetLemmatizer()

lemmatized_data = [[lemmatizer.lemmatize(word) for word in sentence] for sentence in cleaned_data]
lemmatized_data

[['movie', 'fantastic', 'loved', 'every', 'part', 'Egypt'],
 ['hated', 'film', ',', 'worst', 'ever', 'seen'],
 ['storyline', 'boring', 'acting', 'brilliant'],
 ['amazing', 'movie', 'great', 'plot', 'incredible', 'performance'],
 ['Egypt', 'movie', ',', 'regret', 'wasting', 'time'],
 ['actor', 'great', 'job', 'story', 'lacked', 'depth'],
 ['One', 'best', 'film', 'seen', 'long', 'time', ',', 'highly', 'recommend'],
 ['film', 'okay', ',', 'bad', 'great', 'either'],
 ['Absolutely',
  'loved',
  'movie',
  ',',
  'fantastic',
  'plot',
  'wonderful',
  'cast'],
 ['movie', 'disappointing', ',', 'live', 'hype']]

#**Task 4: Part-of-Speech (POS) Tagging✅**


In [94]:
pos_tagged_data = [nltk.pos_tag(sentence) for sentence in tokenized_data]
pos_tagged_data

[[('The', 'DT'),
  ('movie', 'NN'),
  ('was', 'VBD'),
  ('fantastic', 'JJ'),
  ('and', 'CC'),
  ('I', 'PRP'),
  ('loved', 'VBD'),
  ('every', 'DT'),
  ('part', 'NN'),
  ('of', 'IN'),
  ('it', 'PRP'),
  ('about', 'IN'),
  ('Egypt', 'NNP')],
 [('I', 'PRP'),
  ('hated', 'VBD'),
  ('the', 'DT'),
  ('film', 'NN'),
  (',', ','),
  ('it', 'PRP'),
  ('was', 'VBD'),
  ('the', 'DT'),
  ('worst', 'JJS'),
  ('I', 'PRP'),
  ('have', 'VBP'),
  ('ever', 'RB'),
  ('seen', 'VBN')],
 [('The', 'DT'),
  ('storyline', 'NN'),
  ('was', 'VBD'),
  ('boring', 'VBG'),
  ('but', 'CC'),
  ('the', 'DT'),
  ('acting', 'NN'),
  ('was', 'VBD'),
  ('brilliant', 'JJ')],
 [('An', 'DT'),
  ('amazing', 'JJ'),
  ('movie', 'NN'),
  ('with', 'IN'),
  ('a', 'DT'),
  ('great', 'JJ'),
  ('plot', 'NN'),
  ('and', 'CC'),
  ('incredible', 'JJ'),
  ('performances', 'NNS')],
 [('Egypt', 'NNP'),
  ('movie', 'NN'),
  (',', ','),
  ('I', 'PRP'),
  ('regret', 'VBP'),
  ('wasting', 'VBG'),
  ('my', 'PRP$'),
  ('time', 'NN'),
  ('on', 'IN

#**Task 5: Named Entity Recognition (NER)✅**


In [95]:
ner_data = [nltk.ne_chunk(pos_tagged) for pos_tagged in pos_tagged_data]
ner_data

[Tree('S', [('The', 'DT'), ('movie', 'NN'), ('was', 'VBD'), ('fantastic', 'JJ'), ('and', 'CC'), ('I', 'PRP'), ('loved', 'VBD'), ('every', 'DT'), ('part', 'NN'), ('of', 'IN'), ('it', 'PRP'), ('about', 'IN'), Tree('GPE', [('Egypt', 'NNP')])]),
 Tree('S', [('I', 'PRP'), ('hated', 'VBD'), ('the', 'DT'), ('film', 'NN'), (',', ','), ('it', 'PRP'), ('was', 'VBD'), ('the', 'DT'), ('worst', 'JJS'), ('I', 'PRP'), ('have', 'VBP'), ('ever', 'RB'), ('seen', 'VBN')]),
 Tree('S', [('The', 'DT'), ('storyline', 'NN'), ('was', 'VBD'), ('boring', 'VBG'), ('but', 'CC'), ('the', 'DT'), ('acting', 'NN'), ('was', 'VBD'), ('brilliant', 'JJ')]),
 Tree('S', [('An', 'DT'), ('amazing', 'JJ'), ('movie', 'NN'), ('with', 'IN'), ('a', 'DT'), ('great', 'JJ'), ('plot', 'NN'), ('and', 'CC'), ('incredible', 'JJ'), ('performances', 'NNS')]),
 Tree('S', [Tree('GPE', [('Egypt', 'NNP')]), ('movie', 'NN'), (',', ','), ('I', 'PRP'), ('regret', 'VBP'), ('wasting', 'VBG'), ('my', 'PRP$'), ('time', 'NN'), ('on', 'IN'), ('it', 'PR

#**Task 6: TF-IDF✅**


In [96]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(text_data)
tfidf_matrix.toarray()

array([[0.35946021, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.26734116, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.30557402, 0.        , 0.        , 0.35946021,
        0.30557402, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.21345497, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.30557402, 0.21345497, 0.        , 0.        ,
        0.30557402, 0.        , 0.        , 0.        , 0.35946021,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.17522211, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.21345497, 0.        ,
        0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
   

#**Task 7: N-gram Extraction✅**

*   List item
*   List item




In [97]:
# Task 7: N-gram Extraction

# Import necessary library
from sklearn.feature_extraction.text import CountVectorizer

# Define N-gram range for bigrams (n=2)
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))

# Apply N-gram extraction on the text data
bigram_matrix = bigram_vectorizer.fit_transform(text_data)

# Convert the matrix to an array and print it
bigram_array = bigram_matrix.toarray()
bigram_array

# To view the feature names (i.e., the actual bigrams)
bigram_features = bigram_vectorizer.get_feature_names_out()
print(bigram_features)

# Example of predicting the next word using N-gram probabilities (as in your provided file)
def predict_next_word(model, context, n=2):
    # Code to predict next word using an N-gram model
    # For simplicity, return a dummy word
    return "example_next_word"

# Context sentence for prediction
context = ["machine", "learning"]

# Example usage of the N-gram model for word prediction
predicted_word = predict_next_word(bigram_vectorizer, context, n=2) # Changed bigram_model to bigram_vectorizer
print(f"Predicted next word: {predicted_word}")

['about egypt' 'absolutely loved' 'acting was' 'actors did'
 'amazing movie' 'an amazing' 'and incredible' 'and loved' 'and wonderful'
 'bad but' 'best films' 'boring but' 'but not' 'but the' 'did great'
 'did not' 'disappointing it' 'egypt movie' 'ever seen' 'every part'
 'fantastic and' 'fantastic plot' 'film it' 'film was' 'films have'
 'great either' 'great job' 'great plot' 'hated the' 'have ever'
 'have seen' 'highly recommend' 'in long' 'incredible performances'
 'it about' 'it did' 'it was' 'job but' 'just okay' 'lacked depth'
 'live up' 'long time' 'loved every' 'loved the' 'movie fantastic'
 'movie regret' 'movie was' 'movie with' 'my time' 'not great' 'not live'
 'not too' 'of it' 'of the' 'okay not' 'on it' 'one of' 'part of'
 'plot and' 'recommend it' 'regret wasting' 'seen in' 'story lacked'
 'storyline was' 'the acting' 'the actors' 'the best' 'the film'
 'the hype' 'the movie' 'the story' 'the storyline' 'the worst'
 'this film' 'time highly' 'time on' 'to the' 'too bad

#**Task 8 Exploratory Questions on Task 4 & 5✅**


##**1-What types of entities (e.g., people, places, organizations) are most commonly identified in your text data?**

In the given dataset, the only named entity identified is "Egypt", which is classified as a GPE (Geopolitical Entity). No other people, places, or organizations were recognized, as the text mostly revolves around movie reviews and doesn't contain significant references to named entities beyond locations like Egypt.

##**2-How do these entities contribute to the overall meaning of the document?**


####"Egypt" as a location suggests that the reviews might involve a movie related to Egypt. The entity contributes a specific context to the movie being reviewed, potentially shaping the audience's expectations or understanding of the movie's setting, which could be linked to themes around Egypt or Egyptian culture.


##**3-After performing POS tagging on your text, which parts of speech (e.g., nouns, verbs, adjectives) appear most frequently in the movie reviews?**

####Nouns such as "movie", "film", "plot", "time", "story", and "actors" appear frequently, as they are common in movie reviews to describe the subject matter.

####Adjectives like "fantastic", "boring", "brilliant", and "worst" are also frequent, which is typical for reviews, as adjectives convey the writer’s sentiment and evaluation of the film.

####Verbs such as "was", "loved", "hated", "recommend", and "seen" are used to express opinions and describe actions related to the films.

#**Task 9 Questions on Tasks 6 (TF-IDF) and 7 (N-gram)✅**

##**1-What is the main purpose of applying TF-IDF in text processing?**


####The main purpose of TF-IDF (Term Frequency-Inverse Document Frequency) is to convert the text into numerical features that reflect how important each word is within a specific document relative to the entire corpus. It helps in emphasizing words that are more relevant to a particular document while reducing the weight of frequently occurring words that are less informative (e.g., common stopwords)


##**2-How does the inverse document frequency (IDF) part of TF-IDF affect the importance of a word?**


####The IDF part of TF-IDF reduces the weight of words that occur in many documents (common words) and increases the weight of words that are rare across the entire corpus but frequent in individual documents. This helps to highlight words that are more specific and meaningful to a particular document, giving them higher importance in the feature set.


##**3-What are the possible issues that can arise if you do not apply stopword removal before calculating TF-IDF? (see my your experiment )**


####Without removing stopwords, common words like "the", "is", "and" may be assigned higher TF-IDF values, even though they don't carry significant meaning. This can skew the results, causing the model to focus on irrelevant words and potentially reducing the performance of machine learning models that rely on these features.

##**4-What is the difference between unigrams, bigrams, and trigrams?**


####- **Unigrams:** Single words (e.g., "movie").
####- **Bigrams:** Consecutive sequences of two words (e.g., "great movie").
####- **Trigrams:** Sequences of three consecutive words (e.g., "fantastic movie plot").

####Each type captures a different level of context:
####- **Unigrams** capture individual word importance.
####- **Bigrams** and **capture relationships between words, useful for understanding context and dependencies in tasks like sentiment analysis or text classification.






