Here’s an expanded list of key components in Natural Language Processing (NLP) with brief explanations:

### **1. Text Preprocessing**
   - **Tokenization**: Divides text into smaller units such as words or sentences. This is a foundational step in text analysis.
   - **Removing Stopwords**: Eliminates common words (like "and," "the," "is") that are not useful in the analysis as they do not carry significant meaning.
   - **Stemming/Lemmatization**: Reduces words to their base or root form. Stemming often results in root forms that may not be actual words, while lemmatization maps words to their dictionary form. Eg : The root form of "running" is "run."similarly : cleaning && cleaned to clean .

### **2. Text Representation**
   - **Bag of Words (BoW)**: Represents text as an unordered collection of words, disregarding grammar and word order but keeping track of word frequency.
   - **TF-IDF (Term Frequency-Inverse Document Frequency)**: Weighs terms based on their frequency in a document relative to their occurrence across multiple documents, highlighting words that are more significant in specific contexts.
   - **Word Embedding**: Represents words as dense vectors in continuous space, capturing semantic meanings and relationships. Examples include Word2Vec, GloVe, and FastText.

### **3. Core NLP Tasks**
   - **Sentiment Analysis**: Determines the sentiment behind a piece of text, categorizing it as positive, negative, or neutral.
   - **Named Entity Recognition (NER)**: Identifies and classifies entities in text into predefined categories such as names of people, organizations, locations, dates, etc.
   - **Machine Translation**: Automatically translates text from one language to another using techniques such as statistical models or neural networks.
   - **Text Classification**: Assigns categories or labels to text based on its content. Examples include spam detection, topic categorization, etc.
   - **Language Generation**: Involves generating coherent and contextually relevant text, such as in text completion, summarization, or chatbot responses.
   - **Speech Recognition and Synthesis**: Converts spoken language into text (recognition) and vice versa (synthesis), enabling human-computer interaction through voice.

### **4. Advanced NLP Techniques**
   - **Contextualized Word Embeddings**: Models like BERT and GPT represent words in context, allowing for more nuanced understanding of language.
   - **Sequence-to-Sequence Models**: Used in tasks like machine translation and summarization, where input and output are sequences of text.
   - **Transformers**: A deep learning architecture that has revolutionized NLP by allowing for better handling of long-range dependencies and parallel processing.
   - **Attention Mechanisms**: Focuses on specific parts of input when making predictions, improving the performance of models like Transformers.
   - **Reinforcement Learning in NLP**: Used in tasks like dialogue systems and text generation, where the model learns through trial and error.

This structured breakdown provides a comprehensive overview of the key components in NLP, highlighting both foundational techniques and advanced methodologies.




Here's a step-by-step explanation of the code and parameters used in your script:

### 1. **Installing NLTK**


! pip install nltk

- This command installs the Natural Language Toolkit (NLTK) library, which is essential for text processing and analysis in Python.

### 2. **Importing Libraries**

In [15]:
import nltk
import re
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer


- `nltk`: The core library for natural language processing.
- `re`: Provides regular expression matching operations.
- `pandas as pd`: Used for data manipulation and analysis.
- `stopwords`: Provides a list of common stopwords in various languages.
- `word_tokenize`: Tokenizes text into words.
- `sent_tokenize`: Tokenizes text into sentences.
- `PorterStemmer`: Implements stemming, which reduces words to their root forms.
- `WordNetLemmatizer`: Implements lemmatization, which reduces words to their base or dictionary form.

### 3. **Downloading NLTK Data**

In [16]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /home/puskar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/puskar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/puskar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True



- **`punkt`**: Tokenizes text into sentences and words.
  - *Example*: Converts "Hello world. How are you?" into ["Hello world.", "How are you?"] and then into ["Hello", "world", ".", "How", "are", "you", "?"].

- **`stopwords`**: Provides common stopwords to filter out from text.
  - *Example*: Removes words like "the", "is", and "and" from "The quick brown fox is jumping over the lazy dog".

- **`wordnet`**: A lexical database for English that helps in lemmatization.
  - *Example*: Converts "running" to its base form "run".

### 4. **Loading the Dataset**


In [17]:

from nltk.corpus import movie_reviews
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/puskar/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True



- `movie_reviews`: A corpus from NLTK that contains movie reviews categorized into positive and negative sentiments.
- `nltk.download('movie_reviews')`: Downloads the movie reviews dataset if not already available.



In [18]:

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]




- `documents`: A list comprehension that iterates through each category (positive and negative) and each file within those categories. It loads the words from each file and pairs them with their category label.
This line of code creates a list of tuples from the `movie_reviews` corpus in NLTK. Here’s a breakdown of how it works:

- **`movie_reviews.categories()`**: Retrieves the two categories in the movie reviews dataset, which are `pos` (positive) and `neg` (negative).

- **`movie_reviews.fileids(category)`**: Gets a list of file IDs for each category. Each file ID corresponds to a text file containing a review.

- **`movie_reviews.words(fileid)`**: Reads the words from the specified file ID.

- **`[(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]`**: Constructs a list of tuples where each tuple contains:
  - A list of words from a review (obtained using `movie_reviews.words(fileid)`).
  - The category label (`pos` or `neg`) of that review.

**Example:**

If the `movie_reviews` corpus contains two categories with reviews, this line of code might produce a list like:

```python
[
    (['this', 'movie', 'was', 'great', 'positive', 'review'], 'pos'),
    (['terrible', 'film', 'negative', 'review'], 'neg')
]
```

Here, each tuple consists of:
- A list of words from the review.
- The sentiment label of the review (`'pos'` or `'neg'`).

In [19]:
df = pd.DataFrame(documents, columns = ['review', 'sentiment'])

df.head()

Unnamed: 0,review,sentiment
0,"[plot, :, two, teen, couples, go, to, a, churc...",neg
1,"[the, happy, bastard, ', s, quick, movie, revi...",neg
2,"[it, is, movies, like, these, that, make, a, j...",neg
3,"["", quest, for, camelot, "", is, warner, bros, ...",neg
4,"[synopsis, :, a, mentally, unstable, man, unde...",neg


- Converts the list of documents into a pandas DataFrame with columns `review` and `sentiment`.

### 5. **Text Preprocessing**

#### Lowercasing


In [20]:
df['review'] = df['review'].apply(lambda x: [word.lower() for word in x])

- Converts all words in the `review` column to lowercase to maintain consistency and avoid case-sensitive mismatches.

#### Tokenization

In [22]:
import nltk
nltk.download('punkt')
df['review'] = df['review'].apply(lambda x: word_tokenize(' '.join(x)))

[nltk_data] Downloading package punkt to /home/puskar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/home/puskar/nltk_data'
    - '/media/puskar/Academics/Practice/venv/nltk_data'
    - '/media/puskar/Academics/Practice/venv/share/nltk_data'
    - '/media/puskar/Academics/Practice/venv/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


- Joins the words into a single string and then tokenizes it into individual words using `word_tokenize`.


In [None]:

df['review'] = df['review'].apply(lambda x: sent_tokenize(' '.join(x)))

- Joins the words into a single string and then tokenizes it into sentences using `sent_tokenize`.

#### Removing Stopwords

In [None]:
stop_words = set(stopwords.words('english'))
df['review'] = df['review'].apply(lambda x: [word for word in x if word not in stop_words])


- `stop_words`: Set of English stopwords.
- Filters out the stopwords from the reviews.

#### Removing Punctuation and Non-Alphanumeric Characters



In [None]:
df['review'] = df['review'].apply(lambda x: [re.sub(r'\W+', '', word) for word in x if word.isalpha()])


- `re.sub(r'\W+', '', word)`: Removes any non-alphanumeric characters from the words.
- `word.isalpha()`: Ensures that only alphabetic words are kept.

#### Stemming



In [None]:
stemmer = PorterStemmer()
df['review'] = df['review'].apply(lambda x: [stemmer.stem(word) for word in x])


- `PorterStemmer()`: Initializes the Porter Stemming algorithm.
- Applies stemming to reduce words to their root forms.

#### Lemmatization



In [None]:
lemmatizer = WordNetLemmatizer()
df['review'] = df['review'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])


- `WordNetLemmatizer()`: Initializes the WordNet Lemmatization algorithm.
- Applies lemmatization to reduce words to their base or dictionary form.

### 6. **Joining Tokens Back into Sentences**



In [None]:
df['cleaned_review'] = df['review'].apply(lambda x: ' '.join(x))



- Joins the tokens back into a single string for each review.

### 7. **Saving the Cleaned Data**


In [None]:
df.to_csv('cleaned_movie_reviews.csv', index=False)



- Saves the DataFrame with the cleaned reviews to a CSV file named `cleaned_movie_reviews.csv`.

This sequence of steps preprocesses the movie reviews dataset by normalizing, tokenizing, removing stopwords and punctuation, and finally stemming and lemmatizing the text.