<a href="https://colab.research.google.com/github/Desmondonam/NLP_1/blob/main/Components_of_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP

- Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and human language. It involves enabling machines to understand, interpret, and generate human language in a way that is both meaningful and useful.

## Key Components of NLP:
- **Text Preprocessing:** Cleaning and preparing text data for analysis. This includes tasks like tokenization (splitting text into words or sentences), removing stopwords, and stemming or lemmatization (reducing words to their base form).

- **Text Representation:** Converting text into numerical format that a machine can process. Common techniques include:

  - **Bag of Words (BoW):** Represents text as a collection of word counts or frequencies.
  - **TF-IDF:** Weighs terms by their frequency and importance across documents.
  - **Word Embeddings:** Dense vector representations of words, capturing semantic meaning (e.g., Word2Vec, GloVe).
  - **Sentiment Analysis:** Determining the sentiment or emotion expressed in a piece of text (e.g., positive, negative, neutral).

  - **Named Entity Recognition (NER):** Identifying and classifying entities in text (e.g., names of people, places, organizations).

  - **Machine Translation:** Translating text from one language to another using models like Google's Transformer.

  - **Text Classification:** Assigning predefined categories or labels to text (e.g., spam detection in emails).

  - **Language Generation:** Producing human-like text, such as writing essays, summaries, or even creative content.

  - **Speech Recognition and Synthesis:** Converting spoken language into text (speech recognition) and generating spoken language from text (speech synthesis).

### Applications of NLP:
- **Chatbots and Virtual Assistants:** Automating customer service and providing conversational interfaces.
- **Search Engines:** Enhancing search capabilities by understanding user queries better.
- **Content Recommendation:** Suggesting relevant content based on user interests.
- **Healthcare:** Analyzing clinical notes, automating diagnostics, and managing patient data.

NLP is a rapidly evolving field with applications across various industries, leveraging the power of AI to understand and generate human language.

# Text preprocessing
- We'll use a public dataset, the IMDb movie reviews dataset, which is available through the nltk library.

### Step 1: Import Libraries
First, you'll need to install and import the necessary libraries.

In [1]:
import nltk
import re
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

### Step 2: Load the Dataset
For this example, we’ll use the IMDb movie reviews dataset. If you don’t have a dataset in a CSV file, you can load it directly from nltk or another source.

In [2]:
# Download the dataset from nltk
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')

# Load the dataset into a DataFrame
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Convert to DataFrame
df = pd.DataFrame(documents, columns=['review', 'sentiment'])

# Preview the dataset
print(df.head())


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


                                              review sentiment
0  [plot, :, two, teen, couples, go, to, a, churc...       neg
1  [the, happy, bastard, ', s, quick, movie, revi...       neg
2  [it, is, movies, like, these, that, make, a, j...       neg
3  [", quest, for, camelot, ", is, warner, bros, ...       neg
4  [synopsis, :, a, mentally, unstable, man, unde...       neg


### Step 3: Text Preprocessing
1. Lowercasing
- Convert all the text to lowercase to ensure uniformity.

In [3]:
# Convert to lowercase
df['review'] = df['review'].apply(lambda x: [word.lower() for word in x])


### 2. Tokenization
- Tokenize the text into individual words or sentences.

In [4]:
# Tokenization (already done in the loading process)
df['review'] = df['review'].apply(lambda x: word_tokenize(' '.join(x)))

# Example for sentence tokenization (if you have large text):
# df['review'] = df['review'].apply(lambda x: sent_tokenize(' '.join(x)))


###3. Removing Stopwords
- Remove common stopwords that do not contribute significant meaning.

In [5]:
# Define stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
df['review'] = df['review'].apply(lambda x: [word for word in x if word not in stop_words])


### 4. Removing Punctuation and Non-Alphabetic Characters
- Remove punctuation and non-alphabetic characters.

In [6]:
# Remove punctuation and non-alphabetic characters
df['review'] = df['review'].apply(lambda x: [re.sub(r'\W+', '', word) for word in x if word.isalpha()])


### 5. Stemming
- Reduce words to their root form using stemming.

In [7]:
# Initialize stemmer
stemmer = PorterStemmer()

# Apply stemming
df['review'] = df['review'].apply(lambda x: [stemmer.stem(word) for word in x])


### 6. Lemmatization
- Alternatively, you can use lemmatization, which considers the context and converts words to their base form.

In [9]:
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Apply lemmatization
df['review'] = df['review'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])


## Step 4: Joining Tokens Back into Sentences
- After processing, you may want to join the tokens back into a single string.

In [10]:
# Join tokens back into a string
df['cleaned_review'] = df['review'].apply(lambda x: ' '.join(x))

# Preview the cleaned text
print(df['cleaned_review'].head())


0    plot two teen coupl go church parti drink driv...
1    happi bastard quick movi review damn bug got h...
2    movi like make jade movi viewer thank invent t...
3    quest camelot warner bro first featur length f...
4    synopsi mental unstabl man undergo psychothera...
Name: cleaned_review, dtype: object


## Step 5: Save or Continue with Further NLP Tasks
- Now that you’ve preprocessed the text data, you can save it to a file for future use or proceed with further NLP tasks such as sentiment analysis, classification, etc.

In [11]:
# Save to CSV
df.to_csv('cleaned_movie_reviews.csv', index=False)


## Summary:
- **Lowercasing:** Standardizes text by converting everything to lowercase.
- **Tokenization:** Splits text into words or sentences.
- **Stopwords Removal:** Eliminates common, insignificant words.
- **Punctuation Removal:** Cleans text by removing punctuation and non-alphabetic characters.
- **Stemming/Lemmatization:** Reduces words to their root or base form to handle variations.