# Introduction to NLP (Natural Language Processing)

![image.png](attachment:image.png)

In [None]:
I am in Hyderabad. It is pretty hot right now.

In [None]:
23:45

In [None]:
$1 is ₹84 

In [None]:
am in Hyderabad It is pretty hot right now

1. 1.  1    1.   1.  1   1. 1. 1.  1.  1 

In [None]:
Hyderabad pretty hot right now

> Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that deals with the interaction between computers and humans using natural language. Its primary goal is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and contextually relevant.

### NLP encompasses a wide range of tasks and techniques, including:

1. **Text Understanding**: This involves tasks such as parsing, semantic analysis, and entity recognition, where the computer processes the structure and meaning of text to extract useful information.

2. **Text Generation**: NLP can be used to generate human-like text, such as in chatbots, language translation systems, and content creation tools.

3. **Sentiment Analysis**: NLP techniques can determine the sentiment expressed in a piece of text, whether it's positive, negative, or neutral. This is valuable for understanding public opinion, customer feedback, and social media sentiment.

4. **Language Translation**: NLP enables the translation of text from one language to another, allowing people to communicate across language barriers more easily.

5. **Question Answering**: NLP systems can understand and respond to questions posed in natural language, drawing upon vast amounts of information to provide accurate answers.

6. **Named Entity Recognition (NER)**: Identifying and classifying named entities in text, such as names of people, organizations, locations, and other specific entities, is a crucial task in many NLP applications.

7. **Text Summarization**: NLP techniques can condense large bodies of text into shorter, more concise summaries, which is useful for tasks like document summarization and news aggregation.

8. **Language Modeling**: NLP models can learn the structure and patterns of human language, which is fundamental to many NLP tasks and applications.

9. **Information Extraction**: Extracting structured information from unstructured text, such as from documents or web pages, is another important NLP task used in various applications, including data mining and knowledge graph construction.

NLP techniques rely heavily on machine learning and deep learning algorithms, which enable computers to analyze and process natural language data at scale. With advancements in AI and increasing amounts of available data, NLP continues to evolve rapidly, driving innovations in areas such as virtual assistants, language translation, sentiment analysis, and more.

### Everyday examples for each NLP task:

1. **Text Understanding**: When you use autocomplete in search engines, it's analyzing the structure and meaning of your query to suggest relevant search terms in real-time.

2. **Text Generation**: Auto-correct features on smartphones generate human-like text suggestions in real-time as you type messages or emails.

3. **Sentiment Analysis**: Social media platforms like Facebook or Instagram analyze comments and posts in real-time to determine if they are positive, negative, or neutral, helping to filter content or provide analytics to users.

4. **Language Translation**: Apps like Google Translate instantly translate text messages or emails from one language to another as you type or paste them.

5. **Question Answering**: Virtual assistants like Siri or Google Assistant provide real-time answers to questions asked verbally or typed, drawing upon vast amounts of information available online.

6. **Named Entity Recognition (NER)**: Email clients like Gmail automatically detect and highlight names, locations, dates, and other entities in real-time as you compose or read emails.

7. **Text Summarization**: News apps like Flipboard or Google News provide real-time summaries of articles, helping users quickly understand the main points of each story.

8. **Language Modeling**: Predictive text features on smartphones suggest the next word in a sentence as you type, based on the patterns of human language.

9. **Information Extraction**: E-commerce websites like Amazon extract product information in real-time from product descriptions to display key details like price, availability, and specifications.

### Key NLP Tasks

- **Tokenization:** Dividing text into meaningful units (words, punctuation).

*For example*, the sentence "The quick brown fox jumps over the lazy dog" would be tokenized into individual words like ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"].

- **Normalization:** Converting text to a standard form (e.g., lowercase, removing accents)

*For instance*, "Cat" and "cat" would be normalized to "cat" to ensure they are treated as the same feature.

- **Lemmatization/Stemming:** Reducing words to their base form for analysis.

*For example*, "running" and "ran" might be stemmed to "run", or "better" and "best" might be lemmatized to "good".

- **Named Entity Recognition (NER):** Recognizing named entities (people, organizations, locations) in text.

*Example Text:* "Google was founded by Larry Page and Sergey Brin."
   - Entities: [("Google", "Organization"), ("Larry Page", "Person"), ("Sergey Brin", "Person")]
   - Explanation: NER identifies "Google" as an organization and both founders' names as persons. This is critical for extracting information from text, such as in knowledge graphs or information retrieval systems.
   
*Example Text:* "The Eiffel Tower is located in Paris."
   - Entities: [("Eiffel Tower", "Location"), ("Paris", "Location")]
   - Explanation: Here, NER recognizes both "Eiffel Tower" and "Paris" as locations, which can be useful for tasks like geo-tagging or travel-related queries.

*Example Text:* "He mentioned the meeting on January 25th, 2020, at 2 PM costing about \$300."
   - Entities: [("January 25th, 2020", "Date"), ("2 PM", "Time"), ("\$300", "Monetary Value")]
   - Explanation: This example highlights the extraction of temporal expressions ("January 25th, 2020", "2 PM") and a monetary value ("$300"), which can be essential for organizing events or financial analysis.

- **Sentiment Analysis:** Determining the emotional tone of text.

*For example*, words like "happy", "joyful", and "excited" might indicate positive sentiment, while words like "sad", "angry", and "frustrated" might indicate negative sentiment.

### Challenges and Ambiguity

- **Ambiguity:** A cornerstone of natural language; words or phrases can have multiple meanings.
- **Context-Dependence:** The meaning of words and sentences depends heavily on context.
- **Informality:** Real-world language is often informal and doesn't always follow strict grammatical rules.
- **Data Scarcity:** Some languages or domains may lack abundant training data.

In [None]:
What a great service by AirIndia! My destination was Mumbai and my luggage bags reached Bangalore. Great!

In [None]:
AirIndia

Thank you for your kind words! Please fly with us again. 

### NLP tasks using the bag-of-words (BOW) model

In [11]:
import opendatasets as od

In [12]:
od.download('https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification')

Downloading cyberbullying-classification.zip to ./cyberbullying-classification


100%|██████████████████████████████████████| 2.82M/2.82M [00:02<00:00, 1.42MB/s]





In [31]:
df = pd.read_csv('cyberbullying-classification/cyberbullying_tweets.csv')

In [32]:
df['cyberbullying_type'].value_counts()

religion               7998
age                    7992
gender                 7973
ethnicity              7961
not_cyberbullying      7945
other_cyberbullying    7823
Name: cyberbullying_type, dtype: int64

In [28]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

In [34]:
df

Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was cra...",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying
...,...,...
47687,"Black ppl aren't expected to do anything, depe...",ethnicity
47688,Turner did not withhold his disappointment. Tu...,ethnicity
47689,I swear to God. This dumb nigger bitch. I have...,ethnicity
47690,Yea fuck you RT @therealexel: IF YOURE A NIGGE...,ethnicity


In [35]:
X 

<47692x60271 sparse matrix of type '<class 'numpy.int64'>'
	with 977531 stored elements in Compressed Sparse Row format>

In [33]:
# Feature Extraction using Bag-of-Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['tweet_text'])
y = df['cyberbullying_type']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Logistic Regression classifier
classifier = LogisticRegression(max_iter=1000)  # Increase max_iter if needed
classifier.fit(X_train, y_train)

# Predictions
y_pred = classifier.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.821784254114687

Classification Report:
                      precision    recall  f1-score   support

                age       0.97      0.98      0.98      1603
          ethnicity       0.99      0.97      0.98      1603
             gender       0.89      0.84      0.87      1531
  not_cyberbullying       0.56      0.58      0.57      1624
other_cyberbullying       0.59      0.63      0.61      1612
           religion       0.96      0.94      0.95      1566

           accuracy                           0.82      9539
          macro avg       0.83      0.82      0.83      9539
       weighted avg       0.83      0.82      0.82      9539



Let's use a simplified example to explain what's happening with initializing CountVectorizer and using `fit_transform`:

Suppose we have the following three tweet texts:

1. "I love machine learning."
2. "Machine learning is fascinating."
3. "I love coding."

Now, let's see how we convert these texts into a bag-of-words representation using CountVectorizer.

1. **Initializing CountVectorizer**:
   - We first initialize a CountVectorizer object:

```python
vectorizer = CountVectorizer()
```

2. **Converting Text into Bag-of-Words Representation**:
   - Then, we use the `fit_transform()` method to convert the tweet texts into a matrix of token counts:

```python
X = vectorizer.fit_transform(["I love machine learning.", "Machine learning is fascinating.", "I love coding."])
```

   - After calling `fit_transform()`, CountVectorizer learns the vocabulary and converts the tweet texts into a matrix of token counts. Let's see what happens:
   
   - **Fit**: CountVectorizer tokenizes the text and builds a vocabulary based on unique words. In our case, the vocabulary will be: `['coding', 'fascinating', 'is', 'learning', 'love', 'machine']`.
   
   - **Transform**: It converts each tweet text into a matrix representation. Each row corresponds to a tweet, and each column corresponds to a word in the vocabulary. The value at each cell represents the count of the corresponding word in the tweet. For our example:
   
     - The first tweet "I love machine learning." becomes `[0, 0, 0, 1, 1, 1]` in the matrix because it contains 1 count each of 'learning', 'love', and 'machine'.
     - The second tweet "Machine learning is fascinating." becomes `[0, 1, 1, 1, 0, 1]`.
     - The third tweet "I love coding." becomes `[1, 0, 0, 0, 1, 0]`.
   
   - Finally, the matrix `X` represents the bag-of-words representation of the tweet texts:

```
[[0 0 0 1 1 1]
 [0 1 1 1 0 1]
 [1 0 0 0 1 0]]
```

Each row represents a tweet, and each column represents a word in the vocabulary. The values in the matrix indicate the count of each word in the corresponding tweet. This is the bag-of-words representation of the tweet texts.

In addition to the bag-of-words (BOW) model, several other models are commonly used in natural language processing (NLP). Some of the prominent ones include:

1. **Word Embedding Models**:
   - **Word2Vec**: This model represents words as dense vectors in a continuous vector space. It captures semantic meanings of words by training on large text corpora.
   - **GloVe (Global Vectors for Word Representation)**: Similar to Word2Vec, GloVe also learns word vectors, but it does so by analyzing global co-occurrence statistics of words in a corpus.
   - **FastText**: Developed by Facebook AI Research, FastText extends Word2Vec by representing words as bags of character n-grams, enabling it to handle out-of-vocabulary words better.

2. **Sequence Models**:
   - **Recurrent Neural Networks (RNNs)**: RNNs are designed to work with sequences of data. They process inputs step-by-step while maintaining a hidden state, making them suitable for tasks like language modeling, machine translation, and sentiment analysis.
   - **Long Short-Term Memory Networks (LSTMs)**: A type of RNN, LSTMs are designed to address the vanishing gradient problem. They have memory cells that can remember information over long sequences, making them effective for tasks requiring understanding of long-range dependencies.
   - **Gated Recurrent Units (GRUs)**: Similar to LSTMs, GRUs are another type of RNN designed to address the vanishing gradient problem, but with a simpler architecture.

3. **Transformer Models**:
   - **BERT (Bidirectional Encoder Representations from Transformers)**: BERT, developed by Google, is a transformer-based model pre-trained on large text corpora. It has achieved state-of-the-art performance on various NLP tasks by capturing bidirectional contextual information.
   - **GPT (Generative Pre-trained Transformer)**: GPT is a series of transformer-based models developed by OpenAI. These models are designed for autoregressive generation tasks and have been used for tasks like text generation, summarization, and question answering.
   - **XLNet**: XLNet is another transformer-based model that overcomes some limitations of BERT by leveraging permutation-based pre-training, allowing it to capture bidirectional context without the constraints of left-to-right or right-to-left pre-training objectives.

4. **Statistical Models**:
   - **Hidden Markov Models (HMMs)**: HMMs are probabilistic models commonly used for tasks like part-of-speech tagging and named entity recognition.
   - **Conditional Random Fields (CRFs)**: CRFs are a type of discriminative probabilistic model used for sequence labeling tasks, such as named entity recognition and chunking.

5. **Ensemble Models**:
   - **Voting Classifiers**: Ensemble methods combine multiple models to improve performance. In NLP, voting classifiers can combine predictions from multiple classifiers, such as SVMs, decision trees, or neural networks.
   - **Stacking**: Stacking involves training multiple models and using another model to learn how to best combine their predictions.

These are just a few examples of the diverse range of models used in natural language processing. The choice of model depends on the specific task, available data, computational resources, and desired performance.