<h1 align = center> Advance NLP Techniques(NER) </h1>

#### What is NER in NLP ?

NER stands for Named Entity Recognition. It is the process of identifying and categorizing named entities in text into pre-defined categories such as names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

#### Applications of NER in NLP

- Customer Support: NER helps in identifying and categorizing customer queries, making it easier for agents to respond effectively.
- Medical Diagnosis: NER can be used to extract patient names, diseases, symptoms, and medications from medical reports.
- Legal Documents: NER can be used to extract names of individuals, organizations, and places from legal documents.
- Search Engine Optimization: NER helps in indexing and searching relevant documents by identifying and categorizing named entities.
- Chatbots: NER can be used to extract relevant information from user queries and generate appropriate responses.
- Information Retrieval: NER can be used to retrieve relevant documents or articles based on named entities.

#### Techniques for NER

- Rule-based Systems: Simple rule-based systems can be used to identify named entities based on predefined patterns and rules.
- Machine Learning Algorithms: Various machine learning algorithms, such as Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs), can be used to train and improve NER models.
- Deep Learning Techniques: Deep learning models, such as Recurrent Neural Networks (RNNs) and Transformers, can be used to capture contextual information and improve NER accuracy.

#### NER Evaluation Metrics

- Precision: Precision measures the accuracy of correctly identified named entities out of all identified named entities.
- Recall: Recall measures the accuracy of correctly identified named entities out of all relevant named entities.
- F1 Score: F1 Score is the harmonic mean of precision and recall, providing a balanced evaluation metric.
- Accuracy: Accuracy measures the overall accuracy of the NER model, considering both precision and recall.

#### NER Challenges and Solutions

- Limited Training Data: NER models require large amounts of labeled training data to achieve high accuracy. Overfitting can occur if the model is trained on too few examples or if it memorizes the training data instead of learning the underlying patterns. To overcome this challenge, techniques like data augmentation, transfer learning, and ensemble methods can be used.
- Contextual Information: NER models often struggle to capture contextual information, such as the relationship between different named entities. To address this challenge, techniques like bidirectional LSTM models, attention mechanisms, and contextual embeddings can be used.
- Ambiguity: Named entities can be ambiguous, and different interpretations of the same entity can occur. To handle this challenge, techniques like named entity disambiguation, knowledge bases, and contextual information can be used.
- Noisy Data: NER models may encounter noisy data, such as incorrect annotations, spelling errors, or incomplete information.



<h2 align = center> Importing Necessary Libraries </h2>


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import spacy
from spacy import displacy
from spacy.tokens import DocBin
from spacy.util import filter_spans



from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report


Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



<h1 align = center> Building Custom Named Entity Recognization Model Using Spacy </h1>

<h2 align = center> Downloading Spacy Built-in Model </h2>

In [7]:
!python -m spacy download en_code_web_lg
!python -m spacy download en_core_web_sm


[38;5;1m✘ No compatible package found for 'en_code_web_lg' (spaCy v3.7.6)[0m

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m84.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


### Loading Model

In [9]:
nlp = spacy.load("en_core_web_sm")
nlp

<spacy.lang.en.English at 0x7c0904d95b70>

### Loading Text in our model

In [10]:
text = "Pakistan is a muslim country that is founded by Quid-e-Azam Muhammad Ali Jinnah in 1947"
text = nlp(text)

### Printing Entities in our text

In [12]:
text.ents

(Pakistan, muslim, Ali Jinnah, 1947)

### Showing Entities With Text

In [14]:
displacy.render(text , style = "ent" , jupyter = True)

### Interpretation

in above text

GPE = Geopolitical Entity

NORP = Nationalities or Religious or Political Groups

PERSON = Refers to the name of a person or individual.

DATE =  Refers to a specific calendar date


<h2 align = center> Making Custom Model Using Spacy </h2>

 I am training an NER model that can extract the name of the ingredient, the quantity and units of ingredient.

### Importing Dataset

In [19]:
data = pd.read_csv('train.csv')
data.head()

Unnamed: 0,source,ingredient_id,token_id,token,label
0,ar,0,0,4,QUANTITY
1,ar,0,1,cloves,UNIT
2,ar,0,2,garlic,NAME
3,ar,1,0,2,QUANTITY
4,ar,1,1,tablespoons,UNIT


### Converting CSV to Spacy Format

In [24]:
grouped = data.groupby(['source' , 'ingredient_id'])

In [32]:
training_data = []

for name, group in grouped:
    tokens = group['token'].tolist()
    labels = group['label'].tolist()

    sentence = ' '.join([str(token) for token in tokens])

    # Calculating the start and end of entities

    entities = []
    start = 0
    for token , label in zip(tokens , labels):
        token_str = str(token)
        end = start + len(token_str)
        if label != 'O':
            entities.append((start , end , label))
        start = end + 1

    training_data.append((sentence, {'entities': entities}))

training_data[0]

('4 cloves garlic',
 {'entities': [(0, 1, 'QUANTITY'), (2, 8, 'UNIT'), (9, 15, 'NAME')]})

### Interpretation

in sentence '4 cloves garlic' there are 3 entities. 1st from 0-1 which is Quantity. 2nd entity starts from 2-8 which is Unit and the last is starts from 9-15 and it is ingredient name

<h2> NER Model Training </h2>

### Creating a Spacy DocBin File

In [33]:

nlp = spacy.blank("en")
doc_bin = DocBin()

In [37]:

for text,annotation in training_data:
  doc = nlp.make_doc(text)
  entities = []
  for start , end , label in annotation['entities']:
    span = doc.char_span(start , end , label = label)
    if span is not None:
      entities.append(span)
  doc.ents = filter_spans(entities)
  doc_bin.add(doc)

doc_bin.to_disk("train.spacy")

### Setting Up Config File

In [38]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


### Training Model Using DocBin File

In [39]:
!python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     84.20   18.15   15.46   21.98    0.18
  0     200        209.45   4782.43   96.31   95.64   96.99    0.96
  1     400        220.98   2121.00   97.49   97.72   97.26    0.97
  2     600        258.34   1947.15   97.92   97.59   98.26    0.98
  3     800        313.60   1894.40   98.27   97.39   99.17    0.98
  4    1000        367.46   1827.87   98.77   98.70   98.85    0.99
  5    1200        469.11   1818.99   99.17   99.01   99.33    0.99
  7    1400        601.53   1735.42   99.35   99.23   99.47    0.99
  9    1600        690.54   1745.22   99.47   99.16   99.78    0.99
 12    1800        817.11   1579.86   99.62   99.55

#### We already got 100% score so ,we are interupting the further training process

<h2 align = center> Loading And Testing Model </h2>

#### Loading Best Trained Model

In [40]:
ner = spacy.load("model-best")

#### Giving Text To Model

In [41]:
doc = ner('aute the veggies. Dice an onion and red bell pepper and add that to a sauté pan with a little olive oil over medium heat. Stir the veggies for about 5 minutes, or until the onions become translucent. Then add the garlic and spices and stir for another minute, until the mixture is nice and fragrant.Pour in a 28-ounce can of whole peeled tomatoes and use your spatula to break up the tomatoes into smaller pieces. Once this entire mixture is lightly simmering, you can crack your eggs on top. Use your spatula to make little holes for the eggs, then crack an egg into each hole. I use six eggs, though depending on the size of your pan you may use more or less. Reduce the heat to low, and cook for another 5 to 8 minutes or until the eggs are done to your liking.Before serving, season the eggs with salt and a generous amount of freshly chopped parsley and cilantro. Enjoy!')

#### Giving Color To Each Category

In [46]:
colors = {'NAME':"#f67de3",
          'STATE' : "#765ABE",
          'QUANTITY' : "#AA23B3",
          'UNIT' : '#98734A'}

options = {'colors':colors}

#### Rendering Result

In [47]:
displacy.render(doc , style = "ent" , options=options, jupyter = True)

<h2 align = center> Sentiment Analysis </h2>

#### What is Sentiment Analysis ?

Sentiment analysis is the process of determining the sentiment or emotional tone of a piece of text, such as a tweet, review, or product description. It involves identifying the presence of positive, negative, or neutral emotions within the text, as well as quantifying the intensity of these emotions.

Sentiment analysis can be used to analyze customer feedback, market trends, and improve product or service quality. It can also be used to detect biased or manipulative language, detect fake news, or monitor the spread of misinformation.

There are various algorithms and techniques available for sentiment analysis, such as Naive Bayes, Support Vector Machines (SVM), Recurrent Neural Networks (RNNs), and Transformers. Some popular libraries and frameworks for sentiment analysis include NLTK, TextBlob, and spaCy.

#### Steps for Sentiment Analysis

1. Data Collection: Gather a large dataset of positive and negative reviews or tweets related to the topic of interest. This dataset should be labeled with the sentiment (positive, negative, or neutral).

2. Preprocessing: Clean and preprocess the text data by removing stop words, removing punctuation, converting all text to lowercase, and tokenizing the text into individual words or tokens.

3. Feature Extraction: Extract relevant features from the text data, such as word frequencies, n-grams, or sentiment-specific features (e.g., presence of positive or negative words).

4. Model Selection: Choose an appropriate sentiment analysis algorithm or model based on the characteristics of your dataset and the desired accuracy and performance requirements. Some popular algorithms for sentiment analysis include Naive Bayes, Support Vector Machines (SVM), Recurrent Neural Networks (RNNs), and Transformers.

5. Training: Train the chosen model using the preprocessed text data and labeled sentiments. This step involves adjusting the model's parameters to minimize the error between the predicted sentiments and the actual sentiments.

6. Evaluation: Evaluate the trained model using a separate dataset to measure its accuracy, precision, recall, and F1 score. This evaluation helps determine the performance of the sentiment analysis system.


<h2 align = center> Implementation </h2>

<h3 align = center> Importing Dataset </h3>

In [2]:
train_df = pd.read_csv('data.csv', encoding='ISO-8859-1')

#### Selecting Only 2 Columns to keep the analysis simple

In [3]:

train_df = train_df[['text', 'sentiment']]



#### Dropping Any Null Value



In [4]:
train_df.dropna(inplace=True)

#### Encoding the sentiment categories to numeric values

In [5]:
label_mapping = {'positive': 2, 'neutral': 1, 'negative': 0}
train_df['sentiment'] = train_df['sentiment'].map(label_mapping)

### Model Training

#### Splitting Data Into Training and Testing Data

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    train_df['text'], 
    train_df['sentiment'], 
    test_size=0.2, 
    random_state=42
)

####  Convert the text data to feature vectors using CountVectorizer


In [8]:
vectorizer = CountVectorizer(stop_words='english')
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)

#### Training Model

In [9]:
model = MultinomialNB()
model.fit(X_train_vectors, y_train)
y_pred = model.predict(X_test_vectors)

#### Evaluate the model's performance


In [10]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

print('Classification Report:')
print(classification_report(y_test, y_pred, target_names=['negative', 'neutral', 'positive']))

Accuracy: 0.64
Classification Report:
              precision    recall  f1-score   support

    negative       0.67      0.56      0.61      1572
     neutral       0.60      0.66      0.63      2236
    positive       0.69      0.69      0.69      1688

    accuracy                           0.64      5496
   macro avg       0.65      0.64      0.64      5496
weighted avg       0.65      0.64      0.64      5496



### Interpretation 

The sentiment analysis model has an accuracy of 64.00%. The classification report shows that the model performs well in classifying positive sentiments better than neutral and negative sentiments. This could be due to the limited training data and the presence of biased or offensive language. 

### Prediction Funciton 

In [19]:
def predict_sentiment(text):
    vectorized_text = vectorizer.transform([text])
    prediction = model.predict(vectorized_text)
    sentiment_label = {0: 'negative', 1: 'neutral', 2: 'positive'}
    return sentiment_label[prediction[0]]



### Testing

In [20]:
user_input = "he acts like a good student"
predicted_sentiment = predict_sentiment(user_input)
print(f'The sentiment of the input text is: {predicted_sentiment}')

The sentiment of the input text is: positive
