<h1>Training Data (NLP)</h1>

<span>Learning Objectives</span>
<li>Define training data and its importance</li>
<li>Understand the process of collecting and preparing data for NLP models</li>
<li>Explore different training set data sets</li>
<li>Identify challenges related to training data in NLP</li>
<li>Apply the best practices</li>
<hr>
<span>Preparing an NLP</span>
<li>high quality training data</li>
    <li>ensuring that the quality is not compromised</li>
<hr>
<span>Training Data</span>
<li>is a type of data used for teaching new applications, model, or system to begin recognizing patterns dependent on a projects requirements.</li>
<li>(in AI or Machine Learning) is slightly different, as they are labeled or annotated</li>
<li>To find relationships, develop understanding, make decisions and evaluate their confidence when making a prediction.</li>
<hr>
<span>Types of annotations in NLP</span>
<li>Utterances</li>
<span>  refers to the smallest unit of speech or text that a speaker produces as a complete thought, typically separated by pauses in spoken language or punctuation in written.</span><br>
<span>For example:</span><br>
<span><b>Can I have pizza?</b></span>
<br><span><b>How is the weather in Dasma?</b></span><br>
<span>These sentences represent a single utterance.</span>
<hr>
<span>Spoken language analyses</span>
<li>Understanding customer review</li>
<li>To properly breakdown or parse the in-out data into utterances</li>
<br><span>For example:</span><br>
<span>The food was great | but | the service was slow. | I enjoyed the ambiance, | but | the waiter was not attentive.</span><br>
<span>The review covers multiple ideas and utterances:</span>
<br><span>These are individual utterances: such that they represent distinct thought or sentiment.</span>
<li>"The food was great"</li>
<li>"The service was slow"</li>
<li>"I enjoyed the ambiance"</li>
<li>"The waiter was not attentive"</li>
<br><span>Pasing utterances</span>
<li>It helps us ensure that each thought is analyzed separately more accurate insights from the text.</li>
<hr>
<span><b>Steps in Utterance Parsing for NLP Tasks:</b></span>
<ul>
<li>Identify pauses or breaks</li>
    <ul>
        <li>(In spoken language pauses)</li>
        <li>(In written language) punctuation (. , ?) or conjunctions (but, and , or)</li>
    </ul>
<li>Splitting text in utterances</li>
    <ul>
        <li>Use punctuation or logical breaks to split a sentence</li>
        <li>To get individual utterance, each sentence can be treated as separate utterance if it expresses a complete idea</li>
    </ul>
<li>Preprocessing for Machine Learning</li>
    <ul>
        <li>Tokenization is used to split text to words</li>
        <li>Stemming, removing stop words</li>
        <li>Analyze the utterances separately for tasks like <b>Sentiment Analysis</b>, intent detection or topic modeling.</li>
    </ul>
</ul>
<span>Review:</span><br>
<span>"The laptop is fast and efficient, but the battery life is terrible. I like the design but it is too heavy to carry around."</span>
<ul>
    <li>Parsed Utterances</li>
    <ul>
        <li>The laptop is fast and efficient,</li>
        <li>but the battery life is terrible.</li>
        <li>I like the design</li>
        <li>It is too heavy to carry around</li>
    </ul>
    <li>Splitting Text</li>
    <ul>
        <li>The laptop is fast and efficient,</li>
        <li>but the battery life is terrible.</li>
        <li>I like the design</li>
        <li>It is too heavy to carry around</li>
    </ul>
     <li>Analyze individually</li>
    <ul><li>Sentiment</li>
    <ul>
        <li>Identify the utterances:</li>      
        <ul>
            <li>positive</li>
            <li>negative</li>
            <li>neutral</li>
        </ul>
    </ul>
        <li>Intent</li>
    </ul>
</ul>

In [138]:
## split text into sentences based on punctuation marks or conjunction (NLTK)
import nltk
import re
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

## ensure NLTK resources are download
nltk.download('punkt')

review = "The laptop is fast and efficient, but the battery life is terrible. I like the design but it is too heavy to carry around."

## Step 1: Split text using punctuation and conjunction
## using regular expressions to split on punctuation and some common conjunction
utterances = re.split(r'[.,;!?]|\b(?:but|and)\b', review)

## clean up any white spaces
utterances = [utterance.strip() for utterance in utterances if utterance.split()]

## display utterances
utterances

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\steph\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\steph\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['The laptop is fast',
 'efficient',
 'the battery life is terrible',
 'I like the design',
 'it is too heavy to carry around']

In [74]:
## Step 2: Display each utterance
for i, utterance in enumerate(utterances, 1):
    print(f'Utterance {i}: {utterance}')

Utterance 1: The laptop is fast
Utterance 2: efficient
Utterance 3: the battery life is terrible
Utterance 4: I like the design
Utterance 5: it is too heavy to carry around


In [128]:
## Step 3: Preprocess
## func for preprocessing
def preprocess_utterance(utterance):
    ## tokenize the utterance
    tokens = word_tokenize(utterance.lower())

    ## remove stopwords and keep alphanumeric tokens
    tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    return tokens

## apply preprocessing to each utterance 
processed_utterance = [preprocess_utterance(utterance) for utterance in utterances]

## display the processed utterances
processed_utterance 

[['laptop', 'fast'],
 ['efficient'],
 ['battery', 'life', 'terrible'],
 ['like', 'design'],
 ['heavy', 'carry', 'around']]

In [152]:
## Sentiment Analysis = logic regression model
## TF-IDF (Team Frequency - Inverse Document Frequency) vectorization to classify the 
## utterances as positive or negative

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## sample data
data = {
    "utterances" : ["The laptop is fast",  "and efficient", "but the battery life is terrible.", 
    "I like the design", "but it is too heavy to carry around."],
    ## 1: Positive, 0: Negative
    "sentiment" : [1, 1, 0, 1, 0]
}

df = pd.DataFrame(data)

## TF_IDF
vectorizer = TfidfVectorizer(stop_words = 'english')
x = vectorizer.fit_transform(df['utterances'])
y = df['sentiment']

## split into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

## train logistic regression model
model = LogisticRegression()
model.fit(x_train, y_train)

## predict on test data
y_pred = model.predict(x_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

Accuracy: 1.00
