## NLP(Natural Language Processing)
When we talk about data, we not only reference numbers, but also text and human speech. But how do we approach the latter? This field covers both linguistics and AI. We'll not be needed to understand linguistics in depth ie morphology, semantics, pragmatics etc. because those who came before us figure out a way to give use the tools necessary to work with human language without that hassle.

So in this section will cover how to use these tools(NLTK) when our analytics involves text data(human language).

We'll be using a very popular library in Natural Language Processing called NLTK(Natural Language Tool Kit).


#### Objectives
1. Preprocessing and Exploring Text Data.
2. Preparing our data for machine learning(Vectorizing).
3. Fitting machine learning models with text data(vectorized data).

When working with text data, the bulk of the tasks involves preparation of the data. Unlike the normal data where the steps are pretty straightforward/rigid, in text data you have to adjust based on the nature of the data and the objectives you are aiming for.


In [None]:
#tokenization

sentence_1 = "The cat steals the stew on my Grandma's stove."

sentence_3 = "The dog stole the stew on my gRandma's jiko."

sentence_4 = "The rat was stealing the stew on the grandma's gas."

sentence_5 = "I like eating a sandwich."
sentence_5 = "I ate burgers in the morning."
sentence_5 = "I like eating sandwiches."

""" 
1. stemming - chopping off the ends of words eg 'sandwiches' we remove the 'es' to remain with 'sandwich'
2. lemmatization - improved version of stemming, it considers both pos and morphology of the word eg
                    'ate' is changed to 'eat', 'better' is changed to 'good'
"""

context_1 = 'The baseball bat flew over the fence, startling the cave bat.'

sentence_2 = "Unlike the normal data where the steps are pretty straightforward/rigid, in text data you have to adjust based on the nature of the data and the objectives you are aiming for."

""" 
                the | cat | stole | stew | on | my | grandmas | stove | jiko | gas | dog | rat
sentence_1       2      1      1       1    1    1      1          1       0    0      0    0
sentence_2       2      1      1       1    1    1      1          1       0    0      0    0
sentence_3       2      0      1       1    1    1      1          0       1    0      0    0
sentence_4       3      0      1       1    1    0      1          0       0    1      0    1
"""



standardized_sentence_1 = [part.lower() for part in sentence_1.split()]
standardized_sentence_2 = [part.lower() for part in sentence_2.split()]
standardized_sentence_3 = [part.lower() for part in sentence_3.split()]
standardized_sentence_4 = [part.lower() for part in sentence_4.split()]


# standardized_sentence_1 = []

# for part in sentence_1.split():
#     standardized_sentence_1.append(part.lower())

standardized_sentence_1



['the', 'cat', 'stole', 'the', 'stew', 'on', 'my', "grandma's", 'stove.']

In [37]:
# no_punc_standardized_sentence_1 = []

# for part in standardized_sentence_1:
#     no_punc_standardized_sentence_1.append(part.replace("'", "").replace("."))

no_punc_standardized_sentence_1 = [part.replace("'", "").replace(".", "") for part in standardized_sentence_1]
no_punc_standardized_sentence_2 = [part.replace("'", "").replace(".", "").replace("/", " ").replace(",", "") for part in standardized_sentence_2]
no_punc_standardized_sentence_3 = [part.replace("'", "").replace(".", "") for part in standardized_sentence_3]
no_punc_standardized_sentence_4 = [part.replace("'", "").replace(".", "") for part in standardized_sentence_4]

no_punc_standardized_sentence_1

['the', 'cat', 'stole', 'the', 'stew', 'on', 'my', 'grandmas', 'stove']

#### Vectorization
Transforming the tokens into numerical vectors for machine learning.

In [33]:
from sklearn.feature_extraction.text import CountVectorizer

#initialize the vectorizer
counter_vectorizer = CountVectorizer()

In [41]:
no_punc_standardized_sentence_1

['the', 'cat', 'stole', 'the', 'stew', 'on', 'my', 'grandmas', 'stove']

In [44]:
"|".join(no_punc_standardized_sentence_1)


'the|cat|stole|the|stew|on|my|grandmas|stove'

In [38]:
combined_sentences = [
    " ".join(no_punc_standardized_sentence_1),
    " ".join(no_punc_standardized_sentence_2),
    " ".join(no_punc_standardized_sentence_3),
    " ".join(no_punc_standardized_sentence_4)
]
combined_sentences

['the cat stole the stew on my grandmas stove',
 'unlike the normal data where the steps are pretty straightforward rigid in text data you have to adjust based on the nature of the data and the objectives you are aiming for',
 'the dog stole the stew on my grandmas jiko',
 'the rat stole the stew on the grandmas gas']

In [46]:
#combine the tokens into a sentence and then apply countvectorizer
count_matrix = counter_vectorizer.fit_transform(combined_sentences)

counter_vectorizer.vocabulary_

{'the': 29,
 'cat': 5,
 'stole': 25,
 'stew': 24,
 'on': 19,
 'my': 14,
 'grandmas': 10,
 'stove': 26,
 'unlike': 31,
 'normal': 16,
 'data': 6,
 'where': 32,
 'steps': 23,
 'are': 3,
 'pretty': 20,
 'straightforward': 27,
 'rigid': 22,
 'in': 12,
 'text': 28,
 'you': 33,
 'have': 11,
 'to': 30,
 'adjust': 0,
 'based': 4,
 'nature': 15,
 'of': 18,
 'and': 2,
 'objectives': 17,
 'aiming': 1,
 'for': 8,
 'dog': 7,
 'jiko': 13,
 'rat': 21,
 'gas': 9}

In [48]:
sorted_tokens = sorted(counter_vectorizer.vocabulary_.items(), key=lambda i:i[1])
sorted_tokens

[('adjust', 0),
 ('aiming', 1),
 ('and', 2),
 ('are', 3),
 ('based', 4),
 ('cat', 5),
 ('data', 6),
 ('dog', 7),
 ('for', 8),
 ('gas', 9),
 ('grandmas', 10),
 ('have', 11),
 ('in', 12),
 ('jiko', 13),
 ('my', 14),
 ('nature', 15),
 ('normal', 16),
 ('objectives', 17),
 ('of', 18),
 ('on', 19),
 ('pretty', 20),
 ('rat', 21),
 ('rigid', 22),
 ('steps', 23),
 ('stew', 24),
 ('stole', 25),
 ('stove', 26),
 ('straightforward', 27),
 ('text', 28),
 ('the', 29),
 ('to', 30),
 ('unlike', 31),
 ('where', 32),
 ('you', 33)]

1. `stemming` - chopping off the ends of words eg 'sandwiches' we remove the 'es' to remain with 'sandwich'
2. `lemmatization` - improved version of stemming, it considers both pos and morphology of the word eg
                    'ate' is changed to 'eat', 'better' is changed to 'good'

### Regular Expressions

In [None]:
transaction_1 = "SDM3ZJQ44F Confirmed.You have received Ksh50.00 from CAROLINE  ABUGA 0703992559 on 22/4/24 at 8:27 PM  New M-PESA balance is Ksh6,218.04. Use a unique M-PESA PIN to keep your money safe - don't use your date of birth as your PIN."

transaction_2 = 'SDJ9MU5DIL Confirmed. On 19/4/24 at 8:57 AM Take Ksh1,100.00 cash from Christopher Rwara Your M-PESA float balance is Ksh97,007.00. Click the link to Download M-Pesa Agent App and Transact the SMART way https://bit.ly/3Ll6JQU'

transaction_3 = 'SGO0PU3TBW Confirmed. on 24/7/24 at 7:37 PM Give Ksh1,000.00 to  ANTONE OKOTH MURING  New M-PESA float balance is  Ksh5,080.00. Click the link to Download M-Pesa Agent App and Transact the SMART way https://bit.ly/3Ll6JQU'

transaction_4 = 'TEN8WJWHKI Confirmed. On 23/5/25 at 6:25 PM Take Ksh324.00 cash from Nancy Otieno Your M-PESA float balance is Ksh71,373.00. Click the link to Download M-Pesa Agent App and Transact the SMART way https://bit.ly/3Ll6JQU'

transaction_5 = 'TDL9NHA8FV Confirmed. on 21/4/25 at 5:39 PM Give Ksh1,200.00 to  Paul kuria  New M-PESA float balance is  Ksh56,276.00. Click the link to Download M-Pesa Agent App and Transact the SMART way https://bit.ly/3Ll6JQU'

transaction_6 = 'TDL2MXL1W8 Confirmed. On 21/4/25 at 3:37 PM Take Ksh814.00 cash from PERIS NGIGI Your M-PESA float balance is Ksh48,272.00. Click the link to Download M-Pesa Agent App and Transact the SMART way https://bit.ly/3Ll6JQU'




### POS Tagging
| Tag  | Meaning              | Example           |
|------|----------------------|-------------------|
| NN   | Noun (singular)      | cat, time, torch  |
| NNS  | Noun (plural)        | dogs, books       |
| VB   | Verb (base form)     | run, eat          |
| VBD  | Verb (past tense)    | ran, ate, had     |
| VBG  | Verb (gerund)        | running, eating   |
| JJ   | Adjective            | blue, fast        |
| RB   | Adverb               | quickly, silently |
| IN   | Preposition/Subord.  | in, on, because   |
| DT   | Determiner           | a, the, an        |
| PRP  | Personal pronoun     | I, you, they      |


