<h3>Real World Application of AI - Natural Language Processing (NLP)</h3>

<h4>Installing required packages</h4>


In [1]:
%pip install nltk
%pip install spacy

Collecting nltkNote: you may need to restart the kernel to use updated packages.

  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.7.24-cp312-cp312-win_amd64.whl.metadata (41 kB)
     ---------------------------------------- 0.0/41.5 kB ? eta -:--:--
     --------- ------------------------------ 10.2/41.5 kB ? eta -:--:--
     -------------------------------------  41.0/41.5 kB 487.6 kB/s eta 0:00:01
     -------------------------------------- 41.5/41.5 kB 401.4 kB/s eta 0:00:00
Collecting tqdm (from nltk)
  Downloading tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
     ---------------------------------------- 0.0/57.6 kB ? eta -:--:--
     ---------------------------- ----------- 41.0/57.6 kB 1.9 MB/s eta 0:00:01
     -------------------------------------- 57.6/57.6 kB 751.8 kB/s eta 0:00:00
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   


[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting spacy
  Downloading spacy-3.7.6-cp312-cp312-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.10-cp312-cp312-win_amd64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.8-cp312-cp312-win_amd64.whl.metadata (8.6 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.3.0,>=8.2.2 (from spacy)
  Downloading thinc-8.2.5-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.4.8-cp312-cp312-win_amd64.


[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


<h4>Basic Text Processing Techniques</h4>

<h5>Tokenization</h5>

In [1]:
import spacy
from spacy.cli import download

#Try loading the model, and if it fails, download and install it 
try :
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Model not found. Downloading ....")
    download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")
    print('\n'*4)

#Example Text
text = "Natural Language Processing with Python is fun. Let's tokenize this sentence!"

print("Code Run from here :---")
print('--'*30)

#Process the text
doc = nlp(text)

#Sentence Tokenization
sentences = [sent.text for sent in doc.sents]
print("Sentence : ", sentences)

#Word Tokenization
words = [token.text for token in doc]
print("Words : ", words)

Code Run from here :---
------------------------------------------------------------
Sentence :  ['Natural Language Processing with Python is fun.', "Let's tokenize this sentence!"]
Words :  ['Natural', 'Language', 'Processing', 'with', 'Python', 'is', 'fun', '.', 'Let', "'s", 'tokenize', 'this', 'sentence', '!']


In [2]:
#From scratch
text = "Natural Language Processing with Python is fun. Let's tokenize this sentence!"

sentences = [text.split('.')]
words = [text.split(' ')]

print(sentences)
print(words)

[['Natural Language Processing with Python is fun', " Let's tokenize this sentence!"]]
[['Natural', 'Language', 'Processing', 'with', 'Python', 'is', 'fun.', "Let's", 'tokenize', 'this', 'sentence!']]


<h4>Stemming </h4>

In [5]:
from nltk.stem import PorterStemmer

#Initialize the stemmer 
ps = PorterStemmer()

#Example Words 
words = ["running", "ran", "runner", "easily", "fairly", "eating", "driving", "riding"]

#Stem the words 
stems = [ps.stem(word) for word in words]
print("Stem : ", stems)

Stem :  ['run', 'ran', 'runner', 'easili', 'fairli', 'eat', 'drive', 'ride']


<h4>Lemmatization</h4>

In [9]:
import spacy

#Load the English Model
nlp = spacy.load("en_core_web_sm")

#Example Text
text = "Natural Language Processing with Python is fun. Let's tokenize this sentence!. That's so cool!"

#Process The Text 
doc = nlp(text)

#Lemmitize Words
lemmas = [token.lemma_ for token in doc]
print("Lemmas : ", lemmas)

Lemmas :  ['Natural', 'Language', 'Processing', 'with', 'Python', 'be', 'fun', '.', 'let', 'us', 'tokenize', 'this', 'sentence', '!', '.', 'that', 'be', 'so', 'cool', '!']


<h4>Sentiment Analysis</h4>

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import accuracy_score, classification_report

#Loading various categories
categories = [  'rec.autos', 
                'sci.electronics', 
                'comp.graphics', 
                'rec.sport.hockey',
                'talk.politics.guns',
                'talk.politics.mideast',
                'comp.os.ms-windows.misc',
                'comp.sys.ibm.pc.hardware',
                'misc.forsale',
                'sci.med'
            ]

newsgroups = fetch_20newsgroups(subset = 'train', categories = categories)
x, y = newsgroups.data, newsgroups.target
target_names = newsgroups.target_names #Get Category names

#Text vectorization
vectorizer = CountVectorizer(stop_words='english')
x_vetor = vectorizer.fit_transform(x)

#Train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_vetor, y, test_size = 0.2, random_state = 42)

#Train a Logistic Regression Model 
model = LogisticRegression(max_iter = 1000) #Set max iteration to avoid convergence warnings
model.fit(x_train, y_train)

#Test the model 
y_predict = model.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
report = classification_report(y_test, y_predict, target_names= target_names)
print(f"Accuracy : {accuracy:.2f}")
print("Classification Report :", report)


Accuracy : 0.91
Classification Report :                           precision    recall  f1-score   support

           comp.graphics       0.85      0.93      0.89       108
 comp.os.ms-windows.misc       0.85      0.90      0.88       117
comp.sys.ibm.pc.hardware       0.83      0.81      0.82       118
            misc.forsale       0.87      0.89      0.88       110
               rec.autos       0.94      0.88      0.91       133
        rec.sport.hockey       0.97      0.98      0.97       114
         sci.electronics       0.81      0.84      0.83       116
                 sci.med       0.97      0.91      0.94       128
      talk.politics.guns       1.00      0.95      0.98       107
   talk.politics.mideast       1.00      1.00      1.00       117

                accuracy                           0.91      1168
               macro avg       0.91      0.91      0.91      1168
            weighted avg       0.91      0.91      0.91      1168



<h4>Testing The random text</h4>

In [8]:
#Example predictions
new_sentences = [input("Enter Sentences to check : ")]
new_x_vector = vectorizer.transform(new_sentences)
predictions = model.predict(new_x_vector)

#Print prediction with class names
for sentence, predication in zip(new_sentences, predictions):
    print('\n')
    print(f'Sentence : {sentence}')
    print(f'Predicted Class is : {target_names[predication]}')

#if the art0icle are not present it force try to fix any category



Sentence : Trump will be the president of america
Predicted Class is : rec.sport.hockey
