Name Entity Recognition: identifying and categorizing entities in a text corpus into predefined categories such as people, organiations, locations etc.

In [None]:
import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm") # you may use different models: en_core_web_md, en_core_web_lg, en_core_web_trf etc.

# Example text
text = "Apple is looking at buying U.K. startup for $1 billion"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    print(ent.text, ent.label_)



Apple ORG
U.K. GPE
$1 billion MONEY


Stemming:
Chopping the end of the words (Affixes): drives -> drive


Lemmatization:
This process reduces words to their base or dictionary form, known as the lemma. Running -> Run

In [None]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required NLTK data
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize Stemmer and Lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Example word
word = ["running","drives", "does", "stood", "did", "studied"]

# Stemming
for items in word:
  stemmed = stemmer.stem(items)
  print("Stemmed Word:", stemmed)

# Lemmatization
for items in word:
  lemmatized = lemmatizer.lemmatize(items, pos='v')
  print("Lemmatized Word:", lemmatized)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Stemmed Word: run
Stemmed Word: drive
Stemmed Word: doe
Stemmed Word: stood
Stemmed Word: did
Stemmed Word: studi
Lemmatized Word: run
Lemmatized Word: drive
Lemmatized Word: do
Lemmatized Word: stand
Lemmatized Word: do
Lemmatized Word: study


Badge of Words (Feature Extraction)

 It represents text data by identifying the occurrence of words within the text, disregarding the order or structure of the words.

 It converts text data into numerical form (vectors), which is necessary for most machine learning algorithms

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text
text = ["I love writing code in Python", "Python is a versatile programming language"]

# Create the transform
vectorizer = CountVectorizer()

# Tokenize and build vocab
vectorizer.fit(text)

# Summarize
print(vectorizer.vocabulary_)

# Encode the document
vector = vectorizer.transform(text)

# Summarize encoded vector
print(vector.toarray())


{'love': 4, 'writing': 8, 'code': 0, 'in': 1, 'python': 6, 'is': 2, 'versatile': 7, 'programming': 5, 'language': 3}
[[1 1 0 0 1 0 1 0 1]
 [0 0 1 1 0 1 1 1 0]]


TF-IDF (cares about the semantic and the importanc of the words by considering the word in the context)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Example corpus
corpus = [
    "The sky is blue.",
    "The sun is bright."
]

# Initialize a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the corpus
tfidf_matrix = vectorizer.fit_transform(corpus)

# Retrieve the words found in the corpus
words = vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix to an array and display it
tfidf_array = tfidf_matrix.toarray()

# Displaying the results
tfidf_results = {words[i]: tfidf_array[:, i] for i in range(len(words))}

print(tfidf_results)

{'blue': array([0.57615236, 0.        ]), 'bright': array([0.        , 0.57615236]), 'is': array([0.40993715, 0.40993715]), 'sky': array([0.57615236, 0.        ]), 'sun': array([0.        , 0.57615236]), 'the': array([0.40993715, 0.40993715])}


Wrod Embedding:Using NN to provide more informative representationof text

In [None]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Sample sentences (replace with your own corpus)
sentences = [
    "The quick brown fox jumps over the lazy dog",
    "I love natural language processing",
    "Machine learning is fun and interesting"
]

# Tokenizing the sentences into words
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Train the Word2Vec model with 100 dimensions
model = Word2Vec(tokenized_sentences, min_count=1, vector_size=100)

# Find words similar to a given word, e.g., 'love'
similar_words = model.wv.most_similar('love', topn=5)

print(similar_words)




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


[('i', 0.15923377871513367), ('quick', 0.1528114527463913), ('processing', 0.14256368577480316), ('natural', 0.1326463222503662), ('fun', 0.11935518682003021)]


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# Sample corpus and labels (replace with your actual data)
corpus = [
    "The latest smartphone features a state-of-the-art camera.",
    "Innovations in AI are transforming the tech industry.",
    "The new tablet is lightweight and has a long battery life.",
    "Tech companies are investing heavily in quantum computing.",
    "Autonomous vehicles could revolutionize daily commutes.",
    "The election campaign is heating up with debates on policy.",
    "Legislation for environmental protection was recently passed.",
    "The new tax law has sparked controversy among businesses.",
    "Diplomatic talks will address the rising trade tensions.",
    "The government pledged to increase funding for education.",
    "The football team clinched victory with a last-minute goal.",
    "A record-breaking performance in the 100 meters sprint.",
    "The basketball playoffs are attracting large audiences.",
    "A new swimming champion has emerged at the international meet.",
    "The baseball game was postponed due to bad weather."
]

# 0 for Tech, 1 for Politics and 2 for Sport

labels = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.2)

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=1000)  # Adjust max_features as needed

# Transform training and testing data into TF-IDF vectors
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train the SVM classifier
clf = SVC(kernel='linear')  # Choose kernel (linear here for simplicity)
clf.fit(X_train_tfidf, y_train)

# Predict on unseen data (replace with your new sentence)
new_sentence = "Salah did nt score last night"
new_sentence_tfidf = vectorizer.transform([new_sentence])

# Get prediction for unseen data
prediction = clf.predict(new_sentence_tfidf)

# Print the predicted label
print(f"Predicted label for '{new_sentence}': {prediction[0]}")



Predicted label for 'Salah did nt score last night': 2


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline

# Dataset with 20 texts and corresponding labels (1 for positive, 0 for negative)
texts = [
    "I love this phone, its super fast and there's so much new stuff to learn!",
    "The camera quality is amazing, I've taken some great shots.",
    "I hate this phone, it's always crashing.",
    "Battery life is terrible, it barely lasts a day.",
    "Great phone, but I miss having a headphone jack.",
    "The new update is awesome, my phone works much better now.",
    "The screen broke after a minor fall, very fragile.",
    "The voice assistant is surprisingly accurate and helpful!",
    "I don't like the new interface, it's too complicated.",
    "Fingerprint sensor is not responsive, I prefer the old scanner.",
    "This laptop has excellent performance for gaming.",
    "The sound quality on these headphones is not great.",
    "I love how easy it is to take photos with this camera.",
    "The battery on this laptop doesn't last long enough.",
    "This gaming console is the best I've used so far.",
    "The touchscreen is unresponsive at times.",
    "This new app has really helped me organize my schedule.",
    "The wifi connectivity on this device is poor.",
    "I'm really impressed with the build quality of this tablet.",
    "The picture quality on this TV is breathtaking."
]
labels = [1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1]

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)

# Create a pipeline that combines TF-IDF with an SVM classifier
pipeline = make_pipeline(TfidfVectorizer(), SVC(kernel='linear'))

# Train the model
pipeline.fit(X_train, y_train)

# Predicting the topic of an unseen sentence
unseen_sentence = "This smartwatch has excellent battery life"
predicted_label = pipeline.predict([unseen_sentence])[0]

# Print the predicted label
print(f"The predicted label for the unseen sentence is: {'Positive' if predicted_label == 1 else 'Negative'}")


The predicted label for the unseen sentence is: Positive
