In [1]:

from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords

# Part 1: Topic detection

---

## RegExps 

RegExp: Stands for Regular Expression, it is way to analyze a sequence of characters and find patterns. 
RegEx can be used to check if a string contains the specified search pattern

In [2]:
# import the necessary library
import re

#create a text where we will analyse and find patterns
text = "Its raining here in Spain. The air feels cold. hope we can sail today"

#Search in the text if "ai" can be found
x = re.findall("ai", text)


#Search the string to see if it starts with "Its" and ends with "today":
z = re.search("^Its.*today$", text) 

# In case we have a match print
if z:
  print("YES! We have a match!")
else:
  print("No match")

#print the list where the information was stored
print(x)

YES! We have a match!
['ai', 'ai', 'ai', 'ai']


## Bow  

Bag of Words is a method to extract features from text documents.This model is used to preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words.

These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set.

In [3]:
# Import the necessary library
from sklearn.feature_extraction.text import CountVectorizer
import nltk 
import numpy as np 

documents = [
    "The quick brown fox jumped over the lazy dog.",
    "The dog barked at the moon.",
    "The fox and the dog are good friends."
]

# Create a CountVectorizer instance
#The CountVectorizer from scikit-learn is used to convert a collection of text documents into a matrix of token counts
vectorizer = CountVectorizer()

# Fit the vectorizer to the documents and transform them into a bag-of-words representation
#The fit_transform method fits the model to the documents and transforms the documents into a bag-of-words matrix.
bow_matrix = vectorizer.fit_transform(documents)


# Get the feature names (words) in the bag-of-words representation
#The get_feature_names_out method returns the feature names (words) in the bag-of-words representation.
feature_names = vectorizer.get_feature_names_out()


# Print the bag-of-words matrix and feature names
print("Bag-of-Words Matrix:")
print(bow_matrix.toarray())
print("\nFeature Names:")
print(feature_names)

Bag-of-Words Matrix:
[[0 0 0 0 1 1 1 0 0 1 1 0 1 1 2]
 [0 0 1 1 0 1 0 0 0 0 0 1 0 0 2]
 [1 1 0 0 0 1 1 1 1 0 0 0 0 0 2]]

Feature Names:
['and' 'are' 'at' 'barked' 'brown' 'dog' 'fox' 'friends' 'good' 'jumped'
 'lazy' 'moon' 'over' 'quick' 'the']


## Tokenization   

Tokenization: in Natural Language Processing (NLP) and machine learning, refers to the process of converting a sequence of text into smaller parts, known as tokens. 

These tokens can be as small as characters or as long as words. The primary reason this process matters is that it helps machines understand human language by breaking it down into bite-sized pieces, which are easier to analyze.

In [None]:
# Download NLTK resources (you only need to do this once)
nltk.download('punkt')

In [5]:
import nltk
from nltk.tokenize import word_tokenize

# Input text
text = "Tokenization is important for NLP. It helps machines understand documents and text by breaking it down"

# Tokenize the text into words
tokens = word_tokenize(text)

# Print the tokens
print(tokens)


['Tokenization', 'is', 'important', 'for', 'NLP', '.', 'It', 'helps', 'machines', 'understand', 'documents', 'and', 'text', 'by', 'breaking', 'it', 'down']


## Lemmatization   

Lemmatization: It is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item.
It works by linking words with similar meanings to one word.  reducing words to their base or root form, known as the lemma. 

For example, the lemma of the words "running," "ran," and "runs" is "run." Lemmatization is often used in natural language processing (NLP) to standardize words and reduce them to their base form for better analysis.

In [None]:
# Download NLTK resources (you only need to do this once)
nltk.download('wordnet')

In [18]:
# Import necessary library
from nltk.stem import WordNetLemmatizer

#Here, an instance of the WordNetLemmatizer is created. This object (lemmatizer) will be used to perform lemmatization on words.
lemmatizer = WordNetLemmatizer()
 
#This line prints the result of lemmatizing the word "rocks" using the lemmatizer. 
# The output will be "rock" because "rocks" is lemmatized to its base form.

print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
 
#This line prints the result of lemmatizing the word "better" with the additional information 
# that it is an adjective (denoted by the pos="a" argument). 
# The output will be "good" because the lemma of the comparative adjective "better" is "good."
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))


#The output will be "run" because "running" is the present participle form of the verb, and lemmatization reduces it to the base form
print("running :", lemmatizer.lemmatize("running", pos="v"))


print("largest :", lemmatizer.lemmatize("largest", pos="a"))

print("trying :", lemmatizer.lemmatize("trying", pos="v"))

rocks : rock
corpora : corpus
better : good
running : run
largest : large
trying : try


## Tfidf   

words please

## Named entity recognition

feed words

##

# Part 2: Sentiment analysis

---