In [1]:

from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords

# Part 1: Topic detection

---

## RegExps 

RegExp: Stands for Regular Expression, it is way to analyze a sequence of characters and find patterns. 
RegEx can be used to check if a string contains the specified search pattern

In [2]:
# import the necessary library
import re

#create a text where we will analyse and find patterns
text = "Its raining here in Spain. The air feels cold. hope we can sail today"

#Search in the text if "ai" can be found
x = re.findall("ai", text)


#Search the string to see if it starts with "Its" and ends with "today":
z = re.search("^Its.*today$", text) 

# In case we have a match print
if z:
  print("YES! We have a match!")
else:
  print("No match")

#print the list where the information was stored
print(x)

YES! We have a match!
['ai', 'ai', 'ai', 'ai']


## Bow  

Bag of Words is a method to extract features from text documents.This model is used to preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words.

These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set.

In [3]:
# Import the necessary library
from sklearn.feature_extraction.text import CountVectorizer
import nltk 
import numpy as np 

documents = [
    "The quick brown fox jumped over the lazy dog.",
    "The dog barked at the moon.",
    "The fox and the dog are good friends."
]

# Create a CountVectorizer instance
#The CountVectorizer from scikit-learn is used to convert a collection of text documents into a matrix of token counts
vectorizer = CountVectorizer()

# Fit the vectorizer to the documents and transform them into a bag-of-words representation
#The fit_transform method fits the model to the documents and transforms the documents into a bag-of-words matrix.
bow_matrix = vectorizer.fit_transform(documents)


# Get the feature names (words) in the bag-of-words representation
#The get_feature_names_out method returns the feature names (words) in the bag-of-words representation.
feature_names = vectorizer.get_feature_names_out()


# Print the bag-of-words matrix and feature names
print("Bag-of-Words Matrix:")
print(bow_matrix.toarray())
print("\nFeature Names:")
print(feature_names)

Bag-of-Words Matrix:
[[0 0 0 0 1 1 1 0 0 1 1 0 1 1 2]
 [0 0 1 1 0 1 0 0 0 0 0 1 0 0 2]
 [1 1 0 0 0 1 1 1 1 0 0 0 0 0 2]]

Feature Names:
['and' 'are' 'at' 'barked' 'brown' 'dog' 'fox' 'friends' 'good' 'jumped'
 'lazy' 'moon' 'over' 'quick' 'the']


## Tokenization   

Tokenization: in Natural Language Processing (NLP) and machine learning, refers to the process of converting a sequence of text into smaller parts, known as tokens. 

These tokens can be as small as characters or as long as words. The primary reason this process matters is that it helps machines understand human language by breaking it down into bite-sized pieces, which are easier to analyze.

In [None]:
# Download NLTK resources (you only need to do this once)
nltk.download('punkt')

In [5]:
import nltk
from nltk.tokenize import word_tokenize

# Input text
text = "Tokenization is important for NLP. It helps machines understand documents and text by breaking it down"

# Tokenize the text into words
tokens = word_tokenize(text)

# Print the tokens
print(tokens)


['Tokenization', 'is', 'important', 'for', 'NLP', '.', 'It', 'helps', 'machines', 'understand', 'documents', 'and', 'text', 'by', 'breaking', 'it', 'down']


## Lemmatization   

Lemmatization: It is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item.
It works by linking words with similar meanings to one word.  reducing words to their base or root form, known as the lemma. 

For example, the lemma of the words "running," "ran," and "runs" is "run." Lemmatization is often used in natural language processing (NLP) to standardize words and reduce them to their base form for better analysis.

In [None]:
# Download NLTK resources (you only need to do this once)
nltk.download('wordnet')

In [18]:
# Import necessary library
from nltk.stem import WordNetLemmatizer

#Here, an instance of the WordNetLemmatizer is created. This object (lemmatizer) will be used to perform lemmatization on words.
lemmatizer = WordNetLemmatizer()
 
#This line prints the result of lemmatizing the word "rocks" using the lemmatizer. 
# The output will be "rock" because "rocks" is lemmatized to its base form.

print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
 
#This line prints the result of lemmatizing the word "better" with the additional information 
# that it is an adjective (denoted by the pos="a" argument). 
# The output will be "good" because the lemma of the comparative adjective "better" is "good."
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))


#The output will be "run" because "running" is the present participle form of the verb, and lemmatization reduces it to the base form
print("running :", lemmatizer.lemmatize("running", pos="v"))


print("largest :", lemmatizer.lemmatize("largest", pos="a"))

print("trying :", lemmatizer.lemmatize("trying", pos="v"))

rocks : rock
corpora : corpus
better : good
running : run
largest : large
trying : try


## TF- IDF   

**Term Frequency - Inverse Document Frequency (TF-IDF)** is a widely used statistical method in natural language processing and information retrieval. It measures how important a term is within a document relative to a collection of documents (i.e., relative to a corpus). Words within a text document are transformed into importance numbers by a text vectorization process. There are many different text vectorization scoring schemes, with TF-IDF being one of the most common.

In [None]:
# Only need to download once
nltk.download('stopwords')

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords


# Sample documents
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# Instructing the vectorizer to ignore common English words that typically don't contain much information.
vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))

# Create a TfidfVectorizer instance
vectorizer = TfidfVectorizer()

# Fit and transform the documents -- This is a sparse matrix representation of the documents after TF-IDF transformation.
tfidf_matrix = vectorizer.fit_transform(documents)


# Get feature names (words) and TF-IDF values
feature_names = vectorizer.get_feature_names_out()
tfidf_values = tfidf_matrix.toarray()

#Convert the matrix to a dense format and put it in a DataFrame
import pandas as pd
tfidf_df = pd.DataFrame(tfidf_matrix.todense(), columns=feature_names)

tfidf_df


Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
1,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
2,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


## Named entity recognition (NER)

NER identifies, categorizes and extracts the most important pieces of information from unstructured text without requiring time-consuming human analysis. It's particularly useful for quickly extracting key information from large amounts of data because it automates the extraction process.


Its used in many fields in **artificial intelligence (AI)**, including *machine learning (ML)*, *deep learning* and *neural networks*. NER is a key component of NLP systems, such as chatbots, sentiment analysis tools and search engines. It's used in healthcare, finance, human resources (HR), customer support, higher education and social media analysis.

In [15]:
import spacy

# Load the spaCy model for English NER
nlp = spacy.load("en_core_web_sm")

# Define the text to analyze
text = "Apple is a multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software, and online services. It is considered one of Big Five technology companies in the U.S. information technology industry, along with Amazon, Google, Microsoft, and Meta."

# Analyze the text using spaCy's NER
doc = nlp(text)

# Extract named entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
Cupertino GPE
California GPE
Five CARDINAL
U.S. GPE
Amazon ORG
Google ORG
Microsoft ORG
Meta ORG


# Part 2: Sentiment analysis

---

## **N-grams**

### **What are n-grams?**

N-grams are contiguous sequences of n items from a given sequence of text or speech. The value of n determines the length of the n-gram. For instance, bigrams are sequences of two adjacent items, while trigrams are sequences of three adjacent items. N-grams are widely used in NLP tasks such as language modeling, machine translation, and speech recognition.

In [2]:
import nltk
from nltk.util import ngrams

# Sample text corpus
corpus = "The quick brown fox jumps over the lazy dog."

# Generate bigrams from the corpus
bigrams = ngrams(corpus.split(), 2)

# Print the generated bigrams
for bigram in bigrams:
    print(bigram)

('The', 'quick')
('quick', 'brown')
('brown', 'fox')
('fox', 'jumps')
('jumps', 'over')
('over', 'the')
('the', 'lazy')
('lazy', 'dog.')


___

### **What might you use n-grams for?**

N-grams are a powerful tool for analyzing and understanding the structure of language. They can be used to identify patterns in word usage, predict next words in a sequence, and develop language models.

In [16]:
import nltk
from nltk.util import ngrams

# Sample text corpus
corpus = "The quick brown fox jumps over the lazy dog."

# Define the n-gram order
n = 2

# Generate n-grams from the corpus
n_grams = ngrams(corpus.split(), n)

# Print the n-grams
for gram in n_grams:
    print(gram)

('The', 'quick')
('quick', 'brown')
('brown', 'fox')
('fox', 'jumps')
('jumps', 'over')
('over', 'the')
('the', 'lazy')
('lazy', 'dog.')


___

### **Why would there be a risk for overfitting the data with n-grams**

Here are some reasons why with n-grams it can be a risk to overfitting ->

1. **High dimensionality**: N-grams, especially higher-order n-grams, can result in a large number of features, making the model more complex and prone to overfitting.

2. **Data sparsity**: As the order of n-grams increases, the frequency of each n-gram decreases, leading to sparse data. This sparsity can make it difficult for the model to generalize well to unseen data.

3. **Noise sensitivity**: N-grams can capture noise and idiosyncrasies in the training data, leading to overfitting. This is because n-grams treat all sequences of words equally, regardless of their meaning or relevance.



**Here is an example of how you can prevent overfitting while using n-grams**

In [12]:
import nltk
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Tokenize the corpus
corpus = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(corpus)

# Generate bigrams from the tokens
bigrams = list(nltk.ngrams(tokens, 2))

# Split the data into train and test sets for tokens and bigrams
X_train_tokens, X_test_tokens, y_train_tokens, y_test_tokens = train_test_split(tokens[:-1], tokens[1:], test_size=0.2)
X_train_bigrams, X_test_bigrams, _, _ = train_test_split(bigrams, tokens[1:], test_size=0.2)

# Convert bigrams to strings
X_train_strings = [" ".join(bigram) for bigram in X_train_bigrams]
X_test_strings = [" ".join(bigram) for bigram in X_test_bigrams]

# Vectorize the data
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train_strings)
X_test_vec = vectorizer.transform(X_test_strings)

# Train the Multinomial Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train_vec, y_train_tokens)

# Evaluate the classifier
y_pred = classifier.predict(X_test_vec)
accuracy = accuracy_score(y_test_tokens, y_pred)
print("Accuracy:", accuracy)
print("Test Tokens:", y_test_tokens)
print("Predicted Tokens:", y_pred)
intersection = set(y_test_tokens) & set(y_pred)
print("Intersection:", intersection)




Accuracy: 0.0
Test Tokens: ['quick', 'jumps']
Predicted Tokens: ['brown' 'dog']
Intersection: set()


The above code uses *data splitting, L2 regularization, feature selection and early stopping* to be able to use n-grams without it overfitting.

Other tecniques you can use are:

- Data augmentation

- Feature hashing

- Model complexity control

___

### **Earlier, we basically wanted to get rid off punctuations, but here, why we might want to use them here?**


Punctuation marks can serve valuable purposes in n-gram analysis, despite the initial inclination to remove them. While removing punctuation can streamline the process and reduce the number of n-grams, **it can also overlook important contextual information.** Punctuation marks play a significant role in human language, and their consideration can enhance the effectiveness of n-gram-based applications.

Some examples:

1. **Sentence Structure and Grammar**: Punctuation marks provide cues about sentence structure and grammar, allowing for more accurate identification of meaningful n-grams. For instance, a full stop indicates the end of a sentence, enabling the recognition of n-grams that span entire sentences.

2. **Discourse Coherence**: Punctuation marks contribute to discourse coherence by signaling relationships between clauses and phrases. This information is crucial for understanding the context and meaning of n-grams that extend across multiple clauses.

3. **Sentiment Analysis**: Punctuation marks can convey emotional tones and sentiments, which can be particularly useful in sentiment analysis tasks. For example, exclamation points and question marks often indicate surprise or inquiry, while ellipses suggest pauses or unspoken thoughts.

4. **Preserving Author's Intent**: Punctuation marks can help preserve the author's intended meaning and tone, especially in creative writing or informal communication. Removing punctuation can alter the interpretation and impact of the text.

## Thank you!

Group 2:

Edem Quashigah

Catarina Kaucher

Sara-Sofia Paananen