# Vectorized computation and data structures

In the context of Natural Language Processing (NLP) and data science, efficient computation and data handling are crucial for handling large datasets and complex algorithms. Vectorized computation and data structures like NumPy and pandas play a pivotal role in this efficiency. Let's delve into the concepts, their significance, and how they relate to broader statistical concepts in data science, mathematics, and social sciences.

### Vectorized Computation

**Context and Importance:** Vectorized computation refers to performing operations on entire arrays or matrices in a single step, rather than using loops over individual elements. This approach leverages optimized. In NLP, where operations on large text corpora and word embeddings are common, vectorized computation can drastically reduce execution time.

**NumPy:** NumPy (Numerical Python) is a foundational package for scientific computing in Python. It provides an N-dimensional array object which is a key data structure for vectorized operations. NumPy arrays support a wide range of mathematical operations that can be performed over the whole array efficiently and succinctly. This is particularly beneficial in NLP for tasks like vector operations in semantic analysis, where each word or document can be represented as a high-dimensional vector.

**Tensors:** generalizes the concept of vectors and matrices to higher dimensions. Tensors are the basic building blocks in these frameworks, used to represent data or parameters (like weights and biases in neural networks). While a Numpy array is conceptually similar to a tensor, tensors in deep learning frameworks are designed with additional capabilities, such as the ability to run on GPUs for faster computation and to automatically compute gradients (a feature essential for training neural networks). 

### Data Structures: pandas

**pandas Overview:** pandas is a Python library providing high-level data structures and a vast array of tools for data analysis. At its core, the DataFrame is the most noteworthy data structure in pandas, essentially a tabular data structure with labeled axes (rows and columns). For NLP, pandas DataFrames facilitate the manipulation of textual data, enabling tasks like data cleaning, transformation, and aggregation to be performed effortlessly.

**Significance in NLP:** In NLP projects, one often starts with structured data from various sources like CSV files, databases, or JSON. pandas makes it easy to load, preprocess, and explore this textual data. For example, you can easily apply transformations to text data (like tokenization, stemming, or lemmatization) across large datasets and analyze the distribution of words or phrases.

## (Super) Simple example

In [None]:
import numpy as np
import pandas as pd

# Example with NumPy: Calculate the word length of each word in an array
words = np.array(["natural", "language", "processing"])
word_lengths = np.vectorize(len)(words)
print(word_lengths)

# Example with pandas: Create a DataFrame and process textual data
data = {'text': ["Natural Language Processing", "Data Science", "Machine Learning"]}
df = pd.DataFrame(data)
df['word_count'] = df['text'].apply(lambda x: len(x.split()))
print(df)

## Exploring pandas

In [None]:
import pandas as pd

# If you are running this notebook in Google colab, uncomment this line of code and run
# from google.colab import drive
# drive.mount('/content/gdrive/', force_remount=True)
# file = Path('gdrive/MyDrive/DATA_340_3_NLP/Datasets/eczema.csv')
# eczema_df = pd.read_csv(file, nrows=10000)

# load 10,000 rows of the dataset
eczema_df = pd.read_csv('../datasets/eczema.csv', nrows=10000)

### Quick view of data

In [None]:
eczema_df.head()

In [None]:
eczema_df.tail()

In [None]:
eczema_df.info()

In [None]:
eczema_df.shape

### Data Munging in pandas

In [None]:
# how many people talk about lotions?
eczema_df['lotion'] = eczema_df['text'].str.contains("lotion")
eczema_df[eczema_df['lotion'] == True]

In [None]:
eczema_df['lotion'] = eczema_df['text'].str.contains("lotion", case=False)
eczema_df[eczema_df['lotion'] == True]

In [None]:
eczema_df['lotion'] = eczema_df['text'].str.contains("lot[io]+n", regex=True, case=False)
eczema_df[eczema_df['lotion'] == True]

In [None]:
eczema_df['age'] = eczema_df['text'].str.contains("\d+", regex=True, case=False)
eczema_df[eczema_df['age'] == True]

### Quick tutorial on Regular Expressions

In the context of Natural Language Processing (NLP), understanding and leveraging Regular Expressions (regex) is a fundamental skill. Regex is a powerful tool used for searching, manipulating, and analyzing text data by defining specific patterns. It's particularly useful in text preprocessing, data cleaning, and information extraction tasks.

**Regular Expressions** are sequences of characters that form a search pattern. They can be used to check if a string contains the specified search pattern, to replace the search pattern with a specified text, or to split a string on the specified search pattern.

- **Literals**: These are the simplest form of regex, where the search pattern matches the exact character sequence. For example, the regex `data` will match "data" in the string "data science".

- **Metacharacters**: These are characters with a special meaning. Some of the most common metacharacters include:
  - `.` (dot): Matches any single character except newline. For example, `a.b` matches "acb" or "a2b".
  - `^` (caret): Matches the start of a string. For example, `^data` matches "data" in "data science" but not in "my data science project".
  - `$` (dollar): Matches the end of a string. For example, `end$` matches "the end" in "This is the end".
  - `*` (asterisk): Matches 0 or more occurrences of the preceding element. For example, `a*b` matches "b", "ab", "aab", etc.
  - `+` (plus): Matches 1 or more occurrences of the preceding element. For example, `a+b` matches "ab", "aab", but not "b".
  - `?` (question mark): Matches 0 or 1 occurrence of the preceding element. For example, `a?b` matches "b" or "ab".

- **Brackets**: Used for specifying a set of characters to match.
  - `[abc]`: Matches any one of the characters a, b, or c.
  - `[^abc]`: Matches any character except a, b, or c.
  
- **Quantifiers**: Specify how many instances of a character, group, or character class must be present in the input for a match to be found.
  - `{n}`: Exactly n occurrences. For example, `a{3}` matches exactly three 'a' characters.
  - `{n,}`: At least n occurrences. For example, `a{2,}` matches two or more 'a' characters.
  - `{n,m}`: Between n and m occurrences, inclusive. For example, `a{2,4}` matches between two and four 'a' characters.

- **Parentheses** are used for grouping characters or patterns. For example, `(abc)+` matches one or more repetitions of "abc".

Think of regex like a secret code that helps you find specific patterns in a giant wall of text. It's like playing "Where's Waldo?" but for words or letters. You have special symbols that tell your search tool exactly what to look for, whether it's anything that looks like "Waldo", or finding Waldo only if he's at the beginning or end of the page.

Regular Expressions are a versatile and powerful tool in text processing and NLP. By mastering regex, you can efficiently search, replace, and manipulate text data, which is crucial for preparing and analyzing textual data in Data Science. Start small, practice with real text data, and gradually, you'll find regex to be an indispensable part of your NLP toolkit.

In [None]:
import re

# Finding if a pattern exists in a string
if re.search(r'^Data', 'Data Science'):
    print("Pattern found!")

# Replacing a pattern in a string
replaced_text = re.sub(r'Science', 'Analysis', 'Data Science')
print(replaced_text)  # Outputs: Data Analysis

# Splitting a string based on a pattern
split_text = re.split(r'\s+', 'Data Science is cool')
print(split_text)  # Outputs: ['Data', 'Science', 'is', 'cool']

# Extracting all matches of a pattern
matches = re.findall(r'\bData\b', 'Data Science and Data Analysis')
print(matches)  # Outputs: ['Data', 'Data']

### Let's dig into some more pandas features

We can use additional libraries with our pandas dataframe to perform more complex operations. For example, let's explore how different NLP libraries can be used with pandas to perform text preprocessing and analysis.

#### Tokenization

In NLP, tokenization serves as a preprocessing step where text is segmented into tokens that are useful for subsequent tasks like parsing, syntax analysis, or feature extraction. The process can vary in complexity depending on the language and the specific application. For instance, tokenizing a sentence in English might simply involve splitting it by spaces and punctuation, but languages without clear word boundaries (e.g., Chinese) require more sophisticated methods. The choice of what constitutes a token is crucial and depends on the task at hand; for example, in some contexts, individual words are tokens, while in others, meaningful phrases or even individual characters might be considered tokens.

##### NLTK

In [None]:
!pip install nltk --quiet

In [None]:
# let's tokenize our text column
import nltk
from nltk.tokenize import word_tokenize

In [None]:
eczema_df['nltk_tokens'] = eczema_df['text'].apply(word_tokenize)
eczema_df['nltk_tokens'].head()

In [None]:
# plot the token length distribution using pandas
import matplotlib.pyplot as plt

eczema_df['nltk_tokens'].apply(len).hist(bins=30)
plt.title('Token Length Distribution')
plt.show()

In [None]:
# look at the longest token
max_token = eczema_df['nltk_tokens'].apply(len).idxmax()
print(eczema_df.loc[max_token, 'nltk_tokens'], eczema_df.loc[max_token, 'text'])


##### SpaCy

In [None]:
!pip install spacy --quiet
!python -m spacy download en_core_web_sm --quiet

In [None]:
# tokenize with spacy
import spacy

nlp = spacy.load('en_core_web_sm')

In [None]:
eczema_df['spacy_tokens'] = eczema_df['text'].apply(lambda x: [token.text for token in nlp(x)])

In [None]:
# plot the token length distribution using pandas
eczema_df['spacy_tokens'].apply(len).hist(bins=30)
plt.title('Token Length Distribution')
plt.show()

In [None]:
# example of differences between spacy and nltk tokenization
eczema_df[eczema_df['nltk_tokens'].apply(len) != eczema_df['spacy_tokens'].apply(len)]

##### Our own tokenizer

We can also create our own tokenizer using regular expressions. This can be useful when we have specific requirements for tokenization that are not met by existing libraries.

In [None]:
eczema_df['our_tokens'] = eczema_df['text'].apply(lambda x: re.split(r'([,.?_!"()\']|--|\s)', x))
eczema_df['our_tokens'].head()

In [None]:
# remove empty tokens
eczema_df['our_tokens'] = eczema_df['our_tokens'].apply(lambda x: [token for token in x if token.strip() != ''])
eczema_df['our_tokens'].head()

#### TD-IDF

TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a statistical measure used to evaluate the importance of a word in a document, which is part of a corpus (a collection of documents). This technique is commonly used in information retrieval and text mining.

### Background and Assumptions:
- **Term Frequency (TF):** This measures how frequently a term occurs in a document. In the TF calculation, all terms are considered equally important. However, in reality, certain terms, like "is" or "the," may appear many times but have little importance. Thus, the need for weighting the frequency of each term.
- **Inverse Document Frequency (IDF):** This measures how important a term is within the corpus. The idea is that the significance of a term increases in proportion to the number of times a term appears in the document but is offset by the frequency of the term in the corpus. Terms that are common across multiple documents (like common English words) will have a lower IDF.

##### Formula and Computation:
1. **TF(t)** = (Number of times term t appears in a document) / (Total number of terms in the document)
2. **IDF(t)** = log_e(Total number of documents / Number of documents with term t in it)
3. **TF-IDF(t)** = TF(t) * IDF(t)

##### Applications:
- **Information Retrieval:** Helps in scoring and ranking a document's relevance given a user query.
- **Text Mining:** Useful for dimensionality reduction, feature selection, and identifying key terms in documents.
- **Data Science Context:** It's a way to convert textual data into a vector space model, where each dimension corresponds to a specific term. When dealing with NLP in data science, TF-IDF can be instrumental in pre-processing data for machine learning models.

##### Create sample data

In [None]:
# Let's work with a subset
random_seed = 42
sample = eczema_df.sample(n=100, random_state=random_seed)
sample = sample[['docID', 'text']]
sample

##### Understanding total words

In [None]:
# Let's get an understanding of the length of our conversations
sample['word_total'] = sample['text'].apply(lambda x: len(x.split(' ')))
sample['word_total'].describe()

In [None]:
# Our sample distribution conversation length
(sample
 .drop(columns=['docID', 'text'])
 .plot(kind='hist', bins=50)
)

In [None]:
sample.query('word_total > 300')

In [None]:
import numpy as np
f"The 95th percentile of the conversation length is: {np.percentile(sample['word_total'].tolist(), 95)} words"

#### Standardize our corpus

Let's analyze the above two datasets using TF-IDF by doing the following.

* Tokenize the conversation
* We can try lemmatizing the text
* Count the freqeuncy of words in each tweet

In [None]:
# Tokenize the tweet using SpaCy
import spacy
import string

from tqdm import tqdm
tqdm.pandas()

punctuation = list(string.punctuation)
NLP = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [None]:
# Let's get a list of the lemmas to count the frequency of words but remove the punctuation
sample['tokens'] = sample['text'].progress_apply(lambda x: [x.lemma_.lower() for x in NLP(x) if x.lemma_.lower() not in punctuation])

In [None]:
sample

In [None]:
# Unwind the data on the tokens
sample_tokens = (sample
                  .explode('tokens')
                  .drop(columns=['text', 'word_total'])
                )

In [None]:
sample_tokens

##### Term Frequency

- **Term Frequency (TF):** This measures how frequently a term occurs in a document. In the TF calculation, all terms are considered equally important. However, in reality, certain terms, like "is" or "the," may appear many times but have little importance. Thus, the need for weighting the frequency of each term.

1. **TF(t)** = (Number of times term t appears in a document) / (Total number of terms in the document)

In [None]:
# create a word frequency dataframe
term_frequency = (sample_tokens
                  .groupby(by=['docID', 'tokens'])
                  .agg({'tokens': 'count'})
                  .rename(columns={'tokens': 'term_frequency'})
                  .reset_index()
                  .rename(columns={'tokens': 'term'})
                 )

In [None]:
term_frequency

##### What are stopwords

In [None]:
stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
         'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
         'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
         'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
         'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
         'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp']

In [None]:
# remove stop words
term_frequency = term_frequency.drop(term_frequency[term_frequency['term'].isin(stop_words)].index)
term_frequency

#### Document Frequency

In [None]:
# Document Frequency
document_frequency = (term_frequency
                      .groupby(['docID', 'term'])
                      .size()
                      .unstack()
                      .sum()
                      .reset_index()
                      .rename(columns={0: 'document_frequency'})
                     )

In [None]:
document_frequency

In [None]:
# merge the document freqs into the term dataframe
term_frequency = term_frequency.merge(document_frequency)

In [None]:
documents_in_corpus = term_frequency['docID'].nunique()
documents_in_corpus

##### Inverse Document Frequency

- **Inverse Document Frequency (IDF):** This measures how important a term is within the corpus. The idea is that the significance of a term increases in proportion to the number of times a term appears in the document but is offset by the frequency of the term in the corpus. Terms that are _common across multiple documents (like common English words) will have a lower IDF_.

2. **IDF(t)** = log_e(Total number of documents / Number of documents with term t in it)
3. **TF-IDF(t)** = TF(t) * IDF(t)

In [None]:
# inverse document frequency
term_frequency['idf'] = np.log((1 + documents_in_corpus) / (1 + term_frequency['document_frequency'])) + 1

In [None]:
term_frequency

In [None]:
term_frequency['tfidf'] = term_frequency['term_frequency'] * term_frequency['idf']
term_frequency.sort_values(by=['term_frequency'], ascending=False)

##### Normalize our data

In [None]:
from sklearn import preprocessing
term_frequency['tfidf_norm'] = preprocessing.normalize(term_frequency[['tfidf']], axis=0, norm='l2')

In [None]:
term_frequency

In [None]:
top_n_terms = term_frequency.sort_values(by=['docID', 'tfidf'], ascending=[True, False]).groupby(['docID']).head(5)

In [None]:
top_n_terms.head(30)

In [None]:
docIds = top_n_terms['docID'].tolist()

In [None]:
sample[sample['docID'] == 41]['text'].values