# **<span style="color: orange;"> Representation Techniques </span>**
Loading the dataframes from the first Notebook-file to proceed with representation techniques.In this chapter, we will explore various text representation techniques that are essential for natural language processing (NLP) tasks. These techniques transform text data into numerical representations that can be used by machine learning models. The key techniques we will cover include:


In [None]:
#Loading the datasets from before
import pandas as pd

# List of dataset file paths
dataset_files = ['df_autocorrected.xlsx', 'df_lowercasing.xlsx', 'df_original.xlsx', 'df_without_named_entities.xlsx', 'df_without_greetings_and_closings.xlsx']

# Function to load and process each dataset
def load_and_process_datasets(files):
    for file in files:
        # Load the dataset
        df = pd.read_csv(file)
        print(f"Loaded {file} with shape {df.shape}")
        
        # Process the dataset (example: print first few rows)
        print(df.head())
        
        # Add your processing steps here
        # For example: data cleaning, feature extraction, etc.
        
        # Save the processed dataset if needed
        processed_file = file.replace('.csv', '_processed.csv')
        df.to_csv(processed_file, index=False)
        print(f"Processed dataset saved as {processed_file}")

# Load and process datasets
load_and_process_datasets(dataset_files)

### **<span style="color: orange;">Bag of Words (BoW)</span>**
While BoW is easy to implement and understand, it has some limitations, such as ignoring word order and context.
niques. It involves the following steps:
1. Tokenizing the text into individual words.
2. Creating a vocabulary of all unique words in the corpus.
3. Representing each document as a vector of word counts, where each element of the vector corresponds to the count of a specific word in the document.

### **<span style="color: orange;">Term Frequency-Inverse Document Frequency (TF-IDF)</span>**
TF-IDF is an extension of the Bag of Words model that aims to address some of its limitations. It assigns a weight to each word based on its frequency in a document (term frequency) and its rarity across the entire corpus (inverse document frequency). TF-IDF helps to highlight important words in a document while down-weighting common words that appear in many documents.

### **<span style="color: orange;">Word Embeddings</span>**
Word embeddings are dense vector representations of words that capture semantic relationships between them. Unlike BoW and TF-IDF, word embeddings consider the context in which words appear. Popular word embedding techniques include:
- **Word2Vec**: Uses neural networks to learn word vectors based on their context in a large corpus.
- **GloVe (Global Vectors for Word Representation)**: Combines the advantages of matrix factorization and context-based learning.
- **FastText**: An extension of Word2Vec that considers subword information, making it more robust to rare and misspelled words.