# Chapter 3: Stage 1 - Data Preparation

## Steps Involved in Data Preparation

###  Data Collection

The first step in data preparation is to collect data from various sources. These sources can be in any
 format such as CSV, web pages, SQL databases, S3 storage, etc. Python provides several libraries to
 gather the data efficiently and accurately. 

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:700px">
    <img src="image/data_collect_library.png" alt="" />
</div>

###  Data Preprocessing and Formatting

Data preprocessing and formatting are crucial for ensuring high-quality data for fine-tuning. This step
 involves tasks such as cleaning the data, handling missing values, and formatting the data to match the
 specific requirements of the task. 

<div style="background-color:white; padding:10px; display:flex; justify-content:center;height:300px">
    <img src="image/data_preprocess_library.png" alt="" />
</div>

###  Handling Data Imbalance

**Over-sampling and Under-sampling**: 
+ Techniques like SMOTE (Synthetic Minority Over
sampling Technique) generate synthetic examples to achieve balance.
+ Python Library: imbalanced-learn
+ Description: imbalanced-learn provides various methods to deal with imbalanced datasets, in
cluding oversampling techniques like SMOTE.

**Adjusting Loss Function**: Modify the loss function to give more weight to the minority class,
 setting class weights inversely proportional to the class frequencies.

**Focal Loss**: A variant of cross-entropy loss that adds a factor to down-weight easy examples and
 focus training on hard negatives.
+ **Python Library**: focal loss
+ **Description**: The focal loss package provides robust implementations of various focal loss func
tions, including BinaryFocalLoss and SparseCategoricalFocalLoss.

**Cost-sensitive Learning**: Incorporating the cost of misclassifications directly into the learning
 algorithm, assigning a higher cost to misclassifying minority class samples.

**Ensemble Methods**: Using techniques like bagging and boosting to combine multiple models
 and handle class imbalance.
+ Python Library: sklearn.ensemble
+ Description: scikit-learn provides robust implementations of various ensemble methods, including
 bagging and boosting.

**StratifiedSampling**: Ensuring that each mini-batch during training contains an equal or proportional representation of each `class.
+ PythonLibrary: sklearn.model selection.StratifiedShuffleSplit
+ Description: scikit-learn offers tools for stratifiedsampling, ensuring balanced representation
 across classes

**Data Cleaning**: Removing noisy and mislabelled data, which can disproportionately affect the minority class.
+ Python Library: pandas.DataFrame.sample
+ Description: pandas provides methods for sampling data from DataFrames, useful for data cleaning and preprocessing.

**Using Appropriate Metrics**: Metrics like Precision-Recall AUC, F1-score, and Cohen’s Kappa are more informative than accuracy when dealing with imbalanced datasets.
+ Python Library: sklearn.metrics
+ Description: scikit-learn offers a comprehensive set of tools for evaluating the performance of classification models, particularly with imbalanced datasets.

### Splitting Dataset

Splitting the dataset for fine-tuning involves dividing it into training and validation sets, typically using an 80:20 ratio. Different techniques include:

+ Random Sampling: Selecting a subset of data randomly to create a representative sample
+ Stratified Sampling: Dividing the dataset into subgroups and sampling from each to maintain class balance.
+  K-Fold Cross Validation: Splitting the dataset into K folds and performing training and validation K times.
+  Leave-One-Out Cross Validation: Using a single data point as the validation set and the rest for training, repeated for each data point.

## Existing and Potential Research Methodologies

### Data Annotation

Data annotation involves labelling or tagging textual data with specific attributes relevant to the model’s training objectives. This process is crucial for supervised learning tasks and greatly influences the performance of the fine-tuned model. Recent research highlights various approaches to data annotation:


+ Human Annotation
+ Semi-automatic Annotation
+ Automatic Annotation

### Data Augmentation

Data Augmentation (DA) techniques expand training datasets artificially to address data scarcity and improve model performance. Advanced techniques often used in NLP include:

+ Word Embeddings
+ Back Translation
+ Adversarial Attacks
+ NLP-AUG

### Synthetic Data Generation using LLMs

Large Language Models (LLMs) can generate synthetic data through innovative techniques such as:
+ Prompt Engineering: Crafting specific prompts to guide LLMs like GPT-3 in generating relevant and high-quality synthetic data
+ Multi-Step Generation: Employing iterative generation processes where LLMs generate initial data that is refined through subsequent steps. This method can produce high-quality synthetic data for various tasks, including summarising and bias detection.

## Challenges in Data Preparation for Fine-Tuning LLMs

##  Available LLM Fine-Tuning Datasets

## Best Practices