# Data Preprocessing Techniques for Natural Language Processing

## 1. Normalization

Normalization ($X_new = (X - X_min)/(X_max - X_min)$) transforms features to be on a similar scale.

## 2. Standardization

Standardization ($X_{new} = (X - mean)/Std$) is a technique to transform the data into a standard normal distribution with mean 0 and standard deviation

It is also called as Z-score normalization. It is done by subtracting the mean and dividing by the standard deviation of each value.

## 3. Handle missing or corrupted data in a dataset

- Dropping the rows or columns with the missing or corrupted dataset
- replacing them entirely with a different value are two easy ways to handle such a situation.

Methods like IsNull(), dropna(), and Fillna() help in accomplishing this task.

## 4. TF-IDF

TF-IDF formula: $tfidf(t,d) = tf(t,d) * idf(t)$

TF-IDF stands for **Term Frequency-Inverse Document Frequency**. It is a **numerical statistic** that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.

- Term frequency, tf(t,d), is the relative frequency of term t within document d.
- The inverse document frequency is the logarithmically scaled inverse fraction of the documents that contain the word.

## 5. Deal with imbalanced data

1. **Oversampling or undersampling**. Instead of sampling with a uniform distribution from the training dataset, we can use other distributions so the model sees a more balanced dataset.
2. **Data augmentation**. We can add data in the less frequent categories by modifying existing data in a controlled way. In the example dataset, we could flip the images with illnesses, or add noise to copies of the images in such a way that the illness remains visible.
3. **Using appropriate metrics**. We can use metrics that are less sensitive to class imbalance, such as the F1 score.
4. **Using appropriate loss functions**. We can use loss functions that are less sensitive to class imbalance, such as the weighted cross-entropy loss.

## 6. Cross-Validation

Cross-validation is a method of splitting all your data into three parts: training, testing, and validation data. Data is split into k subsets, and the model has trained on k-1of those datasets. 

The last subset is held for testing. This is done for each of the subsets. This is k-fold cross-validation.

Finally, the scores from all the k-folds are averaged to produce the final score.