# Data Preprocessing Techniques for Natural Language Processing

-----
## Deal with imbalanced data

1. Resampling techniques for data preprocessing
   1. Oversampling: Increase the number of samples in the minority class by randomly duplicating existing samples or generating synthetic samples using techniques like SMOTE (Synthetic Minority Over-sampling Technique).
   2. Undersampling: Reduce the number of samples in the majority class by randomly removing instances.
2. Algorithmic Techniques
   1. Class Weights: assign different weights to classes, i.e., increasing the weight of the minority class can help the model pay more attention to it during training.
   2. Ensemble Methods: Techniques like bagging and boosting can be effective in handling class imbalance. Algorithms like Random Forest and AdaBoost can automatically handle imbalanced datasets by adjusting their training process
3. Use appropriate loss functions. We can use loss functions that are less sensitive to class imbalance, such as the weighted cross-entropy loss.
4. Use appropriate metrics. We can use metrics that are less sensitive to class imbalance, such as the F1 score, AUC-ROC, and AUC-PR.

-----
## Handle missing or corrupted data in a dataset

- Dropping the rows or columns with the missing or corrupted dataset
- replacing them entirely with a different value are two easy ways to handle such a situation.

Methods like IsNull(), dropna(), and Fillna() help in accomplishing this task.

-----
## Normalization

Normalization transforms features to be on a similar scale.

$$X_{new} = \frac{X - X_{min}}{X_{max} - X_{min}}$$

-----
## Standardization

Standardization is a technique to transform the data into a standard normal distribution with mean 0 and standard deviation

It is also called as Z-score normalization. It is done by subtracting the mean and dividing by the standard deviation of each value.

$$X_{new} = \frac{X - \mu}{\sigma}$$

-----
## TF-IDF

TF-IDF formula: $tfidf(t,d) = tf(t,d) * idf(t)$

TF-IDF stands for **Term Frequency-Inverse Document Frequency**. It is a **numerical statistic** that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.

- Term frequency, tf(t,d): **relative frequency of term t within document d**.
- The inverse document frequency: **logarithmically scaled inverse fraction of the documents that contain the word**.

## Cross-Validation

Cross-validation is a method of splitting all your data into three parts: training, testing, and validation data. Data is split into k subsets, and the model has trained on k-1of those datasets. 

The last subset is held for testing. This is done for each of the subsets. This is k-fold cross-validation.

Finally, the scores from all the k-folds are averaged to produce the final score.