<a href="https://colab.research.google.com/github/Dana1402/Kaggle_competitions/blob/main/Handle_categorical_and_text_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Handle text features with many categories (50+) in Python

In [None]:
import pandas as ps

### 1. **Label Encoding**
For ordinal or categorical features with a logical order:
This converts each unique category into an integer.
However, this may not work well for non-ordinal features
as it can introduce unintended relationships between categories.

In [None]:
df = pd.read_csv('')

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['encoded_feature'] = encoder.fit_transform(df['text_feature'])

### 2. **One-Hot Encoding**
For non-ordinal categorical features with many unique values:
This creates binary columns for each unique category, but with 50+ categories,
this can result in high-dimensional data, which may not be ideal.

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded_feature = encoder.fit_transform(df[['text_feature']])

### 3. **Frequency/Count Encoding**
You can replace each category with its frequency in the dataset:
This helps preserve the information while keeping the feature one-dimensional.

In [None]:
df['freq_encoded_feature'] = df['text_feature'].map(df['text_feature'].value_counts())

### 4. **Target/Mean Encoding**
For supervised learning, you can encode categories based on the mean of the target variable:
This method introduces information from the target,
making it more powerful but at risk of data leakage.

In [None]:
mean_encoded_feature = df.groupby('text_feature')['target'].mean()
df['mean_encoded_feature'] = df['text_feature'].map(mean_encoded_feature)

### 5. **Embedding Encoding** If you have large categorical features,
you can use embedding techniques, commonly seen with deep learning models like neural networks:
This approach is suitable for high-cardinality categorical features and can capture relationships between categories in a compact form.

In [None]:
import tensorflow as tf
model = tf.keras.Sequential([ tf.keras.layers.Embedding(input_dim=num_categories, output_dim=embedding_dim) ])

### 6. **Hashing Encoding**
A dimensionality-reduction technique that maps categories into a fixed number of hash buckets:
This method works well when you need to reduce the dimensionality but maintain uniqueness.

In [None]:
from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher(n_features=10, input_type='string')
hashed_feature = hasher.transform(df[['text_feature']].astype(str))