# Dataset Preparation for Translation Tasks

This script prepares datasets for English-to-Tigrinya and Tigrinya-to-English translation tasks. It includes data splitting, sampling, and saving the processed datasets for training, validation, and testing.

---

## Key Steps

### 1. **Load Full Datasets**
- Loaded two full datasets:
  - English-to-Tigrinya (`en_to_ti_full_datasets.csv`)
  - Tigrinya-to-English (`ti_to_en_full_datasets.csv`)
- Previewed the structure of the datasets to verify the content.

### 2. **Split Datasets**
- Split each dataset into:
  - **Training Set** (80%)
  - **Validation Set** (10%)
  - **Test Set** (10%)
- Used the `train_test_split` method with a random state for reproducibility.

### 3. **Save Dataset Splits**
- Saved the training, validation, and test splits for both translation tasks as separate CSV files:
  - **English-to-Tigrinya**:
    - `en_to_ti_train.csv`
    - `en_to_ti_val.csv`
    - `en_to_ti_test.csv`
  - **Tigrinya-to-English**:
    - `ti_to_en_train.csv`
    - `ti_to_en_val.csv`
    - `ti_to_en_test.csv`

### 4. **Print Summary**
- Displayed the number of rows in each dataset split for both tasks.

### 5. **Sample 10% from Training Sets**
- Took a 10% sample from the training sets for both tasks to create smaller subsets for faster experimentation:
  - **English-to-Tigrinya**: `en_to_ti_sampled_train.csv`
  - **Tigrinya-to-English**: `ti_to_en_sampled_train.csv`
- Saved the sampled subsets as separate CSV files.

---

## Output Summary
- **English-to-Tigrinya**:
  - Train: 80% of rows
  - Validation: 10% of rows
  - Test: 10% of rows
  - Sampled Train: 10% of the training rows

- **Tigrinya-to-English**:
  - Train: 80% of rows
  - Validation: 10% of rows
  - Test: 10% of rows
  - Sampled Train: 10% of the training rows

This preparation ensures well-structured and manageable datasets for training and evaluating translation models.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import Library


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

##Load the Datasets

Load the dataset


In [None]:
# Load datasets
en_to_ti_df = pd.read_csv("/Capstone/Dataset_csv/en_to_ti_full_datasets.csv")
ti_to_en_df = pd.read_csv("/Capstone/Dataset_csv/ti_to_en_full_datasets.csv")

# Check the structure
print(en_to_ti_df.head())
print(ti_to_en_df.head())

   Unnamed: 0                                             Source  \
0           0  'In the beginning God created the heaven and t...   
1           1  'And the earth was without form, and void; and...   
2           2  'And God said, Let there be light: and there w...   
3           3  'And God saw the light, that it was good: and ...   
4           4  'And God called the light Day, and the darknes...   

                                              Target  
0                       'ኣምላኽ ብመጀመርታ ሰማይን ምድርን ፈጠረ።'  
1  'ምድሪ ድማ በረኻን ጥራያን ነበረት፡ ጸልማት ከኣ ኣብ ልዕሊ መዓሙቕ ነበ...  
2               'ኣምላኽ ከኣ፥ ብርሃን ይኹን፡ በለ። ብርሃን ድማ ዀነ።'  
3  'ኣምላኽ ድማ እቲ ብርሃን ጽቡቕ ከም ዝዀነ ረኣየ። ኣምላኽ ከኣ ነቲ ብር...  
4  'ኣምላኽ ነቲ ብርሃን መዓልቲ ኣውጽኣሉ። ነቲ ጸልማት ከኣ ለይቲ ኣውጽኣሉ...  
   Unnamed: 0                                             Target  \
0           0  'In the beginning God created the heaven and t...   
1           1  'And the earth was without form, and void; and...   
2           2  'And God said, Let there be light: and the

## Define a Function for Splitting
Create a reusable function to split datasets into training (80%), validation (10%), and testing (10%). In addtion take 10

In [None]:
def split_dataset(df, test_size=0.1, val_size=0.1, random_state=42):
    # Split into training+validation and testing
    train_val, test = train_test_split(df, test_size=test_size, random_state=random_state)
    # Further split training+validation into training and validation
    train, val = train_test_split(train_val, test_size=val_size / (1 - test_size), random_state=random_state)
    return train, val, test

## Split Each Dataset
Use the function to split both datasets.

In [None]:
# Split English-to-Tigrinya dataset
en_to_ti_train, en_to_ti_val, en_to_ti_test = split_dataset(en_to_ti_df)

# Split Tigrinya-to-English dataset
ti_to_en_train, ti_to_en_val, ti_to_en_test = split_dataset(ti_to_en_df)

# Save splits if needed
en_to_ti_train.to_csv("/Capstone/Dataset_csv/en_to_ti_train.csv", index=False)
en_to_ti_val.to_csv("/Capstone/Dataset_csv/en_to_ti_val.csv", index=False)
en_to_ti_test.to_csv("/Capstone/Dataset_csv/en_to_ti_test.csv", index=False)

ti_to_en_train.to_csv("/Capstone/Dataset_csv/ti_to_en_train.csv", index=False)
ti_to_en_val.to_csv("Capstone/Dataset_csv/ti_to_en_val.csv", index=False)
ti_to_en_test.to_csv("/Capstone/Dataset_csv/ti_to_en_test.csv", index=False)

# Print summary
print(f"English-to-Tigrinya: Train={len(en_to_ti_train)}, Val={len(en_to_ti_val)}, Test={len(en_to_ti_test)}")
print(f"Tigrinya-to-English: Train={len(ti_to_en_train)}, Val={len(ti_to_en_val)}, Test={len(ti_to_en_test)}")


English-to-Tigrinya: Train=286500, Val=35813, Test=35813
Tigrinya-to-English: Train=286500, Val=35813, Test=35813


## English-to-Tigrinya: Take a 10% Sample from the Training Set

In [None]:
# Load training set
en_to_ti_train = pd.read_csv("/Capstone/Dataset_csv/en_to_ti_train.csv")

# Take 10% sample
en_to_ti_sampled_train = en_to_ti_train.sample(frac=0.1, random_state=42)

# Save the sampled training set
en_to_ti_sampled_train.to_csv("/Capstone/Dataset_csv/en_to_ti_sampled_train.csv", index=False)

print(f"English-to-Tigrinya Sampled Train: {len(en_to_ti_sampled_train)} rows")


English-to-Tigrinya Sampled Train: 28650 rows


## Tigrinya-to-English:Take a 10% Sample from the Training Se

In [None]:
# Load training set
ti_to_en_train = pd.read_csv("/Capstone/Dataset_csv/ti_to_en_train.csv")

# Take 10% sample
ti_to_en_sampled_train = ti_to_en_train.sample(frac=0.1, random_state=42)

# Save the sampled training set
ti_to_en_sampled_train.to_csv("/Capstone/Dataset_csv/ti_to_en_sampled_train.csv", index=False)

print(f"Tigrinya-to-English Sampled Train: {len(ti_to_en_sampled_train)} rows")


Tigrinya-to-English Sampled Train: 28650 rows
