**Initialization**
* I use these 3 lines of code on top of my each Notebooks because it will help to prevent any problems while reloading and reworking on a same Project or Problem. And the third line of code helps to make visualization within the Notebook.

In [1]:
#@ Initialization:
%reload_ext autoreload
%autoreload 2
%matplotlib inline 

**Downloading the Libraries and Dependencies**
* I have downloaded all the Libraries and Dependencies required for this Project in one particular cell.

In [2]:
#@ Downloading the Libraries and Dependencies:
import pandas as pd
import numpy as np
import collections
import re

from argparse import Namespace
from IPython.display import display


**Getting the Data**
* I have used Google Colab for this Project so the process of downloading and reading the Data might be different in other platforms. I have used **The Surname Dataset** which is a collection of 10000 surnames from 18 different Nationalities collected from different name sources on the Internet. The first property of this Dataset is that it is fairly Imbalanced. The second property is that there is a valid and intuitive relationships between Nationality origin and Surname Orthography. 

In [3]:
#@ Getting the Dataset:
args = Namespace(
    raw_dataset = "/content/drive/My Drive/Colab Notebooks/Surname/surnames.csv",
    train_proportion = 0.7,
    val_proportion = 0.15,
    test_proportion = 0.15,
    output_munged = "/content/drive/My Drive/Colab Notebooks/Surname/surnames_with_splits.csv",
    seed = 42
)

#@ Reading the Raw Dataset:
surnames = pd.read_csv(args.raw_dataset, header=0)
display(surnames.head(10))                                                                          # Inspecting the DataFrame.
print("\nUnique Classes:")
display(set(surnames["nationality"]))                                                               # Inspecting the Unique classes in the Dataset.

Unnamed: 0,surname,nationality
0,Woodford,English
1,Coté,French
2,Kore,English
3,Koury,Arabic
4,Lebzak,Russian
5,Obinata,Japanese
6,Rahal,Arabic
7,Zhuan,Chinese
8,Acconci,Italian
9,Mifsud,Arabic



Unique Classes:


{'Arabic',
 'Chinese',
 'Czech',
 'Dutch',
 'English',
 'French',
 'German',
 'Greek',
 'Irish',
 'Italian',
 'Japanese',
 'Korean',
 'Polish',
 'Portuguese',
 'Russian',
 'Scottish',
 'Spanish',
 'Vietnamese'}

**Processing the Dataset** 

In [4]:
#@ Splitting the Dataset on the basis of Nationality:
by_nationality = collections.defaultdict(list)                      # Collection stores the collection of Data.
for _, row in surnames.iterrows():
  by_nationality[row.nationality].append(row.to_dict())             # Creating the Dictionary.

#@ Creating the Split Data:
final_list = []
np.random.seed(args.seed)
for _, item_list in sorted(by_nationality.items()):
  np.random.shuffle(item_list)                                      # Shuffling the Data randomly.
  n = len(item_list)                                                # Number of Items.
  n_train = int(args.train_proportion * n)                          # Number of Training Dataset.
  n_val = int(args.val_proportion * n)                              # Number of Validation Dataset.
  n_test = int(args.test_proportion * n)                            # Number of Testing Dataset.
  #@ Giving the Data point a Split Attribute:
  for item in item_list[:n_train]:
    item["split"] = "train"                                         # Training Dataset.
  for item in item_list[n_train:n_train+n_val]:
    item["split"] = "val"                                           # Validation Dataset.
  for item in item_list[n_train+n_val:n_train+n_val+n_test]:
    item["split"] = "test"                                          # Testing Dataset.
  #@ Adding to the Final List:
  final_list.extend(item_list)

#@ Final Split of the Data and Creating the Final DataFrame:
final_surnames = pd.DataFrame(final_list)

#@ Inspecting the Final DataFrame:
display(final_surnames.split.value_counts())                         # Inspecting the Training, Validation and the Testing Data.
print(" ")
display(final_surnames.head())                                       # Inspecting the Final DataFrame.

train    7680
val      1640
test     1640
Name: split, dtype: int64

 


Unnamed: 0,surname,nationality,split
0,Guirguis,Arabic,train
1,Shamon,Arabic,train
2,Nader,Arabic,train
3,Kassis,Arabic,train
4,Bahar,Arabic,train


In [5]:
#@ Preparing the Final Data:
final_surnames.to_csv(args.output_munged, index=False)