# Generating a Synthetic Dataset for Text Classification Problems

<img align="left" width="130" src="https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Cookbook/master/Extra/cover-small-padded.png"/>

This notebook contains the code to help readers work through one of the recipes of the book [Machine Learning with Amazon SageMaker Cookbook: 80 proven recipes for data scientists and developers to perform ML experiments and deployments](https://www.amazon.com/Machine-Learning-Amazon-SageMaker-Cookbook/dp/1800567030)

### How to do it...

In [None]:
!pip install faker

In [None]:
from faker import Faker
faker = Faker()

In [None]:
positive_custom_list = [
    'this is good', 
    'i like it', 
    'very delicious', 
    'i would recommend this to my friends',
    'food in the restaurant',
    'spaghetti chicken soup',
    'dinner time',
    'tastes good',
    'donut',
    'very good',
    'impressive']

In [None]:
def generate_positive_sentences():
    return faker.sentence(
        ext_word_list=positive_custom_list
    )

In [None]:
negative_custom_list = [
    'this is bad', 
    'i hate it', 
    'there are better restaurants out there', 
    'i will not recommend this to my friends',
    'food in the restaurant',
    'spaghetti chicken soup',
    'dinner time',
    'tastes bad',
    'donut',
    'very bad',
    'not impressive']

In [None]:
def generate_negative_sentences():
    return faker.sentence(
        ext_word_list=negative_custom_list
    )

In [None]:
positive_sentences = []

for i in range(0, 1000):
    item = generate_positive_sentences()
    item = item.replace(".","")
    positive_sentences.append(item)

In [None]:
positive_sentences

In [None]:
negative_sentences = []

for i in range(0, 1000):
    item = generate_negative_sentences()
    item = item.replace(".","")
    negative_sentences.append(item)

In [None]:
negative_sentences

In [None]:
import pandas as pd 

positive_df = pd.DataFrame(
    positive_sentences, 
    columns=['text']
)

positive_df.insert(
    0, 
    "label", 
    "__label__positive"
)

In [None]:
positive_df

In [None]:
negative_df = pd.DataFrame(
    negative_sentences, 
    columns=['text']
)
negative_df.insert(
    0, 
    "label", 
    "__label__negative"
)

In [None]:
negative_df

In [None]:
all_df = pd.concat(
    [positive_df, negative_df], 
    ignore_index=True
)

In [None]:
from sklearn.model_selection import train_test_split
train_val_df, test_df = train_test_split(
    all_df, 
    test_size=0.2
) 
train_df, val_df = train_test_split(
    train_val_df, 
    test_size=0.25
)

In [None]:
!mkdir tmp 
train_df.to_csv(
    "tmp/synthetic.train.txt", 
    header=False, 
    index=False, 
    sep=" ", 
    quotechar=" "
)
val_df.to_csv(
    "tmp/synthetic.validation.txt", 
    header=False, 
    index=False, 
    sep=" ", 
    quotechar=" "
) 
test_df.to_csv(
    "tmp/synthetic.test.txt", 
    header=False, 
    index=False, 
    sep=" ", 
    quotechar=" "
)

In [None]:
!head tmp/synthetic.train.txt

In [None]:
s3_bucket = "sagemaker-cookbook-bucket"
prefix = "chapter08"
!aws s3 cp tmp/synthetic.train.txt s3://{s3_bucket}/{prefix}/input/synthetic.train.txt 
!aws s3 cp tmp/synthetic.validation.txt s3://{s3_bucket}/{prefix}/input/synthetic.validation.txt

In [None]:
%store test_df
%store s3_bucket
%store prefix