# Data Ingestion

Dataset Used: [amazon reviews dataset](https://www.kaggle.com/datasets/bittlingmayer/amazonreviews)

This notebook reads from various json files which contains data for various types of amazon products and merges the data into a single CSV file.

In [1]:
import pandas as pd
import numpy as np
import warnings
import os
warnings.filterwarnings("ignore")

###  Locating Data Folder

In [11]:
cur_dir = os.getcwd()
artifacts_folder = os.path.join(os.path.abspath(os.path.join(cur_dir, os.pardir)),'artifacts')
data_folder = os.path.join(artifacts_folder, 'data')
print(f'Data Folder Path: {data_folder}')

Data Folder Path: E:\customer-sentiment\artifacts\data


### Opening Files

In [3]:
with open(f'{data_folder}/train.ft.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()

print(f'Total Samples: {len(lines)}')

Total Samples: 3600000


In [4]:
#Creating Dataset
data = {
    'target' : [],
    'text' : []
}
print('Creating Data Set')
count = 0
for line in lines:
    data['target'].append(line[:10])
    data['text'].append(line[10:].strip())
    count += 1
    if count % 500_000 == 0:
        print(f'Converted  {count} lines')

Creating Data Set
Converted  500000 lines
Converted  1000000 lines
Converted  1500000 lines
Converted  2000000 lines
Converted  2500000 lines
Converted  3000000 lines
Converted  3500000 lines


###  Printing Dataframes

In [5]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,target,text
0,__label__2,Stuning even for the non-gamer: This sound tra...
1,__label__2,The best soundtrack ever to anything.: I'm rea...
2,__label__2,Amazing!: This soundtrack is my favorite music...
3,__label__2,Excellent Soundtrack: I truly like this soundt...
4,__label__2,"Remember, Pull Your Jaw Off The Floor After He..."


### Checking Null Values

In [6]:
df.isna().sum()

target    0
text      0
dtype: int64

There are no null values in the dataset

## Checking Duplicates

In [7]:
df.duplicated().sum()

0

There are no duplicates in the dataset

### Mapping Target Variable

In [8]:
df['target'].unique()

array(['__label__2', '__label__1'], dtype=object)

In [9]:
# __label__2 = 1 (positive)
# __label__1 = 0 (negative)

target_map = {
    '__label__2' : 1,
    '__label__1' : 0,
}

df['target'] = df['target'].map(target_map).astype(np.int64)
df.head()

Unnamed: 0,target,text
0,1,Stuning even for the non-gamer: This sound tra...
1,1,The best soundtrack ever to anything.: I'm rea...
2,1,Amazing!: This soundtrack is my favorite music...
3,1,Excellent Soundtrack: I truly like this soundt...
4,1,"Remember, Pull Your Jaw Off The Floor After He..."


## Sampling random 500,000 lines

In [12]:
df = df.sample(500_000).reset_index(drop=True)

In [13]:
df.shape

(500000, 2)

In [15]:
df.head()

Unnamed: 0,target,text
0,0,"Not That Great: Though I love Tito Puente, I w..."
1,1,Loved it right up until the last 10 pages: Thi...
2,1,Great Guide for those who want to know: Great ...
3,0,The Worst Deal yet: This was the worst compute...
4,0,fun for kids but...: My kid wanted to buy mini...


###  Saving the dataset

In [14]:
df.to_csv(data_folder+'/data.csv', index=False)