In [1]:
import numpy as np
import pandas as pd

In [2]:
amzn_df = pd.read_csv(r'rough_data\amazon_dataset\amzn_all_items_simplified.csv')
amzn_df

Unnamed: 0,item,rating
0,0000000078,5.0
1,0000000116,4.0
2,0000000116,1.0
3,0000000868,4.0
4,0000013714,4.0
...,...,...
82677126,BT00IU6O8K,3.0
82677127,BT00IU6O8K,5.0
82677128,dp-g310/do,5.0
82677129,SMLRBIMX03,5.0


In [5]:
item_review_counts = amzn_df['item'].value_counts()

In [15]:
item_review_counts[:534900]

B0054JZC6E    25368
B00FAPF5U0    24024
B009UX2YAC    23956
0439023483    21398
030758836X    19867
              ...  
B009LRR7VI       25
B004RPHRDM       25
0897933990       25
B009B98EHC       25
B0009HBAS0       24
Name: item, Length: 534900, dtype: int64

Out of the 9.8M unique products in the dataset, 534,899 have 25+ reviews.

In [36]:
item_review_counts[:256228]

B0054JZC6E    25368
B00FAPF5U0    24024
B009UX2YAC    23956
0439023483    21398
030758836X    19867
              ...  
B000OMYCXK       50
0071361103       50
1781163243       50
B000A1HF7U       50
B008AY8546       49
Name: item, Length: 256228, dtype: int64

256,227 Unique items have 50+ reviews.

In [49]:
item_review_counts[:112988]

B0054JZC6E    25368
B00FAPF5U0    24024
B009UX2YAC    23956
0439023483    21398
030758836X    19867
              ...  
6305094934      100
6304308434      100
B0051OKO42      100
B000GB3ADC      100
B00CX9K2FE       99
Name: item, Length: 112988, dtype: int64

112,987 Unique items have 100+ reviews. 

### Discretionary Choice:
Prior research indicates that only 5%-15% of reviews for an online service like Amazon are negative. In order to get less variance in the distributions simply based upon sample size, I am going to use a cutoff of 50 reviews or more in my initial sampling of the data. The goal here is to create a small subset of the data, run my analysis on the smaller, more manageable dataset, and then after having proof of concept generalizing it to the large dataset. I will choose 5,000 samples from the 112,987 items that have 100+ reviews. This will be my sample data, and will be saved into a .csv file.

Update: This is not manageable for me. I need to search through the whole dataset to make sure I am getting all the reviews for each item, and with this method I am looping through 5,000 sample item numbers for 82M rows of reviews. I let the program run for about 10 minutes before realizing how bad this was. I will cut my sample size down to 500 items, and cut the number of lines my computer needs to run for creating this sample set by 90%. 8:09 start.  8:27 finish.

In [57]:
large_reviews_item_list = list(item_review_counts[:256227].index)

In [65]:
np.random.seed(1)
sample_items = np.random.choice(large_reviews_item_list, 500, replace=False)

In [66]:
print('size of sample:',len(sample_items))
sample_items

size of sample: 500


array(['B004DEQJSG', 'B0095VONZS', 'B008LFTCAK', '1451635524',
       'B0000YRQB2', 'B005DD8KFG', 'B001GOZGIK', '0916708233',
       'B00HSVEDYI', 'B003CGQOZ4', '1416549994', 'B0000532E0',
       'B000IMLSQK', 'B0012R4XO4', '0374202028', 'B004Z01PVO',
       '0446529117', 'B0002GZM00', 'B0088DOCTQ', 'B003YVTF30',
       '0345425324', 'B005AKD836', '0143118765', 'B00ED2TGZ6',
       'B00HEZ9UXW', '1599185156', '0446530220', 'B002SNIK4U',
       'B00505DQ1K', 'B00004S1DU', 'B00091PMEO', 'B00AP7VGNI',
       'B002YQQQJW', '0811714802', 'B005BYZ2YE', '6304196660',
       'B00AE07932', '0763632643', 'B003Q6AG9K', 'B004LRO7DO',
       'B009Y95U5S', 'B002YPZ85G', 'B00CORT57Q', 'B000AP04GK',
       '0373835639', '0374199639', 'B002EE583E', 'B005EZMCGQ',
       'B00FK1H0EI', 'B0010HA6A6', 'B000MMTNLS', 'B000MRCT5U',
       'B000085EEI', 'B005UGIR36', 'B00FKSNX42', '145552297X',
       'B000WBQOZW', 'B005MKZMVK', 'B000UPD8AO', '0307272761',
       '0968601405', 'B005GUPUTA', 'B00000AFH2', 'B001H

In [67]:
sample_subset = amzn_df['item'].apply(lambda x: x in sample_items)
sample_df = amzn_df[sample_subset]

In [68]:
sample_df

Unnamed: 0,item,rating
344751,0060721545,5.0
344752,0060721545,2.0
344753,0060721545,2.0
344754,0060721545,1.0
344755,0060721545,3.0
...,...,...
82151670,B00J5364J4,4.0
82151671,B00J5364J4,5.0
82151672,B00J5364J4,1.0
82151673,B00J5364J4,5.0


In [71]:
sample_df['item'].value_counts()

043935806X    4683
B006Z48TZS    3680
B002AQHLEU    3128
B008LFTCAK    2884
B008K6G8CK    2375
              ... 
B0074PFIKQ      50
B0054MAVXA      50
0615748260      50
0345491327      50
B0050435NU      50
Name: item, Length: 500, dtype: int64

In [72]:
sample_df['rating'].mean()

4.179592896795767

This sample_df is the dataset I will use for my initial analysis of the process. In fact, as a subset of 500 randomly drawn products with 50+ reviews, I think this would be enough for a project as is, but I do hope to be able to generalize my result to the larger dataset, and then compare other datasets to this one. 

Now that I have this smaller dataset, I will start analysis on the data.

In [70]:
sample_df.to_csv(r'rough_data\amazon_dataset\amzn_sample.csv', index=False)