# About the project

Sentiment analysis plays a significant role in marketing. This automation can help in analyzing millions of reviews in the market. In this project, I am picking fictional video game review data from the manning live project. 

This project's primary goal is to understand NLP using deep learning and a complete lifecycle of developing applications.

# Download Dataset

In [1]:
!wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz

--2021-05-26 11:43:55--  http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 154050105 (147M) [application/octet-stream]
Saving to: ‘Video_Games_5.json.gz’


2021-05-26 11:44:11 (9.51 MB/s) - ‘Video_Games_5.json.gz’ saved [154050105/154050105]



In [3]:
!mv Video_Games_5.json.gz data/

Downloaded data is moved into the Data folder.  Video_Games_5.json.gz  is gzipped file, and we can open the file using gunzip command in Linux. I found this helpful link https://tecadmin.net/extract-gz-file-in-linux-command/

In [4]:
!gunzip data/Video_Games_5.json.gz

# Data analysis

Now that we have downloaded the dataset, we can move to analyze the data with the help of Pandas.  The downloaded dataset is a json file, and read_json can be used to read the json into the pandas data frame. 

In [6]:
import pandas as pd

Since the data is not a standard JSON, it's using Ndjson [http://ndjson.org/], which is a Newline delimited JSON data. Pandas come with an API to parse such a JSON format too. 

In [7]:
df = pd.read_json('data/Video_Games_5.json',lines=True)

> lines=True allow this format to be possible

In [8]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,5,True,"10 17, 2015",A1HP7NVNPFMA4N,700026657,Ambrosia075,"This game is a bit hard to get the hang of, bu...",but when you do it's great.,1445040000,,,
1,4,False,"07 27, 2015",A1JGAP0185YJI6,700026657,travis,I played it a while but it was alright. The st...,"But in spite of that it was fun, I liked it",1437955200,,,
2,3,True,"02 23, 2015",A1YJWEXHQBWK2B,700026657,Vincent G. Mezera,ok game.,Three Stars,1424649600,,,
3,2,True,"02 20, 2015",A2204E1TH211HT,700026657,Grandma KR,"found the game a bit too complicated, not what...",Two Stars,1424390400,,,
4,5,True,"12 25, 2014",A2RF5B5H74JLPE,700026657,jon,"great game, I love it and have played it since...",love this game,1419465600,,,


In [9]:
len(df)

497577

## Rating and Review Columns

We will be working on a supervised learning technique. So we need the text review written, and the rating given by the individual is the most essential feature. I have predicted the most essential field quickly, but that may not be the case generally. So with a pinch of salt, we can move on to the next step. 

The next step in the process is to identify the 
- Length of Dataset 
- Find the balance of the samples.
- Create a small dataset for training
- Hold of Large corpus for deep learning.

 ## Find the Balance of the Sample

Based on ratings, we need the provide an equal amount of the data to model for training. For example, If we provide a more positive or negative model will be biased to the data provided. To achieve the generalization, it's crucial to balance the data, and to start with, let's check how the data is distributed.

In [10]:
df['overall'].value_counts()

5    299759
4     93654
3     49146
1     30883
2     24135
Name: overall, dtype: int64

Looking at the distribution above clearly shows the data is poised towards 5(Positive) reviews. So using the complete data will skew the model towards a positive mindset. So let us create a couple of training subsets. 
  - Small corpus - This is useful for a general training
  - large corpus - Again, we have a 497577 list of data. So It's better to reduce the number to 100K

# Small Corpus

On the small corpus, we can start with 1% of the data.To make a balanced dataset out of the large dataset, I pick a percentage from each category.
This allows me to represent the dataset in an ideal way.

In [11]:
subsetrecord = {1:1500,2:500,3:500,4:500,5:1500}

def create_small_corpus(partion,df):
    values = []
    for i in partion:
        values.append(df[df.overall==i].sample(n=partion[i],random_state=42))
    return pd.concat(values)
subset=create_small_corpus(subsetrecord,df)
subset=subset[['overall','reviewText']]
subset.rename(columns={"overall":"ratings","reviewText":"reviews"},inplace=True)
subset.to_csv('data/small_corpus.csv')

One more prominent use of small corpus is to train faster, allowing you to experiment with multiple models before choosing one. 

# Large Corpus

Take a random sample of the reviews by selecting 100,000 reviews. This way, you get a bigger representative corpus for deep learning models

In [12]:
biggercorups = df.sample(n=100000,random_state=42)
biggercorups=biggercorups[['overall','reviewText']]
biggercorups.rename(columns={"overall":"ratings","reviewText":"reviews"},inplace=True)
biggercorups.to_csv('data/big_corpus.csv')