duplicates checked

In [20]:
import pandas as pd
from textblob import TextBlob

## Data Extraction:
**Extracting JSON Data**: 
   - Used outside software such as `7Zip` to decompress the .json.gz file. This can be done using `gzip` on`python` however to cut on time and mistakes this approach was selected. 
   - After decompressing the file, the JSON data was extracted using the method shown below. 
   - The `pandas` library was then used read the JSON data directly into a DataFrame for further analysis and manipulation.

Extraction

In [21]:
file_path = '../data/raw/australian_user_reviews.json'

# Read convert each line to a Python object
with open(file_path, 'r', encoding='utf-8') as file:
    reviews = file.readlines()
    reviews = [eval(line.strip()) for line in reviews]

# create dataframe and display  
df_reviews=pd.DataFrame(reviews)

df_reviews.tail()

# create a copy of the dataframe to be worked on 
df_reviews_clean = df_reviews.copy()

df_reviews_clean.head()

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."


Explode

In [22]:
df_exploded = df_reviews_clean.explode('reviews')
reviews_df = df_exploded['reviews'].apply(pd.Series)

df_reviews_clean = pd.concat([df_exploded.drop('reviews', axis=1), reviews_df], axis=1)

## Data Transformation

1. Drop irrelevant columns. `0`
2. Apply Natural Language Procesing model 
3. Find rows with missig values after the explode
4. Rename column to game_id for future joins

Drop columns

In [23]:
df_reviews_clean = df_reviews_clean.drop(columns=[0, 'user_url', 'funny', 'last_edited', 'helpful', 'posted' ])

**Natural Language Procesing model:**
1. Installed TextBlob library and imported it 
2. Define fucntion in charge of the sentiment and applied to each review
3. Drop the review column once the sentiment is applied to the new column 

**Sentiment Values:**
- Neutral: Return 1
- Negative: Return 0
- Positive: Return 2 

In [24]:
def sentiment_analysis(review):

    if pd.isna(review):
        return 1  

    if not isinstance(review, str):
        review = str(review)
    polarity = TextBlob(review).sentiment.polarity

    # Assign a sentiment value based on the polarity
    if polarity < 0:
        return 0  # Negative
    elif polarity == 0:
        return 1  # Neutral
    else:
        return 2  # Positive


In [25]:
df_reviews_clean['sentiment_analysis'] = df_reviews_clean['review'].apply(sentiment_analysis)

df_reviews_clean = df_reviews_clean.drop(columns=['review'])

In [26]:
df_reviews_clean.head()

Unnamed: 0,user_id,item_id,recommend,sentiment_analysis
0,76561197970982479,1250,True,2
0,76561197970982479,22200,True,2
0,76561197970982479,43110,True,2
1,js41637,251610,True,2
1,js41637,227300,True,0


In [27]:
missing_values = df_reviews_clean.isna().sum()
print(missing_values)

user_id                0
item_id               28
recommend             28
sentiment_analysis     0
dtype: int64


Missing Values

In [28]:
rows_with_missing_values = df_reviews_clean.isnull().any(axis=1)
df_reviews_clean[rows_with_missing_values].head()

Unnamed: 0,user_id,item_id,recommend,sentiment_analysis
62,gdxsd,,,1
83,76561198094224872,,,1
1047,76561198021575394,,,1
3954,cmuir37,,,1
5394,Jaysteeny,,,1


In [29]:
df_reviews_clean.dropna(inplace=True)

Rename Column

In [30]:

new_name = {
    'item_id': 'game_id',
}

df_reviews_clean.rename(columns=new_name, inplace=True)

print(df_reviews_clean.columns)

Index(['user_id', 'game_id', 'recommend', 'sentiment_analysis'], dtype='object')


In [31]:
df_reviews_clean['game_id'] = df_reviews_clean['game_id'].astype('Int64')

In [32]:
df_reviews_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59305 entries, 0 to 25798
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   user_id             59305 non-null  object
 1   game_id             59305 non-null  Int64 
 2   recommend           59305 non-null  object
 3   sentiment_analysis  59305 non-null  int64 
dtypes: Int64(1), int64(1), object(2)
memory usage: 2.3+ MB


In [33]:
df_reviews_clean.tail()

Unnamed: 0,user_id,game_id,recommend,sentiment_analysis
25797,76561198312638244,70,True,2
25797,76561198312638244,362890,True,2
25798,LydiaMorley,273110,True,2
25798,LydiaMorley,730,True,2
25798,LydiaMorley,440,True,2


## Loading/Saving the Data
1. Saved dataframes: `df_reviews_clean`
2. Saved the data in`.parquet` 
3. File Path: `'../data/processed/'`

In [34]:
save_path = '../data/processed/' 

In [36]:
df_reviews_clean.to_parquet(save_path + 'df_reviews_clean.parquet')