# Data cleaning

## Importations

In [1]:
import json
import pandas as pd
import re
from nltk.tokenize import TweetTokenizer
import string
import nltk.corpus
# nltk.download('stopwords')

## Function definition

In [2]:
def non_ascii(s):
    return "".join(i for i in s if ord(i)<128)

def clean_links(text):
    txt = re.compile('http(s)?://\w+(\.\w+){1,}(/\w+)*')
    return txt.sub(r'', str(text))

def clean_html(text):
    html = re.compile('<.*?>|&[A-Za-z0-9_]+')
    return html.sub(r'', str(text))

def remove_twitter_entities(text): # Hashtags, urls and usernames
    return ' '.join(re.sub("(#[A-Za-z0-9]+)|(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",str(text)).split())

def clean_web(text):
    text = clean_links(text)
    text = clean_html(text)
    return text

def punct(text):
    token = nltk.tokenize.RegexpTokenizer(r'\w+')
    text = token.tokenize(text)
    text = " ".join(text)
    return text

def clean(text):
    text = non_ascii(text)
    text = str.lower(text)
    text = clean_links(text)
    text = clean_html(text)
    text = punct(text)
    return text

## Tweets

We want to clean the text from each tweet obtained. That means, removing special characters of any kind and only keeping the relevant words for the analysis.

### Data importation

First of all, we will need to import the tweets recolected.

In [3]:
# Define companies' names and its products names
entities = ['Nintendo', 'Playstation', 'Xbox', 'Engage', 'Forspoken', 'HFRush']

In [4]:
entities

['Nintendo', 'Playstation', 'Xbox', 'Engage', 'Forspoken', 'HFRush']

In [5]:
# Import tweets
tweets = {}
for entity in entities:
    path = f'Videojocs/tweets/{entity}.json'
    with open(path) as json_file:
        data = json.load(json_file)
        tweets[entity] = pd.DataFrame(data['data'])
        tweets[entity] = tweets[entity].loc[:, ['text']] # We only keep text column, since is what we are interested in

### Data cleaning

Our first mission is to clean the text from all unwanted characters or information. This mean urls, non ascii characters, punctuation signs and even html components. We will do that thanks to the **clean** function defined at the beginning of the notebook.

In [6]:
# Data cleaning
for entity in entities:
    tweets[entity]['cleanText'] = tweets[entity]['text'].apply(clean_web)
    tweets[entity]['cleanText'] = tweets[entity]['cleanText'].apply(remove_twitter_entities)
    tweets[entity]['cleanText'] = tweets[entity]['cleanText'].apply(str.lower)

### Tokenization
Once we have imported all the required data, we will first perform the process known as **text tokenization**. This consists on splitting the text into tokens in order to identify the different topics which are discussed in the text.

In [7]:
for entity in entities:
    tweets[entity]['tokens'] = tweets[entity]['cleanText'].apply(TweetTokenizer().tokenize)

### Stop words

After splitting the text into different tokens, the next step is to remove the so called **stop words**, which are words that do not have any kind of relevant meaning and are just used to link semantical words.

In [8]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords_vocabulary = stopwords.words('english')
for entity in entities:
    tweets[entity]['final'] = tweets[entity]['tokens'].apply(lambda x: [i for i in x if i.lower() not in stopwords_vocabulary])
    tweets[entity]['final'] = tweets[entity]['final'].apply(lambda x: [i for i in x if i.lower() not in list(string.punctuation)])
    tweets[entity]['final'] = tweets[entity]['final'].apply(lambda x: [i for i in x if i.lower() not in list(string.digits)])
    tweets[entity]['final'] = tweets[entity]['final'].apply(lambda x: [i for i in x if len(i) > 1]) # Remove single characters
    tweets[entity]['finalText'] = [' '.join(map(str, token)) for token in tweets[entity]['final']]

### Stemming and lemmatization

Lastly, we want to obtain the ulterior meaning behind every word, which can be retrieved if we just keep the stem of each word. This process can be done in an easy way called **stemming** (reducing inflected or sometimes derived words to their word stem, base or root form), or in a more complex way called **lemmatization** (group together inflected forms of a word so they can be analysed as a single item, identified by the word's lemma/dictionary form).

The difference between stemming and lemmatization is that lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

In [9]:
# STEMMING
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('english')
for entity in entities:
    tweets[entity]['stemmed'] = tweets[entity]['final'].apply(lambda x: [stemmer.stem(y) for y in x])
    tweets[entity]['stemmedText'] = [' '.join(map(str, token)) for token in tweets[entity]['stemmed']]

In [10]:
# LEMMATIZATION
#import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
wnl = WordNetLemmatizer()
for entity in entities:
    tweets[entity]['lemmatized'] = tweets[entity]['final'].apply(lambda x: [wnl.lemmatize(y) for y in x])
    tweets[entity]['lemmatizedText'] = [' '.join(map(str, token)) for token in tweets[entity]['lemmatized']]

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\barri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\barri\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [13]:
#print(tweets['Nintendo']['tokens'])
tweets['Nintendo']

Unnamed: 0,text,cleanText,finalText,stemmedText,lemmatizedText
0,Good guy Nintendo. Gotta give credit where it'...,good guy nintendo gotta give credit where it s...,good guy nintendo gotta give credit due everyo...,good guy nintendo gotta give credit due everyo...,good guy nintendo gotta give credit due everyo...
1,Finally caved &amp; got myself a Nintendo Swit...,finally caved got myself a nintendo switch i m...,finally caved got nintendo switch healing inne...,final cave got nintendo switch heal inner chil...,finally caved got nintendo switch healing inne...
2,PXN V3 Pro Gaming Racing Wheel Volante PC Stee...,pxn v3 pro gaming racing wheel volante pc stee...,pxn v3 pro gaming racing wheel volante pc stee...,pxn v3 pro game race wheel volant pc steer whe...,pxn v3 pro gaming racing wheel volante pc stee...
3,Nintendo Will Pay Its Workers 10% More https:/...,nintendo will pay its workers 10 more 11,nintendo pay workers 10 11,nintendo pay worker 10 11,nintendo pay worker 10 11
4,Nintendo seeing Tomodachi Life trending on twi...,nintendo seeing tomodachi life trending on twi...,nintendo seeing tomodachi life trending twitter,nintendo see tomodachi life trend twitter,nintendo seeing tomodachi life trending twitter
...,...,...,...,...,...
36831,#NintendoNetwork #NintendoNetworkDown\n https:...,nintendo network down 1 31 2023 why nintendo n...,nintendo network 31 2023 nintendo network work...,nintendo network 31 2023 nintendo network work...,nintendo network 31 2023 nintendo network work...
36832,Mega Man X2 Super Nintendo SNES Video Game Lot...,mega man x2 super nintendo snes video game lot...,mega man x2 super nintendo snes video game lot...,mega man x2 super nintendo snes video game lot...,mega man x2 super nintendo snes video game lot...
36833,February means we are definitely on official N...,february means we are definitely on official n...,february means definitely official nintendo di...,februari mean definit offici nintendo direct w...,february mean definitely official nintendo dir...
36834,Get pre order bonus for Kirby returns to dream...,get pre order bonus for kirby returns to dream...,get pre order bonus kirby returns dream land d...,get pre order bonus kirbi return dream land de...,get pre order bonus kirby return dream land de...


In [14]:
# Errors due to stemming
print('Text clean and tokenized: ',tweets['Nintendo'].finalText[36833])
print('Stemmed text: ',tweets['Nintendo'].stemmedText[36833])
print('Lemmatized text: ',tweets['Nintendo'].lemmatizedText[36833])

Text clean and tokenized:  february means definitely official nintendo direct watch 2019 february 13th 2020 march 26th 2021 february 17th 2022 february 9th nothing ever 100 guaranteed odds good
Stemmed text:  februari mean definit offici nintendo direct watch 2019 februari 13th 2020 march 26th 2021 februari 17th 2022 februari 9th noth ever 100 guarante odd good
Lemmatized text:  february mean definitely official nintendo direct watch 2019 february 13th 2020 march 26th 2021 february 17th 2022 february 9th nothing ever 100 guaranteed odds good


### Data storage

Once we have cleaned the data, we will store it in order to use it in the future. We will drop the columns that does not store the text as a unique string, since we can recover them just by stripping the related columns.

In [12]:
for entity in entities:
    tweets[entity] = tweets[entity].drop(['tokens', 'final', 'stemmed', 'lemmatized'], axis=1)

In [15]:
for entity in entities:
    path = f'Videojocs/cleanTweets/{entity}.csv'
    tweets[entity].to_csv(path, index=False)

## Reviews

We want to clean the reviews obtained from the Metacritic web, so we just need to perform the same steps as for the cleaning done with the tweets, besides an additional language filter.

### Data importation

First of all, we will need to import the data recolected.

In [16]:
# Define games and platforms
games = [
    {'title': 'fire-emblem-engage', 'platform': 'switch', 'name': 'feSwitch'},
    {'title': 'hi-fi-rush', 'platform': 'xbox-series-x', 'name': 'hfrushXbox'},
    {'title': 'forspoken', 'platform': 'playstation-5', 'name': 'forspokenPS5'},
    {'title': 'hi-fi-rush', 'platform': 'pc', 'name': 'hfrushPc'},
    {'title': 'forspoken', 'platform': 'pc', 'name': 'forspokenPc'}
]

In [17]:
# Import reviews
reviews = {}
reviewTypes = ['user', 'scored', 'unscored']
for tipo in reviewTypes:
    reviews[tipo] = {}
for game in games:
    reviews['user'][game["name"]] = pd.read_csv(f'data/userReviews/{game["title"]}_{game["platform"]}.csv')
    reviews['scored'][game["name"]] = pd.read_csv(f'data/criticReviews/scored/{game["title"]}_{game["platform"]}.csv')
    reviews['unscored'][game["name"]] = pd.read_csv(f'data/criticReviews/scored/{game["title"]}_{game["platform"]}.csv')

In [18]:
reviews['user']['hfrushXbox']['text'].apply(clean_web)

0       Amazing Game, one of the best surprises ever! ...
1       It's highly recommended and one of the best vi...
2       Jogo perfeito! Sua jogabilidade, gráficos e hi...
3       The good:\rAn interesting mix of Sunset Overdr...
4       Başında oturdun mu saatlerce kalkamıyorsun. Yı...
                              ...                        
1419    Love it. Great gameplay. Incredibly immersive....
1420    Worst game, the worst game in my entre life pl...
1421    This game lives up to the hype. The world conc...
1422                      我歌唱火焰，在我的眼睛周围，他们永远不会害怕，就像敌人奔向太阳
1423    Amazing game! I loved the art style, all plent...
Name: text, Length: 1424, dtype: object

### Data cleaning

Our firts mission is to clean the text from all unwanted characters or information. This mean urls, non ascii characters, punctuation signs and even html components. We will do that thanks to the **clean** function defined at the beginning of the notebook.

In [19]:
# Data cleaning
for game in games:
    for tipo in reviewTypes:
        reviews[tipo][game["name"]]['cleanText'] = reviews[tipo][game["name"]]['text'].apply(clean_web)
        reviews[tipo][game["name"]]['cleanText'] = reviews[tipo][game["name"]]['cleanText'].apply(remove_twitter_entities)
        reviews[tipo][game["name"]]['cleanText'] = reviews[tipo][game["name"]]['cleanText'].apply(str.lower)

### Language filtering

As we said in the previous notebook, we must filter all the **user's reviews** that are not written in English, since they would just add noise to our tests.

In [20]:
from langdetect import detect

for game in games:
    reviewsToDrop = []
    for index, review in reviews['user'][game["name"]].iterrows():
        if (re.search('[a-zA-Z]', review['cleanText'])):
            if (detect(review['cleanText']) != "en"):
                #print(detect(review['cleanText']))
                reviewsToDrop.append(index)
            #print(index)
        else: # There is no text to analyze
            reviewsToDrop.append(index)
    reviews['user'][game["name"]] = reviews['user'][game["name"]].drop(reviewsToDrop)

### Tokenization
Once we have imported all the required data, we will first perform the process known as **text tokenization**. This consists on splitting the text into tokens in order to identify the different topics which are discussed in the text.

In [21]:
for game in games:
    for tipo in reviewTypes:
        reviews[tipo][game["name"]]['tokens'] = reviews[tipo][game["name"]]['cleanText'].apply(TweetTokenizer().tokenize)

### Stop words

After splitting the text into different tokens, the next step is to remove the so called **stop words**, which are words that do not have any kind of relevant meaning and are just used to link semantical words.

In [22]:
stopwords_vocabulary = stopwords.words('english')
for game in games:
    for tipo in reviewTypes:
        reviews[tipo][game["name"]]['final'] = reviews[tipo][game["name"]]['tokens'].apply(lambda x: [i for i in x if i.lower() not in stopwords_vocabulary])
        reviews[tipo][game["name"]]['final'] = reviews[tipo][game["name"]]['final'].apply(lambda x: [i for i in x if i.lower() not in list(string.punctuation)])
        reviews[tipo][game["name"]]['final'] = reviews[tipo][game["name"]]['final'].apply(lambda x: [i for i in x if i.lower() not in list(string.digits)])
        reviews[tipo][game["name"]]['final'] = reviews[tipo][game["name"]]['final'].apply(lambda x: [i for i in x if len(i) > 1]) # Remove single characters
        reviews[tipo][game["name"]]['finalText'] = [' '.join(map(str, token)) for token in reviews[tipo][game["name"]]['final']]

### Stemming and lemmatization

Lastly, we want to obtain the ulterior meaning behind every word, which can be retrieved if we just keep the stem of each word. This process can be done in an easy way called **stemming** (reducing inflected or sometimes derived words to their word stem, base or root form), or in a more complex way called **lemmatization** (group together inflected forms of a word so they can be analysed as a single item, identified by the word's lemma/dictionary form).

The difference between stemming and lemmatization is that lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

In [23]:
# STEMMING
stemmer = SnowballStemmer('english')
for game in games:
    for tipo in reviewTypes:
        reviews[tipo][game["name"]]['stemmed'] = reviews[tipo][game["name"]]['final'].apply(lambda x: [stemmer.stem(y) for y in x])
        reviews[tipo][game["name"]]['stemmedText'] = [' '.join(map(str, token)) for token in reviews[tipo][game["name"]]['stemmed']]

In [24]:
# LEMMATIZATION
wnl = WordNetLemmatizer()
for game in games:
    for tipo in reviewTypes:
        reviews[tipo][game["name"]]['lemmatized'] = reviews[tipo][game["name"]]['final'].apply(lambda x: [wnl.lemmatize(y) for y in x])
        reviews[tipo][game["name"]]['lemmatizedText'] = [' '.join(map(str, token)) for token in reviews[tipo][game["name"]]['lemmatized']]

In [30]:
reviews['unscored']['hfrushXbox']

Unnamed: 0,source,link,date,grade,scoreType,text,cleanText,finalText,stemmedText,lemmatizedText
0,Hardcore Gamer,https://hardcoregamer.com/reviews/review-hi-fi...,"Feb 1, 2023",100,Positive,\n What likely ...,what likely started out as an xbox and or beth...,likely started xbox bethesda executive thinkin...,like start xbox bethesda execut think hey neat...,likely started xbox bethesda executive thinkin...
1,TheXboxHub,https://www.thexboxhub.com/hi-fi-rush-review/,"Jan 30, 2023",100,Positive,\n It combines ...,it combines showstopping unique gameplay with ...,combines showstopping unique gameplay best cel...,combin showstop uniqu gameplay best cel shade ...,combine showstopping unique gameplay best cel ...
2,IGN Japan,https://jp.ign.com/hi-fi-rush/65432/review/hi-...,"Jan 30, 2023",100,Positive,\n Hi-Fi Rush i...,hi fi rush is a rock themed masterpiece rhythm...,hi fi rush rock themed masterpiece rhythm acti...,hi fi rush rock theme masterpiec rhythm action...,hi fi rush rock themed masterpiece rhythm acti...
3,XboxAddict,https://www.xboxaddict.com/Staff-Review/14804/...,"Feb 21, 2023",96,Positive,\n I had a smil...,i had a smile on my face from beginning to end...,smile face beginning end playing hi fi rush co...,smile face begin end play hi fi rush color gra...,smile face beginning end playing hi fi rush co...
4,Generación Xbox,https://generacionxbox.com/analisis-de-hi-fi-r...,"Jan 30, 2023",96,Positive,\n There is not...,there is nothing as addictive fun and well bui...,nothing addictive fun well built hi fi rush xb...,noth addict fun well built hi fi rush xbox cat...,nothing addictive fun well built hi fi rush xb...
5,Gaming Nexus,https://www.gamingnexus.com/Article/10785/Hi-F...,"Feb 24, 2023",95,Positive,\n Hi-Fi Rush i...,hi fi rush is a light hearted game about comra...,hi fi rush light hearted game comradery taking...,hi fi rush light heart game comraderi take cor...,hi fi rush light hearted game comradery taking...
6,CGMagazine,https://www.cgmagonline.com/review/game/hi-fi-...,"Feb 7, 2023",95,Positive,\n Hi-Fi RUSH i...,hi fi rush is a highly enjoyable action advent...,hi fi rush highly enjoyable action adventure f...,hi fi rush high enjoy action adventur fuse sol...,hi fi rush highly enjoyable action adventure f...
7,God is a Geek,https://www.godisageek.com/reviews/hi-fi-rush-...,"Jan 30, 2023",95,Positive,\n Hi-Fi Rush i...,hi fi rush is a thrilling rhythm game that ooz...,hi fi rush thrilling rhythm game oozes style s...,hi fi rush thrill rhythm game ooz style substa...,hi fi rush thrilling rhythm game ooze style su...
8,MGG,https://www.millenium.org/test/399588.html,"Jan 29, 2023",95,Positive,\n It's been a ...,it s been a long time since a game has conquer...,long time since game conquered us like hi fi r...,long time sinc game conquer us like hi fi rush...,long time since game conquered u like hi fi ru...
9,Forbes,https://www.forbes.com/sites/paultassi/2023/01...,"Jan 28, 2023",95,Positive,\n Hi-Fi Rush i...,hi fi rush is going to stay with me i absolute...,hi fi rush going stay absolutely adore game fa...,hi fi rush go stay absolut ador game far far w...,hi fi rush going stay absolutely adore game fa...


### Data storage

Once we have cleaned the data, we will store it in order to use it in the future.

In [26]:
for game in games:
    for tipo in reviewTypes:
        reviews[tipo][game["name"]] = reviews[tipo][game["name"]].drop(['tokens', 'final', 'stemmed', 'lemmatized'], axis=1)

In [28]:
for game in games:
    for tipo in reviewTypes:
        path = f'Videojocs/cleanReviews/{tipo}_{game["name"]}.csv'
        reviews[tipo][game["name"]].to_csv(path, index=False)

## Next notebook

We have cleaned all the data that we want to analyze from all the noise and unnecessary information. However, we still don't know what are the main topics that are talked about in this texts. In order to discover it, we will perform [topic modelling](./05TopicModelling.ipynb) along the following notebook.