---
# Data Cleaning
---

In this notebook we will be cleaning the data file [File](../data/games_comments.csv) .

**Key cleaning steps:**
  - Removing rows where comments are missing.
  - Creating a new columns with the language identification of the comment

The results will be store in the data file [Cleaned File](../data/games_comments_clean.csv)

---

### Importing necessary library

In [1]:
import pandas as pd
import numpy as np

import langid

### Read data file

In [2]:
df = pd.read_csv('../data/games_comments.csv')
df

Unnamed: 0,username,rating,comment,gamename,mechanics,max_players,minplaytime,maxplaytime,age,ratings_avg,count_wanting,count_wishing,description,categories
0,causticforever,,Played prototype- will be an enjoyable way to ...,Sanctuary,"Action Queue,Hand Management,Hexagon Grid,Open...",5,40,100,12,7.00000,55,569,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building"
1,Corwin007,,UPCOMING\n\nArk Nova lite?,Sanctuary,"Action Queue,Hand Management,Hexagon Grid,Open...",5,40,100,12,7.00000,55,569,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building"
2,IronTarkles,,New game from ark nova designer,Sanctuary,"Action Queue,Hand Management,Hexagon Grid,Open...",5,40,100,12,7.00000,55,569,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building"
3,MarkyX,,I'm very interested in this one. I like the co...,Sanctuary,"Action Queue,Hand Management,Hexagon Grid,Open...",5,40,100,12,7.00000,55,569,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building"
4,mikamikomi,1.0,3 artist yet still use stock photos? oh yeah,Sanctuary,"Action Queue,Hand Management,Hexagon Grid,Open...",5,40,100,12,7.00000,55,569,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4199,chicagometh,4.8,"4...'Not so good, but could play again' by BGG...",Civolution,"Area Movement,Dice Rolling,Events,Hand Managem...",4,90,180,14,8.21282,580,4089,"Hello, student beings! The cosmic faculty of t...","Civilization,Dice,Economic,Exploration,Science..."
4200,Chris Coyote,,Birthday 2025,Civolution,"Area Movement,Dice Rolling,Events,Hand Managem...",4,90,180,14,8.21282,580,4089,"Hello, student beings! The cosmic faculty of t...","Civilization,Dice,Economic,Exploration,Science..."
4201,Chris_P85,9.0,Played a half Game at Spiel 24,Civolution,"Area Movement,Dice Rolling,Events,Hand Managem...",4,90,180,14,8.21282,580,4089,"Hello, student beings! The cosmic faculty of t...","Civilization,Dice,Economic,Exploration,Science..."
4202,Chutch1035,5.0,I love Feld games and I love Civ games. When I...,Civolution,"Area Movement,Dice Rolling,Events,Hand Managem...",4,90,180,14,8.21282,580,4089,"Hello, student beings! The cosmic faculty of t...","Civilization,Dice,Economic,Exploration,Science..."


### Looking for missing values

In [3]:
df.isna().sum()

username            0
rating           1357
comment             6
gamename            0
mechanics           0
max_players         0
minplaytime         0
maxplaytime         0
age                 0
ratings_avg         0
count_wanting       0
count_wishing       0
description         0
categories          0
dtype: int64

### Removing rows without comments

In [4]:
df = df.dropna(subset=['comment'])

In [5]:
df.isna().sum()

username            0
rating           1355
comment             0
gamename            0
mechanics           0
max_players         0
minplaytime         0
maxplaytime         0
age                 0
ratings_avg         0
count_wanting       0
count_wishing       0
description         0
categories          0
dtype: int64

### Creating a language column

In [6]:
# Apply language detection
df['lang'] = df['comment'].apply(lambda x: langid.classify(x)[0] if x and len(x.strip()) > 0 else 'unknown')

df[df['lang'] == 'en']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['lang'] = df['comment'].apply(lambda x: langid.classify(x)[0] if x and len(x.strip()) > 0 else 'unknown')


Unnamed: 0,username,rating,comment,gamename,mechanics,max_players,minplaytime,maxplaytime,age,ratings_avg,count_wanting,count_wishing,description,categories,lang
0,causticforever,,Played prototype- will be an enjoyable way to ...,Sanctuary,"Action Queue,Hand Management,Hexagon Grid,Open...",5,40,100,12,7.00000,55,569,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building",en
1,Corwin007,,UPCOMING\n\nArk Nova lite?,Sanctuary,"Action Queue,Hand Management,Hexagon Grid,Open...",5,40,100,12,7.00000,55,569,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building",en
2,IronTarkles,,New game from ark nova designer,Sanctuary,"Action Queue,Hand Management,Hexagon Grid,Open...",5,40,100,12,7.00000,55,569,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building",en
3,MarkyX,,I'm very interested in this one. I like the co...,Sanctuary,"Action Queue,Hand Management,Hexagon Grid,Open...",5,40,100,12,7.00000,55,569,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building",en
4,mikamikomi,1.0,3 artist yet still use stock photos? oh yeah,Sanctuary,"Action Queue,Hand Management,Hexagon Grid,Open...",5,40,100,12,7.00000,55,569,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building",en
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4199,chicagometh,4.8,"4...'Not so good, but could play again' by BGG...",Civolution,"Area Movement,Dice Rolling,Events,Hand Managem...",4,90,180,14,8.21282,580,4089,"Hello, student beings! The cosmic faculty of t...","Civilization,Dice,Economic,Exploration,Science...",en
4200,Chris Coyote,,Birthday 2025,Civolution,"Area Movement,Dice Rolling,Events,Hand Managem...",4,90,180,14,8.21282,580,4089,"Hello, student beings! The cosmic faculty of t...","Civilization,Dice,Economic,Exploration,Science...",en
4201,Chris_P85,9.0,Played a half Game at Spiel 24,Civolution,"Area Movement,Dice Rolling,Events,Hand Managem...",4,90,180,14,8.21282,580,4089,"Hello, student beings! The cosmic faculty of t...","Civilization,Dice,Economic,Exploration,Science...",en
4202,Chutch1035,5.0,I love Feld games and I love Civ games. When I...,Civolution,"Area Movement,Dice Rolling,Events,Hand Managem...",4,90,180,14,8.21282,580,4089,"Hello, student beings! The cosmic faculty of t...","Civilization,Dice,Economic,Exploration,Science...",en


In [7]:
df['lang'].unique()

array(['en', 'de', 'es', 'nl', 'fr', 'da', 'pl', 'gl', 'hu', 'it', 'oc',
       'zh', 'no', 'rw', 'ko', 'ja', 'ru', 'ro', 'et', 'sv', 'fi', 'sk',
       'sl', 'mt', 'ca', 'lt', 'pt', 'uk', 'is', 'eo', 'am', 'cs', 'lv',
       'bs', 'mg', 'th', 'id', 'be', 'sr', 'ga', 'mn', 'tl', 'tr', 'eu',
       'nb', 'mk', 'la', 'br', 'ms'], dtype=object)

---
### Write cleaned data file
---

In [8]:
df.to_csv('../data/games_comments_clean.csv', index=False)