---
# Data Cleaning
---

---

In [9]:
import pandas as pd
import numpy as np

import langid

### Read data file

In [2]:
df = pd.read_csv('../data/games_comments.csv')
df

Unnamed: 0,username,rating,comment,max_players,minplaytime,maxplaytime,age,ratings_avg,count_wanting,count_wishing
0,causticforever,,Played prototype- will be an enjoyable way to ...,5,40,100,12,3.5,44,460
1,Corwin007,,UPCOMING\n\nArk Nova lite?,5,40,100,12,3.5,44,460
2,IronTarkles,,New game from ark nova designer,5,40,100,12,3.5,44,460
3,MarkyX,,I'm very interested in this one. I like the co...,5,40,100,12,3.5,44,460
4,mikamikomi,1.0,3 artist yet still use stock photos? oh yeah,5,40,100,12,3.5,44,460
...,...,...,...,...,...,...,...,...,...,...
4165,Aenelruun,8.5,LM,4,40,80,10,8.00569,1422,21393
4166,Aeremia,6.0,Cute little game with amazing artwork and nice...,4,40,80,10,8.00569,1422,21393
4167,Aevey,9.0,"Only had a few playthroughs so far, but very e...",4,40,80,10,8.00569,1422,21393
4168,afafard,7.0,Fairly simple and quick engine builder.\nHas a...,4,40,80,10,8.00569,1422,21393


### Looking for missing values

In [3]:
df.isna().sum()

username            0
rating           1338
comment             7
max_players         0
minplaytime         0
maxplaytime         0
age                 0
ratings_avg         0
count_wanting       0
count_wishing       0
dtype: int64

### Removing rows without comments

In [4]:
df = df.dropna(subset=['comment'])

In [5]:
df.isna().sum()

username            0
rating           1336
comment             0
max_players         0
minplaytime         0
maxplaytime         0
age                 0
ratings_avg         0
count_wanting       0
count_wishing       0
dtype: int64

### Creating a language column

In [11]:
# Apply language detection
df['lang'] = df['comment'].apply(lambda x: langid.classify(x)[0] if x and len(x.strip()) > 0 else 'unknown')

df[df['lang'] == 'en']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['lang'] = df['comment'].apply(lambda x: x if x.strip() != '' else 'unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['lang'] = df['comment'].apply(lambda x: langid.classify(x)[0] if x and len(x.strip()) > 0 else 'unknown')


Unnamed: 0,username,rating,comment,max_players,minplaytime,maxplaytime,age,ratings_avg,count_wanting,count_wishing,lang
0,causticforever,,Played prototype- will be an enjoyable way to ...,5,40,100,12,3.5,44,460,en
1,Corwin007,,UPCOMING\n\nArk Nova lite?,5,40,100,12,3.5,44,460,en
2,IronTarkles,,New game from ark nova designer,5,40,100,12,3.5,44,460,en
3,MarkyX,,I'm very interested in this one. I like the co...,5,40,100,12,3.5,44,460,en
4,mikamikomi,1.0,3 artist yet still use stock photos? oh yeah,5,40,100,12,3.5,44,460,en
...,...,...,...,...,...,...,...,...,...,...,...
4165,Aenelruun,8.5,LM,4,40,80,10,8.00569,1422,21393,en
4166,Aeremia,6.0,Cute little game with amazing artwork and nice...,4,40,80,10,8.00569,1422,21393,en
4167,Aevey,9.0,"Only had a few playthroughs so far, but very e...",4,40,80,10,8.00569,1422,21393,en
4168,afafard,7.0,Fairly simple and quick engine builder.\nHas a...,4,40,80,10,8.00569,1422,21393,en


---
### Write cleaned data file
---

In [None]:
df.to_csv('../data/games_comments_clean.csv', index=False)