---
# Data Cleaning
---

In this notebook we will be cleaning the data file [File](../data/games_comments.csv) .

**Key cleaning steps:**
  - Removing rows where comments are missing.
  - Creating a new columns with the language identification of the comment

The results will be store in the data file [Cleaned File](../data/games_comments_clean.csv)

---

### Importing necessary library

In [1]:
import pandas as pd
import numpy as np

import langid

### Read data file

In [2]:
df = pd.read_csv('../data/games_comments.csv')
df

Unnamed: 0,username,rating,comment,gamename,max_players,minplaytime,maxplaytime,age,ratings_avg,count_wanting,count_wishing,description,categories
0,causticforever,,Played prototype- will be an enjoyable way to ...,Sanctuary,5,40,100,12,3.50000,44,460,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building"
1,Corwin007,,UPCOMING\n\nArk Nova lite?,Sanctuary,5,40,100,12,3.50000,44,460,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building"
2,IronTarkles,,New game from ark nova designer,Sanctuary,5,40,100,12,3.50000,44,460,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building"
3,MarkyX,,I'm very interested in this one. I like the co...,Sanctuary,5,40,100,12,3.50000,44,460,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building"
4,mikamikomi,1.0,3 artist yet still use stock photos? oh yeah,Sanctuary,5,40,100,12,3.50000,44,460,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4117,Aenelruun,8.5,LM,Everdell,4,40,80,10,8.00569,1422,21393,"Within the charming valley of Everdell, beneat...","Animals,Card Game,City Building,Fantasy"
4118,Aeremia,6.0,Cute little game with amazing artwork and nice...,Everdell,4,40,80,10,8.00569,1422,21393,"Within the charming valley of Everdell, beneat...","Animals,Card Game,City Building,Fantasy"
4119,Aevey,9.0,"Only had a few playthroughs so far, but very e...",Everdell,4,40,80,10,8.00569,1422,21393,"Within the charming valley of Everdell, beneat...","Animals,Card Game,City Building,Fantasy"
4120,afafard,7.0,Fairly simple and quick engine builder.\nHas a...,Everdell,4,40,80,10,8.00569,1422,21393,"Within the charming valley of Everdell, beneat...","Animals,Card Game,City Building,Fantasy"


### Looking for missing values

In [3]:
df.isna().sum()

username            0
rating           1338
comment             7
gamename            0
max_players         0
minplaytime         0
maxplaytime         0
age                 0
ratings_avg         0
count_wanting       0
count_wishing       0
description         0
categories          0
dtype: int64

### Removing rows without comments

In [4]:
df = df.dropna(subset=['comment'])

In [5]:
df.isna().sum()

username            0
rating           1336
comment             0
gamename            0
max_players         0
minplaytime         0
maxplaytime         0
age                 0
ratings_avg         0
count_wanting       0
count_wishing       0
description         0
categories          0
dtype: int64

### Creating a language column

In [6]:
# Apply language detection
df['lang'] = df['comment'].apply(lambda x: langid.classify(x)[0] if x and len(x.strip()) > 0 else 'unknown')

df[df['lang'] == 'en']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['lang'] = df['comment'].apply(lambda x: langid.classify(x)[0] if x and len(x.strip()) > 0 else 'unknown')


Unnamed: 0,username,rating,comment,gamename,max_players,minplaytime,maxplaytime,age,ratings_avg,count_wanting,count_wishing,description,categories,lang
0,causticforever,,Played prototype- will be an enjoyable way to ...,Sanctuary,5,40,100,12,3.50000,44,460,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building",en
1,Corwin007,,UPCOMING\n\nArk Nova lite?,Sanctuary,5,40,100,12,3.50000,44,460,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building",en
2,IronTarkles,,New game from ark nova designer,Sanctuary,5,40,100,12,3.50000,44,460,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building",en
3,MarkyX,,I'm very interested in this one. I like the co...,Sanctuary,5,40,100,12,3.50000,44,460,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building",en
4,mikamikomi,1.0,3 artist yet still use stock photos? oh yeah,Sanctuary,5,40,100,12,3.50000,44,460,"In Sanctuary, you will plan and design a moder...","Animals,Environmental,Territory Building",en
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4117,Aenelruun,8.5,LM,Everdell,4,40,80,10,8.00569,1422,21393,"Within the charming valley of Everdell, beneat...","Animals,Card Game,City Building,Fantasy",en
4118,Aeremia,6.0,Cute little game with amazing artwork and nice...,Everdell,4,40,80,10,8.00569,1422,21393,"Within the charming valley of Everdell, beneat...","Animals,Card Game,City Building,Fantasy",en
4119,Aevey,9.0,"Only had a few playthroughs so far, but very e...",Everdell,4,40,80,10,8.00569,1422,21393,"Within the charming valley of Everdell, beneat...","Animals,Card Game,City Building,Fantasy",en
4120,afafard,7.0,Fairly simple and quick engine builder.\nHas a...,Everdell,4,40,80,10,8.00569,1422,21393,"Within the charming valley of Everdell, beneat...","Animals,Card Game,City Building,Fantasy",en


In [7]:
df['lang'].unique()

array(['en', 'de', 'es', 'nl', 'fr', 'da', 'pl', 'gl', 'hu', 'ko', 'it',
       'oc', 'zh', 'no', 'et', 'sv', 'ja', 'ru', 'ro', 'fi', 'sk', 'sl',
       'pt', 'mt', 'ca', 'lt', 'rw', 'uk', 'is', 'eo', 'am', 'cs', 'th',
       'sr', 'id', 'be', 'ga', 'mg', 'tl', 'mn', 'mk', 'eu', 'ms', 'nb',
       'tr', 'la', 'br'], dtype=object)

---
### Write cleaned data file
---

In [8]:
df.to_csv('../data/games_comments_clean.csv', index=False)