简介：数据集包含从 1980 年到 2023 年的视频游戏列表，并提供发布日期、用户评论评级和评论家评论评级等信息。
变量含义：
- Title：游戏标题
- Release Date：游戏首个版本发布日期
- Team：游戏开发团队
- Rating：平均评分
- Times Listed：列出此游戏的用户数量
- Number of Reviews：用户提供的评论数量
- Genres：游戏所属的所有类型/流派
- Summary：团队提供的摘要/概述
- Reviews：用户的评价/评论

# 1. 导入数据集

In [1]:
import pandas as pd 

In [2]:
original_data = pd.read_csv("./games.csv", index_col = 0)

In [3]:
original_data.head()

Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
0,Elden Ring,"Feb 25, 2022","['Bandai Namco Entertainment', 'FromSoftware']",4.5,3.9K,3.9K,"['Adventure', 'RPG']","Elden Ring is a fantasy, action and open world...","[""The first playthrough of elden ring is one o...",17K,3.8K,4.6K,4.8K
1,Hades,"Dec 10, 2019",['Supergiant Games'],4.3,2.9K,2.9K,"['Adventure', 'Brawler', 'Indie', 'RPG']",A rogue-lite hack and slash dungeon crawler in...,['convinced this is a roguelike for people who...,21K,3.2K,6.3K,3.6K
2,The Legend of Zelda: Breath of the Wild,"Mar 03, 2017","['Nintendo', 'Nintendo EPD Production Group No...",4.4,4.3K,4.3K,"['Adventure', 'RPG']",The Legend of Zelda: Breath of the Wild is the...,['This game is the game (that is not CS:GO) th...,30K,2.5K,5K,2.6K
3,Undertale,"Sep 15, 2015","['tobyfox', '8-4']",4.2,3.5K,3.5K,"['Adventure', 'Indie', 'RPG', 'Turn Based Stra...","A small child falls into the Underground, wher...",['soundtrack is tied for #1 with nier automata...,28K,679,4.9K,1.8K
4,Hollow Knight,"Feb 24, 2017",['Team Cherry'],4.4,3K,3K,"['Adventure', 'Indie', 'Platform']",A 2D metroidvania with an emphasis on close co...,"[""this games worldbuilding is incredible, with...",21K,2.4K,8.3K,2.3K


# 2. 数据评估

## 2.1 评估数据结构

In [4]:
original_data.sample(5)

Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
695,Need for Speed: Heat,"Nov 08, 2019","['Electronic Arts', 'Ghost Games']",3.1,180,180,"['Adventure', 'Racing', 'Sport']",Hustle by day and risk it all at night in Need...,"[""the story was really short but i think for a...",1.6K,100,455,94
290,Goat Simulator,"Apr 01, 2014",['Coffee Stain Studios'],2.4,337,337,"['Adventure', 'Indie', 'Simulator']",Goat Simulator is a third-person perspective g...,"['XD', 'não tenho o menor interesse em jogar',...",8.4K,22,492,125
1005,Pokémon HeartGold,"Sep 12, 2009","['The Pokémon Company', 'Game Freak']",4.2,892,892,"['Adventure', 'RPG', 'Turn Based Strategy']",Pokémon HeartGold Version and Pokémon SoulSilv...,"['o meu joguinho do coração', 'The best pokémo...",9.5K,128,749,331
1088,Star Wars Battlefront II,"Nov 17, 2017","['EA Digital Illusions CE', 'Electronic Arts']",3.1,534,534,"['Adventure', 'Shooter']",Embark on an endless Star Wars action experien...,['i forgot i played this game until i saw a fr...,7.8K,128,826,140
1064,Roblox,"Sep 01, 2006",['Roblox Corporation'],2.8,187,187,"['Adventure', 'Platform', 'Simulator']",Roblox is a massively multiplayer online platf...,"['veletlerin favori oyunu', 'tem suas perolas ...",5.1K,213,102,14


可以看出数据集的每一行都为一项观察值，每一列都为变量，但是`Team` `Genres` `Reviews`列的每一个单元格内都为列表，包含多个值，因此后续处理需要**拆分列**。

## 2.2 评估数据的干净程度

In [5]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1512 entries, 0 to 1511
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Title              1512 non-null   object 
 1   Release Date       1512 non-null   object 
 2   Team               1511 non-null   object 
 3   Rating             1499 non-null   float64
 4   Times Listed       1512 non-null   object 
 5   Number of Reviews  1512 non-null   object 
 6   Genres             1512 non-null   object 
 7   Summary            1511 non-null   object 
 8   Reviews            1512 non-null   object 
 9   Plays              1512 non-null   object 
 10  Playing            1512 non-null   object 
 11  Backlogs           1512 non-null   object 
 12  Wishlist           1512 non-null   object 
dtypes: float64(1), object(12)
memory usage: 165.4+ KB


`Team` `Rating` `Summary` 均存在缺失值；`Release Date` 应为日期变量；`Times Listed` `Plays` `Playing` `Backlogs` 和 `Wishlist`应该为整数型变量

### 2.2.1 评估缺失数据

首先先查看缺失个数

In [6]:
original_data.isnull().sum()

Title                 0
Release Date          0
Team                  1
Rating               13
Times Listed          0
Number of Reviews     0
Genres                0
Summary               1
Reviews               0
Plays                 0
Playing               0
Backlogs              0
Wishlist              0
dtype: int64

#### Team列的缺失值

In [7]:
original_data[original_data["Team"].isnull()]

Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
1245,NEET Girl Date Night,"Oct 21, 2022",,2.7,21,21,['Visual Novel'],Your friend sets you up on a date with his NEE...,"['this sucked. ""Omg she is literally me"" is no...",106,1,44,42


若能够找到团队数据，则可以考虑手动填入，但由于未搜索到相关团队，则考虑删除这一观察值。

#### Rating列的缺失值

In [8]:
original_data[original_data["Rating"].isnull()]

Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
587,Final Fantasy XVI,"Jun 22, 2023","['Square Enix', 'Square Enix Creative Business...",,422,422,['RPG'],Final Fantasy XVI is an upcoming action role-p...,[],37,10,732,2.4K
649,Death Stranding 2,releases on TBD,['Kojima Productions'],,105,105,"['Adventure', 'Shooter']",,[],3,0,209,644
713,Final Fantasy VII Rebirth,"Dec 31, 2023",['Square Enix'],,192,192,[],This next standalone chapter in the FINAL FANT...,[],20,3,354,1.1K
719,Lies of P,"Aug 01, 2023","['NEOWIZ', 'Round8 Studio']",,175,175,['RPG'],"Inspired by the familiar story of Pinocchio, L...",[],5,0,260,939
726,Judas,"Mar 31, 2025",['Ghost Story Games'],,90,90,"['Adventure', 'Shooter']",A disintegrating starship. A desperate escape ...,[],1,0,92,437
746,Like a Dragon Gaiden: The Man Who Erased His Name,"Dec 31, 2023","['Ryū Ga Gotoku Studios', 'Sega']",,118,118,"['Adventure', 'Brawler', 'RPG']",This game covers Kiryu's story between Yakuza ...,[],2,1,145,588
972,The Legend of Zelda: Tears of the Kingdom,"May 12, 2023","['Nintendo', 'Nintendo EPD Production Group No...",,581,581,"['Adventure', 'RPG']",The Legend of Zelda: Tears of the Kingdom is t...,[],72,6,1.6K,5.4K
1130,Star Wars Jedi: Survivor,"Apr 28, 2023","['Respawn Entertainment', 'Electronic Arts']",,250,250,['Adventure'],The story of Cal Kestis continues in Star Wars...,[],13,2,367,1.4K
1160,We Love Katamari Reroll + Royal Reverie,"Jun 02, 2023","['Bandai Namco Entertainment', 'MONKEYCRAFT Co...",,51,51,"['Adventure', 'Puzzle']",We Love Katamari Reroll + Royal Reverie is a r...,[],3,0,74,291
1202,Earthblade,"Dec 31, 2024",['Extremely OK Games'],,83,83,"['Adventure', 'Indie', 'RPG']","You are Névoa, an enigmatic child of Fate retu...",[],0,1,103,529


由于该数据集的数据来源未知，因此rating列难以获取，考虑删除缺失行

### 2.2.2 评估重复数据

在这一步中，主要筛选游戏名、Summary以及Review出现重复的观察值，因为这三者的指向性较强，不太可能出现完全一致的情况。

In [9]:
original_data.duplicated().sum()

np.int64(382)

In [10]:
original_data[original_data.duplicated(subset=["Title"])]

Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
132,Doom,"Dec 10, 1993","['Activision', 'id Software']",4.0,1.4K,1.4K,['Shooter'],"In the future, humans have left Earth and sett...","['Recomendadisimo', 'classic', 'doom my belove...",11K,203,1.5K,514
159,Dead Space,"Oct 14, 2008","['EA Redwood Shores', 'Electronic Arts']",4.0,1.2K,1.2K,['Shooter'],Dead Space is a 2008 science fiction survival ...,['Impressionante demais pra Ã©poca e realmente...,9.6K,302,2.7K,1.1K
161,Shadow of the Colossus,"Feb 06, 2018","['Sony Interactive Entertainment', 'Bluepoint ...",4.1,1.1K,1.1K,"['Adventure', 'Platform', 'Puzzle']",Tales speak of an ancient land where creatures...,['(Played before 2023)\n \...,7.8K,242,3.2K,1.6K
163,God of War,"Mar 22, 2005","['SCE Santa Monica Studio', 'Sony Computer Ent...",3.6,981,981,"['Adventure', 'Brawler', 'Strategy']","Similar to franchises like Devil May Cry, Ryga...","['As a god of war game, It s incredible\n ...",11K,158,1.5K,896
326,Elden Ring,"Feb 25, 2022","['Bandai Namco Entertainment', 'FromSoftware']",4.5,3.9K,3.9K,"['Adventure', 'RPG']","Elden Ring is a fantasy, action and open world...","[""The first playthrough of elden ring is one o...",17K,3.8K,4.6K,4.8K
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1271,Fatal Frame II: Crimson Butterfly,"Nov 27, 2003","['Tecmo Co., Ltd.', 'Ubisoft Entertainment']",4.2,398,398,['Adventure'],Crimson Butterfly is the second installment in...,['Pretty cool albeit a bit similar to the firs...,1K,38,690,513
1282,Super Mario Sunshine,"Sep 18, 2020","['Nintendo EAD', 'Nintendo']",3.7,19,19,"['Adventure', 'Platform']",A port of Super Mario Sunshine included in Sup...,['What an amazing remaster of an already amazi...,340,6,83,14
1332,Doom,"Nov 10, 2017","['Bethesda Softworks', 'id Software']",3.9,80,80,['Shooter'],"Doom, the brutally fun and challenging modern-...",['can we get a Doom anime. Now I would watch t...,2.3K,37,393,150
1492,Sonic the Hedgehog 2,"Oct 16, 1992","['Aspect Co. Ltd', 'Sega']",2.8,157,157,"['Adventure', 'Arcade', 'Platform']",This is a completely different game than its 1...,['Played as part of Sonic Gems Collection on t...,1.4K,6,173,45


由于重复值较多，随机从中挑选一个具体验证

In [11]:
original_data[original_data["Title"] == "Doom"]


Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
45,Doom,"May 12, 2016","['id Software', 'Bethesda Softworks']",4.0,1.8K,1.8K,['Shooter'],"Developed by id software, the studio that pion...",['Recomendado (aunque se hace muy repetitivo)'...,18K,541,4K,1.4K
132,Doom,"Dec 10, 1993","['Activision', 'id Software']",4.0,1.4K,1.4K,['Shooter'],"In the future, humans have left Earth and sett...","['Recomendadisimo', 'classic', 'doom my belove...",11K,203,1.5K,514
371,Doom,"May 12, 2016","['id Software', 'Bethesda Softworks']",4.0,1.8K,1.8K,['Shooter'],"Developed by id software, the studio that pion...",['Recomendado (aunque se hace muy repetitivo)'...,18K,541,4K,1.4K
421,Doom,"Dec 10, 1993","['Activision', 'id Software']",4.0,1.4K,1.4K,['Shooter'],"In the future, humans have left Earth and sett...","['Recomendadisimo', 'classic', 'doom my belove...",11K,203,1.5K,514
821,Doom,"May 12, 2016","['id Software', 'Bethesda Softworks']",4.0,1.8K,1.8K,['Shooter'],"Developed by id software, the studio that pion...",['Recomendado (aunque se hace muy repetitivo)'...,18K,541,4K,1.4K
887,Doom,"Dec 10, 1993","['Activision', 'id Software']",4.0,1.4K,1.4K,['Shooter'],"In the future, humans have left Earth and sett...","['Recomendadisimo', 'classic', 'doom my belove...",11K,203,1.5K,514
1332,Doom,"Nov 10, 2017","['Bethesda Softworks', 'id Software']",3.9,80,80,['Shooter'],"Doom, the brutally fun and challenging modern-...",['can we get a Doom anime. Now I would watch t...,2.3K,37,393,150


可以看出，有些同名游戏发布于不同时间，可能是同一游戏的不同版本或重名；但有些同名观察值的Summary和Reviews完全相同，是数据发生了重复，考虑删去。

### 2.2.3 评估不一致数据

这里可能会出现不一致数据的变量主要在`Team` `Genres`两列，这一点需要在修改完结构问题、拆分完列变量后才能确定

### 2.2.4 评估无效或错误数据

由于一些整数型变量被设定为了object，因此需要在修改完变量类型后才能筛选无效数据

In [12]:
original_data.describe()

Unnamed: 0,Rating
count,1499.0
mean,3.719346
std,0.532608
min,0.7
25%,3.4
50%,3.8
75%,4.1
max,4.8


# 3. 清理数据

**待办事项**
1. 删去Team和Rating数据缺失行
2. 拆分Team、Genres和Review列
3. 修改date、Times Listed、Plays、Playing、Backlogs和Wishlist的变量类型
4. 删去Name、Date、Summary和Reviews同时重复的情况
5. 进一步评估不一致数据和无效数据

In [13]:
cleaned_data = original_data.copy()

## 3.1 删去缺失值

In [14]:
cleaned_data.dropna(subset=["Team", "Rating"], axis=0, inplace=True)
cleaned_data.isnull().sum()

Title                0
Release Date         0
Team                 0
Rating               0
Times Listed         0
Number of Reviews    0
Genres               0
Summary              0
Reviews              0
Plays                0
Playing              0
Backlogs             0
Wishlist             0
dtype: int64

## 3.2 拆分Team、Genres和Review列

### 3.2.1 拆分Team列

首先确认最大Team数

In [15]:
cleaned_data["Team"].info() 

<class 'pandas.core.series.Series'>
Index: 1498 entries, 0 to 1511
Series name: Team
Non-Null Count  Dtype 
--------------  ----- 
1498 non-null   object
dtypes: object(1)
memory usage: 23.4+ KB


由此可见，这里的Team变量都为字符串，需要先进行拆分和清洗

In [16]:
# 去除字符串前后的中括号
cleaned_data["Team"] = cleaned_data["Team"].str.slice(1, -1)
cleaned_data["Team"]

0            'Bandai Namco Entertainment', 'FromSoftware'
1                                      'Supergiant Games'
2       'Nintendo', 'Nintendo EPD Production Group No. 3'
3                                        'tobyfox', '8-4'
4                                           'Team Cherry'
                              ...                        
1507                                     'Telltale Games'
1508                               'Sumo Digital', 'Sega'
1509                                             'Capcom'
1510                                     'Larian Studios'
1511                              'WB Games', 'TT Fusion'
Name: Team, Length: 1498, dtype: object

In [17]:
# 按照英文逗号进行分隔
cleaned_data[['Team1', "Team2", "Team3"]] = cleaned_data["Team"].str.split(",", expand=True)
cleaned_data.drop(["Team"], axis=1, inplace=True)
cleaned_data.sample(10)

Unnamed: 0,Title,Release Date,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist,Team1,Team2,Team3
897,Hogwarts Legacy,"Feb 10, 2023",3.5,474,474,"['Adventure', 'RPG']","Hogwarts Legacy is an immersive, open-world ac...",['Enjoyed this one a lot as a big Harry Potter...,1.4K,766,686,1.6K,'Portkey Games','Avalanche Software',
348,Metal Gear Rising: Revengeance,"Feb 19, 2013",4.1,2.1K,2.1K,"['Adventure', 'Brawler', 'Shooter', 'Strategy']",Developed by Kojima Productions and PlatinumGa...,"['This game is so jank but so entertaining', '...",14K,492,4.2K,2K,'Konami','PlatinumGames',
813,Persona 4 Golden,"Jun 15, 2012",4.2,1.9K,1.9K,"['Adventure', 'RPG', 'Simulator', 'Visual Novel']",An enhanced rerelease of Shin Megami Tensei: P...,"['mfw bitches and whores', 'I love this game, ...",13K,1.5K,5.2K,2.2K,'NIS America','Atlus',
1429,Metal Gear 2: Solid Snake,"Jul 20, 1990",3.5,306,306,['Adventure'],"Solid Snake, now retired from Fox-Hound, retur...","[""pretty much an 8-bit MGS1. Extremely ambitio...",1.2K,19,501,183,'Konami',,
1055,Black Mesa,"Mar 06, 2020",4.1,459,459,"['Adventure', 'Indie', 'Platform', 'Shooter']",Black Mesa is a re-envisioning of Valve Softwa...,['An admirable fan project but there are quest...,3.7K,200,1.5K,600,'Crowbar Collective',,
631,Sonic Colors,"Nov 11, 2010",3.4,609,609,"['Adventure', 'Platform']","Sonic Colors, titled Sonic Colours in European...",['Recomendado (aunque peca de usar mucho el 2D...,5.1K,44,690,342,'SEGA of America','Sonic Team',
198,Dragon Quest XI S: Echoes of an Elusive Age - ...,"Sep 27, 2019",4.2,880,880,"['Adventure', 'RPG']",Ready for a grand adventure filled with memora...,"[""It is the most alrightest modern Dragon Ques...",4.7K,748,3.1K,1.2K,'Square Enix',,
361,Sekiro: Shadows Die Twice,"Mar 22, 2019",4.4,2.3K,2.3K,"['Adventure', 'Brawler']",Enter a dark and brutal new gameplay experienc...,"[""Im waiting for this game to grab me but it h...",14K,919,4.8K,3.4K,'FromSoftware','Activision',
205,Banjo-Kazooie,"Jun 29, 1998",4.0,1.2K,1.2K,"['Adventure', 'Platform']","In this 3D platformer, the heroic but naive be...","['Guh huh', 'My king.', 'Chato.', 'one of the ...",7.4K,205,2.1K,765,'Microsoft Game Studios','Rare',
763,Mario Party 2,"Dec 17, 1999",3.7,348,348,['Card & Board Game'],Mario and the gang are back for another round ...,['Mario Party 2 is a perfect sequel. It improv...,3.9K,11,199,132,'Hudson Soft','Nintendo',


In [18]:
cleaned_data["Team1"] = cleaned_data["Team1"].astype("str").str.replace("'", "")
cleaned_data["Team2"] = cleaned_data["Team2"].astype("str").str.replace("'", "")
cleaned_data["Team3"] = cleaned_data["Team3"].astype("str").str.replace("'", "")

### 3.2.2 拆分Genres列

由于Genres列类似分类变量，考虑设置为虚拟变量，以便于后续分析

In [19]:
cleaned_data["Genres"]

0                                    ['Adventure', 'RPG']
1                ['Adventure', 'Brawler', 'Indie', 'RPG']
2                                    ['Adventure', 'RPG']
3       ['Adventure', 'Indie', 'RPG', 'Turn Based Stra...
4                      ['Adventure', 'Indie', 'Platform']
                              ...                        
1507                     ['Adventure', 'Point-and-Click']
1508                                 ['Arcade', 'Racing']
1509                                   ['Brawler', 'RPG']
1510    ['Adventure', 'RPG', 'Strategy', 'Tactical', '...
1511                              ['Adventure', 'Puzzle']
Name: Genres, Length: 1498, dtype: object

In [20]:
# 清理原单元格中的标点符号
cleaned_data["Genres"] = cleaned_data["Genres"].astype("str")
cleaned_data["Genres"] = cleaned_data["Genres"].str.replace("[", "")
cleaned_data["Genres"] = cleaned_data["Genres"].str.replace("]", "")
cleaned_data["Genres"] = cleaned_data["Genres"].str.replace("'", "")

cleaned_data["Genres"]

0                                          Adventure, RPG
1                          Adventure, Brawler, Indie, RPG
2                                          Adventure, RPG
3              Adventure, Indie, RPG, Turn Based Strategy
4                              Adventure, Indie, Platform
                              ...                        
1507                           Adventure, Point-and-Click
1508                                       Arcade, Racing
1509                                         Brawler, RPG
1510    Adventure, RPG, Strategy, Tactical, Turn Based...
1511                                    Adventure, Puzzle
Name: Genres, Length: 1498, dtype: object

In [21]:
# 设置为虚拟变量，使用pd.get_dummies()方法
genres_dummies = cleaned_data["Genres"].str.get_dummies(sep=",") # 将原单元值按照","拆分，并虚拟变量保存为新DataFrame
genres_dummies = genres_dummies.astype("str") 
cleaned_data = pd.concat([cleaned_data, genres_dummies], axis=1)

In [22]:
cleaned_data.head()

Unnamed: 0,Title,Release Date,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,...,Point-and-Click,Puzzle,RPG,Racing,Real Time Strategy,Shooter,Simulator,Sport,Strategy,Visual Novel
0,Elden Ring,"Feb 25, 2022",4.5,3.9K,3.9K,"Adventure, RPG","Elden Ring is a fantasy, action and open world...","[""The first playthrough of elden ring is one o...",17K,3.8K,...,0,0,0,0,0,0,0,0,0,0
1,Hades,"Dec 10, 2019",4.3,2.9K,2.9K,"Adventure, Brawler, Indie, RPG",A rogue-lite hack and slash dungeon crawler in...,['convinced this is a roguelike for people who...,21K,3.2K,...,0,0,0,0,0,0,0,0,0,0
2,The Legend of Zelda: Breath of the Wild,"Mar 03, 2017",4.4,4.3K,4.3K,"Adventure, RPG",The Legend of Zelda: Breath of the Wild is the...,['This game is the game (that is not CS:GO) th...,30K,2.5K,...,0,0,0,0,0,0,0,0,0,0
3,Undertale,"Sep 15, 2015",4.2,3.5K,3.5K,"Adventure, Indie, RPG, Turn Based Strategy","A small child falls into the Underground, wher...",['soundtrack is tied for #1 with nier automata...,28K,679,...,0,0,0,0,0,0,0,0,0,0
4,Hollow Knight,"Feb 24, 2017",4.4,3K,3K,"Adventure, Indie, Platform",A 2D metroidvania with an emphasis on close co...,"[""this games worldbuilding is incredible, with...",21K,2.4K,...,0,0,0,0,0,0,0,0,0,0


In [23]:
# 删去原Genres列
cleaned_data = cleaned_data.drop(["Genres"], axis=1)

In [24]:
pd.set_option("display.max_columns", 150) # 设置展示列数
pd.set_option("display.max_colwidth", 50) # 设置每个单元的最大字符上限

### 3.2.3 拆分Reviews列

In [25]:
cleaned_data["Reviews"][0]

'["The first playthrough of elden ring is one of the best eperiences gaming can offer you but after youve explored everything in the open world and you\'ve experienced all of the surprises you lose motivation to go exploring on repeat playthroughs which takes a lot away from the replayability which is a very important thing for from games imo.", \'a replay solidified my love for elden ring. so easily my favorite game of all time. actually beating malenia this time was also an amazing feeling. i just love being in this world man its the greatest of all time\', \'The game is absolutely beautiful, with so much to do. The replayability is crazy. And it never gets old with it too.\', \'Took everything great about the Soulsborne games and make it 100% better.\', \'I play with my overlevelled friend every time and we still fail sometimes (he’s on NG6), insanely difficult game lol\\n                     \\n                     gorgeous graphics, animations, everything about this game is so bea

In [26]:
# 重复上述步骤，转换为字符串，进行拆分
cleaned_data["Reviews"] = cleaned_data["Reviews"].astype("str")
cleaned_data[["Review1", "Review1", "Review3", "Review4", "Review5", "Review6"]] = cleaned_data["Reviews"].str.split(", \'", expand=True)

In [27]:
cleaned_data.drop(['Reviews'], axis=1, inplace=True)
cleaned_data.tail()

Unnamed: 0,Title,Release Date,Rating,Times Listed,Number of Reviews,Summary,Plays,Playing,Backlogs,Wishlist,Team1,Team2,Team3,Arcade,Brawler,Card & Board Game,Fighting,Indie,Music,Pinball,Platform,Point-and-Click,Puzzle,Quiz/Trivia,RPG,Racing,Real Time Strategy,Shooter,Simulator,Sport,Strategy,Tactical,Turn Based Strategy,Visual Novel,Adventure,Arcade.1,Brawler.1,Card & Board Game.1,Fighting.1,Indie.1,MOBA,Music.1,Platform.1,Point-and-Click.1,Puzzle.1,RPG.1,Racing.1,Real Time Strategy.1,Shooter.1,Simulator.1,Sport.1,Strategy.1,Visual Novel.1,Review1,Review3,Review4,Review5,Review6
1507,Back to the Future: The Game,"Dec 22, 2010",3.2,94,94,Back to the Future: The Game is one of Telltal...,763,5,223,67,Telltale Games,,,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,i need to give this another try',When I was little I was obsessed with Back to ...,Für mich der inoffizielle vierte Teil der Reih...,,
1508,Team Sonic Racing,"May 21, 2019",2.9,264,264,Team Sonic Racing combines the best elements o...,1.5K,49,413,107,Sumo Digital,Sega,,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"not my cup of tea', ""Compared to the previous ...",One of the funnest PS plus games ever.',"it looks pretty ig', ""Feels great to play but ...",,
1509,Dragon's Dogma,"May 22, 2012",3.7,210,210,"Set in a huge open world, Dragon’s Dogma: Dark...",1.1K,45,487,206,Capcom,,,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"A grandes rasgos, es como un MMO pero para un ...",peak kino\n \n ...,ok knorke',"Muito pika puta merda, gameplay incrivel, expl...",
1510,Baldur's Gate 3,"Oct 06, 2020",4.1,165,165,"An ancient evil has returned to Baldur's Gate,...",269,79,388,602,Larian Studios,,,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"friends are required to enjoy this game.', ""Ga...",,,,
1511,The LEGO Movie Videogame,"Feb 04, 2014",2.8,184,184,Join Emmet and an unlikely group of resistance...,1.7K,11,239,73,WB Games,TT Fusion,,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Pretty Average Lego Game But It Was My Childhood',pog lego game',Pretty decent lego game! Loved the attention t...,Class',TT Games got crunch time for this.']


## 3.3 转换数据类型

将所有含K的数值转换为整数

In [28]:
# 定义转换函数
def K_to_thousands(numbers):
    if "K" in numbers:
        return (float(numbers.replace("K", "")) * 1000)
    else:
        return numbers

In [29]:
columns_list = ["Times Listed", "Number of Reviews", "Plays", "Playing", "Backlogs", "Wishlist"]
for c in columns_list:
    converted_data = pd.Series
    converted_data = cleaned_data[c].apply(K_to_thousands)
    cleaned_data[c] = converted_data.astype("int")


In [31]:
cleaned_data.describe()

Unnamed: 0,Rating,Times Listed,Number of Reviews,Plays,Playing,Backlogs,Wishlist
count,1498.0,1498.0,1498.0,1498.0,1498.0,1498.0,1498.0
mean,3.720027,775.156876,775.156876,6311.834446,269.855808,1463.375167,778.184913
std,0.532133,688.335864,688.335864,5891.431835,427.668688,1342.977036,793.828103
min,0.7,8.0,8.0,1.0,0.0,5.0,2.0
25%,3.4,294.25,294.25,1900.0,44.0,470.25,212.0
50%,3.8,555.0,555.0,4300.0,115.0,1000.0,496.0
75%,4.1,1000.0,1000.0,9100.0,302.0,2100.0,1100.0
max,4.8,4300.0,4300.0,33000.0,3800.0,8300.0,4800.0


In [32]:
cleaned_data.head()

Unnamed: 0,Title,Release Date,Rating,Times Listed,Number of Reviews,Summary,Plays,Playing,Backlogs,Wishlist,Team1,Team2,Team3,Arcade,Brawler,Card & Board Game,Fighting,Indie,Music,Pinball,Platform,Point-and-Click,Puzzle,Quiz/Trivia,RPG,Racing,Real Time Strategy,Shooter,Simulator,Sport,Strategy,Tactical,Turn Based Strategy,Visual Novel,Adventure,Arcade.1,Brawler.1,Card & Board Game.1,Fighting.1,Indie.1,MOBA,Music.1,Platform.1,Point-and-Click.1,Puzzle.1,RPG.1,Racing.1,Real Time Strategy.1,Shooter.1,Simulator.1,Sport.1,Strategy.1,Visual Novel.1,Review1,Review3,Review4,Review5,Review6
0,Elden Ring,"Feb 25, 2022",4.5,3900,3900,"Elden Ring is a fantasy, action and open world...",17000,3800,4600,4800,Bandai Namco Entertainment,FromSoftware,,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,a replay solidified my love for elden ring. so...,"The game is absolutely beautiful, with so much...",Took everything great about the Soulsborne gam...,I play with my overlevelled friend every time ...,
1,Hades,"Dec 10, 2019",4.3,2900,2900,A rogue-lite hack and slash dungeon crawler in...,21000,3200,6300,3600,Supergiant Games,,,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"incredible art, a banger soundtrack a surprisi...","Não sou muito de jogo indie, admito que joguei...","One of my favorites in the rogue-likes/lites, ...",,
2,The Legend of Zelda: Breath of the Wild,"Mar 03, 2017",4.4,4300,4300,The Legend of Zelda: Breath of the Wild is the...,30000,2500,5000,2600,Nintendo,Nintendo EPD Production Group No. 3,,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,em 105 horas de jogo não houve um segundo que ...,Sencillamente el mejor juego que he tenido el ...,em meio a tanto jogo de mundo aberto ruim sain...,,
3,Undertale,"Sep 15, 2015",4.2,3500,3500,"A small child falls into the Underground, wher...",28000,679,4900,1800,tobyfox,8-4,,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"Just play this game, Dont look at any of these...",Não há palavras que deem para descrever a expe...,CLASSSSSSSSSSSSSSSICCCCCCCCCCCCCCCCCCC',whooaa ohh ohhhh ohoohhohh ohhwooaah story of ...,A nice unique take on the RPG indie game forma...
4,Hollow Knight,"Feb 24, 2017",4.4,3000,3000,A 2D metroidvania with an emphasis on close co...,21000,2400,8300,2300,Team Cherry,,,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"Faz uns 2/3 anos que eu zerei esse jogo, mesmo...",i like how you can bounce on spikes with your ...,A rivetting action/adventure game with a stunn...,I\'d give this game a 4 for each individual as...,


## 3.4 删去重复数据

In [37]:
cleaned_data[cleaned_data.duplicated()] # 获取到所有的重复数据

Unnamed: 0,Title,Release Date,Rating,Times Listed,Number of Reviews,Summary,Plays,Playing,Backlogs,Wishlist,Team1,Team2,Team3,Arcade,Brawler,Card & Board Game,Fighting,Indie,Music,Pinball,Platform,Point-and-Click,Puzzle,Quiz/Trivia,RPG,Racing,Real Time Strategy,Shooter,Simulator,Sport,Strategy,Tactical,Turn Based Strategy,Visual Novel,Adventure,Arcade.1,Brawler.1,Card & Board Game.1,Fighting.1,Indie.1,MOBA,Music.1,Platform.1,Point-and-Click.1,Puzzle.1,RPG.1,Racing.1,Real Time Strategy.1,Shooter.1,Simulator.1,Sport.1,Strategy.1,Visual Novel.1,Review1,Review3,Review4,Review5,Review6
326,Elden Ring,"Feb 25, 2022",4.5,3900,3900,"Elden Ring is a fantasy, action and open world...",17000,3800,4600,4800,Bandai Namco Entertainment,FromSoftware,,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,a replay solidified my love for elden ring. so...,"The game is absolutely beautiful, with so much...",Took everything great about the Soulsborne gam...,I play with my overlevelled friend every time ...,
327,Hades,"Dec 10, 2019",4.3,2900,2900,A rogue-lite hack and slash dungeon crawler in...,21000,3200,6300,3600,Supergiant Games,,,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"incredible art, a banger soundtrack a surprisi...","Não sou muito de jogo indie, admito que joguei...","One of my favorites in the rogue-likes/lites, ...",,
328,The Legend of Zelda: Breath of the Wild,"Mar 03, 2017",4.4,4300,4300,The Legend of Zelda: Breath of the Wild is the...,30000,2500,5000,2600,Nintendo,Nintendo EPD Production Group No. 3,,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,em 105 horas de jogo não houve um segundo que ...,Sencillamente el mejor juego que he tenido el ...,em meio a tanto jogo de mundo aberto ruim sain...,,
329,Undertale,"Sep 15, 2015",4.2,3500,3500,"A small child falls into the Underground, wher...",28000,679,4900,1800,tobyfox,8-4,,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"Just play this game, Dont look at any of these...",Não há palavras que deem para descrever a expe...,CLASSSSSSSSSSSSSSSICCCCCCCCCCCCCCCCCCC',whooaa ohh ohhhh ohoohhohh ohhwooaah story of ...,A nice unique take on the RPG indie game forma...
330,Hollow Knight,"Feb 24, 2017",4.4,3000,3000,A 2D metroidvania with an emphasis on close co...,21000,2400,8300,2300,Team Cherry,,,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"Faz uns 2/3 anos que eu zerei esse jogo, mesmo...",i like how you can bounce on spikes with your ...,A rivetting action/adventure game with a stunn...,I\'d give this game a 4 for each individual as...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1268,Bloodstained: Curse of the Moon,"May 23, 2018",3.6,341,341,“Bloodstained: Curse of the Moon” is packed wi...,2300,41,800,397,Inti Creates,,,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"Okay', ""There's nothing wrong with this game, ...","Two Clones, One Series:\n ...",Doesn’t outstay it’s welcome and a fun throwba...,Zangetsu Miriam Gebel Alfred\n ...,
1269,Final Fantasy XIII-2,"Dec 15, 2011",3.3,482,482,FINAL FANTASY XIII-2 is created with the aim o...,2300,58,1400,449,Square Enix,,,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,So like XIII I also platinumed XIII-2.',a marked improvement over 13-1 with alot more ...,Gets an extra half-star for nostalgic reasons....,,
1270,Agar.io,"Apr 28, 2015",2.2,81,81,Agar.io is a Massively-multiplayer top-down st...,4400,8,40,12,Miniclip.com,Matheus Valadares,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,guzel oyundu',"doge', ""Playing this game in school and whenev...",Well at least it used to be fun on school pcs....,"school computers, teachers thought this shit w...",
1271,Fatal Frame II: Crimson Butterfly,"Nov 27, 2003",4.2,398,398,Crimson Butterfly is the second installment in...,1000,38,690,513,Tecmo Co.,Ltd.,Ubisoft Entertainment,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"Ainda na minha maratona de clássicos do PS2, e...",A bit too easy for a survival horror game. The...,literally such a good game wtf',"Wow, where to start...'",Oh now I get it why people put this game in th...


In [41]:
# 删除所有重复数据
cleaned_data.drop_duplicates(inplace=True)

In [42]:
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1116 entries, 0 to 1511
Data columns (total 58 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Title                 1116 non-null   object 
 1   Release Date          1116 non-null   object 
 2   Rating                1116 non-null   float64
 3   Times Listed          1116 non-null   int64  
 4   Number of Reviews     1116 non-null   int64  
 5   Summary               1116 non-null   object 
 6   Plays                 1116 non-null   int64  
 7   Playing               1116 non-null   int64  
 8   Backlogs              1116 non-null   int64  
 9   Wishlist              1116 non-null   int64  
 10  Team1                 1116 non-null   object 
 11  Team2                 1116 non-null   object 
 12  Team3                 1116 non-null   object 
 13   Arcade               1116 non-null   object 
 14   Brawler              1116 non-null   object 
 15   Card & Board Game    1116