# Coursework 2

Student name: Kym Hannah

Student ID: s2247460



## Part A: Significance Testing



### Dataset: Gender Representation in Video Games
This dataset contains information on several video games released between 2012 and 2022.

The dataset metadata states that the games were selected based on the following criteria:

>GAME SELECTION CRITERIA
> 1. The games must have a storyline. To analyze the paper of each character is essential that
games had a plot where characters had a role assigned (even if this role might change based
on the player's choices). This excludes games like:
> - Puzzle games: Tetris, Candy Crush, Minesweepers, etc.
> - Racing games: Gran Turismo, Formula 1, Mario Kart, etc.
> - Social Simulators: Animal Crossing, The Sims, etc.
> - MMORPGs, where the storyline lived by the player, might considerably differ from
other players, such as World of Warcraft.
> - Shooters with no story mode: Fornite, Valorant
> - Other popular games with no storyline: Minecraft, Roblox, League of Legends…
> 2. For games that offer a story and multiplayer modes (like some Call of Duty, GTA V…), just the
story mode is taken into consideration for this analysis.
> 3. The games were selected for being top-selling or best-rated games of the year.
> 4. At least 5 games were selected for each year.

#### Source: Kaggle
- Link: [Kaggle: Gender Representation in Video Games](https://www.kaggle.com/datasets/br33sa/gender-representation-in-video-games/data?select=characters.grivg.csv)
- License: [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)

Dataset was downloaded from Kaggle as this data would be available for public use. It has a Creative Commons License which allows for non-commercial use.

Kaggle gives the dataset a rating of 9.41 out of 10.0 which indicates that the dataset is of high quality (according to Kaggle).

![Kaggle rating](Images/Kaggle_Rating.png)

When looking at the provided Metadata, it states that:

>" This data has been compiled through research of several information sources including, but not
limited to:
> - Websites and media with a strong focus on video games like Metacritic, Destructoid, IGN, or
GameSpot.
> - Wikipedia
> - The websites of the games developers, games publishers, and the website of the game itself."

Keeping this in mind, the dataset should be reliable as it has been compiled from multiple sources which are known for their focus on video games.

#### Dataset composition
- Data structure: Tabular - 3 CSV files (related using common keys)
- CSV files:
    - characters.grivg.csv
    - games.grivg.csv
    - sexualization.grivg.csv


##### Games dataset

In [1]:
from unicodedata import category

# Notebook imports
import pandas as pd

In [2]:
# Load the games dataset
games = pd.read_csv('Data/Gender_In_Video_Games/games.grivg.csv')
games.head()

Unnamed: 0,Game_Id,Title,Release,Series,Genre,Sub-genre,Developer,Publisher,Country,Platform,...,Director,Total_team,female_team,Team_percentage,Metacritic,Destructoid,IGN,GameSpot,Avg_Reviews,Unnamed: 27
0,GTAV,Grand Theft Auto V,Nov-13,GTA,Action-adventure,Action-adventure,Rockstar North,Rockstar Games,GBR,Multi,...,M,7,0,0%,9.7,9.0,10.0,9.0,9.4,
1,PSS,Pokémon Sword/Shield,Nov-19,Pokémon,RPG,RPG,Game Freak,Nintendo,JPN,Nintendo Switch,...,M,9,1,11%,8.0,7.0,9.3,9.0,8.3,
2,CODMW,Call of Duty: Modern Warfare,Oct-19,Call of Duty,Action,FPS,Infinity Ward,Activision,USA,Multi,...,M,11,0,0%,8.0,8.0,8.0,7.0,7.8,
3,RDR2,Red Dead Redemption 2,Dec-18,Red Dead,Action-adventure,Action-adventure,Rockstar Studios,Rockstar Games,USA,Multi,...,M,7,0,0%,9.7,9.5,10.0,9.0,9.6,
4,SMO,Super Mario Odyssey,Oct-17,Super Mario,Action-adventure,Action-adventure,Nintendo EDP,Nintendo,JPN,Nintendo Switch,...,M,11,1,9%,9.7,9.5,10.0,10.0,9.8,


The games dataset has been loaded successfully. The dataset contains the following columns:

In [3]:
# Check the shape of the dataset
print(f"Number of rows: {games.shape[0]}")
print(f"Number of columns: {games.shape[1]}")

Number of rows: 64
Number of columns: 28


In [4]:
# Check the columns in the dataset
games.columns

Index(['Game_Id', 'Title', 'Release', 'Series', 'Genre', 'Sub-genre',
       'Developer', 'Publisher', 'Country', 'Platform', 'PEGI',
       'Customizable_main', 'Protagonist', 'Protagonist_Non_Male',
       'Relevant_males', 'Relevant_no_males', 'Percentage_non_male',
       'Criteria', 'Director', 'Total_team', 'female_team', 'Team_percentage',
       'Metacritic ', 'Destructoid', 'IGN', 'GameSpot', 'Avg_Reviews',
       'Unnamed: 27'],
      dtype='object')

Whitespace has been detected in the column names, this will be removed to avoid any issues.

In [5]:
# Remove whitespace from column names
games.columns = games.columns.str.strip()
games.columns

Index(['Game_Id', 'Title', 'Release', 'Series', 'Genre', 'Sub-genre',
       'Developer', 'Publisher', 'Country', 'Platform', 'PEGI',
       'Customizable_main', 'Protagonist', 'Protagonist_Non_Male',
       'Relevant_males', 'Relevant_no_males', 'Percentage_non_male',
       'Criteria', 'Director', 'Total_team', 'female_team', 'Team_percentage',
       'Metacritic', 'Destructoid', 'IGN', 'GameSpot', 'Avg_Reviews',
       'Unnamed: 27'],
      dtype='object')

Next, the dataset will be checked for missing values.

In [6]:
# Check for missing values
games.isnull().sum()

Game_Id                  0
Title                    0
Release                  0
Series                  37
Genre                    0
Sub-genre                0
Developer                0
Publisher                0
Country                  0
Platform                 0
PEGI                     0
Customizable_main        0
Protagonist              0
Protagonist_Non_Male     0
Relevant_males           0
Relevant_no_males        0
Percentage_non_male      0
Criteria                 0
Director                 0
Total_team               0
female_team              0
Team_percentage          0
Metacritic               0
Destructoid              7
IGN                      2
GameSpot                 2
Avg_Reviews              0
Unnamed: 27             64
dtype: int64

Missing values were found in the Series column. The null values will be replaced with 'None'.

The missing values are removed to ensure that the dataset is clean and ready for analysis.

In [7]:
# Filter the rows with null values in the 'series' column
games_with_null_series = games[games['Series'].isnull()]

# Print the filtered rows
games_with_null_series

Unnamed: 0,Game_Id,Title,Release,Series,Genre,Sub-genre,Developer,Publisher,Country,Platform,...,Director,Total_team,female_team,Team_percentage,Metacritic,Destructoid,IGN,GameSpot,Avg_Reviews,Unnamed: 27
13,INS,Inside,Jun-16,,Action,Puzzle-Platform,Paydead,Playdead,DNK,Multi,...,M,2,0,0%,9.1,9.5,10.0,8.0,9.2,
14,UT,Undertale,Sep-15,,RPG,RPG,Toby Fox,Toby Fox,USA,Multi,...,M,2,1,50%,9.2,10.0,10.0,9.0,9.6,
15,BB,Bloodborne,Mar-15,,RPG,Action RPG,FromSoftware,Sony Computer Entertainment,JPN,PS4,...,M,5,0,0%,9.2,9.0,9.1,9.0,9.1,
16,SK,Shovel Knight: Shovel of Hope,Jun-14,,Action,Platform,Yacht Club Games,Yacht Club Games,USA,Multi,...,M,9,1,11%,8.8,9.0,9.0,8.0,8.7,
17,PP,"Papers, Please",Aug-13,,Simulation,Puzzle Simulation,3909 LLC,3909 LLC,USA,Multi,...,M,1,0,0%,8.9,,8.7,8.0,8.5,
18,TLOU,The Last of Us,Jun-13,,Action-adventure,Action-adventure,Naughty Dog,Sony Computer Entertainment,USA,PS3,...,M,7,0,0%,9.5,10.0,10.0,8.0,9.4,
20,HLM,Hotline Miami,Oct-12,,Action,Top-down Shooter,Dennaton Games,Devolver Digital,SWE,Multi,...,M,2,0,0%,8.6,9.0,8.8,8.5,8.7,
21,JRN,Journey,Mar-12,,Adventure,Adventure,Thatgamecompany,Sony Computer Entertainment,USA,PS3,...,M,7,1,14%,9.2,9.0,9.0,9.5,9.2,
24,TN,Tunic,Sep-22,,Action-adventure,Action-adventure,Andrew Shouldice,Finji,CAN,Multi,...,M,3,0,0%,8.5,9.0,9.0,9.0,8.9,
25,ER,Elden Ring,Feb-22,,RPG,Action RPG,FromSoftware,Bandai Namco Entertainment,JPN,Multi,...,M,6,0,0%,9.5,10.0,10.0,10.0,9.9,


In [8]:
# Replace the null values in the 'series' column with 'None'
games.fillna({'Series': 'None'}, inplace=True)
games.isnull().sum()

Game_Id                  0
Title                    0
Release                  0
Series                   0
Genre                    0
Sub-genre                0
Developer                0
Publisher                0
Country                  0
Platform                 0
PEGI                     0
Customizable_main        0
Protagonist              0
Protagonist_Non_Male     0
Relevant_males           0
Relevant_no_males        0
Percentage_non_male      0
Criteria                 0
Director                 0
Total_team               0
female_team              0
Team_percentage          0
Metacritic               0
Destructoid              7
IGN                      2
GameSpot                 2
Avg_Reviews              0
Unnamed: 27             64
dtype: int64

Next, the 'Unnamed' column will be dropped as it is not needed.

In [9]:
# Drop the 'Unnamed' column
games.drop(columns=['Unnamed: 27'], inplace=True)
games.head()

Unnamed: 0,Game_Id,Title,Release,Series,Genre,Sub-genre,Developer,Publisher,Country,Platform,...,Criteria,Director,Total_team,female_team,Team_percentage,Metacritic,Destructoid,IGN,GameSpot,Avg_Reviews
0,GTAV,Grand Theft Auto V,Nov-13,GTA,Action-adventure,Action-adventure,Rockstar North,Rockstar Games,GBR,Multi,...,MS,M,7,0,0%,9.7,9.0,10.0,9.0,9.4
1,PSS,Pokémon Sword/Shield,Nov-19,Pokémon,RPG,RPG,Game Freak,Nintendo,JPN,Nintendo Switch,...,MS,M,9,1,11%,8.0,7.0,9.3,9.0,8.3
2,CODMW,Call of Duty: Modern Warfare,Oct-19,Call of Duty,Action,FPS,Infinity Ward,Activision,USA,Multi,...,MS,M,11,0,0%,8.0,8.0,8.0,7.0,7.8
3,RDR2,Red Dead Redemption 2,Dec-18,Red Dead,Action-adventure,Action-adventure,Rockstar Studios,Rockstar Games,USA,Multi,...,SR,M,7,0,0%,9.7,9.5,10.0,9.0,9.6
4,SMO,Super Mario Odyssey,Oct-17,Super Mario,Action-adventure,Action-adventure,Nintendo EDP,Nintendo,JPN,Nintendo Switch,...,SR,M,11,1,9%,9.7,9.5,10.0,10.0,9.8


In [10]:
games.isnull().sum()

Game_Id                 0
Title                   0
Release                 0
Series                  0
Genre                   0
Sub-genre               0
Developer               0
Publisher               0
Country                 0
Platform                0
PEGI                    0
Customizable_main       0
Protagonist             0
Protagonist_Non_Male    0
Relevant_males          0
Relevant_no_males       0
Percentage_non_male     0
Criteria                0
Director                0
Total_team              0
female_team             0
Team_percentage         0
Metacritic              0
Destructoid             7
IGN                     2
GameSpot                2
Avg_Reviews             0
dtype: int64

Next, the gaming reviews columns will be checked for missing values.  As the average of the reviews have already been calculated, The missing values in the reviews columns will be left as null.
If the average had not been already been worked out, the missing values would have been replaced with the average of the other reviews.

In [11]:
# Check null values for gaming reviews (Destruktoid, IGN, GameSpot, Metacritic)
games[['Metacritic', 'Destructoid', 'IGN', 'GameSpot']].isnull().sum()

Metacritic     0
Destructoid    7
IGN            2
GameSpot       2
dtype: int64

In [12]:
# Print the rows with null reviews values
games_with_null_reviews = games[games[['Metacritic', 'Destructoid', 'IGN', 'GameSpot']].isnull().any(axis=1)]
games_with_null_reviews

Unnamed: 0,Game_Id,Title,Release,Series,Genre,Sub-genre,Developer,Publisher,Country,Platform,...,Criteria,Director,Total_team,female_team,Team_percentage,Metacritic,Destructoid,IGN,GameSpot,Avg_Reviews
8,CODBO2,Call of Duty: Black Ops 2,Nov-12,Call of Duty,Action,FPS,Treyarch,Activision,USA,Multi,...,MS,M,10,0,0%,8.0,,9.3,8.0,8.4
17,PP,"Papers, Please",Aug-13,,Simulation,Puzzle Simulation,3909 LLC,3909 LLC,USA,Multi,...,TR,M,1,0,0%,8.9,,8.7,8.0,8.5
31,IWATE,I Was a Teenage Exocolonist,Aug-22,,RPG,Strategy RPG,Northway Gates,Finji,CAN,Multi,...,TR,F,6,4,67%,8.9,,,,8.9
36,ITT,It Takes Two,Mar-21,,Action-adventure,Action-adventure,Hazelight Studios,EA,SWE,Multi,...,TR,M,5,1,20%,8.9,,9.0,9.0,9.0
42,TINGWD,There is No Game: Wrong Dimension,Aug-20,,Interactive Story,Interactive Story,Draw Me A Pixel,Draw Me A Pixel,FRA,Computer,...,TR,M,6,2,33%,8.8,,,9.6,9.2
50,DC,Dead Cells,Aug-18,,Action-adventure,Metroidvania,Motion Twin,Motion Twin,FRA,Multi,...,TR,M,2,0,0%,8.8,,9.5,9.0,9.1
58,DS2,Dark Souls 2,Mar-14,Dark Souls,RPG,Action RPG,FromSoftware,Bandai Namco Games,JPN,Multi,...,TR,M,7,0,0%,9.1,,9.0,9.0,9.0
61,FLR,Florence,Feb-18,,Interactive Story,Interactive Story,Mountains,Annapurna Interactive,USA,Mobile,...,TR,M,2,0,0%,8.7,8.0,9.6,,8.8


Next, the dataset will be checked for duplicates.

In [13]:
# Find duplicates in the dataset
duplicates = games[games.duplicated()]
duplicates

Unnamed: 0,Game_Id,Title,Release,Series,Genre,Sub-genre,Developer,Publisher,Country,Platform,...,Criteria,Director,Total_team,female_team,Team_percentage,Metacritic,Destructoid,IGN,GameSpot,Avg_Reviews


No duplicates were found in the dataset. The dataset is now clean and ready to be combined with the other datasets.

##### Characters dataset

The characters dataset contains information on the characters in the video games.

First, the dataset will be loaded and reviewed.

In [14]:
# Load the characters dataset
characters = pd.read_csv('data/Gender_In_Video_Games/characters.grivg.csv')
characters.head()

Unnamed: 0,Name,Gender,Game,Age,Age_range,Playable,Sexualization,Id,Species,Side,Relevance,Romantic_Interest
0,Farah,Female,CODMW,27,Adult,1,0,CODMW_Farah,Human,P,PA,No
1,Protagonist,Custom,PSS,Teenager,Teenager,1,0,PSS_Protagonist,Human,P,PA,No
2,Magnolia,Female,PSS,Elderly,Elderly,0,0,PSS_Magnolia,Human,P,SC,No
3,Sonia,Female,PSS,26,Adult,0,0,PSS_Sonia,Human,P,SC,No
4,Marnie,Female,PSS,Teenager,Teenager,0,0,PSS_Marnie,Human,B,MC,No


The characters dataset has been loaded successfully. The dataset contains the following columns:

In [15]:
# Check the shape of the dataset
print(f"Number of rows: {characters.shape[0]}")
print(f"Number of columns: {characters.shape[1]}")

Number of rows: 637
Number of columns: 12


In [16]:
# Check the columns in the dataset
characters.columns

Index(['Name', 'Gender', 'Game', 'Age', 'Age_range', 'Playable',
       'Sexualization', 'Id', 'Species', 'Side', 'Relevance',
       'Romantic_Interest'],
      dtype='object')

The dataset has 12 columns containing data related to the characters in the video games. The columns will be reviewed to check for any missing values or duplicates.

In [17]:
# Check for missing values
characters.isnull().sum()

Name                 0
Gender               0
Game                 0
Age                  0
Age_range            0
Playable             0
Sexualization        0
Id                   0
Species              0
Side                 0
Relevance            0
Romantic_Interest    0
dtype: int64

In [18]:
# Find duplicates in the dataset
duplicates = characters[characters.duplicated()]
duplicates

Unnamed: 0,Name,Gender,Game,Age,Age_range,Playable,Sexualization,Id,Species,Side,Relevance,Romantic_Interest


No missing values or duplicates were found in the dataset. The dataset is clean and ready to be combined with the other datasets.

##### Sexualization dataset

The sexualization dataset contains information on the sexualization of the characters in the video games. First, the dataset will be loaded and reviewed.

In [19]:
# Load the sexualization dataset
sexualization = pd.read_csv('data/Gender_In_Video_Games/sexualization.grivg.csv')
sexualization.head()

Unnamed: 0,Id,Sexualized_clothing,Trophy,Damsel in Distress,Sexualized Cutscenes,Total
0,CODMW_Farah,0,0,0,0,0
1,PSS_Protagonist,0,0,0,0,0
2,PSS_Magnolia,0,0,0,0,0
3,PSS_Sonia,0,0,0,0,0
4,PSS_Marnie,0,0,0,0,0


The sexualization dataset has been loaded successfully. The dataset contains the following columns:

In [20]:
# Check the shape of the dataset
print(f"Number of rows: {sexualization.shape[0]}")
print(f"Number of columns: {sexualization.shape[1]}")

Number of rows: 637
Number of columns: 6


In [21]:
# Check the columns in the dataset
sexualization.columns

Index(['Id', 'Sexualized_clothing', 'Trophy', 'Damsel in Distress',
       'Sexualized Cutscenes', 'Total'],
      dtype='object')

Next, the dataset will be checked for missing values and duplicates.

In [22]:
# Check for missing values
sexualization.isnull().sum()

Id                      0
Sexualized_clothing     0
Trophy                  0
Damsel in Distress      0
Sexualized Cutscenes    0
Total                   0
dtype: int64

In [23]:
# Find duplicates in the dataset
duplicates = sexualization[sexualization.duplicated()]
duplicates

Unnamed: 0,Id,Sexualized_clothing,Trophy,Damsel in Distress,Sexualized Cutscenes,Total


No missing values or duplicates were found in the dataset. The dataset is clean and ready to be combined with the other datasets.

##### Combining the datasets
The datasets will be combined using the common keys.

First, the characters and games datasets will be combined on the 'Game' and 'Game_Id' columns.

In [24]:
# Step 1: Merge Characters and Games datasets
data = pd.merge(characters, games, left_on='Game', right_on='Game_Id', how='inner')
data.head()

Unnamed: 0,Name,Gender,Game,Age,Age_range,Playable,Sexualization,Id,Species,Side,...,Criteria,Director,Total_team,female_team,Team_percentage,Metacritic,Destructoid,IGN,GameSpot,Avg_Reviews
0,Farah,Female,CODMW,27,Adult,1,0,CODMW_Farah,Human,P,...,MS,M,11,0,0%,8.0,8.0,8.0,7.0,7.8
1,Protagonist,Custom,PSS,Teenager,Teenager,1,0,PSS_Protagonist,Human,P,...,MS,M,9,1,11%,8.0,7.0,9.3,9.0,8.3
2,Magnolia,Female,PSS,Elderly,Elderly,0,0,PSS_Magnolia,Human,P,...,MS,M,9,1,11%,8.0,7.0,9.3,9.0,8.3
3,Sonia,Female,PSS,26,Adult,0,0,PSS_Sonia,Human,P,...,MS,M,9,1,11%,8.0,7.0,9.3,9.0,8.3
4,Marnie,Female,PSS,Teenager,Teenager,0,0,PSS_Marnie,Human,B,...,MS,M,9,1,11%,8.0,7.0,9.3,9.0,8.3


Next, the combined dataset and the sexualization dataset will be merged on the 'Id' column.

In [25]:
# Step 2: Merge the combined dataset with the Sexualization dataset
data = pd.merge(data, sexualization, left_on='Id', right_on='Id', how='inner')
data.head()

Unnamed: 0,Name,Gender,Game,Age,Age_range,Playable,Sexualization,Id,Species,Side,...,Metacritic,Destructoid,IGN,GameSpot,Avg_Reviews,Sexualized_clothing,Trophy,Damsel in Distress,Sexualized Cutscenes,Total
0,Farah,Female,CODMW,27,Adult,1,0,CODMW_Farah,Human,P,...,8.0,8.0,8.0,7.0,7.8,0,0,0,0,0
1,Protagonist,Custom,PSS,Teenager,Teenager,1,0,PSS_Protagonist,Human,P,...,8.0,7.0,9.3,9.0,8.3,0,0,0,0,0
2,Magnolia,Female,PSS,Elderly,Elderly,0,0,PSS_Magnolia,Human,P,...,8.0,7.0,9.3,9.0,8.3,0,0,0,0,0
3,Sonia,Female,PSS,26,Adult,0,0,PSS_Sonia,Human,P,...,8.0,7.0,9.3,9.0,8.3,0,0,0,0,0
4,Marnie,Female,PSS,Teenager,Teenager,0,0,PSS_Marnie,Human,B,...,8.0,7.0,9.3,9.0,8.3,0,0,0,0,0


Next, the shape of the combined dataset will be checked, and the columns will be reviewed.

In [26]:
# Check the shape of the combined dataset
print(f"Number of rows: {data.shape[0]}")
print(f"Number of columns: {data.shape[1]}")

Number of rows: 637
Number of columns: 44


In [27]:
# Check the columns in the combined dataset
data.columns

Index(['Name', 'Gender', 'Game', 'Age', 'Age_range', 'Playable',
       'Sexualization', 'Id', 'Species', 'Side', 'Relevance',
       'Romantic_Interest', 'Game_Id', 'Title', 'Release', 'Series', 'Genre',
       'Sub-genre', 'Developer', 'Publisher', 'Country', 'Platform', 'PEGI',
       'Customizable_main', 'Protagonist', 'Protagonist_Non_Male',
       'Relevant_males', 'Relevant_no_males', 'Percentage_non_male',
       'Criteria', 'Director', 'Total_team', 'female_team', 'Team_percentage',
       'Metacritic', 'Destructoid', 'IGN', 'GameSpot', 'Avg_Reviews',
       'Sexualized_clothing', 'Trophy', 'Damsel in Distress',
       'Sexualized Cutscenes', 'Total'],
      dtype='object')

The dataset contains 44 columns, which is a large number of columns. The columns will be reviewed to determine if any columns can be dropped.  If columns contain the same information, one of the columns will be dropped.

The 'Sexualization' column is the same as the 'Total' column. The 'Sexualization' column will be dropped and the 'Total' column will be renamed to 'Sexualization_total_score'.

In [28]:
# Check the 'Sexualization' column is the same as the 'Total' column
data['Sexualization'].equals(data['Total'])

True

In [29]:
# Drop the 'Sexualization' column
data.drop(columns=['Sexualization'], inplace=True)
data.columns

Index(['Name', 'Gender', 'Game', 'Age', 'Age_range', 'Playable', 'Id',
       'Species', 'Side', 'Relevance', 'Romantic_Interest', 'Game_Id', 'Title',
       'Release', 'Series', 'Genre', 'Sub-genre', 'Developer', 'Publisher',
       'Country', 'Platform', 'PEGI', 'Customizable_main', 'Protagonist',
       'Protagonist_Non_Male', 'Relevant_males', 'Relevant_no_males',
       'Percentage_non_male', 'Criteria', 'Director', 'Total_team',
       'female_team', 'Team_percentage', 'Metacritic', 'Destructoid', 'IGN',
       'GameSpot', 'Avg_Reviews', 'Sexualized_clothing', 'Trophy',
       'Damsel in Distress', 'Sexualized Cutscenes', 'Total'],
      dtype='object')

In [30]:
# Rename 'Total' column to 'Sexualization'
data.rename(columns={'Total': 'Sexualization_total_score'}, inplace=True)
data.columns

Index(['Name', 'Gender', 'Game', 'Age', 'Age_range', 'Playable', 'Id',
       'Species', 'Side', 'Relevance', 'Romantic_Interest', 'Game_Id', 'Title',
       'Release', 'Series', 'Genre', 'Sub-genre', 'Developer', 'Publisher',
       'Country', 'Platform', 'PEGI', 'Customizable_main', 'Protagonist',
       'Protagonist_Non_Male', 'Relevant_males', 'Relevant_no_males',
       'Percentage_non_male', 'Criteria', 'Director', 'Total_team',
       'female_team', 'Team_percentage', 'Metacritic', 'Destructoid', 'IGN',
       'GameSpot', 'Avg_Reviews', 'Sexualized_clothing', 'Trophy',
       'Damsel in Distress', 'Sexualized Cutscenes',
       'Sexualization_total_score'],
      dtype='object')

In [31]:
# Check 'Game' and 'Game_Id' columns are the same
data['Game'].equals(data['Game_Id'])

True

In [32]:
# Drop the 'Game' column
data.drop(columns=['Game'], inplace=True)
data.columns

Index(['Name', 'Gender', 'Age', 'Age_range', 'Playable', 'Id', 'Species',
       'Side', 'Relevance', 'Romantic_Interest', 'Game_Id', 'Title', 'Release',
       'Series', 'Genre', 'Sub-genre', 'Developer', 'Publisher', 'Country',
       'Platform', 'PEGI', 'Customizable_main', 'Protagonist',
       'Protagonist_Non_Male', 'Relevant_males', 'Relevant_no_males',
       'Percentage_non_male', 'Criteria', 'Director', 'Total_team',
       'female_team', 'Team_percentage', 'Metacritic', 'Destructoid', 'IGN',
       'GameSpot', 'Avg_Reviews', 'Sexualized_clothing', 'Trophy',
       'Damsel in Distress', 'Sexualized Cutscenes',
       'Sexualization_total_score'],
      dtype='object')

Next, the column names will be renamed to make them more descriptive and easier to understand.

In [33]:
# Rename columns
data.rename(columns=
            {
                'Name': 'Character_name',
                'Gender': 'Character_gender',
                'Age': 'Character_age',
                'Age_range': 'Character_age_range',
                'Playable': 'Character_playable',
                'Id': 'Character_id',
                'Species': 'Character_species',
                'Side': 'Character_side',
                'Relevance': 'Character_relevance',
                'Romantic_Interest': 'Character_romantic_interest',
                'Game_Id': 'Game_id',
                'Title': 'Game_title',
                'Release': 'Game_release',
                'Series': 'Game_series',
                'Genre': 'Game_genre',
                'Sub-genre': 'Game_sub_genre',
                'Developer': 'Game_developer',
                'Publisher': 'Game_publisher',
                'Country': 'Country_of_game_developer',
                'Platform': 'Platform',
                'PEGI': 'PEGI_rating',
                'Customizable_main': 'Customizable_main_character',
                'Protagonist': 'Protagonist_characters',
                'Protagonist_Non_Male': 'Protagonist_non_male_characters',
                'Relevant_males': 'Relevant_male_characters',
                'Relevant_no_males': 'Relevant_no_male_characters',
                'Percentage_non_male': 'Percentage_non_male_characters',
                'Criteria': 'Criteria',
                'Director': 'Director',
                'Total_team': 'Total_team',
                'female_team': 'Female_team',
                'Team_percentage': 'Women_in_team_percentage',
                'Metacritic': 'Metacritic_review',
                'Destructoid': 'Destructoid_review',
                'IGN': 'IGN_review',
                'GameSpot': 'GameSpot_review',
                'Avg_Reviews': 'Average_reviews',
                'Sexualized_clothing': 'Sexualized_clothing_score',
                'Trophy': 'Character_trophy_score',
                'Damsel in Distress': 'Damsel_in_Distress_score',
                'Sexualized Cutscenes': 'Sexualized_Cutscenes_score',
                'Sexualization_total_score': 'Sexualization_total_score'
            },
            inplace=True)
data.columns

Index(['Character_name', 'Character_gender', 'Character_age',
       'Character_age_range', 'Character_playable', 'Character_id',
       'Character_species', 'Character_side', 'Character_relevance',
       'Character_romantic_interest', 'Game_id', 'Game_title', 'Game_release',
       'Game_series', 'Game_genre', 'Game_sub_genre', 'Game_developer',
       'Game_publisher', 'Country_of_game_developer', 'Platform',
       'PEGI_rating', 'Customizable_main_character', 'Protagonist_characters',
       'Protagonist_non_male_characters', 'Relevant_male_characters',
       'Relevant_no_male_characters', 'Percentage_non_male_characters',
       'Criteria', 'Director', 'Total_team', 'Female_team',
       'Women_in_team_percentage', 'Metacritic_review', 'Destructoid_review',
       'IGN_review', 'GameSpot_review', 'Average_reviews',
       'Sexualized_clothing_score', 'Character_trophy_score',
       'Damsel_in_Distress_score', 'Sexualized_Cutscenes_score',
       'Sexualization_total_score'],
  

Next, the dataset data types will be reviewed.

Most of the columns are expected to be of type 'object' as they contain string values, however some columns may need to be converted to a different data type.



In [34]:
# Check the data types of the columns
data.dtypes

Character_name                      object
Character_gender                    object
Character_age                       object
Character_age_range                 object
Character_playable                   int64
Character_id                        object
Character_species                   object
Character_side                      object
Character_relevance                 object
Character_romantic_interest         object
Game_id                             object
Game_title                          object
Game_release                        object
Game_series                         object
Game_genre                          object
Game_sub_genre                      object
Game_developer                      object
Game_publisher                      object
Country_of_game_developer           object
Platform                            object
PEGI_rating                          int64
Customizable_main_character         object
Protagonist_characters               int64
Protagonist

In [35]:
# Convert and check data types of the columns are as expected
expected_data_types = {
                'Character_name': 'object',
                'Character_gender': 'category',
                'Character_age': 'object',
                'Character_age_range': 'category',
                'Character_playable': 'bool',
                'Character_id': 'object',
                'Character_species': 'category',
                'Character_side': 'category',
                'Character_relevance': 'category',
                'Character_romantic_interest': 'category',
                'Game_id': 'object',
                'Game_title': 'object',
                'Game_release': 'datetime64[ns]',
                'Game_series': 'category',
                'Game_genre': 'category',
                'Game_sub_genre': 'category',
                'Game_developer': 'category',
                'Game_publisher': 'category',
                'Country_of_game_developer': 'category',
                'Platform': 'category',
                'PEGI_rating': 'category',
                'Customizable_main_character': 'category',
                'Protagonist_characters': 'int64',
                'Protagonist_non_male_characters': 'int64',
                'Relevant_male_characters': 'int64',
                'Relevant_no_male_characters': 'int64',
                'Percentage_non_male_characters': 'float64',
                'Criteria': 'category',
                'Director': 'category',
                'Total_team': 'int64',
                'Female_team': 'int64',
                'Women_in_team_percentage': 'float64',
                'Metacritic_review': 'float64',
                'Destructoid_review': 'float64',
                'IGN_review': 'float64',
                'GameSpot_review': 'float64',
                'Average_reviews': 'float64',
                'Sexualized_clothing_score': 'bool',
                'Character_trophy_score': 'bool',
                'Damsel_in_Distress_score': 'bool',
                'Sexualized_Cutscenes_score': 'bool',
                'Sexualization_total_score': 'int64'
            }

for column, expected_type in expected_data_types.items():
    try:
        data[column] = data[column].astype(expected_type)
    except Exception as e:
        print(f"An error occurred while converting column '{column}' to type '{expected_type}': {e}\n")


An error occurred while converting column 'Game_release' to type 'datetime64[ns]': Out of bounds nanosecond timestamp: Oct-19, at position 0

An error occurred while converting column 'Percentage_non_male_characters' to type 'float64': could not convert string to float: '17%'

An error occurred while converting column 'Women_in_team_percentage' to type 'float64': could not convert string to float: '0%'



From the output above, it can be seen that the data types have been converted successfully, except the 'Game_release', 'Percentage_non_male_characters', and the 'Women_in_team_percentage'.

- 'Game_release' is expected to be of type 'datetime64[ns]' but it is currently of type 'object'.  When the conversion was attempted, an error occurred.  This is likely due to the format of the 'Game_release' column not being recognised by the 'to_datetime' function.
- 'Percentage_non_male_characters' is expected to be of type 'float64' but it is currently of type 'object'.  The '%' character will need to be removed before the conversion
- 'Women_in_team_percentage' is expected to be of type 'float64' but it is currently of type 'object'.  The '%' character will need to be removed before the conversion

First, the '%' character will be removed from the 'Percentage_non_male_characters' and 'Women_in_team_percentage' columns using the 'rstrip' function.

In [36]:
# Remove the '%' from the percentage columns
data['Percentage_non_male_characters'] = data['Percentage_non_male_characters'].str.rstrip('%').astype(float)
data['Women_in_team_percentage'] = data['Women_in_team_percentage'].str.rstrip('%').astype(float)


Next, the 'Game_release' column will be converted to datetime with a specified format.

In [37]:
# Convert the 'Game_release' column to datetime with a specified format
data['Game_release'] = pd.to_datetime(data['Game_release'], format='%b-%y', errors='coerce')

# Verify the data type conversion
print(data['Game_release'].dtype)

datetime64[ns]


This appears to have been successful, but the values will be checked to ensure that the conversion has been done correctly.

In [38]:
# Check the data types of the 'Game_release' column
game_releases = data['Game_release']
game_releases.head()

0   2019-10-01
1   2019-11-01
2   2019-11-01
3   2019-11-01
4   2019-11-01
Name: Game_release, dtype: datetime64[ns]

In [40]:
data.dtypes

Character_name                             object
Character_gender                         category
Character_age                              object
Character_age_range                      category
Character_playable                           bool
Character_id                               object
Character_species                        category
Character_side                           category
Character_relevance                      category
Character_romantic_interest              category
Game_id                                    object
Game_title                                 object
Game_release                       datetime64[ns]
Game_series                              category
Game_genre                               category
Game_sub_genre                           category
Game_developer                           category
Game_publisher                           category
Country_of_game_developer                category
Platform                                 category


The 'Game_release' column has been converted successfully. The dataset is now clean and ready for analysis.

In [39]:
data.head()

Unnamed: 0,Character_name,Character_gender,Character_age,Character_age_range,Character_playable,Character_id,Character_species,Character_side,Character_relevance,Character_romantic_interest,...,Metacritic_review,Destructoid_review,IGN_review,GameSpot_review,Average_reviews,Sexualized_clothing_score,Character_trophy_score,Damsel_in_Distress_score,Sexualized_Cutscenes_score,Sexualization_total_score
0,Farah,Female,27,Adult,True,CODMW_Farah,Human,P,PA,No,...,8.0,8.0,8.0,7.0,7.8,False,False,False,False,0
1,Protagonist,Custom,Teenager,Teenager,True,PSS_Protagonist,Human,P,PA,No,...,8.0,7.0,9.3,9.0,8.3,False,False,False,False,0
2,Magnolia,Female,Elderly,Elderly,False,PSS_Magnolia,Human,P,SC,No,...,8.0,7.0,9.3,9.0,8.3,False,False,False,False,0
3,Sonia,Female,26,Adult,False,PSS_Sonia,Human,P,SC,No,...,8.0,7.0,9.3,9.0,8.3,False,False,False,False,0
4,Marnie,Female,Teenager,Teenager,False,PSS_Marnie,Human,B,MC,No,...,8.0,7.0,9.3,9.0,8.3,False,False,False,False,0


#### Question : ...
...


### Significance Testing
...


### Interpretation
...

---

## Part B: Machine Learning models