# Feature Engineering

This will contain the use of many data engineering processes to maximise the model's performance.

#### Feature Engineering Process
- Deriving new variables from existing ones
- Encoding categorical features
- Calculating new features from existing features
- Combining features/feature interactions
- Identifying the most relevant features for the model
- Transforming Features(Dividing Data into categories, Mathematical transformations, Logarithmic transformations)
- Creating Domain-Specific Features that incorporating knowledge from the specific domain to create features that capture important characteristics of the data.

Loading Dependencies

In [1]:
import pandas as pd

Storing Data as local variable (pandas object)

In [2]:

data_frame = pd.read_csv("wrangled_data2.csv")

#### Deriving new variables from existing ones

##### Encoding Categorical Data

Converts textual data into numerical format for input into the algorithm's processes. This is due to most machine learning algorithms using numerical values rather than text.

For Genre the best encoding method would be one-hot encoding which generated new boolean columns that shows if the data is has that category with each column being one of the genres in the original data.

In [3]:
data_frame = pd.get_dummies(data_frame, columns=['Genre'])
data_frame.head()

Unnamed: 0,Rank,Name,Platform,Year,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,...,Genre_Fighting,Genre_Misc,Genre_Platform,Genre_Puzzle,Genre_Racing,Genre_Role-Playing,Genre_Shooter,Genre_Simulation,Genre_Sports,Genre_Strategy
0,3565,The Incredible Hulk,GBA,2003.0,Universal Interactive,0.8,0.75,0.0,0.2,0.466667,...,False,False,False,False,False,False,False,False,False,False
1,3566,Ms. Pac-Man Maze Madness,GBA,2004.0,Zoo Digital Publishing,0.8,0.75,0.0,0.2,0.466667,...,False,False,False,True,False,False,False,False,False,False
2,3568,The Lord of the Rings: The Return of the King,GBA,2003.0,Electronic Arts,0.8,0.75,0.0,0.2,0.466667,...,False,False,False,False,False,False,False,False,False,False
3,3571,The Sims: Bustin' Out,XB,2003.0,Electronic Arts,0.8,0.7,0.0,0.4,0.466667,...,False,False,False,False,False,False,False,True,False,False
4,3576,Yu-Gi-Oh! World Championship Tournament 2004,GBA,2004.0,Konami Digital Entertainment,0.8,0.75,0.0,0.2,0.466667,...,False,True,False,False,False,False,False,False,False,False



#### Combining features/feature interactions

Interactions between feature often carry even more information. This section will be for feature interaction engineering where one feature will represent the interaction of multiple.


#### Transforming Features

This section is where to put any filters on certain features especially for working on specific subset that would effect the data


#### Creating Domain-Specific Features

This is where the knowledge into the data set's field will allow for the creation of new features that are tailor to that field from other features.

The gaming space has different weighting in terms of popularity in regions. This model is for a prediction on global/general performance and weighing each country by their copies sold would naturally disadvantage the regions with smaller market sizes.

This allows the creation of a "popularity" column where the number copies sold will be put over the size of the target market in that space to see how well it performed within its region. By then combining these columns into an average, it will give a numerical value, as a percentage, to how well the product performed.

Target of market (people who play games) of each region is take from https://explodingtopics.com/blog/number-of-gamers and uses in the modern data (2025), is scaled to 1.2 to match the scaling of the data, and the base number is scaled to a the number of gamers in 2016 

2.17 billion in 2016 and 3.32 billion in 2025

2025 stats
- Japan 55.5 million
- Europe	715 million
- North America	285 million
- Other	2188.5 million

2016 stats (scaled from 2025)

- Japan 36.276 million
- Europe 467.334 million
- North America 186.280 million
- Other 1430.435 million

2016 stats (scaled to 1.2) might not be appliedd

- Japan 30.23 million
- Europe 389.445 million
- North America 155.233 million
- Other 1192.029 million

In [4]:
data_frame['popularity'] = 100 * (((data_frame['NA_Sales']/0.5)/186.28 + (data_frame['EU_Sales']/0.2) / 467.334 + (data_frame['JP_Sales']/0.1) / 36.276 + (data_frame['Other_Sales']/0.05) /1430.435) / 4 )
print(data_frame['popularity'])

0       0.485245
1       0.485245
2       0.485245
3       0.541780
4       0.485245
          ...   
9087    0.005368
9088    0.005368
9089    0.000000
9090    0.013374
9091    0.005368
Name: popularity, Length: 9092, dtype: float64


saving data 

In [6]:
data_frame.to_csv('../3.Model_Training/model_ready_data2.csv', index=False)