### __Python DataWranglingPandasFeatureEngineering__

Feature Engineering is a crucial step in data analysis, machine learning, and data science. It involves:

- Creating new features from existing data,

- Transforming variables to better represent the underlying problem,

- Encoding categorical variables, scaling numeric data, or deriving time-based features,



##### _Create new columns based on values ​​from others_

In [6]:
import pandas as pd

df = pd.read_csv('DataSets/vg_sales.csv')
print(df.head())
print()
print(df.columns)

                       name platform  year_of_release         genre publisher  \
0                Wii Sports      Wii           2006.0        Sports  Nintendo   
1         Super Mario Bros.      NES           1985.0      Platform  Nintendo   
2            Mario Kart Wii      Wii           2008.0        Racing  Nintendo   
3         Wii Sports Resort      Wii           2009.0        Sports  Nintendo   
4  Pokemon Red/Pokemon Blue       GB           1996.0  Role-Playing  Nintendo   

  developer  na_sales  eu_sales  jp_sales  critic_score user_score  
0  Nintendo     41.36     28.96      3.77          76.0          8  
1       NaN     29.08      3.58      6.81           NaN        NaN  
2  Nintendo     15.68     12.76      3.79          82.0        8.3  
3  Nintendo     15.61     10.93      3.28          80.0          8  
4       NaN     11.27      8.89     10.22           NaN        NaN  

Index(['name', 'platform', 'year_of_release', 'genre', 'publisher',
       'developer', 'na_sales'

Notice that the DataFrame above includes sales from three regions: NA (North America), EU (Europe), and Japan (JPN). To create a column called 'total_sales', we need to generate it from the other columns.

In [8]:
import pandas as pd

df = pd.read_csv('DataSets/vg_sales.csv')

df['total_sales'] = df['na_sales'] + df['eu_sales'] + df['jp_sales']
print(df['total_sales'].head())
print()
print(df.columns)

0    74.09
1    39.47
2    32.23
3    29.82
4    30.38
Name: total_sales, dtype: float64

Index(['name', 'platform', 'year_of_release', 'genre', 'publisher',
       'developer', 'na_sales', 'eu_sales', 'jp_sales', 'critic_score',
       'user_score', 'total_sales'],
      dtype='object')


This works because most mathematical functions work in a vector-based manner: they are applied to entire columns at once, rather than looping through each value in a column. This provides more efficient and concise code.

With this simple code, you can create a new column called 'total_sales' in the DataFrame. The contents of this column will consist of the sum of sales in the three regions, line by line.

We can leverage this method to create columns from useful formulas. For example, if we want to calculate the portion of total sales that comes from the EU, we can do so like this:

In [9]:
df['eu_sales_share'] = df['eu_sales'] / df['total_sales']
print(df['eu_sales_share'].head())

0    0.390876
1    0.090702
2    0.395904
3    0.366533
4    0.292627
Name: eu_sales_share, dtype: float64


##### _Generate Boolean columns_

Imagine we want a column to indicate whether something is true. We can create it using the comparison operators ==, <, >=, etc. For example, let's create a column that checks whether the publisher is Nintendo:

In [12]:
import pandas as pd

df = pd.read_csv('DataSets/vg_sales.csv')

# crear la columna is_nintendo y rellenarla
df['is_nintendo'] = df['publisher'] == 'Nintendo'
print(df['is_nintendo'].head())
print()
print(df.columns)

0    True
1    True
2    True
3    True
4    True
Name: is_nintendo, dtype: bool

Index(['name', 'platform', 'year_of_release', 'genre', 'publisher',
       'developer', 'na_sales', 'eu_sales', 'jp_sales', 'critic_score',
       'user_score', 'is_nintendo'],
      dtype='object')


Remember that we can also do this with the convenient isin() method, which checks if a value is in a list

In [15]:
import pandas as pd

df = pd.read_csv('DataSets/vg_sales.csv')
print(df)
print()
# asegúrate de que estés comparando minúsculas
print(df['platform'].str.lower().isin(['gb', 'wii']).head())

                                name platform  year_of_release         genre  \
0                         Wii Sports      Wii           2006.0        Sports   
1                  Super Mario Bros.      NES           1985.0      Platform   
2                     Mario Kart Wii      Wii           2008.0        Racing   
3                  Wii Sports Resort      Wii           2009.0        Sports   
4           Pokemon Red/Pokemon Blue       GB           1996.0  Role-Playing   
...                              ...      ...              ...           ...   
16712  Samurai Warriors: Sanada Maru      PS3           2016.0        Action   
16713               LMA Manager 2007     X360           2006.0        Sports   
16714        Haitaka no Psychedelica      PSV           2016.0     Adventure   
16715               Spirits & Spells      GBA           2003.0      Platform   
16716            Winning Post 8 2016      PSV           2016.0    Simulation   

          publisher developer  na_sales

##### _Categorical columns_

Working with raw string data rarely helps with data analysis, as string columns usually require some sort of processing.

If the string column represents a set of categories, it's much better to treat those values ​​directly as categories.

By converting a column to a categorical data type instead of leaving it as a string, we can save memory and speed up analysis, especially for large data sets. Categorical columns only store a single number (the category ID) for each entry, rather than the full text of the entry. Additionally, using categories can facilitate certain analyses, such as grouping data by category or filtering data based on multiple categories at once. This can be done with the categorical data type.

Look at the unique values ​​in the 'platform' column.

In [16]:
import pandas as pd

df = pd.read_csv('DataSets/vg_sales.csv')

print(df['platform'].unique())

['Wii' 'NES' 'GB' 'DS' 'X360' 'PS3' 'PS2' 'SNES' 'GBA' 'PS4' '3DS' 'N64'
 'PS' 'XB' 'PC' '2600' 'PSP' 'XOne' 'WiiU' 'GC' 'GEN' 'DC' 'PSV' 'SAT'
 'SCD' 'WS' 'NG' 'TG16' '3DO' 'GG' 'PCFX']


We can convert 'platform' from a string column to a categorical column using the astype() method.

In [17]:
df['platform'] = df['platform'].astype('category')
print(df['platform'].head())

0    Wii
1    NES
2    Wii
3    Wii
4     GB
Name: platform, dtype: category
Categories (31, object): ['2600', '3DO', '3DS', 'DC', ..., 'WiiU', 'X360', 'XB', 'XOne']


Notice that there are only 31 categories, even though there are 16,719 entries. When the column is stored as strings, we need to retain the full text of all 16,719 entries. __When stored as a category, we only store one number: the category ID__.

##### _Example_

In [27]:
import pandas as pd

df = pd.read_csv('DataSets/vg_sales.csv')

df['average_sales'] = df[['jp_sales', 'na_sales', 'eu_sales']].mean(axis=1)

df= df.sort_values(by='average_sales', ascending=False)

print(df.head())

                       name platform  year_of_release         genre publisher  \
0                Wii Sports      Wii           2006.0        Sports  Nintendo   
1         Super Mario Bros.      NES           1985.0      Platform  Nintendo   
2            Mario Kart Wii      Wii           2008.0        Racing  Nintendo   
4  Pokemon Red/Pokemon Blue       GB           1996.0  Role-Playing  Nintendo   
3         Wii Sports Resort      Wii           2009.0        Sports  Nintendo   

  developer  na_sales  eu_sales  jp_sales  critic_score user_score  \
0  Nintendo     41.36     28.96      3.77          76.0          8   
1       NaN     29.08      3.58      6.81           NaN        NaN   
2  Nintendo     15.68     12.76      3.79          82.0        8.3   
4       NaN     11.27      8.89     10.22           NaN        NaN   
3  Nintendo     15.61     10.93      3.28          80.0          8   

   average_sales  
0      24.696667  
1      13.156667  
2      10.743333  
4      10.126667

##### __Creating categorical columns__

In the previous lesson, you learned how to create new numeric columns from calculations performed on other numeric columns in the data. In this lesson, you'll learn how to create new categorical columns that summarize numeric data in other columns. This technique can often simplify your analysis and make it easier for others to understand your results.

Let's take another look at the 'year_of_release' column in the video game dataset. In particular, we want to know the range of years our data covers.

In [28]:
import pandas as pd

df = pd.read_csv('DataSets/vg_sales.csv')

print(df['year_of_release'].min(), df['year_of_release'].max())

1980.0 2020.0


Let's take a closer look at these values ​​by counting how many games we have per year.

In [31]:
df_year_of_release = df['year_of_release'].value_counts()

print(df_year_of_release)

year_of_release
2008.0    1427
2009.0    1426
2010.0    1255
2007.0    1197
2011.0    1136
2006.0    1006
2005.0     939
2002.0     829
2003.0     775
2004.0     762
2012.0     653
2015.0     606
2014.0     581
2013.0     544
2016.0     502
2001.0     482
1998.0     379
2000.0     350
1999.0     338
1997.0     289
1996.0     263
1995.0     219
1994.0     121
1993.0      60
1981.0      46
1992.0      43
1991.0      41
1982.0      36
1986.0      21
1989.0      17
1983.0      17
1987.0      16
1990.0      16
1988.0      15
1985.0      14
1984.0      14
1980.0       9
2017.0       3
2020.0       1
Name: count, dtype: int64


We've received the response, but it's not sorted correctly. We need to sort it. The value_counts() method returns a count of the years, where years are the index of the new DataFrame and counts are the values. Therefore, we'll sort our result by the index values ​​to display it in chronological order.

##### __.sort_index()__

The sort_index() method sorts the DataFrame by the index.

_dataframe.sort_index(axis, level, ascending, inplace, kind, na_position, sort_remaining, ignore_index, key)_

_axis_	0 1 'index' 'columns'	Optional. Default 0. Specifies the axis to sort by

_level_	String Number List of Strings/Numbers	Optional. Default None. Specifies the index level to sort on

_ascending_	True False	Optional, default True. Specifies whether to sort ascending (0 -> 9) or descending (9 -> 0)

_inplace_	True False	Optional, default False. Specifies whether to perform the operation on the original DataFrame or not, if not, which is default, this method returns a new DataFrame

_kind_	'quicksort' 'mergesort' 'heapsort'	Optional, default 'quicksort'. Specifies the sorting algorithm

_na_position_	'first' 'last'	Optional, default 'last'. Specifies how to handle NULL values. 'first' means put them first, 'last' means put them last.

_sort_remaining_	True False	Optional, default True. Specifies whether to sort by other levels as well, or not

_ignore_index_	True False	Optional, default False. Specifies whether to ignore index or not. If True the original indexes are ignored, and replaced by 0, 1, 2 etc.

_key_	Function	Optional, specify a function to be executed before the sorting

In [32]:
import pandas as pd

df = pd.read_csv('DataSets/vg_sales.csv')
df_year_of_release = df['year_of_release'].value_counts().sort_index()

print(df_year_of_release)

year_of_release
1980.0       9
1981.0      46
1982.0      36
1983.0      17
1984.0      14
1985.0      14
1986.0      21
1987.0      16
1988.0      15
1989.0      17
1990.0      16
1991.0      41
1992.0      43
1993.0      60
1994.0     121
1995.0     219
1996.0     263
1997.0     289
1998.0     379
1999.0     338
2000.0     350
2001.0     482
2002.0     829
2003.0     775
2004.0     762
2005.0     939
2006.0    1006
2007.0    1197
2008.0    1427
2009.0    1426
2010.0    1255
2011.0    1136
2012.0     653
2013.0     544
2014.0     581
2015.0     606
2016.0     502
2017.0       3
2020.0       1
Name: count, dtype: int64


###### _Categorization_

We need to categorize the game by grouping the data into new categories we created. In this case, we're going to group the games into four categories based on the era.

- Those released before 2000 will be in the 'retro' category.

- Those released between 2000 and 2009 will be in the 'modern' category.

- Those released after 2010 will be in the 'recent' category.

- Those without a release year will be in the 'unknown' category.

We want to place each game in one of these four categories and store the result in a new column.

There's no pre-built Pandas function for this. Fortunately, we can write our own custom function tailored to our needs. The function should accept the release year as input and return the era category for that year as a result.

This is what our custom era_group() function will look like.

In [33]:
def era_group(year):
    """
    The function returns the era group of games according to the release year using these rules:
    —'retro' for year < 2000
    —'modern' for 2000 <= year < 2010
    —'recent' for year >= 2010
    —'unknown' to search for year values ​​(NaN)
    """

    if year < 2000:
        return 'retro'
    elif year < 2010:
        return 'modern'
    elif year >= 2010:
        return 'recent'
    else:
        return 'unknown'

##### _.apply()_

The apply() method allows you to apply a function along one of the axis of the DataFrame, default 0, which is the index (row) axis.

_dataframe.apply(func, axis, raw, result_type, args, kwds)_

_func_	 	Required. A function to apply to the DataFrame.

_axis_	0 1 'index' 'columns'	Optional, Which axis to apply the function to. default 0.

_raw_	True False	Optional, default False. Set to true if the row/column should be passed as an ndarray object

_result_type_	'expand' 'reduce' 'broadcast' None	Optional, default None. Specifies how the result will be returned

_args_	a tuple	Optional, arguments to send into the function

_kwds_	keyword arguments	Optional, keyword arguments to send into the function

In this case, the apply() method must be applied to the 'year_of_release' column, because 'year_of_release' contains the data the function will use as input. The era_group() function then becomes the argument we pass to the apply() method.

In [34]:
import pandas as pd

df = pd.read_csv('DataSets/vg_sales.csv')

df['era_group'] = df['year_of_release'].apply(era_group)
print(df.head())

                       name platform  year_of_release         genre publisher  \
0                Wii Sports      Wii           2006.0        Sports  Nintendo   
1         Super Mario Bros.      NES           1985.0      Platform  Nintendo   
2            Mario Kart Wii      Wii           2008.0        Racing  Nintendo   
3         Wii Sports Resort      Wii           2009.0        Sports  Nintendo   
4  Pokemon Red/Pokemon Blue       GB           1996.0  Role-Playing  Nintendo   

  developer  na_sales  eu_sales  jp_sales  critic_score user_score era_group  
0  Nintendo     41.36     28.96      3.77          76.0          8    modern  
1       NaN     29.08      3.58      6.81           NaN        NaN     retro  
2  Nintendo     15.68     12.76      3.79          82.0        8.3    modern  
3  Nintendo     15.61     10.93      3.28          80.0          8    modern  
4       NaN     11.27      8.89     10.22           NaN        NaN     retro  


In [35]:
print(df['era_group'].value_counts())

era_group
modern     9193
recent     5281
retro      1974
unknown     269
Name: count, dtype: int64


##### _Excercise 01_

Start by writing a function called score_group() that categorizes games based on review scores. It categorizes scores based on these characteristics:

- 'low' for scores below 60.

- 'medium' for scores between 60 and 79.

- 'high' for scores above 80.

- 'no score' for scores with no values.

The score_group() function must take a numeric input called score. The output must be a string designating the score category.

Make sure your function produces the correct output when passed the values ​​10, 65, 99, and np.nan. We write a separate print() statement to call each function.

In [42]:
import pandas as pd
import numpy as np

df = pd.read_csv('DataSets/vg_sales.csv')

def score_group(scr):
    
    if scr < 60:
        return 'low'
    elif 59 < scr < 80:
        return 'medium'
    elif scr > 79:
        return 'high'
    else:
        return 'no score'
        

print(score_group(10))
print(score_group(65))
print(score_group(99))
print(score_group(np.nan))

low
medium
high
no score


Add a 'score_categorized' column to the df table by applying the score_group() function to the 'critic_score' column using the apply() method. Print the first 5 rows to ensure the new column was created correctly.

The precode retains the score_group() function from the previous exercise (it may look slightly different from yours, but it works the same way).

In [43]:
df['score_categorized'] = df['critic_score'].apply(score_group)
print(df.head())

                       name platform  year_of_release         genre publisher  \
0                Wii Sports      Wii           2006.0        Sports  Nintendo   
1         Super Mario Bros.      NES           1985.0      Platform  Nintendo   
2            Mario Kart Wii      Wii           2008.0        Racing  Nintendo   
3         Wii Sports Resort      Wii           2009.0        Sports  Nintendo   
4  Pokemon Red/Pokemon Blue       GB           1996.0  Role-Playing  Nintendo   

  developer  na_sales  eu_sales  jp_sales  critic_score user_score  \
0  Nintendo     41.36     28.96      3.77          76.0          8   
1       NaN     29.08      3.58      6.81           NaN        NaN   
2  Nintendo     15.68     12.76      3.79          82.0        8.3   
3  Nintendo     15.61     10.93      3.28          80.0          8   
4       NaN     11.27      8.89     10.22           NaN        NaN   

  score_categorized  
0            medium  
1          no score  
2              high  
3   

Now calculate the total North American sales for each critical score category.

To do this, you need to:

- Group by the new 'score_categorized' column using the groupby() method.

- Calculate the sum of the 'na_sales' column of the grouped DataFrame using sum().

- Display the result.

In [45]:
df_grouped = df.groupby('score_categorized')
df_sum = df_grouped['na_sales'].sum()
print(df_sum)

score_categorized
high        1488.37
low          292.16
medium      1091.67
no score    1528.64
Name: na_sales, dtype: float64


##### _Excercise 02_

Write a function called avg_score_group() that has one parameter named row. The row parameter must be a Pandas Series object. The function must calculate the average rating for each game, then return a string that places each game in one of these categories:

- 'low' value for averages below 60.
- 'medium' value for averages between 60 and 79.
- 'high' value for scores above 80.

To calculate the average score, avg_score_group() must take the row values ​​with the column names 'critic_score' and 'user_score'. The formula for calculating it is avg_score = (critic_score + user_score * 10) / 2.

Here are the completed tests; low, medium, and high should be printed, in that order.

In [None]:
import pandas as pd
df = pd.read_csv('DataSets/vg_sales.csv')

df.dropna(inplace=True)

def avg_score_group(rw):
    
    avg_score = (rw['critic_score'] + rw['user_score']*10) / 2
    
    if avg_score < 60:
        return 'low'
    elif 59 < avg_score < 80:
        return 'medium'
    else:
        return 'high'

# parte de prueba a continuación, por favor no la cambies

col_names = ['critic_score', 'user_score']
test_low  = pd.Series([10, 1.0], index=col_names)
test_med  = pd.Series([65, 6.5], index=col_names)
test_high = pd.Series([99, 9.9], index=col_names)

rows = [test_low, test_med, test_high]

for row in rows:
    print(avg_score_group(row))

low
medium
high
