# Feature Engineering

Practice creating new features from the GDP and population data. 

You'll create a new feature gdppercapita, which is GDP divided by population. You'll then write code to create new features like GDP squared and GDP cubed. 

Start by running the code below. It reads in the World Bank data, filters the data for the year 2016, and cleans the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

# read in the projects data set and do basic wrangling 
gdp = pd.read_csv('../data/gdp_data.csv', skiprows=4)
gdp.drop(['Unnamed: 62', 'Country Code', 'Indicator Name', 'Indicator Code'], inplace=True, axis=1)
population = pd.read_csv('../data/population_data.csv', skiprows=4)
population.drop(['Unnamed: 62', 'Country Code', 'Indicator Name', 'Indicator Code'], inplace=True, axis=1)


# Reshape the data sets so that they are in long format
gdp_melt = gdp.melt(id_vars=['Country Name'], 
                    var_name='year', 
                    value_name='gdp')

# Use back fill and forward fill to fill in missing gdp values
gdp_melt['gdp'] = gdp_melt.sort_values('year').groupby('Country Name')['gdp'].fillna(method='ffill').fillna(method='bfill')

population_melt = population.melt(id_vars=['Country Name'], 
                                  var_name='year', 
                                  value_name='population')

# Use back fill and forward fill to fill in missing population values
population_melt['population'] = population_melt.sort_values('year').groupby('Country Name')['population'].fillna(method='ffill').fillna(method='bfill')

# merge the population and gdp data together into one data frame
df_country = gdp_melt.merge(population_melt, on=('Country Name', 'year'))

# filter data for the year 2016
df_2016 = df_country[df_country['year'] == '2016']

# filter out values that are not countries
non_countries = ['World',
 'High income',
 'OECD members',
 'Post-demographic dividend',
 'IDA & IBRD total',
 'Low & middle income',
 'Middle income',
 'IBRD only',
 'East Asia & Pacific',
 'Europe & Central Asia',
 'North America',
 'Upper middle income',
 'Late-demographic dividend',
 'European Union',
 'East Asia & Pacific (excluding high income)',
 'East Asia & Pacific (IDA & IBRD countries)',
 'Euro area',
 'Early-demographic dividend',
 'Lower middle income',
 'Latin America & Caribbean',
 'Latin America & the Caribbean (IDA & IBRD countries)',
 'Latin America & Caribbean (excluding high income)',
 'Europe & Central Asia (IDA & IBRD countries)',
 'Middle East & North Africa',
 'Europe & Central Asia (excluding high income)',
 'South Asia (IDA & IBRD)',
 'South Asia',
 'Arab World',
 'IDA total',
 'Sub-Saharan Africa',
 'Sub-Saharan Africa (IDA & IBRD countries)',
 'Sub-Saharan Africa (excluding high income)',
 'Middle East & North Africa (excluding high income)',
 'Middle East & North Africa (IDA & IBRD countries)',
 'Central Europe and the Baltics',
 'Pre-demographic dividend',
 'IDA only',
 'Least developed countries: UN classification',
 'IDA blend',
 'Fragile and conflict affected situations',
 'Heavily indebted poor countries (HIPC)',
 'Low income',
 'Small states',
 'Other small states',
 'Not classified',
 'Caribbean small states',
 'Pacific island small states']

# remove non countries from the data
df_2016 = df_2016[~df_2016['Country Name'].isin(non_countries)]
df_2016.reset_index(inplace=True, drop=True)

# Exercise 1

Create a new feature called gdppercapita in a new column. This feature should be the gdp value divided by the population.

In [4]:
# TODO: create a new feature called gdppercapita, 
#      which is the gdp value divided by the population value for each country

df_2016['gdppercapita'] = df_2016['gdp'] / df_2016['population']

# Exercise 2 (Challenge)

This next exercise is more challenging and assumes you know how to use the pandas apply() method as well as lambda functions. 

Write code that creates multiples of a feature. For example, if you take the 'gdp' column and an integer like 3, you want to append a new column with the square of gdp (gdp^2) and another column with the cube of gdp (gdp^3).

Follow the TODOs below. These functions build on each other in the following way:

create_multiples(b, k) has two inputs. The first input, b, is a floating point number. The second number, k, is an integer. The output is a list of multiples of b. For example create_multiples(3, 4) would return this list: $[3^2, 3^3, 3^4]$ or in other words $[9, 27, 81]$.

Then the column_name_generator(colname, k) function outputs a list of column names. For example, column_name_generator('gdp', 4) would output a list of strings `['gdp2', 'gdp3', 'gdp4']`.

And finally, concatenate_features(df, column, num_columns) uses the two previous functions to create the new columns and then append these new columns to the original data frame.

In [9]:
# TODO: Fill out the create_multiples function.
# The create_multiples function has two inputs. A floating point number and an integer.
# The output is a list of multiples of the input b starting from the square of b and ending at b^k.

def create_multiples(b, k):
    
    # TODO: use a for loop to make a list of multiples of b: ie b^2, b^3, b^4, etc... until b^k
    # You do not need to include b^0, which would be 1. You also do not need b^1 because that feature
    # is already in data frame.

    new_features = []
    for i in range(2, k+1):
        new_features.append(b ** i)
    return new_features

# TODO: Fill out the column_name_generator function.
# The function has two inputs: a string representing a column name and an integer k. 
# The 'k' variable is the same as the create_multiples function.
# The output should be a list of column names.
# For example if the inputs are ('gdp', 4) then the output is a list of strings ['gdp2', 'gdp3', gdp4']
def column_name_generator(colname, k):
    
    col_names = []
    for i in range(2, k+1):
        col_names.append(f"{colname}{i}")
    return col_names

# TODO: Fill out the concatenate_features function.
# The function has three inputs. A dataframe, a column name represented by a string, and an integer representing
# the maximum power to create when engineering features.

# If the input is (df_2016, 'gdp', 3), then the output will be the df_2016 dataframe with two new columns
# One new column will be 'gdp2' ie gdp^2, and then other column will be 'gdp3' ie gdp^3.

# HINT: There may be more than one way to do this.
# The TODOs in this section point you towards one way that works
def concatenate_features(df, column, num_columns):
    
    # TODO: Use the pandas apply() method to create the new features. Inside the apply method, you
    # can use a lambda function with the create_mtuliples function
    # HINT: df[column].apply(lambda ....)
    new_features = df[column].apply(lambda x: create_multiples(x, num_columns))
    
    # TODO: Create a dataframe with a separate column for each of the new features
    # Use the column_name_generator() function to create the column names
    
    # HINT: In the pd.DataFrame() method, you can specify column names inputting a list in the columns option
    # HINT: Using new_features.tolist() might be helpful
    new_features_df = pd.DataFrame(new_features.to_list(), columns=column_name_generator(column, num_columns))
    
    # TODO: concatenate the original date frame in df with the new_features_df dataframe
    # return this concatenated dataframe
    
    return pd.concat([df, new_features_df], axis=1)

# Solution

Run the code cell below. If your code is correct, you should get a dataframe with 8 columns. Here are the first two rows of what your results should look like. 

| Country Name | year | gdp          | population | gdppercapita | gdp2         | gdp3         | gdp4         |
|--------------|------|--------------|------------|--------------|--------------|--------------|--------------|
| Aruba        | 2016 | 2.584464e+09 | 104822.0   | 24655.737223 | 6.679453e+18 | 1.726280e+28 | 4.461509e+37 |
| Afghanistan  | 2016 | 1.946902e+10 | 34656032.0 | 561.778746   | 3.790428e+20 | 7.379593e+30 | 1.436735e+41 |




There is a solution in the 16_featureengineering_exercise folder if you go to File->Open.

In [10]:
concatenate_features(df_2016, 'gdp', 4)

Unnamed: 0,Country Name,year,gdp,population,gdppercapita,gdp2,gdp3,gdp4
0,Aruba,2016,2.584464e+09,104822.0,24655.737223,6.679453e+18,1.726280e+28,4.461509e+37
1,Afghanistan,2016,1.946902e+10,34656032.0,561.778746,3.790428e+20,7.379593e+30,1.436735e+41
2,Angola,2016,9.533720e+10,28813463.0,3308.772828,9.089182e+21,8.665372e+32,8.261324e+43
3,Albania,2016,1.188368e+10,2876101.0,4131.872341,1.412219e+20,1.678236e+30,1.994363e+40
4,Andorra,2016,2.877312e+09,77281.0,37231.815671,8.278924e+18,2.382105e+28,6.854058e+37
...,...,...,...,...,...,...,...,...
212,Kosovo,2016,6.715487e+09,1816200.0,3697.548026,4.509776e+19,3.028534e+29,2.033808e+39
213,"Yemen, Rep.",2016,1.821333e+10,27584213.0,660.280885,3.317253e+20,6.041823e+30,1.100417e+41
214,South Africa,2016,2.957627e+11,56015473.0,5280.017633,8.747557e+22,2.587201e+34,7.651975e+45
215,Zambia,2016,2.095475e+10,16591390.0,1262.989682,4.391017e+20,9.201269e+30,1.928103e+41
