# Feature Engineering for Machine Learning in Python
### Creating Features
Feature engineering is the creation of new input or target features from existing features. The objective is to create ones that do a better job of representing a machine learning problem to the model. By doing so, you can improve the accuracy of the model.

Good feature engineering can be the difference between a poor model and a fantastic one! More often than not, you will find that you can squeeze more out of your models through careful feature selection than any amount of algorithm tuning.
#### The Dataset used for this analysis can be found in the repo

In [1]:
import numpy as np
import pandas as pd

#### Importing the Dataset

In [2]:
so_survey_df = pd.read_csv('Combined_DS_v10.csv')

In [3]:
so_survey_df.head()

Unnamed: 0,SurveyDate,FormalEducation,ConvertedSalary,Hobby,Country,StackOverflowJobsRecommend,VersionControl,Age,Years Experience,Gender,RawSalary
0,2/28/18 20:20,Bachelor's degree (BA. BS. B.Eng.. etc.),,Yes,South Africa,,Git,21,13,Male,
1,6/28/18 13:26,Bachelor's degree (BA. BS. B.Eng.. etc.),70841.0,Yes,Sweeden,7.0,Git;Subversion,38,9,Male,70841.00
2,6/6/18 3:37,Bachelor's degree (BA. BS. B.Eng.. etc.),,No,Sweeden,8.0,Git,45,11,,
3,5/9/18 1:06,Some college/university study without earning ...,21426.0,Yes,Sweeden,,Zip file back-ups,46,12,Male,21426.00
4,4/12/18 22:41,Bachelor's degree (BA. BS. B.Eng.. etc.),41671.0,Yes,UK,8.0,Git,39,7,Male,"£41,671.00"


#### Showing the data types for the different columns in the dataset

In [4]:
so_survey_df.dtypes

SurveyDate                     object
FormalEducation                object
ConvertedSalary               float64
Hobby                          object
Country                        object
StackOverflowJobsRecommend    float64
VersionControl                 object
Age                             int64
Years Experience                int64
Gender                         object
RawSalary                      object
dtype: object

#### Creating a subset of only the numeric columns

In [5]:
so_numeric_df = so_survey_df.select_dtypes(include=['int', 'float'])

#### Printing the column names contained in so_survey_df_num

In [6]:
print(so_numeric_df.columns)

Index(['ConvertedSalary', 'StackOverflowJobsRecommend'], dtype='object')


#### Converting the Country column to a one hot encoded Data Frame

In [7]:
one_hot_encoded = pd.get_dummies(so_survey_df, columns=['Country'], prefix='OH')

#### Printing the columns names

In [8]:
print(one_hot_encoded.columns)

Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary', 'OH_France', 'OH_India',
       'OH_Ireland', 'OH_Russia', 'OH_South Africa', 'OH_Spain', 'OH_Sweeden',
       'OH_UK', 'OH_USA', 'OH_Ukraine'],
      dtype='object')


#### Creating dummy variables for the Country column

In [9]:
dummy = pd.get_dummies(so_survey_df, columns=['Country'], drop_first=True, prefix='DM')

#### Printing the columns names

In [10]:
print(dummy.columns)

Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary', 'DM_India', 'DM_Ireland',
       'DM_Russia', 'DM_South Africa', 'DM_Spain', 'DM_Sweeden', 'DM_UK',
       'DM_USA', 'DM_Ukraine'],
      dtype='object')


#### Creating a series out of the Country column

In [11]:
countries = so_survey_df['Country']

#### Getting the counts of each category

In [12]:
country_counts = countries.value_counts()

#### Printing the count values for each category

In [13]:
print(country_counts)

South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
India            95
UK               95
Ukraine           9
Ireland           5
Name: Country, dtype: int64


#### Creating a mask for only categories that occur less than 10 times

In [14]:
mask = countries.isin(country_counts[country_counts < 10].index)

#### Printing the top 5 rows in the mask series

In [15]:
print(mask.head())

0    False
1    False
2    False
3    False
4    False
Name: Country, dtype: bool


#### Labeling all other categories as 'Other'

In [16]:
countries[mask] = 'Other'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  countries[mask] = 'Other'


In [17]:
print(pd.value_counts(countries))

South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
India            95
UK               95
Other            14
Name: Country, dtype: int64


### Numeric Variables
#### Binarizing columns

#### Creating the Paid_Job column filled with zeros

In [18]:
so_survey_df['Paid_Job'] = 0

#### Replacing all the Paid_Job values where ConvertedSalary is > 0`

In [19]:
so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0, 'Paid_Job'] = 1

#### Printing the first five rows of the columns

In [20]:
print(so_survey_df[['Paid_Job', 'ConvertedSalary']].head())

   Paid_Job  ConvertedSalary
0         0              NaN
1         1          70841.0
2         0              NaN
3         1          21426.0
4         1          41671.0


#### Binning the continuous variable ConvertedSalary into 5 bins

In [21]:
so_survey_df['equal_binned'] = pd.cut(so_survey_df['ConvertedSalary'], 5)

#### Printing the first 5 rows of the equal_binned column

In [22]:
print(so_survey_df[['equal_binned', 'ConvertedSalary']].head())

          equal_binned  ConvertedSalary
0                  NaN              NaN
1  (-2000.0, 400000.0]          70841.0
2                  NaN              NaN
3  (-2000.0, 400000.0]          21426.0
4  (-2000.0, 400000.0]          41671.0


#### Specifing the boundaries of the bins

In [23]:
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]

#### Binning labels

In [24]:
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']

#### Binning the continuous variable ConvertedSalary using these boundaries

In [26]:
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'], 
                                         bins, labels = labels)

#### Printing the first 5 rows of the boundary_binned column

In [27]:
print(so_survey_df[['boundary_binned', 'ConvertedSalary']].head())

  boundary_binned  ConvertedSalary
0             NaN              NaN
1          Medium          70841.0
2             NaN              NaN
3             Low          21426.0
4             Low          41671.0
