## Feature engineering
"the act of taking raw data and extracting features for machine learning."

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./DATASETS/Combined_DS_v10.csv')

In [3]:
print(df.head())

      SurveyDate                                    FormalEducation  \
0  2/28/18 20:20           Bachelor's degree (BA. BS. B.Eng.. etc.)   
1  6/28/18 13:26           Bachelor's degree (BA. BS. B.Eng.. etc.)   
2    6/6/18 3:37           Bachelor's degree (BA. BS. B.Eng.. etc.)   
3    5/9/18 1:06  Some college/university study without earning ...   
4  4/12/18 22:41           Bachelor's degree (BA. BS. B.Eng.. etc.)   

   ConvertedSalary Hobby       Country  StackOverflowJobsRecommend  \
0              NaN   Yes  South Africa                         NaN   
1          70841.0   Yes       Sweeden                         7.0   
2              NaN    No       Sweeden                         8.0   
3          21426.0   Yes       Sweeden                         NaN   
4          41671.0   Yes            UK                         8.0   

      VersionControl  Age  Years Experience Gender   RawSalary  
0                Git   21                13   Male         NaN  
1     Git;Subversion  

In [4]:
print(df.columns)

Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby', 'Country',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary'],
      dtype='object')


In [5]:
print(df.dtypes)

SurveyDate                     object
FormalEducation                object
ConvertedSalary               float64
Hobby                          object
Country                        object
StackOverflowJobsRecommend    float64
VersionControl                 object
Age                             int64
Years Experience                int64
Gender                         object
RawSalary                      object
dtype: object


#### Selecting specific data types

[More info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html)

In [6]:
only_ints = df.select_dtypes(include=['int64'])
print(only_ints.columns)

Index(['Age', 'Years Experience'], dtype='object')


#### Categorical features

Assigning numbers would imply some sort of ordering.

<img src='./IMAGES/encoding-categorical-features.PNG'>

- <u>One-hot encoding</u>: converts $n$ categories into $n$ features.
    - Explainable features.
- <u>Dummy encoding</u>: converts $n$ categories into $n-1$ features.
    - Ommits first category;
    - First category is represented by the absence of value in all other dummy variables;
    - Necessary information without duplication.

In [7]:
# One-hot encoding:
pd.get_dummies(df, columns=['Country'], prefix='C').iloc[:,10:].head()

Unnamed: 0,C_France,C_India,C_Ireland,C_Russia,C_South Africa,C_Spain,C_Sweeden,C_UK,C_USA,C_Ukraine
0,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,1,0,0


In [8]:
# Dummy encoding
pd.get_dummies(df, columns=['Country'], prefix='C', drop_first=True).iloc[:,10:].head()

Unnamed: 0,C_India,C_Ireland,C_Russia,C_South Africa,C_Spain,C_Sweeden,C_UK,C_USA,C_Ukraine
0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,1,0,0


#### Limiting your columns

In [9]:
counts = df['Country'].value_counts()
print(counts)

South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
UK               95
India            95
Ukraine           9
Ireland           5
Name: Country, dtype: int64


In [10]:
mask = df['Country'].isin(counts[counts < 10].index)
#df['Country'][mask] = 'Other' ---> raises a warning!
df.loc[mask,'Country'] = 'Other'

df['Country'].value_counts()

South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
UK               95
India            95
Other            14
Name: Country, dtype: int64

#### Binarizing numeric variables

~~~
df['Binary_Violation'] = 0
df.loc[df['Number_of_Violations'] > 0, 'Binary_Violation'] = 1
~~~

#### Binning numeric variables

~~~
import numpy

df['Binned_Group'] = pd.cut(df['Number_of_Violations'], bins=[-np.inf, 0, 2, np.inf], labels=[1,2,3]) # right limit included
~~~
