# Feature engineering in pandas

## Loading/Exploring the data

Load the iris.csv file from this repo into a pandas dataframe. Take a minute to familiarize yourself with the data.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('./iris.csv')

How many different species are in this dataset?

In [4]:
df['species'].value_counts()
3

3

What are their names?

In [5]:
['virginica', 'setosa', 'versicolor']

['virginica', 'setosa', 'versicolor']

How many samples are there per species?

<details><summary>Hint</summary>Use the [value_counts](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) method</details>

50 per

## Broadcasting

Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length

In [6]:
df['sepal_ratio'] = df['sepal width (cm)'] / df['sepal length (cm)']

Create a similar column called `'petal_ratio'`: petal width / petal length

In [7]:
df['petal_ratio'] = df['petal width (cm)'] / df['petal length (cm)']

Since we're in 'Murica, create 4 columns the correspond to **sepal length (cm)**, **sepal width (cm)**, **petal length (cm)**, and **petal width (cm)**, only in inches.

In [8]:
df['sl (in)'] = df['sepal length (cm)'] / 2.54
df['sw (in)'] = df['sepal width (cm)'] / 2.54
df['pl (in)'] = df['petal length (cm)'] / 2.54
df['pw (in)'] = df['petal width (cm)'] / 2.54
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,sepal_ratio,petal_ratio,sl (in),sw (in),pl (in),pw (in)
0,5.1,3.5,1.4,0.2,setosa,0.686275,0.142857,2.007874,1.377953,0.551181,0.07874
1,4.9,3.0,1.4,0.2,setosa,0.612245,0.142857,1.929134,1.181102,0.551181,0.07874
2,4.7,3.2,1.3,0.2,setosa,0.680851,0.153846,1.850394,1.259843,0.511811,0.07874
3,4.6,3.1,1.5,0.2,setosa,0.673913,0.133333,1.811024,1.220472,0.590551,0.07874
4,5.0,3.6,1.4,0.2,setosa,0.72,0.142857,1.968504,1.417323,0.551181,0.07874


## Mapping

Create a column called `'encoded_species'`:
- 0 for setosa
- 1 for versicolor
- 2 for virginica


<details><summary>Hint 1</summary>
Create a dictionary using the species as keys and the numbers 0-2 for values
</details>

<details><summary>Hint 2</summary>
Use the dictionary in hint 1 with the map method to create the new column
</details>

In [9]:
sp_codes = {'setosa':0, 'versicolor':1, 'virginica':2}
df['encoded_species'] = df['species'].map(sp_codes)

## Apply

Let's change up the dataset to something way cooler than flowers: March Madness!

Load `ncaa-seeds.csv` into pandas. This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:

| team_seed | opponent_seed |
|-----------|---------------|
| 01N       | 16N           |

For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).

Using the `apply` method, create the following new columns:
- team_division
- opponent_division

In [10]:
bball = pd.read_csv('./ncaa-seeds.csv')
bball.head()

Unnamed: 0,team_seed,opponent_seed
0,01N,16N
1,02N,15N
2,03N,14N
3,04N,13N
4,05N,12N


Now that you have the divisions, change the team_seed and opponent_seed columns to just be the numbers.

In [22]:
# Seperate division
bball['team division'] = bball['team_seed'].apply(lambda division: division[-1])
bball['opponent division'] = bball['opponent_seed'].apply(lambda division: division[-1])
bball.head()

Unnamed: 0,team_seed,opponent_seed,team division,opponent division
0,01N,16N,N,N
1,02N,15N,N,N
2,03N,14N,N,N
3,04N,13N,N,N
4,05N,12N,N,N


In [23]:
# Seperate seed value
bball['team seed'] = bball['team_seed'].apply(lambda seed: seed[:-1])
bball['opponent seed'] = bball['opponent_seed'].apply(lambda seed: seed[:-1])
bball.head()

Unnamed: 0,team_seed,opponent_seed,team division,opponent division,team seed,opponent seed
0,01N,16N,N,N,1,16
1,02N,15N,N,N,2,15
2,03N,14N,N,N,3,14
3,04N,13N,N,N,4,13
4,05N,12N,N,N,5,12


In [25]:
# Drop superfluous columns
bball.drop(columns=['team_seed', 'opponent_seed'], inplace=True)

In [27]:
# Change datatype of numeric columns to numeric type
bball[['team seed', 'opponent seed']] = bball[['team seed', 'opponent seed']].apply(pd.to_numeric)

In [31]:
# Verify our work
display(bball.head())
bball.dtypes

Unnamed: 0,team division,opponent division,team seed,opponent seed
0,N,N,1,16
1,N,N,2,15
2,N,N,3,14
3,N,N,4,13
4,N,N,5,12


team division        object
opponent division    object
team seed             int64
opponent seed         int64
dtype: object

Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. 

For example, the `seed_delta` in the first row will be result of 1 - 16: -15

<details><summary>Did you get an error?</summary>
team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.
</details>

In [34]:
bball['seed delta'] = bball['team seed'] - bball['opponent seed']
bball.head()

Unnamed: 0,team division,opponent division,team seed,opponent seed,seed delta
0,N,N,1,16,-15
1,N,N,2,15,-13
2,N,N,3,14,-11
3,N,N,4,13,-9
4,N,N,5,12,-7


## Dummies

Using pandas get_dummies method, create a new dataframe with 4 columns from team_divison.

NOTE: Be sure to use 'team_division' as your prefix.

In [36]:
# Doing this step and the next as one because why overcomplicate things amirite?
bball = pd.get_dummies(bball, columns=['team division'], drop_first=True)

In machine learning, it's common to drop one the columns and have that be the baseline. Drop 'team_division_E', and append the remaining three columns to your original ncaa dataframe.

Repeat the previous two steps for opponent_division.

In [37]:
bball = pd.get_dummies(bball, columns=['opponent division'], drop_first=True)

In [38]:
bball

Unnamed: 0,team seed,opponent seed,seed delta,team division_N,team division_S,team division_W,opponent division_N,opponent division_S,opponent division_W
0,1,16,-15,1,0,0,1,0,0
1,2,15,-13,1,0,0,1,0,0
2,3,14,-11,1,0,0,1,0,0
3,4,13,-9,1,0,0,1,0,0
4,5,12,-7,1,0,0,1,0,0
5,6,11,-5,1,0,0,1,0,0
6,7,10,-3,1,0,0,1,0,0
7,8,9,-1,1,0,0,1,0,0
8,1,16,-15,0,1,0,0,1,0
9,2,15,-13,0,1,0,0,1,0
