# Feature engineering in pandas

## Loading/Exploring the data

Load the iris.csv file from this repo into a pandas dataframe. Take a minute to familiarize yourself with the data.

In [437]:
# Load Pandas into Python
import pandas as pd
from matplotlib import pyplot as plt

%matplotlib inline

In [438]:
path = './iris.csv'
Ir = pd.read_csv(path);

In [439]:
Ir.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'species'],
      dtype='object')

How many different species are in this dataset?

In [440]:
Ir.species.nunique()

3

What are their names?

In [441]:
for i in Ir.species.unique():
    print(i)


setosa
versicolor
virginica


How many samples are there per species?

<details><summary>Hint</summary>Use the [value_counts](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) method</details>

In [442]:
Ir.species.value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

## Broadcasting

Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length

In [443]:
Ir['sepal_ratio']= Ir['sepal width (cm)']/Ir['sepal length (cm)'];

Create a similar column called `'petal_ratio'`: petal width / petal length

In [444]:
Ir['petal_ratio']= Ir['petal width (cm)']/Ir['petal length (cm)'];

In [445]:
Ir.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,sepal_ratio,petal_ratio
0,5.1,3.5,1.4,0.2,setosa,0.686275,0.142857
1,4.9,3.0,1.4,0.2,setosa,0.612245,0.142857
2,4.7,3.2,1.3,0.2,setosa,0.680851,0.153846
3,4.6,3.1,1.5,0.2,setosa,0.673913,0.133333
4,5.0,3.6,1.4,0.2,setosa,0.72,0.142857


Create 4 columns that correspond to **sepal length (cm)**, **sepal width (cm)**, **petal length (cm)**, and **petal width (cm)**, but only in inches.

In [446]:
Ir['sepal length (in)'], Ir['sepal width (in)'], Ir['petal length (in)'], Ir['petal width (in)'] = [Ir['sepal length (cm)']  * 0.39 , Ir['sepal width (cm)'] * 0.39 , Ir['petal length (cm)']  * 0.39 , Ir['sepal width (cm)']  * 0.39 ]
Ir.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,sepal_ratio,petal_ratio,sepal length (in),sepal width (in),petal length (in),petal width (in)
0,5.1,3.5,1.4,0.2,setosa,0.686275,0.142857,1.989,1.365,0.546,1.365
1,4.9,3.0,1.4,0.2,setosa,0.612245,0.142857,1.911,1.17,0.546,1.17
2,4.7,3.2,1.3,0.2,setosa,0.680851,0.153846,1.833,1.248,0.507,1.248
3,4.6,3.1,1.5,0.2,setosa,0.673913,0.133333,1.794,1.209,0.585,1.209
4,5.0,3.6,1.4,0.2,setosa,0.72,0.142857,1.95,1.404,0.546,1.404


## Mapping

Create a column called `'encoded_species'`:
- 0 for setosa
- 1 for versicolor
- 2 for virginica


<details><summary>Hint 1</summary>
Create a dictionary using the species as keys and the numbers 0-2 for values
</details>

<details><summary>Hint 2</summary>
Use the dictionary in hint 1 with the map method to create the new column
</details>

In [447]:
Ir['encoded_species'] = Ir.species.map({'setosa':0, 'versicolor':1, 'virginica':2})


## Apply

Let's change up the dataset to something way cooler than flowers: March Madness!

Load `ncaa-seeds.csv` into pandas. This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:

| team_seed | opponent_seed |
|-----------|---------------|
| 01N       | 16N           |

For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).

Using the `apply` method, create the following new columns:
- team_division
- opponent_division

In [448]:
Marpath = './ncaa-seeds.csv'
Madness = pd.read_csv(Marpath);

In [449]:
def team_division(row):
    return row.team_seed[-1:]

def opponent_division(row):
    return row.opponent_seed[-1:]

Madness['team_division'] = Madness.apply(lambda row: team_division (row), axis=1) ;
Madness['opponent_division'] = Madness.apply(lambda row: opponent_division (row), axis=1) ;


In [450]:
test = "sab";
newTest = test[0:len(test)-1]
print(newTest)

sa


Now that you have the divisions, change the team_seed and opponent_seed columns to just be the numbers.

In [451]:
Madness['team_seed'].replace('[a-zA-Z]','',inplace=True,regex=True)
Madness['opponent_seed'].replace('[a-zA-Z]','',inplace=True,regex=True)
Madness.head()

Unnamed: 0,team_seed,opponent_seed,team_division,opponent_division
0,1,16,N,N
1,2,15,N,N
2,3,14,N,N
3,4,13,N,N
4,5,12,N,N


Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. 

For example, the `seed_delta` in the first row will be result of 1 - 16: -15

<details><summary>Did you get an error?</summary>
team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.
</details>

In [452]:
Madness['seed_delta'] = Madness.apply(lambda row: int(row.team_seed) - int(row.opponent_seed) , axis=1);
Madness.team_division.value_counts()

N    8
E    8
W    8
S    8
Name: team_division, dtype: int64

## Dummies

Using pandas get_dummies method, create a new dataframe with 4 columns from team_divison.

NOTE: Be sure to use 'team_division' as your prefix.

In [453]:
dummies = pd.get_dummies(Madness.team_division, prefix="team_devision");
dummies

Unnamed: 0,team_devision_E,team_devision_N,team_devision_S,team_devision_W
0,0,1,0,0
1,0,1,0,0
2,0,1,0,0
3,0,1,0,0
4,0,1,0,0
5,0,1,0,0
6,0,1,0,0
7,0,1,0,0
8,0,0,1,0
9,0,0,1,0


In machine learning, it's common to drop one the columns and have that be the baseline. Drop 'team_division_E', and append the remaining three columns to your original ncaa dataframe.

In [454]:
dummies.drop('team_devision_E',axis=1,inplace=True);


In [456]:
Madness = pd.concat([Madness, dummies], axis=1)