## Loading/Exploring the data

Load the iris.csv file into a pandas dataframe. Take a minute to familiarize yourself with the data.

## Import Pandas

Import the `pandas` library as `pd`

In [172]:
import pandas as pd

Read the `iris.csv` dataset into an object named `iris`

In [173]:
iris=pd.read_csv('iris 1.csv')
iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


How many different species are in this dataset?

In [174]:
iris['species'].nunique()

3

What are their names?

In [175]:
iris['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

How many samples are there per species?

<details><summary>Hint</summary>Use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html"><code>.value_counts()</code></a> method</details>

In [176]:
iris['species'].value_counts()

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

## Feature Engineering

Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length

In [177]:
iris['sepal_ratio'] = iris['sepal width (cm)'] / iris['sepal length (cm)']

Create a similar column called `'petal_ratio'`: petal width / petal length

In [178]:
iris['sepal_ratio'] = iris['petal width (cm)'] / iris['petal length (cm)']

Create 4 columns that correspond to `sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, and `petal width (cm)`, only in inches.

In [179]:
cm_to_inch = 0.393701

iris['sepal_length_in'] = iris['sepal length (cm)'] * cm_to_inch
iris['sepal_width_in'] = iris['sepal width (cm)'] * cm_to_inch
iris['petal_length_in'] = iris['petal length (cm)'] * cm_to_inch
iris['petal_width_in'] = iris['petal width (cm)'] * cm_to_inch

In [180]:
iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,sepal_ratio,sepal_length_in,sepal_width_in,petal_length_in,petal_width_in
0,5.1,3.5,1.4,0.2,setosa,0.142857,2.007875,1.377954,0.551181,0.078740
1,4.9,3.0,1.4,0.2,setosa,0.142857,1.929135,1.181103,0.551181,0.078740
2,4.7,3.2,1.3,0.2,setosa,0.153846,1.850395,1.259843,0.511811,0.078740
3,4.6,3.1,1.5,0.2,setosa,0.133333,1.811025,1.220473,0.590552,0.078740
4,5.0,3.6,1.4,0.2,setosa,0.142857,1.968505,1.417324,0.551181,0.078740
...,...,...,...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,0.442308,2.637797,1.181103,2.047245,0.905512
146,6.3,2.5,5.0,1.9,virginica,0.380000,2.480316,0.984253,1.968505,0.748032
147,6.5,3.0,5.2,2.0,virginica,0.384615,2.559057,1.181103,2.047245,0.787402
148,6.2,3.4,5.4,2.3,virginica,0.425926,2.440946,1.338583,2.125985,0.905512


## Apply

Create a column called `'encoded_species'`:
- 0 for setosa
- 1 for versicolor
- 2 for virginica


Hint 1
Create a dictionary using the species as keys and the numbers 0-2 for values


Hint 2
    Use the dictionary in hint 1 with the <code>.apply()</code> method to create the new column


In [181]:
species_encoding = {'setosa': 0, 'versicolor': 1, 'virginica': 2}
iris['encoded_species'] = iris['species'].apply(lambda x: species_encoding[x])


## March Madness

Let's change up the dataset to something different than flowers: March Madness!

Read in the dataset `ncaa-seeds.csv` to an object named `seeds`.

This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:

| team_seed | opponent_seed |
|-----------|---------------|
| 01N       | 16N           |

In [182]:
seeds=pd.read_csv('ncaa-seeds.csv')
seeds.head(1)

Unnamed: 0,"team_seed,opponent_seed"
0,"01N,16N"


For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).

Using the `.apply()` method, create the following new columns:
- `team_division`
- `opponent_division`

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 01N       | 16N           | N             | N                 |


In [183]:
seeds[['team_seed', 'opponent_seed']] =seeds['team_seed,opponent_seed'].str.split(',', expand=True)

seeds['team_division'] = seeds['team_seed'].apply(lambda x: x[-1])  
seeds['opponent_division'] = seeds['opponent_seed'].apply(lambda x: x[-1]) 

seeds.head()

Unnamed: 0,"team_seed,opponent_seed",team_seed,opponent_seed,team_division,opponent_division
0,"01N,16N",01N,16N,N,N
1,"02N,15N",02N,15N,N,N
2,"03N,14N",03N,14N,N,N
3,"04N,13N",04N,13N,N,N
4,"05N,12N",05N,12N,N,N


Now that you have the divisions, change the `team_seed` and `opponent_seed` columns to just be the numbers.

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 1         | 16            | N             | N                 |

In [184]:
seeds['team_seed'] = seeds['team_seed'].str[:-1]  
seeds['opponent_seed'] = seeds['opponent_seed'].str[:-1]

seeds['team_seed'] = seeds['team_seed'].astype(int)


seeds.head()

Unnamed: 0,"team_seed,opponent_seed",team_seed,opponent_seed,team_division,opponent_division
0,"01N,16N",1,16,N,N
1,"02N,15N",2,15,N,N
2,"03N,14N",3,14,N,N
3,"04N,13N",4,13,N,N
4,"05N,12N",5,12,N,N


Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. 

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division | seed_delta |
|-----------|---------------|---------------|-------------------|------------|
| 1         | 16            | N             | N                 | -15        |

<br>
<details><summary>Did you get an error?</summary>
team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.
</details>

In [185]:
seeds['opponent_seed'] = seeds['opponent_seed'].astype(int)

seeds['seed_delta']=seeds['team_seed']-seeds['opponent_seed']
seeds.head()

Unnamed: 0,"team_seed,opponent_seed",team_seed,opponent_seed,team_division,opponent_division,seed_delta
0,"01N,16N",1,16,N,N,-15
1,"02N,15N",2,15,N,N,-13
2,"03N,14N",3,14,N,N,-11
3,"04N,13N",4,13,N,N,-9
4,"05N,12N",5,12,N,N,-7
