## Loading/Exploring the data

Load the iris.csv file into a pandas dataframe. Take a minute to familiarize yourself with the data.

## Import Pandas

Import the `pandas` library as `pd`

In [1]:
import pandas as pd

Read the `iris.csv` dataset into an object named `iris`

In [2]:
iris = pd.read_csv('iris.csv')

How many different species are in this dataset?

The Iris dataset contains three different species of iris flowers.

What are their names?

In [None]:
Setosa Versicolor Virginica

How many samples are there per species?

<details><summary>Hint</summary>Use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html"><code>.value_counts()</code></a> method</details>

In [4]:
species_counts = iris['species'].value_counts()
print(species_counts)

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64


## Feature Engineering

Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length

In [8]:
iris['sepal_ratio'] = iris['sepal width (cm)'] / iris['sepal length (cm)']
print(iris.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

  species  sepal_ratio  
0  setosa     0.686275  
1  setosa     0.612245  
2  setosa     0.680851  
3  setosa     0.673913  
4  setosa     0.720000  


Create a similar column called `'petal_ratio'`: petal width / petal length

In [11]:
iris['petal_ratio'] = iris['petal width (cm)'] / iris['petal length (cm)']

# Display the first few rows to check the new column
print(iris.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

  species  sepal_ratio  petal_ratio  
0  setosa     0.686275     0.142857  
1  setosa     0.612245     0.142857  
2  setosa     0.680851     0.153846  
3  setosa     0.673913     0.133333  
4  setosa     0.720000     0.142857  


Create 4 columns that correspond to `sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, and `petal width (cm)`, only in inches.

In [12]:
iris['sepal length (in)'] = iris['sepal length (cm)'] / 2.54
iris['sepal width (in)'] = iris['sepal width (cm)'] / 2.54
iris['petal length (in)'] = iris['petal length (cm)'] / 2.54
iris['petal width (in)'] = iris['petal width (cm)'] / 2.54
print(iris.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

  species  sepal_ratio  petal_ratio  sepal length (in)  sepal width (in)  \
0  setosa     0.686275     0.142857           2.007874          1.377953   
1  setosa     0.612245     0.142857           1.929134          1.181102   
2  setosa     0.680851     0.153846           1.850394          1.259843   
3  setosa     0.673913     0.133333           1.811024          1.220472   
4  setosa     0.720000     0.142857           1.968504          1.417323   

   petal length (in)  petal width (in)  
0           0.551181       

## Apply

Create a column called `'encoded_species'`:
- 0 for setosa
- 1 for versicolor
- 2 for virginica


Hint 1
Create a dictionary using the species as keys and the numbers 0-2 for values


Hint 2
    Use the dictionary in hint 1 with the <code>.apply()</code> method to create the new column


In [13]:
species_dict = {
    'setosa': 0,
    'versicolor': 1,
    'virginica': 2
}
iris['encoded_species'] = iris['species'].apply(lambda x: species_dict[x])
print(iris.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

  species  sepal_ratio  petal_ratio  sepal length (in)  sepal width (in)  \
0  setosa     0.686275     0.142857           2.007874          1.377953   
1  setosa     0.612245     0.142857           1.929134          1.181102   
2  setosa     0.680851     0.153846           1.850394          1.259843   
3  setosa     0.673913     0.133333           1.811024          1.220472   
4  setosa     0.720000     0.142857           1.968504          1.417323   

   petal length (in)  petal width (in)  encoded_species  
0         

## March Madness

Let's change up the dataset to something different than flowers: March Madness!

Read in the dataset `ncaa-seeds.csv` to an object named `seeds`.

This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:

| team_seed | opponent_seed |
|-----------|---------------|
| 01N       | 16N           |

In [16]:
seeds = pd.read_csv('ncaa-seeds.csv')
seeds.head()

Unnamed: 0,team_seed,opponent_seed
0,01N,16N
1,02N,15N
2,03N,14N
3,04N,13N
4,05N,12N


For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).

Using the `.apply()` method, create the following new columns:
- `team_division`
- `opponent_division`

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 01N       | 16N           | N             | N                 |


In [17]:
seeds['team_division'] = seeds['team_seed'].apply(lambda x: x[-1])  # Last character is the division for the team
seeds['opponent_division'] = seeds['opponent_seed'].apply(lambda x: x[-1])  # Last character is the division for the opponent

# Display the updated DataFrame
print(seeds)

   team_seed opponent_seed team_division opponent_division
0        01N           16N             N                 N
1        02N           15N             N                 N
2        03N           14N             N                 N
3        04N           13N             N                 N
4        05N           12N             N                 N
5        06N           11N             N                 N
6        07N           10N             N                 N
7        08N           09N             N                 N
8        01S           16S             S                 S
9        02S           15S             S                 S
10       03S           14S             S                 S
11       04S           13S             S                 S
12       05S           12S             S                 S
13       06S           11S             S                 S
14       07S           10S             S                 S
15       08S           09S             S                

Now that you have the divisions, change the `team_seed` and `opponent_seed` columns to just be the numbers.

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 1         | 16            | N             | N                 |

In [18]:
seeds['team_division'] = seeds['team_seed'].apply(lambda x: x[-1])  
seeds['opponent_division'] = seeds['opponent_seed'].apply(lambda x: x[-1]) 
seeds['team_seed'] = seeds['team_seed'].apply(lambda x: x[:2]) 
seeds['opponent_seed'] = seeds['opponent_seed'].apply(lambda x: x[:2]) 
print(seeds)

   team_seed opponent_seed team_division opponent_division
0         01            16             N                 N
1         02            15             N                 N
2         03            14             N                 N
3         04            13             N                 N
4         05            12             N                 N
5         06            11             N                 N
6         07            10             N                 N
7         08            09             N                 N
8         01            16             S                 S
9         02            15             S                 S
10        03            14             S                 S
11        04            13             S                 S
12        05            12             S                 S
13        06            11             S                 S
14        07            10             S                 S
15        08            09             S                

Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. 

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division | seed_delta |
|-----------|---------------|---------------|-------------------|------------|
| 1         | 16            | N             | N                 | -15        |

<br>
<details><summary>Did you get an error?</summary>
team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.
</details>

In [25]:
seeds['seed_delta'] = seeds['team_seed'] - seeds['opponent_seed']
print(seeds)

    team_seed  opponent_seed team_division opponent_division  seed_delta
0           1             16             1                 6         -15
1           2             15             2                 5         -13
2           3             14             3                 4         -11
3           4             13             4                 3          -9
4           5             12             5                 2          -7
5           6             11             6                 1          -5
6           7             10             7                 0          -3
7           8              9             8                 9          -1
8           1             16             1                 6         -15
9           2             15             2                 5         -13
10          3             14             3                 4         -11
11          4             13             4                 3          -9
12          5             12             5         