## Loading/Exploring the data

Load the iris.csv file into a pandas dataframe. Take a minute to familiarize yourself with the data.

## Import Pandas

Import the `pandas` library as `pd`

In [1]:
import pandas as pd

Read the `iris.csv` dataset into an object named `iris`

In [28]:
df = pd.read_csv('iris.csv')
df = df[['sepal length (cm)','sepal width (cm)','petal length (cm)', 'petal width (cm)','species']]

How many different species are in this dataset?

In [29]:
species = df['species']
species.nunique()

3

What are their names?

In [34]:
species = df['species']
for i in species.unique():
    print(i)

setosa
versicolor
virginica


How many samples are there per species?

<details><summary>Hint</summary>Use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html"><code>.value_counts()</code></a> method</details>

In [35]:
species = df['species']
species.value_counts()

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

## Feature Engineering

Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length

In [None]:
add_sepal_ratio_column = df['sepal width (cm)'] / df['sepal length (cm)']
df = df.assign(sepal_ratio_cm = add_sepal_ratio_column)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,sepal_ratio_cm
0,5.1,3.5,1.4,0.2,setosa,0.686275
1,4.9,3.0,1.4,0.2,setosa,0.612245
2,4.7,3.2,1.3,0.2,setosa,0.680851
3,4.6,3.1,1.5,0.2,setosa,0.673913
4,5.0,3.6,1.4,0.2,setosa,0.720000
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,0.447761
146,6.3,2.5,5.0,1.9,virginica,0.396825
147,6.5,3.0,5.2,2.0,virginica,0.461538
148,6.2,3.4,5.4,2.3,virginica,0.548387


Create a similar column called `'petal_ratio'`: petal width / petal length

In [41]:
add_petal_ratio = df['petal width (cm)'] / df['petal length (cm)']
df = df.assign(petal_ratio_cm = add_petal_ratio)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,sepal_ratio_cm,petal_ratio_cm
0,5.1,3.5,1.4,0.2,setosa,0.686275,0.142857
1,4.9,3.0,1.4,0.2,setosa,0.612245,0.142857
2,4.7,3.2,1.3,0.2,setosa,0.680851,0.153846
3,4.6,3.1,1.5,0.2,setosa,0.673913,0.133333
4,5.0,3.6,1.4,0.2,setosa,0.72,0.142857


Create 4 columns that correspond to `sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, and `petal width (cm)`, only in inches.

In [72]:
temp_df = df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]

for col in temp_df:
    inch_df = col.replace('(cm)',('(inch)')).strip()
    df[inch_df] = df[col] * 0.393701

df.head()



Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,sepal_ratio_cm,petal_ratio_cm,sepal length (inch),sepal width (inch),petal length (inch),petal width (inch)
0,5.1,3.5,1.4,0.2,setosa,0.686275,0.142857,2.007875,1.377954,0.551181,0.07874
1,4.9,3.0,1.4,0.2,setosa,0.612245,0.142857,1.929135,1.181103,0.551181,0.07874
2,4.7,3.2,1.3,0.2,setosa,0.680851,0.153846,1.850395,1.259843,0.511811,0.07874
3,4.6,3.1,1.5,0.2,setosa,0.673913,0.133333,1.811025,1.220473,0.590552,0.07874
4,5.0,3.6,1.4,0.2,setosa,0.72,0.142857,1.968505,1.417324,0.551181,0.07874


## Apply

Create a column called `'encoded_species'`:
- 0 for setosa
- 1 for versicolor
- 2 for virginica


Hint 1
Create a dictionary using the species as keys and the numbers 0-2 for values


Hint 2
    Use the dictionary in hint 1 with the <code>.apply()</code> method to create the new column


In [80]:

species_df = df['species']

def assignValues(species_df):
    if(species_df=='setosa'):
        return 0
    elif(species_df=='versicolor'):
        return 1
    elif(species_df=='virginica'):
        return 2
    else:
        return

encoded_species_df = species_df.apply(assignValues)

df['incoded_species'] = encoded_species_df

df.head(51)

df.to_csv('iris(1).csv')

## March Madness

Let's change up the dataset to something different than flowers: March Madness!

Read in the dataset `ncaa-seeds.csv` to an object named `seeds`.

This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:

| team_seed | opponent_seed |
|-----------|---------------|
| 01N       | 16N           |

In [1]:
import pandas as pd
df = pd.read_csv('ncaa-seeds.csv')

seed_option = df['team_seed,opponent_seed'].str.split(',',expand=True)
df[['team_seed','opponent_seed']] = seed_option
seed = df[['team_seed','opponent_seed']]

seed.head()

Unnamed: 0,team_seed,opponent_seed
0,01N,16N
1,02N,15N
2,03N,14N
3,04N,13N
4,05N,12N


For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).

Using the `.apply()` method, create the following new columns:
- `team_division`
- `opponent_division`

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 01N       | 16N           | N             | N                 |


In [2]:
import numpy as np

team_seed = seed['team_seed']
opponent_seed = seed['opponent_seed']
combinedSeed = seed[['team_seed','opponent_seed']]

def getTeamDivision(combinedSeed):
    if 'N' in combinedSeed:
        return 'N'
    elif 'S' in combinedSeed:
        return 'S'
    elif 'E' in combinedSeed:
        return 'E'
    elif 'W' in combinedSeed:
        return 'W'
    else:
        return np.nan
        
team_division = team_seed.apply(getTeamDivision)
opponent_division = opponent_seed.apply(getTeamDivision)

seed['team_division'] = team_division
seed['opponent_division'] = opponent_division

seed.head(30)

Unnamed: 0,team_seed,opponent_seed,team_division,opponent_division
0,01N,16N,N,N
1,02N,15N,N,N
2,03N,14N,N,N
3,04N,13N,N,N
4,05N,12N,N,N
5,06N,11N,N,N
6,07N,10N,N,N
7,08N,09N,N,N
8,01S,16S,S,S
9,02S,15S,S,S


Now that you have the divisions, change the `team_seed` and `opponent_seed` columns to just be the numbers.

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 1         | 16            | N             | N                 |

In [6]:
combinedSeed = df[['team_seed','opponent_seed']]

def removeExtras(combinedSeed):
    return combinedSeed.str.replace('N','').str.replace('S','').str.replace('E','').str.replace('W','').str.lstrip('0')

seed['team_seed'] = removeExtras(team_seed)
seed['opponent_seed'] = removeExtras(opponent_seed)

seed.head(32)

Unnamed: 0,team_seed,opponent_seed,team_division,opponent_division
0,1,16,N,N
1,2,15,N,N
2,3,14,N,N
3,4,13,N,N
4,5,12,N,N
5,6,11,N,N
6,7,10,N,N
7,8,9,N,N
8,1,16,S,S
9,2,15,S,S


Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. 

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division | seed_delta |
|-----------|---------------|---------------|-------------------|------------|
| 1         | 16            | N             | N                 | -15        |

<br>
<details><summary>Did you get an error?</summary>
team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.
</details>

In [45]:
seed['team_seed'] = pd.to_numeric(seed['team_seed'])
seed['opponent_seed'] = pd.to_numeric(seed['opponent_seed'])

team_seed = seed['team_seed']
opponent_seed = seed['opponent_seed']

seed['seed_delta'] = team_seed - opponent_seed

seed.head()

seed.to_csv('ncaa-seeds(1).csv')

