## Loading/Exploring the data

Load the iris.csv file into a pandas dataframe. Take a minute to familiarize yourself with the data.

## Import Pandas

Import the `pandas` library as `pd`

In [1]:
import pandas as pd

Read the `iris.csv` dataset into an object named `iris`

In [2]:
df_iris = pd.read_csv("D:\Microsoft ml engineer\worksession\datasets\iris 1.csv")

In [9]:
df_iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   species            150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


How many different species are in this dataset?

In [10]:
df_iris["species"].nunique()

3

What are their names?

In [13]:
df_iris["species"].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

How many samples are there per species?

<details><summary>Hint</summary>Use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html"><code>.value_counts()</code></a> method</details>

In [18]:
df_iris["species"].value_counts()

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

In [17]:
df_iris.groupby("species").count()

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,50,50,50,50
versicolor,50,50,50,50
virginica,50,50,50,50


## Feature Engineering

Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length

In [23]:
df_iris["sepal_ratio"] = df_iris["sepal width (cm)"] / df_iris["sepal length (cm)"]
df_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,sepal_ratio
0,5.1,3.5,1.4,0.2,setosa,0.686275
1,4.9,3.0,1.4,0.2,setosa,0.612245
2,4.7,3.2,1.3,0.2,setosa,0.680851
3,4.6,3.1,1.5,0.2,setosa,0.673913
4,5.0,3.6,1.4,0.2,setosa,0.72


Create a similar column called `'petal_ratio'`: petal width / petal length

In [26]:
df_iris["petal_ratio"] = df_iris["petal width (cm)"] / df_iris["petal length (cm)"]
df_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,sepal_ratio,petal_ratio
0,5.1,3.5,1.4,0.2,setosa,0.686275,0.142857
1,4.9,3.0,1.4,0.2,setosa,0.612245,0.142857
2,4.7,3.2,1.3,0.2,setosa,0.680851,0.153846
3,4.6,3.1,1.5,0.2,setosa,0.673913,0.133333
4,5.0,3.6,1.4,0.2,setosa,0.72,0.142857


Create 4 columns that correspond to `sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, and `petal width (cm)`, only in inches.

In [27]:
conv_factor = 0.3937
df_iris['sepal length (in)'] = df_iris['sepal length (cm)'] * conv_factor
df_iris['sepal width (in)'] = df_iris['sepal width (cm)'] * conv_factor 
df_iris['petal length (in)'] = df_iris['petal length (cm)'] * conv_factor
df_iris['petal width (in)'] = df_iris['petal width (cm)'] * conv_factor
df_iris.head()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,sepal_ratio,petal_ratio,sepal length (in),sepal width (in),petal length (in),petal width (in)
0,5.1,3.5,1.4,0.2,setosa,0.686275,0.142857,2.00787,1.37795,0.55118,0.07874
1,4.9,3.0,1.4,0.2,setosa,0.612245,0.142857,1.92913,1.1811,0.55118,0.07874
2,4.7,3.2,1.3,0.2,setosa,0.680851,0.153846,1.85039,1.25984,0.51181,0.07874
3,4.6,3.1,1.5,0.2,setosa,0.673913,0.133333,1.81102,1.22047,0.59055,0.07874
4,5.0,3.6,1.4,0.2,setosa,0.72,0.142857,1.9685,1.41732,0.55118,0.07874


## Apply

Create a column called `'encoded_species'`:
- 0 for setosa
- 1 for versicolor
- 2 for virginica


Hint 1
Create a dictionary using the species as keys and the numbers 0-2 for values


Hint 2
    Use the dictionary in hint 1 with the <code>.apply()</code> method to create the new column


In [37]:

species_dict = {"setosa":0,"versicolor":1,"virginica":2 }
df_iris["encoded_species"] = df_iris["species"].apply(lambda x :species_dict[x])
df_iris["encoded_species"].tail()



145    2
146    2
147    2
148    2
149    2
Name: encoded_species, dtype: int64

## March Madness

Let's change up the dataset to something different than flowers: March Madness!

Read in the dataset `ncaa-seeds.csv` to an object named `seeds`.

This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:

| team_seed | opponent_seed |
|-----------|---------------|
| 01N       | 16N           |

In [42]:
seeds = pd.read_csv("D:/Microsoft ml engineer/worksession/datasets/ncaa-seeds 1.csv")
seeds.tail()

Unnamed: 0,team_seed,opponent_seed
27,04W,13W
28,05W,12W
29,06W,11W
30,07W,10W
31,08W,09W


In [46]:
seeds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   team_seed          32 non-null     object
 1   opponent_seed      32 non-null     object
 2   team_division      32 non-null     object
 3   opponent_division  32 non-null     object
dtypes: object(4)
memory usage: 1.1+ KB


For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).

Using the `.apply()` method, create the following new columns:
- `team_division`
- `opponent_division`

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 01N       | 16N           | N             | N                 |


In [45]:
seeds['team_division'] = seeds['team_seed'].apply(lambda x: x[-1])
seeds['opponent_division'] = seeds['opponent_seed'].apply(lambda x: x[-1]) 
seeds.head()

Unnamed: 0,team_seed,opponent_seed,team_division,opponent_division
0,01N,16N,N,N
1,02N,15N,N,N
2,03N,14N,N,N
3,04N,13N,N,N
4,05N,12N,N,N


Now that you have the divisions, change the `team_seed` and `opponent_seed` columns to just be the numbers.

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 1         | 16            | N             | N                 |

In [47]:
seeds['team_seed'] = seeds['team_seed'].apply(lambda x: int(x[:-1]))
seeds['opponent_seed'] = seeds['opponent_seed'].apply(lambda x: int(x[:-1]))
seeds.head()

Unnamed: 0,team_seed,opponent_seed,team_division,opponent_division
0,1,16,N,N
1,2,15,N,N
2,3,14,N,N
3,4,13,N,N
4,5,12,N,N


Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. 

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division | seed_delta |
|-----------|---------------|---------------|-------------------|------------|
| 1         | 16            | N             | N                 | -15        |

<br>
<details><summary>Did you get an error?</summary>
team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.
</details>

In [49]:
seeds["seed_delta"] = seeds["team_seed"] - seeds["opponent_seed"]
seeds.head()

Unnamed: 0,team_seed,opponent_seed,team_division,opponent_division,seed_delta
0,1,16,N,N,-15
1,2,15,N,N,-13
2,3,14,N,N,-11
3,4,13,N,N,-9
4,5,12,N,N,-7
