# Tasks

In [11]:
# Import Standard Libraries
# Import Pandas library for working with dataframes
import pandas as pd

#Import machine learning library that contains example datasets
import sklearn as skl

# Import Numpy library to work with arrays
import numpy as np

# Import library to visualize data
import matplotlib.pyplot as plt

#Import display function if not already standard
from IPython.display import display


## Task 1: Source the Data Set

### Task 1 Summary:
Import the Iris data set from the sklearn.datasets module.

Explain, in your own words, what the load_iris() function returns.
### Task 1 Resources:
1 - Standard Datasets available from Scikit-Learn - https://scikit-learn.org/stable/datasets/toy_dataset.html

2 - Information on the load_iris function - https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris

In [40]:
# Import the Iris dataset using Scikit-Learn and assign to variable. Note the skikit-learn library has a few standard datasets included. Here we will import one of them

data_Iris = skl.datasets.load_iris(as_frame=True) #We are using the skl.datasets command to import one of the standard datasets and the load_iris function will load the Iris dataset. 
# I am setting the bool value of as-frame to True so that this info is returned as a dataframe (Instead of an array).

print("The output from load_iris is: \n",data_Iris) #Displays the output from load_iris
print("\nThe keys in this data set are: \n",data_Iris.keys()) #Use the keys() function to see what characteristics / attributes are available

'''
The characteristics of the object returned by load_iris is:
1. data - this contains data (samples) relating to the feutures
2. target - this contains the species labels / designation for each sample / row of data
3. frame - this contains a dataframe with the data and target combined i.e. shape of (150, 5)
4. target_names - this contains a list detailing the type of target classes (species)
5. DESCR - this seems to be a string that gives more information regarding the data set
6. feature_names - this contains the 4 x feutures of the data set. 
7. filename - this seems to be the name of the source csv file containing the data (this is a guess)
8. data_module - this contains sklearn.datasets.data. I assume this refers back to the applicable library we used sklearn

'''

df_Iris = data_Iris.data # extract the dataframe to another variable. I could have displayed this as is using data_Iris.data
print("\nThis is a test to preview the data set - excluding the target (species): ")
display(df_Iris) #display the data set as a test (this excludes the target which is the species of Iris)
#If I wanted to include the target I would have made df_Iris = data_Iris.frame


The output from load_iris is: 
 {'data':      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  5.1               3.5                1.4               0.2
1                  4.9               3.0                1.4               0.2
2                  4.7               3.2                1.3               0.2
3                  4.6               3.1                1.5               0.2
4                  5.0               3.6                1.4               0.2
..                 ...               ...                ...               ...
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8

[150 rows x 4 columns]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


## Task 2: Explore the Data Structure

### Task 2 Summary:
Print and explain the shape of the data set, the first and last 5 rows of the data, the feature names, and the target classes.

### Task 2 Resources:
1 - Information on the pandas shape function - https://www.geeksforgeeks.org/python-pandas-df-size-df-shape-and-df-ndim/

2 - Information on the pandas head & tail function to show first and last 5 rows of data - https://stackoverflow.com/questions/58260771/how-to-show-firstlast-n-rows-of-a-dataframe

3 - Using the list() function to return a list of column names - https://stackoverflow.com/questions/19482970/get-a-list-from-pandas-dataframe-column-headers

In [42]:
#Get the shape of the data set (I assume excluding the target)
print("1. The shape of the Iris data set (excluding the target) is - {}\n".format(df_Iris.shape)) #the df.shape function returns a tuple showing the dimensions of the dataframe. 
# The result shows me this data set has 150 rows and 4 columns (excluding the target).
# If I got the shape of the data set including the target it would be 150 rows and 5 columns.  

#Show only the first 5 and last 5 rows of data (I assume excluding the target)
print("2.1 The following output shows only the first 5 rows of data: ") #Print explanation of format this will be shown
display(df_Iris.head(5)) #use the display function as it outputs a cleaner looking table. I used the pandas df.head(n) function to shown only the top n rows of the dataset. 
print("2.2 The following output shows only the bottom 5 rows of data: ") #Print explanation of format this will be shown
display(df_Iris.tail(5)) #use the display function as it outputs a cleaner looking table. I used the pandas df.tail(n) function to shown only the bottom n rows of the dataset. 

#Get the feuture names of the dataset. From the preview we can see the feuture names are contained in the dataframe heading /columns
feuture_names = data_Iris.feature_names #I could also have used the list() function to return the a list column names

print("3. The following is a list of feuture names in the data set: \n{}\n".format(feuture_names))
#I could have also used df_Iris.head(0) to display the column names

#Return the target classes
target_names = data_Iris.target_names
print("4. The target classes are: \n{}\n".format(target_names))

1. The shape of the Iris data set (excluding the target) is - (150, 4)

2.1 The following output shows only the first 5 rows of data: 


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


2.2 The following output shows only the bottom 5 rows of data: 


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


3. The following is a list of feuture names in the data set: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

4. The target classes are: 
['setosa' 'versicolor' 'virginica']



## Task 3: Summarize the Data

## Task 4: Visualize Features

## Task 5: Investigate Relationships

## Task 6: Analyze Relationship

## Task 7: Analyze Class Distributions

## Task 8: Compute Correlations

## Task 9: Fit a Simple Linear Regression

## Task 10: Too Many Features

## End