In [1]:
"""
linear regression -> predict a continuous value
classification  -> predict a discrete  value

In each of the examples, the labels come in categorical form and represent a finite number of classes.

Discrete data values can be numeric, like the number of students in a class, or it can be categorical, like red, blue or yellow.

There are two types of classification: binary and multi-class

Common algorithms for classification include 
logistic regression, 
k nearest neighbors, 
decision trees, 
naive bayes, 
support vector machines, 
neural networks

Supervised learning problems are grouped into regression and classification problems. 
Both problems have as a goal the construction of a mapping function from input variables (X) to an output variable (y). 
The difference is that the output variable is continuous in regression and categorical for classification.
"""

'\nlinear regression -> predict a continuous value\nclassification  -> predict a discrete  value\n\nIn each of the examples, the labels come in categorical form and represent a finite number of classes.\n\nDiscrete data values can be numeric, like the number of students in a class, or it can be categorical, like red, blue or yellow.\n\nThere are two types of classification: binary and multi-class\n\nCommon algorithms for classification include \nlogistic regression, \nk nearest neighbors, \ndecision trees, \nnaive bayes, \nsupport vector machines, \nneural networks\n\nSupervised learning problems are grouped into regression and classification problems. \nBoth problems have as a goal the construction of a mapping function from input variables (X) to an output variable (y). \nThe difference is that the output variable is continuous in regression and categorical for classification.\n'

In [2]:
"""
Iris Dataset

There are 150 iris plants, 
each with 4 numeric attributes: 
sepal length in cm, 
sepal width in cm, 
petal length in cm, 
and petal width in cm. 

The task is to predict each plant as 
an iris-setosa, 
an iris-versicolor, 
or an iris-virginica based on these attributes.
"""

import pandas as pd
iris = pd.read_csv('https://sololearn.com/uploads/files/iris.csv')

print(iris.shape)


(150, 6)


In [3]:
iris.head()

Unnamed: 0,id,sepal_len,sepal_wd,petal_len,petal_wd,species
0,0,5.1,3.5,1.4,0.2,iris-setosa
1,1,4.9,3.0,1.4,0.2,iris-setosa
2,2,4.7,3.2,1.3,0.2,iris-setosa
3,3,4.6,3.1,1.5,0.2,iris-setosa
4,4,5.0,3.6,1.4,0.2,iris-setosa


In [4]:
iris.drop('id', axis=1, inplace=True)
iris.head()


Unnamed: 0,sepal_len,sepal_wd,petal_len,petal_wd,species
0,5.1,3.5,1.4,0.2,iris-setosa
1,4.9,3.0,1.4,0.2,iris-setosa
2,4.7,3.2,1.3,0.2,iris-setosa
3,4.6,3.1,1.5,0.2,iris-setosa
4,5.0,3.6,1.4,0.2,iris-setosa


In [5]:
iris.describe()

#There are no missing values in any of the columns. Therefore, this is a clean dataset.

Unnamed: 0,sepal_len,sepal_wd,petal_len,petal_wd
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [None]:
"""
The ranges of attributes are still of similar magnitude, 
thus we will skip standardization. 
However, standardizing attributes such that each has a mean of zero and a standard deviation of one, 
can be an important preprocessing step for many machine learning algorithms. 
This is also called feature scaling; see importance of feature scaling for more details.

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html
"""

In [None]:
print(iris.groupby('species').size())

print(iris['species'].value_counts())
#The method value_counts() is a great utility for quickly understanding the distribution of the data. 
#When used on the categorical data, it counts the number of unique values in the column of interest.



In [None]:
"""
Iris is a balanced dataset as the data points for each class are evenly distributed.

An example of an imbalanced dataset is fraud. 
Generally only a small percentage of the total number of transactions is actual fraud, 
about 1 in 1000. And when the dataset is imbalanced, 
a slightly different analysis will be used. 
Therefore, it is important to understand whether the data is balanced or imbalanced.
"""

"""
An imbalanced dataset is one where the classes within the data are not equally represented. 
To review more on imbalanced data, check out this link.

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
"""