Hi there, Kagglers! 👋

You have selected a text snippet from a notebook that introduces you to the basics of machine learning using the Iris dataset. This dataset contains information about the sepal and petal measurements of three different species of iris flowers. 🌸

In this notebook, you will learn how to:

Load and explore the dataset using Python
Visualize the data using histograms, scatter plots, pair plots
Train and test various machine learning algorithms such as logistic regression, decision tree, SVM, and KNN
Compare the accuracy of different models using petals and sepals separately
Make predictions for new observations using the best model
This notebook is a great way to get started with machine learning and understand some of the key concepts and techniques. You will also get to practice your coding skills in Python, popular languages for data analysis and visualization. 💻

I hope you find this notebook useful and interesting. If you do, please upvote it and share it with your friends. 😊

And if you want to see something more creative and fun, how about I try to create a graphic art of an iris flower for you? 🎨

I’ll try to create that.

**Please Upvote!!!**

In [None]:
import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns  
sns.set_palette('husl')
import matplotlib.pyplot as plt
%matplotlib inline

# importing alll the necessary packages to use the various classification algorithms
from sklearn import metrics #for checking the model accuracy
from sklearn.linear_model import LogisticRegression # for Logistic Regression algorithm
from sklearn.tree import DecisionTreeClassifier #for using Decision Tree Algoithm
from sklearn import svm  #for Support Vector Machine (SVM) Algorithm
from sklearn.neighbors import KNeighborsClassifier  # for K nearest neighbours

from sklearn.model_selection import train_test_split  #to split the dataset for training and testing


data = pd.read_csv('/kaggle/input/iris/Iris.csv') #load the dataset

# Preview of Data

1. There are 150 observations with 4 features each (sepal length, sepal width, petal length, petal width).
2. There are no null values, so we don't have to worry about that.
3. There are 50 observations of each species (setosa, versicolor, virginica).

In [None]:
data.head() #show the first 5 rows from the dataset

In [None]:
data.info() #checking if there is any inconsistency in the dataset 
#as we see there are no null values in the dataset, so the data can be processed

In [None]:
data.describe() # what is in dataset it describes it 

In [None]:
data['Species'].value_counts() #there are 3 type setosa,versicolor,virginica = 50

# Some Exploratory Data Analysis With Iris

In [None]:
fig = data[data.Species=='Iris-setosa'].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='red', label='Setosa')
data[data.Species=='Iris-versicolor'].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='blue', label='versicolor',ax=fig)
data[data.Species=='Iris-virginica'].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='yellow', label='virginica', ax=fig)
fig.set_xlabel("Sepal Length")
fig.set_ylabel("Sepal Width")
fig.set_title("Sepal Length VS Width")
fig=plt.gcf()
fig.set_size_inches(10,6)
plt.show()

The above graph shows relationship between the sepal length and width. Now we will check relationship between the petal length and width

In [None]:
fig = data[data.Species=='Iris-setosa'].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='orange', label='Setosa')
data[data.Species=='Iris-versicolor'].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='blue', label='versicolor',ax=fig)
data[data.Species=='Iris-virginica'].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='green', label='virginica', ax=fig)
fig.set_xlabel("Petal Length")
fig.set_ylabel("Petal Width")
fig.set_title(" Petal Length VS Width")
fig=plt.gcf()
fig.set_size_inches(10,6)
plt.show()

As we can see that the Petal Features are giving a better cluster division compared to the Sepal features. This is an indication that the Petals can help in better and accurate Predictions over the Sepal. We will check that later.

# Now let us see how are the length and width are distributed

In [None]:
data.hist(edgecolor='black', linewidth=2)
fig=plt.gcf()
fig.set_size_inches(15,8)
plt.show()

# Now let us see how the length and width vary according to the species

# Data Visualization

1. After graphing the features in a pair plot, it is clear that the relationship between pairs of features of a iris-setosa (in pink) is distinctly different from those of the other two species.
2. There is some overlap in the pairwise relationships of the other two species, iris-versicolor (brown) and iris-virginica (green).

In [None]:
tmp = data.drop('Id', axis=1)
g = sns.pairplot(tmp, hue='Species', markers='+')
plt.show()

In [None]:
figg = sns.violinplot(y='Species', x='SepalLengthCm', data=data, inner='quartile')
plt.show()
figg = sns.violinplot(y='Species', x='SepalWidthCm', data=data, inner='quartile')
plt.show()
figg = sns.violinplot(y='Species', x='PetalLengthCm', data=data, inner='quartile')
plt.show()
figg = sns.violinplot(y='Species', x='PetalWidthCm', data=data, inner='quartile')
plt.show()

**Now the given problem is a classification problem.. 
Thus we will be using the classification algorithms to build a model.**

**Classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data**

**Regression:  If the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.**

**Before we start, we need to clear some ML notations.**

**Attributes-->    An attribute is a property of an instance that may be used to determine its classification. In the following dataset, the attributes are the petal and sepal length and width. It is also known as Features.**

**Target variable, in the machine learning context is the variable that is or should be the output. Here the target variables are the 3 flower species.**

In [None]:
X = data.drop(['Id', 'Species'], axis=1)
y = data['Species']
# print(X.head())
print(X.shape)
# print(y.head())
print(y.shape)

In [None]:
data.shape #get the shape of the dataset

Now, when we train any algorithm, the number of features and their correlation plays an important role. If there are features and many of the features are highly correlated, then training an algorithm with all the featues will reduce the accuracy. Thus features selection should be done carefully. This dataset has less featues but still we will see the correlation.

In [None]:
plt.figure(figsize=(7,4)) 
sns.heatmap(data.corr(),annot=True,cmap='cubehelix_r') #draws  heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()

**Observation--->**

The Sepal Width and Length are not correlated The Petal Width and Length are highly correlated

We will use all the features for training the algorithm and check the accuracy.

Then we will use 1 Petal Feature and 1 Sepal Feature to check the accuracy of the algorithm as we are using only 2 features that are not correlated. Thus we can have a variance in the dataset which may help in better accuracy. We will check it later.

**Steps To Be followed When Applying an Algorithm**
1. Split the dataset into training and testing dataset.The testing dataset is generally smaller than training one as it will help in training the model better. 
2. Select any algorithm based on the problem (classification or regression) whatever you feel may be good. 
3. Then pass the training dataset to the algorithm to train it. We use the .fit() method. 
4. Then pass the testing data to the trained algorithm to predict the outcome. We use the .predict() method. 
5. We then check the accuracy by passing the predicted outcome and the actual output to the model.

****Split the dataset into a training set and a testing set****

****Advantages****

* By splitting the dataset pseudo-randomly into a two separate sets, we can train using one set and test using another.
* This ensures that we won't use the same observations in both sets.
* More flexible and faster than creating a model using all of the dataset for training.
****Disadvantages****

* The accuracy scores for the testing set can vary depending on what observations are in the set.
* This disadvantage can be countered using k-fold cross-validation.
****Notes****

* The accuracy score of the models depends on the observations in the testing set, which is determined by the seed of the pseudo-random number generator (random_state parameter).
* As a model's complexity increases, the training accuracy (accuracy you get when you train and test the model on the same data) increases.
* If a model is too complex or not complex enough, the testing accuracy is lower.
* For KNN models, the value of k determines the level of complexity. A lower value of k means that the model is more complex.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=5) # 60 training data-40(0.4) test data
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

**Logistic Regression**

In [None]:
model = LogisticRegression()
model.fit(X_train,y_train)
prediction=model.predict(X_test)
print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction,y_test))

**Decision Tree**

In [None]:
model=DecisionTreeClassifier()
model.fit(X_train,y_train)
prediction=model.predict(X_test)
print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction,y_test))

**Support Vector Machine (SVM)**

In [None]:
model = svm.SVC() #select the algorithm
model.fit(X_train,y_train) # we train the algorithm with the training data and the training output
prediction=model.predict(X_test) #now we pass the testing data to the trained algorithm
print('The accuracy of the SVM is:',metrics.accuracy_score(prediction,y_test))#now we check the accuracy of the algorithm. 
#we pass the predicted output by the model and the actual output

**K-Nearest Neighbours**

In [None]:
model=KNeighborsClassifier(n_neighbors=3) #this examines 3 neighbours for putting the new data into a class
model.fit(X_train,y_train)
prediction=model.predict(X_test)
print('The accuracy of the KNN is',metrics.accuracy_score(prediction,y_test))

In [None]:
# experimenting with different n values
k_range = list(range(1,26))
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X, y)
    y_pred = knn.predict(X)
    scores.append(metrics.accuracy_score(y, y_pred))
    
plt.plot(k_range, scores)
plt.xlabel('Value of k for KNN')
plt.ylabel('Accuracy Score')
plt.title('Accuracy Scores for Values of k of k-Nearest-Neighbors')
plt.show()

In [None]:
# experimenting with different n values
k_range = list(range(1,26))
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    scores.append(metrics.accuracy_score(y_test, y_pred))
    
plt.plot(k_range, scores)
plt.xlabel('Value of k for KNN')
plt.ylabel('Accuracy Score')
plt.title('Accuracy Scores for Values of k of k-Nearest-Neighbors')
plt.show()

Let's check the accuracy for various values of n for K-Nearest nerighbours

In [None]:
a_index=list(range(1,11))
a=pd.Series()
x=[1,2,3,4,5,6,7,8,9,10]
for i in list(range(1,11)):
    model=KNeighborsClassifier(n_neighbors=i) 
    model.fit(X_train,y_train)
    prediction=model.predict(X_test)
    a=a.append(pd.Series(metrics.accuracy_score(prediction,y_pred)))
plt.plot(a_index, a)
plt.xticks(x)

Above is the graph showing the accuracy for the KNN models using different values of n.

# Choosing KNN to Model Iris Species Prediction with k = 12
**After seeing that a value of k = 12 is a pretty good number of neighbors for this model, I used it to fit the model for the entire dataset instead of just the training set.**

In [None]:
knn = KNeighborsClassifier(n_neighbors=12)
knn.fit(X, y)

# make a prediction for an example of an out-of-sample observation
knn.predict([[6, 3, 4, 2]])

**We used all the features of iris in above models. Now we will use Petals and Sepals Seperately**

**Creating Petals And Sepals Training Data**

In [None]:
petal=data[['PetalLengthCm','PetalWidthCm','Species']]
sepal=data[['SepalLengthCm','SepalWidthCm','Species']]

In [None]:
train_p,test_p=train_test_split(petal,test_size=0.3,random_state=0)  #petals
train_x_p=train_p[['PetalWidthCm','PetalLengthCm']]
train_y_p=train_p.Species
test_x_p=test_p[['PetalWidthCm','PetalLengthCm']]
test_y_p=test_p.Species


train_s,test_s=train_test_split(sepal,test_size=0.3,random_state=0)  #Sepal
train_x_s=train_s[['SepalWidthCm','SepalLengthCm']]
train_y_s=train_s.Species
test_x_s=test_s[['SepalWidthCm','SepalLengthCm']]
test_y_s=test_s.Species

**Logistic Regression**

In [None]:
model = LogisticRegression()
model.fit(train_x_p,train_y_p) 
prediction=model.predict(test_x_p) 
print('The accuracy of the Logistic Regression using Petals is:',metrics.accuracy_score(prediction,test_y_p))

model.fit(train_x_s,train_y_s) 
prediction=model.predict(test_x_s) 
print('The accuracy of the Logistic Regression using Sepals is:',metrics.accuracy_score(prediction,test_y_s))

**Decision Tree**

In [None]:
model=DecisionTreeClassifier()
model.fit(train_x_p,train_y_p) 
prediction=model.predict(test_x_p) 
print('The accuracy of the Decision Tree using Petals is:',metrics.accuracy_score(prediction,test_y_p))

model.fit(train_x_s,train_y_s) 
prediction=model.predict(test_x_s) 
print('The accuracy of the Decision Tree using Sepals is:',metrics.accuracy_score(prediction,test_y_s))

**SVM**

In [None]:
model=svm.SVC()
model.fit(train_x_p,train_y_p) 
prediction=model.predict(test_x_p) 
print('The accuracy of the SVM using Petals is:',metrics.accuracy_score(prediction,test_y_p))

model=svm.SVC()
model.fit(train_x_s,train_y_s) 
prediction=model.predict(test_x_s) 
print('The accuracy of the SVM using Sepal is:',metrics.accuracy_score(prediction,test_y_s))

**K-Nearest Neighbours**

In [None]:
model=KNeighborsClassifier(n_neighbors=3) 
model.fit(train_x_p,train_y_p) 
prediction=model.predict(test_x_p) 
print('The accuracy of the KNN using Petals is:',metrics.accuracy_score(prediction,test_y_p))

model.fit(train_x_s,train_y_s) 
prediction=model.predict(test_x_s) 
print('The accuracy of the KNN using Sepals is:',metrics.accuracy_score(prediction,test_y_s))

**Observations:** 

* Usingggg Petals over Sepal for training the data gives a much better accuracyyy. 
* This was expected as we saw in the heatmap above that the correlation between the Sepal Width and Length was very low whereas the correlation between Petal Width and Length was very high.


Thus we have just implemented some of the common Machine Learning. Since the dataset is small with very few features, I didn't cover some concepts as they would be relevant when we have many features.

I have compiled a notebook covering some advanced ML concepts using a larger dataset. Have a look at that to.

I hope the notebook was useful to you to get started with Machine Learningg.

If find this notebook, **Please Upvote.**

**Thank You!!**