## Getting Started with Logistic Regression !
![alt text](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Machine+Learning+R/iris-machinelearning.png)

In this notebook , you are going to learn how to write a simple logistic regression program to classify an iris species as either ( **virginica, setosa, or versicolor**) based off of the pedal length, pedal height, sepal length, and sepal height using a machine learning algorithm called Logistic Regression.

### Logistic Regression in Summary

Logistic regression is a model that uses a logistic function to model a dependent variable. Like all regression analyses, the logistic regression is a predictive analysis. 

![alt text](https://machinelearningblogcom.files.wordpress.com/2018/04/bildschirmfoto-2018-04-23-um-12-05-381.png?w=736)

Logistic regression is used to describe data and to explain the relationship between one dependent variable and one or more nominal, ordinal, interval or ratio-level independent variables.

This model takes the input values as x and gives the output values as f(x) i.e. 0 or 1. If I need to built a machine learning model then each data point of independent variable will be x (i.e. sum of x1 * w1 + x2 * w2 . . . .so on ) and this will give a value that is between 0 to 1. If I consider that 0.50 as deciding value or threshold. Then any result above 0.5 would be taken as 1 and below that as 0.

This is what a sigmoid function looks like

![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/1200px-Logistic-curve.svg.png)

### Defining the Problem

We will start by stating what I want this program to do. I want this program to predict/classify the iris species as either ( **virginica, setosa, or versicolor**) based off of the **pedal length, pedal height, sepal length, and sepal height**

### Imports

Let us start with importing the dependencies, that will make this program a little easier to write. I’m importing the machine learning library **sklearn, seaborn, and matplotlib** which you you might have come across during your earlier courses with TechIS

In [3]:
# Import the dependencies
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error, r2_score

Next I will load the data set from the seaborn library, store it into a variable called data and just run a describe function on it to get some rudimentary idea about whats in the dataset


---


One thing to note is, the Iris dataset can easily be called into any of 
your notebooks and put into a variable through:

                data = sns.load_dataset("iris")







In [None]:
#Load the data set
data = sns.load_dataset("iris")
display(data.head())
display(data.describe(),"The Three possible species",data.species.unique())

Start preparing the training data set by storing all of the independent variables/ columns/features into a variable called ‘X’ which include the columns - `sepal_length , sepal_width , petal_length and petal_width`

and store the independent variable/target into a variable called ‘y’ - which is the column name - `species`


In [None]:
# X = feature values, all the columns except the species column
X = data.iloc[:, :-1]
display(X)

# y = target values, only the species column
y = data.iloc[:, -1]
display(y)

### Let's plot the relation between the features and the species !

We will use a scatter plot to show this relation. The sepal length will be blue, sepal width will be green, petal length will be red and petal width will be black.

In [None]:
# Plot the relation of each feature with each species
plt.xlabel('Features')
plt.ylabel('Species')

pltX = data.loc[:, 'sepal_length']
pltY = data.loc[:,'species']
plt.scatter(pltX, pltY, color='blue', label='sepal_length')

pltX = data.loc[:, 'sepal_width']
pltY = data.loc[:,'species']
plt.scatter(pltX, pltY, color='green', label='sepal_width')

pltX = data.loc[:, 'petal_length']
pltY = data.loc[:,'species']
plt.scatter(pltX, pltY, color='red', label='petal_length')

pltX = data.loc[:, 'petal_width']
pltY = data.loc[:,'species']
plt.scatter(pltX, pltY, color='black', label='petal_width')

plt.legend(loc=4, prop={'size':8})
plt.show()

### EDA

Here we can see that given 4 features i.e sepal length, sepal width, petal length, and petal width determine whether a flower is Setosa, Versicolor or Virginica.

Let us try to plot 2-D Scatter plot with colour for each flower.

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(data,hue="species",size=8) \
    .map(plt.scatter,"sepal_length","sepal_width") \
    .add_legend()
plt.show()

### Post EDA

Some Conclusions that can be derived from the graph are :

- Blue points can be easily separated from red and green by drawing a line.
- But red and green data points cannot be easily separated.
- Using sepal_length and sepal_width features, we can distinguish Setosa flowers from others.
- Separating Versicolor from Viginica is much harder as they have considerable overlap.

### Optional : Additional EDA 

We could also make use of 3-D plots

In [None]:
import plotly.express as px
fig = px.scatter_3d(data, x='sepal_length', y='sepal_width', z='petal_width',color='species')
fig.show()

Here we are using plotly library for plotting as you can see we have used sepal length on the x-axis, sepal width on the y-axis and petal length on the z-axis.

### Optional : Pair Plots for EDA

A pairs plot allows us to see both distribution of single variables and relationships between two variables.
For example, let’s say we have four features ‘sepal length’, ‘sepal width’, ‘petal length’ and ‘petal width’ in our iris dataset. In that case, we will have 4C2 plots i.e. 6 unique plots. The pairs in this case will be :

- sepal length, sepal width
- sepal length, petal length
- sepal length, petal width
- sepal width, petal length
- sepal width, petal width
- petal length, petal width

In [None]:
sns.set_style("whitegrid");
sns.pairplot(data,hue="species",size=3);
plt.show()

### As Seen Above, The Pair Plots Can Be Divided Into Three Parts:

- The diagonal plot which showcases the histogram. The histogram allows us to see the PDF/Probability distribution of a single variable
- Upper triangle and lower triangle which shows us the scatter plot.
- The scatter plots show us the relationship between the features. These upper and lower triangles are the mirror image of each other.

### Spliting the Dataset into Training and Testing

Split the data into 80% training and 20 % testing by using the method train_test_split() from the sklearn.model_selection library, and store the data into x_train, x_test, y_train, and y_test.

In [9]:
#Split the data into 80% training and 20% testing
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Let's Start Training!

Create and train the Logistic Regression model !

In [None]:
#Train the model
model = LogisticRegression(verbose=1)
model.fit(x_train, y_train) #Training the model

### But did it work?

Now that the model is trained, we will print the predictions and get a few metrics from the model based off of the testing data set.

For prediction we are going to pass in our `x_test` and get some predictions out and save them in the `predictions` variable 

Then we are going to call in the classification

In [None]:
#Test the model
predictions = model.predict(x_test)
print(predictions)# printing predictions

print()# Printing new line

#Check precision, recall, f1-score
print( classification_report(y_test, predictions) )

## Scoring and Metrics

In [None]:
from sklearn.metrics import plot_confusion_matrix

print( "The accuracy of the Model",accuracy_score(y_test, predictions))

class_names = ['setosa','versicolor','virginica']

titles_options = [("Confusion matrix, without normalization", None),
                  ("Normalized confusion matrix", 'true')]
for title, normalize in titles_options:
    disp = plot_confusion_matrix(model, x_test, y_test,
                                 display_labels=class_names,
                                 cmap=plt.cm.Blues,
                                 normalize=normalize)
    disp.ax_.set_title(title)

    print(title)
    print(disp.confusion_matrix)

plt.show()