## Logistic Regression Exercise

1) Load iris datasets 
2) Using Logistic Regression, classify the outcome (Column : 'Class') based on the labels (Columns :'sepal length /cm', 'sepal width /cm', 'petal length /cm', 'petal width /cm')
3) Provide some values to predict the outcome
4) Validate the model - print the confusion matrix and the accuracy score

In [1]:
# import everything
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sb

We are going to use built-in datasets from `sklearn` for this example.


In [2]:
from sklearn import datasets

iris_data = pd.read_csv('iris-data-clean.csv')
iris_data.tail()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,class
140,6.7,3.0,5.2,2.3,Virginica
141,6.3,2.5,5.0,1.9,Virginica
142,6.5,3.0,5.2,2.0,Virginica
143,6.2,3.4,5.4,2.3,Virginica
144,5.9,3.0,5.1,1.8,Virginica


In [4]:
df = pd.DataFrame(iris_data, 
                columns=['sepal_length_cm', 'sepal width /cm', 'petal length /cm', 'petal width /cm', 'class'])
df = pd.DataFrame(iris_data)
df.tail

<bound method NDFrame.tail of      sepal_length_cm  sepal_width_cm  petal_length_cm  petal_width_cm  \
0                5.1             3.5              1.4             0.2   
1                4.9             3.0              1.4             0.2   
2                4.7             3.2              1.3             0.2   
3                4.6             3.1              1.5             0.2   
4                5.0             3.6              1.4             0.2   
..               ...             ...              ...             ...   
140              6.7             3.0              5.2             2.3   
141              6.3             2.5              5.0             1.9   
142              6.5             3.0              5.2             2.0   
143              6.2             3.4              5.4             2.3   
144              5.9             3.0              5.1             1.8   

         class  
0       Setosa  
1       Setosa  
2       Setosa  
3       Setosa  
4       

In [5]:
def myfunction(x):
    if x == "Setosa":
        return 0
    elif x == "Virginica":
        return 1
    else:
        return 2
    
df["class"] = df["class"].apply(myfunction)
df.head()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [None]:
df.tail()

In [None]:
# we are going to use the fours features (sepal & petal - length & petal width) 
X = df.iloc[:,0:4]

# use 'class' as the target we're trying to predict
y = df['class']
X.head()

# Visualize the Data

In [None]:
sb.pairplot (df, hue='class')  # Show different levels of a categorical variable by the color of plot elements
plt.show()

### Train our Model

Now, we can do train test split, then use our training set to train our model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [None]:
X_train.head()

In [None]:
# we are going to fit our model to the training set

# need to specify multi_class = 'multinomial' and solver
logReg = LogisticRegression(solver = 'lbfgs', multi_class = 'multinomial', random_state = 42)
logReg.fit(X_train, y_train)

In [None]:
y_pred = logReg.predict(X_test)

In [None]:
print(y_pred)

### Model Validation

Like Linear Regression, we want to know how well our model predicts.
Since we are doing classification with Logistics Regression, we want to use 
`accuracy_core()` from `sklearn.metrics`.
A closer result to 1 means better prediction.

In [None]:
from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, y_pred))

In [None]:
# we can use this model to predict any data values
print(logReg.predict([[4.9, 3.5, 1.6, 0.25]]))

### Experimentation - choosing only two features

In [None]:
# we are going to choose only two features.

X2 = df.iloc[:, 2:4]  # the columns for petal_length and petal_width
y2 = df['class']
X2.head()

In [None]:
df_zero = df.loc[df['class'] == 0]
df_one = df.loc[df['class'] == 1]
df_two = df.loc[df['class'] == 2]
df_zero.head()

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
# select the columns for petal_length and petal_width
ax.scatter(df_zero.iloc[:, 2:3], df_zero.iloc[:, 3:4])
ax.scatter(df_one.iloc[:, 2:3], df_one.iloc[:, 3:4])
ax.scatter(df_two.iloc[:, 2:3], df_two.iloc[:, 3:4])
ax.set(xlabel = 'petal length /cm', ylabel = 'petal width /cm')

In [None]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, random_state = 42)

In [None]:
X2_train.head()

In [None]:
# initiate a new instance of Logistic Regression Model

logReg2 = LogisticRegression (solver = 'lbfgs', multi_class = 'auto', random_state = 42) # use the same seed for random_state 
logReg2.fit(X2_train, y2_train)

In [None]:
y2_pred = logReg2.predict(X2_test)
print(y2_pred)

In [None]:
print(y2_test)

In [None]:
print(accuracy_score(y2_test, y2_pred))

In [None]:
# how does the accuracy score compared with 4 features?