<a href="https://colab.research.google.com/github/DillonZdrojewski/March-4-2025/blob/main/train_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install scikit-learn



### Split a dataset into training and testing sets

In [2]:
#Imports
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns



- Independent variables or attributes: `X`
- Dependent variable or target: `y`.


In [10]:
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

In [11]:
from sklearn.utils import shuffle
X
y
X, y = shuffle(X, y)

### The Target Variable X is a prediction of what kind of flower class the observation belongs to.      
Class 0: Setosa      
Class 1: Versicolor      
Class 1: Virginica  

Since our data is sorted ascending by class, we do not want to end up in a situation where the training dataset contains only 0 and 1 labels, while the test only contains Virginica (class 2). Hence, we should **randomly shuffle the dataset before we split the dataset**.



### Split the data into training and testing.

- scikit-learn package comprises a pre-built function to split data into training and testing sets.
- Here, we use 50% of the data as training, and 50% testing.


In [17]:
#Import Module
from sklearn.model_selection import train_test_split


train_X, test_X, train_y, test_y = train_test_split(X, y,
                                                    train_size=0.5,
                                                    test_size=0.5,
                                                    random_state=122)
print("Labels for training and testing data")
print(train_y)
print(test_y)

Labels for training and testing data
[1 0 1 1 2 2 0 0 1 2 1 1 2 2 2 2 2 2 0 0 0 1 1 1 2 2 0 1 0 1 0 2 0 2 1 2 2
 0 0 1 1 2 0 1 1 1 0 0 0 0 2 2 2 2 1 2 1 1 2 2 2 1 0 2 0 1 2 0 0 0 1 0 1 2
 2]
[0 2 0 0 0 2 1 2 0 2 0 1 2 2 0 1 1 2 0 1 1 1 2 2 2 0 2 0 0 1 1 0 1 1 1 0 2
 1 1 0 2 0 2 1 0 1 2 2 0 2 2 2 0 1 0 0 0 0 1 2 0 0 1 1 2 0 1 1 1 0 1 0 2 1
 1]


---

## Logistic Regression: Classification

---

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

classifier = LogisticRegression(max_iter=10000, random_state=0)
classifier.fit(train_X, train_y)


In [21]:
prediction = classifier.predict(test_X)

In [20]:
print(prediction)
print(test_y)
#Top line is prediction and bottom is actual answer

[0 2 0 0 0 2 1 2 0 2 0 1 2 2 0 1 1 2 0 1 1 1 2 2 2 0 2 0 0 1 1 0 1 1 1 0 2
 1 1 0 2 0 2 1 0 1 2 2 0 2 2 2 0 1 0 0 0 0 1 1 0 0 1 1 2 0 1 1 2 0 1 0 2 1
 1]
[0 2 0 0 0 2 1 2 0 2 0 1 2 2 0 1 1 2 0 1 1 1 2 2 2 0 2 0 0 1 1 0 1 1 1 0 2
 1 1 0 2 0 2 1 0 1 2 2 0 2 2 2 0 1 0 0 0 0 1 2 0 0 1 1 2 0 1 1 1 0 1 0 2 1
 1]


## Performance Metric: Accuracy

In [28]:
acc = accuracy_score(test_y, classifier.predict(test_X)) * 100
accuracy = accuracy_score(test_y, prediction) * 100
print(f"Logistic Regression model accuracy: {accuracy:.2f}%")

Logistic Regression model accuracy: 97.33%


### Alternative Ways of Calculating the accuracy.

In [26]:
np.mean(prediction == test_y)
classifier.score(test_X, test_y)

0.9733333333333334

In [27]:
# performance on the training set
classifier.score(train_X, train_y)

0.96

## Now, try 80% and 20% split.

In [30]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y,
                                                    train_size=0.2,
                                                    test_size=0.8,
                                                    random_state=122)
print("Labels for training and testing data")
print(train_y)
print(test_y)

Labels for training and testing data
[1 0 0 0 0 2 2 2 2 1 2 1 1 2 2 2 1 0 2 0 1 2 0 0 0 1 0 1 2 2]
[0 2 0 0 0 2 1 2 0 2 0 1 2 2 0 1 1 2 0 1 1 1 2 2 2 0 2 0 0 1 1 0 1 1 1 0 2
 1 1 0 2 0 2 1 0 1 2 2 0 2 2 2 0 1 0 0 0 0 1 2 0 0 1 1 2 0 1 1 1 0 1 0 2 1
 1 1 0 1 1 2 2 0 0 1 2 1 1 2 2 2 2 2 2 0 0 0 1 1 1 2 2 0 1 0 1 0 2 0 2 1 2
 2 0 0 1 1 2 0 1 1]


In [31]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

classifier = LogisticRegression(max_iter=10000, random_state=0)
classifier.fit(train_X, train_y)

In [32]:
perdiction = classifier.predict(test_X)

In [38]:
print(prediction)
print(test_y)

[0 2 0 0 0 2 1 2 0 2 0 1 2 2 0 1 1 2 0 1 1 1 2 2 2 0 2 0 0 1 1 0 1 1 1 0 2
 1 1 0 2 0 2 1 0 1 2 2 0 2 2 2 0 1 0 0 0 0 1 1 0 0 1 1 2 0 1 1 2 0 1 0 2 1
 1]
[0 2 0 0 0 2 1 2 0 2 0 1 2 2 0 1 1 2 0 1 1 1 2 2 2 0 2 0 0 1 1 0 1 1 1 0 2
 1 1 0 2 0 2 1 0 1 2 2 0 2 2 2 0 1 0 0 0 0 1 2 0 0 1 1 2 0 1 1 1 0 1 0 2 1
 1 1 0 1 1 2 2 0 0 1 2 1 1 2 2 2 2 2 2 0 0 0 1 1 1 2 2 0 1 0 1 0 2 0 2 1 2
 2 0 0 1 1 2 0 1 1]


In [42]:
acc = accuracy_score(test_y, classifier.predict(test_X)) * 100
print(f"Logistic Regression model accuracy: {acc:.2f}%")


Logistic Regression model accuracy: 87.50%


## K Nearest Neighbors

In [49]:
from sklearn.neighbors import KNeighborsClassifier

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Create KNN classifier with k=3
# knn = KNeighborsClassifier(n_neighbors=3)

# Set a parameter for how many nearest neighbors you want to examine.
knn = KNeighborsClassifier(n_neighbors=3) # only 1 neighbor

# Train the model
knn.fit(X_train, y_train)

# Predict on the test set
y_pred = knn.predict(X_test)

# Evaluate the model
accuracy = 100*accuracy_score(y_test, y_pred)
print(f"kNN Accuracy: {accuracy}")

kNN Accuracy: 97.33333333333334


In [56]:
from sklearn.neighbors import KNeighborsClassifier

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy = 100*accuracy_score(y_test, y_pred)
print(f"kNN Accuracy: {accuracy}")

kNN Accuracy: 100.0


## Decision Tree Classifier

In [53]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

# Split the dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Create Decision Tree classifer object
classifier = DecisionTreeClassifier()

# Train Decision Tree Classifer
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test) # predict

# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print('Accuracy:',classifier.score(X_test, y_test))

Accuracy: 0.9555555555555556
Accuracy: 0.9555555555555556


## Performance Metrics

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc

In [57]:
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

NameError: name 'confusion_matrix' is not defined