## Lab Assignment: Machine Learning with Scikit-Learn
Student:     Jeff McKendry

### Objective: To give students practical experience in implementing basic machine learning algorithms using Scikit-Learn.

### Instructions:
Produce four machine learning models (one for each type), using the datasets available in Python.

1. Decision Trees
- Load the Iris dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a decision tree classifier with a maximum depth of 2 and fit it to the training data.
- Evaluate the performance of the decision tree classifier on the testing data using accuracy as the evaluation metric.

2. K-Nearest Neighbors
- Load the Breast Cancer dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a K-Nearest Neighbors classifier with k=5 and fit it to the training data.
- Evaluate the performance of the K-Nearest Neighbors classifier on the testing data using precision, recall, and F1-score as the evaluation metrics.

3. Linear Regression
- Load the California Housing dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a linear regression model and fit it to the training data.
- Evaluate the performance of the linear regression model on the testing data using mean squared error as the evaluation metric.

4. Naive Bayes
- Load the SMS Spam Collection dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a Naive Bayes classifier and fit it to the training data.
- Evaluate the performance of the Naive Bayes classifier on the testing data using accuracy, precision, recall, and F1-score as the evaluation metrics.

### Deliverable:
Modify this notebook to include the python code as well as any documentation related to your submission.  Submit the notebook as your response in Blackboard.

### Grading Criteria:

Your lab assignment will be graded based on the following criteria:

- Correctness of the implementation
- Proper use of basic control structures and functions
- Code efficiency
- Clarity and readability of the code
- Compliance with the instructions and deliverables.

### Student Submission
1. Decision Trees
- Load the Iris dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a decision tree classifier with a maximum depth of 2 and fit it to the training data.
- Evaluate the performance of the decision tree classifier on the testing data using accuracy as the evaluation metric.

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(max_depth=2, random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of the Decision Tree Classifier with max depth of 2: {accuracy:.2f}")

Accuracy of the Decision Tree Classifier with max depth of 2: 0.98


2. K-Nearest Neighbors
- Load the Breast Cancer dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a K-Nearest Neighbors classifier with k=5 and fit it to the training data.
- Evaluate the performance of the K-Nearest Neighbors classifier on the testing data using precision, recall, and F1-score as the evaluation metrics.


In [7]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, f1_score

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")


Precision: 0.95
Recall: 0.99
F1-score: 0.97


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


3. Linear Regression
- Load the California Housing dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a linear regression model and fit it to the training data.
- Evaluate the performance of the linear regression model on the testing data using mean squared error as the evaluation metric.

In [8]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

y_pred = lin_reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print(f"Mean Squared Error of the Linear Regression model: {mse:.2f}")


Mean Squared Error of the Linear Regression model: 0.53


4. Naive Bayes
- Load the SMS Spam Collection dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a Naive Bayes classifier and fit it to the training data.
- Evaluate the performance of the Naive Bayes classifier on the testing data using accuracy, precision, recall, and F1-score as the evaluation metrics.

In [12]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

categories = ['rec.sport.baseball', 'rec.sport.hockey']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

nb = MultinomialNB()
nb.fit(X_train_vec, y_train)

y_pred = nb.predict(X_test_vec)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")


Accuracy: 0.86
Precision: 0.88
Recall: 0.86
F1-score: 0.86
