## Iris dataset experimentation
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

https://www.geeksforgeeks.org/exploratory-data-analysis-on-iris-dataset/

In [None]:
# Step 1: Import Necessary Libraries
# Import libraries for data manipulation, visualization, and machine learning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Step 0, download the data, could also be found at https://www.kaggle.com/datasets/uciml/iris
# Once this cell has been run, the below code could be deleted safely
from sklearn import datasets
# Load the IRIS dataset and print to csv
iris = datasets.load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])
iris_df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris_df.to_csv('iris.csv', index=False)
del iris_df

In [None]:
# Step 2: Load the IRIS Dataset
# Load the IRIS dataset from a CSV file (hint, use the pandas read_csv function)
# Create a DataFrame from the dataset

iris_df = pd.read_csv('iris.csv')
iris_df.head() # Display the first 5 rows of the dataset

### Step 3: Explore the Data

Get basic information and statistics about the dataset
- Use `.head()` to view the first few rows of the dataset
- Use `.info()` to understand the structure of the dataset (e.g., data types, missing values)
- Use `.describe()` to get summary statistics for each feature
- Use `iris_df["<col_name>"]` or `iris_df.col_name` to get a specific column of data (called a series in pandas)
  
Visualization
- Visualize the data using scatter plots to understand relationships between features
- Use matplotlib scatter plots to visualize relationships between specific features

In [None]:
# TODO: create plots of each features vs each other, colored by species. For instance, 
# this is a plot of sepal length vs sepal width:
plt.scatter(iris_df['sepal_length'], iris_df['sepal_width'], c=iris_df['species'])

TODO: replace this cell with a *short* explanation of what you see in the plots. Be ready to expand when presenting. Example explanations might be the easiest species to differentiate, and which variables are best used for each species

### Step 4: Preprocess the Data
- Split the dataset into features and target. 
- Split the dataset into training and testing sets
- Standard practice is to have X as capital (bc its a matrix), whereas y is lowercase, meaning its a vector

In [None]:
X = #TODO (hint: features are all columns except the last one)
y = #TODO (hint: species is likely the last column)
print(f'X shape: {X.shape}, y shape: {y.shape}')

# TODO: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features can be helpful for some models. This can be done manually, 
# or using StandardScaler from sklearn
# TODO print some of the above explored graphs from section 3 before and after scaling to see the difference

### Step 5: Train and evaluate the model

#### Train

In [None]:
# Use Logistic Regression to classify the flowers based on their features
# TODO: Train a Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

#### Evaluate

In [None]:
# Make predictions on the test set
# TODO: Make predictions
y_pred = model.predict(X_test)
y_true = #TODO (hint, what is the true value of y that we're testing against?)

# Print accuracy, confusion matrix, and classification report to evaluate model performance
# TODO: Evaluate the model
print("Accuracy:", accuracy_score(y_true, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))

In [None]:
#TODO use sns.heatplot to visualize the confusion matrix
#TODO standardize the values in the confusion matrix so that each row sums to 1
# standardizing the confusion matrix allows us to see the ratio of correct and 
# incorrect predictions for each class as a percentage. Reminder to read the 
# Confusion Matrix article for more information

TODO: Replace this cell with a short explanation of how to interpret the confusion matrix in context

### Step 6: Repeat with a model of your choice
KMeans, KNN, or Random Forest (all from sklearn) all good starting points - feel free to go wild tho

In [None]:
# Train

In [None]:
# Test

In [None]:
# Confusion Matrix

TODO: Replace this cell with a short explanation of how to interpret the confusion matrix in context

### Step 7: Conclusion and Next Steps
Do the below steps, with a focus on THINKING through the results, writing them is not fully necessary. Be ready to explain your thinking however
- Discuss the accuracy and performance of the two models
- Discuss model performance
- Provide suggestions for improvement or extensions, such as trying different algorithms or hyperparameter tuning

TODO: discussion