## CSE 532 Assignment 4 (Due 4/27/24)

**Note: As with the previous assignment you should submit a separate document (.pdf or .doc(x)) with your responses to the analysis portion of the problems.** 

**1. (Machine Learning (Classification))** <br>a. Choose one of the [toy classification datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html) bundled with sklearn **other than the digits dataset**. <br> b. Train **three** distinct sklearn classification estimators for the chosen dataset and compare the results to see which one performs the best when using **2-fold cross-validation**.  Note that you should use three distinct classification models here (not just tweak underlying parameters).  A relatively complete listing of the available estimators can be found here (https://scikit-learn.org/stable/supervised_learning.html) -- but make sure you only use classifiers!  Unless you have an inclination to do otherwise, I recommend using the model default parameters when available.   <br> c. Repeat a. for **20-fold cross-validation**. Explain in a paragraph the difference in your results when using 20-fold vs 2-fold cross-validation (if any). <br>d. Construct a **confusion matrix** for your _most accurate_ model between the three estimators and two cross-fold options. <br> e. Which class in your dataset is most accurately predicted to have the correct label by the best classifier, and and which is most likely to be confused among one or more of the wrong classes?_(You can use a cell in a jupyter notebook file for this or a separate text/document file)._

**2 (Option I). (Machine Learning (Regression))** <br>a. Locate a non-proprietary, small-scale dataset _suitable for regression_ online.  There are countless sources and repositories than you can use in this task, but if you have trouble finding one, I recommend starting via Kaggle (https://www.kaggle.com/code/rtatman/datasets-for-regression-analysis/notebook).  Explain briefly what the dataset represents, what target variable you will be using, and what other features are present.  _You may want or need to apply preprocessing to your data to insure it can be used properly with the regression models_ (e.g. making every feature numeric through transformation or by dropping some)  <br> b. Train **three** distinct sklearn regression estimators for the chosen dataset and compare the results to see which one performs the best when using **10-fold cross-validation**, utilizing the R-Squared score to gauge performance.  Note that you should use two distinct regression models here (not just tweak underlying parameters).  A relatively complete listing of the available estimators can be found here (https://scikit-learn.org/stable/supervised_learning.html) -- but make sure you only use regression models!  Unless you have an inclination to do otherwise, I recommend using the model default parameters when available.<br>  c. Repeat part b utilizing the Mean Square Error to gauge performance.  _Briefly_ research the difference between the two metrics (MSE and R2), and explain in a paragraph or two i. the difference between them ii. when each one is the preferable metric to use. _(You can use a cell in a jupyter notebook file for this or a separate text/document file)._

**2 (Option II). (PRAW and Sentiment)** <br>*Note: you should feel free to propose an alternative task using PRAW of your own design to me via email if you have a specific one in mind.*
<br>a. Use PRAW to extract the 20 top submissions of all time from each of five related subreddits of your choice (ex: someone interested in sports subreddits might extract the top 20 posts from r/basketball, top 20 posts from r/football, etc.). 
<br>b. Use a sentiment analyzer (via Textblob or one of your choosing) to determine the positive sentiment of each and every top submission (all 100 in total), and store these in a variable of your choosing (data-frame, list of lists, etc.)  
c. Investigate how to use a Python [box-plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html).  After doing so, produce a box-plot (default parameters are fine) for the sentiment measures of each of your five subreddits.  Your presentation should look _something_ like the image at the bottom of this file, with your five chosen subreddits replacing the x-axis labels (Subreddit 1 .. Subreddit 5).
<br>d. Repeat steps a-c but use the 20 most controversial submissions of all time for each of your five subreddits.<br>e. Does anything surprise you about the distribution of sentiments, either with respect to individual subreddits, differences between the five subreddits, or the differences between _top_ and _controversial_ submissions?  Explain your answer in a few sentences.  _(You can use a cell in a jupyter notebook file for this or a separate text/document file)._
<br>![Box Plot Example](A4BPExample.png)


Problem 1 Code:

In [17]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Define the 2-fold cross-validation
kf = KFold(n_splits=2, shuffle=True, random_state=42)

# Define the three classifiers
classifiers = {
    'k-Nearest Neighbors': KNeighborsClassifier(),
    'SVC': SVC(),
    'Gaussian Naive Bayes': GaussianNB()
}

# Perform 2-fold cross-validation for each classifier
results = {}
for name, clf in classifiers.items():
    # Calculate the accuracy scores using cross-validation
    scores = cross_val_score(clf, X, y, cv=kf)
    avg_accuracy = np.mean(scores)
    
    # Calculate predictions using cross-validation
    y_pred = cross_val_predict(clf, X, y, cv=kf)
    
    # Calculate the confusion matrix
    conf_matrix = confusion_matrix(y, y_pred)
    
    # Store results in the dictionary
    results[name] = {
        'Average Accuracy': avg_accuracy,
        'Confusion Matrix': conf_matrix
    }

# Display the results
for name, metrics in results.items():
    print(f"{name}:")
    print(f"  Average Accuracy: {metrics['Average Accuracy']:.2f}")
    print("  Confusion Matrix:")
    print(metrics['Confusion Matrix'])
    print()



k-Nearest Neighbors:
  Average Accuracy: 0.95
  Confusion Matrix:
[[50  0  0]
 [ 0 47  3]
 [ 0  5 45]]

SVC:
  Average Accuracy: 0.97
  Confusion Matrix:
[[50  0  0]
 [ 0 48  2]
 [ 0  2 48]]

Gaussian Naive Bayes:
  Average Accuracy: 0.97
  Confusion Matrix:
[[50  0  0]
 [ 0 48  2]
 [ 0  3 47]]



In [18]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Define the 2-fold cross-validation
kf = KFold(n_splits=20, shuffle=True, random_state=42)

# Define the three classifiers
classifiers = {
    'k-Nearest Neighbors': KNeighborsClassifier(),
    'SVC': SVC(),
    'Gaussian Naive Bayes': GaussianNB()
}

# Perform 2-fold cross-validation for each classifier
results = {}
for name, clf in classifiers.items():
    # Calculate the accuracy scores using cross-validation
    scores = cross_val_score(clf, X, y, cv=kf)
    avg_accuracy = np.mean(scores)
    
    # Calculate predictions using cross-validation
    y_pred = cross_val_predict(clf, X, y, cv=kf)
    
    # Calculate the confusion matrix
    conf_matrix = confusion_matrix(y, y_pred)
    
    # Store results in the dictionary
    results[name] = {
        'Average Accuracy': avg_accuracy,
        'Confusion Matrix': conf_matrix
    }

# Display the results
for name, metrics in results.items():
    print(f"{name}:")
    print(f"  Average Accuracy: {metrics['Average Accuracy']:.2f}")
    print("  Confusion Matrix:")
    print(metrics['Confusion Matrix'])
    print()

k-Nearest Neighbors:
  Average Accuracy: 0.96
  Confusion Matrix:
[[50  0  0]
 [ 0 47  3]
 [ 0  2 48]]

SVC:
  Average Accuracy: 0.97
  Confusion Matrix:
[[50  0  0]
 [ 0 47  3]
 [ 0  2 48]]

Gaussian Naive Bayes:
  Average Accuracy: 0.96
  Confusion Matrix:
[[50  0  0]
 [ 0 47  3]
 [ 0  3 47]]



Problem 1
    There appears to be no difference between the 2 and 20 fold cross validation, as across all three models used there is a 0-1% accuracy difference accreditable to random chance. The confusion matrix tells similar story.
    The first class has a 100% accuracy across all models. From the documentation that correlates to the I. Setosa plant. Both I. Versicolor and I. Virginica have 2-3 misclassification. SVC and Gaussian Naive Bayes have almost identical performance.

Problem 2, Option 1 code:

In [19]:
# Import necessary libraries
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer

# Load the California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Define 10-fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# Define the three regression estimators
regressors = {
    'Linear Regression': LinearRegression(),
    'Decision Tree Regressor': DecisionTreeRegressor(),
    'Random Forest Regressor': RandomForestRegressor()
}

# Perform 10-fold cross-validation for each estimator
# Calculate both R-Squared and MSE
results = {}
for name, reg in regressors.items():
    # Calculate R-Squared scores using cross-validation
    r2_scores = cross_val_score(reg, X, y, cv=kf, scoring='r2')
    avg_r2 = np.mean(r2_scores)
    
    # Calculate MSE scores using cross-validation
    mse_scores = cross_val_score(reg, X, y, cv=kf, scoring=make_scorer(mean_squared_error))
    avg_mse = np.mean(mse_scores)
    
    # Store the average R-Squared and MSE in the dictionary
    results[name] = {
        'Average R-Squared': avg_r2,
        'Average MSE': avg_mse
    }

# Display the results for each estimator
for name, metrics in results.items():
    print(f"{name}:")
    print(f"  Average R-Squared: {metrics['Average R-Squared']:.2f}")
    print(f"  Average MSE: {metrics['Average MSE']:.2f}")
    print()


Linear Regression:
  Average R-Squared: 0.60
  Average MSE: 0.53

Decision Tree Regressor:
  Average R-Squared: 0.61
  Average MSE: 0.51

Random Forest Regressor:
  Average R-Squared: 0.81
  Average MSE: 0.25



Problem 2:

	After some basic research I discovered the sklearn California housing dataset. This dataset has over 20,000 samples, significantly above the 150 toy classification dataset. The dataset contains several attributes of a California district. The target variable is the median house value in California districts.
    
Features Present-

    • Median income in the district

    • Average house age in the district

    • Average number of rooms per house

    • Average number of bedrooms per house

    • Population in the district

    • Number of households in the district

    • Latitude of the district

    • Longitude of the district

	R-Squared is a relative measure of fit, indicating how well the model explains the variance in the target variable, while MSE is an absolute measure of fit, assessing the average squared deviation between actual and predicted values.
    
	R-Squared is preferred when you want to understand the proportion of variance explained by the model.
	MSE is preferred when you want to measure the actual magnitude of error and assess the quality of predictions.