___
Team Member Names
- Name 1: Matthew D. Cusack
- Name 2: Tim Cabaza
- Name 3: Amy Adyanthaya

<a id="top"></a>
________
# Clustering
____

## Contents
* <a href="#Imports">Dependency and Data Imports</a>
* <a href="#DataPrep">Data Preparation</a>
* <a href="#BusinessUnderstanding">Business Understanding</a>
* <a href="#DataUnderstanding">Data Understanding</a>
* <a href="#ModelEval">Modeling and Evaluation</a>
    * <a href="#OptionA">Option A: Cluster Analysis</a>
    * <a href="#OptionB">Option B: Association Rule Mining</a>
    * <a href="#OptionC">Option C: Collaborative Filtering</a>
* <a href="#Deployment">Deployment</a>
* <a href="#Exceptional">Exceptional Work</a>
_______

___
<a href="#top">Back to Top</a>
<a id="Imports"></a>
## Imports

In [None]:
# Load Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# file path
file_path = "../1 - Visualization and Data Preprocessing/Data/ONPClean2.csv" # previously cleaned
# file_path = '../1 - Visualization and Data Preprocessing/Data/OnlineNewsPopularity.csv' # unclean

# Load the dataset
df = pd.read_csv(file_path)

# Set the maximum number of columns to display to None
pd.set_option('display.max_columns', None)
df.head()

___
<a href="#top">Back to Top</a>
<a id="DataPrep"></a>
## Data Preperation
To Do:

    Remove Outliers
    Scale Data

In [None]:
# Scale the features in the training and testing sets using standard scalar.
# scaler = StandardScaler()

# X_train = scaler.fit_transform(X_train)
# X_test = scaler.transform(X_test)

# Scale the features using QuantileTransformer with n_quantiles=100 AFTER using StandardScalar().
# quantile_transformer = QuantileTransformer(n_quantiles=100)

# X_train_q = quantile_transformer.fit_transform(X_train)
# X_test_q = scaler.transform(X_test)

In [None]:
# random_state = 42

# # Create a StratifiedKFold object with n_splits=10
# kfold = StratifiedKFold(n_splits=10, random_state = random_state)

# # Remove target, related variables, and factor other targets due to them being categorical
# X0 = pd.get_dummies(df1, columns=['day_of_week'])
# X0 = pd.get_dummies(X0, columns=['news_category'])
# X1 = X0.drop(['share_quantile_ranges', 'shares'], axis=1) 
# y1 = X0['share_quantile_ranges']
# # print(X1.columns)
# # print(y1)

# # Split the data into training and testing folds
# X_train, X_test, y_train, y_test = [], [], [], []
# for train_index, test_index in kfold.split(X, y):
#     X_train.append(X[train_index])
#     X_test.append(X[test_index])
#     y_train.append(y[train_index])
#     y_test.append(y[test_index])

___
<a href="#top">Back to Top</a>
<a id="BusinessUnderstanding"></a>
## Business Understanding (10 points).
    Describe the purpose of the data set you selected (i.e., why was this data
    collected in the first place?). 
    
    How will you measure the effectiveness of a good algorithm? 
    
    Why does your chosen validation method make sense for this specific dataset and the stakeholders needs?

____
<a href="#top">Back to Top</a>
<a id="DataUnderstanding"></a>
## Data Understanding (20 points total)

    [10 points]
        Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. 
    
        Verify data quality: 
            Are there missing values? 
            Duplicate data? Outliers? 
            Are those mistakes? 
            How do you deal with these problems?

    [10 points]
        Visualize the any important attributes appropriately.
        Important: Provide an interpretation for any charts or graphs.


___
<a href="#top">Back to Top</a>
<a id="ModelEval"></a>
## Modeling and Evaluation (50 points total)
    Different tasks will require different evaluation methods. Be as thorough as possible when analyzing
    the data you have chosen and use visualizations of the results to explain the performance and
    expected outcomes whenever possible. Guide the reader through your analysis with plenty of
    discussion of the results.

___
<a href="#top">Back to Top</a>
<a id="OptionA"></a>

Option A: Cluster Analysis

    Perform cluster analysis using several clustering methods

    How did you determine a suitable number of clusters for each method?

    Use internal and/or external validation measures to describe and compare the
    clusterings and the clusters (some visual methods would be good).

    Describe your results. What findings are the most interesting and why?

In [None]:
import numpy as np
from sklearn.cluster import KMeans

# Load the data
X = np.loadtxt('data.csv', delimiter=',')

# Standardize the data
X_std = (X - X.mean(axis=0)) / X.std(axis=0)

# Choose a clustering algorithm
kmeans = KMeans(n_clusters=3)

# Cluster the data
kmeans.fit(X_std)

# Get the cluster labels
labels = kmeans.labels_

# Visualize the results
import matplotlib.pyplot as plt

plt.scatter(X_std[:, 0], X_std[:, 1], c=labels)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Clustered Data')
plt.show()


___
<a href="#top">Back to Top</a>
<a id="OptionB"></a>

Option B: Association Rule Mining

    Create frequent itemsets and association rules.

    Use tables/visualization to discuss the found results.

    Use several measure for evaluating how interesting different rules are.

    Describe your results. What findings are the most compelling and why?

In [None]:
# Using R

# Load the data
data <- read.csv("data.csv")

# Create a transaction dataset
transaction_data <- data.frame(TID = unique(data$TID), items = data$items)

# Create an association rules object
arules <- apriori(transaction_data, minlen = 2, supp = 0.01, conf = 0.5)

# Get the top 10 association rules
top_rules <- arules[1:10]

# Print the top 10 association rules
print(top_rules)

___
<a href="#top">Back to Top</a>
<a id="OptionC"></a>

Option C: Collaborative Filtering

    Create user-item matrices or item-item matrices using collaborative filtering

    Determine performance of the recommendations using different performance measures
    and explain what each measure

    Use tables/visualization to discuss the found results. Explain each visualization in detail.

    Describe your results. What findings are the most compelling and why?

In [None]:
import graphlab as gl

# Load the data
data = gl.SFrame('data.csv')

# Create a user-item matrix
user_item_matrix = data.pivot_table(values='rating', index='user_id', column='item_id', aggfunc='mean')

# Create a recommender system
recommender = gl.recommender.create(user_item_matrix, k=10)

# Get the top 10 recommendations for a user
recommendations = recommender.recommend(user_id=1, k=10)

# Print the top 10 recommendations
print(recommendations)

___
<a href="#top">Back to Top</a>
<a id="Deployment"></a>
## Deployment (10 points total)
    Be critical of your performance and tell the reader how you current model might be usable by
    other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?

    How useful is your model for interested parties (i.e., the companies or organizations
    that might want to use it)?

    How would your deploy your model for interested parties?

    What other data should be collected?

    How often would the model need to be updated, etc.?

___
<a href="#top">Back to Top</a>
<a id="Exceptional"></a>
## Exceptional Work (10 points total)
    You have free reign to provide additional analyses or combine analyses