# Table of Contents
### - Data cleaning and manipulation
### - K-mean cluster determinationa and elbow chart
### - Cluster definition and visualization
### - Descriptive cluster analysis and insight

# Notebook Set-Up

In [None]:
# Import relevant libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import os
import sklearn
from sklearn.cluster import KMeans
import pylab as pl
from sklearn.preprocessing import StandardScaler

In [None]:
# Define path and import dataset
path = r'C:\Users\mmreg\OneDrive\Desktop\Data Analytics Course Work\Data Immersion\Tasks\08-2022 Exploratory Analytics Project\02 Data\Prepared'

In [None]:
df = pd.read_csv(os.path.join(path, 'citibike_clean.csv'), index_col = False)
df.head()

In [None]:
df = df.drop(columns = 'Unnamed: 0')
# Ensure removal of column
df.head()

In [None]:
# Define figure size throughout notebook
sns.set(rc = {'figure.figsize':(20,12)})

# Question 2
## Import your data and conduct any necessary cleaning, manipulations, and reprocessing (such as renaming).

### - Data has been cleaned in previous tasks

### - Columns have concise names for efficiency and clarity

In [None]:
# Remove all columns with categorical variables
df_2 = df[['start_hour', 'start_station_latitude', 'start_station_longitude', 'end_station_latitude', 'end_station_longitude', 'trip_duration', 'age']]
df_2.head()

# Question 3
## Use the elbow technique as shown in the Jupyter notebook for this Exercise.

In [None]:
# Define range of potential cluster numbers and k-mean clusters
num_cl = range(1, 10)
kmeans = [KMeans(n_clusters = i) for i in num_cl]

In [None]:
# Find the scores of all clusters
score = [kmeans[i].fit(df_2).score(df_2) for i in range(len(kmeans))]

In [None]:
score

In [None]:
# Plot the elbow curve
pl.plot(num_cl,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()

# Question 4
## Make an informed decision about the number of clusters you’ll use in your k-means algorithm based on the chart.
### For this analysis, and going off of the elbow curve created here, I have decided that the optimal clusters will be three. While the curve does not have a solid and defined bend in it, the last number of clusters with a discernable bend is 3. This leads me to believe that this is the optimal cluster count for this analysis.

# Question 5
## Run the k-means algorithm.

In [None]:
# Create the k-means object
kmeans = KMeans(n_clusters = 3)
# I tried using n_jobs, but received a "TypeError: __init__() got an unexpected keyword argument 'n_jobs'. I omitted as per StackOverflow suggestion.

In [None]:
# Fit k-means object to the dataset
kmeans.fit(df_2)

# Question 6
## Attach a new column to your dataframe with the resulting clusters as shown in the Exercise. This will allow you to create a visualization using your clusters.

In [None]:
# Create and attach column to dataset
df_2['clusters'] = kmeans.fit_predict(df_2)

In [None]:
df_2.head()

In [None]:
df_2['clusters'].value_counts()

# Question 7
## Create a few different visualizations (e.g., scatterplots) using your clustered data. Try plotting different variables against each other to see the results in terms of the clusters.

In [None]:
# Plot age and trip_duration with cluster information
print('Fig. 1')
plt.figure(figsize=(20, 12))
ax = sns.scatterplot(x=df_2['age'], y=df['trip_duration'], hue = kmeans.labels_, s=100) 

ax.grid(False)
plt.xlabel('Age (Years)')
plt.ylabel('Trip Length (Sec)')
plt.show()

In [None]:
# Plot trip_duration and start_hour with cluster information
print('Fig. 2')
plt.figure(figsize=(20, 12))
ax = sns.scatterplot(x=df_2['start_hour'], y=df['trip_duration'], hue = kmeans.labels_, s=100) 

ax.grid(False)
plt.xlabel('Trip Start Time')
plt.ylabel('Trip Length (Sec)')
plt.show()

In [None]:
# Plot age and start_hour with cluster information
print('Fig. 3')
plt.figure(figsize=(20, 12))
ax = sns.scatterplot(x=df_2['age'], y=df['start_hour'], hue = kmeans.labels_, s=100) 

ax.grid(False)
plt.xlabel('Age')
plt.ylabel('Trip Start Time')
plt.show()

# Question 8
## Discuss how and why the clusters make sense. If they don’t make sense, however, this is also useful insight, as it means you’ll need to explore the data further.

### From what I can gather from the clusters, the clusters represent the trip lengths within the data and break it down into short, medium, and long length trips (as seen in Fig. 1). Black would be the long trips, purple the medium trips, and pink the short trips. From this we can deduce that those riders who are of the college or young working professional age are more likely to take longer trips, while younger customers and senior citizens are more likely to use it on shorter trips. Using cluster insight and applying it to Fig. 2, we can see that the longest trips occur during peak hours while the early hours of the morning see much shorter trips on average. Fig. 3 has very little in the way of insight with the cluster information, though upon glancing at the visual I can make an educated hypothesis that youth are more likely to take short trips than senior citizens.

# Question 9
## Calculate the descriptive statistics for your clusters using the groupby() function and discuss your findings.

In [None]:
# Create groupings by cluster for clarity
df_2.loc[df_2['clusters'] == 2, 'cluster'] = 'Long Trip'
df_2.loc[df_2['clusters'] == 1, 'cluster'] = 'Medium Trip'
df_2.loc[df_2['clusters'] == 0, 'cluster'] = 'Short Trip'

In [None]:
df_2.groupby('cluster').agg({'age':['mean', 'median'], 'trip_duration':['mean', 'median'], 'start_hour':['mean', 'median'], 'start_station_latitude':['mean', 'median'], 'start_station_longitude':['mean', 'median']})

# Question 10
## Propose what these results could be useful for in future steps of an analytics pipeline.

### Unfortunately, according to the descriptive statistics the assumptions made at the visualization stage were mostly incorrect. The average and median age for all three clusters are very similar, the start hour (though it does show slight correlation with shorter trips starting earlier in the day) is almost too weak to call a relationship. Further insight and analysis would be needed to make sure there is no relationship between these variables. There is some good that comes of this, however. Starting longitude and latitude could give us insight on the most popular stations, and with that we could determine any relationships with further analysis to potentially find ideal bike dispursement. and the mean/median of the trip durations gives us a defined parameter for what a short, medium, and long trip means in terms of trip length. Further insight and analysis would be needed 

In [None]:
# Export dataset for final presentation
df_2.to_csv(os.path.join(path, 'citibike_cluster.csv'))