## Problem Statement

#### You are provided with a dataset containing various attributes of different wine samples. The goal of this assignment is to perform cluster analysis using the K-means algorithm to identify natural groupings in the data based on the attributes provided.

## Dataset Overview 

#### The dataset consists of the following columns:

#### 1.Alcohol: Alcohol content in the wine sample.
#### 2.Malic_Acid: Amount of malic acid in the wine.
#### 3.Ash: Ash content in the wine.
#### 4.Ash_Alcalinity: Alkalinity of ash in the wine.
#### 5.Magnesium: Magnesium content in the wine.
#### 6.Total_Phenols: Total phenols content in the wine.
#### 7.Flavanoids: Flavonoid content in the wine.
#### 8.Nonflavanoid_Phenols: Non-flavonoid phenol content in the wine.
#### 9.Proanthocyanins: Proanthocyanin content in the wine.
#### 10.Color_Intensity: Intensity of the color of the wine.
#### 11.Hue: Hue of the wine.
#### 12.OD280: Ratio of OD280/OD315 of diluted wines.
#### 13.Proline: Proline content in the wine.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('WineData - WineData.csv')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'WineData - WineData.csv'

In [None]:
df = df.drop('Unnamed: 0', axis=1)
df.head()

## Tasks

## Task 1: Data Preprocessing
####  .Handle any missing values if present.
####  .Scale the data using StandardScaler or MinMaxScaler since K-means is sensitive to the scale of features.
####  .Remove any unnecessary columns that don't contribute to clustering (e.g., index column if not relevant).

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

#### Handle Missing Value

In [None]:
print(df.isnull().sum())

In [None]:
# Handle missing values (if any)
df.fillna(df.mean(), inplace=True)  # Filling with mean as an example

### Scale the data

In [None]:
# Scale the data
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)
scaled_features

### Convert back to DataFrame

In [None]:
# Convert back to DataFrame
scaled_df = pd.DataFrame(scaled_features, columns=df.columns)
scaled_df

### Task 2: Determine the Optimal Number of Clusters
#### Use the Elbow method to determine the optimal number of clusters.
#### Visualize the results using a line plot of the Within-Cluster Sum of Squares (WCSS) against the number of clusters.

In [None]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

In [None]:
# Determine the optimal number of clusters using the Elbow method
wcss = []
for i in range(1, 11):  # Trying from 1 to 10 clusters
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(scaled_df)
    wcss.append(kmeans.inertia_)

In [None]:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss, marker='o')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.xticks(range(1, 11))
plt.grid()
plt.show()

### Task 3: K-means Clustering
#### Apply K-means clustering using the optimal number of clusters obtained from the Elbow method.
#### Assign cluster labels to each data point and create a new column in the dataset with these labels.

In [None]:
# Assuming the optimal number of clusters is 3 based on the Elbow method
optimal_clusters = 3  # Change this to the number you determined


In [None]:
# Apply K-means
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(scaled_df)

In [None]:
# Show the updated DataFrame
df.head()


### Task 4: Cluster Analysis
#### Analyze the clusters by comparing the mean values of each feature within each cluster.
#### Visualize the clusters using a pairplot or scatterplot for selected features to understand the separations visually.
### Analyze the clusters

In [None]:
# Analyze the clusters
cluster_means = df.groupby('Cluster').mean()
cluster_means

### Visualizing the clusters


In [None]:
# Visualizing the clusters (choose two features for scatter plot)
plt.figure(figsize=(10, 6))
plt.scatter(df['Alcohol'], df['Color_Intensity'], c=df['Cluster'], cmap='viridis')
plt.title('Clusters Visualization')
plt.xlabel('Alcohol')
plt.ylabel('Color Intensity')
plt.colorbar(label='Cluster')
plt.grid()
plt.show()

### Task 5: Interpretation
#### Interpret the characteristics of each cluster. For example, identify which cluster has the highest alcohol content, or which has the most intense color, etc.
#### Suggest potential names or categories for each cluster based on the observed characteristics.

In [None]:
# Example interpretations
for cluster_num in range(optimal_clusters):
    print(f"Cluster {cluster_num}:")
    print(cluster_means.loc[cluster_num])
    print("\n")