# Data Mining Assignment 1

This notebook contains the steps for completing Data Mining Assignment 1, including initial data analysis, preprocessing, and a summary of findings.

## 1. Initial Data Analysis

### 1.1 Load the Dataset
We start by loading the dataset and examining its structure.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import entropy
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Load the dataset
df = pd.read_csv('superstore.csv')

# Examine structure
print("Dataset Shape:", df.shape)
print("\nData Types:\n", df.dtypes)

### 1.2 Descriptive Statistics
Next, we extract descriptive statistics for each column.

In [None]:
print("\nDescriptive Statistics:\n", df.describe(include='all'))

### 1.3 Identify Missing Values
We check for missing values in the dataset.

In [None]:
missing_values = df.isnull().sum()
print("\nMissing Values:\n", missing_values)

### 1.4 Data Quality Assessment
We assess data quality by checking for duplicates.

In [None]:
print("\nDuplicate Rows:", df.duplicated().sum())

### 1.5 Data Visualization
We create various visualizations to understand the data better.

In [None]:
# Histograms
df.hist(figsize=(12, 10))
plt.tight_layout()
plt.savefig('initial_histograms.png')
plt.show()

# Box plots for numerical columns
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns
for col in numerical_cols:
    plt.figure()
    sns.boxplot(x=df[col])
    plt.title(f'Box Plot of {col}')
    plt.savefig(f'initial_boxplot_{col}.png')
    plt.show()

# Pairwise scatter plots
sns.pairplot(df[numerical_cols])
plt.savefig('initial_pairplot.png')
plt.show()

### 1.6 Checklist of Issues
We create a checklist of identified issues.

In [None]:
issues = {
    "Missing Values": missing_values[missing_values > 0].index.tolist(),
    "Outliers": [col for col in numerical_cols if df[col].skew().abs() > 1 or len(df[col].dropna()) - len(df[col].dropna().between(df[col].quantile(0.25) - 1.5 * (df[col].quantile(0.75) - df[col].quantile(0.25)), df[col].quantile(0.75) + 1.5 * (df[col].quantile(0.75) - df[col].quantile(0.25)))) > 0],
    "Duplicates": df.duplicated().sum() > 0
}
print("\nChecklist of Issues:", issues)

## 2. Data Preprocessing

### 2.1 Handle Missing Values
We handle missing values using mean imputation for numerical columns and forward fill for categorical columns.

In [None]:
# Mean imputation for numerical columns
for col in numerical_cols:
    df[col].fillna(df[col].mean(), inplace=True)

# Forward fill for categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    df[col].ffill(inplace=True)

### 2.2 Remove Outliers
We remove outliers using the IQR method.

In [None]:
for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    df = df[~((df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR)))]

### 2.3 Correct Inconsistent Data
We correct inconsistent data, such as standardizing country names.

In [None]:
df['Country'] = df['Country'].str.strip().replace({'United Stateshcr': 'United States', 'United Kingdomegb': 'United Kingdom'})

### 2.4 Standardization/Normalization
We create standardized and normalized versions of the dataset.

In [None]:
# Standardization
scaler = StandardScaler()
df_standardized = df.copy()
df_standardized[numerical_cols] = scaler.fit_transform(df[numerical_cols])

# Normalization
minmax_scaler = MinMaxScaler()
df_normalized = df.copy()
df_normalized[numerical_cols] = minmax_scaler.fit_transform(df[numerical_cols])

### 2.5 Remove Duplicates
We remove duplicate rows from the dataset.

In [None]:
df.drop_duplicates(inplace=True)

### 2.6 Compare Distributions
We compare the distributions of two features using KL divergence.

In [None]:
# Example: Sales vs Profit
kl_div_sales_profit = entropy(df['Sales'].dropna(), df['Profit'].dropna())
print("\nKL Divergence (Sales vs Profit):", kl_div_sales_profit)

### 2.7 Visualize Cleaned Data
We create visualizations for the cleaned data.

In [None]:
df_normalized.hist(figsize=(12, 10))
plt.tight_layout()
plt.savefig('cleaned_histograms.png')
plt.show()

for col in numerical_cols:
    plt.figure()
    sns.boxplot(x=df_normalized[col])
    plt.title(f'Cleaned Box Plot of {col}')
    plt.savefig(f'cleaned_boxplot_{col}.png')
    plt.show()

## 3. Conclusion and Summary

### 3.1 Improvement Statistics
We compare the number of missing values and duplicates before and after preprocessing.

In [None]:
initial_missing = missing_values.sum()
cleaned_missing = df.isnull().sum().sum()
print("\nMissing Values Before:", initial_missing)
print("Missing Values After:", cleaned_missing)
print("Duplicates Removed:", df.duplicated().sum())

### 3.2 Feature Dependencies
We create a correlation matrix to evaluate feature dependencies.

In [None]:
corr_matrix = df[numerical_cols].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.savefig('correlation_matrix.png')
plt.show()

### 3.3 Comprehensive Report
Below is a summary of the steps performed and the findings.

# Data Mining Assignment 1 Report

## 1. Initial Data Analysis
- **Shape**: (number of rows, number of columns)
- **Missing Values**: Total missing values across columns
- **Outliers**: Detected in specific columns
- **Duplicates**: Number of duplicate rows

## 2. Data Preprocessing
- **Missing Values**: Handled using mean imputation and forward fill.
- **Outliers**: Removed using IQR method.
- **Inconsistent Data**: Standardized 'Country' column.
- **Normalization**: Applied MinMax scaling.
- **Duplicates**: Removed duplicate rows.
- **KL Divergence**: Calculated for Sales vs Profit.

## 3. Results
- **Missing Values Reduced**: From initial to final count.
- **Visualizations**: See 'initial_' and 'cleaned_' PNG files.
- **Feature Dependencies**: Correlation matrix saved as 'correlation_matrix.png'.