# Data Mining Project

This notebook is designed to work in Google Colab for data mining tasks.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FarnoodTavasoli/datamining_project/blob/main/data_mining_project.ipynb)

## 1. Setup and Installation

Install required packages and import necessary libraries.

In [None]:
# Install additional packages if needed (uncomment as required)
# !pip install pandas numpy matplotlib seaborn scikit-learn
# !pip install xgboost lightgbm
# !pip install plotly

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score

# Common ML algorithms
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Settings
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-darkgrid')

print("Libraries imported successfully!")

## 2. Data Loading

Load your dataset. You can:
- Upload files directly to Colab
- Load from Google Drive
- Download from URL
- Use built-in datasets

In [None]:
# Option 1: Mount Google Drive (uncomment to use)
# from google.colab import drive
# drive.mount('/content/drive')
# df = pd.read_csv('/content/drive/MyDrive/your_dataset.csv')

In [None]:
# Option 2: Upload file from local computer (uncomment to use)
# from google.colab import files
# uploaded = files.upload()
# df = pd.read_csv(list(uploaded.keys())[0])

In [None]:
# Option 3: Load from URL (example with a sample dataset)
# df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

In [None]:
# Option 4: Use a built-in sklearn dataset (example)
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

print("Dataset loaded successfully!")
print(f"Shape: {df.shape}")

## 3. Exploratory Data Analysis (EDA)

Explore and understand your data.

In [None]:
# Display first few rows
print("First 5 rows:")
df.head()

In [None]:
# Basic information about the dataset
print("Dataset Info:")
df.info()
print("\nDataset Shape:", df.shape)
print("\nColumn Names:", df.columns.tolist())

In [None]:
# Statistical summary
print("Statistical Summary:")
df.describe()

In [None]:
# Check for missing values
print("Missing Values:")
missing = df.isnull().sum()
missing_pct = 100 * missing / len(df)
missing_table = pd.DataFrame({'Missing': missing, 'Percentage': missing_pct})
missing_table[missing_table['Missing'] > 0].sort_values('Missing', ascending=False)

In [None]:
# Visualizations
# Distribution of numerical features
numeric_cols = df.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 0:
    # Dynamically calculate layout
    n_cols = len(numeric_cols)
    n_rows = (n_cols + 2) // 3  # Calculate rows needed for 3 columns
    n_plot_cols = min(3, n_cols)  # Max 3 columns
    
    df[numeric_cols].hist(bins=30, figsize=(15, 5*n_rows), layout=(n_rows, n_plot_cols))
    plt.tight_layout()
    plt.show()

# Correlation heatmap
numeric_df = df.select_dtypes(include=[np.number])
if len(numeric_df.columns) > 1:
    plt.figure(figsize=(10, 8))
    sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', center=0)
    plt.title('Correlation Heatmap')
    plt.show()

## 4. Data Preprocessing

Clean and prepare data for modeling.

In [None]:
# Handle missing values
# Example: Fill numeric columns with median, categorical with mode
for col in df.columns:
    if df[col].isnull().sum() > 0:
        if df[col].dtype in ['float64', 'int64']:
            df[col] = df[col].fillna(df[col].median())
        else:
            df[col] = df[col].fillna(df[col].mode()[0])

print("Missing values handled!")

In [None]:
# Encode categorical variables
categorical_cols = df.select_dtypes(include=['object']).columns

if len(categorical_cols) > 0:
    print(f"Encoding {len(categorical_cols)} categorical columns: {categorical_cols.tolist()}")
    
    for col in categorical_cols:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col].astype(str))
    
    print("Categorical encoding completed!")
else:
    print("No categorical columns to encode.")

In [None]:
# Feature engineering (add your custom features here)
# Example:
# df['new_feature'] = df['feature1'] * df['feature2']

print("Feature engineering completed!")

## 5. Model Training and Evaluation

Split data and train machine learning models.

In [None]:
# Separate features and target
# Adjust 'target' to your actual target column name
target_col = 'target'  # Change this to your target column

if target_col in df.columns:
    X = df.drop(target_col, axis=1)
    y = df[target_col]
    
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    print(f"Training set size: {X_train.shape}")
    print(f"Test set size: {X_test.shape}")
    
    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    print("Data split and scaling completed!")
else:
    print(f"Warning: Target column '{target_col}' not found. Please adjust the target_col variable.")

In [None]:
# Train multiple models
if target_col in df.columns:
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000),
        'Decision Tree': DecisionTreeClassifier(random_state=42),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'SVM': SVC(random_state=42)
    }
    
    results = {}
    
    print("Training models...\n")
    for name, model in models.items():
        # Train
        model.fit(X_train_scaled, y_train)
        
        # Predict
        y_pred = model.predict(X_test_scaled)
        
        # Evaluate
        accuracy = accuracy_score(y_test, y_pred)
        results[name] = accuracy
        
        print(f"{name}:")
        print(f"  Accuracy: {accuracy:.4f}")
        print()
    
    # Display results summary
    results_df = pd.DataFrame(list(results.items()), columns=['Model', 'Accuracy'])
    results_df = results_df.sort_values('Accuracy', ascending=False)
    print("\nModel Comparison:")
    print(results_df)

In [None]:
# Detailed evaluation of best model
if target_col in df.columns and results:
    best_model_name = max(results, key=results.get)
    best_model = models[best_model_name]
    
    print(f"Best Model: {best_model_name}\n")
    
    y_pred = best_model.predict(X_test_scaled)
    
    # Classification report
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    
    # Confusion matrix
    plt.figure(figsize=(8, 6))
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'Confusion Matrix - {best_model_name}')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.show()

In [None]:
# Visualize model comparison
if results:
    plt.figure(figsize=(10, 6))
    models_list = list(results.keys())
    accuracies = list(results.values())
    
    plt.bar(models_list, accuracies, color='skyblue')
    plt.xlabel('Model')
    plt.ylabel('Accuracy')
    plt.title('Model Performance Comparison')
    plt.xticks(rotation=45, ha='right')
    plt.ylim([0, 1])
    
    # Add value labels on bars
    for i, v in enumerate(accuracies):
        plt.text(i, v + 0.01, f'{v:.4f}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()

## 6. Making Predictions

Use the trained model to make predictions on new data.

In [None]:
# Example: Make prediction on a sample
if target_col in df.columns and results:
    # Take a sample from test set
    sample_idx = 0
    sample = X_test_scaled[sample_idx:sample_idx+1]
    
    prediction = best_model.predict(sample)
    
    print(f"Sample data: {X_test.iloc[sample_idx].to_dict()}")
    print(f"\nPredicted class: {prediction[0]}")
    print(f"Actual class: {y_test.iloc[sample_idx]}")

## 7. Conclusion and Next Steps

Summary of findings and potential improvements:

1. **Model Performance**: Review the accuracy and other metrics of your models
2. **Feature Importance**: Analyze which features contribute most to predictions
3. **Hyperparameter Tuning**: Use GridSearchCV or RandomizedSearchCV to optimize model parameters
4. **Cross-Validation**: Implement k-fold cross-validation for more robust evaluation
5. **Advanced Techniques**: Try ensemble methods, deep learning, or other advanced algorithms
6. **Deployment**: Save the model and deploy it for real-world use

### Save the Model

In [None]:
# Save the best model
import pickle

if target_col in df.columns and results:
    # Save model
    with open('best_model.pkl', 'wb') as f:
        pickle.dump(best_model, f)
    
    # Save scaler
    with open('scaler.pkl', 'wb') as f:
        pickle.dump(scaler, f)
    
    print("Model and scaler saved successfully!")
    
    # Download to local machine (uncomment if needed)
    # from google.colab import files
    # files.download('best_model.pkl')
    # files.download('scaler.pkl')

## Additional Resources

- [Scikit-learn Documentation](https://scikit-learn.org/)
- [Pandas Documentation](https://pandas.pydata.org/)
- [Matplotlib Documentation](https://matplotlib.org/)
- [Seaborn Documentation](https://seaborn.pydata.org/)

---

**Note**: This notebook is a template. Customize it based on your specific data mining task and dataset.