# Laptop Price Prediction - Machine Learning Project

This notebook implements a complete machine learning workflow to predict laptop prices using Linear Regression and Decision Tree models.

## Table of Contents
1. Data Loading and Exploration
2. Data Preprocessing
3. Model Training (Linear Regression & Decision Tree)
4. Model Evaluation and Comparison
5. Visualization
6. Save Best Model

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import pickle
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print('All libraries imported successfully!')

## 1. Data Loading and Exploration

In [None]:
# Load the dataset
df = pd.read_csv('laptop_prices.csv')

print(f"Dataset Shape: {df.shape}")
print(f"\nNumber of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

In [None]:
# Display first few rows
df.head(10)

In [None]:
# Dataset information
df.info()

In [None]:
# Statistical summary
df.describe()

In [None]:
# Check for missing values
print("Missing Values:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

In [None]:
# Target variable distribution
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(df['Price_Tsh'], bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Price (Tsh)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Laptop Prices', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.boxplot(df['Price_Tsh'])
plt.ylabel('Price (Tsh)', fontsize=12)
plt.title('Boxplot of Laptop Prices', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('price_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"Mean Price: {df['Price_Tsh'].mean():,.2f} Tsh")
print(f"Median Price: {df['Price_Tsh'].median():,.2f} Tsh")
print(f"Min Price: {df['Price_Tsh'].min():,.2f} Tsh")
print(f"Max Price: {df['Price_Tsh'].max():,.2f} Tsh")

## 2. Data Preprocessing

In [None]:
# Create a copy for preprocessing
df_processed = df.copy()

print("Original dataset shape:", df_processed.shape)

In [None]:
# Select important features for modeling
# We'll focus on the most relevant features

selected_features = [
    'Company', 'TypeName', 'Inches', 'Ram', 'OS', 'Weight',
    'Touchscreen', 'IPSpanel', 'RetinaDisplay', 'CPU_company',
    'CPU_freq', 'PrimaryStorage', 'GPU_company', 'Price_Tsh'
]

df_processed = df_processed[selected_features]
print("Selected features:", df_processed.shape[1] - 1)  # -1 for target variable

In [None]:
# Handle missing values
print("Missing values before handling:")
print(df_processed.isnull().sum())
print()

# Drop rows with missing values (if any)
df_processed = df_processed.dropna()
print(f"Dataset shape after handling missing values: {df_processed.shape}")

In [None]:
# Convert categorical Yes/No to binary
binary_columns = ['Touchscreen', 'IPSpanel', 'RetinaDisplay']

for col in binary_columns:
    df_processed[col] = (df_processed[col] == 'Yes').astype(int)

print("Binary columns converted successfully")
print(df_processed[binary_columns].head())

In [None]:
# Encode categorical variables
categorical_columns = ['Company', 'TypeName', 'OS', 'CPU_company', 'GPU_company']

# Create label encoders dictionary to save for later use
label_encoders = {}

for col in categorical_columns:
    le = LabelEncoder()
    df_processed[col] = le.fit_transform(df_processed[col])
    label_encoders[col] = le
    print(f"{col}: {len(le.classes_)} unique values")

print("\nCategorical encoding completed!")

In [None]:
# Check for any remaining non-numeric data
print("Data types after preprocessing:")
print(df_processed.dtypes)
print("\nProcessed dataset:")
df_processed.head()

In [None]:
# Correlation analysis
plt.figure(figsize=(12, 10))
correlation_matrix = df_processed.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1)
plt.title('Feature Correlation Heatmap', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nTop correlations with Price:")
price_corr = correlation_matrix['Price_Tsh'].sort_values(ascending=False)
print(price_corr)

In [None]:
# Split features and target
X = df_processed.drop('Price_Tsh', axis=1)
y = df_processed['Price_Tsh']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature names: {list(X.columns)}")

In [None]:
# Split data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print(f"\nTraining set: {X_train.shape[0]/len(X)*100:.1f}%")
print(f"Testing set: {X_test.shape[0]/len(X)*100:.1f}%")

In [None]:
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling completed!")
print(f"\nScaled training data shape: {X_train_scaled.shape}")
print(f"Scaled testing data shape: {X_test_scaled.shape}")

## 3. Model Training

### 3.1 Linear Regression Model

In [None]:
# Train Linear Regression model
print("Training Linear Regression Model...")
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)
print("Linear Regression model trained successfully!")

# Make predictions
y_pred_lr_train = lr_model.predict(X_train_scaled)
y_pred_lr_test = lr_model.predict(X_test_scaled)

print("\nPredictions completed!")

### 3.2 Decision Tree Model

In [None]:
# Train Decision Tree model
print("Training Decision Tree Model...")
dt_model = DecisionTreeRegressor(random_state=42, max_depth=10)
dt_model.fit(X_train_scaled, y_train)
print("Decision Tree model trained successfully!")

# Make predictions
y_pred_dt_train = dt_model.predict(X_train_scaled)
y_pred_dt_test = dt_model.predict(X_test_scaled)

print("\nPredictions completed!")

## 4. Model Evaluation and Comparison

In [None]:
# Function to calculate evaluation metrics
def evaluate_model(y_true, y_pred, model_name, dataset_type):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    
    print(f"\n{'='*60}")
    print(f"{model_name} - {dataset_type} Set Performance")
    print(f"{'='*60}")
    print(f"Mean Squared Error (MSE):     {mse:,.2f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:,.2f}")
    print(f"Mean Absolute Error (MAE):     {mae:,.2f}")
    print(f"R² Score:                      {r2:.4f}")
    print(f"{'='*60}")
    
    return {'MSE': mse, 'RMSE': rmse, 'MAE': mae, 'R2': r2}

In [None]:
# Evaluate Linear Regression
lr_train_metrics = evaluate_model(y_train, y_pred_lr_train, "Linear Regression", "Training")
lr_test_metrics = evaluate_model(y_test, y_pred_lr_test, "Linear Regression", "Testing")

In [None]:
# Evaluate Decision Tree
dt_train_metrics = evaluate_model(y_train, y_pred_dt_train, "Decision Tree", "Training")
dt_test_metrics = evaluate_model(y_test, y_pred_dt_test, "Decision Tree", "Testing")

In [None]:
# Create comparison dataframe
comparison_df = pd.DataFrame({
    'Model': ['Linear Regression', 'Decision Tree'],
    'Train_RMSE': [lr_train_metrics['RMSE'], dt_train_metrics['RMSE']],
    'Test_RMSE': [lr_test_metrics['RMSE'], dt_test_metrics['RMSE']],
    'Train_MAE': [lr_train_metrics['MAE'], dt_train_metrics['MAE']],
    'Test_MAE': [lr_test_metrics['MAE'], dt_test_metrics['MAE']],
    'Train_R2': [lr_train_metrics['R2'], dt_train_metrics['R2']],
    'Test_R2': [lr_test_metrics['R2'], dt_test_metrics['R2']]
})

print("\n" + "="*80)
print("MODEL COMPARISON SUMMARY")
print("="*80)
print(comparison_df.to_string(index=False))
print("="*80)

In [None]:
# Determine best model based on Test R² Score
if lr_test_metrics['R2'] > dt_test_metrics['R2']:
    best_model = lr_model
    best_model_name = "Linear Regression"
    best_predictions = y_pred_lr_test
    best_r2 = lr_test_metrics['R2']
else:
    best_model = dt_model
    best_model_name = "Decision Tree"
    best_predictions = y_pred_dt_test
    best_r2 = dt_test_metrics['R2']

print(f"\n{'*'*80}")
print(f"BEST MODEL: {best_model_name}")
print(f"Test R² Score: {best_r2:.4f}")
print(f"{'*'*80}")

## 5. Visualization

In [None]:
# Model comparison visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. RMSE Comparison
models = ['Linear Regression', 'Decision Tree']
train_rmse = [lr_train_metrics['RMSE'], dt_train_metrics['RMSE']]
test_rmse = [lr_test_metrics['RMSE'], dt_test_metrics['RMSE']]

x = np.arange(len(models))
width = 0.35

axes[0, 0].bar(x - width/2, train_rmse, width, label='Training', alpha=0.8)
axes[0, 0].bar(x + width/2, test_rmse, width, label='Testing', alpha=0.8)
axes[0, 0].set_xlabel('Model', fontsize=12, fontweight='bold')
axes[0, 0].set_ylabel('RMSE', fontsize=12, fontweight='bold')
axes[0, 0].set_title('RMSE Comparison', fontsize=14, fontweight='bold')
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels(models)
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. R² Score Comparison
train_r2 = [lr_train_metrics['R2'], dt_train_metrics['R2']]
test_r2 = [lr_test_metrics['R2'], dt_test_metrics['R2']]

axes[0, 1].bar(x - width/2, train_r2, width, label='Training', alpha=0.8)
axes[0, 1].bar(x + width/2, test_r2, width, label='Testing', alpha=0.8)
axes[0, 1].set_xlabel('Model', fontsize=12, fontweight='bold')
axes[0, 1].set_ylabel('R² Score', fontsize=12, fontweight='bold')
axes[0, 1].set_title('R² Score Comparison', fontsize=14, fontweight='bold')
axes[0, 1].set_xticks(x)
axes[0, 1].set_xticklabels(models)
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Actual vs Predicted - Linear Regression
axes[1, 0].scatter(y_test, y_pred_lr_test, alpha=0.6, s=50)
axes[1, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
                'r--', lw=2, label='Perfect Prediction')
axes[1, 0].set_xlabel('Actual Price (Tsh)', fontsize=12, fontweight='bold')
axes[1, 0].set_ylabel('Predicted Price (Tsh)', fontsize=12, fontweight='bold')
axes[1, 0].set_title(f'Linear Regression: Actual vs Predicted\nR² = {lr_test_metrics["R2"]:.4f}', 
                     fontsize=14, fontweight='bold')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 4. Actual vs Predicted - Decision Tree
axes[1, 1].scatter(y_test, y_pred_dt_test, alpha=0.6, s=50, color='orange')
axes[1, 1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
                'r--', lw=2, label='Perfect Prediction')
axes[1, 1].set_xlabel('Actual Price (Tsh)', fontsize=12, fontweight='bold')
axes[1, 1].set_ylabel('Predicted Price (Tsh)', fontsize=12, fontweight='bold')
axes[1, 1].set_title(f'Decision Tree: Actual vs Predicted\nR² = {dt_test_metrics["R2"]:.4f}', 
                     fontsize=14, fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Residual analysis for both models
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Linear Regression residuals
residuals_lr = y_test - y_pred_lr_test
axes[0].scatter(y_pred_lr_test, residuals_lr, alpha=0.6)
axes[0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0].set_xlabel('Predicted Price (Tsh)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Residuals', fontsize=12, fontweight='bold')
axes[0].set_title('Linear Regression: Residual Plot', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Decision Tree residuals
residuals_dt = y_test - y_pred_dt_test
axes[1].scatter(y_pred_dt_test, residuals_dt, alpha=0.6, color='orange')
axes[1].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[1].set_xlabel('Predicted Price (Tsh)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Residuals', fontsize=12, fontweight='bold')
axes[1].set_title('Decision Tree: Residual Plot', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('residual_plots.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Feature importance (for Decision Tree)
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': dt_model.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'])
plt.xlabel('Importance', fontsize=12, fontweight='bold')
plt.ylabel('Feature', fontsize=12, fontweight='bold')
plt.title('Decision Tree: Feature Importance', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nTop 5 Most Important Features:")
print(feature_importance.head())

## 6. Save Best Model and Preprocessing Objects

In [None]:
# Save the best model
model_data = {
    'model': best_model,
    'scaler': scaler,
    'label_encoders': label_encoders,
    'feature_names': list(X.columns),
    'model_name': best_model_name,
    'test_r2_score': best_r2,
    'test_rmse': lr_test_metrics['RMSE'] if best_model_name == 'Linear Regression' else dt_test_metrics['RMSE']
}

with open('model.pkl', 'wb') as f:
    pickle.dump(model_data, f)

print(f"✓ Best model ({best_model_name}) saved successfully as 'model.pkl'")
print(f"✓ Model R² Score: {best_r2:.4f}")
print(f"\nThe model file includes:")
print("  - Trained model")
print("  - Feature scaler")
print("  - Label encoders")
print("  - Feature names")
print("  - Model metadata")

## 7. Model Testing with Sample Predictions

In [None]:
# Test the saved model with sample predictions
print("Testing the saved model with sample predictions...\n")

# Load the model
with open('model.pkl', 'rb') as f:
    loaded_model_data = pickle.load(f)

# Get 5 random samples from test set
sample_indices = np.random.choice(X_test.index, 5, replace=False)
samples = X_test.loc[sample_indices]
actual_prices = y_test.loc[sample_indices]

# Make predictions
samples_scaled = loaded_model_data['scaler'].transform(samples)
predictions = loaded_model_data['model'].predict(samples_scaled)

# Display results
results_df = pd.DataFrame({
    'Actual Price (Tsh)': actual_prices.values,
    'Predicted Price (Tsh)': predictions,
    'Difference (Tsh)': actual_prices.values - predictions,
    'Error (%)': np.abs((actual_prices.values - predictions) / actual_prices.values * 100)
})

print("Sample Predictions:")
print("="*80)
print(results_df.to_string(index=False))
print("="*80)
print(f"\nMean Absolute Error: {np.abs(results_df['Difference (Tsh)']).mean():,.2f} Tsh")
print(f"Mean Percentage Error: {results_df['Error (%)'].mean():.2f}%")

## Summary

This notebook successfully completed the following tasks:

1. **Data Loading and Exploration**: Loaded and analyzed the laptop prices dataset
2. **Data Preprocessing**: 
   - Handled missing values
   - Encoded categorical variables
   - Scaled features
   - Split data into training and testing sets
3. **Model Training**: Trained both Linear Regression and Decision Tree models
4. **Model Evaluation**: Evaluated and compared both models using multiple metrics
5. **Visualization**: Created comprehensive visualizations for model comparison
6. **Model Saving**: Saved the best-performing model for deployment

The saved model is ready for deployment in the AI application!