# Assignment 2.2: Pumpkin Logistic Regression

**Objective**: Build a logistic regression model using the full pumpkin dataset, with proper data cleaning and standardization. We'll evaluate the model using confusion matrix and ROC curve.

**Dataset**: US-pumpkins.csv contains 1757 records with 26 features including prices, varieties, sizes, origins, and dates.

**Target**: We need to define a classification target. Common approaches:
- Classify by variety (HOWDEN TYPE, PIE TYPE, MINIATURE)
- Classify by size category (large vs medium vs small)
- Classify by price range (high vs low price)

We'll use **Variety** as our target since it's the most meaningful classification task.

In [3]:
# Step 1: Load and inspect the pumpkin dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv('C:/Users/Axewu/OneDrive/Dokumenter/GitHub/ML-For-Beginners/2-Regression/data/US-pumpkins.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst 3 rows:")
df.head(3)

Dataset shape: (1757, 26)

Columns: ['City Name', 'Type', 'Package', 'Variety', 'Sub Variety', 'Grade', 'Date', 'Low Price', 'High Price', 'Mostly Low', 'Mostly High', 'Origin', 'Origin District', 'Item Size', 'Color', 'Environment', 'Unit of Sale', 'Quality', 'Condition', 'Appearance', 'Storage', 'Crop', 'Repack', 'Trans Mode', 'Unnamed: 24', 'Unnamed: 25']

First 3 rows:


Unnamed: 0,City Name,Type,Package,Variety,Sub Variety,Grade,Date,Low Price,High Price,Mostly Low,...,Unit of Sale,Quality,Condition,Appearance,Storage,Crop,Repack,Trans Mode,Unnamed: 24,Unnamed: 25
0,BALTIMORE,,24 inch bins,,,,4/29/17,270.0,280.0,270.0,...,,,,,,,E,,,
1,BALTIMORE,,24 inch bins,,,,5/6/17,270.0,280.0,270.0,...,,,,,,,E,,,
2,BALTIMORE,,24 inch bins,HOWDEN TYPE,,,9/24/16,160.0,160.0,160.0,...,,,,,,,N,,,


In [4]:
# Step 2: Data Exploration and Missing Values Analysis
print("=== DATA EXPLORATION ===")
print(f"Dataset shape: {df.shape}")
print(f"\nData types:")
print(df.dtypes.value_counts())

print(f"\n=== MISSING VALUES ===")
missing_data = df.isnull().sum()
missing_percent = (missing_data / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percent
}).sort_values('Missing Percentage', ascending=False)

print(missing_df[missing_df['Missing Count'] > 0])

print(f"\n=== TARGET VARIABLE ANALYSIS ===")
print("Variety value counts:")
print(df['Variety'].value_counts())
print(f"\nVariety missing values: {df['Variety'].isnull().sum()}")

print(f"\n=== NUMERIC FEATURES ===")
numeric_cols = ['Low Price', 'High Price', 'Mostly Low', 'Mostly High']
print(df[numeric_cols].describe())

=== DATA EXPLORATION ===
Dataset shape: (1757, 26)

Data types:
object     13
float64    13
Name: count, dtype: int64

=== MISSING VALUES ===
                 Missing Count  Missing Percentage
Grade                     1757          100.000000
Quality                   1757          100.000000
Condition                 1757          100.000000
Appearance                1757          100.000000
Storage                   1757          100.000000
Crop                      1757          100.000000
Trans Mode                1757          100.000000
Unnamed: 24               1757          100.000000
Environment               1757          100.000000
Type                      1712           97.438816
Unnamed: 25               1654           94.137735
Origin District           1626           92.544109
Unit of Sale              1595           90.779738
Sub Variety               1461           83.153102
Color                      616           35.059761
Item Size                  279           1

## Data Cleaning Strategy

Based on the exploration above, our cleaning strategy will be:

1. **Target Variable**: Use 'Variety' as our classification target (HOWDEN TYPE, PIE TYPE, MINIATURE)
2. **Remove irrelevant columns**: Drop columns with too many missing values or no predictive value
3. **Handle missing values**: 
   - Drop rows where target (Variety) is missing
   - For features: impute or drop based on missing percentage
4. **Feature Engineering**:
   - Create average price feature from Low/High Price
   - Encode categorical variables
   - Parse dates if needed
5. **Remove outliers**: Handle extreme price values
6. **Standardization**: Scale all numeric features for logistic regression

In [11]:
# Step 3: Data Cleaning Implementation
print("=== BEFORE CLEANING ===")
print(f"Original shape: {df.shape}")

# Create a copy for cleaning
df_clean = df.copy()

# 1. Remove rows where target variable (Variety) is missing
print(f"Rows with missing Variety: {df_clean['Variety'].isnull().sum()}")
df_clean = df_clean.dropna(subset=['Variety'])
print(f"After removing missing Variety: {df_clean.shape}")

# 2. Keep only the main varieties (top 3 most common)
top_varieties = df_clean['Variety'].value_counts().head(3).index
df_clean = df_clean[df_clean['Variety'].isin(top_varieties)]
print(f"After keeping top 3 varieties: {df_clean.shape}")
print(f"Varieties kept: {list(top_varieties)}")

# 3. Remove columns that are mostly empty or not useful for prediction
columns_to_drop = [
	'Unnamed: 24', 'Unnamed: 25', 'Type', 'Sub Variety', 'Grade', 
    'Environment', 'Unit of Sale', 'Quality', 'Condition', 'Appearance', 'Storage', 'Origin District', 'Trans Mode'
]
df_clean = df_clean.drop(columns=columns_to_drop, errors='ignore')
print(f"After dropping irrelevant columns: {df_clean.shape}")

# 4. Handle missing values in remaining important columns
print(f"\n=== MISSING VALUES AFTER INITIAL CLEANING ===")
missing_after = df_clean.isnull().sum()
print(missing_after[missing_after > 0])

# 5. Create average price feature
df_clean['Avg_Price'] = (df_clean['Low Price'] + df_clean['High Price']) / 2
df_clean['Price_Range'] = df_clean['High Price'] - df_clean['Low Price']

# 6. Handle date parsing (extract month/season if needed)
df_clean['Date'] = pd.to_datetime(df_clean['Date'], format='%m/%d/%y', errors='coerce')
df_clean['Month'] = df_clean['Date'].dt.month
df_clean['Year'] = df_clean['Date'].dt.year

print(f"\n=== AFTER FEATURE ENGINEERING ===")
print(f"Final shape: {df_clean.shape}")
print(f"Columns: {list(df_clean.columns)}")

df_clean.head()

=== BEFORE CLEANING ===
Original shape: (1757, 26)
Rows with missing Variety: 5
After removing missing Variety: (1752, 26)
After keeping top 3 varieties: (1320, 26)
Varieties kept: ['HOWDEN TYPE', 'PIE TYPE', 'MINIATURE']
After dropping irrelevant columns: (1320, 13)

=== MISSING VALUES AFTER INITIAL CLEANING ===
Mostly Low       88
Mostly High      88
Item Size       223
Color           329
Crop           1320
dtype: int64

=== AFTER FEATURE ENGINEERING ===
Final shape: (1320, 17)
Columns: ['City Name', 'Package', 'Variety', 'Date', 'Low Price', 'High Price', 'Mostly Low', 'Mostly High', 'Origin', 'Item Size', 'Color', 'Crop', 'Repack', 'Avg_Price', 'Price_Range', 'Month', 'Year']


Unnamed: 0,City Name,Package,Variety,Date,Low Price,High Price,Mostly Low,Mostly High,Origin,Item Size,Color,Crop,Repack,Avg_Price,Price_Range,Month,Year
2,BALTIMORE,24 inch bins,HOWDEN TYPE,2016-09-24,160.0,160.0,160.0,160.0,DELAWARE,med,ORANGE,,N,160.0,0.0,9,2016
3,BALTIMORE,24 inch bins,HOWDEN TYPE,2016-09-24,160.0,160.0,160.0,160.0,VIRGINIA,med,ORANGE,,N,160.0,0.0,9,2016
4,BALTIMORE,24 inch bins,HOWDEN TYPE,2016-11-05,90.0,100.0,90.0,100.0,MARYLAND,lge,ORANGE,,N,95.0,10.0,11,2016
5,BALTIMORE,24 inch bins,HOWDEN TYPE,2016-11-12,90.0,100.0,90.0,100.0,MARYLAND,lge,ORANGE,,N,95.0,10.0,11,2016
6,BALTIMORE,36 inch bins,HOWDEN TYPE,2016-09-24,160.0,170.0,160.0,170.0,MARYLAND,med,ORANGE,,N,165.0,10.0,9,2016


In [12]:
# Step 4: Feature Preparation and Encoding
from sklearn.preprocessing import LabelEncoder

print("=== FEATURE PREPARATION ===")

# Handle remaining missing values
print("Handling missing values...")
# Fill missing categorical values with 'Unknown'
categorical_cols = ['City Name', 'Package', 'Origin', 'Item Size', 'Color', 'Crop', 'Repack']
for col in categorical_cols:
    if col in df_clean.columns:
        df_clean[col] = df_clean[col].fillna('Unknown')

# Fill missing numeric values with median
numeric_cols = ['Low Price', 'High Price', 'Mostly Low', 'Mostly High', 'Avg_Price', 'Price_Range']
for col in numeric_cols:
    if col in df_clean.columns:
        df_clean[col] = df_clean[col].fillna(df_clean[col].median())

# Remove outliers in price (using IQR method)
print("Removing price outliers...")
Q1 = df_clean['Avg_Price'].quantile(0.25)
Q3 = df_clean['Avg_Price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Price range before outlier removal: {df_clean['Avg_Price'].min():.2f} - {df_clean['Avg_Price'].max():.2f}")
df_clean = df_clean[(df_clean['Avg_Price'] >= lower_bound) & (df_clean['Avg_Price'] <= upper_bound)]
print(f"Price range after outlier removal: {df_clean['Avg_Price'].min():.2f} - {df_clean['Avg_Price'].max():.2f}")
print(f"Shape after outlier removal: {df_clean.shape}")

# Encode categorical variables
print("\nEncoding categorical variables...")
label_encoders = {}
categorical_features = ['City Name', 'Package', 'Origin', 'Item Size', 'Color', 'Crop', 'Repack']

for col in categorical_features:
    if col in df_clean.columns and df_clean[col].dtype == 'object':
        le = LabelEncoder()
        df_clean[col + '_encoded'] = le.fit_transform(df_clean[col].astype(str))
        label_encoders[col] = le
        print(f"Encoded {col}: {len(le.classes_)} unique values")

print(f"\nFinal dataset shape: {df_clean.shape}")
print(f"Target variable distribution:")
print(df_clean['Variety'].value_counts())

=== FEATURE PREPARATION ===
Handling missing values...
Removing price outliers...
Price range before outlier removal: 10.75 - 480.00
Price range after outlier removal: 10.75 - 400.00
Shape after outlier removal: (1316, 17)

Encoding categorical variables...
Encoded City Name: 13 unique values
Encoded Package: 14 unique values
Encoded Origin: 22 unique values
Encoded Item Size: 8 unique values
Encoded Color: 4 unique values
Encoded Crop: 1 unique values
Encoded Repack: 1 unique values

Final dataset shape: (1316, 24)
Target variable distribution:
Variety
HOWDEN TYPE    541
PIE TYPE       465
MINIATURE      310
Name: count, dtype: int64


In [13]:
# Step 5: Feature Selection and Standardization
print("=== FEATURE SELECTION ===")

# Select features for the model
feature_columns = [
    'Low Price', 'High Price', 'Mostly Low', 'Mostly High', 
    'Avg_Price', 'Price_Range', 'Month', 'Year',
    'City Name_encoded', 'Package_encoded', 'Origin_encoded', 
    'Item Size_encoded', 'Color_encoded', 'Crop_encoded', 'Repack_encoded'
]

# Keep only features that exist in our dataset
available_features = [col for col in feature_columns if col in df_clean.columns]
print(f"Selected features: {available_features}")

# Prepare X (features) and y (target)
X = df_clean[available_features].copy()
y = df_clean['Variety'].copy()

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Feature data types:")
print(X.dtypes)

# Check for any remaining missing values
print(f"\nMissing values in features:")
print(X.isnull().sum().sum())

# Remove any rows with missing values
if X.isnull().sum().sum() > 0:
    mask = ~(X.isnull().any(axis=1) | y.isnull())
    X = X[mask]
    y = y[mask]
    print(f"After removing missing values: X shape {X.shape}, y shape {y.shape}")

print(f"\nTarget variable distribution:")
print(y.value_counts())
print(f"Target proportions:")
print(y.value_counts(normalize=True))

=== FEATURE SELECTION ===
Selected features: ['Low Price', 'High Price', 'Mostly Low', 'Mostly High', 'Avg_Price', 'Price_Range', 'Month', 'Year', 'City Name_encoded', 'Package_encoded', 'Origin_encoded', 'Item Size_encoded', 'Color_encoded', 'Crop_encoded', 'Repack_encoded']
Features shape: (1316, 15)
Target shape: (1316,)
Feature data types:
Low Price            float64
High Price           float64
Mostly Low           float64
Mostly High          float64
Avg_Price            float64
Price_Range          float64
Month                  int32
Year                   int32
City Name_encoded      int64
Package_encoded        int64
Origin_encoded         int64
Item Size_encoded      int64
Color_encoded          int64
Crop_encoded           int64
Repack_encoded         int64
dtype: object

Missing values in features:
0

Target variable distribution:
Variety
HOWDEN TYPE    541
PIE TYPE       465
MINIATURE      310
Name: count, dtype: int64
Target proportions:
Variety
HOWDEN TYPE    0.411094


In [14]:
# Step 6: Standardization and Train/Test Split
print("=== STANDARDIZATION ===")

# Split the data first (important: standardize after splitting to avoid data leakage)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train set: X_train {X_train.shape}, y_train {y_train.shape}")
print(f"Test set: X_test {X_test.shape}, y_test {y_test.shape}")

print(f"\nTrain set target distribution:")
print(y_train.value_counts())
print(f"\nTest set target distribution:")
print(y_test.value_counts())

# Standardize the features
print(f"\n=== BEFORE STANDARDIZATION ===")
print(f"Feature ranges (train set):")
print(X_train.describe().loc[['min', 'max']])

# Initialize and fit the scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrames for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print(f"\n=== AFTER STANDARDIZATION ===")
print(f"Feature ranges (train set):")
print(X_train_scaled.describe().loc[['min', 'max']])

print(f"\nMean and std of standardized features (should be ~0 and ~1):")
print(f"Means: {X_train_scaled.mean().round(6).head()}")
print(f"Stds: {X_train_scaled.std().round(6).head()}")

print(f"\nData is ready for logistic regression!")
print(f"Final shapes: X_train_scaled {X_train_scaled.shape}, X_test_scaled {X_test_scaled.shape}")

=== STANDARDIZATION ===
Train set: X_train (1052, 15), y_train (1052,)
Test set: X_test (264, 15), y_test (264,)

Train set target distribution:
Variety
HOWDEN TYPE    432
PIE TYPE       372
MINIATURE      248
Name: count, dtype: int64

Test set target distribution:
Variety
HOWDEN TYPE    109
PIE TYPE        93
MINIATURE       62
Name: count, dtype: int64

=== BEFORE STANDARDIZATION ===
Feature ranges (train set):
     Low Price  High Price  Mostly Low  Mostly High  Avg_Price  Price_Range  \
min      10.75       10.75        12.0         12.0      10.75          0.0   
max     400.00      400.00       400.0        400.0     400.00        100.0   

     Month    Year  City Name_encoded  Package_encoded  Origin_encoded  \
min    1.0  2014.0                0.0              0.0             0.0   
max   12.0  2017.0               12.0             13.0            21.0   

     Item Size_encoded  Color_encoded  Crop_encoded  Repack_encoded  
min                0.0            0.0           0.0

## Summary of Data Cleaning and Preprocessing

### What We Did:

1. **Data Loading**: Loaded 1757 records with 26 features from the pumpkin dataset

2. **Target Selection**: Used 'Variety' as our classification target (HOWDEN TYPE, PIE TYPE, MINIATURE)

3. **Data Cleaning**:
   - Removed rows with missing target values
   - Kept only the top 3 most common varieties
   - Dropped irrelevant columns (empty columns, non-predictive features)
   - Handled missing values by imputation

4. **Feature Engineering**:
   - Created `Avg_Price` = (Low Price + High Price) / 2
   - Created `Price_Range` = High Price - Low Price  
   - Extracted `Month` and `Year` from dates
   - Encoded categorical variables using LabelEncoder

5. **Outlier Removal**: Removed price outliers using IQR method

6. **Standardization**: 
   - Split data into train/test (80/20) with stratification
   - Applied StandardScaler to features (fit on train, transform both)
   - All features now have mean ≈ 0 and std ≈ 1

### Why Standardization Matters:
- Logistic regression is sensitive to feature scales
- Features like 'Price' (range: 50-500) vs 'Month' (range: 1-12) have very different scales
- Without standardization, price features would dominate the model
- StandardScaler ensures all features contribute equally to the model