# Task: Building Linear and Polynomial Regression Models for Pumpkin Prices

**Objective**: The goal of this task is to build and evaluate linear and polynomial regression models to predict pumpkin prices based on various features. 

## Steps to Solve the Task

1. **Data donwload**:
   - Gather a dataset: https://www.kaggle.com/datasets/usda/a-year-of-pumpkin-prices?resource=download

2. **Data Exploration**:
   - Load the dataset into a DataFrame (using pandas).
   - Inspect the first few rows of the dataset to understand its structure.
   - Check for missing values and data types.

3. **Data Preprocessing**:
   - Handle missing values if any are found.
   - Convert categorical variables (if any) into numerical format using techniques like One-Hot Encoding.
   - Split the dataset into features (X) and target variable (y) where `y` is the price of pumpkins.

4. **Data Visualization**:
   - Create visualizations (scatter plots) to explore the relationship between the features and pumpkin prices. This helps identify trends and the potential need for polynomial features.




5. **Split the Data**:
   - Split the dataset into training and testing sets (e.g., 80% training, 20% testing) using `train_test_split` from `sklearn`.

6. **Build a Linear Regression Model**:
   - Import the necessary libraries (e.g., `LinearRegression` from `sklearn`).
   - Create a linear regression model and fit it to the training data.
   - Make predictions on the test set.
   - Evaluate the model's performance using metrics like Mean Squared Error (MSE) and R-squared.

7. **Build a Polynomial Regression Model**:
   - Import the `PolynomialFeatures` from `sklearn`.
   - Transform the features into polynomial features (e.g., degree = 2).
   - Create a new regression model using the transformed features.
   - Fit the polynomial regression model to the training data.
   - Make predictions on the test set and evaluate the model.

8. **Comparison of Models**:
   - Compare the performance of the linear and polynomial regression models using the evaluation metrics.
   - Visualize the results to see how well each model fits the data.

9. **Conclusion**:
   - Summarize the findings and discuss which model performed better and why.
   - Consider the implications of using polynomial features and potential risks of overfitting.

## Optional Extensions
- Explore the effect of different polynomial degrees on model performance.
- Experiment with other regression techniques, such as Ridge or Lasso regression, for comparison.


In [1]:
# data analysis stack
import numpy as np
import pandas as pd

# machine-learning stack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# miscellaneous
import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("new-york.csv")

In [3]:
df.head()

Unnamed: 0,Commodity Name,City Name,Type,Package,Variety,Sub Variety,Grade,Date,Low Price,High Price,...,Color,Environment,Unit of Sale,Quality,Condition,Appearance,Storage,Crop,Repack,Trans Mode
0,PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,150,170,...,,,,,,,,,N,
1,PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,150,170,...,,,,,,,,,N,
2,PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,130,150,...,,,,,,,,,N,
3,PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,130,150,...,,,,,,,,,N,
4,PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,120,140,...,,,,,,,,,N,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112 entries, 0 to 111
Data columns (total 25 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Commodity Name   112 non-null    object 
 1   City Name        112 non-null    object 
 2   Type             0 non-null      float64
 3   Package          112 non-null    object 
 4   Variety          112 non-null    object 
 5   Sub Variety      18 non-null     object 
 6   Grade            0 non-null      float64
 7   Date             112 non-null    object 
 8   Low Price        112 non-null    int64  
 9   High Price       112 non-null    int64  
 10  Mostly Low       112 non-null    int64  
 11  Mostly High      112 non-null    int64  
 12  Origin           112 non-null    object 
 13  Origin District  15 non-null     object 
 14  Item Size        104 non-null    object 
 15  Color            21 non-null     object 
 16  Environment      0 non-null      float64
 17  Unit of Sale    

In [5]:
df.describe()

Unnamed: 0,Type,Grade,Low Price,High Price,Mostly Low,Mostly High,Environment,Quality,Condition,Appearance,Storage,Crop,Trans Mode
count,0.0,0.0,112.0,112.0,112.0,112.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,,,107.125,116.482143,107.3125,116.410714,,,,,,,
std,,,71.840931,77.454664,71.830008,77.545285,,,,,,,
min,,,15.0,16.0,15.0,16.0,,,,,,,
25%,,,18.0,20.0,18.0,18.0,,,,,,,
50%,,,130.0,140.0,130.0,140.0,,,,,,,
75%,,,150.0,170.0,150.0,170.0,,,,,,,
max,,,260.0,300.0,260.0,300.0,,,,,,,


In [6]:
missing_values = df.isna().sum()
print(missing_values)

Commodity Name       0
City Name            0
Type               112
Package              0
Variety              0
Sub Variety         94
Grade              112
Date                 0
Low Price            0
High Price           0
Mostly Low           0
Mostly High          0
Origin               0
Origin District     97
Item Size            8
Color               91
Environment        112
Unit of Sale        87
Quality            112
Condition          112
Appearance         112
Storage            112
Crop               112
Repack               0
Trans Mode         112
dtype: int64


In [7]:
df = df.dropna(axis=1)


In [8]:
df.head()

Unnamed: 0,Commodity Name,City Name,Package,Variety,Date,Low Price,High Price,Mostly Low,Mostly High,Origin,Repack
0,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,150,170,150,170,MICHIGAN,N
1,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,150,170,150,170,MICHIGAN,N
2,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,130,150,130,150,NEW JERSEY,N
3,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,130,150,130,150,NEW JERSEY,N
4,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,120,140,120,140,NEW YORK,N


In [9]:
df.shape

(112, 11)

In [10]:
df = pd.get_dummies(df, columns=["Origin"])

In [11]:
df.head()

Unnamed: 0,Commodity Name,City Name,Package,Variety,Date,Low Price,High Price,Mostly Low,Mostly High,Repack,Origin_MICHIGAN,Origin_NEW JERSEY,Origin_NEW YORK,Origin_OHIO,Origin_PENNSYLVANIA
0,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,150,170,150,170,N,True,False,False,False,False
1,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,150,170,150,170,N,True,False,False,False,False
2,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,130,150,130,150,N,False,True,False,False,False
3,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,130,150,130,150,N,False,True,False,False,False
4,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,120,140,120,140,N,False,False,True,False,False


In [12]:
#print(df['price'])

In [13]:
df["price"] = (df["Low Price"] + df["High Price"])/2
y = df['price']

In [14]:
print(df['price'])

0      160.0
1      160.0
2      140.0
3      140.0
4      130.0
       ...  
107     32.0
108     18.0
109     18.0
110     18.0
111     18.0
Name: price, Length: 112, dtype: float64


In [15]:
# import plotly.express as px

# for col in df.columns:
#     plt = px.histogram(df, x = col, color ="price", title=col + ' vs price')
#     plt.show()

In [16]:
df.head()

Unnamed: 0,Commodity Name,City Name,Package,Variety,Date,Low Price,High Price,Mostly Low,Mostly High,Repack,Origin_MICHIGAN,Origin_NEW JERSEY,Origin_NEW YORK,Origin_OHIO,Origin_PENNSYLVANIA,price
0,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,150,170,150,170,N,True,False,False,False,False,160.0
1,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,150,170,150,170,N,True,False,False,False,False,160.0
2,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,130,150,130,150,N,False,True,False,False,False,140.0
3,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,130,150,130,150,N,False,True,False,False,False,140.0
4,PUMPKINS,NEW YORK,36 inch bins,HOWDEN TYPE,09/24/2016,120,140,120,140,N,False,False,True,False,False,130.0


In [17]:
df = df.drop(axis=1, columns=['Commodity Name', 'City Name', 'Package', 'Variety','Date', 'Repack'])

In [18]:
df.head()

Unnamed: 0,Low Price,High Price,Mostly Low,Mostly High,Origin_MICHIGAN,Origin_NEW JERSEY,Origin_NEW YORK,Origin_OHIO,Origin_PENNSYLVANIA,price
0,150,170,150,170,True,False,False,False,False,160.0
1,150,170,150,170,True,False,False,False,False,160.0
2,130,150,130,150,False,True,False,False,False,140.0
3,130,150,130,150,False,True,False,False,False,140.0
4,120,140,120,140,False,False,True,False,False,130.0


In [19]:
#Create training and testing dataset
from sklearn.model_selection import train_test_split
X = df.drop(axis=1, columns=['Low Price','High Price'])
y = df['price']               # Target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [20]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

In [23]:
#Test the model and evaluate its performance.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import mean_squared_error, r2_score
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f" linear regression MSE: {mse:.2f}, R2: {r2:.2f}")

 linear regression MSE: 0.00, R2: 1.00
