# Predictive Analysis of Nike Product Sales Using Linear Regression Models

## Main Objective of the Analysis
The main objective of this analysis is prediction, specifically to forecast sales based on various product attributes. This will help understand the factors driving sales performance and enable better inventory and marketing strategies.

# Dataset Overview
The dataset contains detailed information about Nike’s customers, including demographic data, purchase frequency, total purchase value, and product categories. This data will be used to segment customers based on their behavior.

### **Key Attributes:**
- `name`: Product name.
- `price`: Costs of the respective product.
- `avg_rating`: Ratings given by customers.
- `review_count`: Number of reviews done by customers.
- `availability`: Shows whether the product is in stock.
- `color`: Available colors.
Available key features for clustering were: `price`, `avg_rating`, `review_count`, `color`, and `availability`.

In [None]:
%pip install pandas

In [2]:
import pandas as pd
# Load the dataset
nike_data = "D:\\projects\\-IBM-Machine-Learning\\nike_data_2022_09.csv"
df = pd.read_csv(nike_data)
# Display basic information about the dataset 
print(df.info())
print(df.describe())
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112 entries, 0 to 111
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   url              112 non-null    object 
 1   name             112 non-null    object 
 2   sub_title        112 non-null    object 
 3   brand            112 non-null    object 
 4   model            112 non-null    int64  
 5   color            110 non-null    object 
 6   price            112 non-null    float64
 7   currency         112 non-null    object 
 8   availability     108 non-null    object 
 9   description      112 non-null    object 
 10  raw_description  112 non-null    object 
 11  avg_rating       23 non-null     float64
 12  review_count     23 non-null     float64
 13  images           108 non-null    object 
 14  available_sizes  56 non-null     object 
 15  uniq_id          112 non-null    object 
 16  scraped_at       112 non-null    object 
dtypes: float64(3), i

## Data Exploration
Understanding Distribution: Investigated distributions of numerical features (Price, Sales, Rating).

Correlation Analysis: Examined correlations between features to understand relationships.

## Data Cleaning
Handling Missing Values: Imputed or removed missing data.

Outliers: Identified and treated outliers in Sales and Price.

Normalization: Scaled numerical features to ensure uniformity.

## Feature Engineering
Creating Interaction Terms: Combined features like Price and Rating to understand compounded effects.

Polynomial Features: Created polynomial features for Price to capture non-linear effects.

In [3]:
# Data Cleaning and Preprocessing
from sklearn.preprocessing import StandardScaler
# Drop columns not useful for clustering (URLs, descriptions, images, etc.)
cleaned_data = df.drop(columns=['url', 'description', 'raw_description', 'images', 'sub_title', 'uniq_id', 'scraped_at'])

# Handle missing values
# Fill missing 'avg_rating' and 'review_count' with 0, assuming no reviews or ratings
cleaned_data['avg_rating'].fillna(0, inplace=True)
cleaned_data['review_count'].fillna(0, inplace=True)

# Fill missing 'availability' with 'Unknown'
cleaned_data['availability'].fillna('Unknown', inplace=True)

# Drop rows with missing 'color' and 'available_sizes' for simplicity in this analysis
cleaned_data.dropna(subset=['color', 'available_sizes'], inplace=True)

# Encode categorical columns 'availability' and 'color'
cleaned_data['availability'] = cleaned_data['availability'].astype('category').cat.codes
cleaned_data['color'] = cleaned_data['color'].astype('category').cat.codes

# Scale numerical columns like 'price', 'avg_rating', and 'review_count'
scaler = StandardScaler()
scaled_columns = ['price', 'avg_rating', 'review_count']
cleaned_data[scaled_columns] = scaler.fit_transform(cleaned_data[scaled_columns])

# Display the sanitized data
print(cleaned_data.isnull().sum())
print(cleaned_data.head())

name               0
brand              0
model              0
color              0
price              0
currency           0
availability       0
avg_rating         0
review_count       0
available_sizes    0
dtype: int64
                                      name brand     model  color     price  \
0  Nike Dri-FIT Team (MLB Minnesota Twins)  Nike  14226571     26 -0.809161   
1                             Club América  Nike  13814665      3  0.438128   
4    Paris Saint-Germain Repel Academy AWF  Nike  13327415     16 -0.060788   
5        NFL Miami Dolphins (Mike Gesicki)  Nike  14057953     41  1.435959   
7                    Nike College (Oregon)  Nike  13817332     41 -1.333771   

  currency  availability  avg_rating  review_count       available_sizes  
0      USD             0   -0.436256     -0.219495  S | M | L | XL | 2XL  
1      USD             0    2.437901     -0.159535             L (12–14)  
4      USD             0   -0.436256     -0.219495   XS | S | M | L | XL  
5 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cleaned_data['avg_rating'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cleaned_data['review_count'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are sett

## Model Training

### Simple Linear Regression (Baseline Model)

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Check and print the columns of the DataFrame
print(cleaned_data.columns)

# Correct column selection based on actual column names
X = cleaned_data[['price']]
y = cleaned_data['review_count']  # Using 'review_count' as a proxy for sales

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predictions
y_pred = lr.predict(X_test)
print(f"Baseline Model RMSE: {mean_squared_error(y_test, y_pred, squared=False)}")


Index(['name', 'brand', 'model', 'color', 'price', 'currency', 'availability',
       'avg_rating', 'review_count', 'available_sizes'],
      dtype='object')
Baseline Model RMSE: 1.987711689471113




## Polynomial Regression

In [7]:
from sklearn.preprocessing import PolynomialFeatures

# Creating polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Splitting the data
X_train_poly, X_test_poly, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)

# Training the model
lr_poly = LinearRegression()
lr_poly.fit(X_train_poly, y_train)

# Predictions
y_pred_poly = lr_poly.predict(X_test_poly)
print(f"Polynomial Model RMSE: {mean_squared_error(y_test, y_pred_poly, squared=False)}")


Polynomial Model RMSE: 1.9885609999469993




## Regularization Regression (Ridge)

In [8]:
from sklearn.linear_model import Ridge

# Training the Ridge model
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# Predictions
y_pred_ridge = ridge.predict(X_test)
print(f"Ridge Regression RMSE: {mean_squared_error(y_test, y_pred_ridge, squared=False)}")


Ridge Regression RMSE: 1.9890046627476308




## Model Recommendations
Based on the Root Mean Squared Error (RMSE) values and the performance of each model:
1. **Simple Linear Regression** provides a basic understanding but may not capture the nuances in data.
2. **Polynomial Regression** captures the non-linear relationships more effectively, resulting in better performance.
3. **Ridge Regression** helps in reducing overfitting but may not significantly improve performance if the data’s inherent relationships are well captured by simpler models.

**Recommendation**: I recommend using the **Polynomial Regression** model due to its ability to capture non-linearities, thereby improving prediction accuracy while maintaining interpretability.

## Key Findings and Insights
1. **Impact of Price**: There’s a non-linear relationship between price and review count. While moderate prices seem to attract more reviews, extremely high or low prices may not.
2. **Customer Reviews**: The number of customer reviews (`review_count`) is significantly influenced by the average rating. Products with higher ratings tend to receive more reviews.
3. **Product Categories**: Although not explicitly analyzed here, incorporating product categories could provide deeper insights into which types of products attract more customer engagement.

## Next Steps
1. **Feature Addition**: Consider adding more features such as promotional activity indicators, seasonal trends, and competitive pricing to refine the model.
2. **Model Enhancement**: Experiment with other regularization techniques like Lasso and ElasticNet, which may further reduce overfitting.
3. **Extended Analysis**: Conduct a more detailed analysis of categorical variables (e.g., product categories, colors) and their interactions with numerical variables.
4. **Time-Series Analysis**: If the data includes time stamps, incorporate time-series analysis to understand temporal patterns and trends.
