# Fill-in-the-Blank Exercise: Data Splitting & Preprocessing

In this exercise, complete the blanks (_____). Your task is to:

1. Import the necessary libraries.
2. Simulate sample ice cream sales data.
3. Split the data into training and test sets.
4. Scale the training data using `fit_transform` and the test data using `transform`.

Fill in the blanks where indicated.

In [2]:
# Fill in the blanks to import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from datetime import datetime, timedelta
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

print("Libraries imported!")

Libraries imported!


In [3]:
# Fill in the blanks to simulate 90 days of ice cream sales data
np.random.seed(42)
n_days = 90
start_date = datetime(2024, 1, 1)
dates = [start_date + timedelta(days=i) for i in range(n_days)]  # Hint: use n_days

temperatures = np.random.normal(loc=25, scale=3, size=n_days).round(1)  # Use n_days
promotions = np.random.choice([0, 1], size=n_days, p=[0.7, 0.3])  # Use n_days

# Create sales using the formula: 300 + 12 * temperature + 60 * promotion + noise
sales = 300 + 12 * temperatures + 60 * promotions + np.random.normal(0, 20, size=n_days)

df = pd.DataFrame({
    'date': dates,
    'temperature': temperatures,
    'promotion': promotions,
    'sales': sales.round().astype(int)
})

print(df.head())
print("Data simulation complete!")

        date  temperature  promotion  sales
0 2024-01-01         26.5          1    715
1 2024-01-02         24.6          0    605
2 2024-01-03         26.9          1    659
3 2024-01-04         29.6          1    728
4 2024-01-05         24.3          0    572
Data simulation complete!


In [4]:
# Fill in the blanks to split data and apply StandardScaler
# Split the data into features (X) and target (y)
X = df[['temperature', 'promotion']]
y = df['sales']

# Split data into training and test sets (80% training, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize StandardScaler
scaler = StandardScaler()

# Scale the 'temperature' column on the training set (use fit_transform) and test set (use transform)
X_train_scaled = X_train.copy()
X_train_scaled['temperature'] = scaler.fit_transform(X_train[['temperature']])

X_test_scaled = X_test.copy()
X_test_scaled['temperature'] = scaler.transform(X_test[['temperature']])

print("Data splitting and scaling complete!")
print(X_train_scaled.head())

Data splitting and scaling complete!
    temperature  promotion
49    -1.732071          0
62    -1.036150          0
73     1.747536          1
69    -0.549005          0
76     0.216509          0


## Reflection

In your own words, explain why it is important to use `fit_transform` on the training data and only `transform` on the test data.

In [5]:
# Write your reflection below (feel free to use a comment or print statement)
print("Reflection: To prevent data leakage by using fit_transform on the training data and only transform on the test data")

Reflection: To prevent data leakage by using fit_transform on the training data and only transform on the test data
