## Train/Test split
To avoid overfitting, it's common practice in Machine Learning to split data into train and test datasets. This is done to ensure that the model is able to correctly predict new, unseen data.

Since we're working with time-series data, we cannot use random split methods, as that would allow the model to know the future.

A function to print the start and end of a DataFrame is available as show_start_end(), which takes a DataFrame as the only argument, and returns a string.

The data is available as environment.

In [None]:
# Define the split day
limit_day = "2018-10-27"

# Split the data
train_env = environment[:limit_day]
test_env = environment[limit_day:]

# Print start and end dates
print(show_start_end(train_env))
print(show_start_end(test_env))

# Split the data into X and y
X_train = train_env.drop('target', axis=1)
y_train = train_env['target']
X_test = test_env.drop('target', axis=1)
y_test = test_env['target']

## Logistic Regression
Using the data from the previous exercise, you'll now train a Machine learning model.

In line with best practices, the data is now available as X_train, while the labels have been loaded as y_train. A subset of the data is also available as X_test. You'll learn later in this chapter how to properly create these variables.

In [None]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Initialize the model
logreg = LogisticRegression()

# Fit the model
logreg.fit(X_train, y_train)

# Predict classes
print(logreg.predict(X_test))

## Model performance
You're now going to evaluate the model from the previous lesson against the test-data.

Evaluating data against new, unseen data is important, as it proves the ability of the model to correctly estimate data it has never encountered before.

All necessary modules have been imported, and the data is available as X_train and y_train, and X_test and y_test respectively.

Instructions


In [None]:
# Create LogisticRegression model
logreg = LogisticRegression()

# Fit the model
logreg.fit(X_train, y_train)

# Score the model

print(logreg.score(X_train,y_train))
print(logreg.score(X_test, y_test))

## Scaling
Before applying a machine learning algorithm, one of the most common operations applied to the data is scaling.

Scaling data helps the algorithm converge faster. It also avoids having one feature dominate all other features.

You'll now create and inspect a standard scaler object.

The data is available as environment.

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
sc = StandardScaler()

# Fit the scaler
sc.fit(environment)

# Transform the data
environ_scaled = sc.transform(environment)

# Convert scaled data to DataFrame
environ_scaled = pd.DataFrame(environ_scaled, 
                              columns=environment.columns, 
                              index= environment.index)
print(environ_scaled.head())
plot_unscaled_scaled(environment, environ_scaled)

## Creating Pipelines
You'll now use one of the best features scikit-learn has to offer, Pipelines. Pipelines allow you to chain multiple actions, like transformations and estimations, which are applied sequentially to new data.

You'll now create a pipeline containing both a StandardScaler and a LogisticRegression estimator.

This allows you to pass unscaled data to the pipeline, where the Scaler will scale the data, and the LogisticRegression will predict the target column.

The unscaled data is available as X_train, while the labels have been loaded as y_train. A subset of the data, X_test , is also available to evaluate the model.

StandardScaler and LogisticRegression have been imported for you.

In [None]:
# Import pipeline
from sklearn.pipeline import Pipeline

# Create Scaler and Regression objects
sc = StandardScaler()
logreg = LogisticRegression()

# Create Pipeline
pl = Pipeline([
        ("scale", sc),
        ("logreg", logreg)
    ])

# Fit the pipeline and print predictions
pl.fit(X_train, y_train)
print(pl.predict(X_test))


## Store Pipeline
You'll now create the Pipeline again, but directly, skipping the step of initializing the StandardScaler and LogisticRegression as a variable. Instead, you will do the initialization as part of the Pipeline creation.

You'll then store the model for further use.

The data is available as X_train, with the labels as y_train.

StandardScaler, LogisticRegression and Pipeline have been imported for you.

In [None]:
# Create Pipeline
pl = Pipeline([
        ("scale", StandardScaler()),
        ("logreg", LogisticRegression())
    ])

# Fit the pipeline
pl.fit(X_train, y_train)

# Store the model
with Path("pipeline.pkl").open('bw') as f:
	pickle.dump(pl, f)
  
# Load the pipeline
with Path("pipeline.pkl").open('br') as f:
	pl_loaded = pickle.load(f)

print(pl_loaded)



## Apply model to data stream
Let's now apply your trained machine learning Pipeline to streaming data, and categorize the values immediately.

You'll then use predict() on the incoming messages to determine the category. Based on the result of the prediction you will take action, and close the windows in your house (or not).

Remember that category 1 means good weather, whereas category 0 signifies bad, cold weather.

Additionally, the pipeline returns an array of predictions. As you passed in only one element, you need to access the first element using category[0].

The function close_window() will handle this for you, and will additionally log the record for further study.

pandas as pd and json have been preloaded the session for you, and the model is available as pl.

In [None]:
def model_subscribe(client, userdata, message):
    data = json.loads(message.payload)
    # Parse to DataFrame
    df = pd.DataFrame.from_records([data], index='timestamp', columns=cols)
    # Predict result
    category = pl.predict(df)
    if category[0] < 1:
        # Call business logic
        close_window(df, category)
    else:
        print("Nice Weather, nothing to do.")  

# Subscribe model_subscribe to MQTT Topic
subscribe.callback(model_subscribe, topic, hostname=MQTT_HOST)