Using Python for Research
Final

Jason Padgett
10/06/2024

Given the project of predicting movement data from a 'count your steps' app, using another data set with the movements defined.

During the class, I was able to use the tools presented to find homework answers, but felt that I didn't yet understand what those tools were doing in the background, and what some of the returned numbers were representing. So, my first attempt at this project was using brute force. I wanted to avoid using the tools, do the math myself, and teach myself where those numbers were coming from. I completed working code, but the results, when plugged into the test form, only rated an accuracy of about 26%, which is no better than random chance. So, run number 1 was scrapped.

Final v.2 was attempted using LogisticRegression and RandomForestClassifier. I used some plotting code from class 5-2 to render a comparison of the accuracy of each tool and found that RandomForestClassifier was yielding better accuracy. So, with a clean sheet of paper, I wrote new code based on using RandomForestClassifier.

Initializing the code, I state my imports, set the location of the data I am going to use, start my timer, and grab the training and test data from external files:


In [1]:
import datetime
import os

import pandas as pd

from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from scipy.stats import randint
from sklearn.ensemble import RandomForestClassifier

import warnings
warnings.filterwarnings("ignore")

data_dir = "./data"

# start process timer
start_time = datetime.datetime.now()

# Load data from files - training data
df_train_time_series = pd.read_csv(os.path.join(data_dir, "train_time_series.csv"))
df_train_labels = pd.read_csv(os.path.join(data_dir, "train_labels.csv"))

# Load data from files - test data
df_test_time_series = pd.read_csv(os.path.join(data_dir, "test_time_series.csv"))
df_test_labels = pd.read_csv(os.path.join(data_dir, "test_labels_input.csv"))

Having grabbed the data, I want to set that data up in tables which will give me easier access during the classification process:


In [2]:
def set_table(table_l, table_ts):
    """Formatting data collected from external files, and forming a new table with the data I will use during the training / prediction process"""
    table = []
    for i in range(len(table_l)):
        for j in range(len(table_ts)):
            if table_l.timestamp.iloc[i] == table_ts.timestamp.iloc[j]:
                table.append([
                    table_ts.x.iloc[j],
                    table_ts.y.iloc[j],
                    table_ts.z.iloc[j],
                    table_l.label.iloc[i]
                ])
    return table


train_table = pd.DataFrame(set_table(df_train_labels, df_train_time_series), columns=['x', 'y', 'z', 'label'])
test_table = pd.DataFrame(set_table(df_test_labels, df_test_time_series), columns=['x', 'y', 'z', 'label'])

Using the tables created, next I set up the variables I will use:


In [3]:
classification_target = 'label'
all_covariates = ['x', 'y', 'z']

X, y = train_table[all_covariates], train_table[classification_target]

The classification porcess:
I struggled trying to get results with any kind of decent accuracy, so I started researching. I found RandomizedSearchCV which creates mutiple RandomForestClassifiers, checks the accuracy of each tree, and makes predictions based on the most accurate tree. This greatly increased my run time, but my accuracy jumped up.

I set param_dist with a range of n_estimators, and a range of max_depths. Start a new RandomForestClassifier as rf instance. Then I run rf through RandomizedSearchCV, using the param_dist setting I stated earlier. I also state the number of trees I want to create, n_iter=10, and cv=5 (the number of cross-validation folds to use). And save this result in the variable rand_search.

Next, I fit(X, y) in rand_search. Next, I run rand_search, find the tree that performed the best with the given data, and save it in best_rf.

I am printing to console the best parameters that RandomizedSearchCV has found as a reference.
Using best_rf, I make my predictions and store them in answers.

In [4]:
param_dist = {'n_estimators': randint(100,375),
              'max_depth': randint(5,20)}

rf = RandomForestClassifier()

rand_search = RandomizedSearchCV(rf, param_distributions = param_dist, n_iter=10, cv=5)

rand_search.fit(X, y)

best_rf = rand_search.best_estimator_

print('Best hyperparameters:',  rand_search.best_params_)

answers = best_rf.predict(test_table[all_covariates])

Best hyperparameters: {'max_depth': 6, 'n_estimators': 330}


Using my list of answers, I create a table that mimics the initial df_test_labels, with the answers included, and save that table to test_labels.csv.

End my runtime counter, disply the runtime result to the console.

In [5]:
results = []
for i in range(len(answers)):
    results.append([
        df_test_labels.orig_index.iloc[i],
        df_test_labels.timestamp.iloc[i],
        df_test_labels['UTC time'].iloc[i],
        answers[i]
    ])

results = pd.DataFrame(results, columns=['', 'timestamp', 'UTC time', 'label'])

results.to_csv(os.path.join(data_dir, "test_labels.csv"), sep=",", index=False)

end_time = datetime.datetime.now()

# print elasped time to console
print("Run time: " + str(end_time - start_time)[5:] + " s.ms")

Run time: 42.696559 s.ms


Manipulating the n_iter changes the number of trees created for comparison. The higher this number goes, the better the accuracy, and the higher the runtime. I chose n_iter=10 because it improves the accuracy, and keeps my runtime below 1 minute.