# Introduction

### In this project was used dataset of tri-axial smartphone accelerometer data. The goal is to obtain the classification problem of identifying the exercise type according to accelerometer data. The project steps are following:

    1. Inctantiate libraries of necessary tools
    2. Upload the data from csv files
    3. Data preprocessing to prepare it for the ML algorithm
    4. Training the model
    5. Make a prediction by using the trained model
    6. Save the result as a new csv file

***

# Methods

### The datasets were checked for containing any missing value, the 'timestamp' and 'UTC time' columns format was changed to datetime as it is better for Python using. After that, the 'UTC time' column was cut in terms of unnecessity (for training were used only 'x', 'y' and 'z' accelerometer data). Formatted dataset were merged in order to keep only data that has a label for a training purpose. As the last step of data preprocessing dataset were splitted for training, testing and X, y consequently.

***
# Results

### The modeling results has shown low-level of accuracy ~ (<50%). Thus, the expanding the size of input data could be suggested. As the result, the model will operate with more training data and would be able to perform better. Also, the cross-validation method may vary the accuracy into higher numbers. The best running time of implementing one classification model is around 0.24 seconds.

***
# Conclusion

### This project is a great chance to implement ML skills, although datasets are quite small and there is no space of applying EDA knowledge. The results are not good enough, however it's understandable how to provide better ML algorithms

***

In [87]:
#Instantianting the libraries. All Classifiers models were instantiated at one place.

import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import time

In [91]:
# Start timeclock 
start_time = time.time()

# Read all csv files that we need for the project
train_data = pd.read_csv('D:\\Work_folder\\Harvard_1\\train_time_series.csv')
train_labels = pd.read_csv("D:\\Work_folder\\Harvard_1\\train_labels.csv")
test_data = pd.read_csv('D:\\Work_folder\\Harvard_1\\test_time_series.csv')
test_labels = pd.read_csv("D:\\Work_folder\\Harvard_1\\test_labels.csv")

# Cut the "UTC time" column
train_data = train_data[['timestamp', 'x', 'y', 'z']]
test_data = test_data[['timestamp','x', 'y', 'z']]
train_labels = train_labels[['timestamp', 'label']]
test_labels = test_labels[['timestamp', 'label']]

# Change the datatype of 'timestamp' column 
train_data['timestamp'] = pd.to_datetime(train_data['timestamp'])
test_data['timestamp'] = pd.to_datetime(test_data['timestamp'])
train_labels['timestamp'] = pd.to_datetime(train_labels['timestamp'])
test_labels['timestamp'] = pd.to_datetime(test_labels['timestamp'])

# Merging the dataset to keep only the rows with labeled data. Use 'timestamp' column as an index to merge 
merged_train_data = train_data.merge(train_labels, how='inner', on='timestamp')
merged_test_data = test_data.merge(test_labels, how='inner', on='timestamp')

# Create data for training and testing
X_train = merged_train_data[['x', 'y', 'z']]
y_train = merged_train_data['label']
X_test = merged_test_data[['x', 'y', 'z']]

################# RANDOM FOREST CLASSIFIER ######################

# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100)  
 
# Training the model on the training dataset
clf.fit(X_train, y_train)
 
# performing predictions on the test dataset
y_pred_rf = clf.predict(X_test)

##################################################################

################# GRADIENT BOOSTING CLASSIFIER ###################

# creating a GB classifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1, random_state=0)  
 
# Training the model on the training dataset
clf.fit(X_train, y_train)
 
# performing predictions on the test dataset
y_pred_gb = clf.predict(X_test)

##################################################################

################# ADA BOOST CLASSIFIER ###########################

# creating a AB classifier
clf = AdaBoostClassifier(n_estimators=100, algorithm="SAMME", random_state=0) 
 
# Training the model on the training dataset
clf.fit(X_train, y_train)
 
# performing predictions on the test dataset
y_pred_ab = clf.predict(X_test)

##################################################################

################# LOGISTIC REGRESSION ############################

# creating a LR classifier
clf = LogisticRegression(random_state=0)  
 
# Training the model on the training dataset
clf.fit(X_train, y_train)
 
# performing predictions on the test dataset
y_pred_lr = clf.predict(X_test)

##################################################################

################# SUPPORT VECTOR CLASSIFIER ######################

# creating a SV classifier
clf = SVC(gamma='auto')  
 
# Training the model on the training dataset
clf.fit(X_train, y_train)
 
# performing predictions on the test dataset
y_pred_sv = clf.predict(X_test)

##################################################################

# Convert array into dataframe. I decided to save Random Forest Classifier results, as they showed the best results.
DF = pd.DataFrame(y_pred_rf)

test_label = test_labels.join(DF, how='outer')
test_label.drop('label', axis=1, inplace=True)
test_label = test_label.rename(columns={0: "label"})

# save the dataframe as a csv file 
test_label.to_csv("D:\\Work_folder\\Harvard_1\\predictions_rf.csv")
#The time is provided for only RF classifier, as it has shown the best performance. If execute this code, it obviously would be much longer 
print("--- %s seconds ---" % (time.time() - start_time))

--- 0.22152233123779297 seconds ---
