<a href="https://colab.research.google.com/github/Anand-1932/Walk_Run_Classification/blob/main/PRCP_1013_WalkRunClass.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>CONTENTS<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Business Case" data-toc-modified-id="Business Case-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Business Case</a></span></li>
<li><span><a href="#Domain Analysis" data-toc-modified-id="Domain Analysis-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Domain Analysis</a></span></li>
<li><span><a href="#Importing Libraries" data-toc-modified-id="Importing Libraries-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Importing Libraries</a></span></li>
<li><span><a href="#Loading Dataset" data-toc-modified-id="Loading Dataset-3"><span class="toc-item-num">4&nbsp;&nbsp;</span>Loading Dataset</a></span></li>
<li><span><a href="#Basic Checks" data-toc-modified-id="Basic Checks"><span class="toc-item-num">5&nbsp;&nbsp;</span>Basic Checks</a></span></li>
<li><span><a href="#Exploratory Data Ananlysis" data-toc-modified-id="#Exploratory Data Ananlysis""><span class="toc-item-num">6&nbsp;&nbsp;</span>Exploratory Data Ananlysis</a></span></li>
<li><span><a href="#Data Preprocessing" data-toc-modified-id="#Data Preprocessing"><span class="toc-item-num">7&nbsp;&nbsp;</span>Data Preprocessing</a></span></li>
<li><span><a href="#Feature Engineering" data-toc-modified-id="Feature Engineering"><span class="toc-item-num">8&nbsp;&nbsp;</span>Feature Engineering</a></span></li>
<li><span><a href="#Split Data Into Independent & Dependent Variable" data-toc-modified-id="Split Data Into Independent & Dependent Variable"><span class="toc-item-num">9&nbsp;&nbsp;</span>Split Data Into Independent & Dependent Variable</a></span></li>
<li><span><a href="#Model Building" data-toc-modified-id="Model Building><span class="toc-item-num">10&nbsp;&nbsp;</span>Model Building</a></span></li>
<li><span><a href="#Hyperparameter Tunning" data-toc-modified-id="Hyperparameter Tunning><span class="toc-item-num">11&nbsp;&nbsp;</span>Hyperparameter Tunning</a></span></li>
<li><span><a href="#Model Selection Baeed on F1 Score" data-toc-modified-id="Model Selection Baeed on F1 Score><span class="toc-item-num">12&nbsp;&nbsp;</span>Model Selection Baeed on F1 Score</a></span></li>
<li><span><a href="#Visualize Some Prediction" data-toc-modified-id="Visualize Some Prediction><span class="toc-item-num">13&nbsp;&nbsp;</span>Visualize Some Prediction</a></span></li>

# **Business Case:**

Wearable devices like smartwatches and fitness trackers collect motion data using accelerometers and gyroscopes. Accurately distinguishing between walking and running is crucial for:

1. Fitness Tracking – Accurate calorie burn estimation, step counting, and workout categorization.

2. Healthcare & Rehabilitation – Monitoring patient mobility, recovery progress, and fall detection.

3. Sports Science – Performance analysis for athletes and injury prevention.

4. Smartphone Applications – Enhancing step-tracking accuracy in health apps.

# **Domain Analysis:**

* Sensors Used: Accelerometers (measure motion in x, y, z) and Gyroscopes (measure rotation).

* Challenges: Noise in sensor data, variations in user movement, and different device placements (e.g., wrist vs. pocket).

* Solution: Machine learning models trained on motion data can classify movements accurately, improving health and fitness applications.

This dataset contain 11 columns these are mentioned below, we will go through each one of feature to gain insights what is the weights of the column on the label column.


 1.  Date
        *   Description: These columns store the timestamp of when the data was recorded.

        *   Likely irrelevant for distinguishing between walking and running.

2.   Time
        *   Description: These columns store the timestamp of when the data was recorded.

        *   Likely irrelevant for distinguishing between walking and running.

        *   Time of day might influence activity patterns, but not directly helpful for classification.

3.   ussername
        *   Description: The name of the person whose motion data was recorded.

        *   Could help if different individuals have distinct movement patterns. However, it might introduce bias, making the model too personalized.

4.   wrist
        *   Description: Indicates on which wrist the device was worn (0 = Left, 1 = Right).

        *   Could affect gyroscope readings, as arm movement varies based on wrist placement.

        *   Feature importance analysis suggests low contribution, but keeping it doesn’t hurt.        

5.   activity
        *   Description: The movement label, where 0 = Walking, 1 = Running.

        *   Binary classification problem. Well-balanced dataset (~50% for each class).

6.   Accelerometer Data (acceleration_x, acceleration_y, acceleration_z)
        *   X-axis → Left-right movement

        *   Y-axis → Up-down movement

        *   Z-axis → Forward-backward movement

        *   Running generally has higher acceleration than walking.

        *   Peaks in acceleration_y (vertical motion) could differentiate between slow and fast movement.

6.   Gyroscope Data (gyro_x, gyro_y, gyro_z)
        *   X-axis → Tilt around left-right
        *   Y-axis → Tilt around up-down
        *   Z-axis → Tilt around forward-backward

        *   Walking has steady, rhythmic rotations, while running causes stronger, sharper rotations.
        *   gyro_y may be important, as arm swings differ in walking vs. running.

# **Objectives**
The objective of this project is to develop a machine learning model that can accurately classify whether a person is walking or running based on motion sensor data (accelerometer & gyroscope readings).


1.  Clean and preprocess raw motion sensor data.

2.  Standardize features to improve model performance.

3.  Train & Compare Multiple Machine Learning Models

4.  Implement Logistic Regression, Random Forest, SVM, KNN, and Neural Networks.

5.  Identify the best-performing model using evaluation metrics.

6.  Use feature selection and hyperparameter tuning to improve accuracy.

7.  Apply techniques like SMOTE (if needed) to handle class imbalance.

8.  Analyze feature importance to understand which factors contribute most to classification.

9.  Ensure the model is robust and generalizable for real-world scenarios.

10.  Develop a prediction pipeline to classify new motion sensor readings.

11.  Provide an automated system for activity recognition in wearable devices.





# **IMPORTING LIBRARIES**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report,f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
import pickle

# **LOADING DATASET**

In [None]:
data = pd.read_csv('/content/walkrun.csv')

In [None]:
data.sample(5)

In [None]:
data.shape

# **BASIC CHECKS**

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.isnull().sum()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.describe(include = 'object')

# **EXPLORATORY DATA ANALYSIS (EDA)**

In [None]:
data.columns

In [None]:
data['wrist'].value_counts().plot(kind = 'bar')

In [None]:
data['activity'].value_counts().plot(kind = 'bar')

In [None]:
sns.histplot(data['acceleration_x'],kde = True)

In [None]:
sns.histplot(data['acceleration_y'],kde = True)

In [None]:
sns.histplot(data['acceleration_z'],kde = True)

In [None]:
sns.histplot(data['gyro_x'],kde = True)

In [None]:
sns.histplot(data['gyro_y'],kde = True)

In [None]:
sns.histplot(data['gyro_z'],kde = True)

In [None]:
sns.scatterplot(x = 'acceleration_x',y = 'activity',data = data)

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(x=data['wrist'], hue=data['activity'], palette="coolwarm")
plt.xlabel("Wrist (0 = Left, 1 = Right)")
plt.ylabel("Count")
plt.title("Distribution of Walking vs Running by Wrist Placement")
plt.legend(title="Activity", labels=["Walking (0)", "Running (1)"])
plt.show()

In [None]:
data.columns

In [None]:
# Create a boxplot to compare acceleration_x across wrist placements
plt.figure(figsize=(6, 4))
sns.boxplot(x=data['wrist'], y=data['acceleration_x'], palette="coolwarm")
plt.xlabel("Wrist (0 = Left, 1 = Right)")
plt.ylabel("Acceleration X")
plt.title("Comparison of Acceleration X by Wrist Placement")
plt.show()

In [None]:
# Create a boxplot to compare acceleration_x across activity types
plt.figure(figsize=(6, 4))
sns.boxplot(x=data['activity'], y=data['acceleration_x'], palette="coolwarm")
plt.xlabel("Activity (0 = Walking, 1 = Running)")
plt.ylabel("Acceleration X")
plt.title("Comparison of Acceleration X by Activity")
plt.show()

In [None]:
# Create a boxplot to compare acceleration_y across activity types
plt.figure(figsize=(6, 4))
sns.boxplot(x=data['activity'], y=data['acceleration_y'], palette="coolwarm")
plt.xlabel("Activity (0 = Walking, 1 = Running)")
plt.ylabel("Acceleration Y")
plt.title("Comparison of Acceleration Y by Activity")
plt.show()

In [None]:
# Create a boxplot to compare acceleration_z across activity types
plt.figure(figsize=(6, 4))
sns.boxplot(x=data['activity'], y=data['acceleration_z'], palette="coolwarm")
plt.xlabel("Activity (0 = Walking, 1 = Running)")
plt.ylabel("Acceleration Z")
plt.title("Comparison of Acceleration Z by Activity")
plt.show()

In [None]:
# Create a boxplot to compare gyro_x across activity types
plt.figure(figsize=(6, 4))
sns.boxplot(x=data['activity'], y=data['gyro_x'], palette="coolwarm")
plt.xlabel("Activity (0 = Walking, 1 = Running)")
plt.ylabel("Gyro X")
plt.title("Comparison of Gyro X by Activity")
plt.show()

In [None]:
# Create a boxplot to compare gyro_y across activity types
plt.figure(figsize=(6, 4))
sns.boxplot(x=data['activity'], y=data['gyro_y'], palette="coolwarm")
plt.xlabel("Activity (0 = Walking, 1 = Running)")
plt.ylabel("Gyro Y")
plt.title("Comparison of Gyro Y by Activity")
plt.show()

In [None]:
# Create a boxplot to compare gyro_z across activity types
plt.figure(figsize=(6, 4))
sns.boxplot(x=data['activity'], y=data['gyro_z'], palette="coolwarm")
plt.xlabel("Activity (0 = Walking, 1 = Running)")
plt.ylabel("Gyro Z")
plt.title("Comparison of Gyro Z by Activity")
plt.show()

# **DATA PREPROCESSING**

In [None]:
data.isnull().sum()

In [None]:
# removing null values
data.dropna(inplace = True)

In [None]:
data.isnull().sum()

In [None]:
data.duplicated().sum()

In [None]:
# checking for outliers
plt.figure(figsize=(20,15),facecolor='white')
plotnumber = 1

for i in data.select_dtypes(include="number").columns:
  plotnumber<=14
  ax=plt.subplot(6,3,plotnumber)
  sns.boxplot(x=data[i])
  plt.xlabel(i,fontsize=10)
  plt.ylabel('count',fontsize=10)

  plotnumber+=1

plt.tight_layout()

In [None]:
# defineing the wisker function which will return lower_wisker(lw) and upper_wisker(uw)
def wisker(col):
  Q1,Q3=np.percentile(col,[25,75])
  IQR=Q3-Q1
  lw=Q1-1.5*IQR
  uw=Q3+1.5*IQR
  return lw,uw

In [None]:
for i in data[['acceleration_x','acceleration_y','acceleration_z','gyro_x','gyro_y','gyro_z']]:
  lw,uw=wisker(data[i])
  data[i]=np.where(data[i]<lw,lw,data[i])
  data[i]=np.where(data[i]>uw,uw,data[i])

In [None]:
# Rechecking for outliers
plt.figure(figsize=(20,15),facecolor='white')
plotnumber = 1

for i in data.select_dtypes(include="number").columns:
  plotnumber<=14
  ax=plt.subplot(6,3,plotnumber)
  sns.boxplot(x=data[i])
  plt.xlabel(i,fontsize=10)
  plt.ylabel('count',fontsize=10)

  plotnumber+=1

plt.tight_layout()

In [None]:
# checking correlation between numerical column
data.select_dtypes(include='number').corr()

In [None]:
sns.heatmap(data.select_dtypes(include='number').corr(),annot=True)

# **FEATURE ENGINEERING**

In [None]:
data.columns

In [None]:
# dropping date column
# it is not important for our prediction
data = data.drop(columns = ['date'])

In [None]:
# dropping time column
# it is not important for our prediction
data = data.drop(columns = ['time'])

In [None]:
# dropping username column
# it is not important for our prediction
data = data.drop(columns = ['username'])

In [None]:
data.sample(5)

In [None]:
data.shape

# **Splitting data in independent and target variable**

In [None]:
x = data.drop(columns = ['activity'])
y = data['activity']

In [None]:
x

In [None]:
x.shape

In [None]:
y.shape

# Splitting data into testing and training set

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
x_train.shape

In [None]:
x_test.shape

In [None]:
y_train.shape

# **Applying standard scaler to numerical column**

In [None]:
# Initialize StandardScaler (default range is [0,1])
scaler = StandardScaler()

# Apply scaling and update the DataFrame
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [None]:
y.value_counts()

# **Applying Smote**

Data for training is highly imbalanced i.e., there is huge difference in data points of walking and running so to minimize the difference we will apply smote to the training dataset.

In [None]:
# Apply SMOTE on the training data
smote = SMOTE(random_state=42)
x_train_smote, y_train_smote = smote.fit_resample(x_train, y_train)

# Checking the class distribution after SMOTE
print("Original class distribution:", y_train.value_counts())
print("Resampled class distribution:", y_train_smote.value_counts())

# **MODEL CREATION**

# Logistic Regression

In [None]:
LR = LogisticRegression()
LR.fit(x_train_smote,y_train_smote)
y_pred = LR.predict(x_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

# Decision Tree Classifier

In [None]:
DT = DecisionTreeClassifier()
DT.fit(x_train_smote,y_train_smote)
y_pred = DT.predict(x_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

# Random Forest Classifier

In [None]:
RF = RandomForestClassifier()
RF.fit(x_train_smote,y_train_smote)
y_pred = RF.predict(x_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

# Support Vector Classifier

In [None]:
SVC = SVC()
SVC.fit(x_train_smote,y_train_smote)
y_pred = SVC.predict(x_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

# KNeighborsClassifier

In [None]:
KNN = KNeighborsClassifier()
KNN.fit(x_train_smote,y_train_smote)
y_pred = KNN.predict(x_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

# Multi-Layer Perceptron

In [None]:
MLP = MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500)
MLP.fit(x_train_smote,y_train_smote)
y_pred = MLP.predict(x_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

# **MODEL COMPARISION**

*   Algorithms used in this project are LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, SupportVectorMachine, KNeighborsClassifier, Multi-Layer-Perceptron(MLP).

*   RandomForestClassifier, SupportVectorMachine, KNeighborsClassifier, Multi-Layer-Perceptron(MLP) have almost same accuracy score near 99.5%.

*   SupportVectorMachine has highest accuracy score of 99.58%
*   There is no need of performing HyperParameter Tunning since most of the machine learning giving almost best score.


*   Logistic Regression has lowest accuracy score of 91%.


*   In conclusion, SVM provides the best accuracy score i.e., 99.58%.



# **VISUALIZING PREDICTION WITH MODEL HAVING BEST ACCURACY SCORE**


*   Prediction of the new data points using KNeighborsClassifiers




In [None]:
pickle.dump(SVC,open('svc.pkl','wb'))

In [None]:

import joblib  # For loading the trained model and scaler

# Load the trained model and scaler
model = joblib.load("svc.pkl")  # Load saved model


# Format: [wrist, acceleration_x, acceleration_y, acceleration_z, gyro_x, gyro_y, gyro_z]
new_data_point = np.array([[1, 0.5, 0.8, -0.3, 0.01, 0.02, -0.04]])  # Example values

# Scaleing the new data points
new_data_scaled = scaler.transform(new_data_point)

# prediction
prediction = model.predict(new_data_scaled)

# Result
activity = "Running" if prediction[0] == 1 else "Walking"
print(f"Predicted Activity: {activity}")

# **CHALLENGES FACED**

*   Dataset contain irrelevant columns like date, time and username, we dropped these column in order to get better prediction.

*   Data is fairly imbalanced to balance the dataset we use SMOTE in order to balance the data, otherwise it will lead us to biased model which will learn better on one class.

*   We don't use hyperparameter tunning of SVM otherwise it will be computionally expensive.
*   Scaling new data before predictions was crucial but sometimes forgotten, leading to incorrect results.


*   Dataset contain large amount of outliers, we handle that outliers we used IQR method to replace that outliers with lower_wisker and upper_wisker.


*   Despite these challenges, the model achieved high accuracy (~99.58%), making it a reliable system for activity classification.



# **CONCLUSION**
In conclusion, WalkRun Classification using SVM offers a promising solution for efficient and accuracte classification of weather the person is walking or running. By leveraging the different classification machine learning algorithm, it enables the real time analysis and with accuracy of more than 99.5% it almost predict correct class everytime. The project successfully built a robust and accurate activity classification model. This system can be integrated into wearable devices, fitness applications, and healthcare monitoring systems to track human movement efficiently.

# **FUTURE SCOPE**
The model was trained on a specific dataset, which may not generalize well to different environments, terrains, or individuals to overcome this collect data from different users, including various speeds, surfaces (e.g., treadmill, grass, pavement), and conditions to improve generalization.
The model currently uses accelerometer and gyroscope data, but additional sensors (e.g., barometer, GPS) could enhance predictions like Incorporate heart rate, stride length, or GPS movement patterns to distinguish between running and fast walking more accurately.
The current model is trained offline, but real-time classification in mobile apps or IoT devices requires efficiency to convert the model into a lightweight version using TensorFlow Lite or ONNX for deployment in mobile/wearable devices.
Use edge computing to run the model on a smartwatch or smartphone instead of cloud-based predictions.
The current model only classifies walking vs. running we can further expand the model to recognize other activities like standing, sitting, jumping, cycling, and stair climbing to create a comprehensive activity recognition system.
By implementing these improvements, the model can be made more accurate, generalizable, and real-time capable, enabling its use in health monitoring, sports analytics, and wearable tech applications.