# Introduction
Time-series analysis and forecasting play a crucial role in various domains such as finance, healthcare, and engineering. The ability to predict future trends based on historical data is invaluable for making informed decisions. This project aims to tackle a time-series prediction problem by leveraging the capabilities of advanced machine learning models, specifically Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) layers.

RNNs are designed to process sequential data by maintaining a hidden state that captures information from previous time steps. However, traditional RNNs often face challenges such as vanishing gradients, which limit their ability to capture long-term dependencies. LSTMs address these limitations by introducing gated mechanisms that enable the model to retain relevant information over extended sequences.

The dataset used in this project represents sequential data that is split into training and testing subsets. The pipeline begins with preprocessing, including normalization and sequence creation. A baseline linear regression model is implemented as a benchmark to evaluate the performance of the LSTM-based model. The LSTM model is then trained and evaluated using metrics such as Mean Absolute Error (MAE) and Mean Squared Error (MSE). These metrics provide insights into the model's prediction accuracy and its ability to generalize to unseen data.

The steps include:
- Loading and preprocessing the dataset, including normalization and sequence creation.
- Implementing a baseline linear regression model for benchmark comparison.
- Designing and training an LSTM-based model to learn temporal patterns.
- Evaluating the models using metrics such as Mean Absolute Error (MAE) and Mean Squared Error (MSE).
- Analyzing the results to demonstrate the advantages of deep learning approaches.

Through this project, the aim is to demonstrate the effectiveness of deep learning in capturing temporal patterns in sequential data, ultimately leading to more accurate and reliable predictions compared to traditional methods. The findings have the potential to inform future research and applications in time-series analysis, paving the way for further exploration and innovation in this field.

In [9]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

In [10]:
# Load the dataset
file_path = 'user_behavior_dataset.csv'
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,User ID,Device Model,Operating System,App Usage Time (min/day),Screen On Time (hours/day),Battery Drain (mAh/day),Number of Apps Installed,Data Usage (MB/day),Age,Gender,User Behavior Class
0,1,Google Pixel 5,Android,393,6.4,1872,67,1122,40,Male,4
1,2,OnePlus 9,Android,268,4.7,1331,42,944,47,Female,3
2,3,Xiaomi Mi 11,Android,154,4.0,761,32,322,42,Male,2
3,4,Google Pixel 5,Android,239,4.8,1676,56,871,20,Male,3
4,5,iPhone 12,iOS,187,4.3,1367,58,988,31,Female,3


In [11]:
# Explore dataset
print("Dataset Overview:")
print(df.head())

Dataset Overview:
   User ID    Device Model Operating System  App Usage Time (min/day)  \
0        1  Google Pixel 5          Android                       393   
1        2       OnePlus 9          Android                       268   
2        3    Xiaomi Mi 11          Android                       154   
3        4  Google Pixel 5          Android                       239   
4        5       iPhone 12              iOS                       187   

   Screen On Time (hours/day)  Battery Drain (mAh/day)  \
0                         6.4                     1872   
1                         4.7                     1331   
2                         4.0                      761   
3                         4.8                     1676   
4                         4.3                     1367   

   Number of Apps Installed  Data Usage (MB/day)  Age  Gender  \
0                        67                 1122   40    Male   
1                        42                  944   47  Female   

In [12]:
print("\nDataset Info:")
print(df.info())


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   User ID                     700 non-null    int64  
 1   Device Model                700 non-null    object 
 2   Operating System            700 non-null    object 
 3   App Usage Time (min/day)    700 non-null    int64  
 4   Screen On Time (hours/day)  700 non-null    float64
 5   Battery Drain (mAh/day)     700 non-null    int64  
 6   Number of Apps Installed    700 non-null    int64  
 7   Data Usage (MB/day)         700 non-null    int64  
 8   Age                         700 non-null    int64  
 9   Gender                      700 non-null    object 
 10  User Behavior Class         700 non-null    int64  
dtypes: float64(1), int64(7), object(3)
memory usage: 60.3+ KB
None


In [15]:
# Data Preprocessing
# Separate numerical and categorical columns
numerical_cols = ['App Usage Time (min/day)', 'Screen On Time (hours/day)',
                  'Battery Drain (mAh/day)', 'Number of Apps Installed',
                  'Data Usage (MB/day)', 'Age']
categorical_cols = ['Gender', 'Device Model', 'Operating System']
numerical_cols

['App Usage Time (min/day)',
 'Screen On Time (hours/day)',
 'Battery Drain (mAh/day)',
 'Number of Apps Installed',
 'Data Usage (MB/day)',
 'Age']

In [16]:
categorical_cols

['Gender', 'Device Model', 'Operating System']

In [17]:
# Handle missing values for numerical columns (using median)
for col in numerical_cols:
    df[col] = df[col].fillna(df[col].median())

In [18]:
# Handle missing values for categorical columns (using mode)
for col in categorical_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

In [19]:
# Verify there are no remaining missing values
print("Missing values after handling:")
print(df.isnull().sum())

Missing values after handling:
User ID                       0
Device Model                  0
Operating System              0
App Usage Time (min/day)      0
Screen On Time (hours/day)    0
Battery Drain (mAh/day)       0
Number of Apps Installed      0
Data Usage (MB/day)           0
Age                           0
Gender                        0
User Behavior Class           0
dtype: int64


In [20]:
# Encode categorical variables
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

In [21]:
label_encoders

{'Gender': LabelEncoder(),
 'Device Model': LabelEncoder(),
 'Operating System': LabelEncoder()}

In [22]:
# Normalize numerical features
scaler = MinMaxScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

In [23]:
scaler

In [24]:
# Prepare sequences for RNN
sequence_length = 5
features = df.drop(columns=['User Behavior Class']).values
labels = df['User Behavior Class'].values

In [25]:
features

array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        4.25887265e-01, 5.36585366e-01, 1.00000000e+00],
       [2.00000000e+00, 1.00000000e+00, 0.00000000e+00, ...,
        3.51565762e-01, 7.07317073e-01, 0.00000000e+00],
       [3.00000000e+00, 3.00000000e+00, 0.00000000e+00, ...,
        9.18580376e-02, 5.85365854e-01, 1.00000000e+00],
       ...,
       [6.98000000e+02, 0.00000000e+00, 0.00000000e+00, ...,
        1.48225470e-01, 7.80487805e-01, 0.00000000e+00],
       [6.99000000e+02, 2.00000000e+00, 0.00000000e+00, ...,
        5.09394572e-02, 6.34146341e-01, 1.00000000e+00],
       [7.00000000e+02, 1.00000000e+00, 0.00000000e+00, ...,
        3.03131524e-01, 1.21951220e-01, 0.00000000e+00]])

In [27]:
labels

array([4, 3, 2, 3, 3, 2, 4, 5, 4, 4, 1, 3, 4, 3, 3, 5, 2, 3, 1, 5, 2, 5,
       4, 3, 3, 2, 4, 5, 4, 4, 3, 3, 3, 1, 5, 1, 2, 5, 4, 4, 3, 5, 2, 2,
       2, 5, 5, 2, 5, 5, 4, 3, 2, 2, 5, 1, 1, 5, 4, 4, 5, 4, 1, 5, 4, 3,
       3, 2, 5, 1, 4, 5, 4, 1, 1, 3, 1, 2, 1, 3, 5, 2, 4, 1, 3, 4, 2, 1,
       2, 4, 2, 3, 1, 2, 4, 4, 5, 5, 1, 3, 1, 3, 2, 3, 2, 1, 2, 1, 3, 3,
       2, 5, 3, 2, 4, 2, 1, 3, 1, 2, 4, 5, 5, 5, 3, 5, 4, 3, 2, 2, 5, 4,
       2, 1, 1, 5, 1, 3, 5, 5, 3, 4, 2, 1, 5, 3, 4, 1, 3, 2, 5, 1, 2, 4,
       1, 2, 1, 4, 4, 2, 1, 1, 4, 1, 3, 5, 5, 1, 3, 5, 2, 1, 2, 4, 1, 3,
       3, 3, 3, 5, 2, 4, 5, 1, 5, 5, 4, 1, 2, 1, 2, 1, 4, 3, 4, 3, 3, 2,
       1, 2, 5, 2, 1, 1, 2, 4, 1, 2, 5, 3, 4, 5, 2, 1, 2, 5, 4, 5, 5, 2,
       4, 2, 4, 5, 2, 5, 4, 3, 2, 4, 2, 5, 5, 2, 2, 1, 5, 4, 1, 5, 1, 5,
       2, 4, 1, 3, 1, 4, 2, 1, 1, 4, 5, 2, 1, 4, 4, 3, 3, 5, 1, 4, 5, 1,
       4, 5, 3, 5, 3, 4, 4, 3, 2, 2, 2, 5, 3, 3, 1, 4, 1, 1, 4, 1, 4, 5,
       4, 3, 2, 1, 4, 1, 1, 3, 5, 1, 4, 2, 3, 2, 4,

In [28]:
X, y = [], []
for i in range(len(features) - sequence_length):
    X.append(features[i:i+sequence_length])
    y.append(labels[i+sequence_length])

In [29]:
X = np.array(X)
y = np.array(y)

In [31]:
X

array([[[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         4.25887265e-01, 5.36585366e-01, 1.00000000e+00],
        [2.00000000e+00, 1.00000000e+00, 0.00000000e+00, ...,
         3.51565762e-01, 7.07317073e-01, 0.00000000e+00],
        [3.00000000e+00, 3.00000000e+00, 0.00000000e+00, ...,
         9.18580376e-02, 5.85365854e-01, 1.00000000e+00],
        [4.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         3.21085595e-01, 4.87804878e-02, 1.00000000e+00],
        [5.00000000e+00, 4.00000000e+00, 1.00000000e+00, ...,
         3.69937370e-01, 3.17073171e-01, 0.00000000e+00]],

       [[2.00000000e+00, 1.00000000e+00, 0.00000000e+00, ...,
         3.51565762e-01, 7.07317073e-01, 0.00000000e+00],
        [3.00000000e+00, 3.00000000e+00, 0.00000000e+00, ...,
         9.18580376e-02, 5.85365854e-01, 1.00000000e+00],
        [4.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         3.21085595e-01, 4.87804878e-02, 1.00000000e+00],
        [5.00000000e+00, 4.00000000e+0

In [32]:
y

array([2, 4, 5, 4, 4, 1, 3, 4, 3, 3, 5, 2, 3, 1, 5, 2, 5, 4, 3, 3, 2, 4,
       5, 4, 4, 3, 3, 3, 1, 5, 1, 2, 5, 4, 4, 3, 5, 2, 2, 2, 5, 5, 2, 5,
       5, 4, 3, 2, 2, 5, 1, 1, 5, 4, 4, 5, 4, 1, 5, 4, 3, 3, 2, 5, 1, 4,
       5, 4, 1, 1, 3, 1, 2, 1, 3, 5, 2, 4, 1, 3, 4, 2, 1, 2, 4, 2, 3, 1,
       2, 4, 4, 5, 5, 1, 3, 1, 3, 2, 3, 2, 1, 2, 1, 3, 3, 2, 5, 3, 2, 4,
       2, 1, 3, 1, 2, 4, 5, 5, 5, 3, 5, 4, 3, 2, 2, 5, 4, 2, 1, 1, 5, 1,
       3, 5, 5, 3, 4, 2, 1, 5, 3, 4, 1, 3, 2, 5, 1, 2, 4, 1, 2, 1, 4, 4,
       2, 1, 1, 4, 1, 3, 5, 5, 1, 3, 5, 2, 1, 2, 4, 1, 3, 3, 3, 3, 5, 2,
       4, 5, 1, 5, 5, 4, 1, 2, 1, 2, 1, 4, 3, 4, 3, 3, 2, 1, 2, 5, 2, 1,
       1, 2, 4, 1, 2, 5, 3, 4, 5, 2, 1, 2, 5, 4, 5, 5, 2, 4, 2, 4, 5, 2,
       5, 4, 3, 2, 4, 2, 5, 5, 2, 2, 1, 5, 4, 1, 5, 1, 5, 2, 4, 1, 3, 1,
       4, 2, 1, 1, 4, 5, 2, 1, 4, 4, 3, 3, 5, 1, 4, 5, 1, 4, 5, 3, 5, 3,
       4, 4, 3, 2, 2, 2, 5, 3, 3, 1, 4, 1, 1, 4, 1, 4, 5, 4, 3, 2, 1, 4,
       1, 1, 3, 5, 1, 4, 2, 3, 2, 4, 3, 4, 5, 2, 2,

In [33]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=False)

In [34]:
X_train

array([[[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         4.25887265e-01, 5.36585366e-01, 1.00000000e+00],
        [2.00000000e+00, 1.00000000e+00, 0.00000000e+00, ...,
         3.51565762e-01, 7.07317073e-01, 0.00000000e+00],
        [3.00000000e+00, 3.00000000e+00, 0.00000000e+00, ...,
         9.18580376e-02, 5.85365854e-01, 1.00000000e+00],
        [4.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         3.21085595e-01, 4.87804878e-02, 1.00000000e+00],
        [5.00000000e+00, 4.00000000e+00, 1.00000000e+00, ...,
         3.69937370e-01, 3.17073171e-01, 0.00000000e+00]],

       [[2.00000000e+00, 1.00000000e+00, 0.00000000e+00, ...,
         3.51565762e-01, 7.07317073e-01, 0.00000000e+00],
        [3.00000000e+00, 3.00000000e+00, 0.00000000e+00, ...,
         9.18580376e-02, 5.85365854e-01, 1.00000000e+00],
        [4.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         3.21085595e-01, 4.87804878e-02, 1.00000000e+00],
        [5.00000000e+00, 4.00000000e+0

In [35]:
# Baseline Model (Simple Linear Regression for Benchmark)
from sklearn.linear_model import LinearRegression
baseline_model = LinearRegression()
baseline_model.fit(X_train.reshape(X_train.shape[0], -1), y_train)
y_pred_baseline = baseline_model.predict(X_test.reshape(X_test.shape[0], -1))
print("Baseline MAE:", mean_absolute_error(y_test, y_pred_baseline))
print("Baseline MSE:", mean_squared_error(y_test, y_pred_baseline))

Baseline MAE: 1.2734767996149956
Baseline MSE: 2.177265922659539


In [36]:
# RNN Model
model = Sequential([
    LSTM(64, return_sequences=True, input_shape=(X_train.shape[1], X_train.shape[2])),
    Dropout(0.2),
    LSTM(32),
    Dropout(0.2),
    Dense(1, activation='linear')
])

  super().__init__(**kwargs)


In [37]:
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
model.summary()

In [38]:
# Train the model
history = model.fit(X_train, y_train, validation_split=0.2, epochs=20, batch_size=32)

Epoch 1/20
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 27ms/step - loss: 7.7767 - mae: 2.3601 - val_loss: 2.8310 - val_mae: 1.3426
Epoch 2/20
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 2.4345 - mae: 1.3057 - val_loss: 1.7555 - val_mae: 1.1162
Epoch 3/20
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 2.2273 - mae: 1.2935 - val_loss: 1.7520 - val_mae: 1.1135
Epoch 4/20
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 2.2400 - mae: 1.3115 - val_loss: 1.7415 - val_mae: 1.0813
Epoch 5/20
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 2.2628 - mae: 1.3210 - val_loss: 1.7453 - val_mae: 1.0911
Epoch 6/20
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 2.3556 - mae: 1.3396 - val_loss: 1.7448 - val_mae: 1.0902
Epoch 7/20
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 2.2422 

In [39]:
history

<keras.src.callbacks.history.History at 0x22928c3c690>

In [40]:
# Evaluate the model
y_pred = model.predict(X_test)
print("RNN MAE:", mean_absolute_error(y_test, y_pred))
print("RNN MSE:", mean_squared_error(y_test, y_pred))

[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step
RNN MAE: 1.161002152257686
RNN MSE: 1.9226019653241697


# Conclusion
The implementation of LSTM-based models demonstrated improved performance over baseline linear regression models for the given time-series data. Metrics such as Mean Absolute Error (MAE) and Mean Squared Error (MSE) indicate the model's capability to capture temporal patterns effectively. Future work can involve optimizing hyperparameters, incorporating additional features, or exploring advanced architectures to further enhance prediction accuracy.