# Assignment 2

This assignment serves as a comprehensive evaluation of your machine learning skills, encompassing not only the technical aspects of model development but also your ability to analyze, interpret, and present data insights effectively. As such, it's essential to ensure that your submission is complete, functional, and devoid of any obvious gaps, as if you were delivering this project to a client.

To achieve this, leverage the full capabilities of Markdown and the interactive visualization tools available in Jupyter notebooks to craft a well-structured and visually appealing report of your findings. Your report should clearly communicate the insights you've gained from the exploratory data analysis, the rationale behind your data preprocessing and feature engineering decisions, and a thorough analysis of feature importance. High-quality visualizations and well-organized documentation will not only support your analysis but also make your results more accessible and understandable to your audience.

Remember, the ability to present complex results in an intuitive and engaging manner is a crucial skill, almost as important as the technical proficiency in model building and data analysis. Treat this assignment as an opportunity to showcase your skills in both areas.

## Instructions
- Your submission should be a `.ipynb` file with your name,
  like `FirstnameLastname.ipynb`. It should include the answers to the questions in markdown cells, your data analysis and results.
- You are expected to follow the best practices for code writing and model
training. Poor coding style will be penalized.
- You are allowed to discuss ideas with your peers, but no sharing of code.
Plagiarism in the code will result in failing. If you use code from the
internet, cite it by adding the source of the code as a comment in the first line of the code cell. [Academic misconduct policy](https://wiki.innopolis.university/display/DOE/Academic+misconduct+policy)
- In real life clients can give unclear goals or requirements. So, if the instructions seem vague, use common sense to make reasonable assumptions and decisions.

## Self-Reliance and Exploration
In this task, you're encouraged to rely on your resourcefulness and creativity. Dive into available resources, experiment with various solutions, and learn from every outcome. While our team is here to clarify task details and offer conceptual guidance, we encourage you to first seek answers independently. This approach is vital for developing your problem-solving skills in machine learning.



# Task 1: [Where's Waldo?](https://www.wikihow.com/Find-Waldo) (50%)

## Fingerprinting
Browser fingerprinting is a technique used to identify and track individuals based on unique characteristics of their web browser configuration. These characteristics can include the browser type, version, installed plugins, and screen resolution, among others. By combining these attributes, websites can create a digital fingerprint that can be used to track user behavior across multiple sites, even if they clear their cookies or use different devices. This has raised concerns about privacy and the potential for this technology to be used for targeted advertising, surveillance, and other purposes.

[Read more about Fingerprinting](https://datadome.co/learning-center/browser-fingerprinting-techniques/)


## What You Need to Do
In this task, you are required to employ a fully connected feed-forward Artificial Neural Network (ANN) to tackle a classification problem. This involves several key steps, each critical to the development and performance of your model:

- **Exploratory Data Analysis (EDA) (10%)**: Begin by conducting a thorough exploratory analysis of the provided dataset. Your goal here is to uncover patterns, anomalies, relationships, or trends that could influence your modeling decisions. **Share the insights** you gather from this process and explain how they informed your subsequent steps.
  
- **Data Preprocessing and Feature Engineering (10%)**: Based on your EDA insights, choose and implement the most appropriate data preprocessing steps and feature engineering techniques. This may include handling missing values, encoding categorical variables, normalizing data, and creating new features that could enhance your model's ability to learn from the data.
  
- **Model Design and Training (10%)**: Design a fully connected feed-forward ANN model. You will need to experiment with different architectures, layer configurations, and hyperparameters to find the most effective solution for the classification problem at hand.

- **Feature Importance Analysis (10%)**: After developing your model, analyze which features are most important for making predictions. Discuss how this analysis aligns with your initial EDA insights and what it reveals about the characteristics most indicative of specific user behaviors or identities.

- **Evaluation (10%)**: You will be required to submit your model prediction on a hidden data set.

### Data
You will be using the data in `Task_1.json` to identify Waldo (`user_id=0`). The dataset includes:
- **"browser", "os" and "locale"**: Information about the software used.
- **"user_id"**: A unique identifier for each user.
- **"location"**: Geolocation based on the IP address used.
- **"sites"**: A list of visited URLs and the time spent there in seconds.
- **"time" and "date"**: When the session started in GMT.


### Evaluation
After training, evaluate your model by printing the classification report on your test set. Then, predict whether each user in `task_1_verify.json` is Waldo or not, by adding the boolean `is_waldo` property to the `task_1_verify.json`:

```diff
  [
    {
+     "is_waldo": false,
      "browser": "Chrome",
      "os": "Debian",
      "locale": "ur-PK",
      "location": "Russia/Moscow",
      "sites": [
          // ...
      ],
      "time": "04:12:00",
      "date":"2017-06-29"
    }
    // ...
  ]

```

## Learning Objectives

- **Exploratory Data Analysis**: Apply suitable analysis techniques to gain insights and better understand the dataset.
- **Classification Approach**: Identify the most appropriate method for the given problem.
- **Data Preprocessing**: Select and execute proper preprocessing and encoding techniques.
- **Model Implementation**: Utilize ANNs to address a classification problem, including training, validation, and testing phases.
- **Feature Importance Analysis**: Determine and report which features are most critical for the model's predictions to uncover insights into specific user behaviors.

In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [53]:
df = pd.read_json('task_1_train_data.json')
df.head()

Unnamed: 0,browser,os,locale,user_id,location,sites,time,date
0,Chrome,Debian,ur-PK,116,Russia/Moscow,"[{'site': 'bing.net', 'length': 52}, {'site': ...",04:12:00,2017-06-29
1,Firefox,Windows 8,uk-UA,155,France/Paris,"[{'site': 'yahoo.com', 'length': 46}, {'site':...",03:57:00,2016-03-23
2,Safari,MacOS,fr-FR,39,Japan/Tokyo,"[{'site': 'oracle.com', 'length': 335}]",05:26:00,2016-11-17
3,Chrome,Windows 8,nl-NL,175,Australia/Sydney,"[{'site': 'mail.google.com', 'length': 192}, {...",00:05:00,2016-08-23
4,Firefox,Ubuntu,ro-RO,50,USA/San Francisco,"[{'site': 'mail.google.com', 'length': 266}, {...",22:55:00,2016-07-23


## Exploratory Data Analysis

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 40000 entries, 0 to 39999
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   browser   40000 non-null  object        
 1   os        40000 non-null  object        
 2   locale    40000 non-null  object        
 3   user_id   40000 non-null  int64         
 4   location  40000 non-null  object        
 5   sites     40000 non-null  object        
 6   time      40000 non-null  object        
 7   date      40000 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(6)
memory usage: 2.7+ MB


In [55]:
df['time'] = pd.to_datetime(df['time'], format='%H:%M:%S')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 40000 entries, 0 to 39999
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   browser   40000 non-null  object        
 1   os        40000 non-null  object        
 2   locale    40000 non-null  object        
 3   user_id   40000 non-null  int64         
 4   location  40000 non-null  object        
 5   sites     40000 non-null  object        
 6   time      40000 non-null  datetime64[ns]
 7   date      40000 non-null  datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(5)
memory usage: 2.7+ MB


In [56]:
df.describe()

Unnamed: 0,user_id,time,date
count,40000.0,40000,40000
mean,99.5,1900-01-01 11:40:01.321499904,2016-11-09 05:19:04.079999744
min,0.0,1900-01-01 00:00:00,2016-01-14 00:00:00
25%,49.75,1900-01-01 05:54:00,2016-06-12 00:00:00
50%,99.5,1900-01-01 11:32:00,2016-11-09 00:00:00
75%,149.25,1900-01-01 17:01:00,2017-04-09 00:00:00
max,199.0,1900-01-01 23:59:00,2017-09-28 00:00:00
std,57.735027,,


In [57]:
df['num_sites_visited'] = df['sites'].apply(lambda x: len(x))

site_lengths = []
for sites_list in df['sites']:
    for site in sites_list:
        site_lengths.append(site['length'])

site_lengths_series = pd.Series(site_lengths)
site_length_distribution = site_lengths_series.describe()

site_counts = {}
for sites_list in df['sites']:
    for site in sites_list:
        site_name = site['site']
        site_counts[site_name] = site_counts.get(site_name, 0) + 1

frequent_sites = pd.Series(site_counts).sort_values(ascending=False)

print("Number of Sites Visited:")
print(df['num_sites_visited'])

print("\nDistribution of Site Lengths:")
print(site_length_distribution)

print("\nFrequently Visited Sites:")
print(frequent_sites)

print("\nNumber of sites:")
print(len(site_counts))

Number of Sites Visited:
0        10
1         8
2         1
3         3
4         8
         ..
39995    10
39996     6
39997     4
39998     3
39999     6
Name: num_sites_visited, Length: 40000, dtype: int64

Distribution of Site Lengths:
count    300578.000000
mean        129.723985
std          90.306291
min          40.000000
25%          66.000000
50%         102.000000
75%         165.000000
max        1185.000000
dtype: float64

Frequently Visited Sites:
youtube.com          14517
toptal.com           10943
slack.com            10758
lenta.ru              9255
vk.com                8744
                     ...  
shaushka.com             2
ac-toulouse.fr           1
styleblazer.com          1
directe.cat              1
grumpybumpers.com        1
Length: 11129, dtype: int64

Number of sites:
11129


There are 11129 websites, so any kind of encoding would not be appropriate. I propose changing this feature to percent of sites that belong to sites visited by waldo

In [58]:
waldo = df[df['user_id'] == 0]
other_users_data = df[df['user_id'] != 0]
waldo.num_sites_visited.max()

12

In [59]:
df.locale.nunique()

25

## Data Preprocessing and Feature Engineering

In [60]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the 'locale' column
df['locale_encoded'] = label_encoder.fit_transform(df['locale'])
df.drop('locale', axis=1, inplace=True)
df['location_encoded'] = label_encoder.fit_transform(df['location'])
df.drop('location', axis=1, inplace=True)
# Display the encoded DataFrame
print(df)


       browser          os  user_id  \
0       Chrome      Debian      116   
1      Firefox   Windows 8      155   
2       Safari       MacOS       39   
3       Chrome   Windows 8      175   
4      Firefox      Ubuntu       50   
...        ...         ...      ...   
39995   Chrome  Windows 10      184   
39996   Chrome  Windows 10      181   
39997   Safari       MacOS      112   
39998   Safari       MacOS      136   
39999  Firefox   Windows 8      190   

                                                   sites                time  \
0      [{'site': 'bing.net', 'length': 52}, {'site': ... 1900-01-01 04:12:00   
1      [{'site': 'yahoo.com', 'length': 46}, {'site':... 1900-01-01 03:57:00   
2                [{'site': 'oracle.com', 'length': 335}] 1900-01-01 05:26:00   
3      [{'site': 'mail.google.com', 'length': 192}, {... 1900-01-01 00:05:00   
4      [{'site': 'mail.google.com', 'length': 266}, {... 1900-01-01 22:55:00   
...                                                

In [61]:
import pandas as pd

one_hot_encoded = pd.get_dummies(df['browser'], prefix='browser', dtype=float)

df = pd.concat([df, one_hot_encoded], axis=1)

df.drop('browser', axis=1, inplace=True)

print(df)


               os  user_id                                              sites  \
0          Debian      116  [{'site': 'bing.net', 'length': 52}, {'site': ...   
1       Windows 8      155  [{'site': 'yahoo.com', 'length': 46}, {'site':...   
2           MacOS       39            [{'site': 'oracle.com', 'length': 335}]   
3       Windows 8      175  [{'site': 'mail.google.com', 'length': 192}, {...   
4          Ubuntu       50  [{'site': 'mail.google.com', 'length': 266}, {...   
...           ...      ...                                                ...   
39995  Windows 10      184  [{'site': 'airbnb.com', 'length': 96}, {'site'...   
39996  Windows 10      181  [{'site': 'lenta.ru', 'length': 84}, {'site': ...   
39997       MacOS      112  [{'site': 'toptal.com', 'length': 65}, {'site'...   
39998       MacOS      136  [{'site': 'yworks.com', 'length': 146}, {'site...   
39999   Windows 8      190  [{'site': 'vk.com', 'length': 43}, {'site': 's...   

                     time  

In [62]:
df.os.nunique()

6

In [63]:
one_hot_encoded = pd.get_dummies(df['os'], prefix='os', dtype=float)

# Concatenate the one-hot encoded DataFrame with the original DataFrame
df = pd.concat([df, one_hot_encoded], axis=1)

# Drop the original 'locale' column
df.drop('os', axis=1, inplace=True)

# Display the DataFrame with one-hot encoded columns
print(df)


       user_id                                              sites  \
0          116  [{'site': 'bing.net', 'length': 52}, {'site': ...   
1          155  [{'site': 'yahoo.com', 'length': 46}, {'site':...   
2           39            [{'site': 'oracle.com', 'length': 335}]   
3          175  [{'site': 'mail.google.com', 'length': 192}, {...   
4           50  [{'site': 'mail.google.com', 'length': 266}, {...   
...        ...                                                ...   
39995      184  [{'site': 'airbnb.com', 'length': 96}, {'site'...   
39996      181  [{'site': 'lenta.ru', 'length': 84}, {'site': ...   
39997      112  [{'site': 'toptal.com', 'length': 65}, {'site'...   
39998      136  [{'site': 'yworks.com', 'length': 146}, {'site...   
39999      190  [{'site': 'vk.com', 'length': 43}, {'site': 's...   

                     time       date  num_sites_visited  locale_encoded  \
0     1900-01-01 04:12:00 2017-06-29                 10              20   
1     1900-01-01 03:5

In [64]:
sites_set = set()
for site in df.sites[df.user_id == 0]:
    for s in site:
        sites_set.add(s['site'])

In [65]:
df.num_sites_visited[df.user_id == 0].min()

2

In [66]:
def check(sites):
    count = 0
    for s in sites:
        if s['site'] in sites_set:
            count += 1
    if len(sites) == 0:
        return 0
    return count/len(sites)
df['percent'] = df['sites'].apply(lambda x: check(x))
df.drop('sites', axis=1, inplace=True)

In [67]:
print(df)

       user_id                time       date  num_sites_visited  \
0          116 1900-01-01 04:12:00 2017-06-29                 10   
1          155 1900-01-01 03:57:00 2016-03-23                  8   
2           39 1900-01-01 05:26:00 2016-11-17                  1   
3          175 1900-01-01 00:05:00 2016-08-23                  3   
4           50 1900-01-01 22:55:00 2016-07-23                  8   
...        ...                 ...        ...                ...   
39995      184 1900-01-01 06:45:00 2016-04-03                 10   
39996      181 1900-01-01 20:57:00 2016-12-28                  6   
39997      112 1900-01-01 04:12:00 2016-07-26                  4   
39998      136 1900-01-01 10:18:00 2017-01-01                  3   
39999      190 1900-01-01 04:55:00 2017-03-09                  6   

       locale_encoded  location_encoded  browser_Chrome  browser_Firefox  \
0                  20                13             1.0              0.0   
1                  19          

In [68]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


location_column = df['location_encoded'].values.reshape(-1, 1)

scaler = StandardScaler()

scaler.fit(location_column)

scaled_location = scaler.transform(location_column)

df['location_encoded'] = scaled_location
location_column = df['locale_encoded'].values.reshape(-1, 1)

scaler = StandardScaler()

scaler.fit(location_column)

scaled_location = scaler.transform(location_column)

df['locale_encoded'] = scaled_location
print(df)

       user_id                time       date  num_sites_visited  \
0          116 1900-01-01 04:12:00 2017-06-29                 10   
1          155 1900-01-01 03:57:00 2016-03-23                  8   
2           39 1900-01-01 05:26:00 2016-11-17                  1   
3          175 1900-01-01 00:05:00 2016-08-23                  3   
4           50 1900-01-01 22:55:00 2016-07-23                  8   
...        ...                 ...        ...                ...   
39995      184 1900-01-01 06:45:00 2016-04-03                 10   
39996      181 1900-01-01 20:57:00 2016-12-28                  6   
39997      112 1900-01-01 04:12:00 2016-07-26                  4   
39998      136 1900-01-01 10:18:00 2017-01-01                  3   
39999      190 1900-01-01 04:55:00 2017-03-09                  6   

       locale_encoded  location_encoded  browser_Chrome  browser_Firefox  \
0            1.076578          0.544151             1.0              0.0   
1            0.933660         -

In [69]:
from sklearn.preprocessing import MinMaxScaler
df_normalized = pd.DataFrame()
df_normalized['hour'] = df['time'].dt.hour

scaler = MinMaxScaler()
df['time'] = pd.DataFrame(scaler.fit_transform(df_normalized[['hour']]), columns=['hour'])


print(df)

       user_id      time       date  num_sites_visited  locale_encoded  \
0          116  0.173913 2017-06-29                 10        1.076578   
1          155  0.130435 2016-03-23                  8        0.933660   
2           39  0.217391 2016-11-17                  1       -0.495523   
3          175  0.000000 2016-08-23                  3       -0.066768   
4           50  0.956522 2016-07-23                  8        0.504905   
...        ...       ...        ...                ...             ...   
39995      184  0.260870 2016-04-03                 10        1.362414   
39996      181  0.869565 2016-12-28                  6        0.790742   
39997      112  0.173913 2016-07-26                  4       -1.210114   
39998      136  0.434783 2017-01-01                  3        0.219069   
39999      190  0.173913 2017-03-09                  6       -0.638441   

       location_encoded  browser_Chrome  browser_Firefox  \
0              0.544151             1.0            

In [70]:
X = df.drop(['user_id', 'date', 'num_sites_visited'], axis=1)  
y = (df['user_id'] == 0).astype(int)  

X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.2, random_state=42)

In [71]:
import pandas as pd
import numpy as np

user_id_0 = df[df.user_id==0]

num_rows_user_id_0 = len(user_id_0)

num_random_users = 2 * num_rows_user_id_0

random_users = df[df['user_id'] != 0].sample(n=num_random_users, random_state=42)

result = pd.concat([user_id_0, random_users])

result_shuffled = result.sample(frac=1.0, random_state=42)

X = result_shuffled.drop(['user_id', 'date', 'num_sites_visited'], axis=1)  
y = (result_shuffled['user_id'] == 0).astype(int)  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Design and Training

In [72]:
import torch.nn as nn
import torch.nn.functional as F

class ANN(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, output_size):
        super(ANN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size1)
        self.fc2 = nn.Linear(hidden_size1, hidden_size2)
        self.fc3 = nn.Linear(hidden_size2, output_size)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.softmax(self.fc3(x), dim=1)
        return x

# Define the model
input_size = X_train.shape[1]
hidden_size1 = 64
hidden_size2 = 32
output_size = len(np.unique(y))
ann = ANN(input_size, hidden_size1, hidden_size2, output_size)

# Print the model architecture
print(ann)


ANN(
  (fc1): Linear(in_features=14, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=32, bias=True)
  (fc3): Linear(in_features=32, out_features=2, bias=True)
)


In [73]:
import torch
import torch.optim as optim

X_tensor = torch.tensor(X_train.values, dtype=torch.float32)
y_tensor = torch.tensor(y_train.values, dtype=torch.long)
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(ann.parameters(), lr=0.001)

# Train the model
num_epochs = 200
for epoch in range(num_epochs):
    # Forward pass
    outputs = ann(X_tensor)
    loss = criterion(outputs, y_tensor)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')



Epoch [1/200], Loss: 0.7150
Epoch [2/200], Loss: 0.7115
Epoch [3/200], Loss: 0.7080
Epoch [4/200], Loss: 0.7045
Epoch [5/200], Loss: 0.7010
Epoch [6/200], Loss: 0.6975
Epoch [7/200], Loss: 0.6940
Epoch [8/200], Loss: 0.6905
Epoch [9/200], Loss: 0.6870
Epoch [10/200], Loss: 0.6834
Epoch [11/200], Loss: 0.6798
Epoch [12/200], Loss: 0.6762
Epoch [13/200], Loss: 0.6726
Epoch [14/200], Loss: 0.6689
Epoch [15/200], Loss: 0.6651
Epoch [16/200], Loss: 0.6613
Epoch [17/200], Loss: 0.6573
Epoch [18/200], Loss: 0.6533
Epoch [19/200], Loss: 0.6491
Epoch [20/200], Loss: 0.6449
Epoch [21/200], Loss: 0.6406
Epoch [22/200], Loss: 0.6363
Epoch [23/200], Loss: 0.6318
Epoch [24/200], Loss: 0.6273
Epoch [25/200], Loss: 0.6226
Epoch [26/200], Loss: 0.6179
Epoch [27/200], Loss: 0.6131
Epoch [28/200], Loss: 0.6081
Epoch [29/200], Loss: 0.6031
Epoch [30/200], Loss: 0.5981
Epoch [31/200], Loss: 0.5929
Epoch [32/200], Loss: 0.5877
Epoch [33/200], Loss: 0.5824
Epoch [34/200], Loss: 0.5771
Epoch [35/200], Loss: 0

In [74]:
X_tensor = torch.tensor(X_test1.values, dtype=torch.float32)
y_tensor = torch.tensor(y_test1.values, dtype=torch.long)
with torch.no_grad():
    outputs = ann(X_tensor)
    _, predicted = torch.max(outputs, 1)
    accuracy = (predicted == y_tensor).sum().item() / len(y_tensor)
    print(f'Test Accuracy: {accuracy:.4f}')

Test Accuracy: 0.9781


In [75]:
from sklearn.metrics import precision_score, recall_score, f1_score

with torch.no_grad():
    outputs = ann(X_tensor)
    _, predicted = torch.max(outputs, 1)

predicted = predicted.numpy()

y_true = y_tensor.numpy()

precision = precision_score(y_true, predicted, average='macro')
recall = recall_score(y_true, predicted, average='macro')
f1 = f1_score(y_true, predicted, average='macro')

print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')


Precision: 0.5853
Recall: 0.9890
F1 Score: 0.6402


## Feature Importance Analysis

In [76]:
from sklearn.inspection import permutation_importance

weights = ann.fc1.weight.detach().numpy()

feature_importance = np.abs(weights).mean(axis=0)

feature_names = X_train.columns

sorted_indices = feature_importance.argsort()[::-1]

print("Feature Importance:")
for idx in sorted_indices:
    print(f"{feature_names[idx]}: {feature_importance[idx]}")


Feature Importance:
percent: 0.23134776949882507
locale_encoded: 0.20508535206317902
browser_Internet Explorer: 0.18107588589191437
os_Ubuntu: 0.16903850436210632
os_Windows 7: 0.1639598309993744
os_Debian: 0.15968242287635803
os_Windows 8: 0.15960586071014404
browser_Chrome: 0.15612007677555084
browser_Safari: 0.1509840488433838
os_Windows 10: 0.14034917950630188
os_MacOS: 0.13916754722595215
browser_Firefox: 0.1361299753189087
time: 0.12159864604473114
location_encoded: 0.10722179710865021


## Evaluation

In [100]:
df_ver = pd.read_json('task_1_verify.json')
df_final = df_ver
df_ver.head()

Unnamed: 0,browser,os,locale,location,sites,time,date
0,Internet Explorer,Windows 8,xh-ZA,France/Paris,"[{'site': 'baidu.com', 'length': 201}, {'site'...",14:13:00,2016-11-05
1,Chrome,Windows 10,ja-JP,Germany/Berlin,"[{'site': 'toptal.com', 'length': 96}, {'site'...",21:06:00,2017-02-22
2,Chrome,Windows 10,it-IT,Singapore/Singapore,"[{'site': 'bing.net', 'length': 225}, {'site':...",13:17:00,2016-01-30
3,Chrome,Windows 10,ur-PK,UK/London,"[{'site': 'google.com', 'length': 113}, {'site...",17:00:00,2017-02-27
4,Firefox,Ubuntu,en-CA,Russia/Moscow,"[{'site': 'googleapis.com', 'length': 243}, {'...",18:11:00,2017-04-19


In [101]:
df_ver['time'] = pd.to_datetime(df_ver['time'], format='%H:%M:%S')
df_ver.info()

<class 'pandas.core.frame.DataFrame'>
Index: 40000 entries, 0 to 39999
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   browser   40000 non-null  object        
 1   os        40000 non-null  object        
 2   locale    40000 non-null  object        
 3   location  40000 non-null  object        
 4   sites     40000 non-null  object        
 5   time      40000 non-null  datetime64[ns]
 6   date      40000 non-null  datetime64[ns]
dtypes: datetime64[ns](2), object(5)
memory usage: 2.4+ MB


In [102]:
label_encoder = LabelEncoder()

df_ver['locale_encoded'] = label_encoder.fit_transform(df_ver['locale'])
df_ver.drop('locale', axis=1, inplace=True)
df_ver['location_encoded'] = label_encoder.fit_transform(df_ver['location'])
df_ver.drop('location', axis=1, inplace=True)
print(df_ver)

                 browser          os  \
0      Internet Explorer   Windows 8   
1                 Chrome  Windows 10   
2                 Chrome  Windows 10   
3                 Chrome  Windows 10   
4                Firefox      Ubuntu   
...                  ...         ...   
39995            Firefox  Windows 10   
39996             Chrome  Windows 10   
39997             Chrome   Windows 8   
39998             Chrome  Windows 10   
39999            Firefox   Windows 8   

                                                   sites                time  \
0      [{'site': 'baidu.com', 'length': 201}, {'site'... 1900-01-01 14:13:00   
1      [{'site': 'toptal.com', 'length': 96}, {'site'... 1900-01-01 21:06:00   
2      [{'site': 'bing.net', 'length': 225}, {'site':... 1900-01-01 13:17:00   
3      [{'site': 'google.com', 'length': 113}, {'site... 1900-01-01 17:00:00   
4      [{'site': 'googleapis.com', 'length': 243}, {'... 1900-01-01 18:11:00   
...                                    

In [103]:
one_hot_encoded = pd.get_dummies(df_ver['os'], prefix='os', dtype=float)

# Concatenate the one-hot encoded DataFrame with the original DataFrame
df_ver = pd.concat([df_ver, one_hot_encoded], axis=1)

# Drop the original 'locale' column
df_ver.drop('os', axis=1, inplace=True)

# Display the DataFrame with one-hot encoded columns
print(df_ver)


                 browser                                              sites  \
0      Internet Explorer  [{'site': 'baidu.com', 'length': 201}, {'site'...   
1                 Chrome  [{'site': 'toptal.com', 'length': 96}, {'site'...   
2                 Chrome  [{'site': 'bing.net', 'length': 225}, {'site':...   
3                 Chrome  [{'site': 'google.com', 'length': 113}, {'site...   
4                Firefox  [{'site': 'googleapis.com', 'length': 243}, {'...   
...                  ...                                                ...   
39995            Firefox  [{'site': 'instagram.com', 'length': 170}, {'s...   
39996             Chrome  [{'site': 'youtube.com', 'length': 55}, {'site...   
39997             Chrome  [{'site': 'mail.google.com', 'length': 178}, {...   
39998             Chrome  [{'site': 'toptal.com', 'length': 89}, {'site'...   
39999            Firefox  [{'site': 'toptal.com', 'length': 187}, {'site...   

                     time       date  locale_encode

In [104]:
one_hot_encoded = pd.get_dummies(df_ver['browser'], prefix='browser', dtype=float)

df_ver = pd.concat([df_ver, one_hot_encoded], axis=1)

df_ver.drop('browser', axis=1, inplace=True)

print(df_ver)

                                                   sites                time  \
0      [{'site': 'baidu.com', 'length': 201}, {'site'... 1900-01-01 14:13:00   
1      [{'site': 'toptal.com', 'length': 96}, {'site'... 1900-01-01 21:06:00   
2      [{'site': 'bing.net', 'length': 225}, {'site':... 1900-01-01 13:17:00   
3      [{'site': 'google.com', 'length': 113}, {'site... 1900-01-01 17:00:00   
4      [{'site': 'googleapis.com', 'length': 243}, {'... 1900-01-01 18:11:00   
...                                                  ...                 ...   
39995  [{'site': 'instagram.com', 'length': 170}, {'s... 1900-01-01 14:12:00   
39996  [{'site': 'youtube.com', 'length': 55}, {'site... 1900-01-01 18:49:00   
39997  [{'site': 'mail.google.com', 'length': 178}, {... 1900-01-01 15:58:00   
39998  [{'site': 'toptal.com', 'length': 89}, {'site'... 1900-01-01 05:30:00   
39999  [{'site': 'toptal.com', 'length': 187}, {'site... 1900-01-01 11:32:00   

            date  locale_encoded  locat

In [105]:
def check(sites):
    count = 0
    for s in sites:
        if s['site'] in sites_set:
            count += 1
    if len(sites) == 0:
        return 0
    return count/len(sites)
df_ver['percent'] = df_ver['sites'].apply(lambda x: check(x))
df_ver.drop('sites', axis=1, inplace=True)

In [106]:
location_column = df_ver['location_encoded'].values.reshape(-1, 1)

scaler = StandardScaler()

scaler.fit(location_column)

scaled_location = scaler.transform(location_column)

df_ver['location_encoded'] = scaled_location
location_column = df_ver['locale_encoded'].values.reshape(-1, 1)

scaler = StandardScaler()

scaler.fit(location_column)

scaled_location = scaler.transform(location_column)

df_ver['locale_encoded'] = scaled_location
print(df_ver)

                     time       date  locale_encoded  location_encoded  \
0     1900-01-01 14:13:00 2016-11-05        1.503872         -0.815392   
1     1900-01-01 21:06:00 2017-02-22       -0.210728         -0.645061   
2     1900-01-01 13:17:00 2016-01-30       -0.353611          0.717587   
3     1900-01-01 17:00:00 2017-02-27        1.075222          1.058249   
4     1900-01-01 18:11:00 2017-04-19       -1.353795          0.547256   
...                   ...        ...             ...               ...   
39995 1900-01-01 14:12:00 2016-08-27       -1.496678         -0.474730   
39996 1900-01-01 18:49:00 2016-07-12       -0.782261         -0.304399   
39997 1900-01-01 15:58:00 2016-06-22       -0.925145          1.228580   
39998 1900-01-01 05:30:00 2016-05-19        1.503872          1.058249   
39999 1900-01-01 11:32:00 2017-07-23       -1.210912         -1.326385   

       os_Debian  os_MacOS  os_Ubuntu  os_Windows 10  os_Windows 7  \
0            0.0       0.0        0.0    

In [107]:
df_normalized['time'] = df_ver['time'].dt.hour

#ponents into a single column
scaler = MinMaxScaler()
df_ver['time'] = pd.DataFrame(scaler.fit_transform(df_normalized[['hour']]), columns=['hour'])
df_ver.drop('date', axis=1, inplace=True)
print(df_ver)

           time  locale_encoded  location_encoded  os_Debian  os_MacOS  \
0      0.173913        1.503872         -0.815392        0.0       0.0   
1      0.130435       -0.210728         -0.645061        0.0       0.0   
2      0.217391       -0.353611          0.717587        0.0       0.0   
3      0.000000        1.075222          1.058249        0.0       0.0   
4      0.956522       -1.353795          0.547256        0.0       0.0   
...         ...             ...               ...        ...       ...   
39995  0.260870       -1.496678         -0.474730        0.0       0.0   
39996  0.869565       -0.782261         -0.304399        0.0       0.0   
39997  0.173913       -0.925145          1.228580        0.0       0.0   
39998  0.434783        1.503872          1.058249        0.0       0.0   
39999  0.173913       -1.210912         -1.326385        0.0       0.0   

       os_Ubuntu  os_Windows 10  os_Windows 7  os_Windows 8  browser_Chrome  \
0            0.0            0.0 

In [108]:
# Assuming df_ver.values is your NumPy array
print(df_ver.values.dtype)

# Convert the array to a supported data type (e.g., float32)
X_verify = torch.tensor(df_ver.values.astype(np.float32), dtype=torch.float32)


float64


In [109]:
outputs = ann(X_verify)
outputs_detached = outputs.detach().numpy()
outputs

tensor([[9.9999e-01, 1.4366e-05],
        [9.6714e-01, 3.2860e-02],
        [9.9980e-01, 2.0002e-04],
        ...,
        [9.9734e-01, 2.6588e-03],
        [8.4476e-01, 1.5524e-01],
        [9.9935e-01, 6.5476e-04]], grad_fn=<SoftmaxBackward0>)

In [110]:
_, predicted = torch.max(outputs, 1)
count = 0
for i in predicted:
    if i:
        df_final['is_waldo'] = True
    else:
        df_final['is_waldo'] = False

In [111]:
df_final.head()

Unnamed: 0,browser,os,sites,time,date,locale_encoded,location_encoded,is_waldo
0,Internet Explorer,Windows 8,"[{'site': 'baidu.com', 'length': 201}, {'site'...",1900-01-01 14:13:00,2016-11-05,23,5,False
1,Chrome,Windows 10,"[{'site': 'toptal.com', 'length': 96}, {'site'...",1900-01-01 21:06:00,2017-02-22,11,6,False
2,Chrome,Windows 10,"[{'site': 'bing.net', 'length': 225}, {'site':...",1900-01-01 13:17:00,2016-01-30,10,14,False
3,Chrome,Windows 10,"[{'site': 'google.com', 'length': 113}, {'site...",1900-01-01 17:00:00,2017-02-27,20,16,False
4,Firefox,Ubuntu,"[{'site': 'googleapis.com', 'length': 243}, {'...",1900-01-01 18:11:00,2017-04-19,3,13,False


In [112]:
df_final.to_json('task1_verify.json', orient='records', indent=2)