# 📈 Search Queries Anomaly Detection

In this project, we focus on **Search Queries Anomaly Detection**, a technique used to identify unusual or unexpected patterns within search query data.  
Anomalies can represent significant deviations from normal behavior, such as sudden spikes or drops in clicks, impressions, or CTR, and detecting them early can provide valuable insights for improving search engine optimization (SEO), marketing strategies, or system monitoring.

Using Python and machine learning techniques, particularly the **Isolation Forest algorithm**, we systematically:

- Clean and preprocess search queries,
- Analyze and visualize key search metrics,
- Detect outliers that may indicate issues or opportunities.

This project was inspired by and based on the excellent work from [The Clever Programmer](https://thecleverprogrammer.com/2023/11/20/search-queries-anomaly-detection-using-python/), and serves as my personal implementation and learning exercise.

In [1]:
#Importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from collections import Counter
import re
import plotly.express as px
import plotly.io as pio
pio.templates.default = 'plotly_white'
from sklearn.ensemble import IsolationForest

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/queries/Queries.csv


In [2]:
#Load dataset
query_data = pd.read_csv("/kaggle/input/queries/Queries.csv")

In [3]:
print(query_data.head())

                                 Top queries  Clicks  Impressions     CTR  \
0                number guessing game python    5223        14578  35.83%   
1                        thecleverprogrammer    2809         3456  81.28%   
2           python projects with source code    2077        73380   2.83%   
3  classification report in machine learning    2012         4959  40.57%   
4                      the clever programmer    1931         2528  76.38%   

   Position  
0      1.61  
1      1.02  
2      5.94  
3      1.28  
4      1.09  


# Explanatory Data Analysis

In [4]:
print(query_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Top queries  1000 non-null   object 
 1   Clicks       1000 non-null   int64  
 2   Impressions  1000 non-null   int64  
 3   CTR          1000 non-null   object 
 4   Position     1000 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 39.2+ KB
None


Now, let’s convert the CTR (Click through rate) column from a percentage string to a float:

In [5]:
query_data['CTR'] = query_data['CTR'].str.rstrip('%').astype('float')/100

Now, let's analyze common words in each search query:

In [6]:
#Function to clean and split the queries into words
def clean_and_split(query):
    words = re.findall(r'\b[a-zA-z]+\b',query.lower())
    return words

In [9]:
#Split each query into words and count the frequency of each word
word_counts = Counter()
for query in query_data['Top queries']:
    word_counts.update(clean_and_split(query))
                                      
word_freq_df = pd.DataFrame(word_counts.most_common(20),columns = ['Word','Frequency'])

#Plotting the word frequencies
fig = px.bar(word_freq_df, x='Word', y='Frequency', title='Top 20 most common words in search queries')
fig.show()

**Let's look at top queries based on clicks and impression**

In [10]:
top_queries_clicks_vis = query_data.nlargest(10, 'Clicks')[['Top queries','Clicks']]
top_queries_impression_vis = query_data.nlargest(10, 'Clicks')[['Top queries','Impressions']]

In [11]:
#Plotting
fig_clicks = px.bar(top_queries_clicks_vis, x='Top queries', y='Clicks', title = 'Top Queries by Click')
fig_impression = px.bar(top_queries_impression_vis, x='Top queries', y='Impressions', title = 'Top Queries by Impressions')
fig_clicks.show()
fig_impression.show()

Let's analyze the queries with the highest and lowest CTR (click through ratio)

In [12]:
top_ctr_vis = query_data.nlargest(10, 'CTR')[['Top queries','CTR']]
bottom_ctr_vis = query_data.nsmallest(10, 'CTR')[['Top queries','CTR']]

In [13]:
#Plotting
fig_top_ctr = px.bar(top_ctr_vis, x= 'Top queries', y='CTR', title ='Top Queries by CTR')
fig_bottom_ctr = px.bar(bottom_ctr_vis, x= 'Top queries', y = 'CTR', title = 'Bottom Queries by CTR')
fig_top_ctr.show()
fig_bottom_ctr.show()

Inspecting the correlatio between different metrics:

In [14]:
# Correlation matrix visualization
correlation_matrix = query_data[['Clicks','Impressions','CTR','Position']].corr()
fig_corr = px.imshow(correlation_matrix, text_auto= True,title= 'Correlation Matrix')
fig_corr.show()

In this correlation matrix:

Clicks and Impressions are positively correlated, meaning more Impressions tend to lead to more Clicks.
Clicks and CTR have a weak positive correlation, implying that more Clicks might slightly increase the Click-Through Rate.
Clicks and Position are weakly negatively correlated, suggesting that higher ad or page Positions may result in fewer Clicks.
Impressions and CTR are negatively correlated, indicating that higher Impressions tend to result in a lower Click-Through Rate.
Impressions and Position are positively correlated, indicating that ads or pages in higher Positions receive more Impressions.
CTR and Position have a strong negative correlation, meaning that higher Positions result in lower Click-Through Rates.

# Detecting Anomlies in Search Queries

Now, let’s detect anomalies in search queries. You can use various techniques for anomaly detection. A simple and effective method is the Isolation Forest algorithm, which works well with different data distributions and is efficient with large datasets:

In [15]:
#Selecting relevant features
features = query_data[['Clicks','Impressions','CTR','Position']]

#initializing the model
isolation_forest = IsolationForest(n_estimators = 100, contamination = 0.01)

In [16]:
#fit the model
isolation_forest.fit(features)


X does not have valid feature names, but IsolationForest was fitted with feature names



In [17]:
query_data['Anomaly'] = isolation_forest.predict(features)

In [18]:
anomalies = query_data[query_data['Anomaly'] == -1]

Here’s how to analyze the detected anomalies to understand their nature and whether they represent true outliers or data errors:

In [19]:
print(anomalies[['Top queries', 'Clicks', 'Impressions', 'CTR', 'Position']])

                          Top queries  Clicks  Impressions     CTR  Position
0         number guessing game python    5223        14578  0.3583      1.61
1                 thecleverprogrammer    2809         3456  0.8128      1.02
2    python projects with source code    2077        73380  0.0283      5.94
4               the clever programmer    1931         2528  0.7638      1.09
15         rock paper scissors python    1111        35824  0.0310      7.19
21              classification report     933        39896  0.0234      7.53
34           machine learning roadmap     708        42715  0.0166      8.97
82                           r2 score     367        56322  0.0065      9.33
167               text to handwriting     222        11283  0.0197     28.52
929                     python turtle      52        18228  0.0029     18.75


The anomalies in our search query data are not just outliers. They are indicators of potential areas for growth, optimization, and strategic focus. These anomalies are reflecting emerging trends or areas of growing interest. Staying responsive to these trends will help in maintaining and growing the website’s relevance and user engagement.