**Google Search Queries Anomaly Detection**

Search Queries Anomaly Detection means identifying queries that are outliers according to their performance metrics. It is valuable for businesses to spot potential issues or opportunities, such as unexpectedly high or low CTRs.

Search Queries Anomaly Detection is a technique to identify unusual or unexpected patterns in search query data. Below is the process we can follow for the task of Search Queries Anomaly Detection:

Gather historical search query data from the source, such as a search engine or a website’s search functionality.
Conduct an initial analysis to understand the distribution of search queries, their frequency, and any noticeable patterns or trends.
Create relevant features or attributes from the search query data that can aid in anomaly detection.
Choose an appropriate anomaly detection algorithm. Common methods include statistical approaches like Z-score analysis and machine learning algorithms like Isolation Forests or One-Class SVM.
Train the selected model on the prepared data.
Apply the trained model to the search query data to identify anomalies or outliers.

In [2]:
import pandas as pd
from collections import Counter
import re
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"

queries_df = pd.read_csv("/content/drive/MyDrive/Queries.csv")
print(queries_df.head())

                                 Top queries  Clicks  Impressions     CTR  \
0                number guessing game python    5223        14578  35.83%   
1                        thecleverprogrammer    2809         3456  81.28%   
2           python projects with source code    2077        73380   2.83%   
3  classification report in machine learning    2012         4959  40.57%   
4                      the clever programmer    1931         2528  76.38%   

   Position  
0      1.61  
1      1.02  
2      5.94  
3      1.28  
4      1.09  


**Exploratory Data Analysis**

In [3]:
print(queries_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Top queries  1000 non-null   object 
 1   Clicks       1000 non-null   int64  
 2   Impressions  1000 non-null   int64  
 3   CTR          1000 non-null   object 
 4   Position     1000 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 39.2+ KB
None


Now converting the CTR column from a percentage string to a float:

In [4]:
# Cleaning CTR column
queries_df['CTR'] = queries_df['CTR'].str.rstrip('%').astype('float') / 100

Analyzing common words in each search query:

In [6]:
# Function to clean and split the queries into words
def clean_and_split(query):
    words = re.findall(r'\b[a-zA-Z]+\b', query.lower())
    return words

# Split each query into words and count the frequency of each word
word_counts = Counter()
for query in queries_df['Top queries']:
    word_counts.update(clean_and_split(query))

word_freq_df = pd.DataFrame(word_counts.most_common(20), columns=['Word', 'Frequency'])

word_freq_df = word_freq_df.sort_values(by='Frequency', ascending=False)

# Create the bar plot with improvements
fig = px.bar(
    word_freq_df.head(20),  # Top 20 words
    x='Frequency',
    y='Word',
    title='Top 20 Most Common Words in Search Queries',
    text='Frequency',  # Display frequency on bars
    color='Frequency',  # Color based on frequency
    color_continuous_scale='Blues',  # Aesthetic color scale
    orientation='h',  # Horizontal bars for readability
)

# Update layout for better aesthetics
fig.update_layout(
    title={
        'text': 'Top 20 Most Common Words in Search Queries',
        'x': 0.5,  # Center align title
        'xanchor': 'center'
    },
    xaxis_title='Frequency',
    yaxis_title='Words',
    template='plotly_white',  # Clean background
    coloraxis_showscale=False  # Hide color bar for simplicity
)

# Update traces for better label placement
fig.update_traces(textposition='outside')

# Show the figure
fig.show()

Now, looking at the top queries by clicks and impressions:

In [7]:
# Top queries by Clicks and Impressions
top_queries_clicks_vis = queries_df.nlargest(10, 'Clicks')[['Top queries', 'Clicks']]
top_queries_impressions_vis = queries_df.nlargest(10, 'Impressions')[['Top queries', 'Impressions']]

# Plot for Clicks
fig_clicks = px.bar(
    top_queries_clicks_vis,
    x='Top queries',
    y='Clicks',
    title='Top Queries by Clicks',
    text='Clicks',
    color='Clicks',
    color_continuous_scale='Viridis',  # Attractive color palette
)

# Plot for Impressions
fig_impressions = px.bar(
    top_queries_impressions_vis,
    x='Top queries',
    y='Impressions',
    title='Top Queries by Impressions',
    text='Impressions',
    color='Impressions',
    color_continuous_scale='Cividis',  # Complementary color palette
)

# Update layout for both plots
for fig in [fig_clicks, fig_impressions]:
    fig.update_layout(
        xaxis_title='Top Queries',
        yaxis_title='Counts',
        title={
            'x': 0.5,  # Center-align title
            'xanchor': 'center'
        },
        template='plotly_white',  # Clean aesthetic
        margin=dict(l=50, r=50, t=50, b=50),  # Add spacing around the plot
    )
    fig.update_traces(
        textposition='outside',  # Place text above bars
        marker_line_width=0.5,  # Thin border for clarity
    )

# Show the plots
fig_clicks.show()
fig_impressions.show()

Now, let’s analyze the queries with the highest and lowest CTRs:

In [9]:
import plotly.express as px

# Sorting the data for better readability
top_ctr_vis = queries_df.nlargest(10, 'CTR')[['Top queries', 'CTR']]
bottom_ctr_vis = queries_df.nsmallest(10, 'CTR')[['Top queries', 'CTR']]

# Plot for Top CTR
fig_top_ctr = px.bar(
    top_ctr_vis,
    x='CTR',
    y='Top queries',
    title='Top Queries by CTR',
    text='CTR',
    color='CTR',
    color_continuous_scale='Greens',  # Highlight high CTR in green shades
    orientation='h',  # Horizontal bars for readability
)

# Plot for Bottom CTR
fig_bottom_ctr = px.bar(
    bottom_ctr_vis,
    x='CTR',
    y='Top queries',
    title='Bottom Queries by CTR',
    text='CTR',
    color='CTR',
    color_continuous_scale='Reds',  # Highlight low CTR in red shades
    orientation='h',  # Horizontal bars for readability
)

# Apply consistent styling to both plots
for fig in [fig_top_ctr, fig_bottom_ctr]:
    fig.update_layout(
        title={
            'x': 0.5,  # Center-align the title
            'xanchor': 'center',
        },
        xaxis_title='Click-Through Rate (CTR)',
        yaxis_title='Top Queries',
        template='plotly_white',  # Clean layout
        margin=dict(l=80, r=40, t=50, b=50),  # Adequate spacing
    )
    fig.update_traces(
        textposition='outside',  # Place CTR values outside bars
        marker_line_width=0.5,  # Add a thin border around bars
    )

# Show the plots
fig_top_ctr.show()
fig_bottom_ctr.show()


Now, let’s have a look at the correlation between different metrics:

In [10]:
import plotly.express as px

# Compute correlation matrix
correlation_matrix = queries_df[['Clicks', 'Impressions', 'CTR', 'Position']].corr()

# Plot correlation matrix
fig_corr = px.imshow(
    correlation_matrix,
    text_auto=".2f",  # Round values to 2 decimal places
    color_continuous_scale="RdBu",  # Diverging color scale for better contrast
    title="Correlation Matrix of Query Metrics",
)

# Update layout for aesthetics
fig_corr.update_layout(
    title={
        'x': 0.5,  # Center-align the title
        'xanchor': 'center'
    },
    xaxis_title="Metrics",
    yaxis_title="Metrics",
    template="plotly_white",  # Clean background
    coloraxis_colorbar=dict(
        title="Correlation",  # Label the color bar
        ticks="outside"
    ),
    margin=dict(l=50, r=50, t=50, b=50),  # Add space around the plot
)

# Improve axis labels
fig_corr.update_xaxes(tickangle=45)  # Rotate x-axis labels for readability

# Show the figure
fig_corr.show()


In this correlation matrix:

- Clicks and Impressions are positively correlated, meaning more Impressions tend to lead to more Clicks.
- Clicks and CTR have a weak positive correlation, implying that more Clicks might slightly increase the Click-Through Rate.
- Clicks and Position are weakly negatively correlated, suggesting that higher ad or page Positions may result in fewer Clicks.
- Impressions and CTR are negatively correlated, indicating that higher Impressions tend to result in a lower Click-Through Rate.
- Impressions and Position are positively correlated, indicating that ads or pages in higher Positions receive more Impressions.
- CTR and Position have a strong negative correlation, meaning that higher Positions result in lower Click-Through Rates.

Detecting Anomalies in Search Queries

 A simple and effective method is the Isolation Forest algorithm, which works well with different data distributions and is efficient with large datasets:

In [11]:
from sklearn.ensemble import IsolationForest

# Selecting relevant features
features = queries_df[['Clicks', 'Impressions', 'CTR', 'Position']]

# Initializing Isolation Forest
iso_forest = IsolationForest(n_estimators=100, contamination=0.01)  # contamination is the expected proportion of outliers

# Fitting the model
iso_forest.fit(features)

# Predicting anomalies
queries_df['anomaly'] = iso_forest.predict(features)

# Filtering out the anomalies
anomalies = queries_df[queries_df['anomaly'] == -1]

Here’s how to analyze the detected anomalies to understand their nature and whether they represent true outliers or data errors:

In [12]:
print(anomalies[['Top queries', 'Clicks', 'Impressions', 'CTR', 'Position']])

                          Top queries  Clicks  Impressions     CTR  Position
0         number guessing game python    5223        14578  0.3583      1.61
1                 thecleverprogrammer    2809         3456  0.8128      1.02
2    python projects with source code    2077        73380  0.0283      5.94
4               the clever programmer    1931         2528  0.7638      1.09
11                  clever programmer    1243        21566  0.0576      4.82
15         rock paper scissors python    1111        35824  0.0310      7.19
21              classification report     933        39896  0.0234      7.53
34           machine learning roadmap     708        42715  0.0166      8.97
82                           r2 score     367        56322  0.0065      9.33
167               text to handwriting     222        11283  0.0197     28.52


The anomalies in our search query data are not just outliers. They are indicators of potential areas for growth, optimization, and strategic focus. These anomalies are reflecting emerging trends or areas of growing interest. Staying responsive to these trends will help in maintaining and growing the website’s relevance and user engagement.

**Summary**

So, Search Queries Anomaly Detection means identifying queries that are outliers according to their performance metrics. It is valuable for businesses to spot potential issues or opportunities, such as unexpectedly high or low CTRs.