<h1>Search Queries Anomaly Detection: Process We Can Follow</h1>
Search Queries Anomaly Detection is a technique to identify unusual or unexpected patterns in search query data. Below is the process we can follow for the task of Search Queries Anomaly Detection:

<ol><li>Gather historical search query data from the source, such as a search engine or a website’s search functionality</li><br>
<li>Conduct an initial analysis to understand the distribution of search queries, their frequency, and any noticeable patterns or trends.</li><br>
<li>Create relevant features or attributes from the search query data that can aid in anomaly detection.</li><br>
<li>Choose an appropriate anomaly detection algorithm. Common methods include statistical approaches like Z-score analysis and machine learning algorithms like Isolation Forests or One-Class SVM.</li><br>
<li>Train the selected model on the prepared data.</li><br>
<li>Apply the trained model to the search query data to identify anomalies or outliers.<br></li></ol>

In [50]:
# Importing Required Libraries
import pandas as pd
from collections import Counter
import re
import plotly.express as px
import plotly.offline as pyo
%matplotlib inline
# Set the default Plotly template
px.defaults.template = "plotly_white"

In [51]:
#Load the data set
df = pd.read_csv('Queries.csv')

In [52]:
#Viewing the dataset
df.head()

Unnamed: 0,Top queries,Clicks,Impressions,CTR,Position
0,number guessing game python,5223,14578,35.83%,1.61
1,thecleverprogrammer,2809,3456,81.28%,1.02
2,python projects with source code,2077,73380,2.83%,5.94
3,classification report in machine learning,2012,4959,40.57%,1.28
4,the clever programmer,1931,2528,76.38%,1.09


In [53]:
#Dataset info and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Top queries  1000 non-null   object 
 1   Clicks       1000 non-null   int64  
 2   Impressions  1000 non-null   int64  
 3   CTR          1000 non-null   object 
 4   Position     1000 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 39.2+ KB


There is no non-null value in the dataset. 

In [54]:
#Convert CTR value % string in float 
df['CTR']=df['CTR'].str.rstrip('%').astype('float')/100

In [55]:
df['CTR'].head()

0    0.3583
1    0.8128
2    0.0283
3    0.4057
4    0.7638
Name: CTR, dtype: float64

In [56]:
df['Top queries'][:6]

0                  number guessing game python
1                          thecleverprogrammer
2             python projects with source code
3    classification report in machine learning
4                        the clever programmer
5          standard scaler in machine learning
Name: Top queries, dtype: object

In [57]:
#Function clean and split each query in to the words
def clean_and_split(query):
    words = re.findall(r'\b[a-zA-Z]+\b', query.lower())
    return words

In [58]:
# Split each query into words and count the frequency of each word
word_counts = Counter()
for query in df['Top queries']:
    word_counts.update(clean_and_split(query))

# Create a DataFrame with word frequencies
word_freq_df = pd.DataFrame(word_counts.most_common(20), columns=['Word', 'Frequency'])

In [59]:
word_freq_df.head()

Unnamed: 0,Word,Frequency
0,python,562
1,in,232
2,code,138
3,learning,133
4,machine,123


In [60]:
#Plotting the word Frequencies

fig = px.bar(word_freq_df , x='Word' , y='Frequency', title='Top 20 Most Common Words in Search Queries')
fig.show()

Now, let’s have a look at the top queries by clicks and impressions:

In [66]:
#Top queries by click and impression
df_top_queries = df.nlargest(10 , 'Clicks')[['Top queries','Clicks']]
df_top_impression = df.nlargest(10 , 'Impressions')[['Top queries' , 'Impressions']]

#Plotting bar Chart 

fig_click = px.bar(df_top_queries, x='Top queries' , y='Clicks' , title='Top 10 Queries By Clicks')
fig_Impressions = px.bar(data_frame=df_top_impression , x='Top queries' , y='Impressions' , title='Top 10 Queries by Impressions')
fig_click.show()
fig_Impressions.show()

Now Analyze Highest and lowest CTR rate In Queries

In [73]:
# Queries with highest and lowest CTR
Top_CTR = df.nlargest(10 , columns='CTR')[['CTR' , 'Top queries']]
Bottom_CTR = df.nsmallest(10 , columns='CTR')[['CTR' , 'Top queries']]

## Plotting
Fig_Top_CTR = px.bar(data_frame=Top_CTR , x='Top queries' , y='CTR' , title='Top 10 queries by CTR')
Fig_Bottom_CTR = px.bar(data_frame= Bottom_CTR , x='Top queries' , y='CTR' , title = 'Bottom 10 queries by CTR')
Fig_Top_CTR.show()
Fig_Bottom_CTR.show()

Now, let’s have a look at the correlation between different metrics:

In [78]:
# Correlation matrix visualization
correlation_matrix = df[['Clicks' , 'CTR' , 'Impressions', 'Position']].corr()
fig_corr_matrix = px.imshow(img=correlation_matrix , text_auto=True , title='Correlation Matrix')
fig_corr_matrix.show()

<h1>In this correlation matrix:</h1>
<ol>
<li>Clicks and Impressions are positively correlated, meaning more Impressions tend to lead to more Clicks.</li>
<li>Clicks and CTR have a weak positive correlation, implying that more Clicks might slightly increase the Click-Through Rate.</li>
<li>Clicks and Position are weakly negatively correlated, suggesting that higher ad or page Positions may result in fewer Clicks.</li>
<li>Impressions and CTR are negatively correlated, indicating that higher Impressions tend to result in a lower Click-Through Rate.</li>
<li>Impressions and Position are positively correlated, indicating that ads or pages in higher Positions receive more Impressions.</li>
<li>CTR and Position have a strong negative correlation, meaning that higher Positions result in lower Click-Through Rates.</li>
</ol>

<h1>Detecting Anomalies in Search Queries</h1>
Now, let’s detect anomalies in search queries. You can use various techniques for anomaly detection. A simple and effective method is the Isolation Forest algorithm, which works well with different data distributions and is efficient with large datasets:

In [82]:
#importing IsolationForest
from sklearn.ensemble import IsolationForest

#Selecting Relevant Features
features = df[['Clicks','Impressions' ,'CTR' , 'Position']]

#Initializing Isolation Forest
iso_forest = IsolationForest(n_estimators=100 , contamination=0.1) # contamination is the expected proportion of outlier


In [83]:
#Fitting the Model
iso_forest.fit(features)

In [84]:
#Predicting Anomalies
df['Annomly'] = iso_forest.predict(features)

In [91]:
#Filtering Out Annomalies
Anomalies = df[df['Annomly']== -1]

Here’s how to analyze the detected anomalies to understand their nature and whether they represent true outliers or data errors:

In [92]:
print(Anomalies[['Top queries', 'Clicks', 'Impressions', 'CTR', 'Position']])

                                   Top queries  Clicks  Impressions     CTR  \
0                  number guessing game python    5223        14578  0.3583   
1                          thecleverprogrammer    2809         3456  0.8128   
2             python projects with source code    2077        73380  0.0283   
3    classification report in machine learning    2012         4959  0.4057   
4                        the clever programmer    1931         2528  0.7638   
..                                         ...     ...          ...     ...   
927                  the clever programmer.com      53           64  0.8281   
928                   the cleverprogrammer.com      53           62  0.8548   
929                              python turtle      52        18228  0.0029   
963                 grading system code in c++      51           79  0.6456   
964       python program to send otp to mobile      51           72  0.7083   

     Position  
0        1.61  
1        1.02  
2  

<h1>Summary</h1>
So, Search Queries Anomaly Detection means identifying queries that are outliers according to their performance metrics. It is valuable for businesses to spot potential issues or opportunities, such as unexpectedly high or low CTRs. I hope you liked this article on Search Queries Anomaly Detection with Machine Learning using Python. Feel free to ask valuable questions in the comments section below.