## 1. Import the libraries:

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from collections import Counter
%matplotlib inline

## 2. Load the dataset:

In [20]:
data = pd.read_csv("Queries.csv")
data.head(2)

Unnamed: 0,Top queries,Clicks,Impressions,CTR,Position
0,number guessing game python,5223,14578,35.83%,1.61
1,thecleverprogrammer,2809,3456,81.28%,1.02


## 3. Exploaratory Data Analysis:

The datatypes of the features:

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Top queries  1000 non-null   object 
 1   Clicks       1000 non-null   int64  
 2   Impressions  1000 non-null   int64  
 3   CTR          1000 non-null   object 
 4   Position     1000 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 39.2+ KB


Check for any Null-Values:

In [6]:
data.isna().sum()

Top queries    0
Clicks         0
Impressions    0
CTR            0
Position       0
dtype: int64

There is no any null data.

The **CTR** seems to have object data type, to perform any operations in the future, let's convert it to float type:

In [21]:
data['CTR'] = data['CTR'].str.rstrip("%").astype('float')/100
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Top queries  1000 non-null   object 
 1   Clicks       1000 non-null   int64  
 2   Impressions  1000 non-null   int64  
 3   CTR          1000 non-null   float64
 4   Position     1000 non-null   float64
dtypes: float64(2), int64(2), object(1)
memory usage: 39.2+ KB


In [23]:
data.head(5)

Unnamed: 0,Top queries,Clicks,Impressions,CTR,Position
0,number guessing game python,5223,14578,0.3583,1.61
1,thecleverprogrammer,2809,3456,0.8128,1.02
2,python projects with source code,2077,73380,0.0283,5.94
3,classification report in machine learning,2012,4959,0.4057,1.28
4,the clever programmer,1931,2528,0.7638,1.09


## 4. Count the most frequent words from the query:

In [74]:
# A function to find words and split them and discard punctuations:
def split_into_tokens(query):
    words = re.findall(r'\b[a-z]+\b', query.lower())
    return words


# Creating a counter to count the word frequency:
word_frequency = Counter()
for query in data['Top queries']:
    word_frequency.update(split_into_tokens(query))

#### The top 20 most common words:

In [76]:
top_20_common_df = pd.DataFrame(word_frequency.most_common(20), columns=["Words", "Counts"])
top_20_common_df.head(20)

Unnamed: 0,Words,Counts
0,python,562
1,in,232
2,code,138
3,learning,133
4,machine,123
5,using,105
6,game,103
7,number,95
8,to,82
9,prediction,70
