<p style="text-align:center">
        <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="Skills Network Logo">
</p>


### Analyse search terms on the e-commerce web server


##### In this assignment you will download the search term data set for the e-commerce web server and run analytic queries on it.


In [None]:
# Install spark

In [1]:
!pip install pyspark
!pip install findspark

Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845498 sha256=600f72134c5f7d5120b0937eceeb288c2a58486dc17334926009f5d4a8268622
  Stored in directory: /home/jupyterlab/.cache/pip/wheels/91/aa/66/700b503fd714b56462975ab7bf33485ec26677c3c990e67e9a
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.1
Collecting finds

In [2]:
# Start session
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import S

In [3]:
# Creating a spark context class
sc = SparkContext()

# Creating a spark session
spark = SparkSession \
    .builder \
    .appName("Analyzing search terms").getOrCreate()

22/11/19 18:09:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [None]:
# Download The search term dataset from the below url
# https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/Bigdata%20and%20Spark/searchterms.csv

In [7]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/Bigdata%20and%20Spark/searchterms.csv'

--2022-11-19 18:15:18--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/Bigdata%20and%20Spark/searchterms.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 233457 (228K) [text/csv]
Saving to: ‘searchterms.csv’


2022-11-19 18:15:18 (70.6 MB/s) - ‘searchterms.csv’ saved [233457/233457]



In [None]:
# Load the csv into a spark dataframe

In [12]:
df = spark.read.csv('searchterms.csv', header=True)

In [None]:
# Print the number of rows and columns

In [13]:
rows = df.count()
cols = len(df.columns)
print("No. of rows: ", rows, "\nNo. of columns: ", cols)

No. of rows:  10000 
No. of columns:  4


In [None]:
# Print the top 5 rows

In [14]:
df.show(5)

+---+-----+----+--------------+
|day|month|year|    searchterm|
+---+-----+----+--------------+
| 12|   11|2021| mobile 6 inch|
| 12|   11|2021| mobile latest|
| 12|   11|2021|   tablet wifi|
| 12|   11|2021|laptop 14 inch|
| 12|   11|2021|     mobile 5g|
+---+-----+----+--------------+
only showing top 5 rows



In [None]:
# Find out the datatype of the column searchterm?

In [25]:
df.dtypes

[('day', 'string'),
 ('month', 'string'),
 ('year', 'string'),
 ('searchterm', 'string')]

In [None]:
# How many times was the term `gaming laptop` searched?

In [29]:
df.createOrReplaceTempView('search')
spark.sql("SELECT COUNT(searchterm) FROM search WHERE searchterm='gaming laptop'").show()

+-----------------+
|count(searchterm)|
+-----------------+
|              499|
+-----------------+



                                                                                

In [None]:
# Print the top 5 most frequently used search terms?

In [39]:
df.createOrReplaceTempView('search')
print("Top 5 search terms")
spark.sql("SELECT searchterm FROM search GROUP BY searchterm LIMIT 5").show()

Top 5 search terms
+-------------------+
|         searchterm|
+-------------------+
|          mobile 5g|
|ebooks data science|
|      mobile 6 inch|
|     tablet 10 inch|
|             laptop|
+-------------------+



In [None]:
# The pretrained sales forecasting model is available at  the below url
# https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/Bigdata%20and%20Spark/model.tar.gz

In [None]:
# Load the sales forecast model.

In [46]:
from pyspark.ml.regression import LinearRegressionModel
from pyspark.ml.feature import VectorAssembler
model = LinearRegressionModel.load('sales_prediction.model')

In [None]:
# Using the sales forecast model, predict the sales for the year of 2023.

In [47]:
def predict(year):
    assembler = VectorAssembler(inputCols=["year"], outputCol="features")
    data = [[year,0]]
    columns = ["year", "sales"]
    _ = spark.createDataFrame(data, columns)
    __ = assembler.transform(_).select('features', 'sales')
    predictions = model.transform(__)
    predictions.select('prediction').show()

In [48]:
predict(2023)

                                                                                

+------------------+
|        prediction|
+------------------+
|175.16564294006457|
+------------------+



22/11/19 19:14:38 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
22/11/19 19:14:38 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
