## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [2]:
# File location and type
file_location = "/FileStore/tables/amazon_alexa.csv"
file_type = "csv"
   
# CSV options
infer_schema = "true" #cambiar a true
first_row_is_header = "true" #cambiar a true
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

In [3]:
display(df)

rating,date,variation,verified_reviews,feedback
5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
5,31-Jul-18,Charcoal Fabric,Loved it!,1
4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer a question correctly but Alexa says you got it wrong and answers the same as you. I like being able to turn lights on and off while away from home.",1
5,31-Jul-18,Charcoal Fabric,"I have had a lot of fun with this thing. My 4 yr old learns about dinosaurs, i control the lights and play games like categories. Has nice sound when playing music as well.",1
5,31-Jul-18,Charcoal Fabric,Music,1
5,31-Jul-18,Heather Gray Fabric,"I received the echo as a gift. I needed another Bluetooth or something to play music easily accessible, and found this smart speaker. Can’t wait to see what else it can do.",1
3,31-Jul-18,Sandstone Fabric,"Without having a cellphone, I cannot use many of her features. I have an iPad but do not see that of any use. It IS a great alarm. If u r almost deaf, you can hear her alarm in the bedroom from out in the living room, so that is reason enough to keep her.It is fun to ask random questions to hear her response. She does not seem to be very smartbon politics yet.",1
5,31-Jul-18,Charcoal Fabric,I think this is the 5th one I've purchased. I'm working on getting one in every room of my house. I really like what features they offer specifily playing music on all Echos and controlling the lights throughout my house.,1
5,30-Jul-18,Heather Gray Fabric,looks great,1
5,30-Jul-18,Heather Gray Fabric,"Love it! I’ve listened to songs I haven’t heard since childhood! I get the news, weather, information! It’s great!",1


In [4]:
display(df)

rating,date,variation,verified_reviews,feedback
5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
5,31-Jul-18,Charcoal Fabric,Loved it!,1
4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer a question correctly but Alexa says you got it wrong and answers the same as you. I like being able to turn lights on and off while away from home.",1
5,31-Jul-18,Charcoal Fabric,"I have had a lot of fun with this thing. My 4 yr old learns about dinosaurs, i control the lights and play games like categories. Has nice sound when playing music as well.",1
5,31-Jul-18,Charcoal Fabric,Music,1
5,31-Jul-18,Heather Gray Fabric,"I received the echo as a gift. I needed another Bluetooth or something to play music easily accessible, and found this smart speaker. Can’t wait to see what else it can do.",1
3,31-Jul-18,Sandstone Fabric,"Without having a cellphone, I cannot use many of her features. I have an iPad but do not see that of any use. It IS a great alarm. If u r almost deaf, you can hear her alarm in the bedroom from out in the living room, so that is reason enough to keep her.It is fun to ask random questions to hear her response. She does not seem to be very smartbon politics yet.",1
5,31-Jul-18,Charcoal Fabric,I think this is the 5th one I've purchased. I'm working on getting one in every room of my house. I really like what features they offer specifily playing music on all Echos and controlling the lights throughout my house.,1
5,30-Jul-18,Heather Gray Fabric,looks great,1
5,30-Jul-18,Heather Gray Fabric,"Love it! I’ve listened to songs I haven’t heard since childhood! I get the news, weather, information! It’s great!",1


In [5]:
#Creamos un RDD a partir de un archivo csv
data_file = "/FileStore/tables/amazon_alexa.csv"
raw_rdd = sc.textFile(data_file).cache()
raw_rdd.take(5) #show the top 5 lines of the file

In [6]:
type(raw_rdd)#se imprime el tipo de clase del argumento pasado como parámetro

In [7]:
#Vamos a imprimir el 3er registro del RDD
lista = raw_rdd.collect()
print("El tercer elemento de la lista es: %d", lista[3])

In [8]:
#¿Cuantos elementos tiene el RDD?
print("Visualizando el total del elementos:", raw_rdd.count())

In [9]:
#En RDD, es necesario separar cada una de las entradas, antes de parsear y construir un dataframe
csv_rdd = raw_rdd.map(lambda row: row.split(",")) 
print(csv_rdd.take(3))#print 3 rows

In [10]:
#Creamos la tabla parsed_rdd.
#Vamos a importar la clase Row de la librería de pyspark.sql


from pyspark.sql import Row #Importamos la clase Row 

amazon_rdd = csv_rdd.map(lambda r: Row(
    rating = r[0],
    date = r[1],
    variation = r[2],
    verified_reviews = r[3],
    feedback = r[4]
    )
)
amazon_rdd.take(5)

In [11]:
#Transformación map: retorna un nuevo RDD
temp_table_name = "amazon_data"
df.createOrReplaceTempView(temp_table_name)

#Construimos un datraframe
df = sqlContext.createDataFrame(amazon_rdd)
display(df)

date,feedback,rating,variation,verified_reviews
date,feedback,rating,variation,verified_reviews
31-Jul-18,1,5,Charcoal Fabric,Love my Echo!
31-Jul-18,1,5,Charcoal Fabric,Loved it!
31-Jul-18,"you can answer a question correctly but Alexa says you got it wrong and answers the same as you. I like being able to turn lights on and off while away from home.""",4,Walnut Finish,"""Sometimes while playing a game"
31-Jul-18,"i control the lights and play games like categories. Has nice sound when playing music as well.""",5,Charcoal Fabric,"""I have had a lot of fun with this thing. My 4 yr old learns about dinosaurs"
31-Jul-18,1,5,Charcoal Fabric,Music
31-Jul-18,"and found this smart speaker. Can’t wait to see what else it can do.""",5,Heather Gray Fabric,"""I received the echo as a gift. I needed another Bluetooth or something to play music easily accessible"
31-Jul-18,I cannot use many of her features. I have an iPad but do not see that of any use. It IS a great alarm. If u r almost deaf,3,Sandstone Fabric,"""Without having a cellphone"
31-Jul-18,1,5,Charcoal Fabric,I think this is the 5th one I've purchased. I'm working on getting one in every room of my house. I really like what features they offer specifily playing music on all Echos and controlling the lights throughout my house.
30-Jul-18,1,5,Heather Gray Fabric,looks great


In [12]:
reviews_rdd = df.select("verified_reviews").rdd.flatMap(lambda x: x)
reviews_rdd.collect()

In [13]:
#Remover cabeceras y convertir todos los datos a minúsculas para un fácil procesamiento
header = reviews_rdd.first() #devuelve el primer registro del RDD
data_rmv_col = reviews_rdd.filter(lambda row: row != header) #removemos el encabezado
lowerCase_sentRDD = data_rmv_col.map(lambda x : x.lower()) #convertirmos el texto a minúscula

print(lowerCase_sentRDD.count())
print(lowerCase_sentRDD.collect())

In [14]:
#Vamos a instalar NLTK en Databricks

In [15]:
#%sh 
#pip install nltk
#pip install --upgrade pip
#python -m nltk.downloader all

In [16]:
#Tokenizamos los textos en sentencias
import nltk
def sent_TokenizeFunct(x):
    return nltk.sent_tokenize(x)
  
sentenceTokenizeRDD = lowerCase_sentRDD.map(sent_TokenizeFunct)
lista = sentenceTokenizeRDD.collect()
print(lista[0:3])

In [17]:
#Hacer un split a cada sentencia, para trabajar a nivel palabra 
def word_TokenizeFunct(x):
    splitted = [word for line in x for word in line.split()]
    return splitted
  
wordTokenizeRDD = sentenceTokenizeRDD.map(word_TokenizeFunct)
lista2 = wordTokenizeRDD.collect()
print(lista2[0:3])

In [18]:
#Removemos stopwords
def removeStopWordsFunct(x):
    from nltk.corpus import stopwords
    stop_words=set(stopwords.words('english'))
    filteredSentence = [w for w in x if not w in stop_words]
    return filteredSentence
stopwordRDD = wordTokenizeRDD.map(removeStopWordsFunct)
lista3 = stopwordRDD.collect()
print(lista3[0:3])

In [19]:
print(stopwordRDD.first())

In [20]:
#Eliminamos signos de puntuación y espacios en blanco
import string
def removePunctuationsFunct(x):
    list_punct=list(string.punctuation)
    filtered = [''.join(c for c in s if c not in list_punct) for s in x] 
    filtered_space = [s for s in filtered if s] #remove empty space 
    return filtered
rmvPunctRDD = stopwordRDD.map(removePunctuationsFunct)
lista3 = rmvPunctRDD.collect()
print(lista3[0:3])

In [21]:
#Análisis de sentimientos
def sentimentWordsFunct(x):
  from nltk.sentiment.vader import SentimentIntensityAnalyzer
  analyzer = SentimentIntensityAnalyzer() 
  senti_list_temp = []
  for i in x:
    y = ''.join(i) 
    vs = analyzer.polarity_scores(y)
    senti_list_temp.append((y, vs))
    senti_list_temp = [w for w in senti_list_temp if w]
  sentiment_list  = []
  for j in senti_list_temp:
    first = j[0]
    second = j[1]
    
    for (k,v) in second.items():
      if k == 'compound':
        if v < 0.0:
          sentiment_list.append((first, "Negative"))
        elif v == 0.0:
          sentiment_list.append((first, "Neutral"))
        else:
          sentiment_list.append((first, "Positive"))
  return sentiment_list

sentimentRDD = rmvPunctRDD.map(sentimentWordsFunct)
sentimentRDD.take(10)

In [22]:
#Vamos a extrer la frecuencia de las palabras más usadas
freqDistRDD = rmvPunctRDD.flatMap(lambda x : nltk.FreqDist(x).most_common()).map(lambda x: x).reduceByKey(lambda x,y : x+y).sortBy(lambda x: x[1], ascending = False)

freqDistRDD.take(10)

In [23]:
#Vamos a visualizar las palabras más frecuentes
import matplotlib as plt
df_fDist = freqDistRDD.toDF() #convertirmos el RDD hacia un dataframe
df_fDist.createOrReplaceTempView("myTable") 
df2 = spark.sql("SELECT _1 AS PalabrasClave, _2 as Frecuencia from myTable limit 20") #renombramos las columnas
pandas_D = df2.toPandas() #convertimos el dataframe de spark a dataframe de pandas
pandas_D.plot.barh(x='PalabrasClave', y='Frecuencia', rot=1, figsize=(10,8))
display()

In [24]:
display(pandas_D.head(20))

PalabrasClave,Frecuencia
love,812
echo,576
great,574
it,386
alexa,366
like,358
music,316
use,315
works,313
,293
