## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
#  !pip install textblob langdetect googletrans
import pyspark
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import functions as F
from textblob import TextBlob
from langdetect import detect
import googletrans
from googletrans import Translator
spark=SparkSession.builder.appName("DatawithPySpark").getOrCreate()




Collecting textblob
  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
[?25l[K     |▌                               | 10 kB 15.8 MB/s eta 0:00:01[K     |█                               | 20 kB 5.7 MB/s eta 0:00:01[K     |█▌                              | 30 kB 4.2 MB/s eta 0:00:01[K     |██                              | 40 kB 4.4 MB/s eta 0:00:01[K     |██▋                             | 51 kB 4.6 MB/s eta 0:00:01[K     |███                             | 61 kB 5.1 MB/s eta 0:00:01[K     |███▋                            | 71 kB 5.0 MB/s eta 0:00:01[K     |████▏                           | 81 kB 5.6 MB/s eta 0:00:01[K     |████▋                           | 92 kB 5.5 MB/s eta 0:00:01[K     |█████▏                          | 102 kB 5.7 MB/s eta 0:00:01[K     |█████▋                          | 112 kB 5.7 MB/s eta 0:00:01[K     |██████▏                         | 122 kB 5.7 MB/s eta 0:00:01[K     |██████▊                         | 133 kB 5.7 MB/s eta 0:00

In [0]:
# File location and type
file_location = "/FileStore/tables/dataset1.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

sentence,polarity
واو فيديو رائع انا إلهام و انا من أشد المعجبين بك 😍😍,1.0
Sh7al bqety katfkar bash matnsash shy laqta mn laqtat lly majitish m9erre3? 😂,1.0
طلعتي كتشبه زياش 😂👍,1.0
Ana yallah bdit fel YouTube ou bghitkom tde3moni a khouti lmohtawa n9i dekhlou ou choufou b3inikom allah yrhem likom lwalidin 😍,1.0
Tbarkellaaah aliiik nta dmaaghhh👍👍👍👍👍👍👍👍👍❤️❤️❤️❤️❤️❤️❤️,1.0
كلشي على مقدم البرنامج😂 ههه عزيزنا المشاهدون.,1.0
كتشبه لراجلي إحساس عا راجلي زوين شويا،الله إخليه ليا 🙏❤,1.0
🤣🤣🤣🤣🤣🤣😃😂😂😂😂🤣🤣🤣🤣,1.0
0:59 جوعب 🤣🤣🤣🤣🤣🤣🤣😂😂😂,1.0
😂👍👏✌,1.0


In [0]:

# File location and type
file_locat = "/FileStore/tables/emojis.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ";"

# The applied options are for CSV files. For other file types, these will be ignored.
emojis = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_locat)

display(emojis)

emoji,mean
😀,مبتسم
😁,مشع بعيون مبتسمة
😂,بدموع الفرح
🤣,الدحرجة على الأرض يضحك
😃,مبتسم بعيون كبيرة
😄,مبتسم بعيون مبتسمة
😅,مبتسم مع عرق
😆,بابتسامة عريضة
😉,غمز
😊,مبتسم بعيون مبتسمة


In [0]:
df.show(5)

+--------------------+--------+
|            sentence|polarity|
+--------------------+--------+
|واو فيديو رائع ان...|       1|
|Sh7al bqety katfk...|       1|
|طلعتي كتشبه زياش ...|       1|
|Ana yallah bdit f...|       1|
|Tbarkellaaah alii...|       1|
+--------------------+--------+
only showing top 5 rows



In [0]:
df.head(10)

Out[6]: [Row(sentence='واو فيديو رائع انا إلهام و انا من أشد المعجبين بك 😍😍', polarity=1),
 Row(sentence='Sh7al bqety katfkar bash matnsash shy laqta mn laqtat lly majitish m9erre3? 😂', polarity=1),
 Row(sentence='طلعتي كتشبه زياش 😂👍', polarity=1),
 Row(sentence='Ana yallah bdit fel YouTube ou bghitkom tde3moni a khouti lmohtawa n9i dekhlou ou choufou b3inikom allah yrhem likom lwalidin 😍', polarity=1),
 Row(sentence='Tbarkellaaah aliiik nta dmaaghhh👍👍👍👍👍👍👍👍👍❤️❤️❤️❤️❤️❤️❤️', polarity=1),
 Row(sentence='كلشي على مقدم البرنامج😂 ههه عزيزنا المشاهدون.', polarity=1),
 Row(sentence='كتشبه لراجلي إحساس عا راجلي زوين شويا،الله إخليه ليا 🙏❤', polarity=1),
 Row(sentence='🤣🤣🤣🤣🤣🤣😃😂😂😂😂🤣🤣🤣🤣', polarity=1),
 Row(sentence='0:59 جوعب  🤣🤣🤣🤣🤣🤣🤣😂😂😂', polarity=1),
 Row(sentence='😂👍👏✌', polarity=1)]

In [0]:
df.columns

Out[7]: ['sentence', 'polarity']

In [0]:
df.dtypes

Out[8]: [('sentence', 'string'), ('polarity', 'int')]

In [0]:
df.printSchema()

root
 |-- sentence: string (nullable = true)
 |-- polarity: integer (nullable = true)



In [0]:
df.count()

Out[10]: 518

In [0]:
len(df.columns)

Out[11]: 2

In [0]:
df.groupBy('polarity').count().show()

+--------+-----+
|polarity|count|
+--------+-----+
|      -1|  371|
|    null|   51|
|       1|   96|
+--------+-----+



In [0]:
from databricks import koalas as ks


In [0]:


# translator = Translator()
# def toarabic(x):
#     print(x)
#     result = translator.translate(x, dest='ar')
#     return result

In [0]:
df_k = df.to_koalas()

In [0]:
type(df_k)

Out[19]: databricks.koalas.frame.DataFrame

In [0]:
df_k.head()

Unnamed: 0,sentence,polarity
0,واو فيديو رائع انا إلهام و انا من أشد المعجبين...,1
1,Sh7al bqety katfkar bash matnsash shy laqta mn...,1
2,طلعتي كتشبه زياش 😂👍,1
3,Ana yallah bdit fel YouTube ou bghitkom tde3mo...,1
4,Tbarkellaaah aliiik nta dmaaghhh👍👍👍👍👍👍👍👍👍❤️❤️❤...,1


In [0]:
emojis = emojis.to_koalas()

In [0]:
#remove duplicate emojis 
# !pip install emoji
import emoji
from collections import Counter
def remove_duplicate_emojis(text):
    count=Counter("".join(c for c in text if c  in emoji.UNICODE_EMOJI['en']))
    new_text=""
    for i in text:
        if(i in count):
            if(i not in new_text):
                new_text=new_text+" "+i
        else:
                new_text=new_text+i
        
            
    
    return new_text
def replace_emojis(text):
    emojiList = emojis['emoji'].tolist()
    for i in text:
        if i in emojiList:
            text=text.replace(i,emojis[emojis['emoji']==i]["mean"].values[0])
    return text

Collecting emoji
  Downloading emoji-1.6.3.tar.gz (174 kB)
[?25l[K     |█▉                              | 10 kB 23.3 MB/s eta 0:00:01[K     |███▊                            | 20 kB 26.2 MB/s eta 0:00:01[K     |█████▋                          | 30 kB 14.0 MB/s eta 0:00:01[K     |███████▌                        | 40 kB 11.0 MB/s eta 0:00:01[K     |█████████▍                      | 51 kB 9.5 MB/s eta 0:00:01[K     |███████████▎                    | 61 kB 8.2 MB/s eta 0:00:01[K     |█████████████▏                  | 71 kB 7.8 MB/s eta 0:00:01[K     |███████████████                 | 81 kB 8.8 MB/s eta 0:00:01[K     |█████████████████               | 92 kB 8.1 MB/s eta 0:00:01[K     |██████████████████▉             | 102 kB 6.7 MB/s eta 0:00:01[K     |████████████████████▊           | 112 kB 6.7 MB/s eta 0:00:01[K     |██████████████████████▋         | 122 kB 6.7 MB/s eta 0:00:01[K     |████████████████████████▌       | 133 kB 6.7 MB/s eta 0:00:01[K     |█████

In [0]:
    text = "طلعتي كتشبه زياش 😂😂👍"
remove_duplicate_emojis(text)

Out[26]: 'طلعتي كتشبه زياش  😂 👍'

In [0]:
emojis[emojis['emoji']=='😂']["mean"].values[0]

Out[27]: ' بدموع الفرح'

In [0]:
st_word_ar2 =['ila','3a','ta','ja','idan','9al','chwiya','ktar','bzaf','chi','ala','li','l','lihom','lik','liha','ola','awla','wela','ama','ima','howa','ila','ana','nta','ntoma','inama','ah','hadok','oh','a7','waa','fin','finhoma','finma','eh','ps','be3d','be3dma','wra','meli','mli','chi','bik','bikom','bihom','walakin','bach','bayach','bimn','bina','bih','biya','dak','dik','dok','o','kon','3a','kon3a','hta','hata','3ata','bla','hadak','hadik','3asa','3ada','la3ala','wa','3la','3lik','3lih','dakchi','3ando','3andha','3andhom','ma','li','f','fih','fiha','fihom','kan','bhal','haka','kifma','tahowa','tahiya','kolo','kolchi','la','bjouj','bjoujat','bihom','ch7al','kifma','kifach','la','machi','haka','lik','likom','lihom','liha','ma','walaw','liya','chno','imta','mndik','m3a','men','mno','mnhom','mna','mnha','ah','wayeh','hna','hnaya','haka','wach','momkin','hada','hiya','howa','homa','hadok','temak','lheh','ahya','w','hta','3lach','dyal','db']

In [0]:
from pyspark.streaming import StreamingContext

In [0]:
 #!pip install pyarabic nltk
import pyarabic.araby as araby
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('arabic'))
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()


def pre_pros(sentence):
    sentence = sentence.lower()
    #Supprimer les liens url
    sentence = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',' ', sentence)
    #Supprimer les mots non écrits - ne garder que les mots
    sentence = re.sub("[^a-zA-Z0-9ء-يéè]"," ", sentence)
#   sentence = translator.translate(sentence ,dest='ar').text
    # supprimer les espaces > 1
    sentence = re.sub(' +', ' ', sentence)
    #Supprimer plusieurs lettres répétant des mots
    sentence = re.sub(r'(.)\1+', r'\1\1', sentence)
#     sentence = araby.strip_diacritics(sentence)
#     temp1 = [word for word in temp.split() if word not in st_word_ar1]
#         temp = " ".join(temp1)
        
    sentence1 = [word for word in sentence.split() if word.lower() not in st_word_ar2]
    sentence = " ".join(sentence1)
    sentence = ' '.join(word for word in sentence.split() if word not in stop_words)
#     sentence = [lemmatizer.lemmatize(token, "v") for token in sentence.split()]
    #garder que les mots qui n'est pas en anglais ou en francais
    if (sentence != "" and detect(sentence)!="en" and detect(sentence)!="fr"):
            #Suppression de "At-tachkil“ 
            sentence= araby.strip_diacritics(sentence)
    sentence = "".join(sentence)
    return sentence.strip()





Collecting pyarabic
  Downloading PyArabic-0.6.14-py3-none-any.whl (126 kB)
[?25l[K     |██▋                             | 10 kB 21.3 MB/s eta 0:00:01[K     |█████▏                          | 20 kB 23.8 MB/s eta 0:00:01[K     |███████▉                        | 30 kB 11.1 MB/s eta 0:00:01[K     |██████████▍                     | 40 kB 8.9 MB/s eta 0:00:01[K     |█████████████                   | 51 kB 7.3 MB/s eta 0:00:01[K     |███████████████▋                | 61 kB 6.4 MB/s eta 0:00:01[K     |██████████████████▏             | 71 kB 6.3 MB/s eta 0:00:01[K     |████████████████████▊           | 81 kB 7.1 MB/s eta 0:00:01[K     |███████████████████████▍        | 92 kB 6.5 MB/s eta 0:00:01[K     |██████████████████████████      | 102 kB 6.0 MB/s eta 0:00:01[K     |████████████████████████████▌   | 112 kB 6.0 MB/s eta 0:00:01[K     |███████████████████████████████▏| 122 kB 6.0 MB/s eta 0:00:01[K     |████████████████████████████████| 126 kB 6.0 MB/s 
Installi

In [0]:
# from pyspark.streaming import StreamingContext
# text_data = spark.socketTextStream("localhost", 8085)

In [0]:
df_k['pre_sentence'] = df_k['sentence'].map(lambda line: pre_pros(line))
df_k.head()



Unnamed: 0,sentence,polarity,pre_sentence
0,واو فيديو رائع انا إلهام و انا من أشد المعجبين...,1,فيديو رائع انا إلهام انا أشد المعجبين
1,Sh7al bqety katfkar bash matnsash shy laqta mn...,1,sh7al bqety katfkar bash matnsash shy laqta mn...
2,طلعتي كتشبه زياش 😂👍,1,طلعتي كتشبه زياش
3,Ana yallah bdit fel YouTube ou bghitkom tde3mo...,1,yallah bdit fel youtube ou bghitkom tde3moni a...
4,Tbarkellaaah aliiik nta dmaaghhh👍👍👍👍👍👍👍👍👍❤️❤️❤...,1,tbarkellaah aliik dmaaghh


In [0]:
df1=df_k.dropna()
df1.head()

Unnamed: 0,sentence,polarity,pre_sentence
0,واو فيديو رائع انا إلهام و انا من أشد المعجبين...,1,فيديو رائع انا إلهام انا أشد المعجبين
1,Sh7al bqety katfkar bash matnsash shy laqta mn...,1,sh7al bqety katfkar bash matnsash shy laqta mn...
2,طلعتي كتشبه زياش 😂👍,1,طلعتي كتشبه زياش
3,Ana yallah bdit fel YouTube ou bghitkom tde3mo...,1,yallah bdit fel youtube ou bghitkom tde3moni a...
4,Tbarkellaaah aliiik nta dmaaghhh👍👍👍👍👍👍👍👍👍❤️❤️❤...,1,tbarkellaah aliik dmaaghh


In [0]:
df2=df1[df1["pre_sentence"]!=""]
df2.head()

Unnamed: 0,sentence,polarity,pre_sentence
0,واو فيديو رائع انا إلهام و انا من أشد المعجبين...,1,فيديو رائع انا إلهام انا أشد المعجبين
1,Sh7al bqety katfkar bash matnsash shy laqta mn...,1,sh7al bqety katfkar bash matnsash shy laqta mn...
2,طلعتي كتشبه زياش 😂👍,1,طلعتي كتشبه زياش
3,Ana yallah bdit fel YouTube ou bghitkom tde3mo...,1,yallah bdit fel youtube ou bghitkom tde3moni a...
4,Tbarkellaaah aliiik nta dmaaghhh👍👍👍👍👍👍👍👍👍❤️❤️❤...,1,tbarkellaah aliik dmaaghh


In [0]:

# import re
# import nltk
# from nltk.corpus import stopwords
# nltk.download('stopwords')
# stop_words = set(stopwords.words('arabic'))
# from nltk.stem import WordNetLemmatizer
# lemmatizer = WordNetLemmatizer()


# def clean_text(sentence):
#     sentence = sentence.lower()
#     sentence = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',' ', sentence)
# #     sentence = re.sub("[^a-zA-Z0-9ء-يéè]"," ", sentence)
# #     if(sentence != "" and detect(sentence)=="en"):
# #         sentence = toarabic(sentence)
#     sentence=remove_duplicate_emojis(sentence)
#     sentence=replace_emojis(sentence)
#     display(sentence)
#     sentence = re.sub(' +', ' ', sentence)
#     sentence = re.sub(r'(.)\1+', r'\1\1', sentence)
#     sentence = ' '.join(word for word in sentence.split() if word not in stop_words)
# #     sentence = [lemmatizer.lemmatize(token, "v") for token in sentence.split()]

#     sentence = "".join(sentence)
#     return sentence.strip()


# df_k['pre_sentence'] = df_k['sentence'].map(lambda line: clean_text(line))
# df_k


In [0]:
df3=df2.to_koalas()
df3.loc[df3.polarity == -1,'polarity'] = 0
df3

Unnamed: 0,sentence,polarity,pre_sentence
0,واو فيديو رائع انا إلهام و انا من أشد المعجبين...,1,فيديو رائع انا إلهام انا أشد المعجبين
1,Sh7al bqety katfkar bash matnsash shy laqta mn...,1,sh7al bqety katfkar bash matnsash shy laqta mn...
2,طلعتي كتشبه زياش 😂👍,1,طلعتي كتشبه زياش
3,Ana yallah bdit fel YouTube ou bghitkom tde3mo...,1,yallah bdit fel youtube ou bghitkom tde3moni a...
4,Tbarkellaaah aliiik nta dmaaghhh👍👍👍👍👍👍👍👍👍❤️❤️❤...,1,tbarkellaah aliik dmaaghh
5,كلشي على مقدم البرنامج😂 ههه عزيزنا المشاهدون.,1,كلشي مقدم البرنامج هه عزيزنا المشاهدون
6,كتشبه لراجلي إحساس عا راجلي زوين شويا،الله إخل...,1,كتشبه لراجلي إحساس عا راجلي زوين شويا الله إخل...
8,0:59 جوعب 🤣🤣🤣🤣🤣🤣🤣😂😂😂,1,0 59 جوعب
10,Nari bzaaaaaf 😂😂😂😂😂😂😂🙆🏻‍♀️🙆🏻‍♀️🙆🏻‍♀️🙆🏻‍♀️🙆🏻‍♀️,1,nari bzaaf
12,ههههه🤣🤣🤣😂😂😂😂🤣🤣🤣,1,هه


In [0]:
df4=df3.to_spark()

In [0]:
df4.head()

Out[39]: Row(sentence='واو فيديو رائع انا إلهام و انا من أشد المعجبين بك 😍😍', polarity=1, pre_sentence='فيديو رائع انا إلهام انا أشد المعجبين')

In [0]:
# from pyspark.sql import SQLContext
# sc = SparkContext.getOrCreate()
# sqlContext = SQLContext(sc)

# spark_dff = sqlContext.createDataFrame(df2)

[0;31m---------------------------------------------------------------------------[0m
[0;31mTypeError[0m                                 Traceback (most recent call last)
[0;32m<command-4193986249466460>[0m in [0;36m<module>[0;34m[0m
[1;32m      3[0m [0msqlContext[0m [0;34m=[0m [0mSQLContext[0m[0;34m([0m[0msc[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m      4[0m [0;34m[0m[0m
[0;32m----> 5[0;31m [0mspark_dff[0m [0;34m=[0m [0msqlContext[0m[0;34m.[0m[0mcreateDataFrame[0m[0;34m([0m[0mdf2[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m/databricks/spark/python/pyspark/sql/context.py[0m in [0;36mcreateDataFrame[0;34m(self, data, schema, samplingRatio, verifySchema)[0m
[1;32m    384[0m         [0mPy4JJavaError[0m[0;34m:[0m [0;34m...[0m[0;34m[0m[0;34m[0m[0m
[1;32m    385[0m         """
[0;32m--> 386[0;31m         [0;32mreturn[0m [0mself[0m[0;34m.[0m[0msparkSession[0m[0;34m.[0m[0mcreateDataFrame[0m[0;34m([0m[0

In [0]:
from pyspark.ml.feature import Tokenizer
tokenization=Tokenizer(inputCol='pre_sentence',outputCol='tokens')
tokenized_df=tokenization.transform(df4)
tokenized_df.show(4,False)


+--------------------------------------------------------------------------------------------------------------------------------+--------+-------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+
|sentence                                                                                                                        |polarity|pre_sentence                                                                                                       |tokens                                                                                                                                |
+--------------------------------------------------------------------------------------------------------------------------------+--------+-------------------------------------------------------------------------------

In [0]:
# count_vec.fit(tokenized_df).vocabulary

In [0]:
from pyspark.ml.feature import HashingTF,IDF
hashing_vec=HashingTF(inputCol='tokens',outputCol='tf_features')
hashing_df=hashing_vec.transform(tokenized_df)
hashing_df.select(['tokens','tf_features']).show(4,False)

+--------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|tokens                                                                                                                                |tf_features                                                                                                                                                                                    |
+--------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[فيديو, رائع

In [0]:
hashing_df.columns

Out[43]: ['sentence', 'polarity', 'pre_sentence', 'tokens', 'tf_features']

In [0]:
tf_idf_vec=IDF(inputCol='tf_features',outputCol='tf_idf_features')
tf_idf_df=tf_idf_vec.fit(hashing_df).transform(hashing_df)
tf_idf_df.select(['tf_idf_features']).show(4,False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|tf_idf_features                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+-------------------------------------------------------------------------------------------------------------------

In [0]:
len_udf = udf(lambda s: len(s), IntegerType())
refined_text_df = tf_idf_df.withColumn("token_count",len_udf(col('tokens')))
refined_text_df.orderBy(rand()).show(10)

+--------------------+--------+--------------------+--------------------+--------------------+--------------------+-----------+
|            sentence|polarity|        pre_sentence|              tokens|         tf_features|     tf_idf_features|token_count|
+--------------------+--------+--------------------+--------------------+--------------------+--------------------+-----------+
|Hta ana bent o ma...|       0|bent manbghix yew...|[bent, manbghix, ...|(262144,[38562,53...|(262144,[38562,53...|         16|
|كلشي على مقدم الب...|       1|كلشي مقدم البرنام...|[كلشي, مقدم, البر...|(262144,[25300,60...|(262144,[25300,60...|          6|
|بقيتي فيا ولكين ن...|       0|بقيتي فيا ولكين ن...|[بقيتي, فيا, ولكي...|(262144,[18546,30...|(262144,[18546,30...|         15|
|الله يصبرك اختي و...|       0|الله يصبرك اختي و...|[الله, يصبرك, اخت...|(262144,[28943,81...|(262144,[28943,81...|          5|
|لي غاضتني هي الأم...|       0|         غاضتني الأم|      [غاضتني, الأم]|(262144,[222266,2...|(262144,[2

In [0]:
from pyspark.ml.feature import CountVectorizer

In [0]:
count_vec=CountVectorizer(inputCol='tokens',outputCol='features')
cv_text_df=count_vec.fit(refined_text_df).transform(refined_text_df)
cv_text_df.select(['tokens','token_count','features','polarity']).show(10)

+--------------------+-----------+--------------------+--------+
|              tokens|token_count|            features|polarity|
+--------------------+-----------+--------------------+--------+
|[فيديو, رائع, انا...|          7|(2586,[25,90,153,...|       1|
|[sh7al, bqety, ka...|         12|(2586,[109,595,84...|       1|
|[طلعتي, كتشبه, زياش]|          3|(2586,[373,497,16...|       1|
|[yallah, bdit, fe...|         18|(2586,[18,39,202,...|       1|
|[tbarkellaah, ali...|          3|(2586,[881,955,14...|       1|
|[كلشي, مقدم, البر...|          6|(2586,[52,93,325,...|       1|
|[كتشبه, لراجلي, إ...|         10|(2586,[0,51,88,49...|       1|
|       [0, 59, جوعب]|          3|(2586,[778,1501,2...|       1|
|       [nari, bzaaf]|          2|(2586,[192,956],[...|       1|
|                [هه]|          1|   (2586,[52],[1.0])|       1|
+--------------------+-----------+--------------------+--------+
only showing top 10 rows



In [0]:
model_text_df=cv_text_df.select(['features','token_count','polarity'])

In [0]:
from pyspark.ml.feature import VectorAssembler
df_assembler = VectorAssembler(inputCols=['features','token_count'],outputCol='features_vec')
model_text_df = df_assembler.transform(model_text_df)
model_text_df.printSchema()

root
 |-- features: vector (nullable = true)
 |-- token_count: integer (nullable = true)
 |-- polarity: integer (nullable = true)
 |-- features_vec: vector (nullable = true)



In [0]:
from pyspark.ml.classification import LogisticRegression
training_df,test_df=model_text_df.randomSplit([0.75,0.25])

In [0]:
training_df.groupBy('polarity').count().show()

+--------+-----+
|polarity|count|
+--------+-----+
|       1|   54|
|       0|  228|
+--------+-----+



In [0]:
test_df.groupBy('polarity').count().show()

+--------+-----+
|polarity|count|
+--------+-----+
|       1|   18|
|       0|   76|
+--------+-----+



In [0]:
log_reg=LogisticRegression(featuresCol='features_vec',labelCol='polarity').fit(training_df)
results=log_reg.evaluate(test_df).predictions
results.show(30)

+--------------------+-----------+--------+--------------------+--------------------+--------------------+----------+
|            features|token_count|polarity|        features_vec|       rawPrediction|         probability|prediction|
+--------------------+-----------+--------+--------------------+--------------------+--------------------+----------+
|(2586,[0,1,6,14,1...|         40|       0|(2587,[0,1,6,14,1...|[35.4982161786296...|[0.99999999999999...|       0.0|
|(2586,[0,1,22,51,...|          9|       0|(2587,[0,1,22,51,...|[12.3390794969499...|[0.99999562272369...|       0.0|
|(2586,[0,2,3,11,1...|         30|       0|(2587,[0,2,3,11,1...|[27.1370586614787...|[0.99999999999836...|       0.0|
|(2586,[0,2,22,23,...|         17|       0|(2587,[0,2,22,23,...|[24.7701492516034...|[0.99999999998252...|       0.0|
|(2586,[0,2,23,48,...|         60|       0|(2587,[0,2,23,48,...|[21.2218689264974...|[0.99999999939262...|       0.0|
|(2586,[0,2,36,50,...|         15|       0|(2587,[0,2,36

In [0]:
results1=results.to_koalas()
results1

  Unable to convert the field features. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.
Direct cause: Unsupported type in conversion to Arrow: VectorUDT
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.


Unnamed: 0,features,token_count,polarity,features_vec,rawPrediction,probability,prediction
0,"(5.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...",40,0,"(5.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[35.498216178629676, -35.498216178629676]","[0.9999999999999996, 4.440892098500626e-16]",0.0
1,"(3.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",9,0,"(3.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[12.339079496949969, -12.339079496949969]","[0.999995622723691, 4.377276308975553e-06]",0.0
2,"(1.0, 0.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",30,0,"(1.0, 0.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[27.13705866147875, -27.13705866147875]","[0.9999999999983613, 1.638689184346731e-12]",0.0
3,"(1.0, 0.0, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",17,0,"(1.0, 0.0, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[24.770149251603478, -24.770149251603478]","[0.9999999999825233, 1.7476686764439364e-11]",0.0
4,"(1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",60,0,"(1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[21.22186892649744, -21.22186892649744]","[0.9999999993926214, 6.073785918658814e-10]",0.0
5,"(1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",15,0,"(1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[21.270900906592942, -21.270900906592942]","[0.9999999994216842, 5.783158396610588e-10]",0.0
6,"(1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, ...",20,0,"(1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, ...","[37.86783993100685, -37.86783993100685]","[1.0, 0.0]",0.0
7,"(2.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",42,0,"(2.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...","[41.7913300696267, -41.7913300696267]","[1.0, 0.0]",0.0
8,"(1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",11,0,"(1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...","[25.980955040988338, -25.980955040988338]","[0.9999999999947926, 5.207390074701834e-12]",0.0
9,"(2.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, ...",45,0,"(2.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, ...","[30.622586677414358, -30.622586677414358]","[0.9999999999999498, 5.0182080713057076e-14]",0.0


[0;31m---------------------------------------------------------------------------[0m
[0;31mValueError[0m                                Traceback (most recent call last)
[0;32m<command-2406234298505232>[0m in [0;36m<module>[0;34m[0m
[1;32m      4[0m [0;34m[0m[0m
[1;32m      5[0m [0mrddCollect[0m [0;34m=[0m [0mdf[0m[0;34m.[0m[0mcollect[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 6[0;31m [0mresults[0m[0;34m=[0m[0mlog_reg[0m[0;34m.[0m[0mevaluate[0m[0;34m([0m[0mrddCollect[0m[0;34m)[0m[0;34m.[0m[0mpredictions[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m/databricks/spark/python/pyspark/ml/classification.py[0m in [0;36mevaluate[0;34m(self, dataset)[0m
[1;32m   1264[0m         """
[1;32m   1265[0m         [0;32mif[0m [0;32mnot[0m [0misinstance[0m[0;34m([0m[0mdataset[0m[0;34m,[0m [0mDataFrame[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m-> 1266[0;31m             [0;32mraise[0m [0mValueErro

In [0]:
training_predictions=log_reg.evaluate(training_df)
print(training_predictions.accuracy)


1.0


In [0]:
test_results=log_reg.evaluate(test_df)
print(test_results.accuracy)

0.8829787234042553


In [0]:
true_postives = results[(results.polarity == 1) & (results.prediction== 1)].count()

true_negatives = results[(results.polarity == 0) & (results.prediction== 0)].count()

false_positives = results[(results.polarity == 0) & (results.prediction== 1)].count()

false_negatives = results[(results.polarity == 1) & (results.prediction== 0)].count()
