<a href="https://colab.research.google.com/github/CostrunLarisa/Big-Data/blob/main/YoutubeComments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Citirea datelor

Link catre dataset: https://www.kaggle.com/datasets/nipunarora8/most-liked-comments-on-youtube

In [4]:
pip install pyspark


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317130 sha256=2173d6a472415343e5b92035159f9b95097f4a1a460a7b8dfd8635d594e9747b
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0


In [30]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ytcomments').getOrCreate()
data = spark.read.csv('sample_data/youtube_dataset.csv',inferSchema=True,
                     header=True)
data.printSchema()

root
 |-- Video Name: string (nullable = true)
 |-- Channel Name: string (nullable = true)
 |-- Comment Id: string (nullable = true)
 |-- User Name: string (nullable = true)
 |-- Comment: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Likes: string (nullable = true)



## Data Preprocessing

Vom elimina coloana "User Name", deoarece nu este un element relevant in analiza noastra.

In [31]:
# Stergem liniile care au coloana de like-uri sau comentariu null

data = data.na.drop(subset=["Likes", "Comment"])
data = data.drop("User Name")
data.printSchema()
data.columns

root
 |-- Video Name: string (nullable = true)
 |-- Channel Name: string (nullable = true)
 |-- Comment Id: string (nullable = true)
 |-- Comment: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Likes: string (nullable = true)



['Video Name', 'Channel Name', 'Comment Id', 'Comment', 'Date', 'Likes']

In [32]:
from pyspark.sql.functions import length

#Adaugam o coloana noua, Comment_Length, pentru a evita procesarea fiecarei linii

data = data.withColumn("Comment_Length", length(data["Comment"]))
data.columns

['Video Name',
 'Channel Name',
 'Comment Id',
 'Comment',
 'Date',
 'Likes',
 'Comment_Length']

Vom transforma coloana Date in tipul de date Date in format 'yyyy-MM-dd' pentru a calcula care a fost nr. de zile care a trecut de la data publicarii comentariului pana in prezent.

In [39]:
from pyspark.sql.functions import current_date, datediff
from pyspark.sql.functions import substring
from pyspark.sql.functions import to_date

data = data.withColumn('Date', substring(data['Date'], 1, 10))
data = data.withColumn('Date', to_date(data['Date'], 'yyyy-MM-dd'))
updated_data = data.withColumn('Days_Passed', datediff(current_date(), data['Date']))
updated_data = updated_data.drop('Date')

+--------------------+-------------+--------------------+--------------------+------+--------------+-----------+
|          Video Name| Channel Name|          Comment Id|             Comment| Likes|Comment_Length|Days_Passed|
+--------------------+-------------+--------------------+--------------------+------+--------------+-----------+
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgwV0tapZzaFxdYm1...|The people who li...| 98280|            63|       1043|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzXUsI6yrRjTKNAS...|Let's be honest t...|    13|            67|       1013|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzryH8U6Dz_yBmIg...|3.2 Million comme...|370547|            51|       1191|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzPg7VsuRTwJf77j...|claim your “here ...|   763|            77|       1014|
|Luis Fonsi - Desp...|LuisFonsiVEVO|Ugw61yKNdyVJ5T4R_...|The ones who are ...|    94|            56|       1014|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgxaUPmMyW0KD8kqi...|  Kimler burda😂🥰🌹|    45|            15

In [40]:
data = updated_data
data.show()

+--------------------+-------------+--------------------+--------------------+------+--------------+-----------+
|          Video Name| Channel Name|          Comment Id|             Comment| Likes|Comment_Length|Days_Passed|
+--------------------+-------------+--------------------+--------------------+------+--------------+-----------+
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgwV0tapZzaFxdYm1...|The people who li...| 98280|            63|       1043|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzXUsI6yrRjTKNAS...|Let's be honest t...|    13|            67|       1013|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzryH8U6Dz_yBmIg...|3.2 Million comme...|370547|            51|       1191|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzPg7VsuRTwJf77j...|claim your “here ...|   763|            77|       1014|
|Luis Fonsi - Desp...|LuisFonsiVEVO|Ugw61yKNdyVJ5T4R_...|The ones who are ...|    94|            56|       1014|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgxaUPmMyW0KD8kqi...|  Kimler burda😂🥰🌹|    45|            15

## Adaugarea unui UDF care calculeaza nr. de emoji-uri dintr-un comentariu

In [42]:
pip install emoji

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting emoji
  Downloading emoji-2.5.1.tar.gz (356 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m356.3/356.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-2.5.1-py2.py3-none-any.whl size=351210 sha256=cdc624d47c1d590f9b1db9cf34daeeb3f1c72631deb7cef15226333ac9608b0a
  Stored in directory: /root/.cache/pip/wheels/51/92/44/e2ef13f803aa08711819357e6de0c5fe67b874671141413565
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-2.5.1


In [47]:
import emoji
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def count_emojis(text):
    return emoji.emoji_count(text);


# Register the UDF
count_emojis_udf = udf(count_emojis, IntegerType())
spark.udf.register("count_emojis", count_emojis_udf)


<function __main__.count_emojis(text)>

### Adaugam o coloana noua 'Emojis_number'

In [48]:
from pyspark.sql.functions import col

data = data.withColumn('Emojis_number', count_emojis_udf(col('Comment')))
data.show()

+--------------------+-------------+--------------------+--------------------+------+--------------+-----------+-------------+
|          Video Name| Channel Name|          Comment Id|             Comment| Likes|Comment_Length|Days_Passed|Emojis_number|
+--------------------+-------------+--------------------+--------------------+------+--------------+-----------+-------------+
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgwV0tapZzaFxdYm1...|The people who li...| 98280|            63|       1043|            0|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzXUsI6yrRjTKNAS...|Let's be honest t...|    13|            67|       1013|            0|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzryH8U6Dz_yBmIg...|3.2 Million comme...|370547|            51|       1191|            0|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzPg7VsuRTwJf77j...|claim your “here ...|   763|            77|       1014|            1|
|Luis Fonsi - Desp...|LuisFonsiVEVO|Ugw61yKNdyVJ5T4R_...|The ones who are ...|    94|            56|       1014

### Vrem sa analizam cate linii din setul de date au emoji-uri pentru a stabili relevanta acestei coloane (Emojis_number)

In [58]:
from pyspark.sql.functions import col

filtered_data = data.filter(col('Emojis_number') != 0)
print('Total rows: ' + str(data.count()))
print('Rows with emojis: ' + str(filtered_data.count()))

Total rows: 14829
Rows with emojis: 2865


Vom adauga o coloana de label pentru eticheta, care va fi 1 sau 0 pentru cazurile in care un comentariu este considerat cel mai apreciat sau nu.

Un comentariu este considerat cel mai apreciat daca nr. de like-uri este > 1000.

## Adaugam o noua coloana *label*.

In [20]:
from pyspark.sql.functions import lit

data = data.withColumn('Label', lit(0))

### Pentru liniile unde Likes > 1000 modificam valoarea Label in 1.

In [21]:
from pyspark.sql.functions import when

# Create a new column 'NewLabel' with the modified values
updated_data = data.withColumn('Temp_label', when(data.Likes > 1000, 1).otherwise(data.Label))

# Drop the original 'Label' column and rename 'NewLabel' to 'Label'
updated_data = updated_data.drop('Label').withColumnRenamed('Temp_label', 'Label')


+--------------------+-------------+--------------------+--------------------+--------------------+------+--------------+-----+
|          Video Name| Channel Name|          Comment Id|             Comment|                Date| Likes|Comment_Length|Label|
+--------------------+-------------+--------------------+--------------------+--------------------+------+--------------+-----+
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgwV0tapZzaFxdYm1...|The people who li...|2020-08-10T20:00:43Z| 98280|            63|    1|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzXUsI6yrRjTKNAS...|Let's be honest t...|2020-09-09T03:41:34Z|    13|            67|    0|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzryH8U6Dz_yBmIg...|3.2 Million comme...|2020-03-15T21:11:08Z|370547|            51|    1|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzPg7VsuRTwJf77j...|claim your “here ...|2020-09-08T06:01:36Z|   763|            77|    0|
|Luis Fonsi - Desp...|LuisFonsiVEVO|Ugw61yKNdyVJ5T4R_...|The ones who are ...|2020-09-08T08:44:47Z|    9

In [22]:
data = updated_data
data.show()

+--------------------+-------------+--------------------+--------------------+--------------------+------+--------------+-----+
|          Video Name| Channel Name|          Comment Id|             Comment|                Date| Likes|Comment_Length|Label|
+--------------------+-------------+--------------------+--------------------+--------------------+------+--------------+-----+
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgwV0tapZzaFxdYm1...|The people who li...|2020-08-10T20:00:43Z| 98280|            63|    1|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzXUsI6yrRjTKNAS...|Let's be honest t...|2020-09-09T03:41:34Z|    13|            67|    0|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzryH8U6Dz_yBmIg...|3.2 Million comme...|2020-03-15T21:11:08Z|370547|            51|    1|
|Luis Fonsi - Desp...|LuisFonsiVEVO|UgzPg7VsuRTwJf77j...|claim your “here ...|2020-09-08T06:01:36Z|   763|            77|    0|
|Luis Fonsi - Desp...|LuisFonsiVEVO|Ugw61yKNdyVJ5T4R_...|The ones who are ...|2020-09-08T08:44:47Z|    9

### Utilizarea gruparii si agregarii datelor (agg si groupBy) in vederea obtinerii celui mai comentat video de pe fiecare canal

## Formatarea pentru MLlib

In [9]:
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import ArrayType, DoubleType

data = data.withColumn("Likes", data["Likes"].cast("double"))
assembler = VectorAssembler(inputCols=['Likes', 'Comment_Length'],
                            outputCol='features')

output = assembler.transform(data)
final_data = output.select('features','Video Name')

# Show the DataFrame
final_data.show()

+---------------+--------------------+
|       features|          Video Name|
+---------------+--------------------+
| [98280.0,63.0]|Luis Fonsi - Desp...|
|    [13.0,67.0]|Luis Fonsi - Desp...|
|[370547.0,51.0]|Luis Fonsi - Desp...|
|   [763.0,77.0]|Luis Fonsi - Desp...|
|    [94.0,56.0]|Luis Fonsi - Desp...|
|    [45.0,15.0]|Luis Fonsi - Desp...|
| [36446.0,56.0]|Luis Fonsi - Desp...|
|   [142.0,51.0]|Luis Fonsi - Desp...|
|    [10.0,48.0]|Luis Fonsi - Desp...|
|   [109.0,85.0]|Luis Fonsi - Desp...|
|[321690.0,33.0]|Luis Fonsi - Desp...|
|     [8.0,69.0]|Luis Fonsi - Desp...|
|   [166.0,64.0]|Luis Fonsi - Desp...|
|      [7.0,9.0]|Luis Fonsi - Desp...|
|     [0.0,64.0]|Luis Fonsi - Desp...|
|   [114.0,66.0]|Luis Fonsi - Desp...|
| [10412.0,40.0]|Luis Fonsi - Desp...|
|     [5.0,73.0]|Luis Fonsi - Desp...|
|     [2.0,63.0]|Luis Fonsi - Desp...|
|  [8075.0,55.0]|Luis Fonsi - Desp...|
+---------------+--------------------+
only showing top 20 rows



## Impartirea setului de date

In [10]:
train_yt, test_yt = final_data.randomSplit([0.7,0.3])

## Antrenarea modelului

In [11]:
from pyspark.ml.classification import LogisticRegression

lr_yt = LogisticRegression(labelCol='features')

In [12]:
fitted_yt_model = lr_yt.fit(train_yt)

IllegalArgumentException: ignored

## Antrenarea modelului dupa ce am selectat numai liniile unde nr. de like-uri > 100

