# Mini Proyecto 3 


## Instalación de ambiente
Debe subir el archivo "miniproyecto_installer.py"

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving miniproyecto3_installer_drive.py to miniproyecto3_installer_drive.py
User uploaded file "miniproyecto3_installer_drive.py" with length 2562237 bytes


In [None]:
exec(open('miniproyecto3_installer_drive.py').read())

Active services:
2961 ResourceManager
3237 JobHistoryServer
3110 DataNode
3191 NodeManager
3031 NameNode
3292 Jps



## Actividad 1
Visualización de datos en MySQL: *Hate Speech and Offensive Language Dataset*


In [None]:
!mysql -u root --password=password testdb

Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 15
Server version: 8.0.32-0ubuntu0.20.04.2 (Ubuntu)

Copyright (c) 2000, 2023, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select * from hate_speech limit 5;
+----------+-------+-------------+--------------------+---------+-------+----------------------------------------------------------------------------------------------------------------------------------------------+
| tweet_id | count | hate_speech | offensive_language | neither | class | tweet                                                                                                                      

## Actividad 2
Inserción de datos con Sqoop. El password de MySQL es "password" (sin comillas). 

In [None]:
!sqoop import-all-tables --connect jdbc:mysql://localhost/testdb \
                         --username root \
                         -P \
                         --hive-import

23/05/05 17:04:55 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Enter password: 
23/05/05 17:05:21 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
23/05/05 17:05:21 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
23/05/05 17:05:21 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
23/05/05 17:05:22 INFO tool.CodeGenTool: Beginning code generation
23/05/05 17:05:22 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hate_speech` AS t LIMIT 1
23/05/05 17:05:22 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hate_speech` AS t LIMIT 1
23/05/05 17:05:22 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /content/hadoop
Note: /tmp/sqoop-root/compile/fd259

## Actividad 3
Visualización de datos con Hive

In [None]:
!hive

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/content/apache-hive-2.3.9-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/content/hadoop-2.10.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/content/apache-hive-2.3.9-bin/lib/hive-common-2.3.9.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> select * from hate_speech limit 10;
OK
hate_speech.tweet_id	hate_speech.count	hate_speech.hate_speech	hate_speech.offensive_language	hate_speech.neither	hate_speech.class	hate_

## Actividad 4
Lectura de datos con Spark SQL y separación train/test

In [None]:
import findspark
findspark.init()
from pyspark.sql.functions import col
from pyspark.sql import SparkSession

spark = SparkSession.builder.enableHiveSupport().appName("MP-3").getOrCreate()

In [None]:
df_hate = spark.sql('SELECT * FROM hate_speech ')
df_hate = df_hate.filter('tweet IS NOT NULL AND LENGTH (tweet) > 1')
df_hate = df_hate.withColumn('label', df_hate['class']+1)
training, testing = df_hate.randomSplit([0.95, 0.05])
df_hate.show(10, False)

+--------+-----+-----------+------------------+-------+-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|tweet_id|count|hate_speech|offensive_language|neither|class|tweet                                                                                                                                                             |label|
+--------+-----+-----------+------------------+-------+-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|0       |3    |0          |0                 |3      |2    |!!! RT @mayasolovely: As a woman you shouldn't complain about cleaning up your house. &amp; as a man you should always take the trash out...                      |3    |
|1       |3    |0          |3                 |0      |1    |!!!!! RT @mleew

## Actividad 5
Preprocesamiento de dataset de entrenamiento, con funciones de _tokenización_ y remoción de _stopwords_ 

In [None]:
from pyspark.ml.feature import RegexTokenizer
from pyspark.ml.feature import StopWordsRemover

tokenizer = RegexTokenizer(pattern="\\W+", inputCol="tweet", outputCol = 'words' )

stopWords=StopWordsRemover.loadDefaultStopWords('english')
stopWordsRemover = StopWordsRemover(stopWords = stopWords, inputCol = 'words', outputCol = 'clean_words' )

In [None]:
# tokenizer
training_words = tokenizer.transform(training)
training_words.show()

+--------+-----+-----------+------------------+-------+-----+--------------------+-----+--------------------+
|tweet_id|count|hate_speech|offensive_language|neither|class|               tweet|label|               words|
+--------+-----+-----------+------------------+-------+-----+--------------------+-----+--------------------+
|       0|    3|          0|                 0|      3|    2|!!! RT @mayasolov...|    3|[rt, mayasolovely...|
|       1|    3|          0|                 3|      0|    1|!!!!! RT @mleew17...|    2|[rt, mleew17, boy...|
|       2|    3|          0|                 3|      0|    1|!!!!!!! RT @UrKin...|    2|[rt, urkindofbran...|
|       3|    3|          0|                 2|      1|    1|!!!!!!!!! RT @C_G...|    2|[rt, c_g_anderson...|
|       4|    6|          0|                 6|      0|    1|!!!!!!!!!!!!! RT ...|    2|[rt, shenikarober...|
|       5|    3|          1|                 2|      0|    1|"!!!!!!!!!!!!!!!!...|    2|[t_madison_x, the...|
|       6|

In [None]:
# mostrar las columnas "words" y "tweet" de training_words
training_words.select(col('tweet'), col('words')).show(20, False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
|tweet                                                                                                                                                             |words                                                                                                                                                     |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
|!!! RT @mayasolovely: As a woman you sh

In [None]:
# stopWordsRemover
training_clean = stopWordsRemover.transform(training_words)

In [None]:
from pyspark.sql.functions import size

# Se usan finalmente aquellas frases con más de 3 palabras que no sean stopwords
train_words = training_clean.filter(size(training_clean['clean_words']) > 3)
train_words.show(n=5, truncate=False)

+--------+-----+-----------+------------------+-------+-----+--------------------------------------------------------------------------------------------------------------------------------------------+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
|tweet_id|count|hate_speech|offensive_language|neither|class|tweet                                                                                                                                       |label|words                                                                                                                                                     |clean_words                                                                                 |
+--------+-----+-----------+------------------+-------+-----+-------------------------

## Actividad 6
Entrenamiento de modelo Word2Vec y transformación a features numéricos

In [None]:
from pyspark.ml.feature import Word2Vec

model_w2v = Word2Vec(vectorSize=32, minCount = 0, inputCol="clean_words", outputCol="features").fit(train_words)
train_features = model_w2v.transform(train_words)

In [None]:
# mostrar Dataframe final que se usará para entrenamiento
train_features.show()
train_features.printSchema()

+--------+-----+-----------+------------------+-------+-----+--------------------+-----+--------------------+--------------------+--------------------+
|tweet_id|count|hate_speech|offensive_language|neither|class|               tweet|label|               words|         clean_words|            features|
+--------+-----+-----------+------------------+-------+-----+--------------------+-----+--------------------+--------------------+--------------------+
|       0|    3|          0|                 0|      3|    2|!!! RT @mayasolov...|    3|[rt, mayasolovely...|[rt, mayasolovely...|[0.06701692820449...|
|       1|    3|          0|                 3|      0|    1|!!!!! RT @mleew17...|    2|[rt, mleew17, boy...|[rt, mleew17, boy...|[0.12749311456886...|
|       2|    3|          0|                 3|      0|    1|!!!!!!! RT @UrKin...|    2|[rt, urkindofbran...|[rt, urkindofbran...|[0.09391862348032...|
|       3|    3|          0|                 2|      1|    1|!!!!!!!!! RT @C_G...|    2|

## Actividad 7
Entrenamiento de modelo clasificador RandomForestClassifier

In [None]:
from pyspark.ml.classification import RandomForestClassifier

algorithm = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=200)
model = algorithm.fit(train_features)


In [None]:
# obtener Dataframes de testing

testing_words = tokenizer.transform(testing)
testing_clean = stopWordsRemover.transform(testing_words)
test_words = testing_clean.filter(size(testing_clean['clean_words']) > 3)
model_w2v1 = Word2Vec(vectorSize=32, minCount = 0, inputCol="clean_words", outputCol="features").fit(test_words)
test_features = model_w2v1.transform(test_words)
test_features.show()
test_features.printSchema()

+--------+-----+-----------+------------------+-------+-----+--------------------+-----+--------------------+--------------------+--------------------+
|tweet_id|count|hate_speech|offensive_language|neither|class|               tweet|label|               words|         clean_words|            features|
+--------+-----+-----------+------------------+-------+-----+--------------------+-----+--------------------+--------------------+--------------------+
|      28|    3|          0|                 3|      0|    1|""" i need a trip...|    2|[i, need, a, trip...|[need, trippy, bi...|[0.01936799800023...|
|      90|    3|          3|                 0|      0|    0|"""@CB_Baby24: @w...|    1|[cb_baby24, white...|[cb_baby24, white...|[0.00134788099158...|
|     148|    3|          0|                 3|      0|    1|"""@ItsYahBoiRay:...|    2|[itsyahboiray, an...|[itsyahboiray, an...|[-0.0343563282027...|
|     172|    3|          0|                 3|      0|    1|"""@Latrobemark: ...|    2|

In [None]:
# obtener predicciones
predictions = model.transform(test_features)

In [None]:
predictions.show(50)

+--------+-----+-----------+------------------+-------+-----+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|tweet_id|count|hate_speech|offensive_language|neither|class|               tweet|label|               words|         clean_words|            features|       rawPrediction|         probability|prediction|
+--------+-----+-----------+------------------+-------+-----+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|      28|    3|          0|                 3|      0|    1|""" i need a trip...|    2|[i, need, a, trip...|[need, trippy, bi...|[0.01936799800023...|[0.0,17.353466702...|[0.0,0.0867673335...|       2.0|
|      90|    3|          3|                 0|      0|    0|"""@CB_Baby24: @w...|    1|[cb_baby24, white...|[cb_baby24, white...|[0.00134788099158...|[0.0,17.652882295...|[0.0,0.0

In [None]:
# mostrar label, predicciones, y el tweet original

predictions.select(col('label'), col('prediction'), col('tweet')).show(10,False)

+-----+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|prediction|tweet                                                                                                                                                                |
+-----+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|2    |2.0       |""" i need a trippy bitch who fuck on Hennessy """                                                                                                                   |
|1    |2.0       |"""@CB_Baby24: @white_thunduh alsarabsss"" hes a beaner smh you can tell hes a mexican"                                                                              |
|2    |2.0       |"""@ItsYahBoiRay: @anelylove if I don't get my dick sucke

In [None]:
# evaluación con métrica de accuracy

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='accuracy')
accuracy = evaluator.evaluate(predictions)
print("Test accuracy: ", "%2.1f%%" % (accuracy*100,))

Test accuracy:  79.6%
