# Project - Apache Spark & Elastichsearch

##### Students:
* Lilia IZRI      (DS)
* Yacine MOKHTARI (DS)
* Alexandre COMBEAU (DS)

##### Report
[PENSER A METTRE UN LIEN ICI]


In [2]:
# !pip install textblob
# !pip install elasticsearch

In [1]:
# import necessary packages
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext


# For ML
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.clustering import StreamingKMeans

# From our util.py file
from utils import sentiment, tweetToJSON

## I. Process & Analyze input data (tweets)
### 1. Create our Dstream that receives data

In [2]:
# Initiate the SparkContext and StreamingContext with 10 second batch interval
sc = SparkContext()
ssc = StreamingContext(sc, 10)
spark = SparkSession(sc)
ssc.checkpoint("file:///tmp/spark")    # Checkpoint for backups (useful for operations by window)

In [3]:
# initiate streaming text from a TCP (socket) source (Our tweets received)
socket_stream = ssc.socketTextStream("127.0.0.1", 5567)

### 2. Process data and tag with sentiment 

Here, we just took into account the polarity and choosed to ignore the subjectivity !  ;)

In [None]:
# We split the fields of the tweet received and we add tag the data with the sentiment of the tweet
#   so the rdd below 'tweets' will be of the form (user, text, date, latitude, locations, hashtags, sentiment, tweet_id)
def mapSplit(tweet):
    """
    A function that takes a tweet  (the one we sent from the other iPython file),
    splits it into its different fields and adds the sentiment field {-1, 0, 1}
    """
    return (tweet[1],    tweet[2],  tweet[3],    tweet[4],    tweet[5],    tweet[6],   "sentiment: "+ str(sentiment(tweet[2][6:])), tweet[7])
             #user        #text       #date      #latitude   #longitude    #Hashtags    #sentiment(= {-1,0,1})   #Id

In [4]:
#### 1. Process the received tweets ( we will catch them the same way we sent them into the socket  :  " ###field### field_name: ... ### ...."
tweets_split = socket_stream.map(lambda tweet: tweet.split(' ###:field:### '))
tweets = tweets_split.map(mapSplit)

#### Tweets contains RDDs representented as tuples
# tweets.pprint() # uncomment this line to see the tweets in the tuple format

#### json_list_per_stream is a list of tweets tuples converted  as a string following the JSON/Dict format
json_list_per_stream = tweets.map(tweetToJSON)
# json_list_per_stream.pprint() # uncomment this line to see the tweets in the JSON format

Form of RDD in Tweets :
``('user: userX', 'tweet: @userY blablablbabla', 'date: Thu May 05 00:16:06 +0000 2022', 'lat: 44.933143', 'lon: 7.540121', 'hashtags: #SaveTheWorld', 0, "1237288393929")``

### 3. ML : Cluster tweets according to sentiments and their location

In [8]:
# We create a training set and test set 
training_data =  tweets.map(lambda tweet: Vectors.dense([float(tweet[6]), float(tweet[3][5:]), float(tweet[4][5:])]))
testing_data  =  tweets.map(lambda tweet: LabeledPoint(float(tweet[6]), Vectors.dense([float(tweet[6]), float(tweet[3][5:]), float(tweet[4][5:])])))


# We create a model with random clusters and specify the number of clusters to find
k = 3
dimension = 3
weights = 0.0
seed = 21

# init
model = StreamingKMeans(k=k, decayFactor=0.5).setRandomCenters(dimension, weights, seed)

# Train the model
model.trainOn(training_data)  

# Predict
result = model.predictOnValues(testing_data.map(lambda lp: (lp.label, lp.features)))
# result.pprint()

In [9]:
# We keep the predictions of each tweet (the index of the cluster), and we create (indexCluster, 1) pairs
predictions   = result.map(lambda x: (x[1], 1))

# We reduce by key and window to get the number of elements assigned to each cluster
size_clusters = predictions.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)
size_clusters.pprint()

In [10]:
# tweets.saveAsTextFiles("tmp/")

In [8]:
# start streaming and wait couple of minutes to get enought tweets
ssc.start()

-------------------------------------------
Time: 2022-05-06 00:22:50
-------------------------------------------
('user: vazouzou', 'tweet: RT @ThatUmbrella: "He did that to me in front of other people!"  Amber Heard says Johnny Depp abused her in front of multiple people, they…', 'date: Fri May 06 00:22:37 +0000 2022', 'lat: 44.933143', 'lon: 7.540121', 'hashtags: ', 'sentiment: -1', 'id: 219225360')
('user: pqpmiam', 'tweet: tirem a cara feia do johnny depp da timeline não aguento mais', 'date: Fri May 06 00:22:38 +0000 2022', 'lat: 42.4494141', 'lon: 27.2936714', 'hashtags: ', 'sentiment: 0', 'id: 472154327')
('user: mariruybarbosa', 'tweet: To meio obcecada com esse caso da Amber e do Depp! 🥵', 'date: Fri May 06 00:22:38 +0000 2022', 'lat: -22.9110137', 'lon: -43.2093727', 'hashtags: ', 'sentiment: 0', 'id: 125432895')
('user: Mursleen01', 'tweet: Vanessa Paradis attends fashion show amid ex Johnny Depp’s\xa0trial https://t.co/oM0elEkOK8', 'date: Fri May 06 00:22:38 +0000 2022', '

In [21]:
print("helloooo", json_list_per_stream)

helloooo None
-------------------------------------------
Time: 2022-05-05 23:36:00
-------------------------------------------
('user: julesradiohead', 'tweet: RT @ReemDepp: Johnny Depp obviously disgusted and don’t want to face her! He gave her his back… https://t.co/rrXqj5sJt7', 'date: Thu May 05 23:35:00 +0000 2022', 'lat: 44.933143', 'lon: 7.540121', 'hashtags: ', 'sentiment: -1', 'id: 1331954780')
('user: Assiaaaa5', 'tweet: RT @marcorobinson7: Vanessa Paradis on Johnny Depp: "I knew him for 25 years, we lived as a couple for 14 years and we raised our 2 childre…', 'date: Thu May 05 23:35:01 +0000 2022', 'lat: 44.933143', 'lon: 7.540121', 'hashtags: ', 'sentiment: 0', 'id: 1139594034484273152')
('user: 1LoveEmpower', "tweet: RT @maIeficentmills: here are texts from johnny depp's ex assistant stephen deuters to amber heard, which acknowledge &amp; prove that this inc…", 'date: Thu May 05 23:35:01 +0000 2022', 'lat: 25.029422', 'lon: -77.36195598496681', 'hashtags: ', 'sentiment: 0

In [11]:
print("Clusters coordinates: " + str(model.latestModel().clusterCenters))

-------------------------------------------
Time: 2022-05-05 23:33:30
-------------------------------------------
('user: RoxyMcfly143', 'tweet: RT @Eve_Barlow: The Depp / Heard trial is a festival of misogynist hate. To all those engaged in the dehumanization of a courageous individ…', 'date: Thu May 05 23:33:23 +0000 2022', 'lat: 44.933143', 'lon: 7.540121', 'hashtags: ', 'sentiment: -1', 'id: 1281605548111204353')



NameError: name 'model' is not defined

-------------------------------------------
Time: 2022-05-05 23:33:40
-------------------------------------------
('user: redback1gc', 'tweet: RT @JackPosobiec: The Johnny Depp trial is a massive redpill', 'date: Thu May 05 23:33:24 +0000 2022', 'lat: 36.5748441', 'lon: 139.2394179', 'hashtags: ', 'sentiment: 0', 'id: 2952892291')
('user: EVlLEMPIRE', 'tweet: RT @altoidsrevenge: johnny depp said he would f**k his ex wife’s burnt dead body n y’all still on here talm bout some “justiceforjohnny” no…', 'date: Thu May 05 23:33:24 +0000 2022', 'lat: 25.029422', 'lon: -77.36195598496681', 'hashtags: ', 'sentiment: -1', 'id: 1477796808667639809')
('user: PdeGdb', "tweet: RT @paulkart76: I'm thankful Johnny Depp is even still alive at all. Amber Heard is one of the scariest women I've ever seen. This is her i…", 'date: Thu May 05 23:33:24 +0000 2022', 'lat: 44.933143', 'lon: 7.540121', 'hashtags: ', 'sentiment: 1', 'id: 1438987903791677444')
('user: RoM_Rossary', 'tweet: RT @AlexJam91754067: A

In [13]:
print("Clusters coordinates: " + str(model.latestModel().clusterCenters))

Clusters coordinates: [[-0.05196425 -0.11119605  1.0417968 ]
 [-1.25673929  0.74538768 -1.71105376]
 [-0.20586438 -0.23457129  1.12814404]]
-------------------------------------------
Time: 2022-05-05 23:15:40
-------------------------------------------
('user: _BiGmIKE___', 'tweet: So Megan and Johnny depp wife lying😂', 'date: Thu May 05 23:15:33 +0000 2022', 'lat: 40.6233536', 'lon: -111.8908443', 'hashtags: ', 0, 'id: 2830969055')
('user: paolayesenia', 'tweet: RT @JuanitoSay: Con ustedes, otra ex novia apoyando a Johnny Depp.  NO EXISTE una mujer que viviera algo negativo con el, SALVO Amber Heard…', 'date: Thu May 05 23:15:33 +0000 2022', 'lat: 52.4760892', 'lon: -71.8258668', 'hashtags: ', 0, 'id: 245313079')

-------------------------------------------
Time: 2022-05-05 23:15:40
-------------------------------------------
(1, 2)

-------------------------------------------
Time: 2022-05-05 23:15:50
-------------------------------------------
('user: FunRageDie', "tweet: RT @MikeE

In [3]:
## La bizarrement ça marche tout seul... mais quand c'est un stream ça plante


# a = "id: 234774849593"

# es.index(index="hello",
#                 id=int(a[4:]),
#                 document={"user": "lili",
#                           "text": "22"})