__Vous décidez de vous fixer de nouveaux challenges ! Vous prenez le livre “Beautiful Stories” de
Shakespeare, plusieurs questions vous viennent naturellement__:
- Quel est le thème qui ressort le plus dans ces belles histoires ? Vous regardez les mots les
plus utilisés afin d’avoir un aperçu du champ lexical de ces histoires.
- Vous avez découvert une superbe histoire que vous voulez partager à un ami, vous vous
lancez dans la découpe des différentes histoires pour les exporter séparément.

In [2]:
import requests
# Étape 1: Télécharger le fichier depuis l'URL
url = 'https://www.gutenberg.org/files/1430/1430-0.txt'
response = requests.get(url) 

# Sauvegarder le contenu dans un fichier local
with open('gutenberg.txt', 'wb') as file:
    file.write(response.content)

In [3]:
# Initialiser Spark
from pyspark import SparkConf, SparkContext 
conf = SparkConf().setAppName("example").setMaster("local")
sc = SparkContext(conf=conf)

# Charger le fichier texte dans un RDD
rdd = sc.textFile("gutenberg.txt")
header = rdd.first()
data_rows = rdd.filter(lambda line: line != header)

24/07/30 21:48:27 WARN Utils: Your hostname, locoselli-HP-ProDesk-400-G4-MT resolves to a loopback address: 127.0.1.1; using 192.168.1.21 instead (on interface wlx242fd066436c)
24/07/30 21:48:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/30 21:48:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/07/30 21:48:28 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/07/30 21:48:28 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/07/30 21:48:28 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
                                                                                

# Quel est le thème qui ressort le plus dans ces belles histoires ? 
Vous regardez les mots les plus utilisés afin d’avoir un aperçu du champ lexical de ces histoires.


In [15]:

# Pour chaque ligne du fichier, faire un split pour prendre les mots et passer comme parametre une fonction lambda a être éxecutée à chaque ligne trouvée
words = rdd.flatMap(lambda line: line.lower().split())

word_counts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

most_frequent_word = word_counts.sortBy(lambda x: x[1], ascending=False).take(50) 
most_frequent_word

['***', 'start', 'of']

In [16]:
blacklist = set([
    'the', 'and', 'to', '.', 'of', 'he', 'a', 'was', 'his', 'that', 'in', 'her', 'she', 'had', 'for',
    'with', 'said', 'but', 'not', 'i', 'as', 'it', 'you', 'him', 'is', 'when', 'be', 'on', 'so', 'they',
    'at', 'who', 'were', 'have', 'by', 'would', 'this', 'my', 'then', '“i', 'all', 'from', '--', 'will',
    'your'
])

filtered_words = words.filter(lambda word: word not in blacklist and word.isalpha())
word_counts = filtered_words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)


most_frequent_word = word_counts.sortBy(lambda x: x[1], ascending=False).take(10)
most_frequent_word

[('king', 127),
 ('one', 120),
 ('their', 117),
 ('which', 113),
 ('if', 113),
 ('are', 108),
 ('no', 105),
 ('could', 105),
 ('love', 102),
 ('been', 100)]

In [6]:
print("Les 10 mots les plus fréquents (hors blacklist) :")
for word, count in most_frequent_word:
    print(f"{word}: {count}")

Les 10 mots les plus fréquents (hors blacklist) :
king: 127
one: 120
their: 117
which: 113
if: 113
are: 108
no: 105
could: 105
love: 102
been: 100


# Découper des différentes histoires pour les exporter séparément.

In [62]:

tout_fichier = sc.wholeTextFiles("gutenberg.txt").values().first()

lines = tout_fichier.split("\r\n\r\n\r\n\r\n\r\n") # je découpe les histoires par 4 linebreaks

dd2 = sc.parallelize(lines)

dd3 = dd2.zipWithIndex().map(lambda x: (x[1], x[0]))

onzeeme_histoire = dd3.filter(lambda x: x[0] in [11]) # exemple pour prendre 11eme histoire

onzeeme_histoire.take(100)


(11,
 "THE WINTER'S TALE\r\n\r\n\r\n\r\nLeontes was the King of Sicily, and his dearest friend was Polixenes,\r\nKing of Bohemia. They had been brought up together, and only separated\r\nwhen they reached man's estate and each had to go and rule over\r\nhis kingdom. After many years, when each was married and had a son,\r\nPolixenes came to stay with Leontes in Sicily.\r\n\r\nLeontes was a violent-tempered man and rather silly, and he took it into\r\nhis stupid head that his wife, Hermione, liked Polixenes better than\r\nshe did him, her own husband. When once he had got this into his head,\r\nnothing could put it out; and he ordered one of his lords, Camillo, to\r\nput a poison in Polixenes' wine. Camillo tried to dissuade him from this\r\nwicked action, but finding he was not to be moved, pretended to consent.\r\nHe then told Polixenes what was proposed against him, and they fled from\r\nthe Court of Sicily that night, and returned to Bohemia, where Camillo\r\nlived on as Polixenes' 