# Web Scraping des données d’avis de spectacteurs

*David Scanu et Ramata Soraya Dussart*

Ce notebook utilise **Beautiful Soup** pour effectuer un web scraping des commentaires des films "Inception" et "Sonic 2". **Ces commentaires sont ensuite sauvegardés dans un fichier .csv**.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [3]:
def scrape_comments(url, max_page):
	"""Function that scrapes comments and associated notes on Allo Ciné."""

	# Headers for request
	HEADERS = ({'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36', 'Accept-Language': 'en-US, en;q=0.5'})

	comments_count = 0

	total_comments_list = []

	for i in range(1, max_page+1):
		
		url_page = url + f"?page={i}"

		# HTTP Request
		webpage = requests.get(url_page, headers=HEADERS)
		# Soup Object containing all data
		soup = BeautifulSoup(webpage.content, "lxml")
		# Comments
		comments = soup.find_all("div", {"class" : "hred review-card cf"})

		# print(f"Page => {i}")
		# print(url_page)

		for comment in comments:

			comment_ls = []
			# Note
			comment_note = comment.find("span", {"class": "stareval-note"}).get_text().replace(',', '.')
			# convert to float
			comment_note_float = float(comment_note)
			# Text
			comment_text = comment.find("div", {"class", "content-txt review-card-content"}).get_text().strip().replace('"', "'").replace("spoiler:", '').replace(" [spoiler]", '')

			comment_ls.append(comment_note_float)
			comment_ls.append(comment_text)

			comments_count += 1
			total_comments_list.append(comment_ls)

	print(f"{comments_count} comments imported in DataFrame.")

	df = pd.DataFrame(total_comments_list, columns=['note', 'comment'])
	return df

#### Inception

In [4]:
url_inception = "https://www.allocine.fr/film/fichefilm-143692/critiques/spectateurs/"
df_inception = scrape_comments(url_inception, 479)

7175 comments imported in DataFrame.


In [None]:
df_inception.shape

(7175, 2)

In [None]:
df_inception.head()

Unnamed: 0,note,comment
0,5.0,Après le chef d'oeuvre super-héroïque The Dark...
1,5.0,C’est fou ce qu’on aime détester Christopher N...
2,5.0,CHEF D’ŒUVRE ! Le film est absolument parfait ...
3,5.0,"Un film aussi novateur que complexe, dont la m..."
4,5.0,Christopher Nolan est sûrement l'un des seuls ...


#### Sonic 2

In [None]:
url_sonic_2 = "https://www.allocine.fr/film/fichefilm-281203/critiques/spectateurs/"
df_sonic_2 = scrape_comments(url_sonic_2, 13)

190 comments imported in DF.


#### Concatenation des deux DataFrame

In [None]:
print(df_inception.shape)
print(df_sonic_2.shape)

(7175, 2)
(190, 2)


In [None]:
df = pd.concat([df_inception, df_sonic_2])
df.shape

(7365, 2)

#### Export en fichier .csv

In [None]:
# Exporting to .csv file
df.to_csv('corpus.csv', sep='|', index=False)
df.shape

(7365, 2)