https://docs.google.com/presentation/d/1hrdgmIETLXMcO9jCtLcY_GM0wNb1cGKs/edit#slide=id.g11091aa2c13_1_81
Explorer un jeu de données
Nettoyer un jeu de données
Formatter un jeu de données
Appliquer des algorithmes de Machine Learning
Optimiser des algorithmes de Machine Learning

# Projet Machine Learning - Reviews

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from textblob import Word, TextBlob

## Exploration du dataset

In [None]:
df = pd.read_csv('reviews.csv')

In [3]:
df

Unnamed: 0,Rating,Year_Month,Reviewer_Location,Review_Text
0,5,2019-3,United Arab Emirates,"We've been to Disneyland Hongkong and Tokyo, s..."
1,4,2018-6,United Kingdom,I went to Disneyland Paris in April 2018 on Ea...
2,5,2019-4,United Kingdom,"What a fantastic place, the queues were decent..."
3,4,2019-4,Australia,We didn't realise it was school holidays when ...
4,5,missing,France,A Trip to Disney makes you all warm and fuzzy ...
...,...,...,...,...
13625,5,missing,United Kingdom,i went to disneyland paris in july 03 and thou...
13626,5,missing,Canada,2 adults and 1 child of 11 visited Disneyland ...
13627,5,missing,South Africa,My eleven year old daughter and myself went to...
13628,4,missing,United States,"This hotel, part of the Disneyland Paris compl..."


## Preprocessing

In [35]:
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = stopwords.words('english')
custom_stop_words = ['Disney', 'Disneyland', 'Disneyworld']
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /Users/donor/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/donor/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/donor/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


True

In [36]:
def preprocess_reviews(review, custom_stopwords):
    preprocessed_review = review
    preprocessed_review.replace('[^\w\s]', '') # remove whitespace at the beginning of the review
    preprocessed_review = ' '.join(word for word in preprocessed_review.split() if word not in stop_words)
    preprocessed_review = ' '.join(Word(word).lemmatize() for word in preprocessed_review.split())
    return preprocessed_review

In [37]:
df['processed_review'] = df['Review_Text'].apply(lambda x: preprocess_reviews(x, custom_stop_words))
df

Unnamed: 0,Rating,Year_Month,Reviewer_Location,Review_Text,processed_review
0,5,2019-3,United Arab Emirates,"We've been to Disneyland Hongkong and Tokyo, s...","We've Disneyland Hongkong Tokyo, far one best...."
1,4,2018-6,United Kingdom,I went to Disneyland Paris in April 2018 on Ea...,I went Disneyland Paris April 2018 Easter week...
2,5,2019-4,United Kingdom,"What a fantastic place, the queues were decent...","What fantastic place, queue decent best time y..."
3,4,2019-4,Australia,We didn't realise it was school holidays when ...,"We realise school holiday went, consequently e..."
4,5,missing,France,A Trip to Disney makes you all warm and fuzzy ...,A Trip Disney make warm fuzzy actual kid again...
...,...,...,...,...,...
13625,5,missing,United Kingdom,i went to disneyland paris in july 03 and thou...,went disneyland paris july 03 thought brillian...
13626,5,missing,Canada,2 adults and 1 child of 11 visited Disneyland ...,2 adult 1 child 11 visited Disneyland Paris be...
13627,5,missing,South Africa,My eleven year old daughter and myself went to...,My eleven year old daughter went visit son Lon...
13628,4,missing,United States,"This hotel, part of the Disneyland Paris compl...","This hotel, part Disneyland Paris complex, won..."


In [40]:
df['processed_review'][1]

"I went Disneyland Paris April 2018 Easter weekend, I know say June 2018 I can't choose date then, I loved it, mum went I autism managed get disability pas parks. Disney excellent disability access cater type disabilities, visible (wheelchair users, etc.) invisible (autism, etc.), managed get lot ride pas queue normal queue entrance disabilities. I fault one thing I went met Spider man photo taken pay photo expensive, even pay one photo. The food spectacular edible nice, variety food outlet plenty choice. I would loved go Halloween Christmas I would love go again."

In [41]:
df['Review_Text'][1]

"I went to Disneyland Paris in April 2018 on Easter weekend, I know it says June 2018 but I can't choose a date before then, and I loved it, me and my mum went and as I have autism we managed to get a disability pass for both parks. Disney are excellent with disability access and cater to all types of disabilities, both visible (wheelchair users, etc.) and invisible (autism, etc.), we managed to get on a lot of rides because with the pass you don't queue in the normal queue but the entrance for disabilities. I can only fault one thing when I went we met Spider man and had photos taken but you have to pay for the photos and they are very expensive, even to just pay for one photo. The food wasn't spectacular but it was edible and nice, there is a variety of food outlets so there was plenty of choice. I would loved to go again in Halloween or Christmas but if not I would just love to go again."

## Calculate sentiment

In [46]:
df['polarity'] = df['processed_review'].apply(lambda x: TextBlob(x).sentiment[0])
df['subjectivity'] = df['processed_review'].apply(lambda x: TextBlob(x).sentiment[1])

In [47]:
df

Unnamed: 0,Rating,Year_Month,Reviewer_Location,Review_Text,processed_review,subjectivity,polarity
0,5,2019-3,United Arab Emirates,"We've been to Disneyland Hongkong and Tokyo, s...","We've Disneyland Hongkong Tokyo, far one best....",0.550000,0.287500
1,4,2018-6,United Kingdom,I went to Disneyland Paris in April 2018 on Ea...,I went Disneyland Paris April 2018 Easter week...,0.806250,0.468750
2,5,2019-4,United Kingdom,"What a fantastic place, the queues were decent...","What fantastic place, queue decent best time y...",0.566667,0.135185
3,4,2019-4,Australia,We didn't realise it was school holidays when ...,"We realise school holiday went, consequently e...",0.539500,0.164875
4,5,missing,France,A Trip to Disney makes you all warm and fuzzy ...,A Trip Disney make warm fuzzy actual kid again...,0.554167,0.243750
...,...,...,...,...,...,...,...
13625,5,missing,United Kingdom,i went to disneyland paris in july 03 and thou...,went disneyland paris july 03 thought brillian...,0.595833,0.275000
13626,5,missing,Canada,2 adults and 1 child of 11 visited Disneyland ...,2 adult 1 child 11 visited Disneyland Paris be...,0.603667,0.153000
13627,5,missing,South Africa,My eleven year old daughter and myself went to...,My eleven year old daughter went visit son Lon...,0.337500,0.212500
13628,4,missing,United States,"This hotel, part of the Disneyland Paris compl...","This hotel, part Disneyland Paris complex, won...",0.519780,0.253480


In [48]:
kaggle competitions download -c sentiment-analysis-on-movie-reviews

SyntaxError: invalid syntax (2842003431.py, line 1)