# Analyse des données textuelles

À partir des données l'application [Yelp](https://www.yelp.fr), traiter les sujets d'insatisfaction dans les commentaires client.

**Analyser les commentaires pour détecter les différents sujets d’insatisfaction**
 - pré-traitement des données textuelles
 - utilisation de techniques de réduction de dimension
 - visualisation des données de grandes dimensions

**Collecter un échantillon (environ 200 restaurants) de données via l’API Yelp**
 - récupérer uniquement les champs nécessaires
 - stocker les résultats dans un fichier exploitable (par exemple csv)
 
Dans cette partie nous aloons utilise l'API Yelp pour récupérer de nouvelles données et de les utiliser avec nos 3 modèle. Manuellement nous vérifirons la pertinence des prédictions pour établir le meilleur modèle.

# Chargement des bibliothèques

In [1]:
import numpy as np
import pandas as pd

import gensim
import gensim.corpora as corpora
from gensim import models
from sklearn.feature_extraction.text import TfidfVectorizer

from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport
import json
import pprint
from string import Template
import csv
import string
from Extract import Extract

import pickle
import random

In [2]:
# Nécessaire lors de la phase de développement pour mettre à jour la classe olist dans le notebook
%load_ext autoreload
%autoreload 2

In [3]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [4]:
data_dir = '../data/cities/'

## Paramètres

In [None]:
print("Clé API")
api_key = input()

In [6]:
# Ville
city_name = 'Dublin'

# Nombre de restaurants à rechercher
restaurant_count = 100

# Importation des données en provenance de l'API

In [7]:
class YelpApi:
    """Use Yelp API to retrieve restaurants reviews.
    This script use GRAPH SQL
    """

    def __init__(self):
        """Init class
        
        :param self:
        
        :return: void 
        """
        self.client = None
        self.restaurants = []
        self.create_client()

    def create_client(self):
        """Yelp API client
        
        :param self:
        
        :return: void 
        
        """
        reqHeaders = {
            'Authorization': 'Bearer ' + api_key
        }

        _transport = RequestsHTTPTransport(
            url="https://api.yelp.com/v3/graphql",
            headers=reqHeaders,
            use_json=True,
        )

        self.client = Client(
            transport=_transport,
            fetch_schema_from_transport=True,
        )

    def query_gql_location(self, location, limit=50, offset=0):
        """Request restaurant name's and review (limit to 3 by Yelp) 
        
        :param self:
        :param location: city name
        :param limit: Number of record to retreive - max 50 (default: {50})
        :param offset: Offset (default: {0})
        
        :return: void
        """
        queryTemplate = Template(
            """
            {
                search(
                    term: "restaurants", 
                    location: "$location", 
                    categories:"restaurants", 
                    limit:$limit, 
                    offset:$offset                    
                ){
                    total,                    
                        business {
                            name
                            id                        
                        reviews(limit: 50){
                            text
                            rating
                        }
                    }
                }
            }
            """
        )
        query = queryTemplate.substitute(location=location, limit=limit, offset=offset)
        request = gql(query)
        result = self.client.execute(request)
        for restaurant in result['search']['business']:
            self.restaurants.append(restaurant)

    def get_restaurants(self, location, count=50):
        """Get restaurants and reviews information

        :param self:
        :param location: city name
        :param count: number of restaurants to get(default: 50)
        
        :return: void
        """

        max_count = 50
        counter = int(count / max_count)
        modulo = count % max_count
        offset = 0
        for count_restaurant in range(counter):
            self.query_gql_location(location, max_count, offset)
            offset += max_count
        if modulo > 0:
            self.query_gql_location(location, modulo, offset)

    def to_csv(self, file):
        """Save review as CSV :
            - text
            - score
        :param self:
        :param file: filename
      
        :return: void
        """
        start= 1
        with open(file, "w", newline="") as csv_file:
            for restaurant in self.restaurants:
                for review in restaurant['reviews']:
                    wr = csv.writer(csv_file, quoting=csv.QUOTE_ALL)
                    if start == 1:
                        wr.writerow(['text', 'stars'])
                        start = 2
                    wr.writerow([review['text'].replace("\n", ""), review['rating']])

## Importation

In [8]:
yelp = YelpApi()

In [9]:
yelp.get_restaurants(city_name, restaurant_count)
yelp.to_csv(data_dir + city_name + '_raw.csv')

## Transformation des données reçues en dataFrame

In [10]:
new_df = pd.read_csv(data_dir + city_name + '_raw.csv')
new_df.head()

Unnamed: 0,text,stars
0,A must try among temple street. Head there for...,5
1,I couldn't have asked for a better restaurant ...,4
2,While walking (drinking) through the Temple ba...,4
3,Food: 5/5Service: 5/5Cleanliness: 5/5Decor & S...,5
4,My visit to P Mac's was my first time sitting ...,4


## Nettoyage des données

In [11]:
extract = Extract()
extract.clean_file(data_dir + city_name + '_raw.csv', data_dir + city_name + '_clean.csv')

/ |#                                                  | 0 Elapsed Time: 0:00:00

## Chargement des données précédement traitées

In [12]:
new_df = pd.read_csv(data_dir + city_name + '_clean.csv')
new_df = new_df[new_df['bad_review'] == 1]
new_df.head()

Unnamed: 0,text,stars,clean_text,bad_review
10,Updating my review from a 4 stars to a 2. Defi...,2,updat review star definit lost appeal mediocr ...,1
59,Wow. Front of house staff need some training o...,2,wow front hous staff need train friendli ask b...,1
105,"First of all, the menu is incorrect on their w...",1,first menu incorrect websit pre set earli bird...,1
111,Updated: March 4 2020I have been having diarrh...,1,updat march diarrhea sinc night finish meal da...,1
115,I was very excited to try this place because I...,2,veri excit tri place becaus heard mani thing p...,1


In [13]:
print(f"Ce jeu de données contient {new_df.shape[0]} observations")

Ce jeu de données contient 12 observations


In [14]:
# on prend 4 observation au hasard
rand_obs = random.sample(list(new_df.index), 4)
for i in rand_obs:
    print(f"{i} - {new_df.loc[i]['text']}")

193 - It's a quaint little place for breakfast, although the service is not the best. It took a while to place our order, and we had to ask for utensils, refills,...
196 - No idea the reviews here. Mediocre, at best. Sat at the kitchen counter and watched the chefs talk gossip back and forth. Steak came, after being pushed,...
116 - I'm going to echo a few other yelpers that said this place wasn't that great. We ordered a fish and chips meal to go since there was no place, not even...
111 - Updated: March 4 2020I have been having diarrhea since the night I finished this meal for 3 days, and been to bathroom 20 times a day. It was horrible....


# Sauvegarde du jeu d'exemple

In [15]:
new_df.loc[rand_obs].to_csv(data_dir + city_name + '_sample.csv')