# Analyse des données textuelles

À partir des données l'application [Yelp](https://www.yelp.fr), traiter les sujets d'insatisfaction dans les commentaires client.

**Analyser les commentaires pour détecter les différents sujets d’insatisfaction**
 - pré-traitement des données textuelles
 - utilisation de techniques de réduction de dimension
 - visualisation des données de grandes dimensions

**Collecter un échantillon (environ 200 restaurants) de données via l’API Yelp**
 - récupérer uniquement les champs nécessaires
 - stocker les résultats dans un fichier exploitable (par exemple csv)
 
Dans cette partie nous aloons utilise l'API Yelp pour récupérer de nouvelles données et de les utiliser avec nos 3 modèle. Manuellement nous vérifirons la pertinence des prédictions pour établir le meilleur modèle.

# Chargement des bibliothèques

In [1]:
import numpy as np
import pandas as pd

import gensim
import gensim.corpora as corpora
from gensim import models
from sklearn.feature_extraction.text import TfidfVectorizer

from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport
import json
import pprint
from string import Template
import csv
import string
from Extract import Extract

import pickle
import random

In [2]:
# Nécessaire lors de la phase de développement pour mettre à jour la classe olist dans le notebook
%load_ext autoreload
%autoreload 2

In [3]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [4]:
data_dir = '../data/'
models_dir = './models/'

# Importation des données en provenance de l'API

In [5]:
class YelpApi:
    """
    Use Yelp API to retrieve restaurants reviews.
    This script use GRAPH SQL
    """

    def __init__(self):
        """
        Init class
        """
        self.client = None
        self.restaurants = []
        self.create_client()

    def create_client(self):
        """
        Yelp API client
        """
        reqHeaders = {
            'Authorization': 'Bearer 8p9wuo76shD0qFmrcTIzED27k1BYR3soGzS2nlUrQNo-TkGhVx_3GOomIlKODImobFwzaeWYmwfw76xHLLnfW31ofrZ9gYYSKl8C5iYc4QPoTVrn4ebUZb2mtNNtYHYx'
        }

        _transport = RequestsHTTPTransport(
            url="https://api.yelp.com/v3/graphql",
            headers=reqHeaders,
            use_json=True,
        )

        self.client = Client(
            transport=_transport,
            fetch_schema_from_transport=True,
        )

    def query_gql_location(self, location, limit=50, offset=0):
        """
        Request restaurant name's and review (limit to 3 by Yelp) 
        Arguments:
            location {string} -- city name

        Keyword Arguments:
            limit {int} -- Number of record to retreive - max 50 (default: {50})
            offset {int} -- Offset (default: {0})
        """        """
       
        """
        queryTemplate = Template(
            """
            {
                search(
                    term: "restaurants", 
                    location: "$location", 
                    categories:"restaurants", 
                    limit:$limit, 
                    offset:$offset                    
                ){
                    total,                    
                        business {
                            name
                            id                        
                        reviews(limit: 50){
                            text
                            rating
                        }
                    }
                }
            }
            """
        )
        query = queryTemplate.substitute(location=location, limit=limit, offset=offset)
        request = gql(query)
        result = self.client.execute(request)
        for restaurant in result['search']['business']:
            self.restaurants.append(restaurant)

    def get_restaurants(self, location, count=50):
        """
        Get restaurants and reviews information

        Arguments:
            location {string} -- city

        Keyword Arguments:
            count {int} -- Number of restaurant (default: {50})
        """

        max_count = 50
        counter = int(count / max_count)
        modulo = count % max_count
        offset = 0
        for count_restaurant in range(counter):
            self.query_gql_location(location, max_count, offset)
            offset += max_count
        if modulo > 0:
            self.query_gql_location(location, modulo, offset)

    def to_csv(self, file):
        """
        Save review as CSV :
            text,score

        Arguments:
            file {string} -- Filename
        """
        start= 1
        with open(file, "w", newline="") as csv_file:
            for restaurant in self.restaurants:
                for review in restaurant['reviews']:
                    wr = csv.writer(csv_file, quoting=csv.QUOTE_ALL)
                    if start == 1:
                        wr.writerow(['text', 'stars'])
                        start = 2
                    wr.writerow([review['text'].replace("\n", ""), review['rating']])

## Importation

In [6]:
yelp = YelpApi()

In [7]:
yelp.get_restaurants("London", 700)
yelp.to_csv(data_dir + 'london_raw.csv')

## Transformation des données reçues en dataFrame

In [8]:
new_df = pd.read_csv(data_dir + 'raw_london.csv')
new_df.head()

Unnamed: 0,text,stars
0,One of the best fish ever with the most tasty ...,5
1,One of the best fish and chips I've ever had. ...,5
2,You can't go to England and call it an officia...,5
3,"A superb restaurant, we look forward to going ...",5
4,Hard to find a way to add any higher praise to...,5


## Nettoyage des données

In [10]:
extract = Extract()
extract.clean_file(data_dir + 'london_raw.csv', data_dir + 'london_clean.csv')

100% (4 of 4) |##########################| Elapsed Time: 0:00:05 ETA:  00:00:00

## Chargement des données précédement traitées

In [11]:
new_df = pd.read_csv(data_dir + 'london_clean.csv')
new_df = new_df[new_df['bad_review'] == 1]
new_df.head()

Unnamed: 0,text,stars,clean_text,bad_review
50,This is my second time at NOPI and I really wa...,2,second time nopi really impressed time starter...,1
83,"This place is a little small, but it's very po...",2,place little small popular come monday morning...,1
169,I oredered a Burger&Lobster DIY kit because of...,1,oredere burger lobster diy kit review day deli...,1
176,You arrive excited for some classic Belgian fo...,2,arrive excited classic belgian food excitement...,1
194,"dropped bottled water on my pants and shirt, b...",2,drop bottled water pant shirt still charge eno...,1


In [12]:
print(f"Ce jeu de données contient {new_df.shape[0]} observations")

Ce jeu de données contient 93 observations


In [17]:
# on prend 4 observation au hasard
rand_obs = random.sample(list(new_df.index), 4)
for i in rand_obs:
    print(f"{i} - {new_df.loc[i]['text']}")

1675 - Terrible service from the time I sat down. Had to flag down multiple people other than our actual waitress whom was nowhere to be found in order to order...
1248 - If you like Burned Pizza and Bad Service that is the place for you. We ordered a pizza to go and the pizza was soggy, soft and very burned under the...
1940 - We were frequent visitors of Vapiano's when we lived in Vienna, so we know what to expect, and I have to say their food and service were always...
1421 - Walked in for dinner. First night in London and super excited for meat & ale pie. Sat at a table upstairs. 10 minutes went by and no one came around at all....


# Sauvegarde du jeu d'exemple

In [18]:
new_df.loc[rand_obs].to_csv(data_dir + 'london_sample.csv')