## Introduction

This code reads in a dataset and then tells which language it is written in. After that it does some further analysis.

In [1]:
# import libraries
import pandas as pd
import os

In [2]:
# Import the azure libraries
from azure.cognitiveservices.language.textanalytics import TextAnalyticsClient
from msrest.authentication import CognitiveServicesCredentials

In [3]:
# read credentials file to get key and endpoint
config = pd.read_csv('credentials.txt', index_col='resource')
cog_key = config.loc['nlp', 'cog_key']
cog_endpoint = config.loc['nlp', 'cog_endpoint']

To use the Text Analytics service in your Cognitive Services resource, you'll need to install the Azure Cognitive Services Text Analytics SDK. 

In [36]:
# Read the reviews from train
path = r'C:\Users\linds\OneDrive\Data Science\open_classrooms\students\Kazuki Matsuda\Credit'
with open(path + r'\x_test.txt', encoding="utf8") as f:
    lines = f.readlines()
len(lines)

117500

In [37]:
# Azure wants a dictionary (json format)
reviews = []
for id, review in enumerate(lines):
    review = {'id': id, 'text': review}
    reviews.append(review)

    # Lets limit this to 40. Azure does not like the entire large file.
    if id > 39:
        break
len(reviews)

41

In [38]:
# Take a look at the reviews
for review in reviews:
    print(review)

{'id': 0, 'text': "Ne l fin de l seclo XIX l Japon era inda çconhecido i sótico pa l mundo oucidental. Cula antroduçon de la stética japonesa, particularmente na Sposiçon Ounibersal de 1900, an Paris, l Oucidente adquiriu un apetite ansaciable pul Japon i Heiarn se tornou mundialmente coincido pula perfundidade, ouriginalidade i sinceridade de ls sous cuntos. An sous radadeiros anhos, alguns críticos, cumo George Orwell, acusórun Heiarn de trasferir sou nacionalismo i fazer l Japon parecer mais sótico, mas, cumo l'home qu'oufereciu al Oucidente alguns de sous purmeiros lampeijos de l Japon pré-andustrial i de l Período Meiji, sou trabalho inda ye balioso até hoije.\n"}
{'id': 1, 'text': 'Schiedam is gelegen tussen Rotterdam en Vlaardingen, oorspronkelijk aan de Schie en later ook aan de Nieuwe Maas. Per 30 april 2017 had de gemeente 77.833 inwoners (bron: CBS). De stad is vooral bekend om haar jenever, de historische binnenstad met grachten, en de hoogste windmolens ter wereld.\n'}
{'i

## Detect Language
Let's start by identifying the language in which these reviews are written.

In [39]:
# Get a client for your text analytics cognitive service resource
text_analytics_client = TextAnalyticsClient(endpoint=cog_endpoint,
                                            credentials=CognitiveServicesCredentials(cog_key))

# Analyze the reviews you read from the data_project/x_test
language_analysis = text_analytics_client.detect_language(documents=reviews)

# print detected language details
for id, review_num in enumerate(reviews):
    # print the review id
    print('Review', id)

    # Get the language details for this review
    lang = language_analysis.documents[id].detected_languages[0]
    print(' - Language: {}\n - Code: {}\n - Score: {}\n'.format(lang.name, lang.iso6391_name, lang.score))

    # Add the detected language code to the collection of reviews (so we can do further analysis)
    reviews[id]["language"] = lang.iso6391_name

Review 0
 - Language: Portuguese
 - Code: pt
 - Score: 0.9716981649398804

Review 1
 - Language: Dutch
 - Code: nl
 - Score: 1.0

Review 2
 - Language: Serbian_Cyrillic
 - Code: sr
 - Score: 0.9195402264595032

Review 3
 - Language: Kannada
 - Code: kn
 - Score: 0.9455782175064087

Review 4
 - Language: Indonesian
 - Code: id
 - Score: 0.9411764740943909

Review 5
 - Language: Russian
 - Code: ru
 - Score: 0.9833333492279053

Review 6
 - Language: Farsi
 - Code: fa
 - Score: 0.9615384936332703

Review 7
 - Language: Uzbek_Cyrillic
 - Code: uz
 - Score: 0.925000011920929

Review 8
 - Language: Bulgarian
 - Code: bg
 - Score: 1.0

Review 9
 - Language: French
 - Code: fr
 - Score: 0.7586206793785095

Review 10
 - Language: English
 - Code: en
 - Score: 0.9946523904800415

Review 11
 - Language: Russian
 - Code: ru
 - Score: 0.90625

Review 12
 - Language: Indonesian
 - Code: id
 - Score: 0.9512194991111755

Review 13
 - Language: Italian
 - Code: it
 - Score: 0.9787233471870422

Review 1

## Extract Key Phrases

Now you can analyze the text in the customer reviews to identify key phrases that give some indication of the main talking points.

In [40]:
# # Use the client and reviews you created in the previous code cell to get key phrases
key_phrase_analysis = text_analytics_client.key_phrases(documents=reviews)

# print key phrases for each review
for review_num in range(len(reviews)):
    # print the review id
    print('Review', reviews[review_num]['id'])

    # Get the key phrases in this review
    print('\nKey Phrases:')
    try:
        key_phrases = key_phrase_analysis.documents[review_num].key_phrases
        # Print each key phrase
        for key_phrase in key_phrases:
            print('\t', key_phrase)
        print('\n')
    except:
        print('Died again, sigh')
        break 

Review 0

Key Phrases:
	 Japon pré
	 Japon parecer
	 Oucidente
	 sótico
	 An sous radadeiros anhos
	 sous purmeiros lampeijos
	 ls sous cuntos
	 un apetite ansaciable pul Japon
	 acusórun Heiarn
	 an Paris
	 cumo George Orwell
	 cumo l'home qu'oufereciu
	 críticos
	 ouriginalidade
	 sinceridade
	 trasferir
	 stética japonesa
	 Sposiçon Ounibersal
	 andustrial
	 çconhecido
	 nacionalismo
	 pula perfundidade
	 Período Meiji
	 seclo


Review 1

Key Phrases:
	 de Schie
	 de gemeente
	 De stad
	 de historische binnenstad
	 de hoogste windmolens
	 de Nieuwe Maas
	 grachten
	 jenever
	 Rotterdam en Vlaardingen
	 inwoners


Review 2

Key Phrases:
	 сар гэхэд хийн түлшний хэрэглээ сард
	 хийн түлшээр ажилладаг тулга
	 хийн түлшний ашиг тусыг хэвлэл мэдээллийн хэрэгслэлээр тасралтгүй сурталчилсаны үр дүнд
	 тн хүрч өссөн юм
	 оны
	 өөрийн брэнд болгон хэрэглээнд нэвтрүүлж
	 тоног төхөөрөмжийн ашиглалт
	 төрөл бүрийн халаагуурууд
	 ын Орос баллон хэрэглэдэг байсан уламжлалыг халж хэрэглэгчдийг со

The key phrases can help you gain an understanding of the most important talking points in each review. For example, a review containing a phrase "helpful staff" or "poor service" can give you an indication of some of the main concerns of the reviewer.

## Determine Sentiment

It might be useful to classify the reviews as *positive* or *negative* based on a *sentiment score*. Again, you can use the Text Analytics service to do this.

In [41]:
# Use the client and reviews you created previously to get sentiment scores
sentiment_analysis = text_analytics_client.sentiment(documents=reviews)

# Print the results for each review
for review_num in range(len(reviews)):

    # Get the sentiment score for this review
    try:
        sentiment_score = sentiment_analysis.documents[review_num].score
        # classifiy 'positive' if more than 0.5, 
        if sentiment_score < 0.5:
            sentiment = 'negative'
        else:
            sentiment = 'positive'

        # print file name and sentiment
        print('{} : {} ({})'.format(reviews[review_num]['id'], sentiment, sentiment_score))
    
    except:
        print('Died again, sigh')
        break

0 : positive (0.9804902076721191)
1 : positive (0.8357977867126465)
2 : positive (0.9851184487342834)
3 : positive (0.5360177755355835)
4 : positive (0.6788195371627808)
5 : positive (0.5892857313156128)
6 : positive (0.5201155543327332)
7 : negative (0.4048193693161011)
8 : positive (0.5655439496040344)
9 : positive (0.527801513671875)
10 : positive (0.5825332999229431)
11 : negative (0.4395814538002014)
12 : positive (0.6430618166923523)
13 : positive (0.5751879811286926)
14 : negative (0.42960482835769653)
15 : negative (0.4783596396446228)
16 : positive (0.7114706635475159)
Died again, sigh


## Extract Known Entities

*Entities* are things that might be mentioned in text that reference some commonly understood type of item. For example, a location, a person, or a date. Let's suppose you're interested in dates and places mentioned in the reviews - you can use the following code to find them.

In [42]:
# Use the client and reviews you created previously to get named entities
entity_analysis = text_analytics_client.entities(documents=reviews)

# Print the results for each review
for review_num in range(len(reviews)):
    print('\nReview', reviews[review_num]['id'])
    # Get the named entitites in this review
    try:
        entities = entity_analysis.documents[review_num].entities
        for entity in entities:
            # Only get location entitites
            if entity.type in ['DateTime','Location']:
                link = '(' + entity.wikipedia_url + ')' if entity.wikipedia_id is not None else ''
                print(' - {}: {} {}'.format(entity.type, entity.name, link))
    
    except:
        print('Died again, sigh')
        break


Review 0
 - Location: Paris 
 - Location: Heiarn 

Review 1
 - Location: Schiedam 
 - Location: Rotterdam 
 - Location: Vlaardingen 
 - Location: Schie 
 - Location: Nieuwe Maas 

Review 2

Review 3

Review 4

Review 5
 - Location: Адыге-Хабль районда авай 

Review 6

Review 7
 - Location: بحيرة يلوستون 
 - Location: مونتانا 

Review 8
 - Location: Mississippi 
 - Location: Louisiana 
 - DateTime: 1523 

Review 9
 - Location: 팔공산 
 - Location: 비슬산 
 - Location: 북 
 - Location: 금호강 
 - Location: 대구 

Review 10

Review 11

Review 12

Review 13
 - Location: Itagliàn Pàvana 
 - Location: Sanbûca 

Review 14
 - Location: Иньва 

Review 15
 - Location: Ems 
 - Location: Osneborg 
 - Location: Hannover 
 - Location: Braunschweig 
 - Location: Lüneburg 

Review 16
 - DateTime: 1959 
 - Location: Morago (https://en.wikipedia.org/wiki/Morago)

Review 17
 - Location: Pohja 
 - Location: Helsinki 
 - Location: suomalainen 
 - Location: Suomen 

Review 18
Died again, sigh


Note that some entities are sufficiently well-known to have an associated Wikipedia page, in which case the Text Analytics service returns the URL for that page.

## Learn More

For more information about the Text Analytics service, see [the Text Analytics service documentation](https://docs.microsoft.com/azure/cognitive-services/text-analytics/)