# How To Create a Python Dictionary

       Most important features by which you can recognize a dictionary are the curly brackets { } and for each item in the dictionary,the separation of the key and value by a colon :.

In [1]:
fruit = {"apple" : 5, "pear" : 3, "banana" : 4, "pineapple" : 1, "cherry" : 20}

# Access the `fruit` dictionary directly (without using get) and print the value of "banana"
print(fruit["banana"])

# Pick one of 5 the fruits and show that both retrieval styles obtain equal results
print(fruit["cherry"] == fruit.get("cherry"))

4
True


# Loading Data In Your Python Dictionary

- ### The data used are the reviews of Donna Tarttâ€™s The Goldfinch in the Amazon book review set from the Irivine Machine Learning Repository. These reviews have been stored in a simple tab separated file, which is nothing more than a plain text file with columns.

- ### The table contains four columns: <font color=magenta>review score, url, review title and review text.</font>

- ### There are several ways imaginable to put this into a dictionary, but in this case,you take the url as the dictionary keys and put the other columns in the nested values dictionary.

In [3]:
import urllib
import random

# Load the data from remote location (URL)
file = urllib.request.urlopen( \
    "https://gist.githubusercontent.com/twielfaert/a0972bf366d9aaf6cb1206c16bf93731/raw/dde46ad1fa41f442971726f34ad03aaac85f5414/Donna-Tartt-The-Goldfinch.csv")
f = file.read()

In [5]:
# Transform the bitstream into strings
text = f.decode(encoding='utf-8', errors='ignore')

In [8]:
len(text)

9688697

In [9]:
# Split this single string at the end of lines
lines = text.split("\n")

In [10]:
len(lines)

22861

In [12]:
lines[0]

'5.0\t/gp/customer-reviews/R20Q73A83CLX8E?ASIN=0316055441\tSeveral GREAT novellas in one very long book!\t"<span class=""a-size-base review-text"">I won\'t go into the plot since everyone will know it. My concern whenever I\'m given or purchase a very long book is, ""Will it keep me engaged?"" and is it worth the weeks it will take me to finish it?""<br/><br/>The answer with THE GOLDFINCH is ""Yes!"" and ""Sorta!""<br/><br/>To me, the book is divided into sections or novellas--the explosion, living with the wealthy family, moving to Vegas, etc.<br/><br/>The brilliant opening section immediately kept me engaged--I think the explosion and Theo\'s experience and recovery is some of the best writing I\'ve read in years.<br/><br/>The family he moves in with may remind you of THE ROYAL TENENBAUMS or Salinger\'s Glass family. They are funny, a bit tragic and sort of odd. The father especially--something about his behavior seemed a bit ""off"" as did his wild dialogue; it didn\'t seem at all "

In [13]:
lines[0].strip().split("\t")

['5.0',
 '/gp/customer-reviews/R20Q73A83CLX8E?ASIN=0316055441',
 'Several GREAT novellas in one very long book!',
 '"<span class=""a-size-base review-text"">I won\'t go into the plot since everyone will know it. My concern whenever I\'m given or purchase a very long book is, ""Will it keep me engaged?"" and is it worth the weeks it will take me to finish it?""<br/><br/>The answer with THE GOLDFINCH is ""Yes!"" and ""Sorta!""<br/><br/>To me, the book is divided into sections or novellas--the explosion, living with the wealthy family, moving to Vegas, etc.<br/><br/>The brilliant opening section immediately kept me engaged--I think the explosion and Theo\'s experience and recovery is some of the best writing I\'ve read in years.<br/><br/>The family he moves in with may remind you of THE ROYAL TENENBAUMS or Salinger\'s Glass family. They are funny, a bit tragic and sort of odd. The father especially--something about his behavior seemed a bit ""off"" as did his wild dialogue; it didn\'t see

In [14]:
# Initalising the dictionary
reviews = {}

# Filling the dictionary
for line in lines:
    l = line.strip().split("\t")

    # These are just training wheels to see more clearly what goes into the dictionary
    score = l[0]
    id = l[1]
    title = l[2]
    review = l[3]


    # NOTICE: each dict entry is also a DICT !!

    reviews[id] = {"score": score, "title": title, "review": review}

# Take a random key from the dictionary and print its value
print(reviews[random.choice(list(reviews.keys()))])

{'review': '"<span class=""a-size-base review-text"">Good, but not really my cup of tea. Complex characters, to say the least, and a captivating plot, but I was left disappointed.</span>"', 'score': '5.0', 'title': 'interesting but...'}


In [None]:
reviews.keys()

## MISSING VALUES !!!

In [None]:
# Counting the number of lines in the file
print("Number of lines: " + str(len(lines)))

# Counting the keys in the dictionary; should equal the number of lines in the file
print("Number of dictionary keys: " + str(len(reviews.keys())))

## How To Filter a Dictionary in Python

- ### Letâ€™s say you are interested in the bad reviews and want to see what people actually wrote by selecting only the reviews that score 1.0


- ### Python dictionary items not only have both a key and a value, but they also have a special iterator to loop over them. Instead of for item in dictionary, you need to use for key, value in dictionary.items()

In [None]:
# Store review keys with low score (1.0) in list
lowscores = []
for key, value in reviews.items():
  if float(value["score"]) == 1.0: # Convert score to float
    lowscores.append(key)

# Print all the entries with a low review score
for item in lowscores:
  print(reviews[item])

## Python Dictionary Operations

- ### EXTRACTING a subset you use the keys stored in lowscores to create the new dictionary.
- ### There are two options for this: one just retrieves the relevant items from the original dictionary with the .get() method leaving the original intact, the other uses .pop() which does remove it permanently from the original dictionary.

In [21]:
  # for-loop style to subset a dictionary
  forloop = {}
  for k in lowscores:
      forloop[k] = reviews[k]

  # List comprehension to get a subset dictionary. Alternatively you can use `.pop` instead of `.get`
  dictcomp = {k: reviews.get(k) for k in lowscores}

  # Verify that these objects are equal
  print(forloop == dictcomp)

 

True


In [None]:
###### How To Sort Dictionaries in Python

### Very simple NLP frequency of words in bad reviews
# If you are interested in the words that go hand in hand with a negative sentiment about the novel,
#  you could do a low-level form of sentiment analysis by making a frequency list of the words in the
# negative reviews (scored 1.0).

# Regular Expressions
# You need to process the review text a little bit by removing the HTML-tags and converting uppercase words
# to lowercase. For the first we use a regular expression which removes all tags: re.sub("<.*?>", "").



# defaultdict
### you build the frequency dictionary using a defaultdict instead of a normal dictionary.
### This guarantees that each â€œkeyâ€ is already intialized and you can just increase the frequency count with 1.

### If you were not using defaultdict, Python would raise an error when you try to increase the count for the
### first time (so from 0 to 1) because the key does not yet exist.

In [4]:
from collections import defaultdict
import re
# Import defaultdict
from collections import defaultdict

freqdict = defaultdict(int)
for item in lowscores:
    review = reviews[item]["review"]
    cleantext = re.sub(r'<.*?>', '',
                       review).strip().split()  # Remove HTML tags and split the review by word (space separated)
    for word in cleantext:
        # Convert to all lowercase
        word = word.lower()

        # Complete the following line to increase the count by one:
        freqdict[word] += 1

print(freqdict)

# Once the frequency dictionary is ready, you still need to sort the keys by value in descending order to
#  see promptly which words are highly frequent.
#  As normal dictionaries (including defaultdict can not be ordered by design), you need another class,
#  namely OrderedDict.

### The sorted function takes 3 arguments.

# FIRST: The first one is the object that you want to sort, your
# frequency dictionary. Remember however that accessing the key-value pairs in a dictionary is
#  only possible through the .items() function!

# SECOND
# the second argument specifies what part of the first argument should be used to sort: key=lambda item: item[1]
#  lambda function is an anonymous function, meaning it is a function without a name and can not be called from outside.
# In this case, it simply uses the dictionary value (item[1], with item[0] being the key) as the argument for sorting.


# THIRD
# The third and final argument, reverse, specifies whether sorting should be ascending (the default) or descending.
# In this case, you want to see the most frequent words at the top and need to specify explicitly that reverse=True.

from collections import OrderedDict

# Create Ordered dictionary
ordict = OrderedDict(sorted(freqdict.items(), key=lambda item: item[1], reverse=True))

# Ignore top 10%
top10 = int(len(ordict.keys())/10)

# Print 100 words of the top 90%
print(list(ordict.items())[top10:top10+100])

NameError: name 'text' is not defined