# Week 3 Homework

Topics Covered:
- Data Filtering and Cleaning
    - Filtering by date and attributes
- Processing Text and Strings in Python
    - Removeing punctuation, making all lowercase
- Processing Times and Dates in Python 
    - Converstion from Strings to Time Objects, to Unix time
    - Sorting time feature data



## Section 1: Data Filtering and Cleaning
### ...With an introduction to processing Time

In this section, you will preprocess a moderately-sized dataset filled with Airbnb Listing Data for Seattle, WA and its surrounding regions.

To expidite the assignment, we've written the code needed to import the `airbnb_listings.csv` file.Please begin the assignment by running this code cell below.

In [0]:
# Be sure to run this!
import csv

f = open('datasets/airbnb_listings.csv')
# Note: the delimiter is ',' since this is a csv rather than a tsv file
reader = csv.reader(f, delimiter = ',')
header = next(reader)

print(header)

['index', 'room_id', 'host_id', 'room_type', 'address', 'reviews', 'overall_satisfaction', 'accommodates', 'bedrooms', 'bathrooms', 'price', 'last_modified', 'latitude', 'longitude', 'location', 'name', 'currency', 'rate_type']


### Note
Be sure to take note of the headers within the dataset.

### Task 1:
Import the rest of the csv file into a list of dictionaries, simply named `dataset`

**Hint**: These are the corresponding headers for each data type:

##### Int
- index
- room_id
- host_id
- reviews
- accommodates
- price

##### Float
- overall_satisfaction
- bedrooms
- bathrooms
- latitude
- longitude
(some listings list smaller rooms as halves)

##### String
- Everything else



In [0]:
dataset = []

for line in reader:
    # Convert into a dictionary object
    d = dict(zip(header,line))
    # Cast ID and Star Ratings as Integers
    for field in ['index','room_id','host_id', 'reviews', 'accommodates', 'price']:
        d[field] = int(d[field])
    for field in ['overall_satisfaction', 'bedrooms', 'bathrooms', 'latitude', 'longitude']:
        if(d[field] != ''):
            d[field] = float(d[field])
    dataset.append(d)

In [0]:
#Use this to verify that you're on the right track.
if(len(dataset) == 7576):
    print("Great, looks like we've got the dataset up and running. Let's try some filtering techniques we learned in the lecture.")

Great, looks like we've got the dataset up and running. Let's try some filtering techniques we learned in the lecture.


*Note that attempting to print the dataset at this point will cause Jupyter to throw an error due to its size.*

### Task 2: Filtering by Room Type



Let's start slimming down the dataset. 

Let's say we're only interested in `room_type` features that are `"Entire home/apt"` listings.  
Using what we know from the lecture, how would we filter down the dataset by the above criteria? Think about what the line below is doing.

In [0]:
dataset = [d for d in dataset if d['room_type'] == "Entire home/apt"]
print(len(dataset))

5603


### Task 3: Filtering By Number of Reviews

While this is printable now, it's still incredibly large (clocking in at 5603 entries!). Let's find another criteria to narrow down the datset to make more meaningful predictions.

Let's filter the entireHomeDataset once again by only showing listings that have >= 150 `reviews`

In [0]:
dataset = #TODO: YOUR CODE HERE
print(len(dataset))

In [0]:
# Use this to verify that you are on the right track
if(len(dataset) == 423):
    print("Great job!")

## Section 2: Strings in Python

### Task 1: Removing Punctuation and Casing

Let's begin by running the following code snippet below to import Python's `string` library, and by creating a `vocab` list.  
We will use this list to collect every word inside the slimmed-down listing dataset. 

In [0]:
import string

vocab = []
vocabFrequency = []

After this, we will...
1. Loop through the dataset to remove punctuation  
    _Note: You will need to use Python's list of punctuation._
2. Make every string lowercase  
3. Extend the `vocab` dict with the newly cleaned string, albeit split into individual words.

_Hint: Refer to Python's documentation on [common string operations](https://docs.python.org/3.7/library/string.html) to complete this task_

In [0]:

#Process the string, then add each word to the compendium
for d in dataset:
    # Remove punctuation
    d['name'] = ''.join(#TODO: YOUR CODE HERE)
    # Make lowercase
    d['name'] = #TODO: YOUR CODE HERE
    # Extend the vocab list by splitting d['name']
    vocab.extend(#TODO: YOUR CODE HERE)


Then, let's append the vocab's count into the vocabFrequency list (this part is done for you). 

After that, we will zip up the `vocab` and `vocabFrequency` lists into a dictionary.



In [0]:
for word in vocab:
    vocabFrequency.append(vocab.count(word))

# Zip up the vocab and vocabFrequency lists into a dictionary object.
# Recall the method we used from the previous section to do so. 
freqDict = #TODO: YOUR CODE HERE
# Don't modify this line:
freqDict = sorted(freqDict.items(), reverse=True, key=lambda x: x[1])

In [0]:
#Run this cell
print(freqDict[:10])
print(len(dataset))

Here we see the top 10 terms used in the description of each listing. Naturally, since the dataset is focused around the Seattle area, 'seattle' is the top term, clocking in at 70 instances within the subset of our dataset. 

In a future assignment, we may use the [NLTK Python Library](https://www.nltk.org) to further filter string lists by parts of speech, but that is a topic for a different course. 

### Knowledge Check:
1. Name some key reasons why filtering is necessary when working with large datasets.
2. What are some features of the `airbnb_reviews` dataset that might not be necessary? Explain your reasoning. 
3. If you were to remove those features, what would the code for that look like?

_You do not need to answer these officially, but we encourage you to go back and review the code if you do not know the answer._

## Section 3: Time and Date Data

Let's begin this section by running the cell below.

In [0]:
import time

firstTime = dataset[0]['last_modified']
print(firstTime)
print(type(firstTime))


As we can see, the variable `firstTime` is considered a `str` by Python.

#### Pause and Reflect:
What could be a potential shortcoming of a time feature being represented as a string?

### Task 1: Conversion of firstTime from String, to Tuple, to Unix Time

In order to grasp the concept of time casting and conversion, let's modify the `firstTime` variable we created in the prior cells. 

1. Convert the string to a Time object using a method found [in the documentation for time](https://docs.python.org/3.7/library/time.html).  
_Note: Only the first 19 characters in the time string contain relevant time data that we'd like to cast._  
_Note 2: Note the date's format (e.g. "Year-day-month Second:Minute:Hour" , as printed out in the prior cell. Refer to the list of directives within the documentation to aid with the casting process._

2. Print out the `firstTime` variable, and note the way that Time objects are represented in the console output. 

In [0]:
#Convert from String to a Time object
firstTime = #TODO: YOUR CODE HERE
print(firstTime)
print(type(firstTime))

Using the slimmed-down dataset from prior sections, let's use our new knowledge to manipulate the time-based features of any given listing.

*Hint*: Be sure to leverage the `time.strptime` function to cast the time `str` (strings) as Python `tuple` objects.

In [0]:
# First, let's modify every 'last_modified' entry within the dataset
for d in dataset:
    #Check if the `last_modified` feature of line d is of type string. 
    #Use an if statement, followed by isinstance. 
    #TODO: YOUR CODE HERE FOR IF STATEMENT
        #Hint: Use the code from the firstTime variable's conversion as an example for the next line.
        d['last_modified'] = #TODO: YOUR CODE HERE

In order to make the sorting process a bit easier for us, let's convert the Python `time` structs into Unix time. 

As per the lecture, Unix time is defined as the time (in seconds) since the [Unix Epoch](https://en.wikipedia.org/wiki/Unix_time).

*Hint*: Look through the [Python Time Library's Documentation](https://docs.python.org/3.7/library/time.html) to find the function needed to convert struct_time to Unix time. 


In [0]:
for d in dataset:
    if isinstance(d['last_modified'], tuple):
        d['last_modified'] = #TODO: YOUR CODE HERE


Next, we will sort the dataset by `last_modified` feature. To do this, we'll look at [the documentation for Sorting in Python](https://docs.python.org/3/howto/sorting.html#key-functions) — specifically the section on Key Functions.

This will most likely involve using lambdas. In this case,  since knowledge of lambda functions are not required for this course, we will provide the code snippet below:

`lambda x: x['last_modified']`

In [0]:
# Using dataset.copy() makes a shallow copy of the dataset.
# Leave this line be.
sortedByTimeDataset = dataset.copy()

sortedByTimeDataset = #TODO: YOUR CODE HERE

for d in sortedByTimeDataset:
    print(d['last_modified'])

Using UNIX time made it easy to sort the dataset by the `last_modified` field!

### Knowledge Check:
1. Which method would you use to convert from a Python Time object to a String?
2. In what context would you need to perform the above operation?
3. Which directive (e.g. %p) would you use to express Minutes as a decimal number (that is, from 00 to 59)?

_You do not need to answer these officially, but we encourage you to go back and review the code if you do not know the answer._

---
## You're all done!
You should be familiar with the basics of reading in CSV and JSON files now and playing with the data. In your own time, we encourage you to find some datasets and try filtering and cleaning them to slim down extraneous results, or hone in on a specific feature result. 