# Humanities Context—Lab 1 Guide


## Instructions/Assignment

#### Part 1: Cleaning Data
There are a lot of tweets in this data set. The first thing you need to do is download the data set and then create a much smaller version to test your work on. Do this using your favorite text editor. When you’re done you should have two text files, one is the original and the other is a text file just like tweets.txt except it should only have information for 30 tweets. 


#### Part 2: Creating Dictionaries
Write a function in Python that reads the smaller tweet file and creates a list of dictionaries. Each dictionary will represent one line of your file and will have the key-value pairs:

- text: a string, the text of the tweet all in lowercase
- time: a datetime object, date and time of the tweet
- latitude: a float, the latitude of the tweet's location
- longitude: a float, the longitude of the tweet's location

Start by opening your abbreviated tweet file and then read each line. As you read the line, parse the line using split to access the information you need and create a dictionary. Add the dictionary to a list.


#### What to Submit
Submit the smaller of your two tweet files. It should be in the same format as the tweets.txt above. Also submit a notebook called hum_1.ipynb that has your function from part 2 in it. If you finish this on Thursday, no need to attend lab on Friday. You’ll still get the credit.  If you don’t finish by the end of lab on Friday, submit what you have.



## Guide:

**Steps:**   
1. **Making tweets.txt**  
If opening/accessing all_tweets.txt is giving you trouble I uploaded my 30 line tweets.txt that you can use. I also did a quick scrub for profanity on it.

2. **Defining our goal/steps**   
For each tweet (aka each line of tweets.txt) we want a dictionary whose entries are the tweet's time, text, latitude, and longitude. We also want a master list that holds the dictionaries for each tweet. The tricky part of this lab is how we parse the tweet from a single string into its constituent parts. The flow of the program can be written as follows:
 1. Open file  
 2. while you have not yet encounted the last line of the file, get the next line
 3. Parse the line into discrete parts as required
 4. For each tweet, load its relevant parts into a dictionary   
 5. Append the dictionary to master list, which is what we will returned 
 6. Repeat 2-5 until the last line of the file is read in  

3. **Splitting tweet into its components**  
In the text file, the tweet is split around tabs, so we can split it into a list using `.split('\t')`. The location (as a single \[lat,long\] string), time, and text are at indices 0, 2, and 3 respectively.  

4. **Refining/parsing components**  
 1. Longitude/latitude: Although it's never directly stated, the convention is that latitude is ordered first. We can split the location around the comma with `.split(',')` and then we can clean up the extra spaces and square brackets with `.strip('[] ')`
 
 2. Text: We need to remove the trailing '\n' from the text with `.rstrip('\n')`, and to convert it to all lowercase with `.lower()`.  
 
 3. Time: We need to convert the string date-time group into a datetime object (see step 5)
 
5. **datetime object**   
A datetime object is a type of object that is designed to store a particular date and time as its value, and make performing date and time relation operations easier. There are several ways to instantiate a datetime object, but notably it can parse an [ISO 8061](https://en.wikipedia.org/wiki/ISO_8601) formatted string into a datetime object. Fortunately, the tweet's date and time is already stored in ISO 8061 format. We can create our datetime object by using `datetime.datetime.fromisoformat(datetime_string)`. 
 
Final considerations/points of discussion: Dictionary mutability and why adding it to list before adding any values to the dictionary still works fine, and why placing it outside the loop would, despite working, just give us a list of the same dictionary repeated 30 times. 


## Additional Information

Just some extra stuff you might find helpful if you have the time, but not strictly necesary for the lab. 

### datetime module

The datetime module supplies **classes** for manipulating dates and times. While date and time arithmetic is supported, the focus of the implementation is on efficient attribute extraction for output formatting and manipulation.

The classes define objects that are useful for operations related to dates and time. When we import the datetime module, we can instantiate those objects and access their methods to help perform those operations.

#### datetime classes
The datetime module defines 6 classes. For this lab, the only one we really need is datetime, but I've included the rest just for completeness.

1. datetime.date:  
An idealized naive date, assuming the current Gregorian calendar always was, and always will be, in effect. Attributes: year, month, and day.  
2. datetime.time:  
An idealized time, independent of any particular day, assuming that every day has exactly 24*60*60 seconds. (There is no notion of “leap seconds” here.) Attributes: hour, minute, second, microsecond, and tzinfo.  
3. datetime.datetime:  
A combination of a date and a time. Attributes: year, month, day, hour, minute, second, microsecond, and tzinfo.
4. datetime.timedelta:   
A duration expressing the difference between two date, time, or datetime instances to microsecond resolution.
5. datetime.tzinfo:   
An *abstract base class*. This should not/can not be instaniated directly. Instead, we use its *subclass* timezone 
6. datetime.timezone:  
A class that implements the tzinfo abstract base class. Tracks the timezone as a fixed offset from the UTC. Used in time related objects in their timezone attribute to track the timezone.

**Properties and constraints**
Date, time, datetime, and timezone objects immutable, hashable, and may either be **aware** (can locate itself relative to other aware objects) or **naive** (does not contain enough information to unambiguously locate itself relative to other date/time objects). The datetime module defines two contraints, which are imported to your program along with the module, and which are used by the date and datetime classes to define the minimum and maximum allowable year.

1. datetime.MINYEAR: Set to 1.  
2. datetime.MAXYEAR: Set to 9999.


**Source**: [Python's datetime documentation](https://docs.python.org/3/library/datetime.html)

**Related Modules and Packages**

[calendar](https://docs.python.org/3/library/calendar.html#module-calendar): Provides general calendar related functions.  

[time](https://docs.python.org/3/library/time.html#module-time): Provides various time-related functions.

[dateutil](https://dateutil.readthedocs.io/en/stable/): A **third party** library that extends the the functionality of datetime with more advanced features, including extended timezone and parsing support. 

## Solution Code

In [44]:
'''
Samuel Weissmann
spw2136
2019-10-24
'''

import datetime as dt

def parse_tweets(file_name):
    dicts_li = []
    fo = open(file_name,'r')
    tweet = fo.readline()
    
    while tweet !='':
        #setup
        tweet_li = tweet.split('\t') #list of tweet split around a tab-space (\t)
        coordinates = tweet_li[0].split(",") #splits coords into a list
        tweet_di = {} #empty dict to which we will add the relevant parts of the tweet
        dtg = dt.datetime.fromisoformat(tweet_li[2]) #get the datetime object based on the 
                                                     #tweet's date
        
        #add entries to dictionary
        tweet_di['text'] = tweet_li[3].lower().rstrip("\n")
        tweet_di['time'] = dtg
        tweet_di['latitude'] = coordinates[0].strip("[] ")
        tweet_di['longitude'] = coordinates[1].strip("[]] ")
        
        #add dictionary to list
        dicts_li.append(tweet_di) #discussion point: Could be added right away before adding values and still work
        
        #update LCV
        tweet = fo.readline()
        tweet.rstrip()
    
    #close file, return value
    fo.close()
    return dicts_li

#test program    
results = parse_tweets('tweets.txt')
for di in results:
    print("\n \"{}\" was said on {} at [{}, {}].".format(di['text'],str(di['time']),di['latitude'],di['longitude']))







 "irene is on the way to meet  rochester http://t.co/07by3a4" was said on 2011-08-28 19:03:27 at [43.102102240000001, -77.51810055].

 "@cami219 hahaha i love you!" was said on 2011-08-28 19:03:28 at [25.587980000000002, -80.393619999999999].

 "i really can't stand to be in a  salon.. #ihatethesmell" was said on 2011-08-28 19:03:28 at [38.318311029999997, -81.718020730000006].

 "i'm at los trompos fco de montejo (mérida) http://t.co/44p4cob" was said on 2011-08-28 19:03:28 at [21.030365, -89.641650999999996].

 "@blessthefall yes! awakening! cant wait tto hhear whats for the soul" was said on 2011-08-28 19:03:28 at [29.004319899999999, -81.387371099999996].

 "i'm at corberstone pl @knob hill (evans) http://t.co/f2o2chn" was said on 2011-08-28 19:03:29 at [33.550829, -82.190280999999999].

 "work needs to fly by ... i'm so excited to see spy kids 4 with then love of my life ... arreic" was said on 2011-08-28 19:02:36 at [41.298669629999999, -81.915329330000006].

 "@lowkeypea smh" 