In [273]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

# DSC 80: Lab 07

### Due Date: Monday, November 15th, 11:59PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `lab*.py` file, that will be imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
from lab import *

In [5]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import glob
import os
import time
import re
import requests
import json

# Practice with regular expressions (Regex)

**Question 1**

You start with some basic regular expression exercises to get some practice using them. You will find function stubs and related doctests in the starter code. 

**Exercise 1:** A string that has a `[` as the third character and `]` as the sixth character.

**Exercise 2:** Phone numbers that start with '(858)' and follow the format '(xxx) xxx-xxxx' (x represents a digit).

*Notice: There is a space between (xxx) and xxx-xxxx*

**Exercise 3:** A string whose length is between 6 to 10 and contains only word characters, white spaces and `?`. This string must have `?` as its last character.

**Exercise 4:** A string that begins with '\\$' and with another '\\$' within, where:
   - Characters between the two '\\$' can be anything (including nothing) except the letters 'a', 'b', 'c' (lower case).
   - Characters after the second '\\$' can only have any number of the letters 'a', 'b', 'c' (upper or lower case), with every 'a' before every 'b', and every 'b' before every 'c'.
       - E.g. 'AaBbbC' works, 'ACB' doesn't.

**Exercise 5:** A string that represents a valid Python file name including the extension. 

*Notice*: For simplicity, assume that the file name contains only letters, numbers and an underscore `_`.

**Exercise 6:** Find patterns of lowercase letters joined with an underscore.

**Exercise 7:** Find patterns that start with and end with a `_`.

**Exercise 8:**  Apple registration numbers and Apple hardware product serial numbers might have the number '0' (zero), but never the letter 'O'. Serial numbers don't have the number '1' (one) or the letter 'i'. Write a line of regex expression that checks if the given Serial number belongs to a genuine Apple product.

**Exercise 9:** Check if a given ID number is from Los Angeles (LAX), San Diego(SAN) or the state of New York (NY). ID numbers have the following format `SC-NN-CCC-NNNN`. 
   - SC represents state code in uppercase 
   - NN represents a number with 2 digits 
   - CCC represents a three letter city code in uppercase
   - NNNN represents a number with 4 digits

**Exercise 10:**  Given an input string, cast it to lower case, remove spaces/punctuation, and return a list of every 3-character substring following this logic:
   - The first character doesn't start with 'a' or 'A'
   - The last substring (and only the last substring) can be shorter than 3 characters, depending on the length of the input string.
   - The substrings cannot overlap
   
Here's an example with one of the doctests:

`>>> match_10("Ab..DEF")`
`['def']`

1. convert it to a lowercase string resulting in "ab..def"
2. delete any 3 letter sequence that starts with the letter 'a', so delete "ab." from the string, leaving using with ".def"
3. delete the punctuation resulting in "def"
4. finally, we get `["def"]`

(Only split in the last step, everything else is removing from the string)

In [None]:
# out = 'ABCdef'
# out = 'DEFaabc !g'
# out = 'Come ti chiami?'
out = 'and'
# out = 'Ab..DEF'

def test(string):
    out = string.lower()
    arr = pd.Series(re.findall(r'.{1,3}', out))
    arr_df = arr.str.extract(r'(^[^a].*)').dropna()
    arr = []
    for i in arr_df[0]:
        w = re.sub(r'[^\w]','',i)
        w = re.sub(r'_','',w)
        arr.append(w)

    arr = pd.Series(re.findall(r'.{1,3}', ''.join(arr)))
    if len(arr) == 0:
        return []
    return arr.str.extract(r'(^[^a].*)').dropna()[0].to_list()

test(out)

In [70]:
arr_df

NameError: name 'arr_df' is not defined

In [None]:
grader.check("q1")

## Regex groups: extracting personal information from messy data

**Question 2**

The file in `data/messy.txt` contains personal information from a fictional website that a user scraped from webserver logs. Within this dataset, there are four fields that interest you:
1. Email Addresses (assume they are alphanumeric user-names and domain-names),
2. [Social Security Numbers](https://en.wikipedia.org/wiki/Social_Security_number#Structure)
3. Bitcoin Addresses (alpha-numeric strings of long length)
4. Street Addresses

Create a function `extract_personal` that takes in a string like `open('data/messy.txt').read()` and returns a tuple of four separate lists containing values of the 4 pieces of information listed above (in the order given). Do **not** keep empty values.

*Hint*: There are multiple "delimiters" in use in the file; there are few enough of them that you can safely determine what they are.

*Note:* Since this data is messy/corrupted, your function will be allowed to miss ~5% of the records in each list. Good spot checking using certain useful substrings (e.g. `@` for emails) should help assure correctness! Your function will be tested on a sample of the file `messy.txt`.

In [71]:
fp = os.path.join('data', 'messy.txt')
s = open(fp, encoding='utf8').read()

In [73]:
len(extract_personal(s)[0])

955

In [74]:
len(extract_personal(s)[1])

2571

In [75]:
len(extract_personal(s)[2])

2781

In [76]:
len(extract_personal(s)[3])

958

In [43]:
s[:1000]

'1\t4/12/2018\tLorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin risus. Praesent lectus.\n\nVestibulum quam sapien| varius ut, blandit non, interdum in, ante. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Duis faucibus accumsan odio. Curabitur convallis.|dottewell0@gnu.org\toR1mOq,!@#$%^&*(),[{bitcoin:18A8rBU3wvbLTSxMjqrPNc9mvonpA4XMiv\tIP:192.232.9.210\tccn:3563354617955160|ssn:380-09-9403}]|05-6609813,814 Monterey Court\n2\t12/18/2018\tSuspendisse potenti. In eleifend quam a odio. In hac habitasse platea dictumst.\n\nMaecenas ut massa quis augue luctus tincidunt. Nulla mollis molestie lorem. Quisque ut erat.,bassiter1@sphinn.com\tc5KvmarHX3o,test\u2060test\u202b,[{bitcoin:1EB7kYpnfJSqS7kUFpinsmPF3uiH9sfRf1,IP:20.73.13.197|ccn:3542723823957010\tssn:118-12-8276}#{bitcoin:1E5fev4boabWZmXvHGVkHcNJZ2tLnpM6Zv*IP:238.206.212.148\tccn:337941898369615,ssn:427-22-9352}#{bitcoin:1DqG3WcmGw74PjptjzcAmxGFuQdvWL7RCC,IP:171.241.15.98\tccn:3574

In [25]:
email = re.findall('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', s)
ssn = re.findall('ssn\:([0-9]+-[0-9]+-[0-9]+)', s)
bca = re.findall('bitcoin\:([0-9a-zA-Z]{5,})', s)
sa = re.findall('([0-9]*\s[a-zA-Z\s]+)\\n\d*', s)


In [44]:
grader.check("q2")

## Content in Amazon review data

**Question 3**

The dataset `reviews.txt` contains [Amazon reviews](http://jmcauley.ucsd.edu/data/amazon/) for ~200k phones and phone accessories. This dataset has been "cleaned" for you. The goal of this section is to create a function that takes in the review dataset and a review and returns the word that "best summarizes the review" using TF-IDF.'

1. Create a function `tfidf_data(review, reviews)` that takes a review as well as the review data and returns a dataframe:
    - indexed by the words in `review`,
    - with columns given by (a) the number of times each word is found in the review (`cnt`), (b) the term frequency for each word (`tf`), (c) the inverse document frequency for each word (`idf`), and (d) the TF-IDF for each word (`tfidf`).
    
2. Create a function `relevant_word(tfidf_data)` which takes in a dataframe as above and returns the word that "best summarizes the review" described by `tfidf_data`.


*Note:* Use this function to "cluster" review types -- run it on a sample of reviews and see which words come up most. Unfortunately, you will likely have to change your code from your answer above to run it on the entire dataset (to do this, you should compute as many of the frequencies "ahead of time" and look them up when needed; you should also likely filter out words that occur "rarely")

In [6]:
fp = os.path.join('data', 'reviews.txt')
reviews = pd.read_csv(fp, header=None, squeeze=True)
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()

In [17]:
review

'this is a great new case design that i have not seen before it has a slim silicone skin that really locks in the phone to cover and protect your phone from spills and such and also a hard polycarbonate outside shell cover to guard it against damage  this case also comes with different interchangeable skins and covers to create multiple color combinations  this is a different kind of case than the usual chunk of plastic  it is innovative and suits the iphone 5 perfectly'

In [72]:
wordlist = review.strip().split()
index = set(wordlist)
wordfreq = []
for w in index:
    wordfreq.append(wordlist.count(w))
key_bag = dict(zip(index, wordfreq))
key_bag

out_df = pd.DataFrame(pd.Series(key_bag), columns = ['cnt'])
tf = []
for i in key_bag.keys():
    tf.append(review.count(i) / (review.count(' ') + 1))
out_df['tf'] = tf

idf = []
for i in key_bag.keys():
    idf.append(np.log(len(reviews) / reviews.str.contains(i).sum()))
out_df['idf'] = idf

tfidf = out_df['tf']*out_df['idf']
out_df['tfidf'] = tfidf

In [332]:
fp = os.path.join('data', 'reviews.txt')
reviews = pd.read_csv(fp, header=None, squeeze=True)
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()

In [334]:
out = tfidf_data(review,reviews)
relevant_word(out)

'skin'

In [337]:
out.tfidf.idxmax()

'skin'

In [327]:
grader.check("q3")

## Tweet Analysis: Internet Research Agency

The dataset `data/ira.csv` contains tweets tagged by Twitter as likely being posted by the *Internet Research Angency* (the tweet factory facing allegations for attempting to influence US political elections).

The questions in this section will focus on the following:
1. We will look at the hashtags present in the text and trends in their makeup.
2. We will prepare this dataset for modeling by creating features out of the text fields.


**Question 4**

### HashTags

You may assume that a hashtag is any string without whitespace following a `#` (this is more permissive than Twitters rules for hashtags; you are encouraged to go down this rabbit-hole to better figure out how to clean your data!).

* Create a function `hashtag_list` that takes in a column of tweet-text and returns a column containing the list of hashtags present in the tweet text. If a tweet doesn't contain a hashtag, the function should return an empty list.

* Create a function `most_common_hashtag` that takes in a column of hashtag-lists (the output above) and returns a column consisting a single hashtag from the tweet-text. 
    - If the text has no hashtags, the entry should be `NaN`,
    - If the text has one distinct hashtag, the entry should contain that hashtag,
    - If the text has more than one hashtag, the entry should be the most common hashtag (among all hashtags in the column). If there is a tie for most common, any of the most common can be returned.
        - E.g. if the input column was: `pd.Series([[1, 2, 2], [3, 2, 3]])`, the output would be: `pd.Series([2, 2])`. Even though `3` was more common in the second list, `2` is the most common among all hashtags in the column.

In [17]:
testdata = [[
    'RT @DSC80: Text-cleaning is cool! #NLP #NLP #NLP https://t.co/xsfdw88d #NLP1 #NLP1'
], ['#NLP1 #NLP1']]
test = hashtag_list(pd.DataFrame(testdata, columns=['text'])['text'])

In [52]:
tweet_lists = hashtag_list(ira.text)

In [42]:
tag_all = []
for i in tweet_lists:
    tag_all.extend(i)
d_all = {x:tag_all.count(x) for x in tag_all}
max_tag = max(d_all, key=d_all.get)
max_tag

'news'

In [54]:
frequancy = pd.Series(tweet_lists.sum()).value_counts()
frequancy

news                5269
sports              2545
politics            2244
world               1585
BlackLivesMatter     813
                    ... 
OscarHasNoColor,       1
Sunday                 1
EmergencyKit           1
GoCowboys              1
NARUTO                 1
Length: 13392, dtype: int64

In [62]:
a = ['politics', 'sports']
out = {}
for w in a:
    out[w] = frequancy[w]

'sports'

In [60]:
out

{'politics': 2244, 'sports': 2545}

In [6]:
fp = os.path.join('data', 'ira.csv')
ira = pd.read_csv(fp, names=['id', 'name', 'date', 'text'])
ira

Unnamed: 0,id,name,date,text
0,3906258,ea85ac8be1e8ab479064ca4c0fe3ac6587f76b1ef97452...,2016-11-16 09:04,The Best Exercise To Lose Belly Fat In 2 weeks...
1,1051443,8e58ab0f46d273103d9e71aa92cdaffb6e330ec7d15ae5...,2016-12-24 04:31,RT @Philanthropy: Dozens of ‘hate groups’ have...
2,2823399,Room Of Rumor,2016-08-18 20:26,"Artificial intelligence can find, map poverty,..."
3,272878,San Francisco Daily,2016-03-18 19:28,Uber balks at rules proposed by world’s busies...
4,7697802,41bb9ae5991f53996752a0ab8dd36b543821abca8d5aed...,2016-07-30 15:44,RT @dirtroaddiva1: #IHatePokemonGoBecause he ...
...,...,...,...,...
89995,5635647,KansasCityDailyNews,2016-04-03 21:19,Trump: Kasich shouldn't be allowed to run http...
89996,7012979,f46e654ff3f1f9697f2b94de5a2e42a6914e1f00da14a7...,2016-12-19 15:04,RT @JefLeeson: #ThingsYouCantIgnore The last s...
89997,6955774,88669ad69e40d7c199af91e8107f1e0e7988d377d2e41f...,2016-08-27 14:15,RT @nealcarter: When someone said the first li...
89998,8563509,ec2109adb67d2a24091026d5d9aab64dadca1fdb2f7355...,2016-10-20 15:19,RT @indigenous01: #rantfortoday I speak the Wo...


In [None]:
testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
test = pd.DataFrame(testdata, columns=['text'])
out = hashtag_list(test['text'])
out

In [2]:
estdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
test = hashtag_list(pd.DataFrame(testdata, columns=['text'])['text'])
most_common_hashtag(test)

NameError: name 'hashtag_list' is not defined

In [65]:
most_common_hashtag(hashtag_list(ira.text))

0                        CatTV
1                          NaN
2                         tech
3                         news
4        IHatePokemonGoBecause
                 ...          
89995                 politics
89996      ThingsYouCantIgnore
89997                      NaN
89998             rantfortoday
89999                     news
Length: 90000, dtype: object

In [66]:
grader.check("q4")

NameError: name 'grader' is not defined

**Question 5 (Features)**

Now create a dataframe of features from the `ira` data.  That is create a function `create_features` that takes in the `ira` data and returns a dataframe with the same index as `ira` (i.e. the rows correspond to the same tweets) and the following columns:
* `num_hashtags` gives the number of hashtags present in a tweet,
* `mc_hashtags` gives the most common hashtag associated to a tweet (as given by the problem above),
* `num_tags` gives the number of tags a given tweet has (look for the presence of `@`),
* `num_links` gives the number of hyper-links present in a given tweet 
    - (a hyper-link is a string starting with `http(s)://` not followed by whitespaces),
* A boolean column `is_retweet` that describes if the given tweet is a retweet (i.e. `RT`),
* A 'clean' text field `text` that contains the tweet text with:
    - The non-alphanumeric characters removed (except spaces),
    - All words should be separated by exactly one space,
    - The characters all lowercase,
    - All the meta-information above (Retweet info, tags, hyperlinks, hashtags) removed.

*Note:* You should make a helper function for each column.

*Note:* This will take a while to run on the entire dataset -- test it on a small sample first!

In [302]:
htag_list = hashtag_list(ira['text'])
num_hashtags = htag_list.map(len)
mc_hashtags = most_common_hashtag(htag_list)
num_tags = []
num_links = []
is_retweet = []
text = []
for i in ira['text'].index:
    t = ira['text'].loc[i]
    num_tags.append(len(re.findall('\@[^\s]+', t)))
    num_links.append(len(re.findall('http(s)://[^\s]+', t)))
    is_retweet.append(True if re.match(r'^RT\s', t) else False)
    text.append(clean_text(t))
out = pd.DataFrame()
out['num_hashtags'] = num_hashtags
out['mc_hashtags'] = mc_hashtags
out['num_tags'] = num_tags
out['num_links'] = num_links
out['is_retweet'] = is_retweet
out['text'] = text
out

Unnamed: 0,num_hashtags,mc_hashtags,num_tags,num_links,is_retweet,text
0,4,news,0,2,False,The Best Exercise To Lose Belly Fat In 2 weeks
1,0,,1,1,True,"Dozens of ‘hate groups’ have charity status, C..."
2,1,tech,0,0,False,"Artificial intelligence can find, map poverty,..."
3,1,news,0,0,False,Uber balks at rules proposed by world’s busies...
4,2,news,1,1,True,"he didn't let me do ""that"" for a Klondike bar..."
...,...,...,...,...,...,...
89995,1,politics,0,1,False,Trump: Kasich shouldn't be allowed to run
89996,1,ThingsYouCantIgnore,1,0,True,The last step at the top of the stairs.
89997,0,,2,0,True,When someone said the first link from @thetrud...
89998,1,rantfortoday,1,0,True,I speak the Word of God therefore because the ...


In [321]:
testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
test = pd.DataFrame(testdata, columns=['text'])
out = create_features(test)
anscols = ['text', 'num_hashtags', 'mc_hashtags', 'num_tags', 'num_links', 'is_retweet']
ansdata = [['text cleaning is cool', 3, 'NLP1', 1, 1, True]]
ans = pd.DataFrame(ansdata, columns=anscols)


In [322]:
out

Unnamed: 0,text,num_hashtags,mc_hashtags,num_tags,num_links,is_retweet
0,text cleaning is cool,3,NLP1,1,1,True


In [323]:
ans

Unnamed: 0,text,num_hashtags,mc_hashtags,num_tags,num_links,is_retweet
0,text cleaning is cool,3,NLP1,1,1,True


In [69]:
create_features(ira)

Unnamed: 0,text,num_hashtags,mc_hashtags,num_tags,num_links,is_retweet
0,the best exercise to lose belly fat in 2 weeks,4,CatTV,0,2,False
1,dozens of hate groups have charity status chro...,0,,1,1,True
2,artificial intelligence can find map poverty r...,1,tech,0,0,False
3,uber balks at rules proposed by world s busies...,1,news,0,0,False
4,he didn t let me do that for a klondike bar sc...,2,IHatePokemonGoBecause,1,1,True
...,...,...,...,...,...,...
89995,trump kasich shouldn t be allowed to run,1,politics,0,1,False
89996,the last step at the top of the stairs,1,ThingsYouCantIgnore,1,0,True
89997,when someone said the first link from thetrudz...,0,,2,0,True
89998,i speak the word of god therefore because the ...,1,rantfortoday,1,0,True


In [324]:
grader.check("q5")

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [325]:
grader.check_all()

q1 results: All test cases passed!

q2 results: All test cases passed!

q3 results: All test cases passed!

q4 results: All test cases passed!

q5 results: All test cases passed!