In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

# Lab 7 – Regular Expressions

## DSC 80, Spring 2022

### Due Date: Monday, May 16th at 11:59 PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and Markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding will be done in an accompanying `lab.py` file that is imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying `lab.py` file will be tested (a la DSC 20),
2. The notebook may be graded (if it contains free response questions or asks you to draw plots).

**Do not change the function names in the `lab.py` file!**
- The functions in the `lab.py` file are how your assignment is graded, and they are graded by their name.
- If you changed something you weren't supposed to, just use git to revert! Ask us if you need help with this, or google around for `git revert`.

**Tips for working in the notebook**:
- The notebooks serve to present the questions and give you a place to present your results for later review.
- The notebooks in *lab assignments* are not graded (only the `lab.py` file is submitted and graded).
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `lab.py` file. You can write code here, but make sure that all of your real work is in the `lab.py` file.

**Tips for developing in the `lab.py` file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional helper functions to solve the lab! 
- Always document your code!

### Importing code from `lab.py`

* We import our `lab.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from lab import *

In [5]:
import pandas as pd
import numpy as np
import os
import re

***Note:*** While working on the lab, check the Campuswire post titled "Lab 7 Released!" for any clarifications.

## Question 1 – Practice with Regular Expressions 🛠

Regular expressions can be tricky, and the best way to gain familiarity with them is through lots of practice. In this question, you will work through ten exercises, each of which requires you to write a regular expression that matches strings that satisfy certain criteria. Make sure to take a close look at the doctests for each function in `lab.py`, as they provide useful guidance for the types of strings you should and shouldn't match.

***Notes:*** 
- Make sure to refer to the [Regular Expression Resources](https://dsc80.com/resources/) posted on the course website. In particular, we recommend having [regex101.com](https://regex101.com/) open while working, along with the [cheat sheet](https://dsc80.com/resources/other/berkeley-regex-reference.pdf).

- Each exercise has a star rating, between 1 (⭐️) and 3 (⭐️⭐️⭐️) stars, indicating its difficulty level (1 being the easiest, 3 being the hardest). If you are spending lots of time on 1-star exercises, take a close look at the syntax from lecture, as there is probably an easier way of writing the necessary pattern!

<br>

### Exercise 1 (⭐️)

Write a regular expression that matches strings that have `'['` as the third character and `']'` as the sixth character.

<br>

### Exercise 2 (⭐️)

Write a regular expression that matches strings that are phone numbers that start with `'(858)'` and follow the format `'(xxx) xxx-xxxx'` (`'x'` represents a digit).

***Note:*** There is a space between `'(xxx)'` and `'xxx-xxxx'`.

<br>

### Exercise 3 (⭐️)

Write a regular expression that matches strings that:
- are between 6 and 10 characters long (inclusive),
- contain only alphanumeric characters, whitespace and `'?'`, and
- end with `'?'`.

<br>

### Exercise 4 (⭐️⭐️)

Write a regular expression that matches strings with exactly two `'$'`, one of which is at the start of the string, such that:
- the characters between the two `'$'` can be anything (including nothing) except the lowercase letters `'a'`, `'b'`, and `'c'`, (and `'$'`), and
- the characters after the second `'$'` can only be the **lowercase or uppercase** letters `'a'`/`'A'`, `'b'`/`'B'`, and `'c'`/`'C'`, with every `'a'`/`'A'` before every `'b'`/`'B'`, and every `'b'`/`'B'` before every `'c'`/`'C'`. There must be at least one `'a'` or `'A'`, at least one `'b'` or `'B'`, and at least one `'c'` or `'C'`.
    

<br>

### Exercise 5 (⭐️)
Write a regular expression that matches strings that represent valid Python file names, including the extension. 

***Note:*** For simplicity, assume that file names contains only letters, numbers, and underscores (`'_'`).

<br>

### Exercise 6 (⭐️)
Write a regular expression that matches strings that:
- are made up of only lowercase letters and exactly one underscore (`'_'`), and
- have at least one lowercase letter on both sides of the underscore.

<br>

### Exercise 7 (⭐️)
Write a regular expression that matches strings that start with and end with an underscore (`'_'`).

<br>

### Exercise 8 (⭐️)

Apple serial numbers are strings of length 1 or more that are made up of any characters, other than
- the uppercase letter `'O'`, 
- the lowercase letter `'i`', and 
- the number `'1'`.

Write a regular expression that matches strings that are valid Apple serial numbers.

<br>

### Exercise 9 (⭐️⭐️)

ID numbers are formatted as `'SC-NN-CCC-NNNN'`, where 
- SC represents state code in uppercase (e.g. `'CA'`),
- NN represents a number with 2 digits (e.g. `'98'`),
- CCC represents a three letter city code in uppercase (e.g. `'SAN'`), and
- NNNN represents a number with 4 digits (e.g. `'1024'`).

Write a regular expression that matches strings that are ID numbers corresponding to the cities of `'SAN'` or `'LAX'`, or the state of `'NY'`. Assume that there is only one city named `'SAN'` and only one city named `'LAX'`.

<br>

### Exercise 10 (⭐️⭐️⭐️)

Write a function named `match_10` that takes in a string and:
- converts the string to lowercase,
- removes all non-alphanumeric characters (i.e. removes everything that is not in the `\w` character class), and the letter `'a'`, and
- returns a list of every **non-overlapping** three-character substring in the remaining string, starting from the beginning of the string.
   
For instance, consider the following doctest:

```py
>>> match_10('Ab..DEF')
['bde']
```

Here's how `match_10` should process `'Ab..DEF'`:

1. Convert to lowercase: `'ab..def'`.
2. Remove non-alphanumeric characters and the letter `'a'`: `'bdef'`.
3. Starting from the beginning of the string, there is only a single non-overlapping three character substring: `'bde'`. Hence, we return `['bde']`.

***Note:*** Perform your operations in the exact order described above, otherwise your code may not pass all the tests.

In [22]:
string = 'Ab..DEFbt'
filter2 = string.lower()
filter2 = filter2.replace('a','')
filter2 = re.findall('[\w]',filter2)
filter2 = ''.join(filter2)
#filter2

re.findall('\w{3}',filter2)

['bde', 'fbt']

In [23]:
grader.check("q1")

## Question 2 – Capturing Groups in Regular Expressions 📡

The dataset stored in `data/messy.txt` contains personal information from a fictional website that a user scraped from web server logs. Within this dataset, there are four fields that are of interest to you:
1. Email Addresses (assume they are alphanumeric usernames and domain names)
2. [Social Security Numbers](https://en.wikipedia.org/wiki/Social_Security_number#Structure)
3. Bitcoin Addresses (alphanumeric strings of long length)
4. Street Addresses

Create a function `extract_personal` that takes in a string containing the contents of a server log file (like `open('data/messy.txt').read()`) and returns a **tuple of four separate lists** containing values of the 4 pieces of information listed above (in the order listed above). Do **not** keep empty values.

***Note:*** Since this data is messy, your function will be allowed to miss ~5% of the records in each list. Good spot checking using certain useful substrings (e.g. `'@'` for emails) should help assure correctness! Your function will be tested on a sample of the file `messy.txt`.

***Hint:*** There are multiple "delimiters" in use in the file; there are few enough of them that you can safely determine what they are.

In [45]:
# experiment with extract_personal using the file s below
fp = os.path.join('data', 'messy.txt')
s = open(fp, encoding='utf8').read()

In [52]:
emails = re.findall('(\w+@[\w]+\.[\w]+(?:\.[\w]+)*)',s)
                    

In [76]:
len(emails)

938

In [56]:
ssn = re.findall('ssn:(\d{3}-\d{2}-\d{4})',s)

In [57]:
ssn

['380-09-9403',
 '118-12-8276',
 '427-22-9352',
 '649-16-2247',
 '265-90-3805',
 '552-37-4756',
 '757-41-7191',
 '855-12-9209',
 '592-06-3498',
 '295-47-4318',
 '487-70-8307',
 '388-94-8970',
 '672-36-5180',
 '493-79-8907',
 '797-87-2569',
 '634-49-4764',
 '236-77-4126',
 '419-91-7185',
 '608-41-6360',
 '850-99-8257',
 '634-32-0336',
 '466-29-7843',
 '219-94-1536',
 '606-83-6633',
 '225-28-2622',
 '158-80-4556',
 '579-70-9026',
 '561-14-1915',
 '619-94-3154',
 '554-21-4300',
 '494-98-3760',
 '535-06-0965',
 '435-17-2290',
 '493-31-9068',
 '898-20-5714',
 '645-60-2112',
 '521-33-6224',
 '177-64-8758',
 '705-22-5915',
 '626-83-3638',
 '296-71-6465',
 '623-74-3100',
 '462-34-3229',
 '672-84-3797',
 '551-76-0257',
 '612-43-4860',
 '540-35-6767',
 '882-68-5364',
 '361-38-8584',
 '497-99-1297',
 '308-27-2080',
 '774-78-1689',
 '159-44-5881',
 '763-29-1574',
 '186-51-3531',
 '110-65-7380',
 '728-96-0460',
 '235-88-1295',
 '270-62-8459',
 '247-78-4274',
 '447-90-1874',
 '685-64-3362',
 '431-58

In [72]:
bitcoin = re.findall('bitcoin:(\w{5,})',s)

In [65]:
len(bitcoin)

2857

In [67]:
counter = 0
for n in bitcoin:
    if n ==  'null':
        counter+=1
        
counter

76

In [68]:
len(bitcoin) - counter

2781

In [73]:
len(bitcoin)

2781

In [74]:
streets = re.findall('\d+ \w+ \w+(?: \w+)*',s)

In [75]:
len(streets)

958

In [77]:
streets

['814 Monterey Court',
 '62 Hooker Park',
 '27811 Clyde Gallagher Drive',
 '32553 Riverside Pass',
 '17157 Clemons Alley',
 '75 Dorton Parkway',
 '5 Manitowish Trail',
 '84 Northland Center',
 '64404 Sundown Street',
 '47144 Mockingbird Street',
 '413 Sutherland Court',
 '8906 Almo Lane',
 '34223 Graceland Crossing',
 '94 Mcguire Lane',
 '1 Delladonna Circle',
 '169 Claremont Point',
 '19475 Meadow Vale Crossing',
 '147 Cascade Center',
 '762 Knutson Terrace',
 '301 Monument Trail',
 '39 Buena Vista Alley',
 '94 Barby Circle',
 '98 Hansons Road',
 '83525 Calypso Way',
 '61 Heath Street',
 '60 Bay Hill',
 '923 Lindbergh Place',
 '1 Elmside Circle',
 '34 High Crossing Road',
 '63 Bonner Lane',
 '67 Superior Terrace',
 '846 Acker Road',
 '4 Menomonie Avenue',
 '858 Arrowood Terrace',
 '2566 Anhalt Trail',
 '30840 Sherman Hill',
 '29170 Parkside Lane',
 '150 Logan Street',
 '74 North Court',
 '51374 8th Hill',
 '6432 Continental Avenue',
 '46916 Donald Court',
 '1950 Beilfuss Hill',
 '949 

In [78]:
# don't change this cell, but do run it -- it is needed for the tests
test_fp = os.path.join('data', 'messy.test.txt')
test_s = open(test_fp, encoding='utf8').read()
emails, ssn, bitcoin, addresses = extract_personal(test_s)

In [79]:
grader.check("q2")

## Question 3 – TF-IDF 📊

The dataset `data/reviews.txt` contains [Amazon reviews](http://jmcauley.ucsd.edu/data/amazon/) for ~200k phones and phone accessories. The dataset has already been "cleaned" for you. In this question, you will create a function that takes in the reviews dataset as a Series (with one entry per review) as well as a single review, and returns the word that "best summarizes the single review" using TF-IDF.

To do so, implement the two functions below.

#### `tfidf_data`

Create a function `tfidf_data` that takes in the reviews data as a Series (`reviews_ser`) and a single review (`review`) and returns a DataFrame indexed by the words in `review` with four columns:
- `'cnt'`: the number of times each word is found in the review 
- `'tf'`: the term frequency for each word
- `'idf'`: the inverse document frequency for each word
- `'tfidf'` the TF-IDF for each word

You may use a `for`-loop. The words in the outputted DataFrame may appear in any order.

***Hint:*** You may need to use the [`'\b'` character](https://www.regular-expressions.info/wordboundaries.html) somewhere.
    
<br>

#### `relevant_word`

Create a function `tfidf_data` that takes in the DataFrame that `tfidf_data` returns and returns the word that "best summarizes" the review. If there are multiple "best" summary words, return any one of them.

In [119]:
# experiment with tfidf_data using reviews_ser and review below 
fp = os.path.join('data', 'reviews.txt')
reviews_ser = pd.read_csv(fp, header=None, squeeze=True)
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()



  reviews_ser = pd.read_csv(fp, header=None, squeeze=True)


In [149]:
reviews_ser

0        works great  i called t mobile and had this si...
1        these items looked to be of good quality and h...
2        this product arrive faster than i expected  i ...
3        i brought this for my sister who has a g2 but ...
4         i am both delighted and disappointed with del...
                               ...                        
58297    bought i for my husband's phone a week ago  ni...
58298    the spring loaded adjustable holder looks nift...
58299    received on time  everything working less spea...
58300    i bought this battery pack with the mindset of...
58301              that work    but thats not the original
Name: 0, Length: 58302, dtype: object

In [83]:
review

'this is a great new case design that i have not seen before it has a slim silicone skin that really locks in the phone to cover and protect your phone from spills and such and also a hard polycarbonate outside shell cover to guard it against damage  this case also comes with different interchangeable skins and covers to create multiple color combinations  this is a different kind of case than the usual chunk of plastic  it is innovative and suits the iphone 5 perfectly'

In [158]:
words = pd.Series(re.findall(r'\b\w+\b',review)).value_counts()
data = pd.DataFrame(columns = ['cnt', 'tf', 'idf', 'tfidf'])
data['cnt'] = words
data['tf'] = data['cnt'] / data['cnt'].sum()
data


idfs = []
for n in words.index:
    idfs.append(np.log(len(reviews_ser) / reviews_ser.str.contains(n).sum()))


data['idf'] = idfs
data['tfidf'] = data['tf'] * data['idf']
#len(words)

In [159]:
tfidf_data()

Unnamed: 0,cnt,tf,idf,tfidf
and,5,0.058824,0.219198,0.012894
a,4,0.047059,0.003041,0.000143
this,3,0.035294,0.440309,0.01554
is,3,0.035294,0.194456,0.006863
to,3,0.035294,0.251689,0.008883
it,3,0.035294,0.081721,0.002884
the,3,0.035294,0.119618,0.004222
case,3,0.035294,1.059063,0.037379
that,2,0.023529,0.839864,0.019762
also,2,0.023529,1.922164,0.045227


In [117]:
tf = []
for n in words:
    count = len(re.findall(fr'\b{n}\b', review))
    tf.append(count)
data['cnt'] = pd.Series(index = words, data = tf)
data['tf'] = data['cnt'] / data.shape[0]

num_t_appear = []

for n in reviews_ser:
    print(re.findall(fr'\b\w+\b',n))
    

['works', 'great', 'i', 'called', 't', 'mobile', 'and', 'had', 'this', 'sim', 'card', 'activated', 'in', '5', 'minutes', 'they', 'have', 'a', 'automated', 'system', 'or', 'you', 'can', 'speak', 'to', 'a', 'rep']
['these', 'items', 'looked', 'to', 'be', 'of', 'good', 'quality', 'and', 'have', 'lasted', 'several', 'months', 'so', 'far', 'i', 'am', 'satisfied', 'with', 'the', 'purchase', 'considering', 'the', 'low', 'cost']


In [107]:
data = pd.DataFrame(index = words,columns = ['cnt', 'tf', 'idf', 'tfidf'])
data

Unnamed: 0,cnt,tf,idf,tfidf
this,,,,
is,,,,
a,,,,
great,,,,
new,,,,
...,...,...,...,...
suits,,,,
the,,,,
iphone,,,,
5,,,,


In [160]:
fp = os.path.join('data', 'reviews.txt')
reviews_ser = pd.read_csv(fp, header=None, squeeze=True)
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
q3_tfidf = tfidf_data(reviews_ser, review)



  reviews_ser = pd.read_csv(fp, header=None, squeeze=True)


In [171]:
q3_tfidf['tfidf'].sort_values(ascending=False).index[0]

'spills'

In [172]:
relevant_word(q3_tfidf)

'spills'

In [92]:
data

Unnamed: 0,cnt,tf,idf,tfidf
this,,,,
is,,,,
a,,,,
great,,,,
new,,,,
...,...,...,...,...
suits,,,,
the,,,,
iphone,,,,
5,,,,


In [130]:
unique_words = pd.Series(re.findall(r'\b\w+\b',review).sum()).value_counts()
unique_words

AttributeError: 'list' object has no attribute 'sum'

In [173]:
# don't change this cell, but do run it -- it is needed for the tests
fp = os.path.join('data', 'reviews.txt')
reviews_ser = pd.read_csv(fp, header=None, squeeze=True)
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
q3_tfidf = tfidf_data(reviews_ser, review)

try:
    q3_rel = relevant_word(q3_tfidf)
except:
    q3_rel = None



  reviews_ser = pd.read_csv(fp, header=None, squeeze=True)


In [174]:
grader.check("q3")

## Questions 4 and 5 – Tweet Analysis 🐥

The dataset `data/ira.csv` contains tweets tagged by Twitter as likely being posted by the [Internet Research Agency](https://en.wikipedia.org/wiki/Internet_Research_Agency), the tweet factory facing allegations for attempting to influence US political elections.

Questions 4 and 5 will focus on the following:
- Question 4: Look at the hashtags present in the text and trends in their makeup.
- Question 5: Prepare the dataset for modeling by creating features out of the text fields.

### Question 4 – Hashtags #️⃣

You may assume that a hashtag is any string without whitespace that immediately follows a `'#'`.

#### `hashtag_list`

Create a function `hashtag_list` that takes in a Series of tweet texts and returns a Series containing a list of hashtags present in each tweet's text. If a tweet's text doesn't contain a hashtag, the Series should contain an empty list for that tweet. Don't include the `'#'` symbol in the lists that are returned (see the doctest for an example).

<br>

#### `most_common_hashtag`

Create a function `most_common_hashtag` that takes in a Series of hashtag lists (as is outputted by `hashtag_list`) and returns a Series consisting of a single hashtag per tweet: 
- If the tweet's text has no hashtags, the entry should in the output Series should be `NaN`.
- If the tweet's text has one distinct hashtag, the entry in the output Series should be that hashtag.
- If the tweet's text has more than one hashtag, the entry in the output Series should be be the most common hashtag **in the whole input Series**. If there is a tie for most common, any of the most common can be returned.
    - For example, if the input Series was `pd.Series([[2], [2], [3, 2, 3]])`, the output would be `pd.Series([2, 2, 2])`. Even though `3` was more common in the third list than `2`, `2` is the most common among all hashtags in the Series.

In [209]:
# The doctests/public tests don't test your work on the `ira` data,
# but the hidden tests do.
# So, make sure to thoroughly test your work yourself!
fp = os.path.join('data', 'ira.csv')
ira = pd.read_csv(fp, names=['id', 'name', 'date', 'text'])

In [185]:
hashes_ser = ira['text'].apply(lambda x:  re.findall('#([^\s]+)',x))
hashes_ser

0        [Exercise, LoseBellyFat, CatTV, TeenWolf…]
1                                                []
2                                            [tech]
3                                            [news]
4           [IHatePokemonGoBecause, PokesAreJokes.]
                            ...                    
89995                                    [politics]
89996                         [ThingsYouCantIgnore]
89997                                            []
89998                                [rantfortoday]
89999                                        [news]
Name: text, Length: 90000, dtype: object

In [207]:
hashes_ser[8]

['NowPlaying:', 'rap', 'hiphop', 'music']

In [208]:
hashes_counts.loc[hashes_ser[8]]

NowPlaying:    70
rap            37
hiphop         47
music          49
Name: text, dtype: int64

In [206]:
hashes_counts.loc[hashes_ser[8]].idxmax()

'NowPlaying:'

In [203]:
most_common_hashtag(hashes_ser)[8]

'NowPlaying:'

In [196]:
hashes_ser[0]

['Exercise', 'LoseBellyFat', 'CatTV', 'TeenWolf…']

In [199]:
hashes_counts = hashes_ser.explode().value_counts()
hashes_counts

hashes_counts.loc[hashes_ser[0]].idxmax()


'CatTV'

In [None]:

for n in hashes_ser:
    if len(hashes_ser) == 0:
        return np.nan
    elif len(hashes_ser) == 1:
        return hashes_ser[0]
    else:
        hashes_counts = hashes_ser.explode().value_counts()
        #hashes_counts
        hashes_counts.loc[n].idxmax()
            

In [186]:
len([])

0

In [180]:
ira.iloc[49197]

id                                                3991390
name                                       Messiah Haynes
date                                     2016-07-18 11:00
text    RT @adamoc132013: #IDoOkForSomeoneWho has no m...
Name: 49197, dtype: object

In [210]:
grader.check("q4")

### Question 5 – Features 📋

Now, create a DataFrame of features from the `ira` data.  That is, create a function `create_features` that takes in a DataFrame `ira` that has just a single column, `'text'`, and returns a DataFrame with the same index as `ira` (i.e. the rows correspond to the same tweets) and the following columns:
* `'num_hashtags'`, the number of hashtags present in the tweet.
* `'mc_hashtags'`, the most common hashtag associated to the tweet (using the result of `most_common_hashtag` from Question 4).
* `'num_tags'`, the number of tags the tweet has (look for the presence of `'@'`).
* `'num_links'`, the number of hyperlinks present in the tweet.
    - A hyperlink is a string starting with `'http://'` or `'https://'`, not followed by whitespaces.
* `'is_retweet'`, a Boolean describing whether the tweet is a retweet. A retweet is a tweet that **begins** with `'RT'`.
* `'text'`, a version of the tweet's text that is cleaned according to the following steps, **in this exact order**:
    1. All meta-information above (retweet info, tags, hyperlinks, and hashtags) should be replaced with a single space.
    2. Everything other than letters, numbers, and spaces should be replaced with a single space.
    3. All letters should be lowercase.
    4. All words should be separated by exactly one space, and leading/trailing whitespace should be removed (stripped).
    
The columns in the outputted DataFrame must be in the order `['text', 'num_hashtags', 'mc_hashtags', 'num_tags', 'num_links', 'is_retweet']`. (Remember, the DataFrame that `create_features` is called on only has a single column, `'text'`.)

***Notes:***
- It's a good idea to make helper function for each column.
- The `\w` character class in regex **does not** refer to letters, numbers, and spaces (or even just letters and numbers). As such, you can't use it here!
- `create_features` will take a while to run on the entire dataset – test it on a small sample first!

In [211]:
# The doctests/public tests don't test your work on the `ira` data,
# but the hidden tests do.
# So, make sure to thoroughly test your work yourself!
fp = os.path.join('data', 'ira.csv')
ira = pd.read_csv(fp, names=['id', 'name', 'date', 'text'])

In [219]:
hashtag_list(ira['text'])

0        [Exercise, LoseBellyFat, CatTV, TeenWolf…]
1                                                []
2                                            [tech]
3                                            [news]
4           [IHatePokemonGoBecause, PokesAreJokes.]
                            ...                    
89995                                    [politics]
89996                         [ThingsYouCantIgnore]
89997                                            []
89998                                [rantfortoday]
89999                                        [news]
Name: text, Length: 90000, dtype: object

In [267]:
ira['text']

0        The Best Exercise To Lose Belly Fat In 2 weeks...
1        RT @Philanthropy: Dozens of ‘hate groups’ have...
2        Artificial intelligence can find, map poverty,...
3        Uber balks at rules proposed by world’s busies...
4        RT @dirtroaddiva1: #IHatePokemonGoBecause he  ...
                               ...                        
89995    Trump: Kasich shouldn't be allowed to run http...
89996    RT @JefLeeson: #ThingsYouCantIgnore The last s...
89997    RT @nealcarter: When someone said the first li...
89998    RT @indigenous01: #rantfortoday I speak the Wo...
89999    10 Things to Know for Monday https://t.co/XoOg...
Name: text, Length: 90000, dtype: object

In [258]:
'RTX '.replace(r'^(?:RT)+',"")

'RTX '

In [322]:
df = pd.DataFrame(columns=[['text', 'num_hashtags', 'mc_hashtags', 'num_tags', 'num_links', 'is_retweet']])
df['num_hashtags'] = hashtag_list(ira['text']).apply(lambda x : len(x))
df['mc_hashtags'] = most_common_hashtag(hashtag_list(ira['text']))

df['num_tags'] = ira['text'].apply(lambda x : len(re.findall(r'@\w+',x)))
df['is_retweet'] = ira['text'].apply(lambda x : x[0:2] == 'RT')
df['num_links'] = ira['text'].apply(lambda x: len(re.findall(r'(?:http:\/\/[^\s]+|https:\/\/[^\s]+)',x)))

clean_text = []
counter = 0
for n in ira['text'][:30]:
    cleaned = n.replace('RT',' ')
    tags = re.findall(r'@\w+',cleaned)
    for l in tags:
        cleaned = cleaned.replace(l,' ')
    hashes = hashtag_list(ira['text'])
    #print(hashes)
    for k in hashes[counter]:
        cleaned = re.sub(fr'(?:\s|\b)#{k}(?:\s|\b)',' ', cleaned)
    links = re.findall(r'(?:http:\/\/[^\s]+|https:\/\/[^\s]+)',cleaned)
    #print(cleaned)
    for b in links:
        cleaned = cleaned.replace(b,' ')
    filtered = re.findall(r'[^A-Za-z0-9 ]+',cleaned)
    for c in filtered:
        cleaned.replace(c,' ')
    cleaned = cleaned.lower()
    words = re.findall(r'\b([a-z0-9]+)\b', cleaned)
    clean_text.append(' '.join(words))
    counter+=1
df['text'] = pd.Series(clean_text)

In [328]:
df['text'].values[4]

array(['he didn t let me do that for a klondike bar screw you pokemon'],
      dtype=object)

In [286]:
testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
test = pd.DataFrame(testdata, columns=['text'])
test.columns
#out = create_features(test)

Index(['text'], dtype='object')

In [315]:
testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
test = pd.DataFrame(testdata, columns=['text'])
out = create_features(test)
anscols = ['text', 'num_hashtags', 'mc_hashtags', 'num_tags', 'num_links', 'is_retweet']
ansdata = [['text cleaning is cool', 3, 'NLP1', 1, 1, True]]
ans = pd.DataFrame(ansdata, columns=anscols)
(out == ans).all().all()

0    [NLP, NLP1, NLP1]
Name: text, dtype: object
NLP
NLP1
NLP1
 : Text-cleaning is cool! https://t.co/xsfdw88d 


True

In [316]:
out

Unnamed: 0,text,num_hashtags,mc_hashtags,num_tags,num_links,is_retweet
0,text cleaning is cool,3,NLP1,1,1,True


In [317]:
ans

Unnamed: 0,text,num_hashtags,mc_hashtags,num_tags,num_links,is_retweet
0,text cleaning is cool,3,NLP1,1,1,True


In [270]:

df['text'].values[:30]

array([['the best to lose belly fat in 2 weeks'],
       ['dozens of hate groups have charity status chronicle study finds'],
       ['artificial intelligence can find map poverty researchers say'],
       ['uber balks at rules proposed by world s busiest airport'],
       ['he didn t let me do that for a klondike bar screw you pokemon'],
       ['chick fil a remains closed after violations'],
       ['we cannot afford to wait to address this public health crisis we must quickly fund efforts to stop zika s spread'],
       ['that the two leading republican candidates are an ignorant bully and an ignorant preacher'],
       ['rj ommio from nothing prod by davo'],
       ['hill street vida blues'],
       ['all you wanted to know about hillary'],
       ['photos man driving atv hit by semi truck while leading police on pursuit'],
       ['you don t have to use your daughters and wives as surrogates for your outrage over you can just be offende'],
       ['celebrity biographer wendy leigh

In [318]:
# don't change this cell, but do run it -- it is needed for the tests
# (yes, we know it says "hidden" – there are still truly hidden tests in this question)
fp_hidden = 'data/ira_test.csv'
ira_hidden = pd.read_csv(fp_hidden, header=None)
text_hidden = ira_hidden.iloc[:, -1:]
text_hidden.columns = ['text']

test_hidden = create_features(text_hidden)

In [319]:
grader.check("q5")

## Congratulations! You're done! 🏁

Submit your `lab.py` file to Gradescope. Note that you only need to submit the `lab.py` file; this notebook should not be uploaded.

Before submitting, you should ensure that all of your work is in the `lab.py` file. You can do this by running the doctests below, which will verify that your work passes the public tests **and** that your work is in the `lab.py` file. Run the cell below; you should see no output.

In [320]:
!python -m doctest lab.py



  reviews_ser = pd.read_csv(fp, header=None, squeeze=True)


  reviews_ser = pd.read_csv(fp, header=None, squeeze=True)


In addition, `grader.check_all()` will verify that your work passes the public tests.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [321]:
grader.check_all()

q1 results: All test cases passed!

q2 results: All test cases passed!

q3 results: All test cases passed!

q4 results: All test cases passed!

q5 results: All test cases passed!