In [199]:
import pandas as pd

**Columns**

```
0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
1 - the id of the tweet (2087)
2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
3 - the query (lyx). If there is no query, then this value is NO_QUERY.
4 - the user that tweeted (robotickilldozr)
5 - the text of the tweet (Lyx is cool)
```

In [200]:
cols = ['polarity','id', 'date', 'query', 'user', 'tweet']

data = pd.read_csv('sentiment.csv',names=cols, encoding='ISO-8859-1')
print('length of data {}'.format(len(data)))

length of data 1600000


In [201]:
data[:5]

Unnamed: 0,polarity,id,date,query,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


#### 1.) Randomly sample 1% of the data (otherwise, the dataset is too big!)

In [202]:
sample = data.sample(n=int(len(data) * .01))

#### 2.) drop id, date, query, and user columns

In [203]:
sample = sample.drop(columns=["id", "date", "query", "user"])

#### 3.) Change all 4s in polarity to 1

- A lambda function might be useful

In [204]:
sample["polarity"] = sample["polarity"].apply(lambda x: 1 if x == 4 else x)

#### 4.) How many are there for each polarity?

- groupby might be useful

In [205]:
sample.groupby("polarity").count()

Unnamed: 0_level_0,tweet
polarity,Unnamed: 1_level_1
0,8036
1,7964


#### 5.) Perform the following operations on the tweet column:

- create a new column that contains tweet words in call caps (e.g., HELLO) (for tweets without all-caps-words, the row should be NaN)
- create a new column that contains tweet word hashtags (#) (for tweets without hashtags, the row should be NaN)
- create a new column that contains tweet word mentions (@) (for tweets without mentions, the row should be NaN)
- create a new column that contains tweet word urls (http) (for tweets without urls, the row should be NaN)
- create a new column that contains tweet numbers (e.g., 55) (for tweets without numbers, the row should be NaN)
- create a new column that contains the original tweet in all lowercase


In [206]:
import re, numpy as np

In [207]:
# Tweets with uppercase
pattern = r"\b[A-Z][A-Z]+\b"
regex = re.compile(pattern)

sample["all_caps"] = sample["tweet"].apply(lambda x: regex.findall(x) if regex.findall(x) else np.nan)

In [208]:
# Tweets with hashtags
pattern = r"\#\w+"
regex = re.compile(pattern)

sample["hashtags"] = sample["tweet"].apply(lambda x: regex.findall(x) if regex.findall(x) else np.nan)
#sample[sample.tweet.str.contains("#")]

In [209]:
# Tweets with mentions
pattern = r"\@\w+"
regex = re.compile(pattern)

sample["mentions"] = sample["tweet"].apply(lambda x: regex.findall(x) if regex.findall(x) else np.nan)
#sample[sample.tweet.str.contains("@")]

In [210]:
# Tweets with urls
pattern = r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
regex = re.compile(pattern)

sample["urls"] = sample["tweet"].apply(lambda x: regex.findall(x) if regex.findall(x) else np.nan)
#sample[sample.tweet.str.contains("http")][["tweet", "urls"]].to_csv("urls.csv")

In [211]:
# Tweets with standalone numbers
pattern = r"\b\d+\b"
regex = re.compile(pattern)

sample["numbers"] = sample["tweet"].apply(lambda x: regex.findall(x) if regex.findall(x) else np.nan)

#sample[["tweet", "numbers"]].to_csv("numbers.csv")
#sample[["tweet", "numbers"]]

In [212]:
# lowercase
sample["lowercase"] = sample["tweet"].apply(lambda x: x.lower())
#sample

#### 6.) Stem all of the words

- Some help: [Learn Python Stemming](https://data-flair.training/blogs/python-stemming/)
- Python Stemming is the act of taking a word and reducing it into a stem. A stem is like a root for a word- that for writing is writing. But this doesn’t always have to be a word; words like study, studies, and studying all stem into the word studi, which isn’t actually a word.
- Use the lowercase tweet column
- Create a new column called "stem"


In [213]:
# example:
import nltk
from nltk.stem import PorterStemmer
ps=PorterStemmer()
ps.stem('writing')

def stem(tweet):
    stems = []
    
    tweet_split = tweet.split(" ")
    
    for word in tweet_split:
        stems.append(ps.stem(word))
        
    return "".join(stems)

sample["stems"] = sample["lowercase"].apply(stem)



That is, you'll need to call `ps.stem()` on each word for each tweet. 

Hints:

- Convert the lowercase tweets column to a list of strings (e.g., use the string split() function)
- Use a lambda function that steps through each row, then a loop that steps 
- use a join to convery the list of words back into a single string (e.g., ''.join(list))

#### 7.) Dump your dataframe to csv

In [214]:
sample.to_csv("clean_munge.csv")