
- Load data from a CSV file.
- Use various data cleaning and preparation techniques to format the loaded data.
- Use various data wrangling techniques to reshape the loaded data.

In [19]:
import pandas as pd

**Columns**

```
0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
1 - the id of the tweet (2087)
2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
3 - the query (lyx). If there is no query, then this value is NO_QUERY.
4 - the user that tweeted (robotickilldozr)
5 - the text of the tweet (Lyx is cool)
```
- Name the columns: polarity, id, date, query, user, and tweet.
- Read the data file "sentiment.csv" and load it to a Dataframe "data".
- Print out the length of data using the following format "length of data: xxxx"
- Display the first five rows of the DataFrame.


In [20]:
cols = ['polarity','id','date','query','user','tweet']

data = pd.read_csv('sentiment.csv', names=cols, encoding ='ISO-8859-1')

print(data)

data.head(5)

         polarity          id                          date     query  \
0               0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY   
1               0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY   
2               0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY   
3               0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
4               0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
...           ...         ...                           ...       ...   
1599995         4  2193601966  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599996         4  2193601969  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599997         4  2193601991  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599998         4  2193602064  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599999         4  2193602129  Tue Jun 16 08:40:50 PDT 2009  NO_QUERY   

                    user                                              tweet  
0        _TheSpecialOne_  @switchfoot http://

Unnamed: 0,polarity,id,date,query,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


- Randomly sample 1% of the data
- Print out the length of data using the following format "length of data: xxxx"
- Display the first five rows of the DataFrame.

In [30]:
random = data.sample(frac = 0.01)
random

print('length of data:', random.shape[0])

random.head(5)

length of data: 16000


Unnamed: 0,polarity,id,date,query,user,tweet
1534097,4,2178647339,Mon Jun 15 08:00:26 PDT 2009,NO_QUERY,KStewart_FAN,@dakotaafanning I love your pic!!!
1416028,4,2057334552,Sat Jun 06 12:48:37 PDT 2009,NO_QUERY,tasha_peach,is really hppy at d mo but dunno y lol
943555,4,1795132798,Thu May 14 07:13:10 PDT 2009,NO_QUERY,_Lauren_Mallory,@Angela_Webber_ You know I'm always up for swe...
1588287,4,2191061205,Tue Jun 16 04:25:39 PDT 2009,NO_QUERY,deestillballin,Gettin ready for work. Love working OT gonna l...
631919,0,2232493172,Thu Jun 18 19:59:23 PDT 2009,NO_QUERY,CHasmonio,Got an itchy throat it really annoys me cos I ...


- Drop id, date, query, and user columns
- Display the first five rows of the DataFrame.

In [33]:
new_data = random.drop(['id','date','query','user'], axis = 1)
new_data.head(5)

Unnamed: 0,polarity,tweet
1534097,4,@dakotaafanning I love your pic!!!
1416028,4,is really hppy at d mo but dunno y lol
943555,4,@Angela_Webber_ You know I'm always up for swe...
1588287,4,Gettin ready for work. Love working OT gonna l...
631919,0,Got an itchy throat it really annoys me cos I ...


- Change all 4s in polarity to 1
- Display the first five lines of the DataFrame.

In [34]:
new_data['polarity'] = random['polarity'].apply(lambda element: 1 if element == 4 else element)
new_data

Unnamed: 0,polarity,tweet
1534097,1,@dakotaafanning I love your pic!!!
1416028,1,is really hppy at d mo but dunno y lol
943555,1,@Angela_Webber_ You know I'm always up for swe...
1588287,1,Gettin ready for work. Love working OT gonna l...
631919,0,Got an itchy throat it really annoys me cos I ...
...,...,...
903908,1,@urbalcloud hahaha! i have a secret door to th...
943671,1,@planethealer ok good I was thinking of you t...
1380753,1,"Not coming or going, not something or nothing,..."
1313433,1,"good morning twit, 5 days left of my vacation,..."


- How many are there for each polarity?
- groupby might be useful

In [35]:
grouped = new_data.groupby('polarity')
grouped.size()


polarity
0    7999
1    8001
dtype: int64

Use pandas apply method and lambda functions to:

- Create a new column called split that contains a list of the words in each tweet.
- Create a new column called caps that contains a list of the words in each tweet that are all-in-caps.
- Create a new column called hashtags that contains a list of the words in the tweet with a hashtag, the # symbol.
 (for tweets without hashtags, the row should be NaN)
- Create a new column called mentions that contains a list of the words in the tweet that have mentions (@) (for tweets without mentions, the row should be NaN)
- Create a new column called urls with a list of the Uniform Resource Locators (URLs) in the tweet (for tweets without urls, the row should be NaN).
- Create a new column called numbers that contains a list of the numbers, like 55, in the tweet (for tweets without numbers, the row should be NaN)
-  Create a new column called lowercase that contains a list of the words in the original tweet in all lowercase.

In [53]:
import numpy as np

# This split column breaks the tweets into units for further processing
new_data['split'] = new_data['tweet'].apply(lambda element: element.split())


new_data['caps'] = new_data['split'].apply(lambda element: [word for word in element if word.isupper()])
new_data['hashtags'] = new_data['split'].apply(lambda element: [word for word in element if word.startswith('#')])
new_data['mentions'] = new_data['split'].apply(lambda element: [word for word in element if word.startswith('@')])
new_data['urls'] = new_data['split'].apply(lambda element: [word for word in element if word.startswith('http')])
new_data['numbers'] = new_data['split'].apply(lambda element: [word for word in element if word.isdigit()])  
new_data['lowercase'] = new_data['split'].apply(lambda element: [word.lower() for word in element])
new_data

Unnamed: 0,polarity,tweet,split,caps,hashtags,mentions,urls,numbers,lowercase
1534097,1,@dakotaafanning I love your pic!!!,"[@dakotaafanning, I, love, your, pic!!!]",[I],[],[@dakotaafanning],[],[],"[@dakotaafanning, i, love, your, pic!!!]"
1416028,1,is really hppy at d mo but dunno y lol,"[is, really, hppy, at, d, mo, but, dunno, y, lol]",[],[],[],[],[],"[is, really, hppy, at, d, mo, but, dunno, y, lol]"
943555,1,@Angela_Webber_ You know I'm always up for swe...,"[@Angela_Webber_, You, know, I'm, always, up, ...",[],[],[@Angela_Webber_],[],[],"[@angela_webber_, you, know, i'm, always, up, ..."
1588287,1,Gettin ready for work. Love working OT gonna l...,"[Gettin, ready, for, work., Love, working, OT,...",[OT],[],[],[],[],"[gettin, ready, for, work., love, working, ot,..."
631919,0,Got an itchy throat it really annoys me cos I ...,"[Got, an, itchy, throat, it, really, annoys, m...",[I],[],[],[http://mypict.me/4oJx],[],"[got, an, itchy, throat, it, really, annoys, m..."
...,...,...,...,...,...,...,...,...,...
903908,1,@urbalcloud hahaha! i have a secret door to th...,"[@urbalcloud, hahaha!, i, have, a, secret, doo...",[],[],[@urbalcloud],[],[],"[@urbalcloud, hahaha!, i, have, a, secret, doo..."
943671,1,@planethealer ok good I was thinking of you t...,"[@planethealer, ok, good, I, was, thinking, of...","[I, I]",[],[@planethealer],[],[],"[@planethealer, ok, good, i, was, thinking, of..."
1380753,1,"Not coming or going, not something or nothing,...","[Not, coming, or, going,, not, something, or, ...",[],[],[],[],[],"[not, coming, or, going,, not, something, or, ..."
1313433,1,"good morning twit, 5 days left of my vacation,...","[good, morning, twit,, 5, days, left, of, my, ...",[],[],[],[],[5],"[good, morning, twit,, 5, days, left, of, my, ..."


In linguistics, a stem is the part of a word responsible for its lexical meaning. Stemming is the act of taking a word and reducing it into its stem. For some words, its stem is a stand-alone word. For instance, the stem for the word writing is write. But a stem isn't always a stand-alone word. For example, the stem of the words study, studies, and studying is studi.

(a) Create a new column called "stem"
- Using the lowercase list of words of each tweet, apply the stem method to each word and create a list of all of the stem words.
- Use a lambda function that steps through each row, then a loop that steps 


(b) Create a new column called "join" that combines the list of stems into a single string.

- Use a join to convery the list of words back into a single string 

In [59]:
# example:
import nltk
from nltk.stem import PorterStemmer
ps=PorterStemmer()
ps.stem('writing')

#split and step through each word, in line for loops
new_data['stem'] = new_data['lowercase'].apply(lambda element: [ps.stem(word) for word in element])
new_data['join'] = new_data['stem'].apply(lambda element: [' '.join(element)])
new_data

Unnamed: 0,polarity,tweet,split,caps,hashtags,mentions,urls,numbers,lowercase,stem,join
1534097,1,@dakotaafanning I love your pic!!!,"[@dakotaafanning, I, love, your, pic!!!]",[I],[],[@dakotaafanning],[],[],"[@dakotaafanning, i, love, your, pic!!!]","[@dakotaafan, i, love, your, pic!!!]",[@dakotaafan i love your pic!!!]
1416028,1,is really hppy at d mo but dunno y lol,"[is, really, hppy, at, d, mo, but, dunno, y, lol]",[],[],[],[],[],"[is, really, hppy, at, d, mo, but, dunno, y, lol]","[is, realli, hppi, at, d, mo, but, dunno, y, lol]",[is realli hppi at d mo but dunno y lol]
943555,1,@Angela_Webber_ You know I'm always up for swe...,"[@Angela_Webber_, You, know, I'm, always, up, ...",[],[],[@Angela_Webber_],[],[],"[@angela_webber_, you, know, i'm, always, up, ...","[@angela_webber_, you, know, i'm, alway, up, f...",[@angela_webber_ you know i'm alway up for swe...
1588287,1,Gettin ready for work. Love working OT gonna l...,"[Gettin, ready, for, work., Love, working, OT,...",[OT],[],[],[],[],"[gettin, ready, for, work., love, working, ot,...","[gettin, readi, for, work., love, work, ot, go...",[gettin readi for work. love work ot gonna lov...
631919,0,Got an itchy throat it really annoys me cos I ...,"[Got, an, itchy, throat, it, really, annoys, m...",[I],[],[],[http://mypict.me/4oJx],[],"[got, an, itchy, throat, it, really, annoys, m...","[got, an, itchi, throat, it, realli, annoy, me...",[got an itchi throat it realli annoy me co i c...
...,...,...,...,...,...,...,...,...,...,...,...
903908,1,@urbalcloud hahaha! i have a secret door to th...,"[@urbalcloud, hahaha!, i, have, a, secret, doo...",[],[],[@urbalcloud],[],[],"[@urbalcloud, hahaha!, i, have, a, secret, doo...","[@urbalcloud, hahaha!, i, have, a, secret, doo...",[@urbalcloud hahaha! i have a secret door to t...
943671,1,@planethealer ok good I was thinking of you t...,"[@planethealer, ok, good, I, was, thinking, of...","[I, I]",[],[@planethealer],[],[],"[@planethealer, ok, good, i, was, thinking, of...","[@planetheal, ok, good, i, wa, think, of, you,...",[@planetheal ok good i wa think of you two whe...
1380753,1,"Not coming or going, not something or nothing,...","[Not, coming, or, going,, not, something, or, ...",[],[],[],[],[],"[not, coming, or, going,, not, something, or, ...","[not, come, or, going,, not, someth, or, nothi...","[not come or going, not someth or nothing, not..."
1313433,1,"good morning twit, 5 days left of my vacation,...","[good, morning, twit,, 5, days, left, of, my, ...",[],[],[],[],[5],"[good, morning, twit,, 5, days, left, of, my, ...","[good, morn, twit,, 5, day, left, of, my, vaca...","[good morn twit, 5 day left of my vacation, ti..."


In [None]:
new_data.to_csv('new_data.csv')