# Data Cleaning

this notebook is about cleaning the data, which includes:
* **remove punctuation**: remove all punctuation from a string
* **stop words**: words which are filtered out before or after processing of text
* **stemming**: process of reducing inflected (or sometimes derived) words to their word stem, base or root form
* **lemmatization**: process of grouping together the inflected forms of a word so they can be analysed as a single item
* **tokenization**: process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens



In [1]:
import pandas as pd

In [4]:
tweets = pd.read_csv('../assets/twitter.csv')

In [66]:
t0 = tweets.tweet[0]

## Remove Punctuation

In [67]:
import string

In [68]:
t0.translate(str.maketrans('', '', string.punctuation))

' user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction   run'

## Remove stop words

In [61]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
#stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aymanelsayeed/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [70]:
t0

' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run'

In [71]:
#remove stop words from the tweet

' '.join([word for word in t0.split() if word.lower() not in stopwords.words('english')])

'@user father dysfunctional selfish drags kids dysfunction. #run'

### Excercise
Write a function that removes all stopwords from a given tweet and punctuation, and
Run the function on all tweets
 

In [60]:
def remove_stopwords_punctuation(tweet):
    pass

Remove stop words and punctuation from all tweets, save the result in a new column called 'cleaned'

## Stemming

In [50]:
input1 = "List listed lists listing listings"
words1 = input1.lower().split(' ')
words1

['list', 'listed', 'lists', 'listing', 'listings']

In [51]:
porter = nltk.PorterStemmer()

In [52]:
[porter.stem(t) for t in words1]

['list', 'list', 'list', 'list', 'list']

## Lemmatization

In [53]:
WNlemma = nltk.WordNetLemmatizer()

In [54]:
[WNlemma.lemmatize(t) for t in words1]

['list', 'listed', 'list', 'listing', 'listing']

## Tokenization
* **word_tokenize**: tokenize a string to words
* **sent_tokenize**: tokenize a string to sentences

In [55]:
text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(' ')

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

In [56]:
nltk.word_tokenize(text11)

['Children',
 'should',
 "n't",
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

In [57]:
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)
len(sentences)

4

In [58]:
sentences

['This is the first sentence.',
 'A gallon of milk in the U.S. costs $2.99.',
 'Is this the third sentence?',
 'Yes, it is!']