# Cleaning the Data

Today, we'll be doing two things: (1) learning what it takes to clean data, (2) practicing how to use pre-made Python packages.
We'll be following the following steps to clean the data:

1. Lowercase the words
2. Remove punctuation
3. Remove stopwords (extremely common words in English)
4. Tokenization
5. Lemmatize words (convert to dictionary/base form)

### Lowercasing

Let's get started by importing pandas and reading in the article bodies. Then, we'll lowercase everything. Look through Google to see what string method would be appropriate.

In [None]:
import pandas as pd
bodies = pd.read_csv("train_bodies.csv")
for i in range(bodies.shape[0]):
    bodies.set_value(i, "articleBody", bodies['articleBody'].iloc[i].lower())
print(bodies)

### Remove Punctuation

Ask Google for the appropriate string method to remove punctuation. Consult StackOverflow if needed.

https://stackoverflow.com/questions/1276764/stripping-everything-but-alphanumeric-chars-from-a-string-in-python

In [None]:
import re

for i in range(bodies.shape[0]):
    bodies.set_value(i, "articleBody", re.sub(r'\W+', ' ', bodies['articleBody'].iloc[i]))
print(bodies)

### Remove stopwords

Let's use gensim's STOPWORDS.

In [None]:
from gensim.parsing.preprocessing import remove_stopwords

for i in range(bodies.shape[0]):
    bodies.set_value(i, "articleBody", remove_stopwords(bodies['articleBody'].iloc[i]))
print(bodies)

### Tokenization

Let's use nltk's tokenizer for this task.

In [None]:
from nltk.tokenize import word_tokenize

for i in range(bodies.shape[0]):
    bodies.set_value(i, "articleBody", word_tokenize(bodies['articleBody'].iloc[i]))
print(bodies)

### Lemmatizing

Let's use nltk's lemmatizer for this.

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
for i in range(bodies.shape[0]):
    for j in range(len(bodies['articleBody'].iloc[i])):
        bodies['articleBody'].iloc[i][j] = lemmatizer.lemmatize(bodies['articleBody'].iloc[i][j])
print(bodies)

Yay, we're done.