# Step 2: Clean the Data

**Project Description**: This project aims to analyze which words in news headlines generate the most engagement. Headlines are from the r/news subreddit. 

In [1]:
import pandas as pd
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('omw-1.4') #Download OpenMultilingualWordnet
wnl = nltk.WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\reyni\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\reyni\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Before we can proceed with our analysis, we first need to clean our data. Since we are only interested in specific words, that means we need to remove capatalization, special characters, numbers and stop words. Finally, we will use lemmatization to combine words that refer to the same thing. 

In [2]:
#read data from step 1
titles = pd.read_csv('titles.csv').drop(['Unnamed: 0'], axis=1) # read from csv and drop extra index column

In [3]:
# Use cleaner function to remove stuff we are not interested in
def cleaner(document):
    document = document.lower() #To lower case
    document = re.sub(r'[^\w\s]','', document) #Remove non-alphanumeric characters
    document =  re.sub(r'[^a-zA-Z ]''+','',document) #remove numbers
    return document

titles['title'] = titles['title'].apply(cleaner)

In [4]:
#lemmatize
wnl = nltk.WordNetLemmatizer()
titles['lemma_titles'] = titles['title'].apply(lambda word: wnl.lemmatize(word, pos='v'))

In [5]:
#remove stop words
stop = stopwords.words('english')
titles['title_stop'] = titles['lemma_titles'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [6]:
# Now we can drop the orignal titles and the lemma_titles and rename 
news = titles.loc[:, ~titles.columns.isin(['title', 'lemma_titles'])].copy()
news.rename(columns = {'title_stop':'titles'}, inplace = True)
news_cleaned = news.to_csv('news_cleaned.csv')