# Cleaning the Data
1. Clean up the data by removing missing values and duplicates. - **Autumn**
1. Convert titles + excerpts to lower case. - **Autumn**
1. Remove punctuation. - **Autumn**
1. Tokenize words. - **Joel**
1. Remove stop words. - **Joel**
1. Lemmatization/Stemming - **Joel**
1. Create a Term-Document-Matrix - ?

In [1]:
import pandas as pd
import numpy as np
from fastparquet import ParquetFile

In [2]:
# load the data
pf = ParquetFile("../raw_data/nyt_data.parquet")

df = pf.to_pandas()

df.shape

(17370913, 3)

## Remove Duplicates and Missing Values

In [3]:
# drop duplicates
df.drop_duplicates(inplace=True)

df.shape

(11027535, 3)

In [4]:
df.isnull().sum()

year       0
title      0
excerpt    0
dtype: int64

In [5]:
# remove missing values
df.replace(r'^\s*$', np.nan, inplace=True, regex=True)
df.isnull().sum()

year             0
title          153
excerpt    5761610
dtype: int64

In [6]:
df.dropna(how='all', subset=['title', 'excerpt'], axis=0, inplace=True)
df.shape

(11027532, 3)

In [7]:
df.head()

Unnamed: 0,year,title,excerpt
0,1920,At last the Federal Reserve Board has issued r...,
1,1920,WILL TEST DOOR SERVICE.,Service Board to Further Examine I.R.T. Safety...
2,1920,Sanction for Chinese Contracts.,
3,1920,"LEADS FRAZIER BY 4,496.",Langer's Margin Falls in North Dakota--Gronna ...
4,1920,"CHICAGO, April 30.--With 300 suspicious charac...",Federal Agents and Police Round-- up Suspiciou...


## Normalize Data

In [8]:
# combine title and excerpt, remove punctuation, and convert to lower case.
reg = r'[^\w\s]'

df['title_excerpt'] = df['title'].fillna('').str.lower().str.replace(reg, ' ', regex=True) + '\r\n' + df['excerpt'].fillna('').str.lower().str.replace(reg, ' ', regex=True)

df.head()

Unnamed: 0,year,title,excerpt,title_excerpt
0,1920,At last the Federal Reserve Board has issued r...,,at last the federal reserve board has issued r...
1,1920,WILL TEST DOOR SERVICE.,Service Board to Further Examine I.R.T. Safety...,will test door service \r\nservice board to fu...
2,1920,Sanction for Chinese Contracts.,,sanction for chinese contracts \r\n
3,1920,"LEADS FRAZIER BY 4,496.",Langer's Margin Falls in North Dakota--Gronna ...,leads frazier by 4 496 \r\nlanger s margin fal...
4,1920,"CHICAGO, April 30.--With 300 suspicious charac...",Federal Agents and Police Round-- up Suspiciou...,chicago april 30 with 300 suspicious charac...


In [9]:
df.drop(columns=['year','title','excerpt'], inplace=True)
df.head()

Unnamed: 0,title_excerpt
0,at last the federal reserve board has issued r...
1,will test door service \r\nservice board to fu...
2,sanction for chinese contracts \r\n
3,leads frazier by 4 496 \r\nlanger s margin fal...
4,chicago april 30 with 300 suspicious charac...


In [10]:
# save result
df.to_parquet('clean_data.parquet', compression='gzip')