# Data cleaning

This notebook surves for raw data preparation for further topic modeling

In [2]:
import pickle
import re
import numpy as np
import pandas as pd

Importing raw data scrapped form website

In [3]:
speeches_dates = pickle.load(open('data/speeches_dates.pkl', 'rb'))

Raw data is a list of lists [[speech, date], [speech , date] ..]

In [4]:
speeches_dates = np.array(speeches_dates)
speeches_dates.shape

(247, 2)

Transforming to [[speech, speech, ..], [date, date, ..]]

In [5]:
speeches_dates.T.shape

(2, 247)

In [6]:
data = pd.DataFrame([])

Creating a data frame with two columns - Speech and Date.

In [7]:
data['speech'] = speeches_dates.T[0]
data['date'] = speeches_dates.T[1]

In [8]:
data

Unnamed: 0,speech,date
0,The President. Thank you. Thank you. Thank you...,"April 28, 2021"
1,"Thank you very much. Mr. Speaker, Mr. Vice Pre...","February 28, 2017"
2,"The President. Mr. Speaker, Mr. Vice President...","January 30, 2018"
3,"The President. Madam Speaker, Mr. Vice Preside...","February 05, 2019"
4,The President. Thank you very much. Thank you....,"February 04, 2020"
...,...,...
242,To the Congress of the United States: With the...,"February 15, 1973"
243,"To the Congress of the United States: Today, i...","February 22, 1973"
244,"To the Congress of the United States: ""Informa...","March 01, 1973"
245,"To the Congress of the United States: Today, i...","March 08, 1973"


Date has to be transformed into datetime format. Parsing text for this.

In [9]:
date = pd.DataFrame([])
date['year'] = data.date.apply(lambda x: x[-4:])
date['month'] = data.date.apply(lambda x: re.findall(r'(.*) \d\d,', x)[0])
date['day'] = data.date.apply(lambda x: re.findall(r' (\d+), ', x)[0])

In [10]:
map_dict = {'January': 1, 
'February': 2,
'March': 3,
'April': 4,
'May': 5,
'June': 6,
'July': 7,
'August': 8,
'September': 9,
'October': 10,
'November': 11,
'December': 12}

Change the month names eith the months numbers

In [11]:
date.month = date.month.map(map_dict)

In [12]:
data['date'] = pd.to_datetime(date[['year', 'month', 'day']])

Finally sorting dataframe by the date from older to newer

In [13]:
data.sort_values(by='date', ascending=True, inplace=True)
data.reset_index(drop=True, inplace=True)

In [14]:
data

Unnamed: 0,speech,date
0,Fellow-Citizens of the Senate and House of Rep...,1790-01-08
1,Fellow-Citizens of the Senate and House of Rep...,1790-12-08
2,Fellow-Citizens of the Senate and House of Rep...,1791-10-25
3,Fellow-Citizens of the Senate and House of Rep...,1792-11-06
4,Fellow-Citizens of the Senate and House of Rep...,1793-12-03
...,...,...
242,"Thank you very much. Mr. Speaker, Mr. Vice Pre...",2017-02-28
243,"The President. Mr. Speaker, Mr. Vice President...",2018-01-30
244,"The President. Madam Speaker, Mr. Vice Preside...",2019-02-05
245,The President. Thank you very much. Thank you....,2020-02-04


Summary:

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 247 entries, 0 to 246
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   speech  247 non-null    object        
 1   date    247 non-null    datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 4.0+ KB


Pickling data for further use cases

In [16]:
pickle.dump(data, open('data/data.pkl', 'wb'))