In [1]:
import pandas as pd

In this notebook we take a text string and split on periods to obtain a list of strings representing sentences.  We will then make a dataframe of simple features from this list.  We will explore more complex featurization (e.g. using word vectors) in notebook under featurization.

# One Dimensional Lists (featuring some light featurization)

## Input Text $\to$ Input List

In [2]:
input_text = """
The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation.

We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.--That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, --That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient causes; and accordingly all experience hath shewn, that mankind are more disposed to suffer, while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same Object evinces a design to reduce them under absolute Despotism, it is their right, it is their duty, to throw off such Government, and to provide new Guards for their future security.--Such has been the patient sufferance of these Colonies; and such is now the necessity which constrains them to alter their former Systems of Government. The history of the present King of Great Britain is a history of repeated injuries and usurpations, all having in direct object the establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid world.
"""
input_text

"\nThe unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation.\n\nWe hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.--That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, --That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its found

In [3]:
input_list = [x.strip() for x in input_text.replace("\n", "").replace("--","").split(".") if x != ""]
input_list

["The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation",
 'We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness',
 'That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its founda

## Build Dataframe of Simple Features

In [4]:
df = pd.DataFrame(input_list, columns=["raw_text"])

In [5]:
df["number_of_characters"] = df["raw_text"].apply(lambda x: len(x))
df["number_of_words"] = df["raw_text"].apply(lambda x: len(x.split()))
df["first_word"] = df["raw_text"].apply(lambda x: x.split()[0].replace(",", "").lower())
df["first_word_is_the"] = df["first_word"].apply(lambda x: x == "the")
df = df.drop("raw_text", axis=1)
df

Unnamed: 0,number_of_characters,number_of_words,first_word,first_word_is_the
0,473,81,the,True
1,208,35,we,False
2,441,75,that,False
3,309,47,prudence,False
4,261,45,but,False
5,151,25,such,False
6,186,31,the,True
7,55,11,to,False


In [6]:
df["letter_of_the_alphabet"] = pd.Series(["abcdefgh"[i] for i in range(8)])
df

Unnamed: 0,number_of_characters,number_of_words,first_word,first_word_is_the,letter_of_the_alphabet
0,473,81,the,True,a
1,208,35,we,False,b
2,441,75,that,False,c
3,309,47,prudence,False,d
4,261,45,but,False,e
5,151,25,such,False,f
6,186,31,the,True,g
7,55,11,to,False,h


In [7]:
df["other_letter"] = pd.Series(["a"])
df["yet_another_letter"] = pd.Series(["abcdefghij"[i] for i in range(10)])
df

Unnamed: 0,number_of_characters,number_of_words,first_word,first_word_is_the,letter_of_the_alphabet,other_letter,yet_another_letter
0,473,81,the,True,a,a,a
1,208,35,we,False,b,,b
2,441,75,that,False,c,,c
3,309,47,prudence,False,d,,d
4,261,45,but,False,e,,e
5,151,25,such,False,f,,f
6,186,31,the,True,g,,g
7,55,11,to,False,h,,h


We see from this that if the new column we are trying to add is shorter then the existing columns, then it will be filled with NaN, while if it is longer the tail of the list will be ignored.

# Multidimensional Lists

In [8]:
input_data = [(1, 1), (1, 2), (2, 3), (3, 5), (5, 8)]
pd.DataFrame(input_data, columns=["Fib n", "Fib n + 1"])

Unnamed: 0,Fib n,Fib n + 1
0,1,1
1,1,2
2,2,3
3,3,5
4,5,8


In [9]:
input_data = [[1, 1], [1, 2], [2, 3], [3, 5], [5, 8]]
pd.DataFrame(input_data, columns=["Fib n", "Fib n + 1"])

Unnamed: 0,Fib n,Fib n + 1
0,1,1
1,1,2
2,2,3
3,3,5
4,5,8


In [10]:
input_data = ([1, 1], [1, 2], [2, 3], [3, 5], [5, 8])
pd.DataFrame(input_data, columns=["Fib n", "Fib n + 1"])

Unnamed: 0,Fib n,Fib n + 1
0,1,1
1,1,2
2,2,3
3,3,5
4,5,8


In [11]:
fib = [1,1,2,3,5,8]
input_data = zip(fib, fib[1:])
pd.DataFrame(input_data, columns=["Fib n", "Fib n + 1"])

Unnamed: 0,Fib n,Fib n + 1
0,1,1
1,1,2
2,2,3
3,3,5
4,5,8
