# Pandas basics

Before working with pandas (similarly to numpy or any other Python package) we first need to import it.


In [3]:
import pandas as pd

### Pandas Series
A pandas series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call s = pd.Series(data, index=index)

In [None]:
# Series can be imported from ndarray:
import numpy as np
numpy_random = np.random.randn(5)
print(numpy_random)

s_random = pd.Series(numpy_random)
print(s_random)

[ 0.50132992  2.6809358   0.68758543  0.38589432 -0.27117245]
0    0.501330
1    2.680936
2    0.687585
3    0.385894
4   -0.271172
dtype: float64


In [None]:
# from dicts:
dict_age = {"John": 18, "Anne": 22, "Connie": 31}
series_age = pd.Series(dict_age)
print(series_age)

John      18
Anne      22
Connie    31
dtype: int64


In [None]:
# from lists:
s_char_idx = pd.Series([74, 123, 42, 123, 51], index=["a", "b", "c", "d", "e"])
s_num_idx = pd.Series([74, 123, 42, 123, 51], index=[1, 2, 3, 4, 5])

print(s_char_idx)
print(s_num_idx)

a     74
b    123
c     42
d    123
e     51
dtype: int64
1     74
2    123
3     42
4    123
5     51
dtype: int64


Series act similarly to dicts and arrays


In [None]:
# array-like
print(s_random[0]) 

print(s_random[:3]) # first 3 elements

0.501329917599262
0    0.501330
1    2.680936
2    0.687585
dtype: float64


In [None]:
# dict-like
print(series_age['Connie'])
print('John is', series_age['John'], 'years old')


print('Anne appears in series_age:', 'Anne' in series_age)
print('Martin appears in series_age:', 'Martin' in series_age)

31
John is 18 years old
Anne appears in series_age: True
Martin appears in series_age: False


### Pandas Dataframe

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict **of Series objects**. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different types of input:


*   Dict of 1D ndarrays, lists, dicts, or Series
*   2-D numpy.ndarray
*   Structured or record ndarray
*   A Series
*   Another DataFrame

In [None]:
# Creating a dataframe from a list

lst = ['This', 'is', 'a', 'list', 'containing', 'many', 'words', 'soon', 'to', 'be', 'transformed', 'to', 'a', 'dataframe']
first_dataframe = pd.DataFrame(lst)

first_dataframe.head(n=6) # Number of entries to display


Unnamed: 0,0
0,This
1,is
2,a
3,list
4,containing
5,many


In [None]:
# Creating a df from a list of lists

lst_of_lists = [['This', 'is', 'a'], ['list', 'containing', 'many'], ['words', 'soon', 'to'], ['be', 'transformed', 'to', 'a', 'dataframe']]
second_dataframe = pd.DataFrame(lst_of_lists)

second_dataframe.head()

Unnamed: 0,0,1,2,3,4
0,This,is,a,,
1,list,containing,many,,
2,words,soon,to,,
3,be,transformed,to,a,dataframe


In [None]:
# Creating a df from a dict of lists - all arrays have to be of same length

dict_of_lists = {'Name 1': ['This', 'is', 'a'], 'Name 2': ['list', 'containing', 'many'], 'Name 3': ['words', 'soon', 'to'], 'Name 4': ['be', 'transformed', 'to a dataframe']}

third_dataframe = pd.DataFrame(dict_of_lists)
third_dataframe.head()


Unnamed: 0,Name 1,Name 2,Name 3,Name 4
0,This,list,words,be
1,is,containing,soon,transformed
2,a,many,to,to a dataframe


In [None]:
# Getting 1 column of interest
print(third_dataframe['Name 1'])

# Getting 2 colums of interest
print(third_dataframe[['Name 2', 'Name 3']])


0    This
1      is
2       a
Name: Name 1, dtype: object
       Name 2 Name 3
0        list  words
1  containing   soon
2        many     to


# Heart disease patients data

### Importing and viewing data

* Age : Age of the patient

* Sex : Sex of the patient

* exang: exercise induced angina (1 = yes; 0 = no)

* ca: number of major vessels (0-3)

* cp : Chest Pain type
  * Value 0: typical angina 
  * Value 1: atypical angina
  * Value 2: non-anginal pain
  * Value 3: asymptomatic


* trtbps : resting blood pressure (in mm Hg)

* chol : cholestoral in mg/dl fetched via BMI sensor

* fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

* rest_ecg : resting electrocardiographic results

  * Value 0: normal

  * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)

  * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
* thalach : maximum heart rate achieved

* target : 0 = less chance of heart attack 1 = more chance of heart attack

In [4]:
data = pd.read_csv("heart.csv") # Dataset containing information about patients with heart diseases
data.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [None]:
# You can check the df’s shape (similar to numpy arrays)
print("Entire data shape:", data.shape)


Entire data shape: (303, 14)


In [None]:
# Retrieving age column
age = data['age']
print(age)


0      63
1      37
2      41
3      56
4      57
       ..
298    57
299    45
300    68
301    57
302    57
Name: age, Length: 303, dtype: int64


In [None]:
# Accessing individual values from age column
print(age[0]) # Age of entry 0
print(age[150]) # Age of entry 150
print(age[300]) # Age of entry 300

63
66
68


In [None]:
# Retrieving entire rows
row_3 = data.loc[3] # Retrieve row 3 (4th row)
print(row_3, '\n')
print("Age of row 3 entry is:", row_3['age'])


age          56.0
sex           1.0
cp            1.0
trestbps    120.0
chol        236.0
fbs           0.0
restecg       1.0
thalach     178.0
exang         0.0
oldpeak       0.8
slope         2.0
ca            0.0
thal          2.0
target        1.0
Name: 3, dtype: float64 

Age of row 3 entry is: 56.0


In [None]:
# Sorting entries by properties (rows)
sorted_data = data.sort_values(['age', 'chol']) # The sort is not done in place, meaning that the data variable will remain the same
sorted_data.head()

# data.sort_values(['age', 'chol'], inplace=True) # Uncomment this line if you want sorting to be done in place
# data.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
72,29,1,1,130,204,0,0,202,0,0.0,2,0,2,1
58,34,1,3,118,182,0,0,174,0,0.0,2,0,2,1
125,34,0,1,118,210,0,1,192,0,0.7,2,0,2,1
65,35,0,0,138,183,0,1,182,0,1.4,2,0,2,1
157,35,1,1,122,192,0,1,174,0,0.0,2,0,2,1


In [None]:
# Indexing based on a property
data[data['cp'] == 3].head() # Get all the patients with chest pain value 3 (asymptomatic) 


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
13,64,1,3,110,211,0,0,144,1,1.8,1,0,2,1
14,58,0,3,150,283,1,0,162,0,1.0,2,0,2,1
17,66,0,3,150,226,0,1,114,0,2.6,0,0,2,1
19,69,0,3,140,239,0,1,151,0,1.8,2,2,2,1


### Exercise - Data accessing in Heart disease patients (E1)



1. Find the average age of patients


In [7]:
patient_ages = data['age']
avg_patient_age = patient_ages.mean()

print("Average age of patients:", avg_patient_age)

Average age of patients: 54.366336633663366


2. Find the number of patients with sex = 0. Then find the number of patients with sex = 1.


In [8]:
sex_0_patients = data[data['sex'] == 0]
sex_1_patients = data[data['sex'] == 1]

print(f'Sex 0 patients: {len(sex_0_patients)}')
print(f'Sex 1 patients: {len(sex_1_patients)}')

Sex 0 patients: 96
Sex 1 patients: 207



3. Find the average cholesterol of people older than 60

In [9]:
patient_chol = data[data['age'] > 60]['chol']
avg_patient_chol = patient_chol.mean()

print(f'Average patient cholesterol (age > 60): {avg_patient_chol}')

Average patient cholesterol (age > 60): 260.1518987341772


### Exercise - Netflix Subscriptions in pandas (E2)



1. Load the Netflix subscriptions data into a dataframe and display it


In [11]:
netflix_data = pd.read_csv('netflix.csv')

netflix_data.head(n=10)

Unnamed: 0,Country_code,Country,Total Library Size,No. of TV Shows,No. of Movies,Cost Per Month - Basic ($),Cost Per Month - Standard ($),Cost Per Month - Premium ($)
0,ar,Argentina,4760,3154,1606,3.74,6.3,9.26
1,au,Australia,6114,4050,2064,7.84,12.12,16.39
2,at,Austria,5640,3779,1861,9.03,14.67,20.32
3,be,Belgium,4990,3374,1616,10.16,15.24,20.32
4,bo,Bolivia,4991,3155,1836,7.99,10.99,13.99
5,br,Brazil,4972,3162,1810,4.61,7.11,9.96
6,bg,Bulgaria,6797,4819,1978,9.03,11.29,13.54
7,ca,Canada,6239,4311,1928,7.91,11.87,15.03
8,cl,Chile,4994,3156,1838,7.07,9.91,12.74
9,co,Colombia,4991,3156,1835,4.31,6.86,9.93



2. View all of the information for the country with the largest library size (code, name, library size, #tv_shows, #movies, basic subscription fee, etc.)


In [14]:
sorted_countries_library_size = netflix_data.sort_values(['Total Library Size'], ascending=False)

sorted_countries_library_size.head(n=1)

Unnamed: 0,Country_code,Country,Total Library Size,No. of TV Shows,No. of Movies,Cost Per Month - Basic ($),Cost Per Month - Standard ($),Cost Per Month - Premium ($)
12,cz,Czechia,7325,5234,2091,8.83,11.49,14.15



3. View the countries with the highest basic subscription fees - it's sufficient to look at the top 5


In [15]:
sorted_countries_sub_fees = netflix_data.sort_values(['Cost Per Month - Basic ($)'], ascending=False)

sorted_countries_sub_fees.head(n=5)

Unnamed: 0,Country_code,Country,Total Library Size,No. of TV Shows,No. of Movies,Cost Per Month - Basic ($),Cost Per Month - Standard ($),Cost Per Month - Premium ($)
33,li,Liechtenstein,3048,1712,1336,12.88,20.46,26.96
56,ch,Switzerland,5506,3654,1852,12.88,20.46,26.96
13,dk,Denmark,4558,2978,1580,12.0,15.04,19.6
55,se,Sweden,4361,2973,1388,10.9,14.2,19.7
29,il,Israel,5713,3650,2063,10.56,15.05,19.54



4. Find the country that has the best library size/basic sub price ratio. This is the country in which you get the most content per dollar.


In [19]:
# country with best library size / subscription fees ratio
netflix_data['Ratio'] = netflix_data['Total Library Size'] / netflix_data['Cost Per Month - Basic ($)']
sorted_countries_ratio = netflix_data.sort_values(['Ratio'], ascending=False)

sorted_countries_ratio.head(n=5)

Unnamed: 0,Country_code,Country,Total Library Size,No. of TV Shows,No. of Movies,Cost Per Month - Basic ($),Cost Per Month - Standard ($),Cost Per Month - Premium ($),Ratio
59,tr,Turkey,4639,2930,1709,1.97,3.0,4.02,2354.822335
26,in,India,5843,3718,2125,2.64,6.61,8.6,2213.257576
0,ar,Argentina,4760,3154,1606,3.74,6.3,9.26,1272.727273
9,co,Colombia,4991,3156,1835,4.31,6.86,9.93,1158.00464
5,br,Brazil,4972,3162,1810,4.61,7.11,9.96,1078.524946


# Data Cleaning


In order to make sense of our data we often need to clean it first.
Data cleaning can generally be summarized as bringing your data into a form that is easily explorable. Having your data in a clean form should save you a lot of trouble and time when doing your investigations. 

On top of that, the results of your analysis may become way more valuable if you get rid of all sorts of noise that can be found in your initial data.

Examples of data cleaning:
*   Finding structure for your data - if you were to analyze an entire book, you would probably not want to work with a single string containing all the text data. Simply trying to access and view subparts of the text would be a pain.

*generally apply for numeric data*
*   Getting rid of empty/NaN entries - you don't want these polluting your data
*   Sometimes getting rid of 0 values for numeric data

*generally apply for text data*
*   Separating your texts into pages/sentences/words
*   Bringing all of your text data into the same case sensitivity
*   Getting rid of punctuation
*   Getting rid of frequent words that hold little meaning for our investigation (words like 'the', 'and', etc.)
*   Other forms of text normalization (stemming, lemmatization - we'll be touching on them in another lab)


### Poem

Say we have this short poem:

In [1]:
poem = """Because I could not stop for Death -",
          "He kindly stopped for me -",
          "The Carriage held but just Ourselves -",
          "and Immortality"""


Let's create a dataframe out of it and then tokenize the data into lowercase words


In [4]:
poem_split = poem.splitlines() # Let's split the poem into a list of lines

df_poem = pd.DataFrame({
    "content": poem_split
}) # and create our dataframe from the list

df_poem.head()


Unnamed: 0,content
0,"Because I could not stop for Death -"","
1,"""He kindly stopped for me -"","
2,"""The Carriage held but just Ourselve..."
3,"""and Immortality"


In [None]:
%%capture
%pip install nltk
%pip install tidytext

In [5]:
from tidytext import unnest_tokens
import nltk
nltk.download('punkt')

# tokenize content into words (separate words, remove punctuation, bring everything lowercase)
df = (unnest_tokens(df_poem, "word", "content")) # Can also do manually - convert content to lowercase, then remove punctuation and then split using spaces
df.head(10)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bogda\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,word
0,because
0,i
0,could
0,not
0,stop
0,for
0,death
1,he
1,kindly
1,stopped


In [6]:
# Unnest tokens does not automatically reset the indexes in our dataframe, so we'll do that manually
df.reset_index(drop=True, inplace=True) # drop=True means we want to drop old indexes, inplace=True means we want to modify the existing df, not create and return a copy with the modifications
df.head(10)

Unnamed: 0,word
0,because
1,i
2,could
3,not
4,stop
5,for
6,death
7,he
8,kindly
9,stopped


### Getting a book into a dataframe
We'll be getting our data from the Gutenberg project http://www.gutenberg.org/

Download Sense and Sensibility by Jane Austen (http://www.gutenberg.org/ebooks/161)

In [11]:
# We'll make use of the gutenbergpy python package
import gutenbergpy.textget

raw_book = gutenbergpy.textget.get_text_by_id(161)
sense_sensibility_text = gutenbergpy.textget.strip_headers(raw_book)
print(len(sense_sensibility_text)) # Length of our data


693092


In [None]:
print(sense_sensibility_text)

In [13]:
# Create our dataframe
import re
sense_sensibility_lines = sense_sensibility_text.splitlines()

sense_sensibility_df = pd.DataFrame({
    "content": sense_sensibility_lines,
    "line": list(range(len(sense_sensibility_lines)))
})

sense_sensibility_df.head(110)


Unnamed: 0,content,line
0,b'',0
1,b'[Illustration]',1
2,b'',2
3,b'',3
4,b'',4
...,...,...
105,"b'will, gave as much disappointment as pleasur...",105
106,"b'unjust, nor so ungrateful, as to leave his e...",106
107,b'he left it to him on such terms as destroyed...,107
108,b'bequest. Mr. Dashwood had wished for it more...,108


In [14]:
# The package provided us with data in a byte object format. Since we prefer working with strings, let's convert all content into utf-8 strings.
ss_df = sense_sensibility_df.copy()
ss_df['content'] = ss_df['content'].str.decode("utf-8") # Transform content byte objects into strings
ss_df.head(110)

Unnamed: 0,content,line
0,,0
1,[Illustration],1
2,,2
3,,3
4,,4
...,...,...
105,"will, gave as much disappointment as pleasure....",105
106,"unjust, nor so ungrateful, as to leave his est...",106
107,he left it to him on such terms as destroyed h...,107
108,bequest. Mr. Dashwood had wished for it more f...,108


### Chapter extraction

In [15]:
def line_is_chapter(dataframe):
    chapter_list = []
    curr_chapter = 0
    for _, row in dataframe.iterrows():
        if re.search("^chapter [\\divxlc]*\.$", row['content'], re.IGNORECASE):
            curr_chapter += 1
        chapter_list.append(curr_chapter)
    return chapter_list

curr_chapter = 0
ss_df = ss_df.assign(chapter = line_is_chapter(ss_df))
ss_df.head(1110)

Unnamed: 0,content,line,chapter
0,,0,0
1,[Illustration],1,0
2,,2,0
3,,3,0
4,,4,0
...,...,...,...
1105,particularly gentlemanlike.,1105,7
1106,,1106,7
1107,There was nothing in any of the party which co...,1107,7
1108,companions to the Dashwoods; but the cold insi...,1108,7


In [16]:
ss_df = (unnest_tokens(ss_df, "word", "content"))
ss_df.reset_index(drop=True, inplace=True)
ss_df.head(110)


Unnamed: 0,line,chapter,word
0,0,0,
1,1,0,illustration
2,2,0,
3,3,0,
4,4,0,
...,...,...,...
105,58,0,chapter
106,58,0,xliv
107,59,0,chapter
108,59,0,xlv


In [17]:
ss_df = ss_df[ss_df.word.notnull()].reset_index(drop=True)
ss_df.head(110)

Unnamed: 0,line,chapter,word
0,1,0,illustration
1,6,0,sense
2,6,0,and
3,6,0,sensibility
4,8,0,by
...,...,...,...
105,63,0,chapter
106,63,0,xlix
107,64,0,chapter
108,64,0,l


In [27]:
from siuba import *
# Seeing the number of appearances per word
# Uncomment each of the lines below one by one and see which one you like the most, you'll probably be using it throughout the lab

ss_word_count_sorted = ss_df['word'].value_counts() # one way to see word counts in pandas
# ss_word_count = ss_df.groupby('word').count() # different way to see word counts in pandas

# count(ss_df, 'word') # siuba count
# count(ss_df, 'word', sort=True).head(22) # siuba count & sort

ss_word_count_sorted.head(n=10)


to     4087
the    4086
of     3568
and    3402
her    2518
a      2042
i      1951
in     1931
was    1847
it     1713
Name: word, dtype: int64

### Cleaning mini-exercise 1 (E3)




Inspect the 100 most frequent word entries and eliminate all the ones that are not actual words e.g.: “ ”

Update the ss_df dataframe by getting rid of all these "words".

In [19]:
# Eliminate non-words from the dataframe
for word, cnt in ss_word_count_sorted.items():
    print(f'{word}: {cnt}')


### Stop words


Stop words are words that are frequent in any context and don't give much (if any) information by themselves. For this reason we usually eliminate them before trying to get meaningful information from our data.

Examples of stop words are:

*  Determiners – Determiners tend to mark nouns where a determiner usually will be followed by a noun

    examples: the, a, an, another
*  Coordinating conjunctions – Coordinating conjunctions connect words, phrases, and clauses

    examples: for, an, nor, but, or, yet, so
*  Prepositions – Prepositions express temporal or spatial relations

    examples: in, under, towards, before

In [None]:
ss_df.head()

Unnamed: 0,line,chapter,word
0,1,0,illustration
1,6,0,sense
2,6,0,and
3,6,0,sensibility
4,8,0,by


In [None]:
from siuba import filter
import siuba
stop_words_test = ['illustration', 'companion', 'dog']

ss_df_test_filtered = ss_df[~ss_df['word'].isin(stop_words_test)] # Get all of the data from ss_df where the word can not be found in the stop words list
ss_df_test_filtered.head()



Unnamed: 0,line,chapter,word
1,6,0,sense
2,6,0,and
3,6,0,sensibility
4,8,0,by
5,8,0,jane


In [None]:
#nltk already has a list of common stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

print(stopwords.words('english'))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'ea

### Cleaning mini-exercise 2 (E4)



Filter out common stop words from your ss_df dataframe. Use nltk's 'english' stop words list to do so.

*(You can change the ss_df variable, no need to make a copy)*

**Display the top 5 most commonly occurring words after having applied your stop word filtering.**

In [None]:
# Write your code below

### Exercise - Perform your own data cleaning (E5)

Try some of the most recent entries in the Gutenberg project - **Kalevala : the Epic Poem of Finland — Complete by Lönnrot and Crawford**

Import the Epic Poem of Finland from Gutenberg, bring it into a clean dataframe containing all of the words as separate row entries (get rid of punctuation signs, bring everything to lowercase, get rid of NaN values, get rid of common stop words).

As previously for the Sense and Sensibility example, **each row should contain the word index, the line index and the chapter index** of where the word appeared. 
*Careful on the chapters separation, they may not be defined in the same way as they were in Sense & Sensibility.*

**After you have your clean dataframe**:
1. Display all of the words that occur more than 400 times in the book.
2. Extend the old stop words list with the words that you notice occur more than 400 times and can be considered stop words (e.g. thy). The previous list probably missed these old english words. 

    Apply the stop words filtering with the newly extended list and display once again all of the words that occur more than 400 times.
3. Display the 3 most frequently appearing words from chapters 5 and 10.

In [None]:
# Write your code below


# Multiple books

### Getting started
Let's create a dataframe containing information from multiple books. We'll choose 4 books from the Bronte Sisters.

In [1]:
# Jane Eyre
bronte1 = gutenbergpy.textget.strip_headers(gutenbergpy.textget.get_text_by_id(1260))

# Wuthering Heights
bronte2 = gutenbergpy.textget.strip_headers(gutenbergpy.textget.get_text_by_id(768))

# Vilette
bronte3 = gutenbergpy.textget.strip_headers(gutenbergpy.textget.get_text_by_id(9182))

# Agnes Grey
bronte4 = gutenbergpy.textget.strip_headers(gutenbergpy.textget.get_text_by_id(767))

bronte1_lines = bronte1.splitlines()
bronte2_lines = bronte2.splitlines()
bronte3_lines = bronte3.splitlines()
bronte4_lines = bronte4.splitlines()

bronte1_lines_df = pd.DataFrame({
    "content": bronte1_lines,
    "line": list(range(len(bronte1_lines)))
})

bronte2_lines_df = pd.DataFrame({
    "content": bronte2_lines,
    "line": list(range(len(bronte2_lines)))
})

bronte3_lines_df = pd.DataFrame({
    "content": bronte3_lines,
    "line": list(range(len(bronte3_lines)))
})

bronte4_lines_df = pd.DataFrame({
    "content": bronte4_lines,
    "line": list(range(len(bronte4_lines)))
})
print(bronte4_lines_df[:200])




NameError: name 'gutenbergpy' is not defined

In [None]:
# We’ll want to know which content comes from which book
bronte1_lines_df = bronte1_lines_df.assign(book = 'Jane Eyre')
bronte2_lines_df = bronte2_lines_df.assign(book = 'Wuthering Heights')
bronte3_lines_df = bronte3_lines_df.assign(book = 'Vilette')
bronte4_lines_df = bronte4_lines_df.assign(book = 'Agnes Grey')


In [None]:
# Finally, we concatenate the books into one dataframe
books = [bronte1_lines_df, bronte2_lines_df, bronte3_lines_df, bronte4_lines_df]
bronte_books_df = pd.concat(books)
bronte_books_df.head()


Unnamed: 0,content,line,book
0,b'',0,Jane Eyre
1,b'',1,Jane Eyre
2,b'',2,Jane Eyre
3,b'',3,Jane Eyre
4,b'JANE EYRE',4,Jane Eyre


In [None]:
# What shapes do the books have before concatenation?

print('Jane Eyre', bronte1_lines_df.shape)
print('Wuthering Heights', bronte2_lines_df.shape)
print('Vilette', bronte3_lines_df.shape)
print('Agnes Grey', bronte4_lines_df.shape)

print('\nAll 4 Bronte sisters books', bronte_books_df.shape)


Jane Eyre (21008, 3)
Wuthering Heights (12349, 3)
Vilette (21329, 3)
Agnes Grey (6992, 3)

All 4 Bronte sisters books (61678, 3)


In [None]:
bronte_books_df['content'] = bronte_books_df['content'].str.decode("utf-8") # Transform content byte objects into strings

In [None]:
bronte_books_df_words = (unnest_tokens(bronte_books_df, "word", "content")) # Tokenize our data (split into words)

In [None]:
bronte_books_df_words.head(300)

Unnamed: 0,line,book,word
0,0,Jane Eyre,
1,1,Jane Eyre,
2,2,Jane Eyre,
3,3,Jane Eyre,
4,4,Jane Eyre,jane
...,...,...,...
63,63,Jane Eyre,crime—an
63,63,Jane Eyre,insult
63,63,Jane Eyre,to
63,63,Jane Eyre,piety


### Exercise - Multiple Books / Multiple Authors (E6)



1. **Remove all the NaNs and common stop words** from your Bronte sisters dataframe and then **view the most commonly occurring words** in the sisters' writings.
2. **Create a similar dataframe (just like the one for the Bronte sisters) using 4 books from an author of your choice**. 
    Make sure all 4 books are from the same author.

    *if you don't find any author that you like, then you can choose these 4 from H. G. Wells: 
    The Time Machine, The War of the Worlds, The Invisible Man, The Island of Doctor Moreau.

    **View the most commonly used words from your author of choice** (after you've removed NaN and stop words from your dataframe).

3. Combine the 2 dataframes (Bronte sisters df + your author df) into a frame that contains **all of the previous information + information about the author** for each of the words in the dataframe.

 (Hint: This can be done similarly to how you've added the book name information to your Bronte books dataframe - add a column with the author name to each of the 2 initial dataframes. Concatenating these 2 "upgraded" dataframes should now give you the desired result.)

    Your columns should be: **index, line, book, word, *author*** (the order does not matter).

In [None]:
# Write your code below