<a href="https://colab.research.google.com/github/DrMelissaFranklin/Docker.dsub/blob/main/Copy_of_Sunday_1pm_Mel_Project_5_NLP_Famous.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing



### This project is a supervised machine learning problems using Natural Language Processing.

This project is in three parts:
- in part 1) a traditional dataset in a CSV file is used to use a labeled target as one person in the dataset to find the top ten persons most similar to the target.
- in part 2) the same will be done using a Wikipedia API to directly access content on Wikipedia.
- in part 3) the notebook will include an interactive component


### Part 1)



- The CSV file is available at https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv
- The file contains a list of famous people and a brief overview.
- The goal of part 1) is provide the capability to
  - Take one person from the list as input and output the 10 other people who's overview are "closest" to the person in a Natural Language Processing sense
  - Also output the sentiment of the overview of the person



In [None]:
%%capture
# Install textblob
!pip install -U textblob


In [None]:
from textblob import TextBlob


In [None]:
%%capture
# Download corpora
!python -m textblob.download_corpora


In [None]:
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')


stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity

pd.options.display.max_columns = 100

In [None]:
!curl -s https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv | wc -l

42786


In [None]:
df = pd.read_csv('https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv')

In [None]:
df.shape


(42786, 3)

In [None]:
df.columns

Index(['URI', 'name', 'text'], dtype='object')

In [None]:
df.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [None]:
df.tail()

Unnamed: 0,URI,name,text
42781,<http://dbpedia.org/resource/Motoaki_Takenouchi>,Motoaki Takenouchi,motoaki takenouchi born july 8 1967 saitama pr...
42782,<http://dbpedia.org/resource/Alan_Judge_(footb...,"Alan Judge (footballer, born 1960)",alan graham judge born 14 may 1960 is a retire...
42783,<http://dbpedia.org/resource/Eduardo_Lara>,Eduardo Lara,eduardo lara lozano born 4 september 1959 in c...
42784,<http://dbpedia.org/resource/Tatiana_Faberg%C3...,Tatiana Faberg%C3%A9,tatiana faberg is an author and faberg scholar...
42785,<http://dbpedia.org/resource/Kenneth_Thomas>,Kenneth Thomas,kenneth thomas born february 24 1938 was chief...


In [None]:
df.iloc[42]['text']
#Note: spelling errors

'sylvie roy born november 4 1964 in la tuque quebec is a politician in quebec canada and the coalition avenir qubec member of the national assembly for the electoral district of arthabaska she previously represented the riding of lotbinire from 2003 until 2012 initially as a member of the nowdefunct adq until the merger of that party into the caq in 2012she was awarded a law degree from universit laval in 1987 and admitted to the barreau du qubec in 1988 she was lawyer for 15 years including 12 years for mental health organizations in mauricie she served as mayor of saintsophiedelvrard from 1998 to 2003 she also worked for the bcancour regional county municipality quebecroy was first elected to the national assembly in the 2003 election with 37 of the voteparti qubcois pq incumbent jeanguy par finished third with 26 of the votein the 2007 election roy was easily reelected with 59 of the vote liberal candidate laurent boissonneault finished second with 22 of the voteon march 29 2007 roy

In [None]:
# Determine the sentiment associated with text for row 42, Sylvie Roy, Canadian politician.
text_sentiment = df.iloc[42]['text']
text_sentiment = TextBlob(text_sentiment)
text_sentiment.sentiment

Sentiment(polarity=0.03240740740740741, subjectivity=0.2537037037037037)

In [None]:
# Perform the count transformation
vectorizer = CountVectorizer(stop_words='english')
# Access the 'text' column of the DataFrame
bow_vec = vectorizer.fit_transform(df.iloc[:]['text']) #slicing out the text column as a vector
bow_vec
#returns a bag of words vector for all 42k+

<42786x437190 sparse matrix of type '<class 'numpy.int64'>'
	with 5847547 stored elements in Compressed Sparse Row format>

In [None]:
#bow_vec.toarray()

In [None]:
#tf-idf
tfidf = TfidfTransformer()
tfidf_vec = tfidf.fit_transform(bow_vec)
tfidf_vec  #a sparse matrix

<42786x437190 sparse matrix of type '<class 'numpy.float64'>'
	with 5847547 stored elements in Compressed Sparse Row format>

In [None]:
nn = NearestNeighbors().fit(tfidf_vec)


Get nearest neighbors distances to first sentence


In [None]:
distances, indices = nn.kneighbors(
  X = tfidf_vec[42],
  n_neighbors = 11)



In [None]:
distances


array([[0.        , 1.14739967, 1.17374642, 1.18171971, 1.20426593,
        1.20595633, 1.22422289, 1.23227593, 1.25966377, 1.26066425,
        1.26560136]])

In [None]:
indices


array([[   42, 32591,  7862, 36314,  1859, 36099, 40841, 27224, 20278,
        26944, 18123]])

In [None]:
indices[0]

array([   42, 32591,  7862, 36314,  1859, 36099, 40841, 27224, 20278,
       26944, 18123])

In [None]:
df.iloc[indices[0]]

Unnamed: 0,URI,name,text
42,<http://dbpedia.org/resource/Sylvie_Roy>,Sylvie Roy,sylvie roy born november 4 1964 in la tuque qu...
32591,<http://dbpedia.org/resource/%C3%89ric_Caire>,%C3%89ric Caire,ric caire born may 21 1965 in soreltracy quebe...
7862,<http://dbpedia.org/resource/Claude_Roy_(polit...,Claude Roy (politician),claude roy born april 25 1952 in montmagny que...
36314,<http://dbpedia.org/resource/Marc_Picard>,Marc Picard,marc picard born april 25 1955 in saintraphal ...
1859,<http://dbpedia.org/resource/Simon-Pierre_Diam...,Simon-Pierre Diamond,simonpierre diamond born february 9 1985 in bo...
36099,<http://dbpedia.org/resource/Janvier_Grondin>,Janvier Grondin,janvier grondin born on june 16 1947 in saintj...
40841,<http://dbpedia.org/resource/Sylvain_L%C3%A9ga...,Sylvain L%C3%A9gar%C3%A9,sylvain lgar born october 22 1970 in quebec ci...
27224,<http://dbpedia.org/resource/Catherine_Morisse...,Catherine Morissette,catherine morissette born february 3 1979 in q...
20278,<http://dbpedia.org/resource/Roger_Bertrand>,Roger Bertrand,roger bertrand born july 26 1947 is an economi...
26944,<http://dbpedia.org/resource/Norbert_Morin>,Norbert Morin,norbert morin born december 16 1945 is a polit...


# ALTERNATIVE: Defining methods to accomplish Part 1

### def preprocess_text (text):
  '''
  This function will take a value and apply the lower case method, remove white spaces and characters and convert the text to TextBlob
  '''
  text = text.lower()
  text = re.sub(r"[^\w\s]", "", text)
  text = TextBlob(text)
  return text

### def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)

    return text

### def create_neighbors(overviews):
    # Create a matrix of overviews
    overview_matrix = np.array([preprocess_text(ov) for ov in overviews])

    # Initialize and fit the model [what are these algorithms doing?]
    nn = NearestNeighbors(n_neighbors=11, algorithm='brute', metric='hamming')
    nn.fit(overview_matrix)

    return nn

### def find_similar_profiles(nn, profile_index):
    distances, indices = nn.kneighbors([np.array([preprocess_text(df.loc[df['name'] == df.iloc[profile_index]['name'], 'overview'].iloc[0])])])
    similar_indices = indices[0][1:]
    return df.loc[similar_indices]



# Part 2


- For the same person from step 1), use the Wikipedia API to access the whole content of that person's Wikipedia page.
- The goal of part 2) is to produce the capability to:
  1. For that Wikipedia page determine the sentiment of the entire page
  1. Print out the Wikipedia article
  1. Collect the Wikipedia pages from the 10 nearest neighbors in Step 1)
  1. Determine the nearness ranking of these 10 to your main subject based on their entire Wikipedia page
  1. Compare the nearest ranking from Step 1) with the Wikipedia page nearness ranking

In [None]:
#To grab the names from WIKI
names = []
for i in indices[0]:
  names.append(df.iloc[i]['name'])
names_for_wiki = pd.Series(names)
#print(pd.Series(names))

In [None]:
def preprocess_names(names):

    # Remove punctuation and special characters
    Wiki_names = names.str.replace(r'[^\w\s]', '', regex=True) # Use .str.replace() for Series

    return Wiki_names

# Call the function with your data and assign the result
processed_names = preprocess_names(names_for_wiki)

# Now you can print the processed names
print(processed_names)

0                    Sylvie Roy
1                 C389ric Caire
2         Claude Roy politician
3                   Marc Picard
4           SimonPierre Diamond
5               Janvier Grondin
6          Sylvain LC3A9garC3A9
7          Catherine Morissette
8                Roger Bertrand
9                 Norbert Morin
10    SC3A9bastien Schneeberger
dtype: object


#Part 2: Replace Sylvie Roy (seed 42) with 42785 from the tail to be rid of french name problems.

In [None]:
df2 = df.copy()
df2.iloc[42785]['text']

'kenneth thomas born february 24 1938 was chief financial officer of the united states chess federation from april 23 to december 31 2004 he took over the position of cfo of the uscf during a period of great financial distress with the uscf having lost money seven years in a row with total losses amounting to 17 million he was hired by bill goichberg because the uscf was in severe financial difficulty he agreed to work for far below the normal salary he restored the federation to profitability and financial solvencyken thomas was born in mead oklahoma during the great dust bowl three of his siblings died of starvation and ken nearly died as well after world war ii ken with his family moved to new jersey he graduated from bound brook high school therekenneth thomas served in the us army ordnance corps from 1961 to 1964 as a sergeant he worked in nuclear weapon assembly he worked in the nuclear weapons laboratory in sandia base near albuquerque new mexico he was a mechanic in the interna

In [None]:
#nn = NearestNeighbors().fit(tfidf_vec)
#This code was already run for the entire df so don't need to run again

Get nearest neighbors distances to new seed 42785 Kenneth Thomas


In [None]:
distances, indices = nn.kneighbors(
  X = tfidf_vec[42785],
  n_neighbors = 11)

#running this code again for 42785

In [None]:
distances


array([[0.        , 1.22992624, 1.25191282, 1.25550796, 1.26614022,
        1.27261345, 1.28431549, 1.28672592, 1.28783668, 1.28832753,
        1.29327643]])

In [None]:
indices


array([[42785, 35332, 42541, 11232, 20076, 14897, 22282, 12890, 36050,
        36821, 19596]])

In [None]:
indices[0]

array([42785, 35332, 42541, 11232, 20076, 14897, 22282, 12890, 36050,
       36821, 19596])

In [None]:
df2.iloc[indices[0]]

Unnamed: 0,URI,name,text
42785,<http://dbpedia.org/resource/Kenneth_Thomas>,Kenneth Thomas,kenneth thomas born february 24 1938 was chief...
35332,<http://dbpedia.org/resource/Don_Schultz>,Don Schultz,don schultz born 13 may 1936 in woodhaven quee...
42541,<http://dbpedia.org/resource/Tamara_Golovey>,Tamara Golovey,tamara golovey russian is a chess master chess...
11232,<http://dbpedia.org/resource/Yanko_Yanev>,Yanko Yanev,yanko yanev has played a vital role in shaping...
20076,<http://dbpedia.org/resource/Michael_Khodarkov...,Michael Khodarkovsky,michael khodarkovsky odessa ussr july 21 1958 ...
14897,<http://dbpedia.org/resource/James_Eade>,James Eade,james eade born march 23 1957 is an american c...
22282,<http://dbpedia.org/resource/Irma_Arguello>,Irma Arguello,irma arguello is an international security exp...
12890,<http://dbpedia.org/resource/Randy_Bauer>,Randy Bauer,randy bauer born 1958 is a chess master and a ...
36050,<http://dbpedia.org/resource/John_Large>,John Large,john h large is a nuclear engineer and analyst...
36821,<http://dbpedia.org/resource/Michael_R._Anasta...,Michael R. Anastasio,michael anastasio born 1948 led two national s...


In [None]:
%%capture output
#install Wikipedia API
!pip3 install wikipedia-api

In [None]:
import wikipediaapi

In [None]:
# Get NEW names from WIKI who were already determined as nearest neighbors with Kenneth Thomas
names = [] #Create an empty list as a container for each name, that is appended row by row in the following loop
for i in indices[0]:
  names.append(df2.iloc[i]['name'])
pd.Series(names) #reference the list "names" as a data series
print(pd.Series(names))

0           Kenneth Thomas
1              Don Schultz
2           Tamara Golovey
3              Yanko Yanev
4     Michael Khodarkovsky
5               James Eade
6            Irma Arguello
7              Randy Bauer
8               John Large
9     Michael R. Anastasio
10          Bruce G. Blair
dtype: object


In [None]:
# Get the text from wikipedia referred to as "bio" - https://en.wikipedia.org/wiki/et.al
list_for_series = [] #Create a list as a container for the biographical text to be retrieved using the wiki app
for i in names:
  bio_text = i
  wikip = wikipediaapi.Wikipedia(user_agent = 'famous') #This line initializes the wikipediaapi to interact with Wikipedia. The user_agent is set to 'famous', which is a string identifying the biographical text.
  page_person= wikip.page(bio_text) # This line uses the wikipediaapi to fetch the Wikipedia page for the current person (using bio_text as the page title) and assigns it to the variable page_person.
  wiki_text = page_person.text # This line adds the extracted Wikipedia text (wiki_text) to the list_for_series list.
  list_for_series.append(wiki_text) # This line extracts the text content from the fetched Wikipedia page (page_person) and assigns it to the variable wiki_text
pd.Series(list_for_series) # This line creates a pandas Series from the list_for_series list

Unnamed: 0,0
0,"Kenneth, Ken or Kenny Thomas may refer to:\n\n..."
1,"Donald Schultz (May 13, 1936, Woodhaven, Queen..."
2,"Tamara Asherawna Golovey is a Chess Master, Ch..."
3,
4,Michael Khodarkovsky is an American chess play...
5,"James V. Eade (born March 23, 1957) is an Amer..."
6,Irma Arguello is an international security exp...
7,
8,John Henry Large (4 May 1943 – 3 November 2018...
9,Michael Anastasio (born 1948) led two national...


In [None]:
def get_wiki_bios(name): # This line defines a function called get_wikipedia_content that takes one argument: names_series.
    # the argument names_series will retrieve the Wikipedia page names as strings
    try: # This line starts a try-except block, which is used for error handling.
        page = wikipedia.page(name) # This line is the core of the function. It uses the wikipedia.page() function to fetch the Wikipedia page corresponding to the provided names_series argument.
        return page.content
    except wikipedia.exceptions.PageError:
        # If a PageError occurs (e.g., page not found), return None
        return None

In [None]:
names_series = pd.Series(names, name='IndvName') # Creates a pandas Series named names_series from the names list
list_for_series_series = pd.Series(list_for_series, name='Blurb', index=names_series.index) # Creates another pandas Series named list_for_series_series from the list_for_series list; name='Blurb' assigns the column name 'Blurb' to this Series; index=names_series.index ensures that this Series uses the same index as names_series, so the names and blurbs are aligned correctly

# Convert Series to DataFrames with a single column
names_df = names_series.to_frame() # Converts the names_series Series into a DataFrame named names_df. This creates a DataFrame with one column ('IndvName') containing the names.
list_for_series_df = list_for_series_series.to_frame() # Converts the list_for_series_series Series into a DataFrame named list_for_series_df. This creates a DataFrame with one column ('Blurb') containing the blurbs.

mini_df = pd.concat([names_df, list_for_series_df], axis=1) # This creates the mini_df DataFrame, which now has two columns: 'IndvName' (containing names) and 'Blurb' (containing the corresponding blurbs)
print(mini_df)

                IndvName                                              Blurb
0         Kenneth Thomas  Kenneth, Ken or Kenny Thomas may refer to:\n\n...
1            Don Schultz  Donald Schultz (May 13, 1936, Woodhaven, Queen...
2         Tamara Golovey  Tamara Asherawna Golovey is a Chess Master, Ch...
3            Yanko Yanev                                                   
4   Michael Khodarkovsky  Michael Khodarkovsky is an American chess play...
5             James Eade  James V. Eade (born March 23, 1957) is an Amer...
6          Irma Arguello  Irma Arguello is an international security exp...
7            Randy Bauer                                                   
8             John Large  John Henry Large (4 May 1943 – 3 November 2018...
9   Michael R. Anastasio  Michael Anastasio (born 1948) led two national...
10        Bruce G. Blair  Bruce Gentry Blair (November 16, 1947 – July 1...


In [None]:
def clean_text(text): # This line defines a function named clean_text that takes one argument called text.
    text = text.lower()  # Convert input 'text' to lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation and special characters

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    words = text.split()
    text = ' '.join([word for word in words if word not in stop_words])

    # Stemming
    stemmer = PorterStemmer()
    words = text.split()
    text = ' '.join([stemmer.stem(word) for word in words])

    return text

# 'CleanedWikiText' and 'Blurb' refers to the column name
mini_df['CleanedWikiText'] = mini_df['Blurb'].apply(clean_text)
mini_df

Unnamed: 0,IndvName,Blurb,CleanedWikiText
0,Kenneth Thomas,"Kenneth, Ken or Kenny Thomas may refer to:\n\n...",kenneth ken kenni thoma may refer kenneth thom...
1,Don Schultz,"Donald Schultz (May 13, 1936, Woodhaven, Queen...",donald schultz may 13 1936 woodhaven queen new...
2,Tamara Golovey,"Tamara Asherawna Golovey is a Chess Master, Ch...",tamara asherawna golovey chess master chess in...
3,Yanko Yanev,,
4,Michael Khodarkovsky,Michael Khodarkovsky is an American chess play...,michael khodarkovski american chess player coa...
5,James Eade,"James V. Eade (born March 23, 1957) is an Amer...",jame v ead born march 23 1957 american chess m...
6,Irma Arguello,Irma Arguello is an international security exp...,irma arguello intern secur expert argentina fo...
7,Randy Bauer,,
8,John Large,John Henry Large (4 May 1943 – 3 November 2018...,john henri larg 4 may 1943 3 novemb 2018 engli...
9,Michael R. Anastasio,Michael Anastasio (born 1948) led two national...,michael anastasio born 1948 led two nation sci...


In [None]:
# Perform the TF-IDF Vectorization
tf_idf_vec = TfidfVectorizer(stop_words = 'english')
tf_idf_peep = tf_idf_vec.fit_transform(mini_df)
tf_idf_peep.shape

(3, 3)

In [None]:
tf_idf_peep

<3x3 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [None]:
tf_idf_peep.transpose().shape

(3, 3)

In [None]:
tf_idf_vec.get_feature_names_out()

array(['blurb', 'cleanedwikitext', 'indvname'], dtype=object)

# Troubleshooting reveals common errors and why the old code does not work as intended:

#Part 3
Make an interactive notebook.

In addition to presenting the project slides, at the end of the presentation each student will demonstrate their code using a famous person suggested by the other students that exists in the DBpedia set.

create a function of tf-idf but need to precalculate the tf-idf or it will take too long

Drop down
sensitivity, polarity


In [None]:
# Determine the sentiment associated with the text column in a row number between 0 and 42785 for the AWS dataset.
text_sentiment = df.iloc[42785]['text']
text_sentiment = TextBlob(text_sentiment)
text_sentiment.sentiment

Sentiment(polarity=0.05142292490118579, subjectivity=0.34561264822134385)

In [None]:
def get_wikipedia_content(names): # This line defines a function called get_wikipedia_content that accepts one argument: names, expected to be a string.
    try: # Begin a try-except block, a common way in Python to handle potential errors.
        page = wikipedia.page(names) # This is the core of the function and uses the wikipedia.page() function (from the wikipedia app library) to fetch the Wikipedia page corresponding to the provided names and assigns it to the variable page.
        return page.content
    except wikipedia.exceptions.PageError:
        return None  # If a PageError occurs (e.g., page not found), return None
print(names) # This returned and printed the value of the 'names' variable, but what was needed was a data series "names_series" not a variable

['Kenneth Thomas', 'Don Schultz', 'Tamara Golovey', 'Yanko Yanev', 'Michael Khodarkovsky', 'James Eade', 'Irma Arguello', 'Randy Bauer', 'John Large', 'Michael R. Anastasio', 'Bruce G. Blair']


In [None]:
#wiki_names = names_for_wiki2.copy() # Creates a copy of the names_for_wiki2 list (which contains the names of Kenneth Thomas but not the 10 neighbors) and assigns it to wiki_names

#wikip = wikipediaapi.Wikipedia(user_agent='Famous_Peep') # This line initializes the wikipediaapi to interact with Wikipedia. The user_agent is set to 'famous_Peep', which is a string identifying the biographical text.

# Loop through wiki_names to get text for each person
#wiki_texts = {}  # Dictionary to store Wikipedia text for each person
#for person in wiki_names: # Starts a loop that iterates through each person (name) in the wiki_names list.
    #try: # This code block handles errors.
        #page_ex = wikip.page(person) # activates the fetch of the biographical text
        #if page_ex.exists(): # Checks to see if the page exists
            #wiki_texts[person] = page_ex.text # If the page exists, this line stores the text content of the page (page_ex.text) in the wiki_texts dictionary, using the person's name as the key
        #else: # If the page does not exist, this block is executed.
            #print(f"Wikipedia page not found for: {person}")
    #except Exception as e:
        #print(f"Error fetching Wikipedia page for {person}: {e}")
#print(wiki_texts) # Print the wiki_texts dictionary

wiki_df = pd.DataFrame(wiki_texts.items(), columns=['Person', 'WikiText'])
# Creates a DataFrame named wiki_df using the pd.DataFrame()constructor; wiki_texts.items() provides the data for the DataFrame as a list of (key, value) pairs from the wiki_texts dictionary (where keys are person names and values are Wikipedia text content), columns=['Person', 'WikiText'] specifies the column names for the DataFrame.
print(wiki_df)

##Future Directions:

In [None]:
#compare this output from #42 to above output from #42 without cleaning above
# BEYOND MVP - consider french translation and compare again....