<a href="https://colab.research.google.com/github/RioAccountant/Project_5/blob/main/Project_5_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing



This project will give you practical experience using Natural Language Processing techniques. This project is in three parts:
- in part 1) you will use a traditional dataset in a CSV file
- in part 2) you will use the Wikipedia API to directly access content
on Wikipedia.
- in part 3) you will make your notebook interactive


### Part 1)



- The CSV file is available at https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv
- The file contains a list of famous people and a brief overview.
- The goal of part 1) is provide the capability to
  - Take one person from the list as input and output the 10 other people who's overview are "closest" to the person in a Natural Language Processing sense
  - Also output the sentiment of the overview of the person



### Part 2)



- For the same person from step 1), use the Wikipedia API to access the whole content of that person's Wikipedia page.
- The goal of part 2) is to produce the capability to:
  1. For that Wikipedia page determine the sentiment of the entire page
  1. Print out the Wikipedia article
  1. Collect the Wikipedia pages from the 10 nearest neighbors in Step 1)
  1. Determine the nearness ranking of these 10 to your main subject based on their entire Wikipedia page
  1. Compare the nearest ranking from Step 1) with the Wikipedia page nearness ranking



### Part 3)


Make an interactive notebook.

In addition to presenting the project slides, at the end of the presentation each student will demonstrate their code using a famous person suggested by the other students that exists in the DBpedia set.


##Import Libraries (Essential)

In [None]:
import numpy as np
import pandas as pd
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.neighbors import NearestNeighbors
pd.options.display.max_columns = 100

In [None]:
%%capture
!python -m textblob.download_corpora

In [None]:
%%capture
!pip3 install wikipedia-api

In [None]:
import nltk
# nltk.download('omw-1.4')
nltk.download('punkt_tab')
# nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
%%capture output
#install Wikipedia API
!pip3 install wikipedia-api

In [None]:
import wikipediaapi

##Pull in & Review Data

In [None]:
#Read in the data
url="https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv"

In [None]:
#This is not the original set because see was not set.  See indices below for original selection.
df = pd.read_csv(url)
df.sample(12)

Unnamed: 0,URI,name,text
26143,<http://dbpedia.org/resource/Philip_Purser-Hal...,Philip Purser-Hallard,philip purserhallard born 1971 as philip halla...
14509,<http://dbpedia.org/resource/C.H._Greenblatt>,C.H. Greenblatt,carl harvey ch greenblatt born june 17 1972 is...
26982,<http://dbpedia.org/resource/William_Hartung>,William Hartung,william d hartung born 7 june 1955 is director...
32372,<http://dbpedia.org/resource/O._Vincent_Haleck>,O. Vincent Haleck,otto vincent haleck jr born january 19 1949 ha...
33724,<http://dbpedia.org/resource/Spiro_Zavos>,Spiro Zavos,spiro zavos born in 1937 in wellington new zea...
41141,<http://dbpedia.org/resource/Gwendolyn_Ecleo>,Gwendolyn Ecleo,gwendolyn buray ecleo mdmg born august 25 1974...
14756,<http://dbpedia.org/resource/Mikhail_Simonyan>,Mikhail Simonyan,mikhail simonyan born 1986 is a violinist from...
25761,<http://dbpedia.org/resource/Tony_Anselmo>,Tony Anselmo,tony anselmo born february 18 1960 is an ameri...
33773,<http://dbpedia.org/resource/Richard_Petty>,Richard Petty,richard lee petty born july 2 1937 nicknamed t...
14057,<http://dbpedia.org/resource/Bucky_Jacobsen>,Bucky Jacobsen,larry william bucky jacobsen born august 30 19...


In [None]:
df.info

In [None]:
df.shape

(42786, 3)

##Part 1.1 -Pick 1 and find the 10 closest (in order of proximity)

In [None]:
#pick my sample-from the original random generated list at beginning
df.iloc[32608]['text']
#Silvino Franciso; what do I see?  All lc and already in 'BOW'.

'silvino francisco born 3 may 1946 is a retired south african professional snooker player who won the south african snooker championship 4 timesfrancisco comes from a snookerplaying family his brother mannie and nephew peter both played at a high level mannie having been a runnerup in the world amateur billiards championship on several occasions and peter having risen to the world ranking of number 14francisco won the 1985 british open beating kirk stevens 129 afterwards he accused stevens of playing under the influence of drugs and was fined and penalised ranking points when stevens admitted a drugs problem the penalty was reversedhe was involved in another scandal after the 1989 masters after losing 51 to terry griffiths in the last16 it was discovered that there had been heavy betting on that exact score francisco was arrested but later released without charge gambling problems followed to the extent of being declared bankrupt in 1996 due to income tax arrears having split up from h

##Part 1.2-Sentiment analysis of selection

In [None]:
#Sentiment analysis on Indv.
text_sentiment = df.iloc[32608]['text']
text_sentiment = TextBlob(text_sentiment)
text_sentiment.sentiment #score
print(f"{text_sentiment.sentiment}was the score for Silvino's text.")

Sentiment(polarity=-0.004333333333333333, subjectivity=0.1743333333333333)was the score for Silvino's text.


In [None]:
#Perform the count trans
vectorizer = CountVectorizer(stop_words='english')
bow_vec= vectorizer.fit_transform(df.iloc[:]['text'])
bow_vec
#returns a bag of words for neighbors, related or not, for all 42K

<42786x437190 sparse matrix of type '<class 'numpy.int64'>'
	with 5847547 stored elements in Compressed Sparse Row format>

**TFidF Transformer


In [None]:
#tf-idf
tfidf = TfidfTransformer()
tfidf_vec = tfidf.fit_transform(bow_vec)
tfidf_vec
#10 rows by 1200 columns-all the unique words accross all of the documents/sparce because not storing zeros
#notice similarity to above

<42786x437190 sparse matrix of type '<class 'numpy.float64'>'
	with 5847547 stored elements in Compressed Sparse Row format>

In [None]:
#USED FOR TESTING ONLY
#bow_vec.toarray()-DO NOT RUN or this will crash for the whole-did for our first n=10
#next convert to tdidf

In [None]:
#nearest neighbor
nn = NearestNeighbors().fit(tfidf_vec)
nn

In [None]:
#distance to 32608, 11 because one is your reference doc-how close something is, the 1st is always 0 (self)
distances,indices= nn.kneighbors(
X=tfidf_vec[32608],
n_neighbors=11)
distances

array([[0.        , 1.19749265, 1.26665982, 1.27209211, 1.27389995,
        1.28239307, 1.28974116, 1.29881506, 1.30087229, 1.30312716,
        1.31716019]])

In [None]:
#indices of the NN for the next step
indices

array([[ 0,  1,  3,  4,  8, 10,  2,  9,  6,  5,  7]])

In [None]:
#Sentiment analysis
text_sentiment = 'Silvino Francisco'
text_sentiment = TextBlob(text_sentiment)
text_sentiment.sentiment #score
print(f"{text_sentiment.sentiment}was the score for Silvino's WIKI text.")

Sentiment(polarity=0.0, subjectivity=0.0)was the score for Silvino's WIKI text.


##Part 2.0-Located and import WIKI page; extract text content in entirety.

In [None]:
#Pull Francisco's page from wikipedia - https://en.wikipedia.org/wiki/Silvino_Francisco
topic = 'Silvino_Francisco'
wikip = wikipediaapi.Wikipedia(user_agent = 'whomever')
page_ex = wikip.page(topic)
wiki_text = page_ex.text
wiki_text

"Silvino Francisco (born 3 May 1946) is a South African former professional snooker player who won the 1985 British Open.\n\nSnooker career\nFrancisco comes from a snooker-playing family. His brother Manuel and nephew Peter both played at a high level, Manuel having been a runner-up in the World Amateur Billiards Championship on several occasions, and Peter having risen to the world ranking of number 14.\nFrancisco won the 1985 British Open, beating Kirk Stevens 12–9. Prior to the start of the Final match, Francisco accused Stevens of playing under the influence of drugs. Francisco was subsequently fined for the comments. The world governing body of snooker, the WPBSA, accepted that the accusation was false and it is on record that Kirk Stevens has never failed a drugs test in the history of his career. Stevens later admitted to having an addiction to cocaine.\nHe was involved in another scandal after the 1989 Masters. After losing 5–1 to Terry Griffiths in the last-16, it was discover

##Part 2.1-Output 10 others

In [None]:
#Show me the names list for the others for reference in wiki search
df.iloc[[28059, 34896, 9094, 30345, 37392, 23037, 22521, 33973,7443, 19498], ::]


Unnamed: 0,URI,name,text
28059,<http://dbpedia.org/resource/Kirk_Stevens>,Kirk Stevens,kirk stevens born august 17 1958 is a canadian...
34896,<http://dbpedia.org/resource/Barry_Hawkins>,Barry Hawkins,barry hawkins born 23 april 1979 is an english...
9094,<http://dbpedia.org/resource/Mark_Wildman>,Mark Wildman,mark wildman born 25 january 1936 is an englis...
30345,<http://dbpedia.org/resource/Steve_Davis>,Steve Davis,steve davis obe born 22 august 1957 is an engl...
37392,<http://dbpedia.org/resource/Ray_Stevens_(poli...,Ray Stevens (politician),raymond alexander ray stevens mp born 1 februa...
23037,<http://dbpedia.org/resource/Michael_Stevens_(...,Michael Stevens (footballer),michael stevens born 7 november 1980 is a form...
22521,<http://dbpedia.org/resource/Alun_Davies_(guit...,Alun Davies (guitarist),alun davies born 1943 is a welsh guitarist stu...
33973,<http://dbpedia.org/resource/Mark_Bennett_(sno...,Mark Bennett (snooker player),mark bennett born september 23 1963 is a forme...
7443,<http://dbpedia.org/resource/Dave_Stevens_(ath...,Dave Stevens (athlete),dave stevens born january 12 1966 is an athlet...
19498,<http://dbpedia.org/resource/Patrick_Wallace>,Patrick Wallace,patrick wallace born september 20 1969 is a fo...


In [None]:
#To grab the call names from WIKI-to prepare to extract
names = []
for i in indices[0]:
  names.append(df.iloc[i]['name'])
pd.Series(names)
#print(pd.Series(names))


Unnamed: 0,0
0,Silvino Francisco
1,Kirk Stevens
2,Barry Hawkins
3,Mark Wildman
4,Steve Davis
5,Ray Stevens (politician)
6,Michael Stevens (footballer)
7,Alun Davies (guitarist)
8,Mark Bennett (snooker player)
9,Dave Stevens (athlete)


In [None]:
# Pull out the small set text from wikipedia - https://en.wikipedia.org/wiki/et.al
subsidiary = []
for i in names:
  topics = i
  wikip = wikipediaapi.Wikipedia(user_agent = 'whomever')
  page_ex = wikip.page(topics)
  wiki_text = page_ex.text
  subsidiary.append(wiki_text)
pd.Series(subsidiary)

Unnamed: 0,0
0,Silvino Francisco (born 3 May 1946) is a South...
1,"Kirk Stevens (born August 17, 1958) is a Canad..."
2,Barry Hawkins (born 23 April 1979) is an Engli...
3,Markham Wildman (born 25 January 1936) is an E...
4,Steve Davis (born 22 August 1957) is an Engli...
5,Raymond Alexander Stevens (born 1 February 195...
6,Michael Stevens (born 7 November 1980) is a fo...
7,Alun Davies (born 27 July 1942) is a Welsh gui...
8,Mark Bennett (born 23 September 1963) is a Wel...
9,David or Dave Stevens may refer to:\n\nDavid S...


In [None]:
#Join the two sets of info into a new "mini" DF
names_series = pd.Series(names, name='IndvName')
subsidiary_series = pd.Series(subsidiary, name='Blurb', index=names_series.index)

# Convert Series to DataFrames with a single column
names_df = names_series.to_frame()
subsidiary_df = subsidiary_series.to_frame()

# Concatenate DataFrames along columns (axis=0)
mini_df = pd.concat([names_df, subsidiary_df], axis=1)



In [None]:
#Confirm it worked
mini_df

Unnamed: 0,IndvName,Blurb
0,Silvino Francisco,Silvino Francisco (born 3 May 1946) is a South...
1,Kirk Stevens,"Kirk Stevens (born August 17, 1958) is a Canad..."
2,Barry Hawkins,Barry Hawkins (born 23 April 1979) is an Engli...
3,Mark Wildman,Markham Wildman (born 25 January 1936) is an E...
4,Steve Davis,Steve Davis (born 22 August 1957) is an Engli...
5,Ray Stevens (politician),Raymond Alexander Stevens (born 1 February 195...
6,Michael Stevens (footballer),Michael Stevens (born 7 November 1980) is a fo...
7,Alun Davies (guitarist),Alun Davies (born 27 July 1942) is a Welsh gui...
8,Mark Bennett (snooker player),Mark Bennett (born 23 September 1963) is a Wel...
9,Dave Stevens (athlete),David or Dave Stevens may refer to:\n\nDavid S...


##Part 2.2 Clean for sentiment Analysis on the WIKI page-using string replacement.

###2.2a    First strip the Blurbs down to the essence

In [None]:
def clean_text(Blurb):
        return (
            Blurb.lower()
            .replace("\n"," ")
            .replace("\'s",'')
            .replace('\'','')
            .replace("(", "")
            .replace(")", "")
            .replace('"', "")
            .replace(":", " ")
            .replace("<", " ")
            .replace("-", "")
            .replace("'", "")
            .replace(",", "")
        )

###2.2b Validate Processing for String Replacement Method

In [None]:
#Test Subject Only
clean_text(mini_df.iloc[0]['Blurb'])

'silvino francisco born 3 may 1946 is a south african former professional snooker player who won the 1985 british open.  snooker career francisco comes from a snookerplaying family. his brother manuel and nephew peter both played at a high level manuel having been a runnerup in the world amateur billiards championship on several occasions and peter having risen to the world ranking of number 14. francisco won the 1985 british open beating kirk stevens 12–9. prior to the start of the final match francisco accused stevens of playing under the influence of drugs. francisco was subsequently fined for the comments. the world governing body of snooker the wpbsa accepted that the accusation was false and it is on record that kirk stevens has never failed a drugs test in the history of his career. stevens later admitted to having an addiction to cocaine. he was involved in another scandal after the 1989 masters. after losing 5–1 to terry griffiths in the last16 it was discovered that there had

In [None]:
#Test ALL
mini_df['Blurb'] = mini_df['Blurb'].apply(clean_text)
mini_df

Unnamed: 0,IndvName,Blurb
0,Silvino Francisco,silvino francisco born 3 may 1946 is a south a...
1,Kirk Stevens,kirk stevens born august 17 1958 is a canadian...
2,Barry Hawkins,barry hawkins born 23 april 1979 is an english...
3,Mark Wildman,markham wildman born 25 january 1936 is an eng...
4,Steve Davis,steve davis born 22 august 1957 is an english...
5,Ray Stevens (politician),raymond alexander stevens born 1 february 1953...
6,Michael Stevens (footballer),michael stevens born 7 november 1980 is a form...
7,Alun Davies (guitarist),alun davies born 27 july 1942 is a welsh guita...
8,Mark Bennett (snooker player),mark bennett born 23 september 1963 is a welsh...
9,Dave Stevens (athlete),david or dave stevens may refer to david ste...


##Part 2.3 Sentiment Analysis for Indv Comparison

In [None]:
#Sentiment analysis on Indv.
text_sentiment2 = mini_df.iloc[0]['Blurb']
text_sentiment2 = TextBlob(text_sentiment2)
text_sentiment2.sentiment #score
print(f"{text_sentiment2.sentiment}was the score for Silvino's WIKI text.")

Sentiment(polarity=-0.013518518518518522, subjectivity=0.21537037037037032)was the score for Silvino's WIKI text.


### Configure TF_IDF Matrix_BOW

In [None]:
#Perform the count trans on the mini DF-notes say NOT to do this, same result as above though
vectorizer = CountVectorizer(stop_words='english')
bow_vec2= vectorizer.fit_transform(mini_df.iloc[:]['Blurb'])
bow_vec2
#returns a bag of words for neighbors, related or not, for the mini_df-sparse matrix created



<11x2269 sparse matrix of type '<class 'numpy.int64'>'
	with 3581 stored elements in Compressed Sparse Row format>

In [None]:
#tf-idf
tfidf2 = TfidfTransformer()
tfidf_vec2 = tfidf2.fit_transform(bow_vec2)
tfidf_vec2
#10 rows by 1200 columns-all the unique words accross all of the documents/sparce because not storing zeros
#notice similarity to above

<11x2269 sparse matrix of type '<class 'numpy.float64'>'
	with 3581 stored elements in Compressed Sparse Row format>

###Find the nearest neighbors w/in the set.

In [None]:
#nearest neighbor
nn2 = NearestNeighbors().fit(tfidf_vec2)
nn2


##2.4 Rankings of Proximity-Compare

In [None]:
#distance to 0, 11 because one is your reference doc-how close something is, the 1st is always 0 (self)
distances2,indices2= nn2.kneighbors(
X=tfidf_vec2[0],
n_neighbors=11)
distances

array([[0.        , 1.10953878, 1.2565779 , 1.27730113, 1.30713706,
        1.31257669, 1.33043892, 1.35589003, 1.35922709, 1.36001405,
        1.36019251]])

In [None]:
#indices2-to order the relationships of proximity
indices2

array([[ 1,  0,  4,  3, 10,  9,  6,  8,  5,  2,  7]])

In [None]:
#Compare results 1 versus 2
print(indices)
print(indices2)

[[ 0  1  3  4  8 10  2  9  6  5  7]]
[[ 1  0  4  3 10  9  6  8  5  2  7]]


In [None]:
print(indices)
print(indices2, mini_df['IndvName'])

[[ 0  1  3  4  8 10  2  9  6  5  7]]
[[ 1  0  4  3 10  9  6  8  5  2  7]] 0                 Silvino Francisco
1                      Kirk Stevens
2                     Barry Hawkins
3                      Mark Wildman
4                       Steve Davis
5          Ray Stevens (politician)
6      Michael Stevens (footballer)
7           Alun Davies (guitarist)
8     Mark Bennett (snooker player)
9            Dave Stevens (athlete)
10                  Patrick Wallace
Name: IndvName, dtype: object
