<a href="https://colab.research.google.com/github/SPlearning27/DDDS-My-Projects/blob/main/Project-5/%20SP_Project5__NLP_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing



This project will give you practical experience using Natural Language Processing techniques. This project is in three parts:
- in part 1) you will use a dataset in a CSV file
- in part 2) you will use the Wikipedia API to directly access content
on Wikipedia.
- in part 3) you will make your notebook interactive


In [None]:
import numpy as np
import pandas as pd
import re

from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.neighbors import NearestNeighbors

pd.options.display.max_columns = 100

import nltk
# nltk.download('omw-1.4')
nltk.download('punkt_tab')
# nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
%%capture
!python -m textblob.download_corpora


# Part 1)



- The CSV file is available at https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv
- The file contains a list of famous people and a brief overview.
- The goal of part 1) is to ...
  1. Pick one person from the list ( the target person ) and output 10 other people who's overview are "closest" to the target person in a Natural Language Processing sense
  1. Also output the sentiment of the overview of the target person



In [None]:
url = "https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv"
!curl -s -I {url}

HTTP/1.1 200 OK
[1mx-amz-id-2[0m: rsbVVkt7ltMUT91TWfBS733Vgsr1X1FvZUmp1ryqLoF1IQoUIaG8GGMEGJh6e+iuwws4S6X2dfs=
[1mx-amz-request-id[0m: 07K02KM9RF9VGHAE
[1mDate[0m: Tue, 15 Jul 2025 20:13:09 GMT
[1mLast-Modified[0m: Mon, 23 Oct 2023 18:30:29 GMT
[1mETag[0m: "f4f3b3bc07aa7b6dddb6e383fc52ae64-10"
[1mx-amz-server-side-encryption[0m: AES256
[1mAccept-Ranges[0m: bytes
[1mContent-Type[0m: text/csv
[1mContent-Length[0m: 83886080
[1mServer[0m: AmazonS3



In [None]:
!curl -s -O {url}

In [None]:
ls -la

total 81936
drwxr-xr-x 1 root root     4096 Jul 15 20:13 [0m[01;34m.[0m/
drwxr-xr-x 1 root root     4096 Jul 15 17:58 [01;34m..[0m/
drwxr-xr-x 4 root root     4096 Jul 14 13:37 [01;34m.config[0m/
-rw-r--r-- 1 root root 83886080 Jul 15 20:13 NLP.csv
drwxr-xr-x 1 root root     4096 Jul 14 13:37 [01;34msample_data[0m/


In [None]:
!head -1 NLP.csv | tr , '\n' | cat -n

     1	URI
     2	name
     3	text


In [None]:
og_npl_data = pd.read_csv(url)
og_npl_data

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...
...,...,...,...
42781,<http://dbpedia.org/resource/Motoaki_Takenouchi>,Motoaki Takenouchi,motoaki takenouchi born july 8 1967 saitama pr...
42782,<http://dbpedia.org/resource/Alan_Judge_(footb...,"Alan Judge (footballer, born 1960)",alan graham judge born 14 may 1960 is a retire...
42783,<http://dbpedia.org/resource/Eduardo_Lara>,Eduardo Lara,eduardo lara lozano born 4 september 1959 in c...
42784,<http://dbpedia.org/resource/Tatiana_Faberg%C3...,Tatiana Faberg%C3%A9,tatiana faberg is an author and faberg scholar...


In [None]:
og_npl_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42786 entries, 0 to 42785
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   URI     42786 non-null  object
 1   name    42786 non-null  object
 2   text    42786 non-null  object
dtypes: object(3)
memory usage: 1002.9+ KB


## Data Cleaning

- To clean all special characters in the 'name' column

In [None]:
npl_data_clean_temp1 = og_npl_data.copy()

In [None]:
# Apply the regex replacement to the 'name' column
npl_data_clean_temp1['name'] = npl_data_clean_temp1['name'].str.replace(r'\(.*?\)', '', regex=True).str.strip()
print(npl_data_clean_temp1)

                                                     URI  \
0            <http://dbpedia.org/resource/Digby_Morrell>   
1           <http://dbpedia.org/resource/Alfred_J._Lewy>   
2            <http://dbpedia.org/resource/Harpdog_Brown>   
3      <http://dbpedia.org/resource/Franz_Rottensteiner>   
4                   <http://dbpedia.org/resource/G-Enka>   
...                                                  ...   
42781   <http://dbpedia.org/resource/Motoaki_Takenouchi>   
42782  <http://dbpedia.org/resource/Alan_Judge_(footb...   
42783         <http://dbpedia.org/resource/Eduardo_Lara>   
42784  <http://dbpedia.org/resource/Tatiana_Faberg%C3...   
42785       <http://dbpedia.org/resource/Kenneth_Thomas>   

                       name                                               text  
0             Digby Morrell  digby morrell born 10 october 1979 is a former...  
1            Alfred J. Lewy  alfred j lewy aka sandy lewy graduated from un...  
2             Harpdog Brown  harpdog

In [None]:
# define regular expression
pattern = r'[^a-zA-Z\s]'

# replace characters that're not matching the expression in the pattern
for pop in ['name']:
    npl_data_clean_temp1['name'] = npl_data_clean_temp1[pop].str.replace(pattern, '', regex=True).str.strip()
#Remove extra spaces that might result from removal (e.g., "  ")
    npl_data_clean_temp1['name'] = npl_data_clean_temp1['name'].str.strip().str.replace(r'\s+', ' ', regex=True)

print("\nDataFrame after cleaning 'name' column:")
print(npl_data_clean_temp1)



DataFrame after cleaning 'name' column:
                                                     URI                 name  \
0            <http://dbpedia.org/resource/Digby_Morrell>        Digby Morrell   
1           <http://dbpedia.org/resource/Alfred_J._Lewy>        Alfred J Lewy   
2            <http://dbpedia.org/resource/Harpdog_Brown>        Harpdog Brown   
3      <http://dbpedia.org/resource/Franz_Rottensteiner>  Franz Rottensteiner   
4                   <http://dbpedia.org/resource/G-Enka>                GEnka   
...                                                  ...                  ...   
42781   <http://dbpedia.org/resource/Motoaki_Takenouchi>   Motoaki Takenouchi   
42782  <http://dbpedia.org/resource/Alan_Judge_(footb...           Alan Judge   
42783         <http://dbpedia.org/resource/Eduardo_Lara>         Eduardo Lara   
42784  <http://dbpedia.org/resource/Tatiana_Faberg%C3...     Tatiana FabergCA   
42785       <http://dbpedia.org/resource/Kenneth_Thomas>       Kenne

In [None]:
# Check nulls in the columns
npl_data_clean_temp1.isnull().sum()

Unnamed: 0,0
URI,0
name,0
text,0


In [None]:
npl_data_clean_temp2 = npl_data_clean_temp1['text'].astype(str).str.replace(',', '').replace("\n"," ").replace("'s",'').replace("'",'').str.strip()

print(npl_data_clean_temp2)

0        digby morrell born 10 october 1979 is a former...
1        alfred j lewy aka sandy lewy graduated from un...
2        harpdog brown is a singer and harmonica player...
3        franz rottensteiner born in waidmannsfeld lowe...
4        henry krvits born 30 december 1974 in tallinn ...
                               ...                        
42781    motoaki takenouchi born july 8 1967 saitama pr...
42782    alan graham judge born 14 may 1960 is a retire...
42783    eduardo lara lozano born 4 september 1959 in c...
42784    tatiana faberg is an author and faberg scholar...
42785    kenneth thomas born february 24 1938 was chief...
Name: text, Length: 42786, dtype: object


## Convert to Text blob

In [None]:
%%capture
# Install textblob
!pip install -U textblob


In [None]:
%%capture
!python -m textblob.download_corpora


In [None]:
# Make a textblob
npl_text_blob = npl_data_clean_temp2.apply(lambda text: TextBlob(text))

In [None]:
npl_text_blob

Unnamed: 0,text
0,"(d, i, g, b, y, , m, o, r, r, e, l, l, , b, ..."
1,"(a, l, f, r, e, d, , j, , l, e, w, y, , a, ..."
2,"(h, a, r, p, d, o, g, , b, r, o, w, n, , i, ..."
3,"(f, r, a, n, z, , r, o, t, t, e, n, s, t, e, ..."
4,"(h, e, n, r, y, , k, r, v, i, t, s, , b, o, ..."
...,...
42781,"(m, o, t, o, a, k, i, , t, a, k, e, n, o, u, ..."
42782,"(a, l, a, n, , g, r, a, h, a, m, , j, u, d, ..."
42783,"(e, d, u, a, r, d, o, , l, a, r, a, , l, o, ..."
42784,"(t, a, t, i, a, n, a, , f, a, b, e, r, g, , ..."


In [None]:
npl_text_blob[42782]

TextBlob("alan graham judge born 14 may 1960 is a retired professional footballer who is the seventh oldest player to play in the football league he played as a goalkeeperduring his career he played for various clubs at all tiers of the league he was part of the oxford united team which won the milk cup in 1986 he also briefly served as a backup goalkeeper for chelsea in the european cup winners cupoften referred to as the judge after retiring from the professional game alan worked as a driving instructor and goalkeeping coach at several clubs including swindon and oxford occasionally acting as emergency goalkeeping cover in 2001 he organised a rerun of the 1986 milk cup final against qpr for charity on 18 march 2003 at the age of 42 he played his first football league match since leaving hereford united in 1994 he helped oxford to a 11 draw with cambridge united making a vital save in stoppage time during the 200304 season he also made several appearances for didcot town he made a sec

In [None]:
# Singularize the word from TextBlob
# Singularize each word in the text
# Join it together into a single string
npl_blob_singular = npl_text_blob[:].apply(lambda x: ' '.join([y.singularize() for y in x.words]))
npl_blob_singular

Unnamed: 0,text
0,digby morrell born 10 october 1979 is a former...
1,alfred j lewy aka sandy lewy graduated from un...
2,harpdog brown is a singer and harmonica player...
3,franz rottensteiner born in waidmannsfeld lowe...
4,henry krvit born 30 december 1974 in tallinn b...
...,...
42781,motoaki takenouchi born july 8 1967 saitama pr...
42782,alan graham judge born 14 may 1960 is a retire...
42783,eduardo lara lozano born 4 september 1959 in c...
42784,tatiana faberg is an author and faberg scholar...


## BoW using CountVectorizer

- Pertorm the count transformation

In [None]:
npl_vectorizer = CountVectorizer(stop_words='english')
npl_bow_vec = npl_vectorizer.fit_transform(npl_blob_singular)


In [None]:
type(npl_bow_vec), npl_bow_vec.shape

(scipy.sparse._csr.csr_matrix, (42786, 404869))

In [None]:
#npl_bow_vec.toarray()

In [None]:
npl_vectorizer.get_feature_names_out()

array(['00', '000', '0000', ..., 'zzebra', 'zzran', 'zzt'], dtype=object)

In [None]:
#npl_vec_df = pd.DataFrame(npl_bow_vec.toarray(), columns = npl_vectorizer.get_feature_names_out() )

## TF-IDF using TfidfTransformer
- Perform the TF-IDF transformation

In [None]:
# Perform the TF-IDF transformation
tf_idf_vec = TfidfTransformer()
npl_tf_idf = tf_idf_vec.fit_transform(npl_bow_vec)

#npl_tf_idf.toarray()


## K Nearest Neighbors

- To find the nearest neighbor

In [None]:
nn = NearestNeighbors().fit(npl_tf_idf)

- To create a reference matrix used to get the nearest neighbors distance

In [None]:
# Create a reference matrix from Tf-IDF matrix
# Specify the target person as entried at the index no. 42782
npl_ref_matrix = npl_tf_idf[42782]


In [None]:
npl_ref_matrix.shape

(1, 404869)

- To get the nearest neighbors distances

In [None]:
# Get nearest neighbors distances of the target person and the closest 10 people
distances, indices = nn.kneighbors(
  X = npl_ref_matrix,
  n_neighbors = 11,
)


In [None]:
# calculated distance of the target person
distances[0]

array([0.        , 1.21001444, 1.21221996, 1.21439732, 1.21827016,
       1.2195035 , 1.22471136, 1.2252636 , 1.22746783, 1.22767495,
       1.23418686])

In [None]:
indices[0]

array([42782, 34838, 23099, 24116, 25507,  3785, 19611,  4113, 31709,
        4300, 21696])

In [None]:
# recall the information of the target person and the closest 10 people from the original dataframe
og_npl_data.iloc[indices[0]]

Unnamed: 0,URI,name,text
42782,<http://dbpedia.org/resource/Alan_Judge_(footb...,"Alan Judge (footballer, born 1960)",alan graham judge born 14 may 1960 is a retire...
34838,<http://dbpedia.org/resource/Len_Bond>,Len Bond,len bond born 2 december 1954 is an english fo...
23099,<http://dbpedia.org/resource/Matt_Green_(footb...,Matt Green (footballer),matthew james matt green born 2 january 1987 i...
24116,<http://dbpedia.org/resource/Tony_Smith_(footb...,"Tony Smith (footballer, born 1957)",anthony tony smith born 20 february 1957 is a ...
25507,<http://dbpedia.org/resource/Peter_Rhoades-Brown>,Peter Rhoades-Brown,peter rhoadesbrown born 2 january 1962 in hamp...
3785,<http://dbpedia.org/resource/Steve_Arnold_(foo...,"Steve Arnold (footballer, born 1951)",stephen frank arnold born 5 january 1951 is an...
19611,<http://dbpedia.org/resource/George_Harris_(fo...,"George Harris (footballer, born 1940)",george harris born 10 june 1940 is an englishb...
4113,<http://dbpedia.org/resource/Keith_Waugh>,Keith Waugh,keith waugh born 27 october 1956 is an english...
31709,<http://dbpedia.org/resource/Nick_Colgan>,Nick Colgan,nicholas vincent nick colgan born 19 september...
4300,<http://dbpedia.org/resource/Peter_Hucker>,Peter Hucker,peter hucker born 28 october 1959 is an englis...


In [None]:
# To print the calculated distance values along with the information of each index from the 'text' column
for a,b in zip(distances[0], np.array(npl_blob_singular)[indices][0]):
  print(f"{a:.4f}: {b}")

0.0000: alan graham judge born 14 may 1960 is a retired professional footballer who is the seventh oldest player to play in the football league he played a a goalkeeperduring hi career he played for variou club at all tier of the league he wa part of the oxford united team which won the milk cup in 1986 he also briefly served a a backup goalkeeper for chelsea in the european cup winner cupoften referred to a the judge after retiring from the professional game alan worked a a driving instructor and goalkeeping coach at several club including swindon and oxford occasionally acting a emergency goalkeeping cover in 2001 he organised a rerun of the 1986 milk cup final against qpr for charity on 18 march 2003 at the age of 42 he played hi first football league match since leaving hereford united in 1994 he helped oxford to a 11 draw with cambridge united making a vital save in stoppage time during the 200304 season he also made several appearance for didcot town he made a second appearance f

# Part 2)



- For the same target person that you chose in Part 1), use the Wikipedia API to access the whole content of the target person's Wikipedia page.
- The goal of Part 2) is to ...
  1. Print out the text of the Wikipedia article for the target person
  1. Determine the sentiment of the text of the Wikipedia page for the target person
  1. Collect the text of the Wikipedia pages from the 10 nearest neighbors from Part 1)
  1. Determine the nearness ranking of these 10 people to your target person based on their entire Wikipedia page
  1. Compare, i.e. plot,  the nearest ranking from Step 1) with the Wikipedia page nearness ranking.  A difference of the rank is one means of comparison.



### Using Wikipedia API

- Install Wikipedia API

In [None]:
%%capture output
#install Wikipedia API
!pip3 install wikipedia-api


In [None]:
import wikipediaapi
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re

- Pull out page from Wikipedia

In [None]:
# https://en.wikipedia.org/wiki/Alan_Judge_(English_footballer)
topic = 'Alan_Judge_(English_footballer)'
wikip = wikipediaapi.Wikipedia('foobar')
page_ex = wikip.page(topic)
wiki_text = page_ex.text
wiki_text


"Alan Graham Judge (born 14 May 1960) is an English retired professional footballer, who is the seventh oldest player to play in the Football League. He played as a goalkeeper.\nDuring his career he played for various clubs at all tiers of the League. He was part of the Oxford United team that won the Milk Cup in 1986. He also briefly served as a backup goalkeeper for Chelsea in the European Cup Winners' Cup.\nOften referred to as The Judge, after retiring from the professional game he worked as a driving instructor and goalkeeping coach at several clubs including Swindon and Oxford, occasionally acting as emergency goalkeeping cover. In 2001, he organised a re-run of the 1986 Milk Cup Final against QPR, for charity.\nOn 18 March 2003, at the age of 42, he played his first Football League match since leaving Hereford United in 1994. He helped Oxford to a 1–1 draw with Cambridge United, making a vital save in stoppage time. During the 2003–04 season he also played for Didcot Town. He ma

### Text Cleaning

- Replace newline chars with spaces before doing any processing. Strip the ' and "s" from possessives.

In [None]:
wiki_text_clean = (
  wiki_text
  .lower()
  .replace("\n"," ")
  .replace("\'s",'')
  .replace('\'','')
  .replace("(", "")
  .replace(")", "")
  .replace('"', "")
)
wiki_text_clean


'alan graham judge born 14 may 1960 is an english retired professional footballer, who is the seventh oldest player to play in the football league. he played as a goalkeeper. during his career he played for various clubs at all tiers of the league. he was part of the oxford united team that won the milk cup in 1986. he also briefly served as a backup goalkeeper for chelsea in the european cup winners cup. often referred to as the judge, after retiring from the professional game he worked as a driving instructor and goalkeeping coach at several clubs including swindon and oxford, occasionally acting as emergency goalkeeping cover. in 2001, he organised a re-run of the 1986 milk cup final against qpr, for charity. on 18 march 2003, at the age of 42, he played his first football league match since leaving hereford united in 1994. he helped oxford to a 1–1 draw with cambridge united, making a vital save in stoppage time. during the 2003–04 season he also played for didcot town. he made a s

#### Convert to TextBlob

In [None]:
# Convert to textblob
wiki_blob = TextBlob(wiki_text_clean)
wiki_blob

TextBlob("alan graham judge born 14 may 1960 is an english retired professional footballer, who is the seventh oldest player to play in the football league. he played as a goalkeeper. during his career he played for various clubs at all tiers of the league. he was part of the oxford united team that won the milk cup in 1986. he also briefly served as a backup goalkeeper for chelsea in the european cup winners cup. often referred to as the judge, after retiring from the professional game he worked as a driving instructor and goalkeeping coach at several clubs including swindon and oxford, occasionally acting as emergency goalkeeping cover. in 2001, he organised a re-run of the 1986 milk cup final against qpr, for charity. on 18 march 2003, at the age of 42, he played his first football league match since leaving hereford united in 1994. he helped oxford to a 1–1 draw with cambridge united, making a vital save in stoppage time. during the 2003–04 season he also played for didcot town. he

In [None]:
# Check how many sentences are in this textblob
len(wiki_blob.sentences)

12

### Sentiment analysis of text in Wikipedia

Note: .sentiment cannot be used with str (for the ErrorMessage)

In [None]:
# sentiment analysis of the whole text from Wikipedia
polarity_wiki = wiki_blob.sentiment.polarity
subjectivity_wiki = wiki_blob.sentiment.subjectivity
print(f"The text of Wikipedia on Alan_Judge_(English_footballer) has polarity of {polarity_wiki: .4f} and subjectivity of {subjectivity_wiki: .4f}.")

The text of Wikipedia on Alan_Judge_(English_footballer) has polarity of  0.0115 and subjectivity of  0.2436.


#### Convert sentences in TextBlob to strings

In [None]:
# 12 sentences in TextBlob
api_sentences = wiki_blob.sentences
api_sentences

[Sentence("alan graham judge born 14 may 1960 is an english retired professional footballer, who is the seventh oldest player to play in the football league."),
 Sentence("he played as a goalkeeper."),
 Sentence("during his career he played for various clubs at all tiers of the league."),
 Sentence("he was part of the oxford united team that won the milk cup in 1986. he also briefly served as a backup goalkeeper for chelsea in the european cup winners cup."),
 Sentence("often referred to as the judge, after retiring from the professional game he worked as a driving instructor and goalkeeping coach at several clubs including swindon and oxford, occasionally acting as emergency goalkeeping cover."),
 Sentence("in 2001, he organised a re-run of the 1986 milk cup final against qpr, for charity."),
 Sentence("on 18 march 2003, at the age of 42, he played his first football league match since leaving hereford united in 1994. he helped oxford to a 1–1 draw with cambridge united, making a vita

In [None]:
# Convert text blob sentences to strings
api_sentences_str = [ str(x) for x in api_sentences ]
api_sentences_str


['alan graham judge born 14 may 1960 is an english retired professional footballer, who is the seventh oldest player to play in the football league.',
 'he played as a goalkeeper.',
 'during his career he played for various clubs at all tiers of the league.',
 'he was part of the oxford united team that won the milk cup in 1986. he also briefly served as a backup goalkeeper for chelsea in the european cup winners cup.',
 'often referred to as the judge, after retiring from the professional game he worked as a driving instructor and goalkeeping coach at several clubs including swindon and oxford, occasionally acting as emergency goalkeeping cover.',
 'in 2001, he organised a re-run of the 1986 milk cup final against qpr, for charity.',
 'on 18 march 2003, at the age of 42, he played his first football league match since leaving hereford united in 1994. he helped oxford to a 1–1 draw with cambridge united, making a vital save in stoppage time.',
 'during the 2003–04 season he also played

#### Using TfidfTransformer

In [None]:
# Perform the TF-IDF Vectorization
tf_idf_vec = TfidfVectorizer(stop_words = 'english')
tf_idf_pp = tf_idf_vec.fit_transform(api_sentences_str)
tf_idf_pp.shape

(12, 89)

In [None]:
tf_idf_pp

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 116 stored elements and shape (12, 89)>

In [None]:
np.set_printoptions(precision=6, formatter=None)
tf_idf_vec.get_feature_names_out()

array(['04', '14', '176', '18', '1960', '1986', '1994', '2001', '2003',
       '2004', '42', '44', 'acting', 'age', 'alan', 'appearance',
       'appeared', 'backup', 'born', 'briefly', 'cambridge', 'career',
       'charity', 'chelsea', 'chipping', 'clubs', 'coach', 'cover', 'cup',
       'days', 'didcot', 'draw', 'driving', 'emergency', 'english',
       'european', 'final', 'football', 'footballer', 'game',
       'goalkeeper', 'goalkeeping', 'graham', 'helped', 'hereford',
       'including', 'instructor', 'judge', 'league', 'leaving', 'making',
       'march', 'match', 'milk', 'norton', 'november', 'occasionally',
       'oldest', 'organised', 'oxford', 'play', 'played', 'player',
       'premier', 'professional', 'qpr', 'references', 'referred',
       'retired', 'retiring', 'run', 'save', 'season', 'second', 'served',
       'seventh', 'stoppage', 'swindon', 'team', 'tiers', 'time', 'town',
       'united', 'various', 'vital', 'winners', 'won', 'worked', 'years'],
      dtype=ob

In [None]:
# Print out results in a dataframe
target_tf_df = pd.DataFrame(tf_idf_pp.toarray(), columns = tf_idf_vec.get_feature_names_out())
target_tf_df.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
04,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.466561,0.000000,0.0,0.0,0.0
14,0.269906,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
176,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.329668,0.0,0.0,0.0
18,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.216628,0.000000,0.000000,0.0,0.0,0.0
1960,0.269906,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
vital,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.216628,0.000000,0.000000,0.0,0.0,0.0
winners,0.000000,0.0,0.0,0.235363,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
won,0.000000,0.0,0.0,0.235363,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
worked,0.000000,0.0,0.0,0.000000,0.225438,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0


## Wikipedia of the 10 closest people

In [None]:
target_pp = og_npl_data.iloc[indices[0]]


In [None]:
# Create a dataframe of the target person and the closest 10 people from part 1
target_pp_df = pd.DataFrame(target_pp)
target_pp_df

Unnamed: 0,URI,name,text
42782,<http://dbpedia.org/resource/Alan_Judge_(footb...,"Alan Judge (footballer, born 1960)",alan graham judge born 14 may 1960 is a retire...
34838,<http://dbpedia.org/resource/Len_Bond>,Len Bond,len bond born 2 december 1954 is an english fo...
23099,<http://dbpedia.org/resource/Matt_Green_(footb...,Matt Green (footballer),matthew james matt green born 2 january 1987 i...
24116,<http://dbpedia.org/resource/Tony_Smith_(footb...,"Tony Smith (footballer, born 1957)",anthony tony smith born 20 february 1957 is a ...
25507,<http://dbpedia.org/resource/Peter_Rhoades-Brown>,Peter Rhoades-Brown,peter rhoadesbrown born 2 january 1962 in hamp...
3785,<http://dbpedia.org/resource/Steve_Arnold_(foo...,"Steve Arnold (footballer, born 1951)",stephen frank arnold born 5 january 1951 is an...
19611,<http://dbpedia.org/resource/George_Harris_(fo...,"George Harris (footballer, born 1940)",george harris born 10 june 1940 is an englishb...
4113,<http://dbpedia.org/resource/Keith_Waugh>,Keith Waugh,keith waugh born 27 october 1956 is an english...
31709,<http://dbpedia.org/resource/Nick_Colgan>,Nick Colgan,nicholas vincent nick colgan born 19 september...
4300,<http://dbpedia.org/resource/Peter_Hucker>,Peter Hucker,peter hucker born 28 october 1959 is an englis...


In [None]:
# 10 Closest people
# https://en.wikipedia.org/wiki/Len_Bond
# https://en.wikipedia.org/wiki/Matt_Green_(footballer)
# https://en.wikipedia.org/wiki/Tony_Smith_(footballer,_born_1957)
# https://en.wikipedia.org/wiki/Peter_Rhoades-Brown
# https://en.wikipedia.org/wiki/Steve_Arnold_(footballer,_born_1951)
# https://en.wikipedia.org/wiki/George_Harris_(footballer,_born_1940)
# https://en.wikipedia.org/wiki/Keith_Waugh
# https://en.wikipedia.org/wiki/Nick_Colgan
# https://en.wikipedia.org/wiki/Peter_Hucker
# https://en.wikipedia.org/wiki/Gary_Hooper

title = ['Len_Bond',
         'Matt_Green_(footballer)',
         'Tony_Smith_(footballer,_born_1957)',
         'Peter_Rhoades-Brown',
         'Steve_Arnold_(footballer,_born_1951)',
         'George_Harris_(footballer,_born_1940)',
         'Keith_Waugh',
         'Nick_Colgan',
         'Peter_Hucker',
         'Gary_Hooper'
         ]

wikip = wikipediaapi.Wikipedia('foobar')

# Initialize an empty list to store the Wikipedia texts
all_wiki_texts = []

for topic_title in title:
    page_ex = wikip.page(topic_title)

    if page_ex.exists():
        wiki_text = page_ex.text
        all_wiki_texts.append(wiki_text) # Add the text to the list
        print(f"--- Collected text for: {topic_title} ---")
    else:
        print(f"--- Warning: Page for '{topic_title}' does not exist. Skipping. ---")

# Now, 'all_wiki_texts' contains all the collected Wikipedia article texts
print("\n--- All collected texts (500 characters) : ---")
for i, text in enumerate(all_wiki_texts):
    print(f"Article {i+1}: {text[:500]}...") # Print 500 characters of each collected text

--- Collected text for: Len_Bond ---
--- Collected text for: Matt_Green_(footballer) ---
--- Collected text for: Tony_Smith_(footballer,_born_1957) ---
--- Collected text for: Peter_Rhoades-Brown ---
--- Collected text for: Steve_Arnold_(footballer,_born_1951) ---
--- Collected text for: George_Harris_(footballer,_born_1940) ---
--- Collected text for: Keith_Waugh ---
--- Collected text for: Nick_Colgan ---
--- Collected text for: Peter_Hucker ---
--- Collected text for: Gary_Hooper ---

--- All collected texts (500 characters) : ---
Article 1: Len Bond (born 2 December 1954) is an English former professional football goalkeeper. He made more than 300 appearances in the Football League, including 168 for Exeter City and 122 for Brentford.

Career
Bond was born in Ilminster, Somerset. He began his career as an apprentice with Bristol City, turning professional in September 1971, although he had made his league debut on the last day of the previous season. He remained at Ashton Gate for 

## Nearness Ranking
- 10 people to the target person

In [None]:
# Combine all steps to obtain Nearness Ranking to 'Alan_Judge_(English_footballer)' based on Wikipedia content
# Step 1. Define your data and target person
people_titles = [
    'Len_Bond',
    'Matt_Green_(footballer)',
    'Tony_Smith_(footballer,_born_1957)',
    'Peter_Rhoades-Brown',
    'Steve_Arnold_(footballer,_born_1951)',
    'George_Harris_(footballer,_born_1940)',
    'Keith_Waugh',
    'Nick_Colgan',
    'Peter_Hucker',
    'Gary_Hooper'
]
# Target person
target_person_title = 'Alan_Judge_(English_footballer)'

# Step 2. Fetch Wikipedia content
# note: foobar = user_agent
wikip = wikipediaapi.Wikipedia('foobar')

def get_wiki_text(title):
    page = wikip.page(title)
    if page.exists():
        text = page.text
        # Remove text within parentheses like "(born 1957)" from the start of sections
        text = re.sub(r'\([^)]*\)', '', text)
        # Remove common Wikipedia section headers and "See also", "References"
        text = re.sub(r'==\s*[^=]+\s*==', '', text)
        text = re.sub(r'== See also ==|== References ==|== External links ==', '', text)
        # Remove non-alphanumeric and remove extra spaces
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        return text
    else:
        print(f"Warning: Page for '{title}' does not exist. Skipping.")
        return None

# Collect all texts, including the target person
all_texts = {}
for title in people_titles + [target_person_title]:
    text = get_wiki_text(title)
    if text:
        all_texts[title] = text

# Separate the target person's text from the rest
target_text = all_texts.pop(target_person_title, None)
if target_text is None:
    print(f"Error: Could not retrieve Wikipedia page for target person: {target_person_title}")
    exit() # Exit if target person's page is not found

# Prepare the list of texts for vectorization
corpus_titles = list(all_texts.keys())
corpus_texts = list(all_texts.values())

# Add target text to the corpus for consistent vectorization
corpus_texts_with_target = corpus_texts + [target_text]
corpus_titles_with_target = corpus_titles + [target_person_title]

# Step 3. Vectorize the Text
# TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features.
# It handles lowercasing, punctuation removal, and stop word removal by default.
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000) # max_features to limit vocabulary size
tfidf_matrix = vectorizer.fit_transform(corpus_texts_with_target)

# The last row of the tfidf_matrix corresponds to the target person's text
target_person_vector = tfidf_matrix[-1]
other_people_vectors = tfidf_matrix[:-1]

# Step 4. Calculate Similarities (Cosine Similarity)
# Reshape target_person_vector to be 2D for cosine_similarity
similarities = cosine_similarity(target_person_vector.reshape(1, -1), other_people_vectors)

# Take the first row only
similarity_scores = similarities[0]

# Step 5. Rank by Nearness
# Create a list of (person_title, similarity_score) tuples
ranked_people = [] # list name (an empty variable)
for i, score in enumerate(similarity_scores):
    ranked_people.append((corpus_titles[i], score))

# Sort the list by similarity score in descending order (highest similarity is "nearest")
ranked_people.sort(key=lambda x: x[1], reverse=True)

print(f"\nNearness Ranking to '{target_person_title}' based on Wikipedia content:")
for rank, (person, score) in enumerate(ranked_people):
    print(f"{rank + 1}. {person}: {score:.4f} (Similarity)") # format score to 4 decimal points


Nearness Ranking to 'Alan_Judge_(English_footballer)' based on Wikipedia content:
1. Peter_Rhoades-Brown: 0.2671 (Similarity)
2. Peter_Hucker: 0.2359 (Similarity)
3. Tony_Smith_(footballer,_born_1957): 0.2065 (Similarity)
4. Steve_Arnold_(footballer,_born_1951): 0.1906 (Similarity)
5. George_Harris_(footballer,_born_1940): 0.1899 (Similarity)
6. Len_Bond: 0.1802 (Similarity)
7. Keith_Waugh: 0.1760 (Similarity)
8. Matt_Green_(footballer): 0.1641 (Similarity)
9. Nick_Colgan: 0.1502 (Similarity)
10. Gary_Hooper: 0.1253 (Similarity)


In [None]:
import matplotlib.pyplot as plt

# Part 3)


Make an interactive notebook where a user can choose or enter a name and the notebook displays the 10 closest individuals.


In [None]:
!curl -s https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv | wc -l

42786


In [None]:
from ipywidgets import interact

## Do not know how to fix the code below...still figuring out...

In [None]:
# Target person
target_person_title = 'Alan_Judge_(English_footballer)'

@interact(text = ['Len Bond',
                  'Matt Green',
                  'Tony Smith',
                  'Peter Rhoades-Brown',
                  'Steve Arnold',
                  'George Harris',
                  'Keith Waugh',
                  'Nick Colgan',
                  'Peter Hucker',
                  'Gary Hooper'])
def knn(text):
  distances, indices = nn.kneighbors(
  X = npl_ref_matrix,
  n_neighbors = 11,
  )
  for a,b in (distances[0], np.array(npl_blob_singular)[indices][0]):
   print(f"The distance of {text} from {target_person_title} is {distances:.4f}: {indices}".format(text))


interactive(children=(Dropdown(description='text', options=('Len Bond', 'Matt Green', 'Tony Smith', 'Peter Rho…

# References

- Module 5: Data cleanning | TextBlob | TF-IDF transformation | K Nearest Neighbors
  - Lecture 2b: https://colab.research.google.com/drive/1eIY8PPxqnH4izK7WV9BS5lX7S2WnnzDz?authuser=1#scrollTo=xDuCEa4Z-mwb
  - Lecture 3c: https://colab.research.google.com/drive/1IJ4Kzy6OX78YcEd8B6-DQxuBnftBqz98?authuser=1#scrollTo=xDuCEa4Z-mwb
  - Lecture 4a: https://colab.research.google.com/drive/1im3J4Y0xgtn_rPA8ETJ_43q4VQjoNs6n?authuser=1#scrollTo=yuFf00oinAiU

- Module 4: Interact
  - Lecture 2g: https://colab.research.google.com/drive/1dy0VG55NfM9zv_eosKJ1xbYXVbNHSodW?authuser=1#scrollTo=xAoHSexy5d8W

