# WORD PAIRS

**File:** WordPairs.ipynb

**Course:** Data Science Foundations: Data Mining in Python

# INSTALL AND IMPORT LIBRARIES

To explore the connections between words, we'll use the Python library `networkx`. It can be installed with Python's `pip` command. This command only needs to be done once per machine.

The standard, shorter approach may work:

In [None]:
# pip install networkx

If the above command didn't work, it may be necessary to be more explicit, in which case you could run the code below.

In [None]:
# import sys
# !{sys.executable} -m pip install networkx

Once `networkx` is installed, then load the libraries below.

In [None]:
# Import libraries
import re  # For regular expressions
import nltk  # For text functions
import matplotlib.pyplot as plt  # For plotting
import pandas as pd  # For dataframes
import networkx as nx  # For network graphs

# Import specific text functions from NLTK
from nltk import ngrams
from nltk.corpus import stopwords
from nltk.corpus import opinion_lexicon
from nltk.tokenize import word_tokenize

# Download data for NLTK
nltk.download('stopwords', quiet=True)
nltk.download('opinion_lexicon', quiet=True)
nltk.download('punkt', quiet=True)

# IMPORT DATA

In [None]:
df = pd.read_csv('data/Iliad.txt',sep='\t')\
    .dropna() \
    .drop('gutenberg_id', 1)

df.head(10)

# PREPARE DATA


## Tokenize the Data

In [None]:
def clean_text(text):
    text = text.lower() # lowecase
    text = text.replace("'", '')
    text = re.sub(r'[^\w]', ' ', text) # leave only word characters
    text = re.sub(r'\s+', ' ', text) # ommit extra space characters
    text = text.strip()
    return text


df['text'] = df['text'].map(clean_text) 
df['text'] = df['text'].map(word_tokenize) # Split text into words

df.head()

## CREATE WORD PAIR TOKENS

- Instead of splitting the text into single words, separate it into pairs of adjacent words.

In [None]:
df['wordpairs'] = df['text'].map(lambda x: list(ngrams(x, 2)))
df = df.explode('wordpairs')

df.head(10)

## Sort the Tokens by Frequency

In [None]:
df['wordpairs'].value_counts().head(10)

## Split Word Pairs

- In order to remove word pairs with stop words, the pairs must first be separated.
- Separated pairs are also necessary for creating network graphs.

In [None]:
df = pd.DataFrame(df.wordpairs.values.tolist(), columns=['word1', 'word2']).dropna()

df.head(10)

- Get the number of rows in the dataframe.

In [None]:
df.shape

## Remove Stop Words

- This reduces the total number of observations from 127,709 to 33,694, a 74% reduction.

In [None]:
en_stopwords = set(stopwords.words('english'))

df = df[~(df.word1.isin(en_stopwords) | df.word2.isin(en_stopwords))]

df.head()

- Get the new number of rows in the dataframe.

In [None]:
df.shape

## Sort Word Pairs by Frequency

In [None]:
df = df.groupby(['word1', 'word2'])\
    .size()\
    .to_frame('n')\
    .reset_index()\
    .sort_values('n', ascending=False)

df.head(20)

# VISUALIZE DATA

## Create Data Table

- Restrict to word pairs that appear more than 12 times.

In [None]:
df[df.n > 12].head(10)

## Visualize Network Graph

- For clarity's sake, restrict to word pairs that appear more than 25 times.

In [None]:
G = nx.from_pandas_edgelist(df[df.n > 25], 'word1', 'word2')
plt.figure(figsize=(12, 10))
nx.draw_shell(G, with_labels=True, node_color='white', font_size=15)

# CLEAN UP

- If desired, clear the results with Cell > All Output > Clear. 
- Save your work by selecting File > Save and Checkpoint.
- Shut down the Python kernel and close the file by selecting File > Close and Halt.