<a href="https://colab.research.google.com/github/DSNortsev/CSE590-PythonAndDataAnalytics/blob/main/HW2/HW2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from collections import OrderedDict
import numpy as np
import pandas as pd
import re

<b>This assignment deals with loading a simple text file into a Python structure, lists, arrays, and dataframes.</b>

<b>a. Locate a movie script, play script, poem, or book of your choice in .txt format*. Project Gutenburg is a great resource for this if you're not sure where to start.</b>

<b>b. Load the words of this structure, one-by-one, into a one-dimensional, sequential Python list (i.e. the first word should be the first element in the list, while the last word should be the last element). It's up to you how to deal with special chacters -- you can remove them manually, ignore them during the loading process, or even count them as words, for example.</b>

In [283]:
def load_data(data_file):
  """
    Reads txt file and returns a list of words
  """
  # Compile regex pattern
  regex_pattern = re.compile('[^A-Za-z0-9.:/-]+')

  # Read txt file
  with open(data_file) as f:
    # Find all special characters in the word and replace it with empty string
    # Remove leading and trailing special characters
    return [re.sub(regex_pattern, '', word).lower().strip('.:/') for line in f for word in line.split()]


words_list = load_data('the_martian_circe.txt')
print(words_list[:100])

['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'martian', 'circe', 'by', 'raymond', 'f', 'jones', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'united', 'states', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'you', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'www.gutenberg.org', 'if', 'you', 'are', 'not', 'located', 'in', 'the', 'united', 'states', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'ebook', 'title', 'the', 'martian', 'circe', 'author', 'raymond']


<b>c. Use your list to create and print a two-column pandas data-frame with the following properties: i. Each index should mark the first occurrence of a unique word (independent of case) in the text. ii. The first column for each index should represent the word in question at that index iii. The second column should represent the number of times that particular word appears in the text.</b>

In [133]:
def count_elements(list_of_elements):
  """
    Count words occurance with preserving possition
  """
  data = OrderedDict() 
  for element in list_of_elements:
    if element not in data:
      data[element] = 1
    else:
      data[element] += 1
  return data

  
counted_occurence = count_elements(words_list)
df = pd.DataFrame(counted_occurence.items(), columns=['Word', 'Count'])
df

Unnamed: 0,Word,Count
0,the,1317
1,project,89
2,gutenberg,31
3,ebook,13
4,of,631
...,...,...
3176,pg,1
3177,facility,1
3178,includes,1
3179,subscribe,1


<b> d. The co-occurrence of two events represents the likelihood of the two occurring together. A simple example of co-occurrence in texts is a predecessor-successor relationship -- that is, the frequency with which one word immediately follows another. The word "cellar," for example, is commonly followed by "door." </b>

For this task, you are to construct a 2-dimensional predecessor-successor co-occurrence array as follows**: i. The row index corresponds to the word from the same index in part c.'s data-frame. ii. The column index likewise corresponds to the word in the same index in the data-frame. iii. The value in each array location represents the count of the number of times the word corresponding to the row index immediately precedes the word correponding to the column index in the text. 

In [196]:
def generate_co_occurance_matrix(words_list, columns):
  """
    Generates co-occurance matrix:
      word_list : a list of words in the text
      columns: unique list of words
  """
  # Convert words list to numpy array
  words_array = np.array(words_list)
  # Convert columns list to numpy array
  columns_array = np.array(columns)
  # Generate zero value matrix with integer value
  matrix = np.zeros((len(columns), len(columns)), dtype=np.int)
  count = 0 
  # Iterate over unique words
  for word in columns_array:
    # find row position in matrix for the word
    row_position = np.where(columns_array == word)[0][0]
    # Find all occurances of this word in the list and iterate over it 
    for word_position in np.where(words_array == word)[0]:
      if word_position < len(words_array) - 1:
        # Find position of the successor word
        col_position = np.where(columns_array == str(words_array[word_position + 1]))[0][0]
        # Incremenet predecessor-successor co-occurrence by one
        matrix[row_position, col_position] += 1   
  return matrix

matrix = generate_co_occurance_matrix(words_list, list(counted_occurence.keys()))
matrix

array([[ 0, 33,  0, ...,  0,  0,  0],
       [ 0,  0, 31, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       ...,
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0]])

<b>e. Based on the data-frame derived in part c. and array derived in part d., determine and print the following information:<br></br>
i. The first occurring word in the text. </b>

In [141]:
df['Word'].iloc[0]

'the'

<b>ii. The unique word that first occurs last within the text. </b> 

In [142]:
df['Word'].iloc[-1]

'newsletter'

<b>iii. The most common word </b>

In [159]:
df[df['Count'] == df['Count'].max()]['Word'].iloc[0]

'the'

<b> v. Words A and B such that B follows A more than any other combination of words.</b>

In [163]:
# Find max value in matrix
max_occurence = np.amax(matrix)
# Find positions of max value 
position = np.where(matrix == max_occurence)
print()

print(df['Word'].iloc[position[0][0]], df['Word'].iloc[position[1][0]])


of the


<b>vi. The word that most commonly follows the least common word </b>

In [194]:
df[df['Count'] == df['Count'].min()]

Unnamed: 0,Word,Count
59,title,1
60,author,1
61,release,1
63,january,1
64,18,1
...,...,...
3176,pg,1
3177,facility,1
3178,includes,1
3179,subscribe,1


In [255]:
df.iloc[59]['Word']

'title'

In [282]:
def find_most_common_word_follows_least(df, matrix):
  # Find less common words
  least_common_words_index = df.index[df['Count'] == df['Count'].min()].tolist()
  # Create empty PandaFrame
  result = pd.DataFrame(data = [],  columns=['Predecessor', 'Successor',
                                             'Occurence_predecessor','Occurence_successor' ])

  # Iterate over least common words
  for column_pos in least_common_words_index:
    # Get one dimension matrix for that word
    tmp_array = matrix[:,column_pos]
    # Find all occurence with greater than 0
    positive_occurence_list  = tmp_array[tmp_array>0]
    # Iterate over all occurence
    for occurence in positive_occurence_list:
        # Find index of the element
        row_pos = np.where(tmp_array == occurence)[0][0]
        # Generate new row
        new_row = [df.iloc[row_pos]['Word'], df.iloc[column_pos]['Word'],
                   df.iloc[row_pos]['Count'], df.iloc[column_pos]['Count']]
        # Append it to result
        result.loc[len(result)] = new_row
  return result

df_task4 = find_most_common_word_follows_least(df, matrix)
df_task4

Unnamed: 0,Predecessor,Successor,Occurence_predecessor,Occurence_successor
0,ebook,title,13,1
1,circe,author,5,1
2,jones,release,3,1
3,date,january,3,1
4,january,18,1,1
...,...,...,...,...
1720,main,pg,5,1
1721,search,facility,3,1
1722,site,includes,4,1
1723,to,subscribe,454,1


In [281]:
df_task4[df_task4['Occurence_predecessor'] == df_task4['Occurence_predecessor'].max()]

Unnamed: 0,Predecessor,Successor,Occurence_predecessor,Occurence_successor
32,the,same--the,1317,1
36,the,odor,1317,1
42,the,backwash,1317,1
45,the,brawling,1317,1
58,the,solar,1317,1
...,...,...,...,...
1628,the,assistance,1317,1
1643,the,internal,1317,1
1672,the,irs,1317,1
1682,the,solicitation,1317,1
