# 02_02 Loading text files

The signature of a word is when all the letters are rearranged such that they are alphabetical in the word. Therefore two words are anagrams if they possess the same signature.

Our strategy for finding and determining anagrams will be to create a dictionary of all the words in the dictionary sorted by their signature. In this way looking to see if a word has an anagram is as simpe as looking it up in the dictionary

But before that we will need to learn how to load a dictionary in from a file.

In [1]:
import math
import collections
import dataclasses
import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as pp

In [None]:
# This repository contains the 1934 English dictionary that is distributed with many unix systms.
words = []
# For every line in dictionary, since words.txt is rendered as a list of strings (I THINK)
for line in open('words.txt'):
    words.append(line)
#We now have a list of every item in the 1934 dictionary


**Looping through a Text file**

for line in open(filename, mmode='r')

&emsp; *Do something with* line

In Python, we talk of idioms when we think of code constucts that have become the preferred way to achieve a certain goal.
One example is looping through all the lines of the text files.
For that, we open the file for reading and the file as a notable in a for loop, which results in us getting the lines one by one.

In [4]:
#We end up with more than 200,000 words.
len(words)

235886

In [None]:
#This gives us the first 10
words[:10]
#Note that each line has a newline character, which will fuck with our processing
#Also, each word is capitalized, which will cause issues since A!=a

['A\n',
 'a\n',
 'aa\n',
 'aal\n',
 'aalii\n',
 'aam\n',
 'Aani\n',
 'aardvark\n',
 'aardwolf\n',
 'Aaron\n']

In [8]:
#.strip() allows us to strip any leading a trailing white space (including newlines)
'Aaron\n'.strip()

'Aaron'

In [9]:
#lower() allows us to make the whole string lowercase
'Aaron\n'.strip().lower()

'aaron'

In [10]:
#We can then incorporate both of those into the loop
words = []
for line in open('words.txt'):
    words.append(line.strip().lower())

In [12]:
#The new issue is that now we have a duplicate from both A and a being in the original file
words[:10]

['a',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'aani',
 'aardvark',
 'aardwolf',
 'aaron']

In [13]:
#One way around this is to build a set rather than a list; recall that sets don't allow repeated values
words = set() # creates an empty set
for line in open('words.txt'):
    words.add(line.strip().lower()) # append() is for lists, add() is for sets

In [18]:
#Since the body of our loop is just one line, we can replace it with a comprehension to make it more idiomatic
words = {line.strip().lower() for line in open('words.txt')}

In [19]:
words

{'amplexation',
 'courteously',
 'backswordsman',
 'roadside',
 'closish',
 'cowhorn',
 'dissolutive',
 'peroxidate',
 'morcellation',
 'teamaker',
 'sinfully',
 'overbrim',
 'visioned',
 'terrence',
 'trionym',
 'radioelement',
 'sulphammonium',
 'dehortative',
 'begad',
 'penumbral',
 'chorepiscopus',
 'submarginal',
 'unrecorded',
 'feudalism',
 'ironmaker',
 'tartryl',
 'sulfohalite',
 'deipara',
 'resurrectible',
 'orbicularly',
 'molecula',
 'endocervical',
 'nummular',
 'unburdened',
 'overhand',
 'inedibility',
 'impavid',
 'fundulus',
 'transaquatic',
 'phyllogenous',
 'snakeology',
 'tangy',
 'outdragon',
 'interpermeate',
 'garlandage',
 'tanaist',
 'uncoachableness',
 'ephydridae',
 'erythron',
 'antidromic',
 'photoelectricity',
 'sibylline',
 'revaporization',
 'stereogoniometer',
 'phototelescope',
 'quinquino',
 'varicella',
 'tractable',
 'purchaser',
 'decachord',
 'fertilizational',
 'newsy',
 'vedette',
 'ghetti',
 'unfeoffed',
 'cirrose',
 'baldrib',
 'undersoil',


In [20]:
#We can wrap it inside of sorted in order to get the set worted
words = sorted({line.strip().lower() for line in open('words.txt')})

In [21]:
words

['a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'aani',
 'aardvark',
 'aardwolf',
 'aaron',
 'aaronic',
 'aaronical',
 'aaronite',
 'aaronitic',
 'aaru',
 'ab',
 'aba',
 'ababdeh',
 'ababua',
 'abac',
 'abaca',
 'abacate',
 'abacay',
 'abacinate',
 'abacination',
 'abaciscus',
 'abacist',
 'aback',
 'abactinal',
 'abactinally',
 'abaction',
 'abactor',
 'abaculus',
 'abacus',
 'abadite',
 'abaff',
 'abaft',
 'abaisance',
 'abaiser',
 'abaissed',
 'abalienate',
 'abalienation',
 'abalone',
 'abama',
 'abampere',
 'abandon',
 'abandonable',
 'abandoned',
 'abandonedly',
 'abandonee',
 'abandoner',
 'abandonment',
 'abanic',
 'abantes',
 'abaptiston',
 'abarambo',
 'abaris',
 'abarthrosis',
 'abarticular',
 'abarticulation',
 'abas',
 'abase',
 'abased',
 'abasedly',
 'abasedness',
 'abasement',
 'abaser',
 'abasgi',
 'abash',
 'abashed',
 'abashedly',
 'abashedness',
 'abashless',
 'abashlessly',
 'abashment',
 'abasia',
 'abasic',
 'abask',
 'abassin',
 'abastardize',
 'abatable',
 'abate',
 'a

We're now ready to make anagrams

In [24]:
#If we wanted to try different languages, like french, we would just need the right dictionary
#Pythin strigns are natively Unicode, meaning that they can handle international charactersets transparently; french only needs some accents
# The characters are encoded internally using multiple byts, either one, two, or four, as needed. The only care that we need to take is to tell Python which encoding to use for the files we read and write
#The Unicode standard includes multiple encodings that map character sets to their representations in bytes, we just need to know th right one
#The most common are UTF-8 and UTF-16, but a few legacy encodings are also useful
#The french dictionary in francais.txt is written using ISO 8859, or Latin-1, so all we need to do is specify that using the encoding parameter in open()
paroles = sorted({line.strip().lower()
                 for line in open('francais.txt', encoding='latin-1')})

In [25]:
paroles

['',
 'a',
 'ab',
 'abaissa',
 'abaissai',
 'abaissaient',
 'abaissais',
 'abaissait',
 'abaissant',
 'abaissas',
 'abaissasse',
 'abaissassent',
 'abaissasses',
 'abaissassiez',
 'abaissassions',
 'abaisse',
 'abaissement',
 'abaissements',
 'abaissent',
 'abaisser',
 'abaissera',
 'abaisserai',
 'abaisseraient',
 'abaisserais',
 'abaisserait',
 'abaisseras',
 'abaisserez',
 'abaisseriez',
 'abaisserions',
 'abaisserons',
 'abaisseront',
 'abaisses',
 'abaisseur',
 'abaisseurs',
 'abaissez',
 'abaissiez',
 'abaissions',
 'abaissons',
 'abaissâmes',
 'abaissât',
 'abaissâtes',
 'abaissèrent',
 'abaissé',
 'abaissée',
 'abaissées',
 'abaissés',
 'abandon',
 'abandonna',
 'abandonnai',
 'abandonnaient',
 'abandonnais',
 'abandonnait',
 'abandonnant',
 'abandonnas',
 'abandonnasse',
 'abandonnassent',
 'abandonnasses',
 'abandonnassiez',
 'abandonnassions',
 'abandonne',
 'abandonnent',
 'abandonner',
 'abandonnera',
 'abandonnerai',
 'abandonneraient',
 'abandonnerais',
 'abandonnerait',